*** elodilles_pto is now known as elodilles | 05:53 | |
ykarel | corvus, fungi some recent multinode failures you would like to check after the last patch for quota message merged | 12:11 |
---|---|---|
ykarel | 19th https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4 | 12:11 |
ykarel | 20th https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a | 12:11 |
ykarel | 21st https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754 | 12:11 |
*** ykarel_ is now known as ykarel | 12:40 | |
fungi | ykarel: thanks, i think the latest improvement for that (955292) would have taken effect at worst during the manual restart of zuul services that occurred a little before 19:00 utc on saturday (2025-07-19), so at least the second and third example happened with it in place | 13:00 |
fungi | frickler: did my comment on https://review.opendev.org/952861 answer your question? | 13:27 |
Clark[m] | Making note of it early as I boot my day because it is potentially important/relevant: there are two reports of Gerrit users after people updated to newer versions like we did. First is problems doing full offline reindexing after upgrading from 3.11.3 to 3.11.4. We can check on this as part of our 3.10 to 3.11 upgrading. Second is replication failing for specific refs until restarting Gerrit (even a request to replicate everything fails). The | 14:00 |
Clark[m] | presumed commit with the issue is in the replication plugin for 3.11.4 and 3.12.1 but not 3.10.7 (which is what we run) | 14:00 |
Clark[m] | If we notice problems with either of those on 3.10.7 I can pass that info along to aid upstream debugging | 14:01 |
corvus | ykarel: ack, thanks! i'll look into those | 14:02 |
corvus | https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754 | 14:17 |
corvus | both nodes were in ovh-bhs1 | 14:17 |
corvus | https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a | 14:17 |
corvus | both nodes were in ovh-gra1 | 14:17 |
corvus | https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4 | 14:17 |
corvus | both nodes were in rax-dfw | 14:17 |
corvus | ykarel: it looks like those failures were not zuul/nodepool related | 14:17 |
ykarel | corvus, but i see those on different nodes | 14:20 |
ykarel | first one ovh-bhs1-main and rax-dfw-main | 14:20 |
ykarel | https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754/log/job-output.txt#38 | 14:21 |
ykarel | second ovh-gra1-main and rax-dfw-main https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a/log/job-output.txt#38 | 14:22 |
ykarel | third rax-dfw-main and rax-ord-main https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4/log/job-output.txt#38 | 14:22 |
clarkb | fungi: maybe after you squash the stack of gitea page updates I can rebase https://review.opendev.org/c/opendev/system-config/+/955411 onto that and we can rollout 1.24.3 after the edits | 14:23 |
fungi | sounds good, or we can go ahead with the gitea upgrade until there's consensus on the splash page content updates | 14:24 |
corvus | ykarel: oh sorry! i misread | 14:25 |
corvus | okay, for https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754 it looks like ovh-bhs1 put the server in "error" state with no fault message, so there was no way for us to tell if it was quota related or not | 14:32 |
corvus | for https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a we got a fault message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance ... | 14:38 |
fungi | mmm, i spotted a sneaky spam message that made it onto openstack-discuss and thought that someone approved that new subscriber through moderation without noticing... but in revisiting the list settings i see that message acceptance for members is set to default processing, but i was sure i had set it to hold for moderation. i wonder how/when it switched back | 14:41 |
fungi | i've set it again now and will keep a closer eye on it | 14:44 |
corvus | and https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4 was the "error" state with no fault message again | 14:44 |
corvus | okay, so i think z-l is working as designed here, and we're just getting unclear error messages from openstack. so if we want to improve this, we need to change the design. i think we could either have z-l be more insistent that node requests stay in the same provider, or we could have it (optionally) error if they can't. | 14:46 |
corvus | the cost of the first option would be that we might leave some ready nodes around if, say, 1/2 nodes succeeds. | 14:47 |
corvus | nonetheless, i think that might be the best approach, so i'll look into doing that first. | 14:47 |
corvus | (if we end up with a ready node, it' probably won't stick around for long, unless it's an esoteric label) | 14:48 |
fungi | checking the other list i'd set to default hold for moderation (legal-discuss) it's still set, so maybe whatever happened to openstack-discuss was user error on my part | 14:48 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM Test offline reindexing with Gerrit 3.11.4 https://review.opendev.org/c/opendev/system-config/+/955487 | 15:05 |
Ramereth[m] | fungi: so there's no single endpoint that I can look at to see if there are any images still stuck in a deleting state like before? I see several that seem to be stuck in a "saving" state and one in "queued" currently from my side. | 15:37 |
fungi | Ramereth[m]: i don't believe so, but maybe corvus or clarkb have some ideas | 15:37 |
fungi | an api query to get that might be possible | 15:37 |
corvus | Ramereth[m]: fungi are we interested in a particular cloud? | 15:43 |
corvus | like, is the intended use something like "show all images in rax-flex that are in X state"? | 15:45 |
clarkb | corvus: I suspect Ramereth[m] is only interested in the osuosl cloud image states | 15:52 |
corvus | okay, so we want to see "what is the status of every upload in osuosl"? | 15:52 |
clarkb | yes I think so. The reason for that is sometimes the cloud fails to delete the images properly and intervention is required. I think Ramereth[m] was monitoring that via the old api and periodically cleaning things out on the cloud side | 15:53 |
corvus | indeed we don't have a summary view of all uploads for a given provider in the web ui, so right now, you'd have to click through to each image. | 16:00 |
corvus | but we can get that info through the api | 16:00 |
corvus | Ramereth[m]: try this: `curl -q https://zuul.opendev.org/api/tenant/opendev/providers |jq -r '.[] | select(.name == "osuosl-regionone-main").images[].build_artifacts[].uploads[] | [.state, .external_id] | @tsv'` | 16:00 |
corvus | it does look not entirely healthy | 16:01 |
clarkb | my test in 955487 seems to show that in the trivial case offline reindexing is fine with gerrit 3.11.4. I suspect the problem with reindexing there is specific to the state of the changes on the system tripping over some bug | 16:05 |
clarkb | I think I'll just continue with upgrade prep and monitor the situation upstream to determine if we need to do some testing on our side. I think worst case we'll be able to revert to older versions if we hit things on our side (note offline reindexing shouldn't be necessary for the ugprade itself) | 16:06 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test 3.11 upgrade and downgrade https://review.opendev.org/c/opendev/system-config/+/893571 | 16:45 |
clarkb | put a couple of holds in place for ^ to dig into the upgrade more | 16:48 |
clarkb | corvus: any reason to not approve https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/955236 at this point (this removes the nodepool config verification job) | 16:57 |
corvus | clarkb: nope i think we can proceed with that | 16:58 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/955497 Avoid duplicate launcher upload jobs [NEW] | 17:11 |
corvus | we have a lot of "pending" uploads; ^ that change should address that | 17:11 |
clarkb | fungi: re the opendev.org main page I guess you're looking for more consensus around that? Should we plan to upgrade gitea first then and get reviews on https://review.opendev.org/c/opendev/system-config/+/955411 ? I expect to be around today and can monitor that if we do it. The main concern I have is with gitea13 (and any others I guess) being overloaded still | 17:12 |
clarkb | it is busy this morning (the others aren't) but its below the worst that I've seen | 17:13 |
clarkb | corvus: I had a couple of notes on 955497 but nothing major so went ahead and +2'd it | 17:23 |
corvus | clarkb: thx, replied | 17:28 |
fungi | clarkb: mainly i was hoping to see if frickler's concern on 952861 was addressed | 17:33 |
clarkb | corvus: huh we dno't mark the image build artifact ready as a precursor to then uploading the image build? | 17:33 |
clarkb | corvus: I guess in my mind it was image build builds goes ready then we upload it | 17:33 |
fungi | once i know whether 952861 has consensus i can decide between squashing all three or just squashing the two parents for now | 17:34 |
corvus | nope, because "ready image builds with no uploads" look like something that should be deleted. so we create the uploads before we say the image is ready. it's the last thing that happens, and it happens in the same "trasaction" | 17:34 |
clarkb | got it | 17:36 |
corvus | (we normally create the uploads as "pending"; that unit test does it as uploading, but otherwise, it's the same) | 17:36 |
clarkb | fungi: ack. I guess we review the 1.24.3 upgrade in parallel and if it is ready first it can go first otherwise I can rebase it | 17:36 |
fungi | i'll probably disappear around 19:00-20:00 utc to get a late lunch/early dinner, but otherwise expect to be around for helping monitor the gitea upgrade | 17:37 |
corvus | clarkb: here's the code (you can see it structurally mirrors the test) https://opendev.org/zuul/zuul/src/commit/01e95e978668123a54ff76f1da77e178c88d13f9/zuul/scheduler.py#L3407 | 17:37 |
clarkb | corvus: thanks | 17:37 |
fungi | the splash page changes aren't urgent, i'd just prioritize the gitea upgrade for now | 17:37 |
clarkb | ack if anyone else wants to review https://review.opendev.org/c/opendev/system-config/+/955411 before 2000 that would be great. Otherwise I can approve it around then (fungi's early dinner lines up with my lunch hour so I figure we can wait until after that block of time) | 17:41 |
clarkb | jrosser: its been long enough that https://review.opendev.org/c/openstack/diskimage-builder/+/954760 should've addressed your debian image problems with backports | 17:53 |
clarkb | jrosser: not a rush but any chance you can confirm that is the case? | 17:53 |
clarkb | once that is confirmed I can abandon https://review.opendev.org/c/zuul/zuul-jobs/+/954280 | 18:05 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Multiple Gitea splash page updates https://review.opendev.org/c/opendev/system-config/+/952407 | 18:41 |
fungi | headed out to grab a bite to eat, shouldn't be more than an hour | 18:56 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 19:12 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 19:13 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 19:14 |
Ramereth[m] | corvus: that's helpful, but all of the externa_id fields are null so I don't know how to reference them on my end | 19:15 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138 | 19:15 |
corvus | Ramereth[m]: the pending ones haven't created a cloud image yet, so you can ignore them (also, there are way too many of those right now due to a bug that should be fixed soon) | 19:16 |
Ramereth[m] | corvus: but the deleting ones don't show any external_id's either | 19:17 |
corvus | Ramereth[m]: you can ignore those too -- those are likely to have failed before we started uploading them | 19:18 |
Ramereth[m] | ah okay | 19:18 |
corvus | (in other words, if we have an external id, it should be there; otherwise, something has gone wrong that doesn't involve the target cloud) | 19:19 |
Ramereth[m] | So I updated your script to the following: | 19:25 |
Ramereth[m] | curl -q https://zuul.opendev.org/api/tenant/opendev/providers |jq -r '.[] | select(.name == "osuosl-regionone-main").images[].build_artifacts[].uploads[] | select(.state == "deleting") | select(.external_id != null) | [.state, .external_id] | @tsv' | 19:25 |
Ramereth[m] | According to this, it looks like we don't have any images that are "stuck"? | 19:25 |
corvus | Ramereth[m]: yes, i agree | 19:31 |
clarkb | I've approved the gitea 1.24.3 upgrade change. It usually takes about an hour to gate so figured a head start was good | 19:37 |
clarkb | can always -2 or -W it if something comes up in any additional reviews | 19:37 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node https://review.opendev.org/c/opendev/system-config/+/955520 | 20:05 |
fungi | cool | 20:24 |
fungi | ready for when it merges | 20:25 |
opendevreview | Merged opendev/system-config master: Update to gitea 1.24.3 https://review.opendev.org/c/opendev/system-config/+/955411 | 21:02 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node https://review.opendev.org/c/opendev/system-config/+/955520 | 21:03 |
opendevreview | Clark Boylan proposed opendev/system-config master: Switch IRC and matrix bots to log with journald rather than syslog https://review.opendev.org/c/opendev/system-config/+/955544 | 21:03 |
fungi | zuul estimates 3 minutes left on the hourly buildset | 21:05 |
fungi | after which the deploy jobs for 955411 should kick in | 21:06 |
corvus | clarkb fungi ykarel as i mentioned earlier, i think we're now at the point where we've done everything we can to make zuul-launcher work as designed, and now we need to change the design to accommodate the un-actionable errors we're getting from the cloud. i proposed this change which i think will have the desired effect: https://review.opendev.org/c/zuul/zuul/+/955545 Require multinode requests served from same provider | 21:07 |
fungi | thanks corvus! | 21:07 |
corvus | i don't want to rush that one through, so we may still have the current behavior for another day or so | 21:08 |
corvus | (while it's reviewed) | 21:08 |
corvus | the bugfix for pending uploads merged, so i'm going to restart the launchers now | 21:10 |
fungi | infra-prod-service-gitea is in progress | 21:12 |
clarkb | https://gitea09.opendev.org:3081/opendev/system-config/ is upgraded | 21:13 |
clarkb | I can git clone system-config from gitea09 as well | 21:13 |
clarkb | it seems to be working for me so far | 21:14 |
fungi | Powered by Gitea Version: v1.24.2 | 21:15 |
fungi | mmm | 21:15 |
fungi | oh, i hit a redirect | 21:15 |
clarkb | ya you have to be careful navigating the web ui some links send you back to the haproxy | 21:16 |
fungi | okay, getting v1.24.3 now | 21:16 |
fungi | yeah, i was doing 3080/tcp instead of 3081 | 21:16 |
clarkb | 09-11 are done now | 21:16 |
clarkb | 13 is the one I'm slightly worried about | 21:16 |
fungi | look like it should be done now? | 21:19 |
fungi | ah, still restarting | 21:20 |
clarkb | no 13 is just finishing up image downloads and doing the upgrade now | 21:20 |
fungi | yeah | 21:20 |
fungi | stopping is in progress still | 21:20 |
clarkb | this restart will be slow because I think it stays up until a timeout for existing connnections and since thsi si the backend with all the connections it is going to be slower | 21:21 |
fungi | https://gitea13.opendev.org:3081/ is loading for me | 21:21 |
clarkb | now 13 is done | 21:21 |
fungi | agreed | 21:21 |
clarkb | worried unnecessarily. I figured ti would be ok but if any were to give us trouble it would be this one | 21:21 |
clarkb | all of them are done now. The job shoudl finish up soon | 21:23 |
clarkb | corvus: reading teh commit message on that change this seems like a reasonable approach. It maintains locality when possible but still allows you to mix k8s and openstack and ec2 or whatever if you wish in a single nodeset | 21:24 |
clarkb | I'll work on a proper review shortly | 21:24 |
clarkb | https://zuul.opendev.org/t/openstack/buildset/738b650c6e0b4de38309f62ecea974ad that is a successful gitea upgrade buildset | 21:25 |
fungi | yeah, i'm already about halfway through it, but based on the description i agree it's the best possible tradeff | 21:25 |
fungi | tradeoff | 21:25 |
clarkb | the one thing I haven't checked yet is replication | 21:25 |
corvus | clarkb: yep that's what i'm thinking | 21:25 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node https://review.opendev.org/c/opendev/system-config/+/955520 | 21:27 |
clarkb | https://opendev.org/opendev/system-config/commit/a0fc9e3edab9e2b7c5806fe93a5386381499756f replication seems to be working | 21:27 |
corvus | since the bugfix for the leaked nodeset request locks merged and we're running that now, i have deleted what is hopefully the last batch of leaked lock znodes. | 21:33 |
corvus | the launchers look like they are more or less done deleting all the extra "pending" uploads | 21:34 |
clarkb | corvus: ok posted a couple of comments but overall I think this is good | 21:41 |
corvus | thx, replied | 21:52 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/955544 is a prep change for eventually replacing the current eavesdrop server with a new noble one. I think it is a big noop (we've tested it a bit at this point) to move from syslog to journald for container logging | 22:01 |
clarkb | ok time to put the meeting agenda together. I'll try to do niz updates and updates on gerrit and so on. Anything else that should be in there? | 22:43 |
clarkb | my edits are in. I'll send this out in ~10-15 minutes. Let me know before then if I'm mossing something important | 22:56 |
corvus | my main question is "wen delete nodepool" which i imagine is covered under at least one of the existing topics :) | 23:08 |
clarkb | yup we can talk about that in the first topic tomorrow. I've just sent the agenda | 23:08 |
clarkb | oh nope I failed to send. One moment | 23:09 |
clarkb | now it should be in your inboxes | 23:10 |
corvus | email is hard | 23:11 |
fungi | also as far as wen baleet nodepull i'm good with any time now | 23:13 |
fungi | we can arrange to have a small memorial service for it if anyone wants | 23:15 |
fungi | otherwise... _press_button_ | 23:15 |
fungi | the stories about where things in the openstack sdk came from are going to get that much more amusing now when we talk about shade getting split off of a component that no longer exists at all | 23:17 |
mordred | what a fun family tree :) | 23:45 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!