Monday, 2025-07-21

*** elodilles_pto is now known as elodilles05:53
ykarelcorvus, fungi some recent multinode failures you would like to check after the last patch for quota message merged12:11
ykarel19th https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f412:11
ykarel20th https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a12:11
ykarel21st https://zuul.openstack.org/build/8a295130e958491cba1d343f4da1975412:11
*** ykarel_ is now known as ykarel12:40
fungiykarel: thanks, i think the latest improvement for that (955292) would have taken effect at worst during the manual restart of zuul services that occurred a little before 19:00 utc on saturday (2025-07-19), so at least the second and third example happened with it in place13:00
fungifrickler: did my comment on https://review.opendev.org/952861 answer your question?13:27
Clark[m]Making note of it early as I boot my day because it is potentially important/relevant: there are two reports of Gerrit users after people updated to newer versions like we did. First is problems doing full offline reindexing after upgrading from 3.11.3 to 3.11.4. We can check on this as part of our 3.10 to 3.11 upgrading. Second is replication failing for specific refs until restarting Gerrit (even a request to replicate everything fails). The14:00
Clark[m]presumed commit with the issue is in the replication plugin for 3.11.4 and 3.12.1 but not 3.10.7 (which is what we run)14:00
Clark[m]If we notice problems with either of those on 3.10.7 I can pass that info along to aid upstream debugging14:01
corvusykarel: ack, thanks!  i'll look into those14:02
corvushttps://zuul.openstack.org/build/8a295130e958491cba1d343f4da1975414:17
corvusboth nodes were in ovh-bhs114:17
corvushttps://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a14:17
corvusboth nodes were in ovh-gra114:17
corvushttps://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f414:17
corvusboth nodes were in rax-dfw14:17
corvusykarel: it looks like those failures were not zuul/nodepool related14:17
ykarelcorvus, but i see those on different nodes14:20
ykarelfirst one ovh-bhs1-main and rax-dfw-main14:20
ykarelhttps://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754/log/job-output.txt#3814:21
ykarelsecond ovh-gra1-main and rax-dfw-main https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a/log/job-output.txt#3814:22
ykarelthird rax-dfw-main and rax-ord-main https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4/log/job-output.txt#3814:22
clarkbfungi: maybe after you squash the stack of gitea page updates I can rebase https://review.opendev.org/c/opendev/system-config/+/955411 onto that and we can rollout 1.24.3 after the edits14:23
fungisounds good, or we can go ahead with the gitea upgrade until there's consensus on the splash page content updates14:24
corvusykarel: oh sorry!  i misread14:25
corvusokay, for https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754 it looks like ovh-bhs1 put the server in "error" state with no fault message, so there was no way for us to tell if it was quota related or not14:32
corvusfor https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a we got a fault message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance ...14:38
fungimmm, i spotted a sneaky spam message that made it onto openstack-discuss and thought that someone approved that new subscriber through moderation without noticing... but in revisiting the list settings i see that message acceptance for members is set to default processing, but i was sure i had set it to hold for moderation. i wonder how/when it switched back14:41
fungii've set it again now and will keep a closer eye on it14:44
corvusand https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4 was the "error" state with no fault message again14:44
corvusokay, so i think z-l is working as designed here, and we're just getting unclear error messages from openstack.  so if we want to improve this, we need to change the design.  i think we could either have z-l be more insistent that node requests stay in the same provider, or we could have it (optionally) error if they can't.14:46
corvusthe cost of the first option would be that we might leave some ready nodes around if, say, 1/2 nodes succeeds.14:47
corvusnonetheless, i think that might be the best approach, so i'll look into doing that first.14:47
corvus(if we end up with a ready node, it' probably won't stick around for long, unless it's an esoteric label)14:48
fungichecking the other list i'd set to default hold for moderation (legal-discuss) it's still set, so maybe whatever happened to openstack-discuss was user error on my part14:48
opendevreviewClark Boylan proposed opendev/system-config master: DNM Test offline reindexing with Gerrit 3.11.4  https://review.opendev.org/c/opendev/system-config/+/95548715:05
Ramereth[m]fungi: so there's no single endpoint that I can look at to see if there are any images still stuck in a deleting state like before? I see several that seem to be stuck in a "saving" state and one in "queued" currently from my side.15:37
fungiRamereth[m]: i don't believe so, but maybe corvus or clarkb have some ideas15:37
fungian api query to get that might be possible15:37
corvusRamereth[m]: fungi are we interested in a particular cloud?15:43
corvuslike, is the intended use something like "show all images in rax-flex that are in X state"?15:45
clarkbcorvus: I suspect Ramereth[m] is only interested in the osuosl cloud image states15:52
corvusokay, so we want to see "what is the status of every upload in osuosl"?15:52
clarkbyes I think so. The reason for that is sometimes the cloud fails to delete the images properly and intervention is required. I think Ramereth[m] was monitoring that via the old api and periodically cleaning things out on the cloud side15:53
corvusindeed we don't have a summary view of all uploads for a given provider in the web ui, so right now, you'd have to click through to each image.16:00
corvusbut we can get that info through the api16:00
corvusRamereth[m]: try this: `curl -q https://zuul.opendev.org/api/tenant/opendev/providers |jq -r '.[] | select(.name == "osuosl-regionone-main").images[].build_artifacts[].uploads[] | [.state, .external_id] | @tsv'`16:00
corvusit does look not entirely healthy16:01
clarkbmy test in 955487 seems to show that in the trivial case offline reindexing is fine with gerrit 3.11.4. I suspect the problem with reindexing there is specific to the state of the changes on the system tripping over some bug16:05
clarkbI think I'll just continue with upgrade prep and monitor the situation upstream to determine if we need to do some testing on our side. I think worst case we'll be able to revert to older versions if we hit things on our side (note offline reindexing shouldn't be necessary for the ugprade itself)16:06
opendevreviewClark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test 3.11 upgrade and downgrade  https://review.opendev.org/c/opendev/system-config/+/89357116:45
clarkbput a couple of holds in place for ^ to dig into the upgrade more16:48
clarkbcorvus: any reason to not approve https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/955236 at this point (this removes the nodepool config verification job)16:57
corvusclarkb: nope i think we can proceed with that16:58
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/955497 Avoid duplicate launcher upload jobs [NEW]        17:11
corvuswe have a lot of "pending" uploads; ^ that change should address that17:11
clarkbfungi: re the opendev.org main page I guess you're looking for more consensus around that? Should we plan to upgrade gitea first then and get reviews on https://review.opendev.org/c/opendev/system-config/+/955411 ? I expect to be around today and can monitor that if we do it. The main concern I have is with gitea13 (and any others I guess) being overloaded still17:12
clarkbit is busy this morning (the others aren't) but its below the worst that I've seen17:13
clarkbcorvus: I had a couple of notes on 955497 but nothing major so went ahead and +2'd it17:23
corvusclarkb: thx, replied17:28
fungiclarkb: mainly i was hoping to see if frickler's concern on 952861 was addressed17:33
clarkbcorvus: huh we dno't mark the image build artifact ready as a precursor to then uploading the image build?17:33
clarkbcorvus: I guess in my mind it was image build builds goes ready then we upload it17:33
fungionce i know whether 952861 has consensus i can decide between squashing all three or just squashing the two parents for now17:34
corvusnope, because "ready image builds with no uploads" look like something that should be deleted.  so we create the uploads before we say the image is ready.  it's the last thing that happens, and it happens in the same "trasaction"17:34
clarkbgot it17:36
corvus(we normally create the uploads as "pending"; that unit test does it as uploading, but otherwise, it's the same)17:36
clarkbfungi: ack. I guess we review the 1.24.3 upgrade in parallel and if it is ready first it can go first otherwise I can rebase it17:36
fungii'll probably disappear around 19:00-20:00 utc to get a late lunch/early dinner, but otherwise expect to be around for helping monitor the gitea upgrade17:37
corvusclarkb: here's the code (you can see it structurally mirrors the test) https://opendev.org/zuul/zuul/src/commit/01e95e978668123a54ff76f1da77e178c88d13f9/zuul/scheduler.py#L340717:37
clarkbcorvus: thanks17:37
fungithe splash page changes aren't urgent, i'd just prioritize the gitea upgrade for now17:37
clarkback if anyone else wants to review https://review.opendev.org/c/opendev/system-config/+/955411 before 2000 that would be great. Otherwise I can approve it around then (fungi's early dinner lines up with my lunch hour so I figure we can wait until after that block of time)17:41
clarkbjrosser: its been long enough that https://review.opendev.org/c/openstack/diskimage-builder/+/954760 should've addressed your debian image problems with backports17:53
clarkbjrosser: not a rush but any chance you can confirm that is the case?17:53
clarkbonce that is confirmed I can abandon https://review.opendev.org/c/zuul/zuul-jobs/+/95428018:05
opendevreviewJeremy Stanley proposed opendev/system-config master: Multiple Gitea splash page updates  https://review.opendev.org/c/opendev/system-config/+/95240718:41
fungiheaded out to grab a bite to eat, shouldn't be more than an hour18:56
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113819:12
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113819:13
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113819:14
Ramereth[m]corvus: that's helpful, but all of the externa_id fields are null so I don't know how to reference them on my end19:15
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer  https://review.opendev.org/c/openstack/project-config/+/94113819:15
corvusRamereth[m]: the pending ones haven't created a cloud image yet, so you can ignore them (also, there are way too many of those right now due to a bug that should be fixed soon)19:16
Ramereth[m]corvus: but the deleting ones don't show any external_id's either19:17
corvusRamereth[m]: you can ignore those too -- those are likely to have failed before we started uploading them19:18
Ramereth[m]ah okay19:18
corvus(in other words, if we have an external id, it should be there; otherwise, something has gone wrong that doesn't involve the target cloud)19:19
Ramereth[m]So I updated your script to the following:19:25
Ramereth[m]curl -q https://zuul.opendev.org/api/tenant/opendev/providers |jq -r '.[] | select(.name == "osuosl-regionone-main").images[].build_artifacts[].uploads[] | select(.state == "deleting") | select(.external_id != null) | [.state, .external_id] | @tsv'19:25
Ramereth[m]According to this, it looks like we don't have any images that are "stuck"?19:25
corvusRamereth[m]: yes, i agree19:31
clarkbI've approved the gitea 1.24.3 upgrade change. It usually takes about an hour to gate so figured a head start was good19:37
clarkbcan always -2 or -W it if something comes up in any additional reviews19:37
opendevreviewClark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node  https://review.opendev.org/c/opendev/system-config/+/95552020:05
fungicool20:24
fungiready for when it merges20:25
opendevreviewMerged opendev/system-config master: Update to gitea 1.24.3  https://review.opendev.org/c/opendev/system-config/+/95541121:02
opendevreviewClark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node  https://review.opendev.org/c/opendev/system-config/+/95552021:03
opendevreviewClark Boylan proposed opendev/system-config master: Switch IRC and matrix bots to log with journald rather than syslog  https://review.opendev.org/c/opendev/system-config/+/95554421:03
fungizuul estimates 3 minutes left on the hourly buildset21:05
fungiafter which the deploy jobs for 955411 should kick in21:06
corvusclarkb fungi ykarel as i mentioned earlier, i think we're now at the point where we've done everything we can to make zuul-launcher work as designed, and now we need to change the design to accommodate the un-actionable errors we're getting from the cloud.  i proposed this change which i think will have the desired effect: https://review.opendev.org/c/zuul/zuul/+/955545 Require multinode requests served from same provider21:07
fungithanks corvus!21:07
corvusi don't want to rush that one through, so we may still have the current behavior for another day or so21:08
corvus(while it's reviewed)21:08
corvusthe bugfix for pending uploads merged, so i'm going to restart the launchers now21:10
fungiinfra-prod-service-gitea is in progress21:12
clarkbhttps://gitea09.opendev.org:3081/opendev/system-config/ is upgraded21:13
clarkbI can git clone system-config from gitea09 as well21:13
clarkbit seems to be working for me so far21:14
fungiPowered by Gitea Version: v1.24.221:15
fungimmm21:15
fungioh, i hit a redirect21:15
clarkbya you have to be careful navigating the web ui some links send you back to the haproxy21:16
fungiokay, getting v1.24.3 now21:16
fungiyeah, i was doing 3080/tcp instead of 308121:16
clarkb09-11 are done now21:16
clarkb13 is the one I'm slightly worried about21:16
fungilook like it should be done now?21:19
fungiah, still restarting21:20
clarkbno 13 is just finishing up image downloads and doing the upgrade now21:20
fungiyeah21:20
fungistopping is in progress still21:20
clarkbthis restart will be slow because I think it stays up until a timeout for existing connnections and since thsi si the backend with all the connections it is going to be slower21:21
fungihttps://gitea13.opendev.org:3081/ is loading for me21:21
clarkbnow 13 is done21:21
fungiagreed21:21
clarkbworried unnecessarily. I figured ti would be ok but if any were to give us trouble it would be this one21:21
clarkball of them are done now. The job shoudl finish up soon21:23
clarkbcorvus: reading teh commit message on that change this seems like a reasonable approach. It maintains locality when possible but still allows you to mix k8s and openstack and ec2 or whatever if you wish in a single nodeset21:24
clarkbI'll work on a proper review shortly21:24
clarkbhttps://zuul.opendev.org/t/openstack/buildset/738b650c6e0b4de38309f62ecea974ad that is a successful gitea upgrade buildset21:25
fungiyeah, i'm already about halfway through it, but based on the description i agree it's the best possible tradeff21:25
fungitradeoff21:25
clarkbthe one thing I haven't checked yet is replication21:25
corvusclarkb: yep that's what i'm thinking21:25
opendevreviewClark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node  https://review.opendev.org/c/opendev/system-config/+/95552021:27
clarkbhttps://opendev.org/opendev/system-config/commit/a0fc9e3edab9e2b7c5806fe93a5386381499756f replication seems to be working21:27
corvussince the bugfix for the leaked nodeset request locks merged and we're running that now, i have deleted what is hopefully the last batch of leaked lock znodes.21:33
corvusthe launchers look like they are more or less done deleting all the extra "pending" uploads21:34
clarkbcorvus: ok posted a couple of comments but overall I think this is good21:41
corvusthx, replied21:52
clarkbhttps://review.opendev.org/c/opendev/system-config/+/955544 is a prep change for eventually replacing the current eavesdrop server with a new noble one. I think it is a big noop (we've tested it a bit at this point) to move from syslog to journald for container logging22:01
clarkbok time to put the meeting agenda together. I'll try to do niz updates and updates on gerrit and so on. Anything else that should be in there?22:43
clarkbmy edits are in. I'll send this out in ~10-15 minutes. Let me know before then if I'm mossing something important22:56
corvusmy main question is "wen delete nodepool" which i imagine is covered under at least one of the existing topics :)23:08
clarkbyup we can talk about that in the first topic tomorrow. I've just sent the agenda23:08
clarkboh nope I failed to send. One moment23:09
clarkbnow it should be in your inboxes23:10
corvusemail is hard23:11
fungialso as far as wen baleet nodepull i'm good with any time now23:13
fungiwe can arrange to have a small memorial service for it if anyone wants23:15
fungiotherwise... _press_button_23:15
fungithe stories about where things in the openstack sdk came from are going to get that much more amusing now when we talk about shade getting split off of a component that no longer exists at all23:17
mordredwhat a fun family tree :)23:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!