Monday, 2025-11-03

opendevreviewyatin proposed opendev/irc-meetings master: Move neutron team and CI meeting by 1 hour  https://review.opendev.org/c/opendev/irc-meetings/+/96590610:47
ykarelcorvus, fungi started seeing the mixed node provider issue again, was something changed again?12:39
ykarelhttps://zuul.openstack.org/build/8afc3ba76b6a437d82bb37a0cea0637012:39
ykarelhttps://zuul.openstack.org/build/9a8dc417c1164b3482d4daecf0f0af6712:39
opendevreviewMerged opendev/irc-meetings master: Move neutron team and CI meeting by 1 hour  https://review.opendev.org/c/opendev/irc-meetings/+/96590612:52
ykarelanother from today https://zuul.opendev.org/t/openstack/build/facb5a9deebe4af3b1f7170618338b4713:28
fricklerykarel: looks like something is amiss with rax in general since two days, very likely related to the latest updates https://grafana.opendev.org/d/fd44466e7f/zuul-launcher3a-rackspace?orgId=1&from=now-7d&to=now&timezone=utc&var-region=$__all13:35
fricklerinfra-root: ^^13:35
frickleralthough other graphs look similarly broken and if we were really running only 5 instances in total I think we'd know?13:37
frickleryes, global zuul status graph looks more like I'd expect it to. so the multinode issue will likely be something different, will defer to corvus I guess13:41
ykarelfrickler, ack13:51
ykarelyes the last one above was on rax itself and that's not on graphs so something odd there13:52
*** dhill is now known as Guest3046913:53
corvushrm, i wonder if we're letting "ready/building node" take priority over provider locality after a failure....14:49
corvusthe graphs are an easy fix, we just changed some of the escaping in the metric names.  i'll look at both things later today.14:49
opendevreviewJames E. Blair proposed openstack/project-config master: Update zuul-launcher metrics  https://review.opendev.org/c/openstack/project-config/+/96594015:04
fungiso... someone reached out to me via irc privmsg over the weekend asking to have their third-party ci account added to the "Zuul Summary" tab in gerrit, which is controlled by inclusion in the "Third-Party CI" group in gerrit i think?15:23
fungithat group is owned by the "Third-Party Coordinators" group, for which i think the only currently active member is frickler15:23
fungiis this a workflow, or even feature, that we want to reconsider?15:24
clarkbfungi: https://gerrit.googlesource.com/plugins/zuul-results-summary/+/refs/heads/main/web/zuul-summary-status-tab.ts#270 I think it only looks at message content15:47
fungiaha, okay so it's a matter of formatting the results appropriately?15:48
clarkbfungi: https://gerrit.googlesource.com/plugins/zuul-results-summary/+/refs/heads/main/web/zuul-summary-status-tab.ts#223 looks like it checks other attributes too but its all message content15:48
clarkbso yes I think you need to configure the commenter name and the message correctly then it is automatically picked up15:49
fungimakes sense. thanks for the clarification! i expect we have outdated user-facing documentation about a lot of this stuff15:51
clarkbinfra-root I believe https://review.opendev.org/c/opendev/system-config/+/965866 will address the zuul restart stoppage when a service is already shutdown. The message check there is based on the stderr captured by the failed run from the weekend before last16:06
clarkband then I was going to double check no one else wanted to try the gitea 1.25 update before I started working on that. Oh and if we think we can land the etherpad upgrade with the launcher situation then I'd like to try and do that soon too (please test on the held node too it is at 50.56.157.144 and I have a clarkb-test pad up there you can check16:08
corvusclarkb: fungi i think https://review.opendev.org/965940 warrants quick review to fix the graphs16:42
corvushas a +2 from frickler already but i want us all to be aware of that16:43
fricklerI am "Third-Party Coordinators"? good to know ;) should I add infra - root or some other similar group? I think this isn't something really openstack specific anymore, right? or is that group really obsolete, then?16:43
clarkbdone16:43
clarkbfrickler: I suspect that the way the group is used is fairly openstack specific, but yes if some other projects (like zuul with softare factory third party ci) wanted to allow voting +/-1 we'd need to sort that out16:44
fricklerthe other option would be to add the TC. I'd just prefer not to be left on my own there once I remove the inactive members16:47
corvusfungi: just a guess: opendev puts the pipeline name in its results, which is not standard zuul.  that's the first thing i'd check for what may not match.16:52
fungithanks. in this case i have a feeling they're not using zuul at all, or it's a very old zuul. for example, no buildset result test/link, and individual build result entries aren't in a bullet list17:04
opendevreviewMerged openstack/project-config master: Update zuul-launcher metrics  https://review.opendev.org/c/openstack/project-config/+/96594017:10
mnasiadkafrickler: is that third party thing OpenStack or OpenDev? ;-) (judging if TC should be even involved)17:20
clarkbI think the friction here is that opendev doesn't want to police third party ci for projects bceause the policies are varied and it puts us closer to the position of supporting people with the ci itself which is something we really don't have the time to do17:21
clarkbbut at the same time third party ci isn't inherently openstack specific ( zuul has thrid party ci for example)17:21
clarkbI'm trying to remember what exactly that group all does. I think I'm mistaken that it allows +/-1 voting and intsead it merely identifies the bots as bots in the gerrit system17:22
clarkbits probably fine for opendev to manage that17:22
fricklerhttps://docs.opendev.org/opendev/system-config/latest/gerrit.html has some hints. like "capability.emailReviewers = deny group Third-Party CI", but not sure if that is still current?17:26
fungii hope it's still current, since we don't normally make changes to the global gerrit configuration without documenting it there17:27
clarkbya it should be in sync17:27
corvusykarel: remote:   https://review.opendev.org/c/zuul/zuul/+/965954 Launcher: honor provider when selecting ready nodes [NEW]        17:36
corvusthat should take care of it -- sorry that slipped through.  that was the "reuse nodes that were already built for a canceled request" code path, but we forgot to limit it by provider.17:36
fungiclarkb: stephenfin: looks like pypi has pbr 7.0.3 as of 35 minutes ago17:39
clarkbcorvus: in that test case you set the instances limit to 2, is that per provider not global right?17:39
clarkbfungi: yup I just looked at that17:39
clarkbso theoretically we're ready now from the pbr side of things?17:39
stephenfinthere's still pyproject.toml work to be done, but hopefully that's the end of the panic17:40
corvusclarkb: yep17:40
stephenfinguess we'll know at some point in the next month or so17:40
clarkbcorvus: thanks +2 but with a small testing suggestion17:41
clarkbstephenfin: I think upstream had some ability to test it. Maybe we ask them to check now?17:41
stephenfinI already left notes for jaraco on the two tickets he opened17:42
clarkbperfect thanks17:43
corvusclarkb: ++ done17:44
fricklerfungi: so it seems to me that the 3rd party thing is better suited to be in opendev hands then rather than openstack. how about I add you to it and then you can follow-up however you think is appropriate?17:44
corvus(and yes, it was implicit due to quotas, but belts and suspenders)17:44
fungifrickler: that's fine, though in this case it turned out there was no followup required since what the individual was asking about wasn't actually controlled by membership in that group anyway17:46
fricklerfungi: yeah, I was more thinking about following up with further member additions and/or cleanup17:47
clarkbcorvus: considering that launcher bug I would guess that things generally work as its still a fallback case? In that case moving forweard with https://review.opendev.org/c/opendev/system-config/+/956593 is probably fine as long as the held etherpad looks good18:15
corvusyes, i don't think we need to change any plans due to it.18:16
clarkbI just wanted to make sure its not a 100% fail case for multinode jobs that rely on shared networks18:17
corvusit's more likely to be hit after gate resets when there are canceled node requests18:17
corvusif things are proceeding smoothly, less so.18:18
corvusi'd like to get it merged today if possible; so i'm trying to figure out what's causing the timeouts.18:18
opendevreviewClark Boylan proposed opendev/system-config master: Update Gitea to 1.25  https://review.opendev.org/c/opendev/system-config/+/96596018:30
clarkbthat is a first pass at gitea 1.25.18:31
clarkbLooks like we also need to prune the vexxhsot backup server. We can potentially purge backups for eavesdrop01, paste01, restack01, and review02 from that server as well if we're comfortable doing that (maybe a subset or all of them?)18:35
fungii just started18:36
fungiit's been running in a root screen session for about 5 minutes18:36
clarkback thanks18:38
clarkbany thoughts on whether we should be purging any of those four sets of backups at this point?18:38
fungino thoughts. my head is currently full of other thoughts18:39
fungialso see #openstack-infra where i pinged dmsimard (since he doesn't seem to be in here) about our ovh credits running out18:40
clarkbya just saw that. We can ping amorin too if necessary18:40
clarkbgood news on the backup front is we get regular reminders about it so if we don't make a decision now we'll be reminded in a month or so :)18:40
fungiyeah, he seems to be in here (and not there)18:40
clarkbhttps://zuul.opendev.org/t/openstack/build/6b4b75604b4b49c3b4431327b3a1c25e something is angry18:49
clarkbafs: disk cache read error in CacheItems slot 1562498 off 124999860/125000020 code -5/80 pid 493006 (apache2)18:50
clarkbfrom dmesg18:50
fungiurg18:50
opendevreviewJames E. Blair proposed openstack/project-config master: Further fixes to zuul-launcher metrics  https://review.opendev.org/c/openstack/project-config/+/96596618:50
clarkbspot checking another mirror the problem does not appear to be global18:50
fungii guess that's on an executor?18:50
clarkbfungi: no that is mirror.dfw3.raxflex.opendev.org18:50
fungioh18:50
corvusmaybe stop apache there and clean the cache?18:51
fungiah, the apache2 there should have been a giveaway18:51
corvusactually, maybe just fs flushall first without even stopping apache18:51
clarkbya looking at the logs its the same slot each time18:52
clarkback will try that18:52
fungichecking dmesg to see if there are any lower level errors mentioned for that block device18:52
fungithough these errors have completely filled the ring buffer18:53
clarkbfungi: I couldn't find any and the volume appears to still be rw18:53
clarkbfs flushall is running and not returning quickly. But I guess that may be expected18:53
fungislot 1562498 osm18:53
corvusi think it's a GC based thing18:54
fungislot 1562498 isn't the only one18:54
clarkbsystem load is raelly high now18:54
fungiup around 18:52:05 i see it complain about slots 1404560, 1523212, 1363181, and 143246318:54
clarkbbut the system seems mostly idle. Maybe it is something with writing to that disk?18:55
fungi[Mon Nov  3 18:52:05 2025] afs: Cannot open cache file (code -5). Trying to continue, but AFS accesses may return errors or panic the system18:55
fungiand then there's a kernel oops18:55
corvusreboot may be in order then18:56
fungikernel BUG at /var/lib/dkms/openafs/1.8.13/build/src/libafs/MODLOAD-6.8.0-85-generic-SP/afs_dcache.c:1209! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI18:56
clarkbshould we stop apache and disable it then reboot so we can try clearing the cache manually after it comes back up?18:56
fungilooks like we mount /var/cache/openafs from a logical volume on a cinder volume18:56
clarkbfungi: yes and that cinder volunme should be split with apache2's cache18:57
fungiso something may have happened to the cinder volume, e.g. iscsi network interruption18:57
clarkbshould I reboot now?18:57
corvus++18:58
clarkbI'll disable apache2 as well18:58
clarkbfungi: you're ssh'd into the host is it ok if I reboot now?18:59
fungiyes, i'm logged out, sorry!19:00
clarkbI've asked the server to reboot itself. Pings have not stopped so I worry it may be hung up on some shutdown routine. But I've got a local machine with something like a 60 second timeout then it proceeds so I'll wait for a couple of minutes to see if that is the case here bfore asking openstack to shutdown and start again19:01
clarkbafter 176 pings it appears to have stopped pinging. Just as I was sorting out how to ask openstack to shut it down19:04
clarkbnow we wait to see if it boots up properly19:04
clarkbok console log show shows it waiting on a number of processes. I'll give them a bit longer but I think I may need to hard stop it19:05
clarkbFailed to unmount /afs: Device or resource busy is the most recent message19:06
fungiyeah, perhaps unsurprising19:07
clarkbok I think it may be time to ask for the "physical" power button19:08
amorinhey, something wrong?19:08
fungithe shiny, candy-red button19:08
clarkbamorin: the issue we're debugging is unrelated to ovh. But we are near the end of our ovh credits19:09
fungiamorin: unrelated to what we're working on right now, just ovh credits expiring. got the "unable to process payment" notice today19:09
amorinoh, damn19:09
fungii pinged dmsimard about it already over in #openstack-infra since he's hanging out in there19:09
amorinok19:10
fungii'll bother you with it if he's not around19:10
clarkbmirror.dfw3 is in a powering off state but is not yet powered off19:10
amorinwill double chck with him then and keep you in touch19:10
fungithanks amorin!19:10
amoringood luck with the debug session19:10
clarkbaha there it goes it is shutdown now. I'll try booting it now19:10
fungi#status log Pruned /opt/backups-202010 on backup02.ca-ymq-1.vexxhost reducing utilization from 92% to 67%19:12
opendevstatusfungi: finished logging19:12
clarkbit won't power on19:12
clarkbI think we should disable that provider in zuul and then we're either booting a new server like we've done in the past and replacing the whole thing or engaging rax to help debug19:13
fungiare you trying server start or reboot?19:13
clarkbfungi: server start after a server stop19:13
fungik19:13
clarkbshould I try a server reboot?19:13
fungino i don't think that'll work from a shutoff state19:13
corvusi'll write a config change to stop the provider19:13
fungianyway, i agree on pausing the provider for now19:13
clarkbfungi: yup confirmed it is an error to try rebooting a stopped instance19:13
fungiwe probably need dan_with or one of his colleagues to intercede19:14
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Disable rax-flex due to mirror issue  https://review.opendev.org/c/opendev/zuul-providers/+/96597119:14
clarkbdan_with: fyi we had a server in DFW3 (mirror01.dfw3.raxflex.opendev.org) become unhappy with io to a filesystem backed by a cinder volume. We ended up asking noav to shutdown the server and now it won't boot up again after that19:14
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Revert "Disable rax-flex due to mirror issue"  https://review.opendev.org/c/opendev/zuul-providers/+/96597219:15
clarkbdan_with: we are capable of replacing the entire system (server, volume, and floating ip) and moving on from here with a replacement, but I figure considering it may be related to volume behaviors it may be worth having ya'll debug it?19:15
clarkbserver show also doesn't seem to record any reason for the boot failure19:16
fungiyeah, if nova thinks something's wrong with an attached volume it may refuse to start the instance19:16
clarkbvolume show on the volume doesn't show anything problematic either fwiw, But its not the first time nova may be withholding a full story from us19:17
opendevreviewMerged opendev/zuul-providers master: Disable rax-flex due to mirror issue  https://review.opendev.org/c/opendev/zuul-providers/+/96597119:17
fungiright, i don't know if the cinder volume itself is actually the problem. corruption we saw before shutdown could have been entirely guest-side19:17
fungialso volume show is telling you what cinder thinks about the volume, which may not be the same as what nova thinks19:18
clarkbya I just wish nova had a note for why something isn't booting19:19
clarkbtask_state goes from powering-on to None and vm_state never changes from stopped19:24
clarkb(just skimming my terminal scrollback to see if I see anything useful, but thats about it. It does try to power on and then doesn't succeed apparently)19:25
opendevreviewClark Boylan proposed opendev/system-config master: Update Gitea to 1.25  https://review.opendev.org/c/opendev/system-config/+/96596019:38
opendevreviewClark Boylan proposed opendev/system-config master: Update Gitea to 1.25  https://review.opendev.org/c/opendev/system-config/+/96596020:01
corvusclarkb: https://review.opendev.org/965966 is a quick oopsie fix that would be good to get in20:10
fungithat is indeed a lot of additional ocurrences20:18
clarkbcorvus: ack I've approved it20:21
clarkb(sorry was eating a sandwich)20:21
opendevreviewMerged openstack/project-config master: Further fixes to zuul-launcher metrics  https://review.opendev.org/c/openstack/project-config/+/96596620:47
clarkbI'm going to start putting the meeting agenda together. Gitea 1.25, gitea load mitigations, trixie mirroring, and the rax dfw3 mirror issue are all on my list of items to add in21:39
clarkbif there is anything else just let me know21:40
clarkbok I think I have all of those updates in now21:57
cardoeSo looking at the openstack-helm job at https://zuul.opendev.org/t/openstack/status and wondering what I need to do to unstick the 52hour stuck.22:07
clarkblooks like that is request 9d53f1877a6648a38e11e064a3bdb09622:07
clarkbat 2025-11-01 17:41:16,212 zl01 marks a node for that request as ready but then doesn't proceed from there22:09
clarkbthat appears to be after corvus unstuck the upgrade process on zl01 and zl0222:10
clarkbby about 2.5 hours22:10
clarkblooks like the request is for 5 nodes22:11
clarkbhttps://grafana.opendev.org/d/0172e0bb72/zuul-launcher3a-rackspace-flex?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all seems to reflect this we have 3 ready nodes and 2 requested, but I'm not sure we're doing anything at this point for the other two requseted nodes22:12
corvusi manually re-enqueued that.  we should be able to continue to debug based on the ids from clark.22:12
clarkbcorvus: cardoe  I think this may be that nova api error that forces us to use a different microversion22:13
clarkba normal server list against iad3 just failed for me22:13
clarkbnow to remember how to specify the older version so that we can list servers and delete the22:14
clarkbcorvus: all 5 servers show as active so none of them clearly indicate an error. I'm going to try and show each one without the microversion to see if I can determine which one is the problem and delete only that one22:18
clarkbcorvus: is that safe enough with zuul launcher?22:18
corvusclarkb: those nodes might get reused22:19
clarkb7f50209b-8224-4b3a-b192-171d4b6bd185 (npb29cdece0ec64) and ec872fa3-f349-44a5-8694-6d00becef9ea (npa24e1b4413c14) are bad22:19
clarkbcorvus: ya I was worried about that. Do we need to mark both of those for deletion on the nodepool side first then I can delete them manually on the cloud side?22:20
clarkbs/nodepool/launcher22:20
corvusnpb29cdece0ec64 is in hold22:20
corvusnpa24e1b4413c14 is ready22:20
clarkbinteresting that implies the node was good for long enough for us to use it22:21
clarkbbut now it appears to be creating the you can't show this node with new microversion problems22:21
corvusa24 has been reassigned to 00d042a30e61445f90c498311c29979722:22
corvusso it's going to be in use when that request is finished22:22
clarkbI guess once the node is booted and we ssh keyscan it we're not interacting with the nova api again until we try to delete the instance22:23
corvusyep22:23
clarkbI have a number of holds in place if that server is one of mine (perhaps for etherpad?) I can recycle it22:23
corvusit is, but that's probably not important now22:23
clarkbcorvus: I think we have to clear out the two bad nodes from nova to get things working again. Either that or we have to start forcing zuul lauincher to use the old microversion22:24
clarkbwhich seems like a step backwards from a high level22:24
corvusso... lemme get this straight:22:25
corvus1) everything works fine22:25
corvus2) something happens; we don't know what22:25
corvus3) all nova api calls start failing22:25
corvus4) the only way to fix that is a) delete all servers or b) use an old microversion22:25
corvusis that right?22:25
clarkbroughly, for 2) I think nova believes it to be some sort of db field corruption replacing some field with an unexpected null value. that then caused 3) failures for any attribute listing of that node with a new microversion that enforces that field be not null22:26
clarkbthe workaround works because old microversions didn't enforce that behavior. It seems likely ot me that this bug has existed since forever and because there was no enforcment it was simply never noticed. Now that they enforce it kaboom22:27
corvusbut this affects the "list all servers" call?22:27
clarkbcorvus: correct22:27
corvusso one server gets corrupted and it tanks the region.22:27
clarkb(as well as server show which is how I narrowed it down to those two)22:27
clarkbyes22:27
corvusclarkb: it looks like the new request is pretty stuck too.. honestly, i'd say sure just delete the nodes out from under the launcher and lets see what happens.  the job will fail, but that should be the extent of the fallout.  anything more is a bug.22:29
clarkbcorvus: ok this is weird, now server listing works again22:29
clarkbwithout the microversion I mean22:30
corvusoO22:30
clarkbcardoe you didn't go and magic things on the backend did you?22:30
cardoeI don't know what magic things to do.22:30
cardoeHeck I'd love to figure out Zuul enough to clean up container builds and all of this stuff22:31
cardoebut I'll admit too much black magic and not enough cycles22:31
clarkbcardoe: this is all booting servers in rax flex iad3 fwiw. Not really anything to do with job payloads22:31
cardoeNow the PTL might have done some magic22:31
clarkbcardoe: I believe the problem is that listing servers broke in that region. But now it magically works again for some reason. Maybe by listing/showing things using the old microversion it uncorrupted the feilds or something22:32
clarkbhttps://bugs.launchpad.net/nova/+bug/2095364 this is the underlying issue I think22:34
clarkbthat new request is for the same job on a different opensatck helm change. And it again has 3 available and 2 requested22:41
clarkbserver list is working, but I don't see any new booting instances. So it could be that there are two problems here. Or one now but was two before when server listing was not working22:42
clarkbok I think I see what may be part of problem 222:44
clarkbwe have these quotas: 10 instances, 51200MB of memory, and 20 cores. Unfortauntely only 5 four core instances fit itno the core limit and 6 fit into the memory limit22:45
clarkbso I think we need to drop the instance limit in that cloud region to 5 :/22:45
clarkbI know that cloudnull intended to bump the quotas up for us but it hasn't happened22:45
clarkbcardoe: ^ not sure if that is something you are bale to assist with. I'm not sure if it hasn't happened due to capacity of the cloud or just time constraints22:46
clarkbcorvus: on the launcher side I suspect that the expectations is the held nodes will eventually go away and we'll haev enough room to fulfill the request. But maybe we need to consider nodes in a hold state as less ephemeral when calculating whether or not we haev sufficient room to supply a request?22:47
cardoeLet me prod him.22:47
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional Gitea failure to hold a node  https://review.opendev.org/c/opendev/system-config/+/84818123:10
corvusclarkb: that is an interesting suggestion.  i can see how it makes sense in this case, but i wonder whether we should generalize it (since in other cases, it could cause NODE_FAILUREs).  it probably does make sense, but it's a tradeoff to consider carefully.23:10
clarkbcorvus: I'm not sure I understand the note about node failures? Wouldn't we boot fewer nodes if we counted held nodes against "permanent" quota?23:13
corvusif we only had iad3 with quota for 5 nodes and received a request for 5 nodes while we had 1 held node and we implement your suggestion, that becomes a NODE_FAILURE in the future, but today it waits for the held node to free up.23:14
corvus(with your suggestion, in opendev, we fail over to another cloud, but not everyone has 6 clouds)23:14
clarkbah I guess in my mind it would've never accepted the request int he first place23:14
clarkbso in the opendev scenario it falls back to another provider and in a single cloud deployment it would indeed node failures since there is no fallback23:15
corvusit fails because if we consider held nodes as unusable provider quota, then we have no provider that could possibly fulfill the request.23:15
corvusyeah.  there's not really much of a distinction between "accepted" and "not accepted" for purposes of this conversation.  it's more about deciding whether a provider can possibly fulfill a request.23:16
clarkbother (probably bad ideas that come to mind): we could canabalize held nodes if necessary and force them to be deleted to return quota. I wonder too if we can annotate the request in the status page somehow as "waiting for quota"23:18
clarkbI think if we don't want to change the behavior of request handling then updating the status page to make it more clear why we're waiting is a decent compromise23:18
corvusi think cannibalizing held nodes is not worth it.  but i definitely think we should expose more info.23:19
corvus(held nodes are generally held because a human needs time on them to fix something complex, so that's an expensive process we don't want to elongate.  but we do, by default, set a time limit on held nodes)23:20
clarkbin the meantime I guess I can delete my held nodes so that the OSH jobs can complete? Or do you think it is helpful to keep things in this state for now?23:29
corvusyeah, i think fundamentally, we don't have the capacity in raxflex-iad for that many held nodes, so that's probably best.23:37
corvuswe could probably fix this situation by deleting the request and the ready nodes; that might get it assigned to a different provider, but no guarantee it won't happen again.23:37
clarkbok I've deleted the autohold corresponding to those two nodes in rax iad23:48
clarkbthat should allow it to boot the two nodes required for the current request23:48

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!