Monday, 2025-07-28

opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup  https://review.opendev.org/c/zuul/zuul-jobs/+/84711106:54
*** ykarel_ is now known as ykarel08:02
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup  https://review.opendev.org/c/zuul/zuul-jobs/+/84711108:11
*** ykarel_ is now known as ykarel11:12
*** ykarel_ is now known as ykarel11:48
EnriqueVallespiGil[m]clarkb: fungi after allowing the new IP, the connection against opendev works perfectly fine. Thanks a lot!12:51
fungiEnriqueVallespiGil[m]: glad to hear it, let us know if you run into any other issues13:40
fungilooks like backup02.ca-ymq-1.vexxhost may be down. lots of backup failures reported and it's not responding to ssh/ping. i'll take a closer look in a sec15:25
fungii had left a prune going in a root screen session yesterday, hopefully it ended before the server went down or otherwise didn't corrupt anything by getting stopped early15:26
fungii had detached from it and was going to check back up on it now15:26
clarkbI think the underlying datastructure is very git like. In theory that makes it resilient to these types of issues15:27
funginova reports the instance is active, but console log show is taking a while to return, i have a feeling it's going to time out15:28
fungiFailure: Unable to establish connection to [...]: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Retrying in 0.5s. 1 retries left15:29
fungiand the automatic retry failed similarly15:29
fungiso it's acting like maybe the kernel is hung?15:30
fungior the hypervisor process itself15:30
clarkbya seems like either the instance within he hypervisor has gone sideways or the hypervisor itself and libvirt doesn't know so nova doesn't know15:31
fungisince we have redundant backup servers, i guess we could leave it in this state temporarily in case guilhermesp or ricolin or mnaser want to take a look (server 5665c088-8ce4-410d-9edd-53633c0d0b76 in ca-ymq-1)15:32
clarkbya may be worth waiting a bit15:35
clarkbI guess double check the other server is still up?15:35
fungibackup01.ord.rax up 1278 days15:37
opendevreviewMauricio Harley proposed opendev/irc-meetings master: Changing Barbican meeting frequency to bi-weekly.  https://review.opendev.org/c/opendev/irc-meetings/+/95601815:37
clarkbI've got local updates to apply but will be back shortly15:42
opendevreviewMauricio Harley proposed opendev/irc-meetings master: Changing Barbican meeting frequency to biweekly.  https://review.opendev.org/c/opendev/irc-meetings/+/95601815:48
clarkblooks like the most recent image builds had trouble finding debootstrap?15:51
clarkbwe're building on ubuntu so unlikely that updates due to the impending trixie release are to blame. I wonder what could cause that?15:52
fungifinding the package or the command?15:52
clarkbthe package during dib dependency installation15:52
clarkbhttps://zuul.opendev.org/t/opendev/build/f18e2b9380e0492e85c2ee6f3636634c/console#1/0/1/ubuntu-noble15:52
clarkblooks like there may have also been an ssh connectivity blip to raxflex sjc1 I wonder if the underlying cause is the same in both cases15:54
fungipuc claims debootstrap 1.0.134ubuntu1 is in noble15:54
clarkbya seems to have affected raxflex sjc315:54
clarkbI suspect the problem is network related : https://zuul.opendev.org/t/opendev/build/e53a128884684cd19b76150442ece6aa is an ssh failure15:54
fungi"No package matching 'debootstrap' is available" is an odd way for an ssh failure to manifest15:55
fungii guess that can be chalked up to misleading error handling in ansible's package module?15:56
clarkbfungi: two different errors. One is ssh can't connect the other is deboostrap package isn't available15:56
clarkbI'm theorizing that both are due to some underlying network issue within that cloud region as the cloud region seems consistent between them (thoguh I did not check every single incidence)15:57
fungiyeah, i mean if this were an error from apt itself i'd have a better idea as to what happened, but since the actual problem has been swallowed and hidden by an ansible module i'm going to guess it's just a random network problem getting misidentified as a missing package15:58
opendevreviewMerged opendev/irc-meetings master: Changing Barbican meeting frequency to biweekly.  https://review.opendev.org/c/opendev/irc-meetings/+/95601816:03
*** qwebirc22566 is now known as dpanech116:13
dpanech1Hi, this review hasn't merged for nearly 3hrs, could someone check? Thanks. https://review.opendev.org/c/starlingx/metal/+/95596816:16
clarkbdpanech1: that change was approved before its depends-on merged. If the depends on and that chagne don't share a zuul queue then zuul won't auto enqueue it to the gate16:23
clarkbyou can reapprove the change and it should enter the gate now that its depends on is merged16:23
fungigood eye, i was still looking at it and hadn't spotted the timing there16:24
fungii wonder if zuul could learn to reprocess changes in that situation, though i suppose that would present a caching challenge16:26
clarkbya I think the easy solution would be to configure zuul to share a queue amongst related projects16:26
clarkbIn theory you want to do this anyway so that they are cogating16:27
dpanechclarkb: when you say "reapprove", what do you mean ?16:28
clarkbdpanech: leave a new comment on the gerrit change that +1 approves the change16:29
clarkbdpanech: you left a code review lgtm +1 comment but I think it needs to be a +1 workflow approved comment16:33
clarkbcorvus: I've been asked to put together openinfra foundation newsletter content for opendev and zuul. I think the most notable thing that has happend in opendev recently is dropping nodepool. Would you prefer we leave that out of the newsletter for now to avoid people trying it before we consider niz ready?16:38
corvusyes please16:39
clarkbok in that case I may be able to talk about some of the dib improvements instead maybe.16:39
clarkbOpen to other ideas as well. Maybe mention the matrix comms spec?16:39
corvusclarkb: i think if you word it carefully, you could probably talk about the new location for configuring image builds, and how it's more accessible.  it's a fine line to walk.  :)16:41
clarkback let me pull up an etherpad and work on a draft16:41
clarkbcorvus: infra-root drafting here https://etherpad.opendev.org/p/_MU8qz8RHgiUv4yP1jNY16:47
clarkbmaybe something simple and mostly relevant facts like that is the wy to go16:47
fungiyeah, i think we can probably talk about some of the upshot of niz in the opendev section without advertising the zuul feature yet16:48
clarkbthen I'm also open to ideas on any content we want ot capture for zuul too16:48
clarkbbut we can discuss that in the zuul matrix room16:49
fungifrom the opendev side some people have wondered where image build logs, image downloads and the api for image status have gone16:49
clarkback I've added a link to the zuul dashboard images tab in the openstack tenant to my draft document above too16:53
fungithanks! i'll take a look when i'm done babysitting plumbers16:53
dpanechclarkb: I'm not a core reviewer. Can the user who had added the W+1 remove and re-add it? Will that work?17:05
clarkbdpanech: yes17:06
clarkbdpanech: you might also be able to recheck teh change17:06
clarkbI think that may work too17:06
clarkb(and that is something anyone can do as long as the votes are already in place)17:06
fungijust note that in the particular zuul tenant those projects use, a recheck will cause the change to go back into the check pipeline first and then into the gate once it passes there, so would result in additional testing and a longer delay that just removing and adding the workflow +1 or another core reviewer adding a new workflow +117:08
fungis/longer delay that/longer delay than/17:08
clarkbto followup on the executor disk utilization we seem to peak around 30% during periodic jobs. So I think the ephemeral repurposing should be plenty of room for now17:10
dpanechclarkb: ok thanks, we'll try that17:11
clarkbit merged17:24
clarkbcurrent draft content for the newsletter has been provided to the foundation18:02
clarkbI made some small organization edits but the content is basically the same18:02
clarkb(I combined opendev list items 1 and 2 together then did the same for 3 and 4)18:03
fungilgtm, thanks!18:06
clarkbhttps://zuul.opendev.org/t/openstack/nodeset-requests this view is really neat18:38
clarkbcorvus: ^ idea the zuul status page could have a link for queued items waiting on requests to that page (maybe with a deep link to the specific request in the list on that page?)18:38
fungimy personal fave for the new dashboard views is the images list, how you can drill down to the actual build result pages to find the build logs and downloadable images as artifacts18:40
corvusclarkb: yep i'd like to get there; unclear whether it should go to that page and we highlight the row, or go to a dedicated page for the request (that doesn't exist yet).  we can also link the other direction.18:45
fricklerwhat's wrong with those 2d old tacker periodic-weekly buildsets, did something get stuck there?18:47
clarkbthey are for periodic weekly jobs that appear to still be queued up19:23
clarkband all of them are grabbed by ovh-bhs1-main19:23
clarkbthe timing from 2 days ago makes me wonder about our reboots and upgrades maybe there is some shutdown race or something.19:24
clarkbit wouldn't surprise me if there is something odd about the jobs too. The tacker jobs continue to run devstack in pre-run but are sensitive to changes in tacker19:25
clarkbI'm working on lunch but next on my todo list is the meeting agenda. Is there anything to add/remove/edit on the agenda?19:31
clarkbI guess niz updates19:32
clarkband maybe a recap of the gitea-lb stuff19:32
clarkbcorvus: did the niz change to force nodes into the same provider unless it is impossible to fulfill that way land?20:09
clarkblooks like yes so we should be operating with that in place at this point. Just wnt to make note of that in the agenda so that if we see mixed nodesets anymore we can call them out20:18
clarkblooking at the tacker requests they are all for ubuntu-noble which I think is a valid request still? But I notice that most of the other requests being made are for the modern niz labels20:24
clarkbCould it be something to do with our config for the ubuntu-noble label being unsatisfiable somehow?20:24
clarkblooks like there are failures for 0821f4e2251c407395a0b3766de7722d in ovh bhs1 due to quota being exceeded20:27
clarkboh ya that cloud is completely cooked20:27
clarkbhttps://grafana.opendev.org/d/cde4271d48/zuul-launcher3a-ovh?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all20:27
clarkbcorvus: will marking the max servers limit on that provider to 0 update existing requests there?20:27
clarkbdoing a server list against that cloud does seem to show a large number of active nodes. At first I thought maybe these got leaked from the nodepool wind down but the hostnames appear to be in the new style non consecutive identifiers20:29
clarkbcorvus: this smells like a zuul launcher bug to me. In particular all of the nodes are marked ready but not being reused by new requests. I think we identified this as a risk of the nodeset "clamping" but the 8 hour ready node timeout was expected to clear things out20:32
clarkbseems like maybe that isn't happening (the timestamps indicate the nodes are ~2 days old)20:32
clarkbI think rax-sjc3 may be entering a similar state but isn't quite into using all of its quota so things semi work there20:32
clarkbtl;dr is we have ~125 active ready nodes in ovh-bhs1 consuming the quota preventing new boots. They are at least 2 days old so aged out past the timeout. Ideally they either get recycled into new requests or aged out to free up quota20:35
corvusready nodes for ubuntu-noble should just get used by single-node nodset requests, so that's likely a bug20:50
corvussetting a 0 limit should (might?) cause requests to bounce to another because it would be impossible to fulfill20:51
corvusnot positive about that20:51
corvusclarkb: if it's operationally urgent (ie, can't wait for debug and fix), you can just delete the ready nodes20:51
corvusi'm not able to look at that right now, but would be happy to later20:52
clarkbcorvus: I don't think it is urgent. Things are quiet in zuul right now21:04
clarkbdeletion would be via the web ui as an authenticated admin?21:04
corvusyes or client/api21:04
clarkb(mostly asking for the info I think for now I'll leave it as is)21:04
clarkbI've just tested reindexing changes on gerrit 3.11 using the held 3.11 test node. That worked fine. but there are only a handful of simple changes in that installation so it isn't a super exhaustive test but good to see that reindexing works in general21:11
clarkbI was owrried that the held nodes were old enough to have gotten cleaned up as part of nodepool cleanup but they are niz nodes so no problem there21:11
clarkbI didn't want to dig into actual upgrade testing if that was going to fail, but it doesn't so I should start trying to do the testing now21:12
opendevreviewJames E. Blair proposed opendev/system-config master: Mirror python2.7 images to quay  https://review.opendev.org/c/opendev/system-config/+/95605622:47
clarkbfungi: looks like the backup server is still not reachable? I'm thinking maybe tomorrow we go ahead and ask nova to reboot it for us?22:57
clarkbI'll put that on the agenda for the meeting then get the agenda sent out22:57
opendevreviewMerged opendev/system-config master: Mirror python2.7 images to quay  https://review.opendev.org/c/opendev/system-config/+/95605623:05
fungiclarkb: yeah, i haven't seen any of the vexxhost crew pop in to say they were looking at it either23:34

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!