Monday, 2025-07-28

opendevreview	Tristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup https://review.opendev.org/c/zuul/zuul-jobs/+/847111	06:54
*** ykarel_ is now known as ykarel		08:02
opendevreview	Tristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup https://review.opendev.org/c/zuul/zuul-jobs/+/847111	08:11
*** ykarel_ is now known as ykarel		11:12
*** ykarel_ is now known as ykarel		11:48
EnriqueVallespiGil[m]	clarkb: fungi after allowing the new IP, the connection against opendev works perfectly fine. Thanks a lot!	12:51
fungi	EnriqueVallespiGil[m]: glad to hear it, let us know if you run into any other issues	13:40
fungi	looks like backup02.ca-ymq-1.vexxhost may be down. lots of backup failures reported and it's not responding to ssh/ping. i'll take a closer look in a sec	15:25
fungi	i had left a prune going in a root screen session yesterday, hopefully it ended before the server went down or otherwise didn't corrupt anything by getting stopped early	15:26
fungi	i had detached from it and was going to check back up on it now	15:26
clarkb	I think the underlying datastructure is very git like. In theory that makes it resilient to these types of issues	15:27
fungi	nova reports the instance is active, but console log show is taking a while to return, i have a feeling it's going to time out	15:28
fungi	Failure: Unable to establish connection to [...]: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Retrying in 0.5s. 1 retries left	15:29
fungi	and the automatic retry failed similarly	15:29
fungi	so it's acting like maybe the kernel is hung?	15:30
fungi	or the hypervisor process itself	15:30
clarkb	ya seems like either the instance within he hypervisor has gone sideways or the hypervisor itself and libvirt doesn't know so nova doesn't know	15:31
fungi	since we have redundant backup servers, i guess we could leave it in this state temporarily in case guilhermesp or ricolin or mnaser want to take a look (server 5665c088-8ce4-410d-9edd-53633c0d0b76 in ca-ymq-1)	15:32
clarkb	ya may be worth waiting a bit	15:35
clarkb	I guess double check the other server is still up?	15:35
fungi	backup01.ord.rax up 1278 days	15:37
opendevreview	Mauricio Harley proposed opendev/irc-meetings master: Changing Barbican meeting frequency to bi-weekly. https://review.opendev.org/c/opendev/irc-meetings/+/956018	15:37
clarkb	I've got local updates to apply but will be back shortly	15:42
opendevreview	Mauricio Harley proposed opendev/irc-meetings master: Changing Barbican meeting frequency to biweekly. https://review.opendev.org/c/opendev/irc-meetings/+/956018	15:48
clarkb	looks like the most recent image builds had trouble finding debootstrap?	15:51
clarkb	we're building on ubuntu so unlikely that updates due to the impending trixie release are to blame. I wonder what could cause that?	15:52
fungi	finding the package or the command?	15:52
clarkb	the package during dib dependency installation	15:52
clarkb	https://zuul.opendev.org/t/opendev/build/f18e2b9380e0492e85c2ee6f3636634c/console#1/0/1/ubuntu-noble	15:52
clarkb	looks like there may have also been an ssh connectivity blip to raxflex sjc1 I wonder if the underlying cause is the same in both cases	15:54
fungi	puc claims debootstrap 1.0.134ubuntu1 is in noble	15:54
clarkb	ya seems to have affected raxflex sjc3	15:54
clarkb	I suspect the problem is network related : https://zuul.opendev.org/t/opendev/build/e53a128884684cd19b76150442ece6aa is an ssh failure	15:54
fungi	"No package matching 'debootstrap' is available" is an odd way for an ssh failure to manifest	15:55
fungi	i guess that can be chalked up to misleading error handling in ansible's package module?	15:56
clarkb	fungi: two different errors. One is ssh can't connect the other is deboostrap package isn't available	15:56
clarkb	I'm theorizing that both are due to some underlying network issue within that cloud region as the cloud region seems consistent between them (thoguh I did not check every single incidence)	15:57
fungi	yeah, i mean if this were an error from apt itself i'd have a better idea as to what happened, but since the actual problem has been swallowed and hidden by an ansible module i'm going to guess it's just a random network problem getting misidentified as a missing package	15:58
opendevreview	Merged opendev/irc-meetings master: Changing Barbican meeting frequency to biweekly. https://review.opendev.org/c/opendev/irc-meetings/+/956018	16:03
*** qwebirc22566 is now known as dpanech1		16:13
dpanech1	Hi, this review hasn't merged for nearly 3hrs, could someone check? Thanks. https://review.opendev.org/c/starlingx/metal/+/955968	16:16
clarkb	dpanech1: that change was approved before its depends-on merged. If the depends on and that chagne don't share a zuul queue then zuul won't auto enqueue it to the gate	16:23
clarkb	you can reapprove the change and it should enter the gate now that its depends on is merged	16:23
fungi	good eye, i was still looking at it and hadn't spotted the timing there	16:24
fungi	i wonder if zuul could learn to reprocess changes in that situation, though i suppose that would present a caching challenge	16:26
clarkb	ya I think the easy solution would be to configure zuul to share a queue amongst related projects	16:26
clarkb	In theory you want to do this anyway so that they are cogating	16:27
dpanech	clarkb: when you say "reapprove", what do you mean ?	16:28
clarkb	dpanech: leave a new comment on the gerrit change that +1 approves the change	16:29
clarkb	dpanech: you left a code review lgtm +1 comment but I think it needs to be a +1 workflow approved comment	16:33
clarkb	corvus: I've been asked to put together openinfra foundation newsletter content for opendev and zuul. I think the most notable thing that has happend in opendev recently is dropping nodepool. Would you prefer we leave that out of the newsletter for now to avoid people trying it before we consider niz ready?	16:38
corvus	yes please	16:39
clarkb	ok in that case I may be able to talk about some of the dib improvements instead maybe.	16:39
clarkb	Open to other ideas as well. Maybe mention the matrix comms spec?	16:39
corvus	clarkb: i think if you word it carefully, you could probably talk about the new location for configuring image builds, and how it's more accessible. it's a fine line to walk. :)	16:41
clarkb	ack let me pull up an etherpad and work on a draft	16:41
clarkb	corvus: infra-root drafting here https://etherpad.opendev.org/p/_MU8qz8RHgiUv4yP1jNY	16:47
clarkb	maybe something simple and mostly relevant facts like that is the wy to go	16:47
fungi	yeah, i think we can probably talk about some of the upshot of niz in the opendev section without advertising the zuul feature yet	16:48
clarkb	then I'm also open to ideas on any content we want ot capture for zuul too	16:48
clarkb	but we can discuss that in the zuul matrix room	16:49
fungi	from the opendev side some people have wondered where image build logs, image downloads and the api for image status have gone	16:49
clarkb	ack I've added a link to the zuul dashboard images tab in the openstack tenant to my draft document above too	16:53
fungi	thanks! i'll take a look when i'm done babysitting plumbers	16:53
dpanech	clarkb: I'm not a core reviewer. Can the user who had added the W+1 remove and re-add it? Will that work?	17:05
clarkb	dpanech: yes	17:06
clarkb	dpanech: you might also be able to recheck teh change	17:06
clarkb	I think that may work too	17:06
clarkb	(and that is something anyone can do as long as the votes are already in place)	17:06
fungi	just note that in the particular zuul tenant those projects use, a recheck will cause the change to go back into the check pipeline first and then into the gate once it passes there, so would result in additional testing and a longer delay that just removing and adding the workflow +1 or another core reviewer adding a new workflow +1	17:08
fungi	s/longer delay that/longer delay than/	17:08
clarkb	to followup on the executor disk utilization we seem to peak around 30% during periodic jobs. So I think the ephemeral repurposing should be plenty of room for now	17:10
dpanech	clarkb: ok thanks, we'll try that	17:11
clarkb	it merged	17:24
clarkb	current draft content for the newsletter has been provided to the foundation	18:02
clarkb	I made some small organization edits but the content is basically the same	18:02
clarkb	(I combined opendev list items 1 and 2 together then did the same for 3 and 4)	18:03
fungi	lgtm, thanks!	18:06
clarkb	https://zuul.opendev.org/t/openstack/nodeset-requests this view is really neat	18:38
clarkb	corvus: ^ idea the zuul status page could have a link for queued items waiting on requests to that page (maybe with a deep link to the specific request in the list on that page?)	18:38
fungi	my personal fave for the new dashboard views is the images list, how you can drill down to the actual build result pages to find the build logs and downloadable images as artifacts	18:40
corvus	clarkb: yep i'd like to get there; unclear whether it should go to that page and we highlight the row, or go to a dedicated page for the request (that doesn't exist yet). we can also link the other direction.	18:45
frickler	what's wrong with those 2d old tacker periodic-weekly buildsets, did something get stuck there?	18:47
clarkb	they are for periodic weekly jobs that appear to still be queued up	19:23
clarkb	and all of them are grabbed by ovh-bhs1-main	19:23
clarkb	the timing from 2 days ago makes me wonder about our reboots and upgrades maybe there is some shutdown race or something.	19:24
clarkb	it wouldn't surprise me if there is something odd about the jobs too. The tacker jobs continue to run devstack in pre-run but are sensitive to changes in tacker	19:25
clarkb	I'm working on lunch but next on my todo list is the meeting agenda. Is there anything to add/remove/edit on the agenda?	19:31
clarkb	I guess niz updates	19:32
clarkb	and maybe a recap of the gitea-lb stuff	19:32
clarkb	corvus: did the niz change to force nodes into the same provider unless it is impossible to fulfill that way land?	20:09
clarkb	looks like yes so we should be operating with that in place at this point. Just wnt to make note of that in the agenda so that if we see mixed nodesets anymore we can call them out	20:18
clarkb	looking at the tacker requests they are all for ubuntu-noble which I think is a valid request still? But I notice that most of the other requests being made are for the modern niz labels	20:24
clarkb	Could it be something to do with our config for the ubuntu-noble label being unsatisfiable somehow?	20:24
clarkb	looks like there are failures for 0821f4e2251c407395a0b3766de7722d in ovh bhs1 due to quota being exceeded	20:27
clarkb	oh ya that cloud is completely cooked	20:27
clarkb	https://grafana.opendev.org/d/cde4271d48/zuul-launcher3a-ovh?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all	20:27
clarkb	corvus: will marking the max servers limit on that provider to 0 update existing requests there?	20:27
clarkb	doing a server list against that cloud does seem to show a large number of active nodes. At first I thought maybe these got leaked from the nodepool wind down but the hostnames appear to be in the new style non consecutive identifiers	20:29
clarkb	corvus: this smells like a zuul launcher bug to me. In particular all of the nodes are marked ready but not being reused by new requests. I think we identified this as a risk of the nodeset "clamping" but the 8 hour ready node timeout was expected to clear things out	20:32
clarkb	seems like maybe that isn't happening (the timestamps indicate the nodes are ~2 days old)	20:32
clarkb	I think rax-sjc3 may be entering a similar state but isn't quite into using all of its quota so things semi work there	20:32
clarkb	tl;dr is we have ~125 active ready nodes in ovh-bhs1 consuming the quota preventing new boots. They are at least 2 days old so aged out past the timeout. Ideally they either get recycled into new requests or aged out to free up quota	20:35
corvus	ready nodes for ubuntu-noble should just get used by single-node nodset requests, so that's likely a bug	20:50
corvus	setting a 0 limit should (might?) cause requests to bounce to another because it would be impossible to fulfill	20:51
corvus	not positive about that	20:51
corvus	clarkb: if it's operationally urgent (ie, can't wait for debug and fix), you can just delete the ready nodes	20:51
corvus	i'm not able to look at that right now, but would be happy to later	20:52
clarkb	corvus: I don't think it is urgent. Things are quiet in zuul right now	21:04
clarkb	deletion would be via the web ui as an authenticated admin?	21:04
corvus	yes or client/api	21:04
clarkb	(mostly asking for the info I think for now I'll leave it as is)	21:04
clarkb	I've just tested reindexing changes on gerrit 3.11 using the held 3.11 test node. That worked fine. but there are only a handful of simple changes in that installation so it isn't a super exhaustive test but good to see that reindexing works in general	21:11
clarkb	I was owrried that the held nodes were old enough to have gotten cleaned up as part of nodepool cleanup but they are niz nodes so no problem there	21:11
clarkb	I didn't want to dig into actual upgrade testing if that was going to fail, but it doesn't so I should start trying to do the testing now	21:12
opendevreview	James E. Blair proposed opendev/system-config master: Mirror python2.7 images to quay https://review.opendev.org/c/opendev/system-config/+/956056	22:47
clarkb	fungi: looks like the backup server is still not reachable? I'm thinking maybe tomorrow we go ahead and ask nova to reboot it for us?	22:57
clarkb	I'll put that on the agenda for the meeting then get the agenda sent out	22:57
opendevreview	Merged opendev/system-config master: Mirror python2.7 images to quay https://review.opendev.org/c/opendev/system-config/+/956056	23:05
fungi	clarkb: yeah, i haven't seen any of the vexxhost crew pop in to say they were looking at it either	23:34

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!