Friday, 2025-08-29

opendevreview	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/957995	02:08
*** mrunge_ is now known as mrunge		05:39
dtantsur	Hey folks, quick question: do we already have CS10 nodes?	11:04
frickler	dtantsur: https://zuul.opendev.org/t/openstack/labels says yes	11:08
dtantsur	frickler: nice, thanks! Do you remember if we still need to limit the node provider to use it? I recall that some months ago not all hosters supported the necessary CPU level.	11:09
mnasiadka	dtantsur: no, they are targeted only at the providers having required HW	12:24
dtantsur	great, thank you	12:29
Clark[m]	dtantsur: to clarify centos 10 can run on only about 50% of our capacity. But you don't need to do anything special on your side to limit the distro to those providers it's don't for you	13:28
Clark[m]	So yes it's limited, but it's not a detail you need to explicitly add to your job configuration	13:29
mnasiadka	hrm, reno job just post_failed in gate - https://zuul.opendev.org/t/openstack/build/c1bafd4d7e47403b8e41eefc69123a8c	13:29
mnasiadka	Is there an option that somebody can enqueue https://review.opendev.org/c/openstack/kolla/+/948520 again to gating - or do I need to go through the check+gate cycle? ;-)	13:30
opendevreview	Takashi Kajinami proposed openstack/diskimage-builder master: Drop remaining reference to TripleO https://review.opendev.org/c/openstack/diskimage-builder/+/958884	13:34
Clark[m]	mnasiadka post failures without logs generally indicate log upload failed. If that is a persistent common issue someone (probably me) will need to look into disabling whichever backend is unhappy.	13:36
Clark[m]	The necessary info is found in zuul executor debug logs	13:36
Clark[m]	I don't see a sea of orange on the zuul status page so that is a good sign	13:37
Clark[m]	mnasiadka I would recheck it now and if things are still not happy I can look at intervention in a couple hours when I'm able to sit down and actually do that	13:43
corvus	Clark: ze11 is still ungood. haven't finished the clone yet, but can see it looks like yesterday.	13:49
mnasiadka	Clark[m]: thanks, rechecking	13:53
mnasiadka	Clark[m]: I thought I've seen support for failing over log uploads to other swift/s3 endpoint, but maybe that's not that simple	13:54
fungi	mnasiadka: if you think it's persistent for that one change/job, you can keep a tab with the console stream for it open and you might see some hints that are otherwise inaccessible due to a post-run build or upload failure	13:58
mnasiadka	fungi: thanks, I'm probably not THAT persistent, especially it's 4pm on Friday :)	13:59
corvus	mnasiadka: there was a change to zuul-jobs to do failover, but isn't compatible with the randomization we use in opendev. i would be open to a change that supported both -- but i think we need to discuss it with the rest of the opendev team since if we lose a cloud we still want to know about it.	13:59
corvus	Clark: 12:45 for the clone on ze11	13:59
mnasiadka	corvus: understand	14:00
mnasiadka	corvus: by the way - is there any outlook for a zuul release that includes zuul-launcher for usage outside opendev? My local Zuul instance is not really used a lot, and would be thinking of a migration to zuul-launcher sooner than later	14:02
corvus	mnasiadka: not ready for prime time yet, still missing some important stuff and still making what would be significant user-facing breaking changes. best guess: end of the year for early adoption, early next year for a wider push for adoption. keyword: guess. :)	14:05
mnasiadka	corvus: fine then, I'll wait :)	14:08
opendevreview	Clark Boylan proposed opendev/git-review master: Drop testing on Bionic and Python36 https://review.opendev.org/c/opendev/git-review/+/958906	16:45
opendevreview	Clark Boylan proposed opendev/bindep master: Drop Bionic testing https://review.opendev.org/c/opendev/bindep/+/958907	16:49
opendevreview	Clark Boylan proposed opendev/glean master: Drop testing on Bionic and Xenial https://review.opendev.org/c/opendev/glean/+/958909	16:55
clarkb	rax flex sjc3 will do maintenance on systems that will impact our mirror cinder volume september 3 from 15:30 - 17:30 UTC	16:58
clarkb	we can disable that region and shutdown the mirror beforehand to aviod errors	16:58
* clarkb makes a note		16:58
opendevreview	Clark Boylan proposed opendev/git-review master: Drop testing on Bionic and Python36 https://review.opendev.org/c/opendev/git-review/+/958906	17:01
opendevreview	Clark Boylan proposed opendev/glean master: Drop testing on Bionic and Xenial https://review.opendev.org/c/opendev/glean/+/958909	17:14
clarkb	corvus: nova reported that https://review.opendev.org/c/openstack/nova/+/951640 did not enqueue to the gate like they expected it to earlier today. Basically there was a recheck on August 29 that reported back Verified +1 at ~12:18 UTC August 29 and that should've created a comment-added event that zuul picked up to reenqueue the chagne to the gate. I can see on zuul01 we have	18:30
clarkb	2025-08-29 12:18:33,436 DEBUG zuul.GerritReporter: [e: 9b96f69b970e4daba151bbc30a1d4eda] Report change <Change 0x7c8d4e72fe10 openstack/nova 951640,3>, params {'Verified': 1}, message: Build succeeded (check pipeline). But I cannot find the corresponding comment-added event from the GerritConnection afterwards on either zuul01 or zuul02	18:30
clarkb	corvus: it looks like we simply never get the event. I do see on zuul01 we have 2025-08-29 12:14:25,870 ERROR zuul.GerritConnection.ssh: kazoo.exceptions.ConnectionLoss but that happens before the issue not after	18:33
clarkb	but maybe we didn't reset things in the GerritConnection's ssh system quickly enough?	18:33
clarkb	ok actually I see 2025-08-29 12:19:56,430 ERROR zuul.GerritConnection.ssh: kazoo.exceptions.ConnectionLoss on zuul02 so I think this must be it	18:34
clarkb	the zookeeper containers have all been up for about 2 weeks so we didn't cause this by upgrading zk or hosting side issues unless it was in the network	18:35
clarkb	ok the traceback shows that occurred when the gerrit ssh connection tried to put the event in the event_queue in addEvent. I suspect that this particular event was the event that was getting added and we don't retry	18:39
clarkb	corvus: thinking out loud here should we try except around addEvent catching kazoo.exceptions.ConnectionLoss and retrying once reconnected?	18:39
clarkb	I can probably work on that if we think it is a good idea. I just don't want to go down that path if there is a good reason for not already doing so	18:40
clarkb	looks like these connection errors first started at 2025-08-29 10:39:33,515 and last occurred at 2025-08-29 17:58:51,342	18:54
clarkb	and happend about once or twice and hour per scheduler in the time between	18:55
clarkb	seems we've gone almost an hour without issue so maybe whatever was causing it is no longer a problem?	18:55
clarkb	its possible we need to investigate why these were occurring as well. But lets see how things go over time	18:55
clarkb	on zk01 I can see logs for zuul01 and zuul02 authenticating to the zk server but I don't see any disconnection complaints there	18:59
*** gmaan is now known as gmaan_afk		19:30
clarkb	there was one more disconnect on zuul02 at 19:26 UTC	20:22
clarkb	I don't see anything pointing at a clear problem in the grafana graphs either. The leader has remained consistent for example	20:26
clarkb	it seems like thinsg are improving. I guess if they don't completely resolve then we can bring this up with rax maybe? I checked and we are using ipv4 addrs for zk connections from zuul to the zk cluster	20:28
clarkb	ANother option may be to try ipv6. I feel like we used ipv4 because it was more reliable in the past though	20:29
clarkb	ok looking further it seems like the zuul webs running on the same hosts may not experience the same issue (I seem them registering the zuul components when the scheduler reconnects). I wonder if that points at a bug in kazoo or zuul itself.	20:41
clarkb	Like maybe something is preventing kazoo from maintaining the connection so it goes past the timeout and then errors?	20:41
clarkb	however it seems to have started today and we haven't updated zuul yet so I don't know	20:42
clarkb	I think I'm going to call it a day, but the connectionlosses are still happening just with slightly less frequency maybe	22:57
clarkb	I suppose we can see if weekly updates and reboots change anything and if not dig in and maybe see if it could be somethign on the cloud side	22:57
*** gmaan_afk is now known as gmaan		23:20

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!