Friday, 2025-08-29

opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/95799502:08
*** mrunge_ is now known as mrunge05:39
dtantsurHey folks, quick question: do we already have CS10 nodes?11:04
fricklerdtantsur: https://zuul.opendev.org/t/openstack/labels says yes11:08
dtantsurfrickler: nice, thanks! Do you remember if we still need to limit the node provider to use it? I recall that some months ago not all hosters supported the necessary CPU level.11:09
mnasiadkadtantsur: no, they are targeted only at the providers having required HW12:24
dtantsurgreat, thank you12:29
Clark[m]dtantsur: to clarify centos 10 can run on only about 50% of our capacity. But you don't need to do anything special on your side to limit the distro to those providers it's don't for you13:28
Clark[m]So yes it's limited, but it's not a detail you need to explicitly add to your job configuration 13:29
mnasiadkahrm, reno job just post_failed in gate - https://zuul.opendev.org/t/openstack/build/c1bafd4d7e47403b8e41eefc69123a8c13:29
mnasiadkaIs there an option that somebody can enqueue https://review.opendev.org/c/openstack/kolla/+/948520 again to gating - or do I need to go through the check+gate cycle? ;-)13:30
opendevreviewTakashi Kajinami proposed openstack/diskimage-builder master: Drop remaining reference to TripleO  https://review.opendev.org/c/openstack/diskimage-builder/+/95888413:34
Clark[m]mnasiadka post failures without logs generally indicate log upload failed. If that is a persistent common issue someone (probably me) will need to look into disabling whichever backend is unhappy.13:36
Clark[m]The necessary info is found in zuul executor debug logs 13:36
Clark[m]I don't see a sea of orange on the zuul status page so that is a good sign13:37
Clark[m]mnasiadka I would recheck it now and if things are still not happy I can look at intervention in a couple hours when I'm able to sit down and actually do that13:43
corvusClark: ze11 is still ungood.  haven't finished the clone yet, but can see it looks like yesterday.13:49
mnasiadkaClark[m]: thanks, rechecking13:53
mnasiadkaClark[m]: I thought I've seen support for failing over log uploads to other swift/s3 endpoint, but maybe that's not that simple13:54
fungimnasiadka: if you think it's persistent for that one change/job, you can keep a tab with the console stream for it open and you might see some hints that are otherwise inaccessible due to a post-run build or upload failure13:58
mnasiadkafungi: thanks, I'm probably not THAT persistent, especially it's 4pm on Friday :)13:59
corvusmnasiadka: there was a change to zuul-jobs to do failover, but isn't compatible with the randomization we use in opendev.  i would be open to a change that supported both -- but i think we need to discuss it with the rest of the opendev team since if we lose a cloud we still want to know about it.13:59
corvusClark: 12:45 for the clone on ze1113:59
mnasiadkacorvus: understand14:00
mnasiadkacorvus: by the way - is there any outlook for a zuul release that includes zuul-launcher for usage outside opendev? My local Zuul instance is not really used a lot, and would be thinking of a migration to zuul-launcher sooner than later14:02
corvusmnasiadka: not ready for prime time yet, still missing some important stuff and still making what would be significant user-facing breaking changes.  best guess: end of the year for early adoption, early next year for a wider push for adoption.  keyword: guess.  :)14:05
mnasiadkacorvus: fine then, I'll wait :)14:08
opendevreviewClark Boylan proposed opendev/git-review master: Drop testing on Bionic and Python36  https://review.opendev.org/c/opendev/git-review/+/95890616:45
opendevreviewClark Boylan proposed opendev/bindep master: Drop Bionic testing  https://review.opendev.org/c/opendev/bindep/+/95890716:49
opendevreviewClark Boylan proposed opendev/glean master: Drop testing on Bionic and Xenial  https://review.opendev.org/c/opendev/glean/+/95890916:55
clarkbrax flex sjc3 will do maintenance on systems that will impact our mirror cinder volume september 3 from 15:30 - 17:30 UTC16:58
clarkbwe can disable that region and shutdown the mirror beforehand to aviod errors16:58
* clarkb makes a note16:58
opendevreviewClark Boylan proposed opendev/git-review master: Drop testing on Bionic and Python36  https://review.opendev.org/c/opendev/git-review/+/95890617:01
opendevreviewClark Boylan proposed opendev/glean master: Drop testing on Bionic and Xenial  https://review.opendev.org/c/opendev/glean/+/95890917:14
clarkbcorvus: nova reported that https://review.opendev.org/c/openstack/nova/+/951640 did not enqueue to the gate like they expected it to earlier today. Basically there was a recheck on August 29 that reported back Verified +1 at ~12:18 UTC August 29 and that should've created a comment-added event that zuul picked up to reenqueue the chagne to the gate. I can see on zuul01 we have18:30
clarkb2025-08-29 12:18:33,436 DEBUG zuul.GerritReporter: [e: 9b96f69b970e4daba151bbc30a1d4eda] Report change <Change 0x7c8d4e72fe10 openstack/nova 951640,3>, params {'Verified': 1}, message: Build succeeded (check pipeline). But I cannot find the corresponding comment-added event from the GerritConnection afterwards on either zuul01 or zuul0218:30
clarkbcorvus: it looks like we simply never get the event. I do see on zuul01 we have 2025-08-29 12:14:25,870 ERROR zuul.GerritConnection.ssh:   kazoo.exceptions.ConnectionLoss but that happens before the issue not after18:33
clarkbbut maybe we didn't reset things in the GerritConnection's ssh system quickly enough?18:33
clarkbok actually I see 2025-08-29 12:19:56,430 ERROR zuul.GerritConnection.ssh:   kazoo.exceptions.ConnectionLoss on zuul02 so I think this must be it18:34
clarkbthe zookeeper containers have all been up for about 2 weeks so we didn't cause this by upgrading zk or hosting side issues unless it was in the network18:35
clarkbok the traceback shows that occurred when the gerrit ssh connection tried to put the event in the event_queue in addEvent. I suspect that this particular event was the event that was getting added and we don't retry18:39
clarkbcorvus: thinking out loud here should we try except around addEvent catching kazoo.exceptions.ConnectionLoss and retrying once reconnected?18:39
clarkbI can probably work on that if we think it is a good idea. I just don't want to go down that path if there is a good reason for not already doing so18:40
clarkblooks like these connection errors first started at 2025-08-29 10:39:33,515 and last occurred at 2025-08-29 17:58:51,34218:54
clarkband happend about once or twice and hour per scheduler in the time between18:55
clarkbseems we've gone almost an hour without issue so maybe whatever was causing it is no longer a problem?18:55
clarkbits possible we need to investigate why these were occurring as well. But lets see how things go over time18:55
clarkbon zk01 I can see logs for zuul01 and zuul02 authenticating to the zk server but I don't see any disconnection complaints there18:59
*** gmaan is now known as gmaan_afk19:30
clarkbthere was one more disconnect on zuul02 at 19:26 UTC20:22
clarkbI don't see anything pointing at a clear problem in the grafana graphs either. The leader has remained consistent for example20:26
clarkbit seems like thinsg are improving. I guess if they don't completely resolve then we can bring this up with rax maybe? I checked and we are using ipv4 addrs for zk connections from zuul to the zk cluster20:28
clarkbANother option may be to try ipv6. I feel like we used ipv4 because it was more reliable in the past though20:29
clarkbok looking further it seems like the zuul webs running on the same hosts may not experience the same issue (I seem them registering the zuul components when the scheduler reconnects). I wonder if that points at a bug in kazoo or zuul itself.20:41
clarkbLike maybe something is preventing kazoo from maintaining the connection so it goes past the timeout and then errors?20:41
clarkbhowever it seems to have started today and we haven't updated zuul yet so I don't know20:42
clarkbI think I'm going to call it a day, but the connectionlosses are still happening just with slightly less frequency maybe22:57
clarkbI suppose we can see if weekly updates and reboots change anything and if not dig in and maybe see if it could be somethign on the cloud side22:57
*** gmaan_afk is now known as gmaan23:20

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!