| opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/957995 | 02:08 |
|---|---|---|
| *** mrunge_ is now known as mrunge | 05:39 | |
| dtantsur | Hey folks, quick question: do we already have CS10 nodes? | 11:04 |
| frickler | dtantsur: https://zuul.opendev.org/t/openstack/labels says yes | 11:08 |
| dtantsur | frickler: nice, thanks! Do you remember if we still need to limit the node provider to use it? I recall that some months ago not all hosters supported the necessary CPU level. | 11:09 |
| mnasiadka | dtantsur: no, they are targeted only at the providers having required HW | 12:24 |
| dtantsur | great, thank you | 12:29 |
| Clark[m] | dtantsur: to clarify centos 10 can run on only about 50% of our capacity. But you don't need to do anything special on your side to limit the distro to those providers it's don't for you | 13:28 |
| Clark[m] | So yes it's limited, but it's not a detail you need to explicitly add to your job configuration | 13:29 |
| mnasiadka | hrm, reno job just post_failed in gate - https://zuul.opendev.org/t/openstack/build/c1bafd4d7e47403b8e41eefc69123a8c | 13:29 |
| mnasiadka | Is there an option that somebody can enqueue https://review.opendev.org/c/openstack/kolla/+/948520 again to gating - or do I need to go through the check+gate cycle? ;-) | 13:30 |
| opendevreview | Takashi Kajinami proposed openstack/diskimage-builder master: Drop remaining reference to TripleO https://review.opendev.org/c/openstack/diskimage-builder/+/958884 | 13:34 |
| Clark[m] | mnasiadka post failures without logs generally indicate log upload failed. If that is a persistent common issue someone (probably me) will need to look into disabling whichever backend is unhappy. | 13:36 |
| Clark[m] | The necessary info is found in zuul executor debug logs | 13:36 |
| Clark[m] | I don't see a sea of orange on the zuul status page so that is a good sign | 13:37 |
| Clark[m] | mnasiadka I would recheck it now and if things are still not happy I can look at intervention in a couple hours when I'm able to sit down and actually do that | 13:43 |
| corvus | Clark: ze11 is still ungood. haven't finished the clone yet, but can see it looks like yesterday. | 13:49 |
| mnasiadka | Clark[m]: thanks, rechecking | 13:53 |
| mnasiadka | Clark[m]: I thought I've seen support for failing over log uploads to other swift/s3 endpoint, but maybe that's not that simple | 13:54 |
| fungi | mnasiadka: if you think it's persistent for that one change/job, you can keep a tab with the console stream for it open and you might see some hints that are otherwise inaccessible due to a post-run build or upload failure | 13:58 |
| mnasiadka | fungi: thanks, I'm probably not THAT persistent, especially it's 4pm on Friday :) | 13:59 |
| corvus | mnasiadka: there was a change to zuul-jobs to do failover, but isn't compatible with the randomization we use in opendev. i would be open to a change that supported both -- but i think we need to discuss it with the rest of the opendev team since if we lose a cloud we still want to know about it. | 13:59 |
| corvus | Clark: 12:45 for the clone on ze11 | 13:59 |
| mnasiadka | corvus: understand | 14:00 |
| mnasiadka | corvus: by the way - is there any outlook for a zuul release that includes zuul-launcher for usage outside opendev? My local Zuul instance is not really used a lot, and would be thinking of a migration to zuul-launcher sooner than later | 14:02 |
| corvus | mnasiadka: not ready for prime time yet, still missing some important stuff and still making what would be significant user-facing breaking changes. best guess: end of the year for early adoption, early next year for a wider push for adoption. keyword: guess. :) | 14:05 |
| mnasiadka | corvus: fine then, I'll wait :) | 14:08 |
| opendevreview | Clark Boylan proposed opendev/git-review master: Drop testing on Bionic and Python36 https://review.opendev.org/c/opendev/git-review/+/958906 | 16:45 |
| opendevreview | Clark Boylan proposed opendev/bindep master: Drop Bionic testing https://review.opendev.org/c/opendev/bindep/+/958907 | 16:49 |
| opendevreview | Clark Boylan proposed opendev/glean master: Drop testing on Bionic and Xenial https://review.opendev.org/c/opendev/glean/+/958909 | 16:55 |
| clarkb | rax flex sjc3 will do maintenance on systems that will impact our mirror cinder volume september 3 from 15:30 - 17:30 UTC | 16:58 |
| clarkb | we can disable that region and shutdown the mirror beforehand to aviod errors | 16:58 |
| * clarkb makes a note | 16:58 | |
| opendevreview | Clark Boylan proposed opendev/git-review master: Drop testing on Bionic and Python36 https://review.opendev.org/c/opendev/git-review/+/958906 | 17:01 |
| opendevreview | Clark Boylan proposed opendev/glean master: Drop testing on Bionic and Xenial https://review.opendev.org/c/opendev/glean/+/958909 | 17:14 |
| clarkb | corvus: nova reported that https://review.opendev.org/c/openstack/nova/+/951640 did not enqueue to the gate like they expected it to earlier today. Basically there was a recheck on August 29 that reported back Verified +1 at ~12:18 UTC August 29 and that should've created a comment-added event that zuul picked up to reenqueue the chagne to the gate. I can see on zuul01 we have | 18:30 |
| clarkb | 2025-08-29 12:18:33,436 DEBUG zuul.GerritReporter: [e: 9b96f69b970e4daba151bbc30a1d4eda] Report change <Change 0x7c8d4e72fe10 openstack/nova 951640,3>, params {'Verified': 1}, message: Build succeeded (check pipeline). But I cannot find the corresponding comment-added event from the GerritConnection afterwards on either zuul01 or zuul02 | 18:30 |
| clarkb | corvus: it looks like we simply never get the event. I do see on zuul01 we have 2025-08-29 12:14:25,870 ERROR zuul.GerritConnection.ssh: kazoo.exceptions.ConnectionLoss but that happens before the issue not after | 18:33 |
| clarkb | but maybe we didn't reset things in the GerritConnection's ssh system quickly enough? | 18:33 |
| clarkb | ok actually I see 2025-08-29 12:19:56,430 ERROR zuul.GerritConnection.ssh: kazoo.exceptions.ConnectionLoss on zuul02 so I think this must be it | 18:34 |
| clarkb | the zookeeper containers have all been up for about 2 weeks so we didn't cause this by upgrading zk or hosting side issues unless it was in the network | 18:35 |
| clarkb | ok the traceback shows that occurred when the gerrit ssh connection tried to put the event in the event_queue in addEvent. I suspect that this particular event was the event that was getting added and we don't retry | 18:39 |
| clarkb | corvus: thinking out loud here should we try except around addEvent catching kazoo.exceptions.ConnectionLoss and retrying once reconnected? | 18:39 |
| clarkb | I can probably work on that if we think it is a good idea. I just don't want to go down that path if there is a good reason for not already doing so | 18:40 |
| clarkb | looks like these connection errors first started at 2025-08-29 10:39:33,515 and last occurred at 2025-08-29 17:58:51,342 | 18:54 |
| clarkb | and happend about once or twice and hour per scheduler in the time between | 18:55 |
| clarkb | seems we've gone almost an hour without issue so maybe whatever was causing it is no longer a problem? | 18:55 |
| clarkb | its possible we need to investigate why these were occurring as well. But lets see how things go over time | 18:55 |
| clarkb | on zk01 I can see logs for zuul01 and zuul02 authenticating to the zk server but I don't see any disconnection complaints there | 18:59 |
| *** gmaan is now known as gmaan_afk | 19:30 | |
| clarkb | there was one more disconnect on zuul02 at 19:26 UTC | 20:22 |
| clarkb | I don't see anything pointing at a clear problem in the grafana graphs either. The leader has remained consistent for example | 20:26 |
| clarkb | it seems like thinsg are improving. I guess if they don't completely resolve then we can bring this up with rax maybe? I checked and we are using ipv4 addrs for zk connections from zuul to the zk cluster | 20:28 |
| clarkb | ANother option may be to try ipv6. I feel like we used ipv4 because it was more reliable in the past though | 20:29 |
| clarkb | ok looking further it seems like the zuul webs running on the same hosts may not experience the same issue (I seem them registering the zuul components when the scheduler reconnects). I wonder if that points at a bug in kazoo or zuul itself. | 20:41 |
| clarkb | Like maybe something is preventing kazoo from maintaining the connection so it goes past the timeout and then errors? | 20:41 |
| clarkb | however it seems to have started today and we haven't updated zuul yet so I don't know | 20:42 |
| clarkb | I think I'm going to call it a day, but the connectionlosses are still happening just with slightly less frequency maybe | 22:57 |
| clarkb | I suppose we can see if weekly updates and reboots change anything and if not dig in and maybe see if it could be somethign on the cloud side | 22:57 |
| *** gmaan_afk is now known as gmaan | 23:20 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!