Monday, 2025-11-24

*** ykarel_ is now known as ykarel09:05
*** ykarel__ is now known as ykarel12:44
fungistill getting 85.5MB/s pulling uncompressible random content from mirror.iad3.openmetal to ze12, so whatever slowdown we observed last week seems to be gone15:39
fungineed to run some errands, will be back after lunch15:45
clarkbfungi: similarly load on static.o.o remains reasonable and we're still nowhere near the new process cap but definitely above the old one most of the time16:04
clarkbI think we can probably consider both of those items addressed at least until we notice new problems16:04
clarkbhttps://review.opendev.org/c/zuul/zuul/+/967891 is still outstanding from last week's launcher floating ip deletion debugging16:05
clarkbIt had an unrelated test case failure so I rechecked it just now16:06
clarkbI'm going to work on putting a meeting agenda today. It should include gerrit upgrade testing and preparation updates. Launcher floating ip and image upload updates from last week. Also a look ahead into december scheduling of meetings. Let me know if there is anything else to add16:07
clarkboh also there is a gitea 1.25.2 release. Is that something another infra-root is interested in proposing an update for?16:08
clarkbthat can go on the agenda as well16:08
clarkbmy other big goal for this week is to keep chipping away at the TODOs on the gerrit upgrade document16:08
opendevreviewClark Boylan proposed opendev/system-config master: Add a Gerrit read block rule for Blocked Users  https://review.opendev.org/c/opendev/system-config/+/96822817:14
clarkbI'ev marked that change WIP so that we can confirm post upgrade that the change to the acls is automatically added for us by the upgrade17:17
clarkbbut I expect that rule will be added and this is a good reminder and upfront warning17:17
clarkbjitsi meet just updated docker images. We will automatically upgrade at ~0200 UTC tomorrow.17:30
clarkbthe release notes don't point at anything that concerns me. But those are release notes fro the images not jitsi meet itself. It is possible that software updates introduce new unwanted behavior. Let us know if you hit problems after the upgrade17:31
fungiclarkb: for the meeting agenda, be aware that starlingx has a project rename request17:43
clarkbfungi: thanks looks like they updated the wiki with it. I guess the main question is do we do that before or after the upgrade17:45
clarkbwe test project renames with both current gerrit and gerrit next in our ci jobs so theoretically it works in both17:46
clarkbmy immediate thought is we should continue to prioritize the upgrade since we've fallen behind there17:46
fungiwfm17:47
clarkbI need to remember how I did all that openid testing before. I awnt to temporarily update my held 3.11 node to use openid to ensure we still function as expected when logging in with openid there17:48
clarkb(I am not super worried but its the first major version bump since I fixed it)17:48
clarkbI'm going to take the held 3.11 node down for a short period while I update its config and test openid login17:55
corvusfungi: the timeout bump on 966200 lgtm -- i wonder if it even makes sense to have a timeout now though... maybe we should just hardcode that to a big number in the role itself, since we expect it to always finish by the time the upload is done anyway.  (we can make that change on top of yours though if we want to, no need to block)18:02
fungiseparately, are all the node_failure results for that change a sign that we just don't have sufficient capacity in our one arm64 provider?18:02
corvuslooking18:10
clarkbI checked last week on a couple of those failures and they were due to ssh key scanning timing out then18:11
corvus /var/log/zuul/launcher-debug.log.2.gz:2025-11-22 19:30:00,052 DEBUG zuul.Launcher: [e: d605a1733ef34f5fa8e576ab14480345] [req: 1bdcd21fccbe4333bd13811aeac3d840] Nodescan request failed with 0 keys, 39 initial connection attempts, 0 key connection failures, 0 key negotiation failures in 120 seconds18:11
corvusclarkb: yeah looks like that's still the case18:12
corvusso maybe a thundering herd causing boots to be slow?18:12
corvusmaybe we can try bumping boot-timeout on that provider?18:12
fungithe frequency with which builds for that change consistently hit node_failure in that provider suggest the buildset may be its own thundering herd18:13
fungithat last recheck was in the middle of the weekend when there shouldn't have been much else going on18:13
clarkbya longer timeouts may make sense especially consdering that I think the arm stuff is somewhat underpowered relative to the x86 hardware we use18:14
corvusyeah, especially during times where other changes are minimal, like the weekend.18:14
corvusduring the week, openstack keeps it busy with one new job at a time, so we don't get a herd.  but once that clears out, a buildset like that can get a bunch of nodes simultaneously18:14
clarkbThe held gerrit 3.11 node has been returned to its normal login config state18:14
fungigot it, so unrelated load spreading out the nodescans probably helps keep most builds from hitting it18:15
clarkbinfra-root I've largely concluded my upfront testing of concerns pulled out of the Gerrit 3.11 release notes and my notes on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 should now be up to date. There are still a couple of TODOs (one has to do with adding a step to the upgrade itself, one has to do with taking a temperature on whether or not more testing should be done with18:16
clarkbone item, and the last is in checking if tools like zuul or gertty rely on api response etag values)18:16
clarkbI think now is a reasonable time for people to look over those notes and make sure they don't see anything else concerning and also provide input on the latter two remaining TODOs18:17
clarkbThe next step in this process will eb for me to start running through a test upgrade on the held 3.10 node. I may target doing that for tomorrow?18:17
corvusclarkb: neither relies on etag18:18
clarkbcorvus: excellent18:18
clarkbnotes have been updated18:19
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.25.2  https://review.opendev.org/c/opendev/system-config/+/96824518:38
clarkbthats the gitea 1.25.2 ball rolling18:38
clarkbside note: I subscribe to jitsi meet docker, gitea, etherpad-lite, and maybe a couple others i don't recall now release notifications via github18:38
clarkbits a useful way of keeping up to date on that stuff18:38
fungii wonder how many of them are full of sha1-hulud now18:51
clarkbthat is certainly one consideration with doing a gitea update. I haven't checked them for spice contamination18:52
clarkbif we think that is prudent I can try to figure that out18:52
clarkbgitea 1.25.2 includes a pnpm-lock.yaml file. They switched to pnpm in the 1.25.0 release. In theory pnpm mitigates some of these issues by more slowly opting into newer releases, but in theory the lock file still wins so I think the process is cross check lock file against the known bad list that came out today18:57
clarkbshouldn't be too much work if I can find a relatively clean source list to iterate over and grep into the pnpm lock file18:57
fungii guess we need to care about gerrit builds too?19:21
fungioh, and mailman, not that we update it often at all19:21
clarkbfungi: yes though gerrit uses dependencies that google checks19:27
clarkbfungi: whcih in theory means those packages are scrutinized more than probably any others we consume19:27
clarkbI'm trying to sort out where those are listed to double check them19:29
clarkbas far as I can tell gerrit and zuul and gitea all seem fine. Only gitea had any overlap (a single lib) and the version is not one that is listed as affected19:33
*** dhill is now known as Guest3220320:49
clarkbgitea 1.25.2 screenshots lgtm. I am looking at popping out shortly for a bike ride though21:01
opendevreviewJeremy Stanley proposed opendev/zuul-providers master: Increase boot-timeout for osuosl to 10 minutes  https://review.opendev.org/c/opendev/zuul-providers/+/96825921:14
fungiclarkb: corvus: ^ as discussed earlier21:15
corvus++21:26
clarkbfungi: approved thanks23:02
opendevreviewMerged opendev/zuul-providers master: Increase boot-timeout for osuosl to 10 minutes  https://review.opendev.org/c/opendev/zuul-providers/+/96825923:02
clarkbas a heads up I'm going to restart zuul launchers, web servers and schedulers to pick up some changes that landed today23:08
clarkband when that is done I'll work on our meeting agenda23:08
clarkbthe first set of services have been restarted. Once they are up I'll do their other halves23:12
clarkbcorvus: interestingly zl01 has a STOPPED launcher and a running launcher. I assume that is beacuse I didn't stop it gracefully so now we need to timeout or something23:13
clarkbhrm looking at the components list ze07 isn't listed and ze06 and ze08 are running different versions. I wonder if our weekly restart failed on ze07 somehow23:14
clarkbI don't anticipate that causing problems for the launchers, webs or schedulers so I'll stick to my plan for now23:15
clarkbze07 did fail with /usr/bin/apt-get upgrade --with-new-pkgs ' failed: E: Could not get lock /var/lib/dpkg/lock-frontend23:22
clarkbthe launchers, webs, and schedulers haev all updated and the last stragglers are initializing now.23:23
clarkbcorvus: I suspect that the next weekly pass can take care of the mergers and executors unless there is a reason to fix them. We can even leave ze07 in the state it is in right now in order to test the case where the executor is already down (which should be handled now)23:24
clarkbI am going to see if I can have that package update step delay or retry over time since the lock is typically not going to be held for long23:24
clarkbhrm we already do have retries set up but the log on bridge says it only did one attempt so I'll need to debug further23:27
clarkboh I see it. We wait for a specific string to not be in the error message and that string appears different not23:28
clarkb*different now so we stop waiting23:28
opendevreviewClark Boylan proposed opendev/system-config master: Update zuul_reboot package upgrade error conditions  https://review.opendev.org/c/opendev/system-config/+/96827023:33
corvusclarkb: yes, stopped components can hang around for a bit, not a problem.23:33
corvusclarkb: +2 with comment23:36
corvusand i think leaving ze07 down sounds fine23:36
clarkbcorvus: ya I noted that in the commit message. I can be swayed to simply retry for up to 20 minutes and do our best in all situations23:37
clarkbcorvus: do you think you prefer that?23:37
corvusright now, i'm like 51% to 49% in favor of removing the check.  the next time the error message changes i will be 99% in favor of removing it.  :)23:38
clarkbheh ok. Why don't we just get ahead of it and fix it now then. I'll push a new patchset23:39
opendevreviewClark Boylan proposed opendev/system-config master: Update zuul_reboot package upgrade error conditions  https://review.opendev.org/c/opendev/system-config/+/96827023:40
clarkbok I've got the meetup agenda updated with all of the updates I am aware of23:48
clarkbI'll send that out in about 15 minutes if I don't hear any new suggestions for edits23:48

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!