Monday, 2025-11-24

*** ykarel_ is now known as ykarel		09:05
*** ykarel__ is now known as ykarel		12:44
fungi	still getting 85.5MB/s pulling uncompressible random content from mirror.iad3.openmetal to ze12, so whatever slowdown we observed last week seems to be gone	15:39
fungi	need to run some errands, will be back after lunch	15:45
clarkb	fungi: similarly load on static.o.o remains reasonable and we're still nowhere near the new process cap but definitely above the old one most of the time	16:04
clarkb	I think we can probably consider both of those items addressed at least until we notice new problems	16:04
clarkb	https://review.opendev.org/c/zuul/zuul/+/967891 is still outstanding from last week's launcher floating ip deletion debugging	16:05
clarkb	It had an unrelated test case failure so I rechecked it just now	16:06
clarkb	I'm going to work on putting a meeting agenda today. It should include gerrit upgrade testing and preparation updates. Launcher floating ip and image upload updates from last week. Also a look ahead into december scheduling of meetings. Let me know if there is anything else to add	16:07
clarkb	oh also there is a gitea 1.25.2 release. Is that something another infra-root is interested in proposing an update for?	16:08
clarkb	that can go on the agenda as well	16:08
clarkb	my other big goal for this week is to keep chipping away at the TODOs on the gerrit upgrade document	16:08
opendevreview	Clark Boylan proposed opendev/system-config master: Add a Gerrit read block rule for Blocked Users https://review.opendev.org/c/opendev/system-config/+/968228	17:14
clarkb	I'ev marked that change WIP so that we can confirm post upgrade that the change to the acls is automatically added for us by the upgrade	17:17
clarkb	but I expect that rule will be added and this is a good reminder and upfront warning	17:17
clarkb	jitsi meet just updated docker images. We will automatically upgrade at ~0200 UTC tomorrow.	17:30
clarkb	the release notes don't point at anything that concerns me. But those are release notes fro the images not jitsi meet itself. It is possible that software updates introduce new unwanted behavior. Let us know if you hit problems after the upgrade	17:31
fungi	clarkb: for the meeting agenda, be aware that starlingx has a project rename request	17:43
clarkb	fungi: thanks looks like they updated the wiki with it. I guess the main question is do we do that before or after the upgrade	17:45
clarkb	we test project renames with both current gerrit and gerrit next in our ci jobs so theoretically it works in both	17:46
clarkb	my immediate thought is we should continue to prioritize the upgrade since we've fallen behind there	17:46
fungi	wfm	17:47
clarkb	I need to remember how I did all that openid testing before. I awnt to temporarily update my held 3.11 node to use openid to ensure we still function as expected when logging in with openid there	17:48
clarkb	(I am not super worried but its the first major version bump since I fixed it)	17:48
clarkb	I'm going to take the held 3.11 node down for a short period while I update its config and test openid login	17:55
corvus	fungi: the timeout bump on 966200 lgtm -- i wonder if it even makes sense to have a timeout now though... maybe we should just hardcode that to a big number in the role itself, since we expect it to always finish by the time the upload is done anyway. (we can make that change on top of yours though if we want to, no need to block)	18:02
fungi	separately, are all the node_failure results for that change a sign that we just don't have sufficient capacity in our one arm64 provider?	18:02
corvus	looking	18:10
clarkb	I checked last week on a couple of those failures and they were due to ssh key scanning timing out then	18:11
corvus	/var/log/zuul/launcher-debug.log.2.gz:2025-11-22 19:30:00,052 DEBUG zuul.Launcher: [e: d605a1733ef34f5fa8e576ab14480345] [req: 1bdcd21fccbe4333bd13811aeac3d840] Nodescan request failed with 0 keys, 39 initial connection attempts, 0 key connection failures, 0 key negotiation failures in 120 seconds	18:11
corvus	clarkb: yeah looks like that's still the case	18:12
corvus	so maybe a thundering herd causing boots to be slow?	18:12
corvus	maybe we can try bumping boot-timeout on that provider?	18:12
fungi	the frequency with which builds for that change consistently hit node_failure in that provider suggest the buildset may be its own thundering herd	18:13
fungi	that last recheck was in the middle of the weekend when there shouldn't have been much else going on	18:13
clarkb	ya longer timeouts may make sense especially consdering that I think the arm stuff is somewhat underpowered relative to the x86 hardware we use	18:14
corvus	yeah, especially during times where other changes are minimal, like the weekend.	18:14
corvus	during the week, openstack keeps it busy with one new job at a time, so we don't get a herd. but once that clears out, a buildset like that can get a bunch of nodes simultaneously	18:14
clarkb	The held gerrit 3.11 node has been returned to its normal login config state	18:14
fungi	got it, so unrelated load spreading out the nodescans probably helps keep most builds from hitting it	18:15
clarkb	infra-root I've largely concluded my upfront testing of concerns pulled out of the Gerrit 3.11 release notes and my notes on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 should now be up to date. There are still a couple of TODOs (one has to do with adding a step to the upgrade itself, one has to do with taking a temperature on whether or not more testing should be done with	18:16
clarkb	one item, and the last is in checking if tools like zuul or gertty rely on api response etag values)	18:16
clarkb	I think now is a reasonable time for people to look over those notes and make sure they don't see anything else concerning and also provide input on the latter two remaining TODOs	18:17
clarkb	The next step in this process will eb for me to start running through a test upgrade on the held 3.10 node. I may target doing that for tomorrow?	18:17
corvus	clarkb: neither relies on etag	18:18
clarkb	corvus: excellent	18:18
clarkb	notes have been updated	18:19
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea to 1.25.2 https://review.opendev.org/c/opendev/system-config/+/968245	18:38
clarkb	thats the gitea 1.25.2 ball rolling	18:38
clarkb	side note: I subscribe to jitsi meet docker, gitea, etherpad-lite, and maybe a couple others i don't recall now release notifications via github	18:38
clarkb	its a useful way of keeping up to date on that stuff	18:38
fungi	i wonder how many of them are full of sha1-hulud now	18:51
clarkb	that is certainly one consideration with doing a gitea update. I haven't checked them for spice contamination	18:52
clarkb	if we think that is prudent I can try to figure that out	18:52
clarkb	gitea 1.25.2 includes a pnpm-lock.yaml file. They switched to pnpm in the 1.25.0 release. In theory pnpm mitigates some of these issues by more slowly opting into newer releases, but in theory the lock file still wins so I think the process is cross check lock file against the known bad list that came out today	18:57
clarkb	shouldn't be too much work if I can find a relatively clean source list to iterate over and grep into the pnpm lock file	18:57
fungi	i guess we need to care about gerrit builds too?	19:21
fungi	oh, and mailman, not that we update it often at all	19:21
clarkb	fungi: yes though gerrit uses dependencies that google checks	19:27
clarkb	fungi: whcih in theory means those packages are scrutinized more than probably any others we consume	19:27
clarkb	I'm trying to sort out where those are listed to double check them	19:29
clarkb	as far as I can tell gerrit and zuul and gitea all seem fine. Only gitea had any overlap (a single lib) and the version is not one that is listed as affected	19:33
*** dhill is now known as Guest32203		20:49
clarkb	gitea 1.25.2 screenshots lgtm. I am looking at popping out shortly for a bike ride though	21:01
opendevreview	Jeremy Stanley proposed opendev/zuul-providers master: Increase boot-timeout for osuosl to 10 minutes https://review.opendev.org/c/opendev/zuul-providers/+/968259	21:14
fungi	clarkb: corvus: ^ as discussed earlier	21:15
corvus	++	21:26
clarkb	fungi: approved thanks	23:02
opendevreview	Merged opendev/zuul-providers master: Increase boot-timeout for osuosl to 10 minutes https://review.opendev.org/c/opendev/zuul-providers/+/968259	23:02
clarkb	as a heads up I'm going to restart zuul launchers, web servers and schedulers to pick up some changes that landed today	23:08
clarkb	and when that is done I'll work on our meeting agenda	23:08
clarkb	the first set of services have been restarted. Once they are up I'll do their other halves	23:12
clarkb	corvus: interestingly zl01 has a STOPPED launcher and a running launcher. I assume that is beacuse I didn't stop it gracefully so now we need to timeout or something	23:13
clarkb	hrm looking at the components list ze07 isn't listed and ze06 and ze08 are running different versions. I wonder if our weekly restart failed on ze07 somehow	23:14
clarkb	I don't anticipate that causing problems for the launchers, webs or schedulers so I'll stick to my plan for now	23:15
clarkb	ze07 did fail with /usr/bin/apt-get upgrade --with-new-pkgs ' failed: E: Could not get lock /var/lib/dpkg/lock-frontend	23:22
clarkb	the launchers, webs, and schedulers haev all updated and the last stragglers are initializing now.	23:23
clarkb	corvus: I suspect that the next weekly pass can take care of the mergers and executors unless there is a reason to fix them. We can even leave ze07 in the state it is in right now in order to test the case where the executor is already down (which should be handled now)	23:24
clarkb	I am going to see if I can have that package update step delay or retry over time since the lock is typically not going to be held for long	23:24
clarkb	hrm we already do have retries set up but the log on bridge says it only did one attempt so I'll need to debug further	23:27
clarkb	oh I see it. We wait for a specific string to not be in the error message and that string appears different not	23:28
clarkb	*different now so we stop waiting	23:28
opendevreview	Clark Boylan proposed opendev/system-config master: Update zuul_reboot package upgrade error conditions https://review.opendev.org/c/opendev/system-config/+/968270	23:33
corvus	clarkb: yes, stopped components can hang around for a bit, not a problem.	23:33
corvus	clarkb: +2 with comment	23:36
corvus	and i think leaving ze07 down sounds fine	23:36
clarkb	corvus: ya I noted that in the commit message. I can be swayed to simply retry for up to 20 minutes and do our best in all situations	23:37
clarkb	corvus: do you think you prefer that?	23:37
corvus	right now, i'm like 51% to 49% in favor of removing the check. the next time the error message changes i will be 99% in favor of removing it. :)	23:38
clarkb	heh ok. Why don't we just get ahead of it and fix it now then. I'll push a new patchset	23:39
opendevreview	Clark Boylan proposed opendev/system-config master: Update zuul_reboot package upgrade error conditions https://review.opendev.org/c/opendev/system-config/+/968270	23:40
clarkb	ok I've got the meetup agenda updated with all of the updates I am aware of	23:48
clarkb	I'll send that out in about 15 minutes if I don't hear any new suggestions for edits	23:48

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!