| *** ykarel_ is now known as ykarel | 09:05 | |
| *** ykarel__ is now known as ykarel | 12:44 | |
| fungi | still getting 85.5MB/s pulling uncompressible random content from mirror.iad3.openmetal to ze12, so whatever slowdown we observed last week seems to be gone | 15:39 |
|---|---|---|
| fungi | need to run some errands, will be back after lunch | 15:45 |
| clarkb | fungi: similarly load on static.o.o remains reasonable and we're still nowhere near the new process cap but definitely above the old one most of the time | 16:04 |
| clarkb | I think we can probably consider both of those items addressed at least until we notice new problems | 16:04 |
| clarkb | https://review.opendev.org/c/zuul/zuul/+/967891 is still outstanding from last week's launcher floating ip deletion debugging | 16:05 |
| clarkb | It had an unrelated test case failure so I rechecked it just now | 16:06 |
| clarkb | I'm going to work on putting a meeting agenda today. It should include gerrit upgrade testing and preparation updates. Launcher floating ip and image upload updates from last week. Also a look ahead into december scheduling of meetings. Let me know if there is anything else to add | 16:07 |
| clarkb | oh also there is a gitea 1.25.2 release. Is that something another infra-root is interested in proposing an update for? | 16:08 |
| clarkb | that can go on the agenda as well | 16:08 |
| clarkb | my other big goal for this week is to keep chipping away at the TODOs on the gerrit upgrade document | 16:08 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Add a Gerrit read block rule for Blocked Users https://review.opendev.org/c/opendev/system-config/+/968228 | 17:14 |
| clarkb | I'ev marked that change WIP so that we can confirm post upgrade that the change to the acls is automatically added for us by the upgrade | 17:17 |
| clarkb | but I expect that rule will be added and this is a good reminder and upfront warning | 17:17 |
| clarkb | jitsi meet just updated docker images. We will automatically upgrade at ~0200 UTC tomorrow. | 17:30 |
| clarkb | the release notes don't point at anything that concerns me. But those are release notes fro the images not jitsi meet itself. It is possible that software updates introduce new unwanted behavior. Let us know if you hit problems after the upgrade | 17:31 |
| fungi | clarkb: for the meeting agenda, be aware that starlingx has a project rename request | 17:43 |
| clarkb | fungi: thanks looks like they updated the wiki with it. I guess the main question is do we do that before or after the upgrade | 17:45 |
| clarkb | we test project renames with both current gerrit and gerrit next in our ci jobs so theoretically it works in both | 17:46 |
| clarkb | my immediate thought is we should continue to prioritize the upgrade since we've fallen behind there | 17:46 |
| fungi | wfm | 17:47 |
| clarkb | I need to remember how I did all that openid testing before. I awnt to temporarily update my held 3.11 node to use openid to ensure we still function as expected when logging in with openid there | 17:48 |
| clarkb | (I am not super worried but its the first major version bump since I fixed it) | 17:48 |
| clarkb | I'm going to take the held 3.11 node down for a short period while I update its config and test openid login | 17:55 |
| corvus | fungi: the timeout bump on 966200 lgtm -- i wonder if it even makes sense to have a timeout now though... maybe we should just hardcode that to a big number in the role itself, since we expect it to always finish by the time the upload is done anyway. (we can make that change on top of yours though if we want to, no need to block) | 18:02 |
| fungi | separately, are all the node_failure results for that change a sign that we just don't have sufficient capacity in our one arm64 provider? | 18:02 |
| corvus | looking | 18:10 |
| clarkb | I checked last week on a couple of those failures and they were due to ssh key scanning timing out then | 18:11 |
| corvus | /var/log/zuul/launcher-debug.log.2.gz:2025-11-22 19:30:00,052 DEBUG zuul.Launcher: [e: d605a1733ef34f5fa8e576ab14480345] [req: 1bdcd21fccbe4333bd13811aeac3d840] Nodescan request failed with 0 keys, 39 initial connection attempts, 0 key connection failures, 0 key negotiation failures in 120 seconds | 18:11 |
| corvus | clarkb: yeah looks like that's still the case | 18:12 |
| corvus | so maybe a thundering herd causing boots to be slow? | 18:12 |
| corvus | maybe we can try bumping boot-timeout on that provider? | 18:12 |
| fungi | the frequency with which builds for that change consistently hit node_failure in that provider suggest the buildset may be its own thundering herd | 18:13 |
| fungi | that last recheck was in the middle of the weekend when there shouldn't have been much else going on | 18:13 |
| clarkb | ya longer timeouts may make sense especially consdering that I think the arm stuff is somewhat underpowered relative to the x86 hardware we use | 18:14 |
| corvus | yeah, especially during times where other changes are minimal, like the weekend. | 18:14 |
| corvus | during the week, openstack keeps it busy with one new job at a time, so we don't get a herd. but once that clears out, a buildset like that can get a bunch of nodes simultaneously | 18:14 |
| clarkb | The held gerrit 3.11 node has been returned to its normal login config state | 18:14 |
| fungi | got it, so unrelated load spreading out the nodescans probably helps keep most builds from hitting it | 18:15 |
| clarkb | infra-root I've largely concluded my upfront testing of concerns pulled out of the Gerrit 3.11 release notes and my notes on https://etherpad.opendev.org/p/gerrit-upgrade-3.11 should now be up to date. There are still a couple of TODOs (one has to do with adding a step to the upgrade itself, one has to do with taking a temperature on whether or not more testing should be done with | 18:16 |
| clarkb | one item, and the last is in checking if tools like zuul or gertty rely on api response etag values) | 18:16 |
| clarkb | I think now is a reasonable time for people to look over those notes and make sure they don't see anything else concerning and also provide input on the latter two remaining TODOs | 18:17 |
| clarkb | The next step in this process will eb for me to start running through a test upgrade on the held 3.10 node. I may target doing that for tomorrow? | 18:17 |
| corvus | clarkb: neither relies on etag | 18:18 |
| clarkb | corvus: excellent | 18:18 |
| clarkb | notes have been updated | 18:19 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea to 1.25.2 https://review.opendev.org/c/opendev/system-config/+/968245 | 18:38 |
| clarkb | thats the gitea 1.25.2 ball rolling | 18:38 |
| clarkb | side note: I subscribe to jitsi meet docker, gitea, etherpad-lite, and maybe a couple others i don't recall now release notifications via github | 18:38 |
| clarkb | its a useful way of keeping up to date on that stuff | 18:38 |
| fungi | i wonder how many of them are full of sha1-hulud now | 18:51 |
| clarkb | that is certainly one consideration with doing a gitea update. I haven't checked them for spice contamination | 18:52 |
| clarkb | if we think that is prudent I can try to figure that out | 18:52 |
| clarkb | gitea 1.25.2 includes a pnpm-lock.yaml file. They switched to pnpm in the 1.25.0 release. In theory pnpm mitigates some of these issues by more slowly opting into newer releases, but in theory the lock file still wins so I think the process is cross check lock file against the known bad list that came out today | 18:57 |
| clarkb | shouldn't be too much work if I can find a relatively clean source list to iterate over and grep into the pnpm lock file | 18:57 |
| fungi | i guess we need to care about gerrit builds too? | 19:21 |
| fungi | oh, and mailman, not that we update it often at all | 19:21 |
| clarkb | fungi: yes though gerrit uses dependencies that google checks | 19:27 |
| clarkb | fungi: whcih in theory means those packages are scrutinized more than probably any others we consume | 19:27 |
| clarkb | I'm trying to sort out where those are listed to double check them | 19:29 |
| clarkb | as far as I can tell gerrit and zuul and gitea all seem fine. Only gitea had any overlap (a single lib) and the version is not one that is listed as affected | 19:33 |
| *** dhill is now known as Guest32203 | 20:49 | |
| clarkb | gitea 1.25.2 screenshots lgtm. I am looking at popping out shortly for a bike ride though | 21:01 |
| opendevreview | Jeremy Stanley proposed opendev/zuul-providers master: Increase boot-timeout for osuosl to 10 minutes https://review.opendev.org/c/opendev/zuul-providers/+/968259 | 21:14 |
| fungi | clarkb: corvus: ^ as discussed earlier | 21:15 |
| corvus | ++ | 21:26 |
| clarkb | fungi: approved thanks | 23:02 |
| opendevreview | Merged opendev/zuul-providers master: Increase boot-timeout for osuosl to 10 minutes https://review.opendev.org/c/opendev/zuul-providers/+/968259 | 23:02 |
| clarkb | as a heads up I'm going to restart zuul launchers, web servers and schedulers to pick up some changes that landed today | 23:08 |
| clarkb | and when that is done I'll work on our meeting agenda | 23:08 |
| clarkb | the first set of services have been restarted. Once they are up I'll do their other halves | 23:12 |
| clarkb | corvus: interestingly zl01 has a STOPPED launcher and a running launcher. I assume that is beacuse I didn't stop it gracefully so now we need to timeout or something | 23:13 |
| clarkb | hrm looking at the components list ze07 isn't listed and ze06 and ze08 are running different versions. I wonder if our weekly restart failed on ze07 somehow | 23:14 |
| clarkb | I don't anticipate that causing problems for the launchers, webs or schedulers so I'll stick to my plan for now | 23:15 |
| clarkb | ze07 did fail with /usr/bin/apt-get upgrade --with-new-pkgs ' failed: E: Could not get lock /var/lib/dpkg/lock-frontend | 23:22 |
| clarkb | the launchers, webs, and schedulers haev all updated and the last stragglers are initializing now. | 23:23 |
| clarkb | corvus: I suspect that the next weekly pass can take care of the mergers and executors unless there is a reason to fix them. We can even leave ze07 in the state it is in right now in order to test the case where the executor is already down (which should be handled now) | 23:24 |
| clarkb | I am going to see if I can have that package update step delay or retry over time since the lock is typically not going to be held for long | 23:24 |
| clarkb | hrm we already do have retries set up but the log on bridge says it only did one attempt so I'll need to debug further | 23:27 |
| clarkb | oh I see it. We wait for a specific string to not be in the error message and that string appears different not | 23:28 |
| clarkb | *different now so we stop waiting | 23:28 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update zuul_reboot package upgrade error conditions https://review.opendev.org/c/opendev/system-config/+/968270 | 23:33 |
| corvus | clarkb: yes, stopped components can hang around for a bit, not a problem. | 23:33 |
| corvus | clarkb: +2 with comment | 23:36 |
| corvus | and i think leaving ze07 down sounds fine | 23:36 |
| clarkb | corvus: ya I noted that in the commit message. I can be swayed to simply retry for up to 20 minutes and do our best in all situations | 23:37 |
| clarkb | corvus: do you think you prefer that? | 23:37 |
| corvus | right now, i'm like 51% to 49% in favor of removing the check. the next time the error message changes i will be 99% in favor of removing it. :) | 23:38 |
| clarkb | heh ok. Why don't we just get ahead of it and fix it now then. I'll push a new patchset | 23:39 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update zuul_reboot package upgrade error conditions https://review.opendev.org/c/opendev/system-config/+/968270 | 23:40 |
| clarkb | ok I've got the meetup agenda updated with all of the updates I am aware of | 23:48 |
| clarkb | I'll send that out in about 15 minutes if I don't hear any new suggestions for edits | 23:48 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!