opendevreview | Merged opendev/zuul-providers master: Add centos-10-stream and rockylinux-10 image definitions https://review.opendev.org/c/opendev/zuul-providers/+/953726 | 06:49 |
---|---|---|
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 07:41 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 08:12 |
opendevreview | Tristan Cacqueray proposed zuul/zuul-jobs master: DNM: testing third-party CI config https://review.opendev.org/c/zuul/zuul-jobs/+/954212 | 08:15 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 08:37 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 08:48 |
*** liuxie is now known as liushy | 08:50 | |
priteau | Hello. Do you know when we could expect new nodepool images to be available on all clouds following the merge of https://review.opendev.org/c/opendev/zuul-providers/+/953908? | 09:53 |
frickler | priteau: iiuc this should have happened right after the promote pipeline ran for that change. but there were also fresh image builds this morning. do you still see any issues? | 10:51 |
priteau | I still had the issue this morning in https://zuul.opendev.org/t/openstack/build/3d027faf5d084bd185363ef931f65ea6 | 11:07 |
priteau | zuul-info says /etc/dib-builddate.txt is 2025-05-27 19:27 | 11:07 |
priteau | (stable/2025.1 branch) | 11:08 |
frickler | infra-root: ^^ I can confirm that https://zuul.opendev.org/t/openstack/image/rockylinux-9 only shows pretty old images, I guess someone needs to take a closer look as to why the images e.g. from https://review.opendev.org/c/opendev/zuul-providers/+/953908 don't get uploaded and/or used | 11:23 |
frickler | ok, looks like rocky images are missing from the periodic-image-build pipeline, but shouldn't uploads also be triggered when promoting images? | 11:35 |
mnasiadka | seems not | 11:41 |
mnasiadka | let me raise a patch | 11:42 |
mnasiadka | I think promote just promotes the built image | 11:42 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add missing rocky jobs to perodic pipeline https://review.opendev.org/c/opendev/zuul-providers/+/954231 | 11:44 |
frickler | right, but promote should act on the images that were built in the gate pipeline? | 11:44 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add missing rocky jobs to perodic pipeline https://review.opendev.org/c/opendev/zuul-providers/+/954231 | 11:45 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add missing rocky jobs to perodic pipeline https://review.opendev.org/c/opendev/zuul-providers/+/954231 | 11:47 |
*** dhill is now known as Guest21504 | 11:47 | |
mnasiadka | frickler: seems there's some logic in promote playbooks which query for artifact, if nothing was built and uploaded - nothing is promoted I guess | 11:50 |
mnasiadka | frickler: although the gating jobs should be uploading something, so I'm a bit lost there ;-) | 11:52 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add missing rocky jobs to periodic pipeline https://review.opendev.org/c/opendev/zuul-providers/+/954231 | 12:16 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 13:19 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add image names to promote jobs https://review.opendev.org/c/opendev/zuul-providers/+/954244 | 13:28 |
corvus | frickler: mnasiadka ^ that should fix the promote jobs | 13:28 |
opendevreview | James E. Blair proposed openstack/project-config master: Update Zuul status node graphs https://review.opendev.org/c/openstack/project-config/+/954246 | 13:43 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 13:45 |
ykarel | Hi is there some recent change which made multinode jobs to have nodes across different cloud providers? Seeing random failures in CI with such cases | 13:49 |
ykarel | example https://zuul.openstack.org/build/37303139f81f4390a04446c9e5f1a9a5 and https://zuul.openstack.org/build/886bbfaeb5894fb389d1fecb989b631f, but there are many such cases started ~ week ago | 13:51 |
fungi | ykarel: what's your most recent example? have you seen it happen today? | 13:51 |
ykarel | fungi, you mean this is fixed already? | 13:51 |
ykarel | one of above ex is from today | 13:51 |
fungi | over the weekend we upgraded to a zuul-launcher fix that should try harder to keep nodes for the same request in one provider | 13:51 |
fungi | er, zuul, not zuul-launcher | 13:53 |
fungi | a fix to the zuul-launcher service in the zuul repo | 13:53 |
fungi | ykarel: in theory https://review.opendev.org/c/zuul/zuul/+/954064 should have addressed the problem | 13:53 |
ykarel | ok seeing the failures today atleast it seems not helped fully | 13:53 |
fungi | it does look like our upgrade over the weekend maybe didn't happen? we're running several different versions on various components according to https://zuul.opendev.org/components | 13:54 |
ykarel | no idea about that :) | 13:56 |
fungi | the launchers are still a couple of commits behind, i think 12.1.1.dev48 75766e938 that some components are running has it, but 12.1.1.dev46 ad9d8bc4e on the launchers doesn't | 13:59 |
fungi | looks like the upgrade may have stopped between ze08 and ze09? | 13:59 |
fungi | ze09 is missing from the components list, and then ze10 onwards are still reporting older versions | 14:00 |
fungi | infra-root: ^ i'm in meetings for the next 2 hours and can't look deeper into this yet | 14:00 |
opendevreview | Tristan Cacqueray proposed zuul/zuul-jobs master: Remove python2-devel from bindep for Fedora https://review.opendev.org/c/zuul/zuul-jobs/+/954257 | 14:31 |
opendevreview | Merged opendev/zuul-providers master: Add image names to promote jobs https://review.opendev.org/c/opendev/zuul-providers/+/954244 | 14:43 |
corvus | i can look into the launcher/nodes issue; i'll leave debugging the restart playbook for someone else or later | 14:49 |
fungi | corvus: i suspect the launcher/nodes issue is simply that the fix from last week isn't deployed yet due to the stuck upgrade | 14:52 |
corvus | fungi: oh i restarted the launchers immediately with that fix | 14:53 |
corvus | so that should be deployed | 14:53 |
fungi | oh, the fix in 954064 is 12.1.1.dev46 then? | 14:54 |
corvus | i'm not sure, i'm just sure that i did a pull and restart | 14:54 |
fungi | i was trying to map it based on commit count since the last tag and it seemed like that was one or two dev version commits too low | 14:55 |
fungi | but maybe i was off by a couple | 14:55 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 14:56 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 14:56 |
mnasiadka | hmm, DCO is not required in opendev tenant? | 14:56 |
corvus | nope | 14:57 |
mnasiadka | fine by me :) | 14:58 |
mnasiadka | (was just surprised) | 14:58 |
corvus | those policies are set by the projects themselves (eg, "openstack", "starlingx", "opendev", etc) | 14:59 |
fungi | afaik it's currently set for official openstack deliverables, as well as airship and starlingx | 15:00 |
fungi | opendev isn't an official openinfra project, we're more like a community that operates with assistance from openinfra | 15:01 |
corvus | fungi: zuul-launcher is running commit "Add monitoring server to zuul-launcher" which is current master | 15:01 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 15:02 |
corvus | (based on docker inspect) | 15:02 |
fungi | corvus: huh, i wonder if the components list is stale then? | 15:02 |
fungi | maybe something broke version reporting | 15:04 |
corvus | fungi: oops, sorry, it's running "Minor launcher log improvements", one commit back | 15:05 |
corvus | that seems consistent with the components page, where launchers are running 46, and executors are running 48. | 15:06 |
fungi | okay, so launchers should have the locality fix already since before the weekend, and separately the upgrade to an even newer state broke in the middle of the executors list after that | 15:09 |
fungi | does indeed sound like two different unrelated problems in that case | 15:09 |
corvus | ++ | 15:10 |
frickler | ze09 upgrade failed in "Upgrade executor server packages" with "E:Could not get lock /var/lib/apt/lists/lock. It is held by process 917246 (apt-get)". possibly a collision with automated upgrades | 15:15 |
fungi | fatal: [ze09.opendev.org]: FAILED! => ... MSG: Failed to lock apt for exclusive operation: Failed to lock directory /var/lib/apt/lists/: W:Be aware that removing the lock file is not a solution and may break your system., E:Could not get lock /var/lib/apt/lists/lock. It is held by process 917246 (apt-get) | 15:17 |
fungi | oh, frickler beat me to the log | 15:17 |
frickler | actually looks like "/bin/sh /usr/lib/apt/apt.systemd.daily update" is stuck since 20250630 | 15:17 |
fungi | but yeah, sounds like the reboot playbook collided with unattended upgrades on ze09, leaving it offline and stopping cold there | 15:17 |
fungi | 20250630 is coincidentally close to "up 8 days" | 15:19 |
Clark[m] | I assume that caused the reboot playbook to exit with a failure as well? | 15:19 |
fungi | yes | 15:22 |
fungi | it aborted at that point, and since it serializes the upgrades/reboots it left half the list un-upgraded | 15:23 |
fungi | /var/log/dpkg.log is empty, the last entry in /var/log/dpkg.log.1 is "2025-06-29 02:20:16 status installed libc-bin:amd64 2.39-0ubuntu8.4" | 15:24 |
fungi | similarly, /var/log/unattended-upgrades/unattended-upgrades.log is empty but the most recent rotated one ends at "2025-06-30 06:09:41,059 INFO No packages found that can be upgraded unattended and no pending auto-removals" | 15:25 |
Clark[m] | corvus: both of ykarel's examples appear to have been in the periodic pipeline. I wonder if that makes a difference with the stampede of requests all at once vs normal operating through the rest of the day | 15:31 |
Clark[m] | thinking maybe boot failures are more likely during that time which may lead to three failures in a row in some clouds | 15:31 |
fungi | `ps -q 917246 -o lstart` indicates the apt-get update process started "Mon Jun 30 11:26:29 2025" which is well after the last entry in unattended-upgrades.log | 15:36 |
clarkb | fungi: ansible runs apt-get update too iirc | 15:37 |
clarkb | its possible that an ansible run triggered the process and then things went sour | 15:37 |
fungi | well, in this case as frickler said it's a child of /usr/lib/apt/apt.systemd.daily | 15:39 |
opendevreview | Dr. Jens Harbott proposed opendev/zuul-providers master: Add debian-trixie build https://review.opendev.org/c/opendev/zuul-providers/+/951471 | 15:39 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 15:39 |
fungi | so locally scheduled process on the machine, not ansible as far as i can tell | 15:40 |
clarkb | oh tahts odd that it would be so offset from when the daily runs occur | 15:40 |
fungi | looks like systemd has registered a apt-daily.timer and a apt-daily-upgrade.timer (with an associated service for each) | 15:43 |
fungi | `journalctl -u apt-daily.service` reports that it started Jun 30 11:26:28 for "Daily apt download activities" and never reported completion | 15:45 |
fungi | looks like it fires twice a day at somewhat random times | 15:46 |
fungi | /usr/lib/systemd/system/apt-daily.timer has OnCalendar=*-*-* 6,18:00 and RandomizedDelaySec=12h | 15:46 |
fungi | so it will run at a random time between 06:00-18:00 utc and again at a random time between 18:00-06:00 utc | 15:47 |
clarkb | we can probably make the reboot playbook more resilient by detecting a lock and then waiting. However, in this case that wouldn't have helped much as things seem properly stuck | 15:49 |
fungi | yeah, looks like child processes are in select loops waiting for something that's never coming | 15:50 |
fungi | and strace indicates the parent is waiting on each of them | 15:51 |
fungi | i'm not sure how much deeper i can dig into this, but suspect the next step is to try to forcibly terminate the apt-daily.service and then clean up any hung processes | 15:53 |
clarkb | that seems reasonable to me | 15:55 |
clarkb | I recall there were problems reaching the ubuntu mirrors a little while back | 15:55 |
clarkb | I'd have to look at irc logs to confirm but I wonder if this daily update occurred during that period of time and its a network sad related to that | 15:55 |
fungi | entirely possible | 15:56 |
jrosser | there is something a bit strange happening here https://zuul.opendev.org/t/openstack/build/22b144ba953b42758cec8f4cc32d0fce/log/job-output.txt#326-343 | 16:05 |
jrosser | i can see two different mirrors there for different CI regions | 16:05 |
jrosser | the iad3 ones are correct, the dfw ones not | 16:08 |
clarkb | my hunch is that the image build process is setting the dfw mirror in the files then the configure-mirrors role is not overriding the backports target | 16:09 |
jrosser | i don't see it for noble on the same job, so it could be debian specific | 16:10 |
clarkb | jrosser: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/tasks/mirror/Debian.yaml#L12-L13 | 16:11 |
clarkb | and yes that behavior differs to Ubuntu https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/tasks/mirror/Ubuntu.yaml | 16:11 |
clarkb | I think you can set that flag and then things will work. I'm not sure why that is hidden behind a var though | 16:12 |
jrosser | seems to be me that added that :) | 16:13 |
clarkb | reading the README it seems the implication is that backports are not enabled on debian by default so you have to opt into them here | 16:14 |
clarkb | but apparently we are already opted into them during image builds? I wonder if that is a bug | 16:14 |
clarkb | fungi: ^ you may have thoughts | 16:14 |
jrosser | but that doesnt explain why the mirror URL is different? | 16:14 |
jrosser | that should just be enable/disable backports iirc | 16:14 |
fungi | having backports in the sources list doesn't automatically result in installing packages from backports | 16:14 |
clarkb | jrosser: my assumption which I haven't confirmed is that the image build is producing the backports source file with that mirror in it | 16:14 |
clarkb | jrosser: then your job runs and because it doesn't override the value it uses the wrong backports target | 16:15 |
fungi | backports have a lower priority so you have to explicitly request packages from the backports suite or specify versions that can only be satisfied from backports | 16:15 |
clarkb | with nodepool you'd be hardset to the public dfw mirror interface. WIth zuul launcher its inheriting from the test node environment I think | 16:15 |
clarkb | so you probably just never noticed the backport mirror was configured on nodepool iamges but now we notice because the target moved | 16:16 |
fungi | basically, including backports in the sources list means jobs can request packages from backports if they want, without having to first add more sources list entries | 16:16 |
clarkb | fungi: jrosser given the backports won't by used by default behavior I think the easiest and most correct thing would be to have configure-mirrors always ocnfigure backports to the correct target | 16:16 |
fungi | also, i think debian's policy on whether backports are enabled by default differs between their installer and their cloud images | 16:17 |
jrosser | i am pretty confused now | 16:18 |
fungi | last i recall, official debian cloud images have backports available by default in the sources list, but if you use the installer it prompts you whether you want them enabled defaulting to no | 16:18 |
clarkb | jrosser: when your test node booted all of the apt list configuration was pointed at mirror.dfw-int.rax.opendev.org. Then configure-mirrors ran and updated all of the apt list config except backports to point at the current cloud location. | 16:19 |
clarkb | (this is still my hunch I haven't checked the image to see if that is the case, but I'm likt 95% certain) | 16:19 |
clarkb | this happened due to the move from nodepool building the image to zuul-launcher jobs building the image. With nodepool we always hardcoded to mirror.dfw.rax.opendev.org | 16:20 |
fungi | i think mirror-int.dfw.rax.opendev.org is used initially because that's local to where the image builders are, so that starts out baked into our images | 16:20 |
jrosser | ah and this is because in my job i set `configure_mirrors_extra_repos` to be False, and that leaves the stale backports mirror config visible | 16:20 |
clarkb | so it just worked before (tm) but not qutie in the way you expected it to | 16:20 |
clarkb | yes exactly | 16:20 |
jrosser | right, yes i understand now, thankyou :) | 16:20 |
clarkb | I think we should remove configure_mirrors_extra_repos checking for debian backports and always configure it | 16:20 |
clarkb | since backports are not used by default it seems like it would be correct ot ensure they are available and configured correctly for things that opt into using it | 16:20 |
fungi | i concur, as long as that's always going to be used on images with the backports suite enabled | 16:21 |
fungi | it may be that at some point in the past we didn't have backports in the initial sources list for images we built | 16:21 |
jrosser | in the past it was extremely difficult to remove the backports config that was in the image | 16:22 |
fungi | which in theory shouldn't be necessary unless you're trying to make sure you don't accidentally request a version of something from backports | 16:23 |
fungi | unless you're doing things like `apt install -t bookworm-backports foo` or `apt install foo/bookworm-backports` or `apt install foo>=some.backport.version` | 16:24 |
fungi | in which case you're going to get an error without bookworm-backports in sources | 16:25 |
fungi | but installing packages from backports shouldn't ever happen implicitly unless you alter the suite priority in apt.conf or do bespoke package pinning configuration | 16:27 |
jrosser | well anyway, my commit message here has some of the reasoning for this https://opendev.org/zuul/zuul-jobs/commit/5d01b68574931435ce7605a8773fc4db0b47b60c | 16:28 |
fungi | the only reason i can think of to not include backports in sources is to reduce the number of indices downloaded during an apt update, making jobs (slightly) faster and consuming less bandwidth with fewer random network failures, though little of that is relevant since we have local mirrors | 16:28 |
clarkb | I guess the alternative would be to ensure the backports file is absent when that var is not set | 16:30 |
fungi | also as previously mentioned, if you're trying to approximate official debian cloud images, i believe those *do* include backports in their sources lists. it's debian-installer which defaults to not enabling it (and prompts asking whether to if run interactively) | 16:30 |
jrosser | that flag is just there to avoid collisions in jobs that themselves expect to be managing that aspect of the repo setup | 16:31 |
jrosser | we've previously had very fragile code in pre.yml to try to unpick this | 16:31 |
clarkb | wouldn'y you just ensure the file is absent and update? | 16:32 |
clarkb | thats all I was going to do to the alternative of just always configuring it | 16:32 |
clarkb | mostly just looking for some rough consensus on what we think is appropriate. I'm leaning towards alwyas configuring backports since you have to opt into using them anyway | 16:34 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Always configure Debian backports in configure-mirrors https://review.opendev.org/c/zuul/zuul-jobs/+/954280 | 16:40 |
mnasiadka | Ok, finally stream 10 zuul-launcher builds are working - if anyone has some time to review https://review.opendev.org/c/opendev/zuul-providers/+/953460 - I’d be grateful | 16:40 |
clarkb | We can decide in review if 954280 is what we want | 16:40 |
clarkb | mnasiadka: reviewed. One nit but then the thing with rax classic not being able to boot el10 should be addressed so I -1'd | 16:46 |
mnasiadka | Ah right | 16:48 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 16:50 |
clarkb | following up on things from last week: has anyone checked if the zuul key backup cron is working now with the updated container name? | 17:00 |
clarkb | I've done a first pass update to the meeting agenda. I tried to capture the current state of zuul launcher things in particular | 17:05 |
clarkb | let me know what else should go on there. I guess adding centos 10 stream ndoes is a good topic too | 17:05 |
clarkb | ok added centos 10 stream and debian trixie to the agenda | 17:06 |
clarkb | the zuul keys backup file has a modtime of July 7 00:00 UTC and has many bytes of content. I suspectthat is happy now | 17:08 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/954284 Further fixes to multi-tenant provider locality [NEW] | 17:12 |
corvus | clarkb: ^ previous fix was not complete | 17:12 |
clarkb | on it thanks | 17:13 |
fungi | thanks corvus! | 17:13 |
fungi | ykarel: ^ is the answer to earlier | 17:13 |
fungi | since i didn't hear any objections, i issued a `sudo systemctl stop apt-daily` on ze09 which seems to have killed the hung processes and the associated timer automatically started it again. i'll monitor to make sure it completes | 17:18 |
fungi | /var/log/dpkg.log indicates package upgrades are in progress this time, i see a bunch scrolling by | 17:19 |
clarkb | corvus: why use explicit includes in ps2's tenant config? isn't the default to include all of that stuff anyway? | 17:20 |
clarkb | change lgtm either way | 17:20 |
corvus | clarkb: not the niz objects. they are not included by default while niz is still in stealth mode | 17:28 |
corvus | (otherwise, zuul installations could be surprised their users are using an undocumented feature) | 17:29 |
fungi | automatic package upgrades on ze09 completed this time (took a while since a new kernel triggered another dkms rebuild for the openafs lkm) | 17:30 |
fungi | everything should be back to normal there now | 17:30 |
clarkb | corvus: aha that explains it | 17:31 |
clarkb | fungi: corvus frickler should we manually trigger the cronjob on bridge that upgrades and reboots zuul now? Or just wait for Friday again? | 17:32 |
clarkb | I think the main concern is launchers and those are easy to manually update if/when we need to out of band if we just want to wait for Friday | 17:32 |
fungi | well also we're one executor down for the moment | 17:32 |
fungi | since ze09 never got rebooted | 17:33 |
clarkb | I think the playbook does handle the case where services are already off on the executor | 17:34 |
clarkb | also I wonder if noble is running those updates more aggressively than jammy did | 17:34 |
corvus | i think we can start up ze09 and leave it | 17:34 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 17:35 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 17:35 |
clarkb | that might explain part of why this is a new issue | 17:35 |
corvus | i do think this is the second time this happened, but i can't be sure without digging through irc logs that it's not the same stuck process as the first. | 17:35 |
fungi | what's the best way to recover ze09? issue a reboot? | 17:36 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 17:36 |
fungi | since there was also a kernel update in the mix | 17:36 |
corvus | oh i just remembered, searching through irc logs is easy with matrix/element | 17:36 |
corvus | the thing i'm remembering is that it got stuck on june 28 when the server booted initially | 17:37 |
clarkb | fungi: a reboot won't start the containers again. But rebooting to pick up the updates is probably a good idea anyway. So reboot then docker compose up -d in /etc/zuul-executor-docker/ or whatever the dir is called | 17:37 |
clarkb | gitea09 is a jammy node and has both the daily and daily upgrade timer in systemd | 17:37 |
clarkb | and the daily run runs ExecStart=/usr/lib/apt/apt.systemd.daily update | 17:38 |
clarkb | fungi: both the upgrade unit and the daily unit run /usr/lib/apt/apt.systemd.daily. One passes install as the argument (upgrade) and the other update (daily). Reading the script it seems like these two may be redundant with one another | 17:42 |
clarkb | oh no "update" only runs unattended-upgrade if download only is set | 17:43 |
clarkb | so it doesn't install any packages with unattended-upgrade. That is the difference. I wonder why we need both to run though | 17:43 |
fungi | yeah, i think the update is meant to happen more often to pull newer indices and maybe download and stage packages for later upgrading, so that the upgrade will be faster | 17:44 |
fungi | i didn't really dig into the reasoning behind those | 17:45 |
fungi | i see they're present on my debian/unstable systems as well | 17:45 |
clarkb | ya I think that is correct. But they both set the update timestamp so I think you can get away with only the ugprade unit and not the daily | 17:46 |
clarkb | there isn't a strict dependency between them, just potential optimization if both are used? I wonder if we should disable the apt daily unit and rely only on the upgrade uit to reduce the chances of conflict | 17:46 |
clarkb | but also I see that this happens on jammy too so not a new regression I don't think | 17:46 |
clarkb | possibly just more likely to happen on noble since it is newer so maybe gets more package updates | 17:47 |
fungi | possible | 17:49 |
clarkb | to be clear "this happens" == the two crons exist and run. Not necessarily the stuck process we saw on ze09 | 17:50 |
clarkb | mnasiadka: -1 on the rocky image change due to missing nodeset details but otherwise lgtm | 17:57 |
clarkb | frickler: the trixie image change lgmt as well. We'll just have to remember to switch testing to trixie once the release happens and also add arm64 images at some point | 17:57 |
clarkb | following up on the ipv4 -> ipv6 -> ipv4 -> ipv6 git fetch behavior seen building an image that retried git repo updates it looks like the node had ipv6 configured on its primary interface with the expected global address at the beginning of the job when we record node details | 17:59 |
clarkb | https://zuul.opendev.org/t/opendev/build/2f4200f24ae4492295d5c43307e93038/log/zuul-info/zuul-info.ubuntu-noble.txt#53 any idea what stale ipv6 router neighbors indicates? | 18:02 |
clarkb | oh that is for the scope link network anyway? | 18:06 |
fungi | i think stale state may be an indicator that the tcp/ip stack simply hasn't seen any traffic for that address and so hasn't bothered to try to refresh it | 18:07 |
fungi | basically the kernel doesn't know whether that address is reachable or not | 18:08 |
fungi | but saw it working at one time | 18:08 |
fungi | and yeah the via routes (default being the only one listed) use the gateway's global address instead, which is marked as reachable in the neighbor table, so should be fine | 18:10 |
fungi | my guess is that the router is sending route announcements via its linklocal address, but not frequently enough that the linklocal address is always fresh in the neighbor table | 18:11 |
clarkb | I've hopped onto a running bhs1 noble node and we actually configure ipv6 statically on that node | 18:11 |
clarkb | so now I'm even more confused as to why we would see things move from ipv4 to ipv6 and back again | 18:12 |
clarkb | could it be DNS record lookup failures? | 18:12 |
fungi | trying to remember what the actual issue was... is this concerning git? | 18:12 |
clarkb | fungi: yes, its git running within the image build chroot (which I suppose may impact things) updating each repo | 18:13 |
clarkb | https://zuul.opendev.org/t/opendev/build/2f4200f24ae4492295d5c43307e93038/log/job-output.txt#4353-4357 | 18:13 |
fungi | i believe git will automatically try over ipv4 if it gets an error doing things over v6 | 18:13 |
fungi | which can lead to odd behavior if you have intermittent network issues | 18:13 |
clarkb | here you can see it fails then tries again. The prior git command used ipv6 fomr what I can tell from haproxy logs | 18:14 |
clarkb | the first attempt presumably tries ipv6 as well but that is never recorded on the haproxy side. Then ipv4 request shows up | 18:14 |
mnasiadka | clarkb: will update soon, thanks for review | 18:14 |
clarkb | and before using ipv6 it did a bunch of requests via ipv4 first | 18:14 |
fungi | basically git assumes ipv6 is broken when it gets an error, even though it may be a network issue impacting both v6 and v4 | 18:14 |
clarkb | fungi: well git doesn't retry automatically to use ipv4 we do that | 18:15 |
clarkb | fungi: it fails using ipv6 then when we execute the same git command again in a loop to retry it used ipv4 | 18:15 |
clarkb | after a 5 minute timeout | 18:16 |
fungi | huh, for some reason i thought git did that on its own too | 18:16 |
clarkb | the trying again in the linked log is something explicit that mnasiadka added to the dib script | 18:17 |
clarkb | Maybe need to check how dns is configured in the chroot | 18:19 |
clarkb | I think our centos 9 stream mirror is sad | 18:22 |
clarkb | Appstream repomd.xml hashes don't match expected values. This probably means upstream made udpates out of order and we're seeing the fallout | 18:23 |
fungi | revisiting my recollections and trying to reproduce, it does look like git only tries one address family. i was probably thinking of curl, which does implement a "happy eyeballs" fast fallback (ietf rfc 6555/8305) and exhibits the oddities i described | 18:32 |
clarkb | I'm still a bit stumped on what could cause the change of behavior between subsequent git invocations now that i know the addr is statically configured | 18:34 |
clarkb | the only thing I can figure is dns but the ttl on the record is 3600 and we see the flip flapping multiple times in under an hour | 18:34 |
clarkb | maybe git is round robbinning between A and AAAA records? | 18:34 |
fungi | it really shouldn't. based on my reading it uses ipv6 if aaaa records are returned, otherwise ipv4 | 18:35 |
fungi | except when explicitly overridden with -4//ipv4 and -6/--ipv6 cli options | 18:35 |
clarkb | `sudo zgrep '158.69.71.233\|2607:5300:201:2000::5e8' /var/log/haproxy.log.4.gz` this will show you the flapping on gitea-lb02 | 18:40 |
clarkb | I'm thinking we probably don't need to preserve that log since the node is gone and any real debugging probably needs a held node | 18:41 |
clarkb | but feel free to grab it if you think otherwise | 18:41 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 18:54 |
fungi | clarkb: not sure if you've seen https://review.opendev.org/c/openstack/pbr/+/954040 (not urgent), but istr you have some strong feelings about it | 18:57 |
clarkb | fungi: I'm actually fine with whatever format people want to use. I just don't want pre-commit installing everything from git repos if we can use pypi pacakges instead (for reliability reasons) but that is orthogonal | 19:00 |
clarkb | fungi: as for why it helps refactor: black's formatting is intentionally done to make diffs cleaner | 19:00 |
mnasiadka | clarkb: rechecked the stream10 patch because of some arm64 package fetch issues (maybe mirror update in progress) and updated the rocky10 one - let's see if these pass | 19:00 |
clarkb | so in theory it makes refactoring easier as the diffs between old and new code will be cleaner | 19:01 |
clarkb | mnasiadka: ya centos 9 stream's mirror seems to be in a sad state while they do updates | 19:01 |
corvus | this is... not important... but i just opened one file at random from that change and the reformatting tha black is doing is clearly not going to make diffs easier: https://review.opendev.org/c/openstack/pbr/+/954040/2/pbr/tests/test_files.py#86 | 19:03 |
corvus | i'm not here to argue against black in pbr | 19:03 |
corvus | i'm just saying i don't agree with that assessment of black | 19:04 |
mordred | the pre-commit stuff is gross, although it looks like that line has already been crossed | 19:04 |
corvus | (you don't make diffs easier to read by squashing a multi-line dict construction into a single line) | 19:04 |
mordred | I agree - that diff goes the wrong direction | 19:04 |
clarkb | huh that is weird since elsewhere black breaks things up across multiple lines | 19:05 |
clarkb | like here https://review.opendev.org/c/openstack/pbr/+/954040/2/pbr/cmd/main.py I wonder if that is a bug or intentional | 19:05 |
mordred | I like the general approach of indenting after opening parens and brackets, and of being more aggressive about breaking across lines. I really don't like the "this is small, we'll collapse it" thing | 19:05 |
clarkb | but I agree that won't make diffs easier to read | 19:05 |
corvus | mordred: ++ | 19:06 |
clarkb | anyway I don't use black so have no idea if that is likely intentional or not. But I do know one of the stated goals of the project is to make diffing easier so I would probably consider that a bug myself given that goal | 19:07 |
mnasiadka | corvus: I've had multiple issues with black in some repos, it usually was boiled to some bugs (which fixes were hidden behind some experimental flag or were not released yet) - I'm probably not a fan as well ;-) | 19:07 |
mnasiadka | (and it was always some weird reformatting) | 19:08 |
fungi | yeah, skimming, it seems like it wants to combine lists into a single line if short, but explode them to one per line if not | 19:08 |
mordred | I use black on things I work on these days because it's easy enough to do so, and even if I disagree with some of its choices they're not usually terrible choices, so it's usually less cognitive burden to just "shrug, autoformat and be done with it" than to spend time configuring rule specifics. | 19:14 |
mordred | this is hilarious though: https://review.opendev.org/c/openstack/pbr/+/954040/2/pbr/sphinxext.py#68 | 19:14 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Rocky Linux 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/954265 | 19:57 |
corvus | for my own edification: rocky10 has the same cpu flag requirements as centos10? | 20:13 |
clarkb | corvus: yes | 20:13 |
clarkb | corvus: alma however does rebuilds of packages for older hardware. But they don't do that for all packges so it may or may not work depending on what you install | 20:13 |
corvus | is that a bug or a feature? | 20:13 |
clarkb | in alma's case it is an intentional choice on their side to allow people with older hardware to try and use almalinux | 20:14 |
corvus | yeah; i guess i wondered if rocky might make the decision differently too | 20:15 |
clarkb | iirc they decided not to | 20:15 |
corvus | here's a stray thought on a different subject: both the rocky and trixie changes are building all the images because they require some pro-forma changes to dib elements to account for the new releases. perhaps that's something that could be optimized out of the element themselves. | 20:16 |
clarkb | maybe. The element is literally the infra package needs element. We could possibly have copies of that element per platform with symlinks to shared bits maybe | 20:17 |
clarkb | Also I think the trixie change makes a braoder change that potentially impacts all of debuntui | 20:17 |
corvus | (example, the trixie change edits a file that basically says "the ntp package is stilled called 'ntp' in trixie". and the rocky change edits a file that says "we still want to install haveged from epel") | 20:18 |
corvus | yeah, they're totally running the right jobs for those changes | 20:18 |
corvus | just thinking there could be a lot less churn if those elements were restructured, maybe with defaults that carry over automatically, or possibly just have values supplied as input | 20:19 |
clarkb | we may be able to invert the defaults where old things override rather than new things overriding | 20:19 |
corvus | ya | 20:19 |
corvus | for infra-package-needs, could that just be a list that we pass in through env variables? then we can set it once for debuntu and we're done until something changes in the distros | 20:20 |
corvus | (i mean, we set it once in the debuntu zuul image build job) | 20:21 |
clarkb | maybe? we're relying on dib tooling to take the list and expand it into the list of packages to install and uninstall | 20:21 |
clarkb | so it would depend on whether or not we could have an element that writes files into itself that meet the interface for that tooling | 20:21 |
clarkb | or updating thoes elements to look for the data via env vars if they dno't already | 20:22 |
corvus | ack. i don't have time to dig into that right now, but that could be a big win for efficiency if someone does. | 20:22 |
corvus | we're currently using 50% of the arm quota every time we make one of these changes. | 20:22 |
clarkb | in the trixie example we could invert the pkg-map pretty easily I think | 20:24 |
clarkb | basically things older than bookeworm and focal get listed and then we default the family/distro entries to what we're setting noble,trixie et al to | 20:24 |
clarkb | and do similar for the timesyncd check | 20:25 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add ubuntu-bionic and ubuntu-focal labels https://review.opendev.org/c/opendev/zuul-providers/+/954296 | 20:43 |
clarkb | fungi: jrosser: should we go ahead and proceed with https://review.opendev.org/c/zuul/zuul-jobs/+/954280 ? | 20:48 |
fungi | i think so, but holding for confirmation | 20:56 |
fungi | the reality is that right now we have those sources enabled, but they're potentially unreachable from most cloud providers if not overridden in the role | 20:56 |
jrosser | I didn’t expect to lose a feature that were used as a side effect of reporting a different bug tbh | 21:05 |
jrosser | *we are using | 21:05 |
opendevreview | Merged opendev/zuul-providers master: Add missing rocky jobs to periodic pipeline https://review.opendev.org/c/opendev/zuul-providers/+/954231 | 21:06 |
fungi | jrosser: as i understand it, the sources entries were already there, but pulling from a distant cloud mirror most of the time, and we didn't find out until we switched them defaulting to one which wasn't reachable remotely from any other clouds | 21:13 |
jrosser | this is a new behaviour though | 21:13 |
fungi | looks like it's been in there since https://review.openstack.org/563748 merged almost 7 years ago? | 21:15 |
fungi | the errors are a new behavior because we switched the baked-in mirror hostname to one which isn't reachable from other clouds | 21:16 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add ubuntu-bionic and ubuntu-focal labels https://review.opendev.org/c/opendev/zuul-providers/+/954296 | 21:16 |
opendevreview | Clark Boylan proposed opendev/zuul-providers master: Normalize infra-package-needs packages on current releases https://review.opendev.org/c/opendev/zuul-providers/+/954299 | 21:16 |
clarkb | corvus: ^ something like that maybe. If that passes testing then mnasiadka and frickler might want to rebase on top of that change | 21:16 |
corvus | lgtm | 21:20 |
corvus | thanks! | 21:20 |
fungi | jrosser: these days it's being set by https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/ubuntu-minimal/environment.d/12-ubuntu-repo-dists.bash#L15 | 21:21 |
clarkb | fungi: jrosser right I think this never worked the way jrosser thought it did | 21:22 |
clarkb | its just working less well and failing in the process rather than silenety succeeding | 21:22 |
jrosser | yep - well clearly i have no logs or anything concrete to show related to the orignal issue we had with the backports repo | 21:23 |
clarkb | the new behavior is that we're embedding the mirror for the node that built the image rather than the "central" mirror.dfw.rax.opendev.org mirror in every image when built by nodepool on static nodes within dfw | 21:24 |
jrosser | so its fine, we can go back to removing it in a pre playbook | 21:24 |
fungi | yeah, i'm not opposed to figuring out a way to make the addition of those sources optional, but they've been included for many years | 21:24 |
fungi | and the current toggle was a useless feel-good knob, it didn't actually control the existence of the entries, merely let you choose whether they should get updated to point to the closest mirror instead of the baked-in one | 21:25 |
fungi | fixing it down to the dib layer would be a much larger amount of engineering | 21:26 |
fungi | mainly because we haven't previously required jobs which might intentionally install things from backports to first enable the backports sources | 21:26 |
fungi | so would need to shake out any potential breakage if we stopped including them | 21:27 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add nested-virt labels https://review.opendev.org/c/opendev/zuul-providers/+/954301 | 21:56 |
corvus | i think that change and it's parent should be the last of the labels being handled by nodepool | 21:57 |
clarkb | I approved the parent | 22:04 |
clarkb | hopefully that is ok | 22:04 |
opendevreview | Merged opendev/zuul-providers master: Add ubuntu-bionic and ubuntu-focal labels https://review.opendev.org/c/opendev/zuul-providers/+/954296 | 22:04 |
fungi | corvus: one possible error on 954301 | 22:08 |
fungi | also i see we have an incorrect comment on the one above it, though it was there before your change so i didn't comment on that | 22:08 |
opendevreview | Clark Boylan proposed opendev/zuul-providers master: Normalize infra-package-needs packages on current releases https://review.opendev.org/c/opendev/zuul-providers/+/954299 | 22:13 |
clarkb | my brain saw cronie and thought thats where chrony goes | 22:13 |
clarkb | but no we have both cronie and chrony | 22:13 |
clarkb | Last call for meeting updates. I think its mostly caught up based on the work I've seen going on but let me know if there are updates I should make | 22:14 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add nested-virt labels https://review.opendev.org/c/opendev/zuul-providers/+/954301 | 22:20 |
corvus | fungi: ^ thx fixed clarkb ^ | 22:20 |
corvus | that comment above tripped me up; i fixed it too | 22:20 |
corvus | #status log restarted zuul-launcher with locality fix | 22:24 |
opendevstatus | corvus: finished logging | 22:25 |
fungi | thanks! | 22:25 |
fungi | ykarel: ^ should hopefully be fixed now | 22:25 |
clarkb | periodic jobs tongiht will be a good exercise | 22:26 |
clarkb | I've approved 954301 | 22:27 |
opendevreview | Merged opendev/zuul-providers master: Add nested-virt labels https://review.opendev.org/c/opendev/zuul-providers/+/954301 | 22:27 |
corvus | yeah -- sorry i forget to respond to your point directly earlier, but, yes, the rush of jobs put some of our providers over quota; so those temp-failures triggered the flaw | 22:27 |
clarkb | ya I remembered you had said that the periodic job rush is a bit of a stampede that exercises zuul aluncher in weird ways before so figured that it was related | 22:31 |
corvus | the short version of that is: quota calculations are (intentionally) racy now, and stampedes tickle that. | 22:35 |
clarkb | meeting agenda should be in your inboxes now | 22:55 |
opendevreview | Clark Boylan proposed opendev/zuul-providers master: Normalize infra-package-needs packages on current releases https://review.opendev.org/c/opendev/zuul-providers/+/954299 | 23:01 |
clarkb | while I'm burning through test nodes it is nice that we can check all of this pre merge without spinning up dib locally | 23:02 |
corvus | ++ also, we could add image validation in zuul if we get bored | 23:05 |
opendevreview | Clark Boylan proposed opendev/zuul-providers master: Normalize infra-package-needs packages on current releases https://review.opendev.org/c/opendev/zuul-providers/+/954299 | 23:35 |
clarkb | apparently I don't understand bash regexes. Those quotes I added are not appropriate | 23:36 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!