corvus | me too :) | 00:00 |
---|---|---|
fungi | ah, yeah i left a similar comment on the change after looking at the post_faiure | 00:06 |
fungi | i'll set an autohold and reapprove it | 00:08 |
fungi | 200.225.47.58 is the held node | 00:26 |
corvus | how unfortunate -- that one actually failed the image build due to failing to download cirros for the cache | 00:55 |
corvus | i've set another autohold | 00:56 |
fungi | bah | 01:02 |
noonedeadpunk | hey folks! Can I ask for some reviews on https://review.opendev.org/c/opendev/system-config/+/930294 ? | 11:04 |
noonedeadpunk | as Dalmatian has already released, so would be nice to have UCA in mirrors for CI.. | 11:05 |
noonedeadpunk | as otherwise we need to do some exception for noble not to use local mirrors | 11:06 |
opendevreview | Merged opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmatian for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294 | 14:19 |
fungi | that ^ just barely missed the last cronjob by a couple of minutes. i'm tailing the log on the mirror-update server and will check https://static.opendev.org/mirror/ubuntu-cloud-archive/dists/ again after the next run in two hours | 14:35 |
corvus | fungi: huh, the most recent attempt actually spent a long time trying to upload the image; then the job timed out. that's much more like success than the one that failed for an unknown reason after 30s. | 15:04 |
clarkb | it took far too long to check that noble-updates had a dalmatian repo. But it does so 930294 should be good. Also cool that UCA keeps up so quickly | 15:09 |
corvus | i'm tempted to actually use a cli method to upload the image to object storage... i bet if we do that, we could get streaming output with progress... | 15:09 |
clarkb | corvus: I seem to recall that a lot of tools do the chunked uploads serially | 15:10 |
clarkb | that may explain the timeout | 15:10 |
clarkb | something else to check if we are concerned about runtime | 15:10 |
corvus | clarkb: this is just sdk with the default values like we do with logs... i thought it did the right thing there... | 15:10 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Finish upload job https://review.opendev.org/c/opendev/zuul-jobs/+/931355 | 15:11 |
corvus | i added a 2hr timeout to that for the upload | 15:12 |
corvus | since we have the autohold, i'll also go ahead and re-run the upload from there in screen | 15:13 |
corvus | this node is in ovh, so this is pretty much worst case for us network wise | 15:17 |
corvus | okay, running in root screen on 158.69.70.53 | 15:19 |
fungi | in the past we had a provider (iweb?) whose fw/proxy/wag was terminating established tcp connections after a certain amount of time, and that caused our image uploads to their glance to fail. in this case it might be some similar middlebox terminating long-running swift upload connections? i wonder if it's possible to try rotating connections after a certain number of chunks? | 15:31 |
fungi | oh! you meant ansible timed out the playbook | 15:34 |
corvus | yep zuul/ansible timeout after 30m | 15:34 |
fungi | gonna go grab lunch but should be back in about an hour | 15:37 |
clarkb | I should find breakfast | 15:38 |
clarkb | 926970 is failing again beacuse we have stale centos arm64 images? | 16:50 |
clarkb | I can look at that in a bit. But this is the second time we've hit that. Any thoughts on switching the jobs to non voting? | 16:51 |
fungi | noonedeadpunk: mnasiadka: jrosser: https://static.opendev.org/mirror/ubuntu-cloud-archive/dists/ now shows a noble-updates subtree, so ubuntu 2024.04 lts mirrors for uca should be usable | 17:09 |
noonedeadpunk | sweet, will run recheck now | 17:18 |
noonedeadpunk | this one was needing it: https://review.opendev.org/c/openstack/openstack-ansible-openstack_hosts/+/929631 | 17:33 |
clarkb | ok back, I feel like I'm somehow busy but getting nothing done at all. Time to look at the mm3 change then also where are we on udpating the meetpad restart process? | 18:14 |
clarkb | fungi: thank you for the followup on the mm3 comments that change lgtm | 18:17 |
clarkb | for meetpad https://review.opendev.org/c/opendev/system-config/+/930637 is still open I wonder if we should just land it | 18:17 |
clarkb | with the idea being getting that in now gives us more opportunity to see if it causes problems prior to the ptg and in theory avoids anyone needing to manually restart jitsi during the ptg if they do a release during that week | 18:18 |
fungi | yes, the sooner the better, i'll approve | 18:21 |
fungi | oh, i forgot it's my change ;) i'll let someone else approve in that case | 18:21 |
clarkb | done | 18:23 |
fungi | though that also brings up a question... if we're auto-upgrading and restarting services at random, should we put things like meetpad and etherpad servers into the emergency disable list during the ptg? | 18:23 |
clarkb | fungi: thoughts on making the openafs arm rpm package builds jobs non voting to land the ozj linter update? | 18:23 |
clarkb | fungi: it hasn't been an issue yet, but it is something to consider | 18:23 |
fungi | jitsi could upload new container images in the middle of an openstack nova session, for example | 18:23 |
clarkb | fungi: for etherpad we don't auto update etherpad itself but we do allow mariadb to update | 18:23 |
clarkb | fungi: we'll only apply them during our daily ansible runs though which should be a quiet time for the ptg | 18:24 |
clarkb | its $delaytime after 0200 | 18:24 |
clarkb | but yes if we want to be extra cautious we can do that | 18:24 |
fungi | ah, good point wrt being dependent on the daily periodic timer trigger, that's less worrisome then | 18:25 |
clarkb | my main concern with meetpad is that the daily run will occur then jitsi will be non functional for the start of the european ptg timeslot and we may not all be awake/around then | 18:25 |
fungi | looks like the ptg is on break every day between 00-04 utc | 18:25 |
clarkb | landing your change in theory will address that | 18:25 |
clarkb | jitsi itself could still break us though with bad images or similar issues | 18:26 |
fungi | so yes, an upgrade and restart at 02 utc shouldn't impact any scheduled sessions | 18:26 |
clarkb | so may still be worth doing the emergency file stopgap | 18:26 |
fungi | clarkb: what was the afs arm64 package build failure detail? | 18:33 |
clarkb | fungi: I think the issue is that we're not reliably building centos arm images (and maybe not the other images etiher I haven't fully checked) and that results in our images have stale running kernels. Then when the rpm package build runs it hits incompatbility with the current latest kernel headers and the running kernel and fails | 18:34 |
clarkb | previously we've always fixed this by fixing our image builds, waiting a day, then rerunning jobs and it works until image builds fail again | 18:34 |
clarkb | we could also potentially update those jobs to do a system update, reboot, then package build to mitigate | 18:34 |
clarkb | but they are being triggered by the linter update because we're making modifications to those playbooks | 18:35 |
fungi | oh, not afs-specific | 18:35 |
clarkb | afs specific in that its the only thing we build packages for that depend on kernel headers aligning. But not afs behavior being an issue | 18:35 |
clarkb | its a more generic "make kernel module that links against kernel" problem | 18:35 |
clarkb | I'm about to hop onto the builder and start seeing what is unhappy there | 18:36 |
fungi | nodepool dib-image-list saus we built centos-9-stream in the past day but centos-9-stream-arm64 image builds last succeeded 20 days ago | 18:36 |
fungi | i don't mind bypassing the job to land a change, but we should also either fix or drop the centos-9-stream-arm64 images if they're not building | 18:37 |
clarkb | only 5.9gb available in /opt on nb04 so we're still leaking there | 18:37 |
fungi | where "we" can be anyone who cares about continuing to use those images | 18:37 |
clarkb | well this is a problem for all arm images | 18:38 |
clarkb | we just notice in those jobs | 18:38 |
clarkb | anyway I'm going to do the typical stop builder, start an rm of dib_tmp content in screen, and check if losetup has any room which need a reboot cleanup process | 18:38 |
clarkb | a lot of this will in theory be mitigated by nodepool in zuul since the builders will be throwaway so probably not worth investing too much in diagnosing a root cause at the moment | 18:39 |
clarkb | in theory if I do all this we can land the ozj change on monday without any changes to ozj jobs | 18:40 |
clarkb | losetup -a shows 9 devices which is leaky but I think the limit is higher than that | 18:40 |
clarkb | ya limit should be 16 in the kernel | 18:40 |
fungi | yeah, a quick look indicates that the most recently we built any arm64 image successfuly was 16 days ago (ubuntu-jammy-arm64) | 18:41 |
fungi | so i agree it's not just centos | 18:41 |
clarkb | cleanup of /opt/dib_tmp is in progress. Ideally we do a reboot to clear out those losetup devices after this is done too but it may not complete until tomorrow | 18:42 |
clarkb | we'll see how I'm feeling in the morning and maybe I'll rememebr to do that | 18:42 |
clarkb | its running under screen (though my user owns the screen nto root) | 18:43 |
fungi | connected and watching too. maybe i'll be around to reboot it when it finishes, and if so i'll do that | 18:44 |
clarkb | and then maybe monday is also a good day to flip openstack's zuul tenant (and the rest of the tenants) to ansible 9 by default? | 18:44 |
clarkb | fungi: cool thanks | 18:44 |
fungi | i think so, yes | 18:45 |
clarkb | I'm expecting my ability to push things like that along will degrade as we get deeper into next week due to family stuff so ideally we can rip some bandaids off early and get it done with | 18:45 |
fungi | openstack cycle-trailing deliverables are still working on preparing their 2024.2/dalmatian releases, but there's no clear timeline (supposed to be over the next three weeks-ish), so i wouldn't hold up for fear of breaking those | 18:46 |
clarkb | and we've got decentish coverage between opendev and zuul tenants as well as my spot check of devstack+tempest | 18:46 |
clarkb | its not like we're going in blind but there may still be some corner cases | 18:46 |
opendevreview | Merged opendev/system-config master: Explicitly down Jitsi-Meet containers on upgrade https://review.opendev.org/c/opendev/system-config/+/930637 | 18:52 |
fungi | https://zuul.opendev.org/t/openstack/build/5148da9596ef47b48ce8c811624306aa (correctly) did not restart the containers because there was no image update from the pull | 19:01 |
clarkb | excellent, that confirms half of what we want to see now we just need them to make a release | 19:02 |
fungi | TASK [jitsi-meet : Run docker-compose down] skipping: [meetpad02.opendev.org] => { "changed": false, "false_condition": "'downloaded newer image' in docker_compose_pull.stderr", "skip_reason": "Conditional result was False"} | 19:02 |
opendevreview | Merged zuul/zuul-jobs master: Only update qemu-static container settings on x86_64 https://review.opendev.org/c/zuul/zuul-jobs/+/930939 | 22:31 |
opendevreview | Merged zuul/zuul-jobs master: Print instance type in emit-job-header role https://review.opendev.org/c/zuul/zuul-jobs/+/925754 | 22:45 |
opendevreview | Merged zuul/zuul-jobs master: Add other nodes to buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/930927 | 23:00 |
corvus | clarkb: fungi hrm, i don't have a timestamp so i don't know how long it took, but my test on the held node failed with: HttpException: 413: Client Error for url: https://swift.api.sjc3.rackspacecloud.com/v1/AUTH_f063ac0bb70c486db47bcf2105eebcbd/images-a3d39eaeea5f/f01a3fbaad534977956da95dc6d99c5f-debian-bullseye.qcow2, Request Entity Too Large | 23:52 |
corvus | so we probably do need to check and see if we should be calling into sdk a different way | 23:53 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!