Friday, 2024-10-04

corvusme too :)00:00
fungiah, yeah i left a similar comment on the change after looking at the post_faiure00:06
fungii'll set an autohold and reapprove it00:08
fungi200.225.47.58 is the held node00:26
corvushow unfortunate -- that one actually failed the image build due to failing to download cirros for the cache00:55
corvusi've set another autohold00:56
fungibah01:02
noonedeadpunkhey folks! Can I ask for some reviews on https://review.opendev.org/c/opendev/system-config/+/930294 ?11:04
noonedeadpunkas Dalmatian has already released, so would be nice to have UCA in mirrors for CI..11:05
noonedeadpunkas otherwise we need to do some exception for noble not to use local mirrors11:06
opendevreviewMerged opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmatian for Ubuntu Noble  https://review.opendev.org/c/opendev/system-config/+/93029414:19
fungithat ^ just barely missed the last cronjob by a couple of minutes. i'm tailing the log on the mirror-update server and will check https://static.opendev.org/mirror/ubuntu-cloud-archive/dists/ again after the next run in two hours14:35
corvusfungi: huh, the most recent attempt actually spent a long time trying to upload the image; then the job timed out.  that's much more like success than the one that failed for an unknown reason after 30s.15:04
clarkbit took far too long to check that noble-updates had a dalmatian repo. But it does so 930294 should be good. Also cool that UCA keeps up so quickly15:09
corvusi'm tempted to actually use a cli method to upload the image to object storage... i bet if we do that, we could get streaming output with progress...15:09
clarkbcorvus: I seem to recall that a lot of tools do the chunked uploads serially15:10
clarkbthat may explain the timeout15:10
clarkbsomething else to check if we are concerned about runtime15:10
corvusclarkb: this is just sdk with the default values like we do with logs... i thought it did the right thing there...15:10
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Finish upload job  https://review.opendev.org/c/opendev/zuul-jobs/+/93135515:11
corvusi added a 2hr timeout to that for the upload15:12
corvussince we have the autohold, i'll also go ahead and re-run the upload from there in screen15:13
corvusthis node is in ovh, so this is pretty much worst case for us network wise15:17
corvusokay, running in root screen on 158.69.70.5315:19
fungiin the past we had a provider (iweb?) whose fw/proxy/wag was terminating established tcp connections after a certain amount of time, and that caused our image uploads to their glance to fail. in this case it might be some similar middlebox terminating long-running swift upload connections? i wonder if it's possible to try rotating connections after a certain number of chunks?15:31
fungioh! you meant ansible timed out the playbook15:34
corvusyep zuul/ansible timeout after 30m15:34
fungigonna go grab lunch but should be back in about an hour15:37
clarkbI should find breakfast15:38
clarkb926970 is failing again beacuse we have stale centos arm64 images?16:50
clarkbI can look at that in a bit. But this is the second time we've hit that. Any thoughts on switching the jobs to non voting?16:51
funginoonedeadpunk: mnasiadka: jrosser: https://static.opendev.org/mirror/ubuntu-cloud-archive/dists/ now shows a noble-updates subtree, so ubuntu 2024.04 lts mirrors for uca should be usable17:09
noonedeadpunksweet, will run recheck now17:18
noonedeadpunkthis one was needing it: https://review.opendev.org/c/openstack/openstack-ansible-openstack_hosts/+/92963117:33
clarkbok back, I feel like I'm somehow busy but getting nothing done at all. Time to look at the mm3 change then also where are we on udpating the meetpad restart process?18:14
clarkbfungi: thank you for the followup on the mm3 comments that change lgtm18:17
clarkbfor meetpad https://review.opendev.org/c/opendev/system-config/+/930637 is still open I wonder if we should just land it18:17
clarkbwith the idea being getting that in now gives us more opportunity to see if it causes problems prior to the ptg and in theory avoids anyone needing to manually restart jitsi during the ptg if they do a release during that week18:18
fungiyes, the sooner the better, i'll approve18:21
fungioh, i forgot it's my change ;) i'll let someone else approve in that case18:21
clarkbdone18:23
fungithough that also brings up a question... if we're auto-upgrading and restarting services at random, should we put things like meetpad and etherpad servers into the emergency disable list during the ptg?18:23
clarkbfungi: thoughts on making the openafs arm rpm package builds jobs non voting to land the ozj linter update?18:23
clarkbfungi: it hasn't been an issue yet, but it is something to consider18:23
fungijitsi could upload new container images in the middle of an openstack nova session, for example18:23
clarkbfungi: for etherpad we don't auto update etherpad itself but we do allow mariadb to update18:23
clarkbfungi: we'll only apply them during our daily ansible runs though which should be a quiet time for the ptg18:24
clarkbits $delaytime after 020018:24
clarkbbut yes if we want to be extra cautious we can do that18:24
fungiah, good point wrt being dependent on the daily periodic timer trigger, that's less worrisome then18:25
clarkbmy main concern with meetpad is that the daily run will occur then jitsi will be non functional for the start of the european ptg timeslot and we may not all be awake/around then18:25
fungilooks like the ptg is on break every day between 00-04 utc18:25
clarkblanding your change in theory will address that18:25
clarkbjitsi itself could still break us though with bad images or similar issues18:26
fungiso yes, an upgrade and restart at 02 utc shouldn't impact any scheduled sessions18:26
clarkbso may still be worth doing the emergency file stopgap18:26
fungiclarkb: what was the afs arm64 package build failure detail?18:33
clarkbfungi: I think the issue is that we're not reliably building centos arm images (and maybe not the other images etiher I haven't fully checked) and that results in our images have stale running kernels. Then when the rpm package build runs it hits incompatbility with the current latest kernel headers and the running kernel and fails18:34
clarkbpreviously we've always fixed this by fixing our image builds, waiting a day, then rerunning jobs and it works until image builds fail again18:34
clarkbwe could also potentially update those jobs to do a system update, reboot, then package build to mitigate18:34
clarkbbut they are being triggered by the linter update because we're making modifications to those playbooks18:35
fungioh, not afs-specific18:35
clarkbafs specific in that its the only thing we build packages for that depend on kernel headers aligning. But not afs behavior being an issue18:35
clarkbits a more generic "make kernel module that links against kernel" problem18:35
clarkbI'm about to hop onto the builder and start seeing what is unhappy there18:36
funginodepool dib-image-list saus we built centos-9-stream in the past day but centos-9-stream-arm64 image builds last succeeded 20 days ago18:36
fungii don't mind bypassing the job to land a change, but we should also either fix or drop the centos-9-stream-arm64 images if they're not building18:37
clarkbonly 5.9gb available in /opt on nb04 so we're still leaking there18:37
fungiwhere "we" can be anyone who cares about continuing to use those images18:37
clarkbwell this is a problem for all arm images18:38
clarkbwe just notice in those jobs18:38
clarkbanyway I'm going to do the typical stop builder, start an rm of dib_tmp content in screen, and check if losetup has any room which need a reboot cleanup process18:38
clarkba lot of this will in theory be mitigated by nodepool in zuul since the builders will be throwaway so probably not worth investing too much in diagnosing a root cause at the moment18:39
clarkbin theory if I do all this we can land the ozj change on monday without any changes to ozj jobs18:40
clarkblosetup -a shows 9 devices which is leaky but I think the limit is higher than that18:40
clarkbya limit should be 16 in the kernel18:40
fungiyeah, a quick look indicates that the most recently we built any arm64 image successfuly was 16 days ago (ubuntu-jammy-arm64)18:41
fungiso i agree it's not just centos18:41
clarkbcleanup of /opt/dib_tmp is in progress. Ideally we do a reboot to clear out those losetup devices after this is done too but it may not complete until tomorrow18:42
clarkbwe'll see how I'm feeling in the morning and maybe I'll rememebr to do that18:42
clarkbits running under screen (though my user owns the screen nto root)18:43
fungiconnected and watching too. maybe i'll be around to reboot it when it finishes, and if so i'll do that18:44
clarkband then maybe monday is also a good day to flip openstack's zuul tenant (and the rest of the tenants) to ansible 9 by default?18:44
clarkbfungi: cool thanks18:44
fungii think so, yes18:45
clarkbI'm expecting my ability to push things like that along will degrade as we get deeper into next week due to family stuff so ideally we can rip some bandaids off early and get it done with18:45
fungiopenstack cycle-trailing deliverables are still working on preparing their 2024.2/dalmatian releases, but there's no clear timeline (supposed to be over the next three weeks-ish), so i wouldn't hold up for fear of breaking those18:46
clarkband we've got decentish coverage between opendev and zuul tenants as well as my spot check of devstack+tempest18:46
clarkbits not like we're going in blind but there may still be some corner cases18:46
opendevreviewMerged opendev/system-config master: Explicitly down Jitsi-Meet containers on upgrade  https://review.opendev.org/c/opendev/system-config/+/93063718:52
fungihttps://zuul.opendev.org/t/openstack/build/5148da9596ef47b48ce8c811624306aa (correctly) did not restart the containers because there was no image update from the pull19:01
clarkbexcellent, that confirms half of what we want to see now we just need them to make a release19:02
fungiTASK [jitsi-meet : Run docker-compose down] skipping: [meetpad02.opendev.org] => { "changed": false, "false_condition": "'downloaded newer image' in docker_compose_pull.stderr", "skip_reason": "Conditional result was False"}19:02
opendevreviewMerged zuul/zuul-jobs master: Only update qemu-static container settings on x86_64  https://review.opendev.org/c/zuul/zuul-jobs/+/93093922:31
opendevreviewMerged zuul/zuul-jobs master: Print instance type in emit-job-header role  https://review.opendev.org/c/zuul/zuul-jobs/+/92575422:45
opendevreviewMerged zuul/zuul-jobs master: Add other nodes to buildx builder  https://review.opendev.org/c/zuul/zuul-jobs/+/93092723:00
corvusclarkb: fungi hrm, i don't have a timestamp so i don't know how long it took, but my test on the held node failed with:  HttpException: 413: Client Error for url: https://swift.api.sjc3.rackspacecloud.com/v1/AUTH_f063ac0bb70c486db47bcf2105eebcbd/images-a3d39eaeea5f/f01a3fbaad534977956da95dc6d99c5f-debian-bullseye.qcow2, Request Entity Too Large23:52
corvusso we probably do need to check and see if we should be calling into sdk a different way23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!