Friday, 2024-10-04

corvus	me too :)	00:00
fungi	ah, yeah i left a similar comment on the change after looking at the post_faiure	00:06
fungi	i'll set an autohold and reapprove it	00:08
fungi	200.225.47.58 is the held node	00:26
corvus	how unfortunate -- that one actually failed the image build due to failing to download cirros for the cache	00:55
corvus	i've set another autohold	00:56
fungi	bah	01:02
noonedeadpunk	hey folks! Can I ask for some reviews on https://review.opendev.org/c/opendev/system-config/+/930294 ?	11:04
noonedeadpunk	as Dalmatian has already released, so would be nice to have UCA in mirrors for CI..	11:05
noonedeadpunk	as otherwise we need to do some exception for noble not to use local mirrors	11:06
opendevreview	Merged opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmatian for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294	14:19
fungi	that ^ just barely missed the last cronjob by a couple of minutes. i'm tailing the log on the mirror-update server and will check https://static.opendev.org/mirror/ubuntu-cloud-archive/dists/ again after the next run in two hours	14:35
corvus	fungi: huh, the most recent attempt actually spent a long time trying to upload the image; then the job timed out. that's much more like success than the one that failed for an unknown reason after 30s.	15:04
clarkb	it took far too long to check that noble-updates had a dalmatian repo. But it does so 930294 should be good. Also cool that UCA keeps up so quickly	15:09
corvus	i'm tempted to actually use a cli method to upload the image to object storage... i bet if we do that, we could get streaming output with progress...	15:09
clarkb	corvus: I seem to recall that a lot of tools do the chunked uploads serially	15:10
clarkb	that may explain the timeout	15:10
clarkb	something else to check if we are concerned about runtime	15:10
corvus	clarkb: this is just sdk with the default values like we do with logs... i thought it did the right thing there...	15:10
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Finish upload job https://review.opendev.org/c/opendev/zuul-jobs/+/931355	15:11
corvus	i added a 2hr timeout to that for the upload	15:12
corvus	since we have the autohold, i'll also go ahead and re-run the upload from there in screen	15:13
corvus	this node is in ovh, so this is pretty much worst case for us network wise	15:17
corvus	okay, running in root screen on 158.69.70.53	15:19
fungi	in the past we had a provider (iweb?) whose fw/proxy/wag was terminating established tcp connections after a certain amount of time, and that caused our image uploads to their glance to fail. in this case it might be some similar middlebox terminating long-running swift upload connections? i wonder if it's possible to try rotating connections after a certain number of chunks?	15:31
fungi	oh! you meant ansible timed out the playbook	15:34
corvus	yep zuul/ansible timeout after 30m	15:34
fungi	gonna go grab lunch but should be back in about an hour	15:37
clarkb	I should find breakfast	15:38
clarkb	926970 is failing again beacuse we have stale centos arm64 images?	16:50
clarkb	I can look at that in a bit. But this is the second time we've hit that. Any thoughts on switching the jobs to non voting?	16:51
fungi	noonedeadpunk: mnasiadka: jrosser: https://static.opendev.org/mirror/ubuntu-cloud-archive/dists/ now shows a noble-updates subtree, so ubuntu 2024.04 lts mirrors for uca should be usable	17:09
noonedeadpunk	sweet, will run recheck now	17:18
noonedeadpunk	this one was needing it: https://review.opendev.org/c/openstack/openstack-ansible-openstack_hosts/+/929631	17:33
clarkb	ok back, I feel like I'm somehow busy but getting nothing done at all. Time to look at the mm3 change then also where are we on udpating the meetpad restart process?	18:14
clarkb	fungi: thank you for the followup on the mm3 comments that change lgtm	18:17
clarkb	for meetpad https://review.opendev.org/c/opendev/system-config/+/930637 is still open I wonder if we should just land it	18:17
clarkb	with the idea being getting that in now gives us more opportunity to see if it causes problems prior to the ptg and in theory avoids anyone needing to manually restart jitsi during the ptg if they do a release during that week	18:18
fungi	yes, the sooner the better, i'll approve	18:21
fungi	oh, i forgot it's my change ;) i'll let someone else approve in that case	18:21
clarkb	done	18:23
fungi	though that also brings up a question... if we're auto-upgrading and restarting services at random, should we put things like meetpad and etherpad servers into the emergency disable list during the ptg?	18:23
clarkb	fungi: thoughts on making the openafs arm rpm package builds jobs non voting to land the ozj linter update?	18:23
clarkb	fungi: it hasn't been an issue yet, but it is something to consider	18:23
fungi	jitsi could upload new container images in the middle of an openstack nova session, for example	18:23
clarkb	fungi: for etherpad we don't auto update etherpad itself but we do allow mariadb to update	18:23
clarkb	fungi: we'll only apply them during our daily ansible runs though which should be a quiet time for the ptg	18:24
clarkb	its $delaytime after 0200	18:24
clarkb	but yes if we want to be extra cautious we can do that	18:24
fungi	ah, good point wrt being dependent on the daily periodic timer trigger, that's less worrisome then	18:25
clarkb	my main concern with meetpad is that the daily run will occur then jitsi will be non functional for the start of the european ptg timeslot and we may not all be awake/around then	18:25
fungi	looks like the ptg is on break every day between 00-04 utc	18:25
clarkb	landing your change in theory will address that	18:25
clarkb	jitsi itself could still break us though with bad images or similar issues	18:26
fungi	so yes, an upgrade and restart at 02 utc shouldn't impact any scheduled sessions	18:26
clarkb	so may still be worth doing the emergency file stopgap	18:26
fungi	clarkb: what was the afs arm64 package build failure detail?	18:33
clarkb	fungi: I think the issue is that we're not reliably building centos arm images (and maybe not the other images etiher I haven't fully checked) and that results in our images have stale running kernels. Then when the rpm package build runs it hits incompatbility with the current latest kernel headers and the running kernel and fails	18:34
clarkb	previously we've always fixed this by fixing our image builds, waiting a day, then rerunning jobs and it works until image builds fail again	18:34
clarkb	we could also potentially update those jobs to do a system update, reboot, then package build to mitigate	18:34
clarkb	but they are being triggered by the linter update because we're making modifications to those playbooks	18:35
fungi	oh, not afs-specific	18:35
clarkb	afs specific in that its the only thing we build packages for that depend on kernel headers aligning. But not afs behavior being an issue	18:35
clarkb	its a more generic "make kernel module that links against kernel" problem	18:35
clarkb	I'm about to hop onto the builder and start seeing what is unhappy there	18:36
fungi	nodepool dib-image-list saus we built centos-9-stream in the past day but centos-9-stream-arm64 image builds last succeeded 20 days ago	18:36
fungi	i don't mind bypassing the job to land a change, but we should also either fix or drop the centos-9-stream-arm64 images if they're not building	18:37
clarkb	only 5.9gb available in /opt on nb04 so we're still leaking there	18:37
fungi	where "we" can be anyone who cares about continuing to use those images	18:37
clarkb	well this is a problem for all arm images	18:38
clarkb	we just notice in those jobs	18:38
clarkb	anyway I'm going to do the typical stop builder, start an rm of dib_tmp content in screen, and check if losetup has any room which need a reboot cleanup process	18:38
clarkb	a lot of this will in theory be mitigated by nodepool in zuul since the builders will be throwaway so probably not worth investing too much in diagnosing a root cause at the moment	18:39
clarkb	in theory if I do all this we can land the ozj change on monday without any changes to ozj jobs	18:40
clarkb	losetup -a shows 9 devices which is leaky but I think the limit is higher than that	18:40
clarkb	ya limit should be 16 in the kernel	18:40
fungi	yeah, a quick look indicates that the most recently we built any arm64 image successfuly was 16 days ago (ubuntu-jammy-arm64)	18:41
fungi	so i agree it's not just centos	18:41
clarkb	cleanup of /opt/dib_tmp is in progress. Ideally we do a reboot to clear out those losetup devices after this is done too but it may not complete until tomorrow	18:42
clarkb	we'll see how I'm feeling in the morning and maybe I'll rememebr to do that	18:42
clarkb	its running under screen (though my user owns the screen nto root)	18:43
fungi	connected and watching too. maybe i'll be around to reboot it when it finishes, and if so i'll do that	18:44
clarkb	and then maybe monday is also a good day to flip openstack's zuul tenant (and the rest of the tenants) to ansible 9 by default?	18:44
clarkb	fungi: cool thanks	18:44
fungi	i think so, yes	18:45
clarkb	I'm expecting my ability to push things like that along will degrade as we get deeper into next week due to family stuff so ideally we can rip some bandaids off early and get it done with	18:45
fungi	openstack cycle-trailing deliverables are still working on preparing their 2024.2/dalmatian releases, but there's no clear timeline (supposed to be over the next three weeks-ish), so i wouldn't hold up for fear of breaking those	18:46
clarkb	and we've got decentish coverage between opendev and zuul tenants as well as my spot check of devstack+tempest	18:46
clarkb	its not like we're going in blind but there may still be some corner cases	18:46
opendevreview	Merged opendev/system-config master: Explicitly down Jitsi-Meet containers on upgrade https://review.opendev.org/c/opendev/system-config/+/930637	18:52
fungi	https://zuul.opendev.org/t/openstack/build/5148da9596ef47b48ce8c811624306aa (correctly) did not restart the containers because there was no image update from the pull	19:01
clarkb	excellent, that confirms half of what we want to see now we just need them to make a release	19:02
fungi	TASK [jitsi-meet : Run docker-compose down] skipping: [meetpad02.opendev.org] => { "changed": false, "false_condition": "'downloaded newer image' in docker_compose_pull.stderr", "skip_reason": "Conditional result was False"}	19:02
opendevreview	Merged zuul/zuul-jobs master: Only update qemu-static container settings on x86_64 https://review.opendev.org/c/zuul/zuul-jobs/+/930939	22:31
opendevreview	Merged zuul/zuul-jobs master: Print instance type in emit-job-header role https://review.opendev.org/c/zuul/zuul-jobs/+/925754	22:45
opendevreview	Merged zuul/zuul-jobs master: Add other nodes to buildx builder https://review.opendev.org/c/zuul/zuul-jobs/+/930927	23:00
corvus	clarkb: fungi hrm, i don't have a timestamp so i don't know how long it took, but my test on the held node failed with: HttpException: 413: Client Error for url: https://swift.api.sjc3.rackspacecloud.com/v1/AUTH_f063ac0bb70c486db47bcf2105eebcbd/images-a3d39eaeea5f/f01a3fbaad534977956da95dc6d99c5f-debian-bullseye.qcow2, Request Entity Too Large	23:52
corvus	so we probably do need to check and see if we should be calling into sdk a different way	23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!