Tuesday, 2024-06-18

fungitonyb: still waiting on the rest of the images to upload00:13
fungifrom logs it looked like upload errors were related to nodepool image cleanup, because we want to upload raw files but nodepool builders doesn't keep those around so couldn't upload any of the already-built ones. and some of our images are only rebuilt weekly now i think?00:14
tonybfungi: Ahh okay.  So I guess we wait for a bit.00:15
fungialso builder logs are full of upload errors for gentoo and centos-8-stream because we're not rebuilding those at all so no new raw files are appearing00:18
fungii suppose nodepool should refrain from trying to upload image files it has itself deleted00:19
fungiinstead of raising BuilderInvalidCommandError over and over00:20
tonybfungi: So once OpenMetal is added we should remove gentoo can centos-8 images?  I'm guessing doing it now will cause more confusion00:20
fungiwell, new image builds for both are paused but the images already uploaded in our other providers presumably still work00:22
fungiwe just can't upload them to openmetal-iad3 because nodepool cleaned up the raw files00:23
fungibut also i don't see any evidence that the builders are trying to upload new builds of other images, e.g. ubuntu-jammy00:24
tonybfungi: Okay.  I've never looked at nodepool but I'm happy to poke around and see if I can figure out why that might be00:25
fungifor example, nb02 ubuntu-jammy-6ee881ddc05e439b8e2cd01551b1ac80 yesterday and uploaded it to ovh-bhs1 and ovh-gra1 but nowhere else...00:29
fungii'm starting to think that the repeating impossible uploads are blocking some uploadworker threads00:33
tonybfungi: Oh I guess that could be00:34
opendevreviewTony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config  https://review.opendev.org/c/opendev/system-config/+/92132200:46
tonybI don't think it's a big deal but should we move https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L33-L34 from bionic to jammy ?01:27
fungitonyb: oh, probably yes. good catch!01:43
tonybfungi: Okay I'll push a change.01:44
fungithanks!01:50
opendevreviewTony Breeds proposed openstack/project-config master: nodepool: Switch "common job platform" from bionic to jammy  https://review.opendev.org/c/openstack/project-config/+/92217401:50
opendevreviewMerged openstack/project-config master: nodepool: Switch "common job platform" from bionic to jammy  https://review.opendev.org/c/openstack/project-config/+/92217402:47
*** ykarel_ is now known as ykarel05:49
opendevreviewJens Harbott proposed openstack/project-config master: Fix image uploads for openmetal  https://review.opendev.org/c/openstack/project-config/+/92218808:03
fricklertonyb: ^^ after looking at nb logs, I'm pretty sure that the issue won't resolve itself by waiting. please have a look if you're still around, otherwise I'll self-approve and check the outcome08:04
tonybfrickler: thanks. I did add my +2 FWIW.09:13
opendevreviewMerged openstack/project-config master: Fix image uploads for openmetal  https://review.opendev.org/c/openstack/project-config/+/92218809:38
fricklerhmm, seems there is a difference in nodepool between "pause: true" in the config and running "image-pause"? the former done for centos-8-stream still lists "State: ready" in the "dib-list-image" output, only gentoo shows "State: paused"10:20
frickleranyway the config change has deployed and I've triggered another jammy build to see what will happen with the upload once that completes10:20
fricklerhmm, something seems more deeply broken with image uploads. the new ubuntu-jammy image hasn't been uploaded anywhere. images on vexxhost + rax are > 5 days old13:11
fricklerfound this error, is it possible that nodepool completely gives up on uploading the image after that? https://paste.opendev.org/show/boDBRdKNVyTYjNk3AOc7/13:15
fricklernodepool also seems to go through deleting all on-disk images once per minute and is very verbose in logging that13:16
fricklerso I think we might want to disable uploads for rax-dfw for a bit and see if the other providers do fare better then at least13:17
fungifrickler: did you maybe miss scrollback from yesterday?13:20
fungii'm wondering if uploadworker threads are blocked on retrying uploads of raw image files that nodepool has already cleaned up13:21
fungibut that's about as far as i got13:21
fungiprobably next step is to collect a thread dump from one of the builders13:22
fungithoug yeah, good fing, i overlooked the rax-dfw upload failure in the midst of the missing raw image errors13:23
fungier, good find13:24
fricklerfungi: I saw the scrollback and therefore dropped the no longer existing images from nodepool config earlier. the errors related to that no longer appear in the log13:33
fungiaha, thanks!13:34
fricklerthere might still be blocked threads, though, maybe do a dump and then a restart of the container?13:34
fungiso the remaining weirdness is something with ssl/tls for the rax-dfw swift where uploads get staged by glance13:34
fricklerthat plus the question whether/why one failing provider can block uploads to other providers13:36
*** ykarel is now known as ykarel|away15:39
corvusit does look like a failing provider can block uploads of others; i believe it's inadvertent, but it's there.  removing the image from rax-dfw should permit the others to resume.17:34
fungithanks corvus!17:51
fungiwhile you're here, do you have any particular reaction to the intersection of nodepool image cleanups and adding new providers? i wonder if nodepool could figure out that it deleted files and not repeatedly try to upload them17:52
fungior maybe we just remember that can happen, and worry about making things more correct with nodepool-in-zuul instead17:53
corvusremote:   https://review.opendev.org/c/zuul/nodepool/+/922242 Continue trying uploads for other providers after failure [NEW]        17:53
corvusthat should fix the upload issue in nodepool17:53
corvusfungi: that does sound like something nodepool should be able to do; i'm guessing with the above change, that probably would "only" be a log spam issue at that point; does that sound right?17:55
fricklerhmm, how would I disable uploads for rax-dfw without removing the diskimages completely? I have a vague memory that we discussed this already some time ago?17:55
corvusmake https://opendev.org/openstack/project-config/src/branch/master/nodepool/nodepool.yaml#L63 the empty list?17:57
fungicorvus: i think so yes, or maybe every so slightly inefficient in that it also spends some cycles trying to upload those constantly (even if it hopefully never opens a socket to glance)17:58
fungis/svery/ever/17:58
corvusnot sure if we actually want to run a region with different images than others; so probably disabling it entirely would be desirable too?17:58
fungiyeah, if we can't upload images to rax-dfw for an extended period of time we're probably better off temporarily disabling the provider until it comes back into working order17:59
fricklerwell we currently have a delta of some days anyway, which isn't nice but also not critical IMO. waiting for another week before as a last resort disabling the region completely should be fine IMO17:59
corvusfungi: yeah, i think in that case it might triage as below the threshold of me wanting to fix it in current nodepool vs continuing work on nodepool-in-zuul which should fix it a different way.  but if it's a more serious blocker, then might be worth fixing now.18:00
corvusfrickler: ack; just noting for consideration18:00
fungicorvus: also the missing files situation does self-correct eventually as the preexisting images are rotated out of nodepool18:01
fricklerrax-dfw is 140 server, which I would like to avoid disabling if we can18:01
fungiassuming the missing images *can* be rotated out that is. paused image builds sort of throw a wrench into that assumption in reality18:01
fricklerregarding emptying the diskimage list, won't that tell nodepool to clean up existing images according to https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-from-the-builder?18:02
corvusyes, so if you don't want the builders to touch rax-dfw at all, but leave existing images there, then perhaps remove it from the file completely.  i think that might do the trick, but i'm not 100% sure on that and haven't tested it recently.18:05
corvusor wait for 922242 to merge18:06
fricklerok, I think I can wait until tomorrow and then maybe do some testing first, like try on the openmetal cloud what happens18:13
fungisgtm18:20
fricklercorvus: fungi: regarding the "deterministic order" comment on the nodepool patch, does the upload actually go through providers in the order listed in the list? so as a workaround we could simply move rax-dfw to the last position for now? (I'm off for now, but I can test that first thing tomorrow)19:47
corvusi think so19:58
opendevreviewRadoslaw Smigielski proposed openstack/diskimage-builder master: growvols, enforce pvcreate and overwrite old VG signature  https://review.opendev.org/c/openstack/diskimage-builder/+/92205920:06
*** dhill is now known as Guest1007820:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!