fungi | tonyb: still waiting on the rest of the images to upload | 00:13 |
---|---|---|
fungi | from logs it looked like upload errors were related to nodepool image cleanup, because we want to upload raw files but nodepool builders doesn't keep those around so couldn't upload any of the already-built ones. and some of our images are only rebuilt weekly now i think? | 00:14 |
tonyb | fungi: Ahh okay. So I guess we wait for a bit. | 00:15 |
fungi | also builder logs are full of upload errors for gentoo and centos-8-stream because we're not rebuilding those at all so no new raw files are appearing | 00:18 |
fungi | i suppose nodepool should refrain from trying to upload image files it has itself deleted | 00:19 |
fungi | instead of raising BuilderInvalidCommandError over and over | 00:20 |
tonyb | fungi: So once OpenMetal is added we should remove gentoo can centos-8 images? I'm guessing doing it now will cause more confusion | 00:20 |
fungi | well, new image builds for both are paused but the images already uploaded in our other providers presumably still work | 00:22 |
fungi | we just can't upload them to openmetal-iad3 because nodepool cleaned up the raw files | 00:23 |
fungi | but also i don't see any evidence that the builders are trying to upload new builds of other images, e.g. ubuntu-jammy | 00:24 |
tonyb | fungi: Okay. I've never looked at nodepool but I'm happy to poke around and see if I can figure out why that might be | 00:25 |
fungi | for example, nb02 ubuntu-jammy-6ee881ddc05e439b8e2cd01551b1ac80 yesterday and uploaded it to ovh-bhs1 and ovh-gra1 but nowhere else... | 00:29 |
fungi | i'm starting to think that the repeating impossible uploads are blocking some uploadworker threads | 00:33 |
tonyb | fungi: Oh I guess that could be | 00:34 |
opendevreview | Tony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config https://review.opendev.org/c/opendev/system-config/+/921322 | 00:46 |
tonyb | I don't think it's a big deal but should we move https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L33-L34 from bionic to jammy ? | 01:27 |
fungi | tonyb: oh, probably yes. good catch! | 01:43 |
tonyb | fungi: Okay I'll push a change. | 01:44 |
fungi | thanks! | 01:50 |
opendevreview | Tony Breeds proposed openstack/project-config master: nodepool: Switch "common job platform" from bionic to jammy https://review.opendev.org/c/openstack/project-config/+/922174 | 01:50 |
opendevreview | Merged openstack/project-config master: nodepool: Switch "common job platform" from bionic to jammy https://review.opendev.org/c/openstack/project-config/+/922174 | 02:47 |
*** ykarel_ is now known as ykarel | 05:49 | |
opendevreview | Jens Harbott proposed openstack/project-config master: Fix image uploads for openmetal https://review.opendev.org/c/openstack/project-config/+/922188 | 08:03 |
frickler | tonyb: ^^ after looking at nb logs, I'm pretty sure that the issue won't resolve itself by waiting. please have a look if you're still around, otherwise I'll self-approve and check the outcome | 08:04 |
tonyb | frickler: thanks. I did add my +2 FWIW. | 09:13 |
opendevreview | Merged openstack/project-config master: Fix image uploads for openmetal https://review.opendev.org/c/openstack/project-config/+/922188 | 09:38 |
frickler | hmm, seems there is a difference in nodepool between "pause: true" in the config and running "image-pause"? the former done for centos-8-stream still lists "State: ready" in the "dib-list-image" output, only gentoo shows "State: paused" | 10:20 |
frickler | anyway the config change has deployed and I've triggered another jammy build to see what will happen with the upload once that completes | 10:20 |
frickler | hmm, something seems more deeply broken with image uploads. the new ubuntu-jammy image hasn't been uploaded anywhere. images on vexxhost + rax are > 5 days old | 13:11 |
frickler | found this error, is it possible that nodepool completely gives up on uploading the image after that? https://paste.opendev.org/show/boDBRdKNVyTYjNk3AOc7/ | 13:15 |
frickler | nodepool also seems to go through deleting all on-disk images once per minute and is very verbose in logging that | 13:16 |
frickler | so I think we might want to disable uploads for rax-dfw for a bit and see if the other providers do fare better then at least | 13:17 |
fungi | frickler: did you maybe miss scrollback from yesterday? | 13:20 |
fungi | i'm wondering if uploadworker threads are blocked on retrying uploads of raw image files that nodepool has already cleaned up | 13:21 |
fungi | but that's about as far as i got | 13:21 |
fungi | probably next step is to collect a thread dump from one of the builders | 13:22 |
fungi | thoug yeah, good fing, i overlooked the rax-dfw upload failure in the midst of the missing raw image errors | 13:23 |
fungi | er, good find | 13:24 |
frickler | fungi: I saw the scrollback and therefore dropped the no longer existing images from nodepool config earlier. the errors related to that no longer appear in the log | 13:33 |
fungi | aha, thanks! | 13:34 |
frickler | there might still be blocked threads, though, maybe do a dump and then a restart of the container? | 13:34 |
fungi | so the remaining weirdness is something with ssl/tls for the rax-dfw swift where uploads get staged by glance | 13:34 |
frickler | that plus the question whether/why one failing provider can block uploads to other providers | 13:36 |
*** ykarel is now known as ykarel|away | 15:39 | |
corvus | it does look like a failing provider can block uploads of others; i believe it's inadvertent, but it's there. removing the image from rax-dfw should permit the others to resume. | 17:34 |
fungi | thanks corvus! | 17:51 |
fungi | while you're here, do you have any particular reaction to the intersection of nodepool image cleanups and adding new providers? i wonder if nodepool could figure out that it deleted files and not repeatedly try to upload them | 17:52 |
fungi | or maybe we just remember that can happen, and worry about making things more correct with nodepool-in-zuul instead | 17:53 |
corvus | remote: https://review.opendev.org/c/zuul/nodepool/+/922242 Continue trying uploads for other providers after failure [NEW] | 17:53 |
corvus | that should fix the upload issue in nodepool | 17:53 |
corvus | fungi: that does sound like something nodepool should be able to do; i'm guessing with the above change, that probably would "only" be a log spam issue at that point; does that sound right? | 17:55 |
frickler | hmm, how would I disable uploads for rax-dfw without removing the diskimages completely? I have a vague memory that we discussed this already some time ago? | 17:55 |
corvus | make https://opendev.org/openstack/project-config/src/branch/master/nodepool/nodepool.yaml#L63 the empty list? | 17:57 |
fungi | corvus: i think so yes, or maybe every so slightly inefficient in that it also spends some cycles trying to upload those constantly (even if it hopefully never opens a socket to glance) | 17:58 |
fungi | s/svery/ever/ | 17:58 |
corvus | not sure if we actually want to run a region with different images than others; so probably disabling it entirely would be desirable too? | 17:58 |
fungi | yeah, if we can't upload images to rax-dfw for an extended period of time we're probably better off temporarily disabling the provider until it comes back into working order | 17:59 |
frickler | well we currently have a delta of some days anyway, which isn't nice but also not critical IMO. waiting for another week before as a last resort disabling the region completely should be fine IMO | 17:59 |
corvus | fungi: yeah, i think in that case it might triage as below the threshold of me wanting to fix it in current nodepool vs continuing work on nodepool-in-zuul which should fix it a different way. but if it's a more serious blocker, then might be worth fixing now. | 18:00 |
corvus | frickler: ack; just noting for consideration | 18:00 |
fungi | corvus: also the missing files situation does self-correct eventually as the preexisting images are rotated out of nodepool | 18:01 |
frickler | rax-dfw is 140 server, which I would like to avoid disabling if we can | 18:01 |
fungi | assuming the missing images *can* be rotated out that is. paused image builds sort of throw a wrench into that assumption in reality | 18:01 |
frickler | regarding emptying the diskimage list, won't that tell nodepool to clean up existing images according to https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-from-the-builder? | 18:02 |
corvus | yes, so if you don't want the builders to touch rax-dfw at all, but leave existing images there, then perhaps remove it from the file completely. i think that might do the trick, but i'm not 100% sure on that and haven't tested it recently. | 18:05 |
corvus | or wait for 922242 to merge | 18:06 |
frickler | ok, I think I can wait until tomorrow and then maybe do some testing first, like try on the openmetal cloud what happens | 18:13 |
fungi | sgtm | 18:20 |
frickler | corvus: fungi: regarding the "deterministic order" comment on the nodepool patch, does the upload actually go through providers in the order listed in the list? so as a workaround we could simply move rax-dfw to the last position for now? (I'm off for now, but I can test that first thing tomorrow) | 19:47 |
corvus | i think so | 19:58 |
opendevreview | Radoslaw Smigielski proposed openstack/diskimage-builder master: growvols, enforce pvcreate and overwrite old VG signature https://review.opendev.org/c/openstack/diskimage-builder/+/922059 | 20:06 |
*** dhill is now known as Guest10078 | 20:11 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!