Wednesday, 2023-08-09

tkajinamfungi, yeah that was merged after recheck.01:50
fricklerso even if nodepool did upload images to iad, those uploads weren't successful from a nodepool pov, the active images e.g. for rockylinux-9 are still 10 and 11d old. I'll try to clean up the older one via nodepool in the hope of getting a bit of space freed up on the builders04:29
frickleralso the osc command is "openstack image task list", it does work fine on other tenants and shows just a single task for the ci@iad tenant04:31
fricklerso the nodepool image-delete worked, I'll try to go through the other ones slowly in order not to overload the API or backend04:52
fricklerregarding deleting the old instances, as expected a manual delete command didn't change the status, so those will likely need some rax intervention04:53
fricklerthe retried run on meetpad01 also still didn't work, so I'll do the manual cert copying steps later today04:54
fricklerseems the image deletion worked well and we now have at least about 25% free on the builders again06:57
*** amoralej is now known as amoralej|lunch11:04
fungithere were 3 images in rax-iad that nodepool had no record of so i deleted them just now and the counts are finally matched up11:42
opendevreviewJeremy Stanley proposed openstack/project-config master: Revert "Temporarily pause image uploads to rax-iad"
fungiand yes, after comparing times based on the image names, i'm certain the ones i was seeing appear after the pause took effect were somehow delayed/queued on the glance side. the builders didn't try to upload anything there after the pause, but things they had previously tried to upload were eventually appearing in the image list in glance many hours later11:52
fungicounts in dfw and ord aren't as bad as iad was (each a little over 400) but still clearly about 95% leaked. i'll try to find time to clean those up later today12:03
fricklerfungi: regarding iad uploads, maybe try manually uploading an image first? does nodepool complain then? we could also test in the ci tenant?12:08
fricklerfungi: also any idea what to do about the stuck-in-deleting instances?12:09
fungifor instances stuck deleting, i've usually had to open a support ticket or otherwise reach out to a cloud admin regardless of which provider it is12:11
fungii can try manually uploading an image from one of the builders and see what happens, though working out the invocation could be tricky (i'll probably need to install osc in a venv first)12:13
fungiand we don't install the python3-venv package on the servers12:14
fungiwe've got some room on bridge01 though, i guess i could try copying an image to it and uploading from there12:15
fricklermaybe start with a standard cloud image or even cirros. would also be interesting to see if smaller images work better12:20
fungiwell, i've got debian-bookworm-0000000167.vhd copying over from nb0112:21
fungiconveniently, bridge has ssh access to everything already12:23
*** amoralej|lunch is now known as amoralej12:23
fungii've got to go do a couple other things, but i've started an image create in rax-iad with that vhd file and am timing it12:40
opendevreviewSlawek Kaplonski proposed openstack/project-config master: Allow neutron-core to act as osc/sdk service-core
fungicreated_at 2023-08-09T12:44:44Z13:11
fungistatus pending13:11
fungiimage list still isn't returning it13:11
fungiand image show with its uuid returns "No Image found for ..."13:12
fungiimage task list is taking quite a while to return anything13:17
fungitook about 9 minutes to complete and listed thousands of entries (i'm running it again to try and get a proper count of them since the output was longer than my buffer)13:32
fungi45275 tasks listed13:44
fungithe image i uploaded still isn't being returned by the api13:44
fungineed to go run some errands, but will bbiab13:45
guilhermesp_____fungi: sorry for the delay here -- we did have an incident yesterday with one of the storage nodes in ca-ymq-1 around 10:30 am est -- wonder how is this looking so far for you now14:32
opendevreviewSlawek Kaplonski proposed openstack/project-config master: Allow neutron-core to act as osc/sdk service-core
*** dviroel__ is now known as dviroel14:43
fungiguilhermesp_____: nothing new since 14:49:46 utc yesterday, so seems the impact was limited. thanks for confirming it was what it seemed like!14:59
opendevreviewgnuoy proposed openstack/project-config master: Add OpenStack K8S Telemetry charms
guilhermesp_____cool thanks for confirming fungi  we took the measures to avoid that in the future15:34
fungimuch appreciated!15:34
fricklerfungi: wow, that's a lot of tasks. seems I never waited long enough for that command to return. should we try to delete all of them and see if that helps? I would hope that these are simple db operations that can work faster than image deletions15:57
fungifrickler: i don't think they can be deleted?16:03
fungiit seems like it's more of a log, since they all have states like "success" or "failure"16:04
fungi`openstack image task ...` subcommands are limited to list and show, from what i can see16:05
fungiprobably some background process in glance is supposed to expire those entries after some time16:05
fungisome of these entries are from 201516:07
frickleroh, right, tab completion was giving me hallucinated extra commands16:10
fungii'm going to actually save the output to a file this time and then i can more easily see the newer entries (i think they're at the top of the list) and check what extended data they have using image task show16:10
fungiprobably the expected way to use the task listing endpoint in the api is to request the n newest entries or entries after x time16:12
fungicertainly getting 42k task entries from the past 8 years isn't terribly useful, at least in our case16:13
fungier, 45k16:13
fungialso, if the image upload processing backlog is consistent with what i was seeing yesterday, that image i uploaded at 12:45 utc will probably start showing up in the image list soonish16:16
fungi8d06a385-edfa-4a52-aa3d-223b6ae3bd96 is the top one in the task list and it has a status of success. it's for that test image i uploaded16:19
fungihuh, okay so the image uuid returned earlier by the create command doesn't match the one referenced in the task detail16:19
fungithat image does exist, and is status active updated_at 2023-08-09T13:12:14Z16:20
fungiso took just shy of 30 minutes to appear, probably16:20
fungino, wait that was an update from reaching it's "expiration" (not sure what that does exactly). the import task was created at 12:44:44 and the image says it was created at 12:54:44, which is only 10 minutes16:23
fungii'll delete that and do another upload test to see if it's consistent16:24
fungistarted the new upload at 16:25:00 precisely16:25
fricklerI actually tried to list tasks with --limit 2, but that didn't seem to work any faster16:28
fungireal 4m23.219s16:30
fungicreated_at 2023-08-09T16:29:23Z16:30
fungithe image create response said the id was c35b96c7-5448-4e24-9d97-983ae3c2aab416:31
fungibut i'll check for it by name instead since it seems like it eventually appears with a different uuid (maybe as a result of the import task)16:32
fungioh! the id is actually the glance task's id, not the image's id, this makes slightly more sense16:33
fungithat uuid is an import task with status processing16:33
fungiso once that task's status reaches success, it should mention an image id for the uploaded image16:34
fungil processing16:45
fungiimage is now appearing in the list, and claims to have been created at 2023-08-09T16:37:56Z (but wasn't there when i checked for it at 16:45 so something else is still going on)17:20
fricklerif I read the nodepool code correctly, it simply calls the create_image function in sdk. that has a default timeout of 1h, so I guess that should be fine and we can try to go on with the revert18:48
fricklerfungi: I've +2d the patch so you can proceed with it whenever you feel ready. or if you just un-wip I can merge and watch tomorrow my morning18:50
fungithanks frickler! i'm going to do a couple more upload tests and see if i can get a clearer picture of how quickly they're appearing in the image list18:53
fungibut if it's significantly less than an hour i agree that seems good enough18:53
fungistarted a test upload at 19:39:32 which finished at 19:44:19 and appeared in the image list between 20:11:12 and 20:12:12, so worst case call that 32m40s (but probably a few minutes less if nodepool is counting from when the create call returns from blocking)20:26
fungithis seems reasonable, i'll approve the revert and keep an eye on it while i knock out some yardwork20:27
opendevreviewMerged openstack/project-config master: Revert "Temporarily pause image uploads to rax-iad"
fungikeeping an eye on new uploads to rax-iad once that deploys20:45
fungiwhich have now started20:47
fungicurrently uploading 15 images... hopefully that thundering herd doesn't slow ready time to more than an hour20:51
funginot looking good, we're already past 45 minutes and they're all still "uploading"21:33
fungithen again, uploading one image from bridge took over 4 minutes. even splitting the load between two builders, maybe it really takes this long just to transfer the data for ~7.5 images from each builder to glance21:35
fungiespecially if there are uploads to other providers going on at the same time21:35
funginot good, now the uploading count has risen to 1622:32

Generated by 2.17.3 by Marius Gedminas - find it at!