frickler | so (some of) the image uploads look successful in the "image list" output, lots of new image there, but somehow that status doesn't end up in nodepool. I guess I'll revert the revert for now until we have some better idea. maybe increase the timeout to 2h? | 07:23 |
---|---|---|
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Revert "Revert "Temporarily pause image uploads to rax-iad"" https://review.opendev.org/c/openstack/project-config/+/890850 | 07:24 |
frickler | so as an example the last one in the list says "created_at: 2023-08-10T02:28:04Z", "updated_at: 2023-08-10T03:59:58Z", if that is the relevant timespan, it would explain the timeout on the nodepool side | 07:27 |
opendevreview | Merged openstack/project-config master: Revert "Revert "Temporarily pause image uploads to rax-iad"" https://review.opendev.org/c/openstack/project-config/+/890850 | 07:39 |
opendevreview | gnuoy proposed openstack/project-config master: Add OpenStack K8S Telemetry charms https://review.opendev.org/c/openstack/project-config/+/890921 | 08:12 |
opendevreview | gnuoy proposed openstack/project-config master: Add OpenStack K8S Telemetry charms https://review.opendev.org/c/openstack/project-config/+/890921 | 09:30 |
fungi | frickler: what i was observing yesterday is the "updated at" times on the images correspond to the "expiration" times on their import tasks. not sure why a task expiration results in an update to the image record though | 11:51 |
fungi | when i get a break between today's meetings i'll try manually un-pausing one image at a time and see if they upload successfully when there's not a bunch all going at once | 11:52 |
frickler | fungi: I'm currently trying that, I also added timeout=7200 to the create_image call, will see how that goes. last attempt failed but maybe there were still other uploads in progress after the revert | 12:11 |
fungi | ah okay | 12:12 |
fungi | what i ended up doing was checking once a minute in a loop until the image with the expected name appeared in the output of `openstack image list` because none of the timestamps associated with the image or import task seemed to correspond to when that happens | 12:13 |
fungi | date ; time sudo openstack --os-cloud=openstackjenkins-rax --os-region-name=IAD image create --disk-format=vhd --file=/opt/debian-bookworm-0000000167.vhd test-image-create-`date +%s` ; while ! ( sudo openstack --os-cloud=openstackjenkins-rax --os-region-name=IAD image list | grep -q test-image-create ) ; do sleep 60 ; done ; date | 12:14 |
fungi | that's basically how i checked it | 12:15 |
frickler | I'll need to check how the sdk actually does the waiting. too bad the osc image create doesn't simply have a --wait like server create does | 12:57 |
*** amoralej is now known as amoralej|lunch | 13:05 | |
fungi | from what i can tell, glance doesn't know at upload what the image id will be, it returns the import task id instead, and then you have to poll that to find out what the image id is once it completes | 13:08 |
frickler | fungi: yes, but IMO that whole thing could/should be wrapped by sdk+osc, no need for each user to reinvent the wheel like in your script | 13:10 |
fungi | oh, yes i agree there | 13:11 |
frickler | #status log manually updated the jitsi web cert on meetpad01 and restarted the jitsi web container | 13:12 |
opendevstatus | frickler: finished logging | 13:12 |
fungi | thanks for doing that frickler! hopefully it renews normally next time | 13:26 |
*** amoralej|lunch is now known as amoralej | 13:45 | |
frickler | 2023-08-10 15:29:13,173 DEBUG nodepool.OpenStackAdapter.rax-iad: API call create_image in 8549.978593384847 | 15:35 |
frickler | openstack.exceptions.ResourceTimeout: Timeout waiting for Task:b233e83c-ecd6-4360-8a2b-d294668c2792 to transition to success | 15:36 |
frickler | still no success with 2h timeout :( | 15:36 |
frickler | I'm also getting a 503 now for image list | 15:37 |
frickler | hmm, weird, the 503 only happens when I do "openstack image list | grep something", without the pipe, it returns within a couple of seconds | 15:40 |
Clark[m] | are all image uploads to rax failing? | 18:13 |
Clark[m] | I'm not quite sure I understand the situation there | 18:13 |
fungi | Clark[m]: mainly just rax-iad, seems like | 18:17 |
fungi | but the problem is that every task timeout seems to result in another leaked image | 18:17 |
fungi | so the builders are just leaking lost images into that region in a loop because they continually retry the uploads | 18:18 |
fungi | probably we could address this somehow on the builder side, if there's a way to get it to recognize images it made even though it thought the image creation never succeeded | 18:19 |
Clark[m] | So the upload ultimately succeeds but the API calls fail? | 18:19 |
fungi | at its worst, what i observed is that the builder would upload the image, give up waiting for the import task to complete in the cloud, and then ~5 hours later the image shows up in the image list there | 18:19 |
Clark[m] | I wonder if this is something we should engage rax on and fix it on their end | 18:19 |
fungi | i don't think the api calls fail | 18:20 |
Clark[m] | Oh is the timeout in nodepool polling for the image to go active? | 18:20 |
fungi | the way it seems to work is that the image upload returns a task id (not an image id) and then the builder watches that task to see when it completes, at which point its details include an image id | 18:20 |
fungi | but also the import task has an expiration time on it, and the images are appearing (sometimes hours) after that expiration | 18:21 |
fungi | so yes from an api perspective there seems to be some sort of disconnect, since the task state goes to success even though the image still doesn't exist for a while after that even, implying there's some additional background process happening after the import task but before it shows up as availabkle | 18:22 |
frickler | I don't think the task itself has a timeout | 18:22 |
Clark[m] | Does make me wonder if there is some sort of network issue between swift and glance in that region if the data move is taking that long | 18:22 |
frickler | its the sdk that timeouts after 1h by default | 18:22 |
fungi | frickler: not a timeout, but it has an "expiration" time (which i think is supposed to be when it ceases to be included in the task list, except it's not actually getting purged from the task list after that expiration) | 18:23 |
frickler | the upload I triggered earlier seems to have finished after about 5h | 18:23 |
frickler | so maybe I'l make another try with a 12h timeout in nodepool | 18:23 |
frickler | that might not be a permanent solution if we want to continue to rotate images every 24h, but give another datapoint | 18:24 |
fungi | fwiw, my test uploads from bridge01 with nodepool uploads to that region paused took around 35 minutes from start of upload until it appeared in the image list. i have a feeling two builders continuously trying to upload 15 images is contributing to the performance problem there, thundering herd style | 18:24 |
frickler | well we stopped these for now | 18:25 |
fungi | yeah, but also it took 5+ hours before i was getting reasonable test upload results | 18:25 |
fungi | so i think stopping the uploads doesn't immediately solve it, because there's still a bunch of lingering load the cloud is trying to catch back up from | 18:25 |
fungi | my first test upload shortly after we paused that region in the builders took around 5 hours, but once that one was done my subsequent (one at a time) uploads took a little over half an hour | 18:27 |
frickler | | 0000232814 | 0000000003 | rax-iad | ubuntu-bionic | ubuntu-bionic-1691694393 | 018c7205-83de-4eeb-a645-6f2259ce3710 | ready | 00:00:38:36 | | 20:31 |
frickler | fungi: the patched nodepool is running on nb01, if you want you can try some other uploads there, otherwise I'll continue tomorrow | 20:33 |
fungi | frickler: what was the patch? just the increased timeout? | 20:35 |
frickler | yes, though this one even took much less than 1h, so would have worked with default nodepool, too: 2023-08-10 19:52:07,018 DEBUG nodepool.OpenStackAdapter.rax-iad: API call create_image in 2733.277339382097 | 20:36 |
fungi | frickler: right, i think that supports my theory. if the builder had a per-provider concurrency setting so we could tell it "only upload one image at a time to rax-iad" then it might also work | 20:38 |
Clark[m] | Is that timeout hard coded or is it configurable? | 20:38 |
Clark[m] | I wonder if we should make it configurable if it isn't already | 20:39 |
frickler | it is configurable in sdk, but currently not in nodepool | 20:39 |
frickler | https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L561-L569 | 20:40 |
frickler | sdk default is timeout=3600, I added timeout=36000 on nb01 | 20:40 |
frickler | Clark[m]: not sure if you saw that, we also have a lot of stuck in deleting nodes in iad and dfw | 20:41 |
frickler | and with that I'm out for today, cu | 20:42 |
fungi | yeah, seems like the options around image uploads in rax-iad are 1. increase timeout to >5 hours, 2. add a max concurrency for uploads to the provider, 3. trickle images into it and try to avoid creating a thundering herd, or 4. disable that region entirely until it gets fixed | 20:43 |
fungi | thanks frickler! | 20:43 |
Clark[m] | I suspect adding max uploads per provider isn't too difficult as the total uploads are already configurable and limited. That said I think we should talk to rax before doing that as 8 image uploads shouldnt make things fall over particularly since we upload yo swift first and glance is doing an in cloud transfer | 20:46 |
fungi | well, it was 15 at the same time, but yes. regardless it's something that needs fixing on their end ultimately | 20:47 |
fungi | i find it hard to believe they're unaware, since it looks like this has been going on for a long time and got considerably worse around 2 weeks ago | 20:48 |
fungi | but a big part of why i cleaned up those 1200-odd leaked images is that i wanted one less thing for them to balk at if we open a ticket | 20:49 |
fungi | and a lot of the testing was to better quantify the symptom we're seeing | 20:49 |
fungi | i have a feeling if we open a ticket that says "when we try to simultaneously upload 15 images around 25gb each, your service inexplicably gets slow" we're going to get a response like "please stop" | 20:50 |
fungi | also there's still a bunch of leaked images that need cleaning up in the other two regions though only about a third each what we had in iad | 20:51 |
Clark[m] | ++ to cleanup and testing. I just don't think reduce 15 uploads at a time to one is something we should prioritize on our end as it indicates a cloud problem first | 20:54 |
fungi | sure, i agree | 20:54 |
Clark[m] | Aiui th whole point of using the task system with a swift upload was to make multiple image uploads of any size a trivial action | 20:55 |
Clark[m] | Since swift would handle the data loads and is already designed to do so | 20:55 |
fungi | to me there are two questions: what more should we do/collect before opening a ticket, and should we also do anything to try to continue using that region in the meantime or just turn it off for now? | 20:55 |
fungi | if we do want to try to get the images there updated in the meantime so we won't need to turn it off (they're approaching two weeks old in some cases), we have a few possible options for that | 20:57 |
Clark[m] | Maybe pausing the region entirely. Benchmarking a single upload of a representative image. Then reporting that is all we can do for reasonable bug reporting? That way our noisy neighbor problem is minimized | 20:58 |
Clark[m] | (and I think we did that it's 5 hours?) | 20:58 |
fungi | well yes and no. if we benchmark a manual upload right after pausing the mass uploads it's around 5.5 hours. if we pause everything and wait 5 or 6 hours for things to get back to normal, it's a little over half an hour | 20:59 |
fungi | we've independently done this a couple of times over the past two days, and it seems to be consistent. basically it's like there's a lingering backlog glance is still working through after we pause our uploads, and that takes hours to catch back up | 21:00 |
Clark[m] | I see | 21:03 |
fungi | so unfortunately the single upload timing under normal circumstances is really only a half hour delay, which isn't all that indicative of a problem on its own | 21:05 |
fungi | it just seems to get much worse if we try to upload a bunch at once (and stays terrible for a while even after we stop) | 21:05 |
Clark[m] | 15*0.5 is suspiciously close to 5 hours too. I wonder if things are getting serialized | 21:08 |
Clark[m] | On the cloud side I mean | 21:08 |
fungi | that's certainly what it seems like to me, yes | 21:08 |
clarkb | given that (hey I mnaged to connect to irc) I feel like engaging the cloud is probably a good idea since that seems like a problem | 22:31 |
fungi | agreed, it's going to take some careful phrasing though so that we describe our experience in ways that don't seem outside realistic expectation, and i've burnt my available creative writing capacity for the day on openstack security bugs instead | 22:43 |
JayF | -.- | 22:49 |
clarkb | I didn't start today with creative writing ability | 22:50 |
clarkb | the jet lag is real | 22:50 |
JayF | ty for the help on all that fungi; not how I wanted the day to go for sure either lol | 22:50 |
fungi | JayF: oh, it's not just that one! much as i wish it were | 22:51 |
JayF | =| | 22:52 |
clarkb | fungi: only iad is affected? seems like since we can point out ord and dfw don't suffer this it is reasonable that iad be like the others | 22:52 |
fungi | clarkb: i can't say *only* iad is affected, but certainly it seems much worse than the other two we use | 22:52 |
fungi | and is the only one misbehaving badly enough to have become unusable for us now | 22:53 |
clarkb | ya so maybe that is the best way to frame it. IAD suffers compared to the others | 22:53 |
fungi | without being party to the underlying details of the problem, it may merely be a matter of degree | 22:54 |
clarkb | and that way we can show this behavior is at least unexpected | 22:54 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!