opendevreview | Merged opendev/system-config master: Correct static known_hosts entry for goaccess jobs https://review.opendev.org/c/opendev/system-config/+/890698 | 05:13 |
---|---|---|
ajaiswal | how can i change my gerrit username ? | 06:52 |
frickler | ajaiswal: the username is immutable | 07:10 |
opendevreview | Merged openstack/project-config master: Allow cyborg-core to act as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890475 | 07:43 |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365 | 08:18 |
*** dhill is now known as Guest8289 | 11:37 | |
opendevreview | Merged openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365 | 12:15 |
*** amoralej is now known as amoralej|lunch | 12:19 | |
*** amoralej|lunch is now known as amoralej | 12:53 | |
frickler | infra-root: seems we are not only having issues with image upload for rax, but also with instance deletion. bot IAD and DFW have like 90% instances stuck deleting | 14:10 |
frickler | from https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=1&from=now-6M&to=now it looks like for dfw it has started end of may, some instances I checked have that creation date | 14:11 |
frickler | also nb01+02 disks seem to be full again, I guess we need to either fix iad uploads or disable them | 14:14 |
fungi | it may be exacerbated by the image count. i'm going to go back to working on that between meetings today | 14:16 |
fungi | i'm slow-deleting 1172 leaked images in rax-iad now, with a 10-second delay between each request. that's roughly 3h15m worth of delays so will take at least that long to complete | 15:11 |
frickler | does deleting an image also delete the task(s) associated with it? | 15:14 |
fungi | glance tasks? no idea really. though a lot of these images are years old so i would be surprised if there are still lingering tasks | 15:15 |
fungi | for many of them anyway | 15:15 |
fungi | it deleted ~17 and is now sitting | 15:17 |
fungi | presumably one request is not returning in a timely manner | 15:18 |
fungi | that one request has been waiting for approximately 15 minutes now. if a lot of them are like this then it's going to take many orders of magnitude longer than i expected | 15:25 |
fungi | it eventually started moving again, but only did a few more and has stalled once more | 15:35 |
fungi | seems like image deletions might be expensive operations and they're getting backlogged | 15:36 |
fungi | or hitting internal timeouts with services not communicating reliably with one another | 15:37 |
fungi | also it may be fighting with image-related api calls from nodepool, so i agree we should probably pause image uploads to rackspace regions until we can clean things up | 15:40 |
fungi | we haven't successfully uploaded anything to rackspace since the builders were fixed yesterday | 15:40 |
fungi | er, to iad that is | 15:41 |
fungi | we have uploaded successfully to dfw and ord since then | 15:41 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Temporarily pause image uploads to rax-iad https://review.opendev.org/c/openstack/project-config/+/890807 | 15:49 |
fungi | infra-root: ^ | 15:49 |
fungi | thanks frickler! | 15:53 |
fungi | seems just over 40 of my `openstack image delete` calls have returned so far | 15:54 |
fungi | going to take a while at this pace | 15:54 |
frickler | but at least there is progress ;) | 15:55 |
fungi | well, i don't even know that it actually deleted anything, all i know is that openstackclient isn't reporting back with any errors | 15:56 |
fungi | after a while i'll check the image list from that region and see if it's shrinking at all | 15:56 |
fungi | but it will be easier to tell once that nodepool config change deploys | 15:56 |
frickler | if you have a specific image id, "image show" may be faster than list | 16:07 |
frickler | like to verify if at least the first one got deleted | 16:08 |
opendevreview | Merged openstack/project-config master: Temporarily pause image uploads to rax-iad https://review.opendev.org/c/openstack/project-config/+/890807 | 16:12 |
fungi | since that has deployed, i'm taking a baseline image count now | 16:31 |
fungi | 1236 images in rax-iad according to the image list that completed a few minutes ago. i'll check again in an hour | 16:45 |
fungi | i do actually see one image delete error in my terminal scrollback, a rather nondescript 503 response | 16:46 |
fungi | "service unavailable" | 16:46 |
tkajinam | I'll check the status tomorrow (it's actually today) but it seems this one is stuck after +2 was voted by zuul... https://review.opendev.org/c/openstack/puppet-openstack-integration/+/890672 | 16:49 |
fungi | tkajinam: it looks like it succeeded but never merged. also interesting is that zuul commented twice in the same second about the gate pipeline success for that buildset | 16:52 |
fungi | i'll probably need to look in the debug logs on the schedulers to find out what went wrong | 16:53 |
tkajinam | ah, yes. I didn't notice that duplicate comment. | 16:53 |
tkajinam | no jobs in the status view. wondering if recheck works here or we should get it submitted forcefully. | 16:54 |
fungi | a recheck might work, unless there's something wrong with the change itself that's causing gerrit to reject the merge. hopefully i'll be able to tell from the service logs, but also i have to get on a conference call (and several back-to-back meetings) so can't look just yet | 16:57 |
tkajinam | ack. I've triggered recheck... it seems the pathc now appears both in check queue and gate queue. I'm leaving soon and will recheck the patch during the date tomorrow (today) but I might need some help if recheck still behaves strangely | 17:00 |
fungi | looks like decent progress on the image deletions in rax-iad. count is down to 937, so about 200 fewer in roughly an hour | 17:57 |
fungi | multitasking during the openstack tc meeting, found this in gerrit's error log regarding tkajinam's problem change: | 18:05 |
fungi | [2023-08-08T15:04:29.860Z] [HTTP POST /a/changes/openstack%2Fpuppet-openstack-integration~stable%2F2023.1~I544fa835ee86a41bb4ba4bf391857b8a64750af2/ (zuul from [2001:4800:7819:103:be76:4eff:fe04:42c2])] WARN com.google.gerrit.server.update.RetryHelper : REST_WRITE_REQUEST was attempted 3 times [CONTEXT project="openstack/puppet-openstack-integration" request="REST | 18:05 |
fungi | /changes/*/revisions/*/review" ] | 18:05 |
fungi | i wonder if that means it failed to write the merge to the filesystem? | 18:05 |
fungi | [Tue Aug 8 14:47:45 2023] INFO: task jbd2/dm-0-8:823 blocked for more than 120 seconds. | 18:06 |
fungi | that's from dmesg | 18:06 |
fungi | another at 14:49:46 | 18:07 |
fungi | i think that's the cinder volume our gerrit data lives on | 18:08 |
fungi | infra-root: just a heads up that we may be seeing communication issues with the cinder volume for our gerrit data | 18:10 |
fungi | i'll do a quick check over the rackspace tickets/status page | 18:11 |
fungi | no, wait, we moved it to vexxhost a while back | 18:11 |
fungi | guilhermesp_____: mnaser: just a heads up we may be seeing cinder communication issues in ca-ymq-1 | 18:13 |
fungi | so the two events in dmesg were for dm-0 (the gerrit data volume) at 14:47:45 utc, and vda1 (the rootfs) at 14:49:46 utc | 18:23 |
fungi | these are the timestamps from dmesg though, which are notoriously inaccurate, so easily a few minutes off from the actual events that were logged | 18:23 |
fungi | side note, 890672 did end up merging successfully after tkajinam rechecked it | 18:24 |
fungi | for vexxhost peeps who may see this in scrollback, the server is 16acb0cb-ead1-43b2-8be7-ab4a310b4e0a and the volumes it logged problems writing to are de64f4c6-c5c8-4281-a265-833236b78480 and then d1884ff4-528d-4346-a172-6c9abafb8cdf in chronological order | 18:26 |
fungi | down to 780 images in rax-iad now, so still chugging along | 18:35 |
Clark[m] | frickler: fungi: pretty sure deleting images on glance does not delete any related import tasks or related swift data. This is my major complaint with that system after the "it's completely undocumented and non standard part". Basically the service imposes massive amounts of state tracking on the end user and you don't really know it is necessary either | 20:12 |
fungi | fwiw, i can't tell how to get osc to list glance tasks | 20:27 |
fungi | but i haven't gone digging in the docs yet, just context help | 20:27 |
fungi | rax-iad image count is down to 241, so my initial cleanup pass will hopefully finish within the hour | 20:32 |
fungi | done, leaving 171 images in the list. nodepool says it knows about 34, so i'll put together a new deletion list for the other 137 | 20:55 |
fungi | oh, actually that wasn't --private so the count included public images too. actual private image count is 130 meaning we have 96 still to clean up | 20:58 |
fungi | er, 92 actually leaked | 21:00 |
fungi | slow-deleting those now | 21:01 |
Clark[m] | I don't know if you can list tasks. The whole glance task system exists in a corner of no documentation and little testing | 21:09 |
fungi | also frustration and gnashing of teeth | 21:10 |
fungi | okay, after the latest pass, `openstack image list` for iad has 39 entries in rax-iad compared to the 34 from `nodepool image-list` | 21:23 |
fungi | i'll see if i can work out what's up with the other 5 | 21:23 |
fungi | no, i keep counting the decorative lines | 21:24 |
fungi | it's really 35, so only 1 nodepool has no record of | 21:24 |
fungi | um, it's a rockylinux-9 image uploaded 2023-08-08T20:13:42Z (a little over an hour ago) | 21:27 |
fungi | did the upload pause not actually work? | 21:27 |
fungi | though i guess the good news is that the builders are successfully uploading images to rax-iad now | 21:27 |
fungi | the bad news is that i may have deleted some recently uploaded images | 21:27 |
Clark[m] | If the upload was in progress when you paused I think it is allowed to finish | 21:27 |
Clark[m] | But I'm not sure of that | 21:28 |
fungi | the deploy pipeline reported success on the pause change at 16:25:18, so almost 4 hours before that was uploaded | 21:29 |
fungi | i guess worst case we'll have some boot failures in rax-iad for the next ~day | 21:30 |
fungi | oh, actually that image is also "leaked" (it doesn't appear in nodepool image-list) | 21:31 |
fungi | but still suggests that the builders are continuing to upload images | 21:32 |
fungi | and yeah, no successful uploads to rax-iad still according to nodepool image-list | 21:34 |
fungi | okay, i think these are deferred image uploads being processed by glance tasks. the last image uploads logged were at 17:40 | 21:41 |
fungi | so i guess i'll watch for a while to see if the image count goes up any more | 21:42 |
fungi | yeah, now a bionic image just showed up, created timestamp 2023-08-08T20:48:04Z updated at 2023-08-08T21:37:55Z | 21:44 |
fungi | image name on that one was ubuntu-bionic-1691511227 and `date -d@1691511227` reports "2023-08-08T16:13:47 UTC" | 21:48 |
fungi | so yes, seems like glance tasks are running on the uploads, but are severely backlogged, and so nodepool gives up after the timeout response thinking the upload failed, but then the upload actually appears hours later | 21:48 |
fungi | and since nodepool assumes it was never there, a leak ensues | 21:49 |
fungi | if the backlog is predictable, then we should hopefully cease seeing new ones appear after roughly 23:00 utc | 21:52 |
fungi | so about another hour | 21:53 |
fungi | i'll try to check back then and do another cleanup pass, or may not get to it until tomorrow | 21:53 |
*** dviroel_ is now known as dviroel | 22:13 | |
fungi | the count seems to have stabilized with only a few stragglers. i'll c | 23:01 |
fungi | lean them up in a bit | 23:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!