Tuesday, 2023-08-08

opendevreview	Merged opendev/system-config master: Correct static known_hosts entry for goaccess jobs https://review.opendev.org/c/opendev/system-config/+/890698	05:13
ajaiswal	how can i change my gerrit username ?	06:52
frickler	ajaiswal: the username is immutable	07:10
opendevreview	Merged openstack/project-config master: Allow cyborg-core to act as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890475	07:43
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365	08:18
*** dhill is now known as Guest8289		11:37
opendevreview	Merged openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365	12:15
*** amoralej is now known as amoralej\|lunch		12:19
*** amoralej\|lunch is now known as amoralej		12:53
frickler	infra-root: seems we are not only having issues with image upload for rax, but also with instance deletion. bot IAD and DFW have like 90% instances stuck deleting	14:10
frickler	from https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=1&from=now-6M&to=now it looks like for dfw it has started end of may, some instances I checked have that creation date	14:11
frickler	also nb01+02 disks seem to be full again, I guess we need to either fix iad uploads or disable them	14:14
fungi	it may be exacerbated by the image count. i'm going to go back to working on that between meetings today	14:16
fungi	i'm slow-deleting 1172 leaked images in rax-iad now, with a 10-second delay between each request. that's roughly 3h15m worth of delays so will take at least that long to complete	15:11
frickler	does deleting an image also delete the task(s) associated with it?	15:14
fungi	glance tasks? no idea really. though a lot of these images are years old so i would be surprised if there are still lingering tasks	15:15
fungi	for many of them anyway	15:15
fungi	it deleted ~17 and is now sitting	15:17
fungi	presumably one request is not returning in a timely manner	15:18
fungi	that one request has been waiting for approximately 15 minutes now. if a lot of them are like this then it's going to take many orders of magnitude longer than i expected	15:25
fungi	it eventually started moving again, but only did a few more and has stalled once more	15:35
fungi	seems like image deletions might be expensive operations and they're getting backlogged	15:36
fungi	or hitting internal timeouts with services not communicating reliably with one another	15:37
fungi	also it may be fighting with image-related api calls from nodepool, so i agree we should probably pause image uploads to rackspace regions until we can clean things up	15:40
fungi	we haven't successfully uploaded anything to rackspace since the builders were fixed yesterday	15:40
fungi	er, to iad that is	15:41
fungi	we have uploaded successfully to dfw and ord since then	15:41
opendevreview	Jeremy Stanley proposed openstack/project-config master: Temporarily pause image uploads to rax-iad https://review.opendev.org/c/openstack/project-config/+/890807	15:49
fungi	infra-root: ^	15:49
fungi	thanks frickler!	15:53
fungi	seems just over 40 of my `openstack image delete` calls have returned so far	15:54
fungi	going to take a while at this pace	15:54
frickler	but at least there is progress ;)	15:55
fungi	well, i don't even know that it actually deleted anything, all i know is that openstackclient isn't reporting back with any errors	15:56
fungi	after a while i'll check the image list from that region and see if it's shrinking at all	15:56
fungi	but it will be easier to tell once that nodepool config change deploys	15:56
frickler	if you have a specific image id, "image show" may be faster than list	16:07
frickler	like to verify if at least the first one got deleted	16:08
opendevreview	Merged openstack/project-config master: Temporarily pause image uploads to rax-iad https://review.opendev.org/c/openstack/project-config/+/890807	16:12
fungi	since that has deployed, i'm taking a baseline image count now	16:31
fungi	1236 images in rax-iad according to the image list that completed a few minutes ago. i'll check again in an hour	16:45
fungi	i do actually see one image delete error in my terminal scrollback, a rather nondescript 503 response	16:46
fungi	"service unavailable"	16:46
tkajinam	I'll check the status tomorrow (it's actually today) but it seems this one is stuck after +2 was voted by zuul... https://review.opendev.org/c/openstack/puppet-openstack-integration/+/890672	16:49
fungi	tkajinam: it looks like it succeeded but never merged. also interesting is that zuul commented twice in the same second about the gate pipeline success for that buildset	16:52
fungi	i'll probably need to look in the debug logs on the schedulers to find out what went wrong	16:53
tkajinam	ah, yes. I didn't notice that duplicate comment.	16:53
tkajinam	no jobs in the status view. wondering if recheck works here or we should get it submitted forcefully.	16:54
fungi	a recheck might work, unless there's something wrong with the change itself that's causing gerrit to reject the merge. hopefully i'll be able to tell from the service logs, but also i have to get on a conference call (and several back-to-back meetings) so can't look just yet	16:57
tkajinam	ack. I've triggered recheck... it seems the pathc now appears both in check queue and gate queue. I'm leaving soon and will recheck the patch during the date tomorrow (today) but I might need some help if recheck still behaves strangely	17:00
fungi	looks like decent progress on the image deletions in rax-iad. count is down to 937, so about 200 fewer in roughly an hour	17:57
fungi	multitasking during the openstack tc meeting, found this in gerrit's error log regarding tkajinam's problem change:	18:05
fungi	[2023-08-08T15:04:29.860Z] [HTTP POST /a/changes/openstack%2Fpuppet-openstack-integration~stable%2F2023.1~I544fa835ee86a41bb4ba4bf391857b8a64750af2/ (zuul from [2001:4800:7819:103:be76:4eff:fe04:42c2])] WARN com.google.gerrit.server.update.RetryHelper : REST_WRITE_REQUEST was attempted 3 times [CONTEXT project="openstack/puppet-openstack-integration" request="REST	18:05
fungi	/changes//revisions//review" ]	18:05
fungi	i wonder if that means it failed to write the merge to the filesystem?	18:05
fungi	[Tue Aug 8 14:47:45 2023] INFO: task jbd2/dm-0-8:823 blocked for more than 120 seconds.	18:06
fungi	that's from dmesg	18:06
fungi	another at 14:49:46	18:07
fungi	i think that's the cinder volume our gerrit data lives on	18:08
fungi	infra-root: just a heads up that we may be seeing communication issues with the cinder volume for our gerrit data	18:10
fungi	i'll do a quick check over the rackspace tickets/status page	18:11
fungi	no, wait, we moved it to vexxhost a while back	18:11
fungi	guilhermesp_____: mnaser: just a heads up we may be seeing cinder communication issues in ca-ymq-1	18:13
fungi	so the two events in dmesg were for dm-0 (the gerrit data volume) at 14:47:45 utc, and vda1 (the rootfs) at 14:49:46 utc	18:23
fungi	these are the timestamps from dmesg though, which are notoriously inaccurate, so easily a few minutes off from the actual events that were logged	18:23
fungi	side note, 890672 did end up merging successfully after tkajinam rechecked it	18:24
fungi	for vexxhost peeps who may see this in scrollback, the server is 16acb0cb-ead1-43b2-8be7-ab4a310b4e0a and the volumes it logged problems writing to are de64f4c6-c5c8-4281-a265-833236b78480 and then d1884ff4-528d-4346-a172-6c9abafb8cdf in chronological order	18:26
fungi	down to 780 images in rax-iad now, so still chugging along	18:35
Clark[m]	frickler: fungi: pretty sure deleting images on glance does not delete any related import tasks or related swift data. This is my major complaint with that system after the "it's completely undocumented and non standard part". Basically the service imposes massive amounts of state tracking on the end user and you don't really know it is necessary either	20:12
fungi	fwiw, i can't tell how to get osc to list glance tasks	20:27
fungi	but i haven't gone digging in the docs yet, just context help	20:27
fungi	rax-iad image count is down to 241, so my initial cleanup pass will hopefully finish within the hour	20:32
fungi	done, leaving 171 images in the list. nodepool says it knows about 34, so i'll put together a new deletion list for the other 137	20:55
fungi	oh, actually that wasn't --private so the count included public images too. actual private image count is 130 meaning we have 96 still to clean up	20:58
fungi	er, 92 actually leaked	21:00
fungi	slow-deleting those now	21:01
Clark[m]	I don't know if you can list tasks. The whole glance task system exists in a corner of no documentation and little testing	21:09
fungi	also frustration and gnashing of teeth	21:10
fungi	okay, after the latest pass, `openstack image list` for iad has 39 entries in rax-iad compared to the 34 from `nodepool image-list`	21:23
fungi	i'll see if i can work out what's up with the other 5	21:23
fungi	no, i keep counting the decorative lines	21:24
fungi	it's really 35, so only 1 nodepool has no record of	21:24
fungi	um, it's a rockylinux-9 image uploaded 2023-08-08T20:13:42Z (a little over an hour ago)	21:27
fungi	did the upload pause not actually work?	21:27
fungi	though i guess the good news is that the builders are successfully uploading images to rax-iad now	21:27
fungi	the bad news is that i may have deleted some recently uploaded images	21:27
Clark[m]	If the upload was in progress when you paused I think it is allowed to finish	21:27
Clark[m]	But I'm not sure of that	21:28
fungi	the deploy pipeline reported success on the pause change at 16:25:18, so almost 4 hours before that was uploaded	21:29
fungi	i guess worst case we'll have some boot failures in rax-iad for the next ~day	21:30
fungi	oh, actually that image is also "leaked" (it doesn't appear in nodepool image-list)	21:31
fungi	but still suggests that the builders are continuing to upload images	21:32
fungi	and yeah, no successful uploads to rax-iad still according to nodepool image-list	21:34
fungi	okay, i think these are deferred image uploads being processed by glance tasks. the last image uploads logged were at 17:40	21:41
fungi	so i guess i'll watch for a while to see if the image count goes up any more	21:42
fungi	yeah, now a bionic image just showed up, created timestamp 2023-08-08T20:48:04Z updated at 2023-08-08T21:37:55Z	21:44
fungi	image name on that one was ubuntu-bionic-1691511227 and `date -d@1691511227` reports "2023-08-08T16:13:47 UTC"	21:48
fungi	so yes, seems like glance tasks are running on the uploads, but are severely backlogged, and so nodepool gives up after the timeout response thinking the upload failed, but then the upload actually appears hours later	21:48
fungi	and since nodepool assumes it was never there, a leak ensues	21:49
fungi	if the backlog is predictable, then we should hopefully cease seeing new ones appear after roughly 23:00 utc	21:52
fungi	so about another hour	21:53
fungi	i'll try to check back then and do another cleanup pass, or may not get to it until tomorrow	21:53
*** dviroel_ is now known as dviroel		22:13
fungi	the count seems to have stabilized with only a few stragglers. i'll c	23:01
fungi	lean them up in a bit	23:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!