Wednesday, 2025-09-17

*** ykarel__ is now known as ykarel		08:00
opendevreview	Pierre Crégut proposed openstack/diskimage-builder master: root password for dynamic-login made simpler https://review.opendev.org/c/openstack/diskimage-builder/+/961449	08:04
opendevreview	Michal Nasiadka proposed opendev/system-config master: UCA: Add Flamingo https://review.opendev.org/c/opendev/system-config/+/961460	09:04
mnasiadka	clarkb, fungi: ^^ that's why we're not using UCA mirror ;-)	09:11
opendevreview	Pierre Crégut proposed openstack/diskimage-builder master: root password for dynamic-login made simpler https://review.opendev.org/c/openstack/diskimage-builder/+/961449	09:42
opendevreview	Takashi Kajinami proposed openstack/project-config master: Migrate propose-updates job for p-o-i in noble https://review.opendev.org/c/openstack/project-config/+/961475	12:30
opendevreview	Takashi Kajinami proposed openstack/project-config master: Migrate propose-updates job for p-o-i to noble https://review.opendev.org/c/openstack/project-config/+/961475	12:46
fungi	mnasiadka: that makes far more sense. the ubuntu mirror is back up to date btw	13:49
mnasiadka	fungi: thanks - I'll recheck how is kolla build feeling	13:50
fungi	mirror.ubuntu rw volume move from afs02.dfw to afs01.dfw took a little over 42 hours, i'll start on the final afs02.dfw upgrade now	13:51
fungi	57 rw volumes on afs01.dfw, 0 on afs02.dfw or afs01.ord so this is looking just like it should now	13:52
*** dansmith_ is now known as dansmith		14:39
fungi	afs02.dfw is now rebooted into noble and in the process of reinstalling our openafs ppa builds	14:50
clarkb	corvus: I was thinking about ze11 a bit more this morning and wondered "what if we just delete it?"	15:10
clarkb	corvus: specifically wondering if we need 12 executors anymore of iwe could get away with 11 (or 10 if we drop ze12 to avoid confusion)	15:10
clarkb	s/anymore of iwe/anymore or if we/	15:11
fungi	okay, afs02.dfw is back up and running as a fileserver again, and out of the emergency disable list finally	15:17
fungi	gonna go grab lunch out and run a quick errand, then start looking at mirror content cleanup	15:18
clarkb	nice!	15:21
clarkb	I'm going to pop out in a little bit too to get a bike ride in before it gets too warm	15:22
opendevreview	임채민 proposed openstack/project-config master: Add Weblate support to translation update pipeline https://review.opendev.org/c/openstack/project-config/+/961499	15:33
clarkb	fungi: when you get back https://review.opendev.org/c/opendev/system-config/+/961410 should be a noop update to close out the container stuff I've been working on lately (everything already explicitly uses docker now I just want to ensure our default matches that)	15:34
opendevreview	Wade Carpenter proposed zuul/zuul-jobs master: Bump bazelisk version to v1.27.0 https://review.opendev.org/c/zuul/zuul-jobs/+/961512	16:15
corvus	clarkb: i think it's likely that 10 is okay, and more likely that 11 is okay. i'm fine with deleting it.	16:31
corvus	we need to do something to clear out the statsd gauges	16:31
mnasiadka	fungi: looks good (ubuntu mirror)	16:57
fungi	mnasiadka: thanks for confirming. i've approved the flamingo uca addition as well	16:59
mnasiadka	Thanks	17:00
opendevreview	Stephen Finucane proposed openstack/project-config master: Initiate retirement of shade https://review.opendev.org/c/openstack/project-config/+/961522	17:05
opendevreview	Stephen Finucane proposed openstack/project-config master: Retire shade https://review.opendev.org/c/openstack/project-config/+/961524	17:09
fungi	i think we're going to need to increase the mirror.centos-stream volume quota, unless there's content we can clean up... it's over 99% used for its 400gb quota	17:21
*** clif1 is now known as clif		17:32
Clark[m]	corvus: I seem to recall that required rewriting the whisper files? We could also try to filter the node out of grafana as a workaround?	17:56
Clark[m]	fungi: someone like spotz might have ideas? But aiui they just don't clean up packages when updated in the same way debuntu do so the growth over time is larger	17:56
Clark[m]	I'm not opposed to increasing the quota but then trying to recruit someone who might understand better how to trim it down	17:57
opendevreview	Jeremy Stanley proposed opendev/system-config master: Clean up OpenEuler mirroring infrastructure https://review.opendev.org/c/opendev/system-config/+/959892	17:58
opendevreview	Jeremy Stanley proposed opendev/system-config master: Stop updating and delete OpenEuler mirror content https://review.opendev.org/c/opendev/system-config/+/961528	17:58
corvus	Clark: i think we can either just emit a 0 guage, or revisit the statsd timeout values for gauges (because i think there's a setting for when they timeout eventually). maybe if we change that setting we have to rewrite files? maybe that's what you're remembering?	17:59
Clark[m]	Ya maybe that is what I remember. It would've been from the puppet days so quite some time	18:02
corvus	https://github.com/statsd/statsd/blob/master/exampleConfig.js#L64	18:04
corvus	i think deleteGauge and gaugesMaxTTL would be what we want to set	18:05
corvus	i think we've been hesitant to set those, because we have a number of gauges that may not emit very often	18:06
corvus	but maybe 24h would be an okay value?	18:06
corvus	https://review.opendev.org/246610 is a relevant change	18:06
opendevreview	James E. Blair proposed opendev/system-config master: Delete statsd gauges after 24h of inactivity https://review.opendev.org/c/opendev/system-config/+/961530	18:10
fungi	Clark[m]: is the new stack of 961528 and 959892 what you were suggesting?	18:11
clarkb	fungi: yes. I did have one minor thing about test updates on the child change	18:24
clarkb	corvus: ya I guess for zuul jobs it isn't uncommon to run them infrequently	18:26
clarkb	and we probably can't easily separate zuul the service from zuul the workload metrics right now?	18:26
clarkb	infra-root I noticed zuul seemed oddly busy and looking at https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-6h&to=now&timezone=utc I think our ratio between building nodes and in-use/available seems odd. I wonder if we're struggling to successfully boot in a region or two	18:32
clarkb	yes rax, rax-flex, and ovh all look odd to me. osuosl and openmetal seem normalish	18:33
opendevreview	Jeremy Stanley proposed opendev/system-config master: Clean up OpenEuler mirroring infrastructure https://review.opendev.org/c/opendev/system-config/+/959892	18:33
corvus	clarkb: i don't think any zuul jobs emit gauges, so that's fine. i think the worry is probably queues or nodepool providers that may not get used as often. it's possible that we now have enough periodic stats emissions that it's not actually a problem.	18:37
clarkb	rax flex is complaining about multiple networks found. I guess something in the cloud changed and now we need to be more specific in our configs? rax-iad (classic) is returning 413 errors for listing servers	18:37
fungi	looking at https://zuul.opendev.org/t/openstack/nodes for nodes that booted successfully and are in-use i don't see any indication that we're able to boot certain distros and not others	18:37
clarkb	I'm going to dig into the rax flex multiple networks issue first since I think that likely has an actionable resolution on our side of things	18:37
fungi	so a systemic cause seems more likely, or causes	18:37
fungi	openstack network list shows what i would expect (PUBLICNET and opendevzuul-network1)	18:39
fungi	same in all 3 flex regions	18:39
fungi	could it be an sdk change?	18:40
clarkb	either that or a change to the attributes on those networks (most likely publicnet) causing sdk to get confused	18:40
clarkb	we manage opendevzuul-network1 and as far as I know have made no changes to it since they were created and had their mtus reduces	18:40
fungi	updated_at 2025-06-25T15:56:12Z	18:41
fungi	that's for opendevzuul-network1 in SJC3	18:41
clarkb	whcih is probably the mtu update so ya I'm guessing a change to publicnet	18:41
clarkb	or a code update to sdk	18:41
fungi	in SJC3 PUBLICNET has updated_at 2025-06-02T17:16:27Z	18:42
corvus	clarkb: if the resolution is to specify the network id in zuul config, the config attribute name is `networks`	18:42
clarkb	corvus: yup I'm reading the sdk docs now	18:43
fungi	nothing obvious in the recent openstacksdk release notes	18:46
opendevreview	Clark Boylan proposed opendev/system-config master: Select the network to use in raxflex https://review.opendev.org/c/opendev/system-config/+/961537	18:48
clarkb	something like that I think	18:48
fungi	i guess we were just relying on default detection heuristics and those have suddenly stopped working. i wonder what caused the behavior change	18:50
corvus	oh you put it in clouds.yaml; i was thinking of the zuul provider config.	18:51
corvus	so to be clear, my mentioning networks was for that. is it the same name in the clouds.yaml?	18:51
clarkb	corvus: oh I didn't realize that you could configure that via zuul directly. Yes its the same for clouds.yaml let me find a link to the published docs	18:52
corvus	i think clouds.yaml may be a good choice though. we also do that for at least another cloud	18:52
corvus	but i think we attached it to the region, not the cloud for that one. does it work for both?	18:52
clarkb	corvus: https://docs.openstack.org/openstacksdk/latest/user/config/network-config.html	18:52
clarkb	corvus: yes I think it works for both. The example in ^ does it for the entire cloud not per region	18:53
clarkb	I could change it to per region if we want to be more explicit	18:53
corvus	sgtm	18:53
corvus	global seems fine	18:53
clarkb	I see the 'openstack.exceptions.HttpException: HttpException: 413 OverLimit Retry...' in rax-ord and rax-iad but not dfw	18:53
clarkb	and it seems to always be on the /v2/tenantid/servers url	18:54
clarkb	which I think is the url for listing servers. So apparently the data requested is too large when doing that?	18:54
clarkb	for OVH GRA1 I see quota errors	18:56
clarkb	ok looking at gra1 more closely there are nodes entering an error state due to 'message': 'No valid host was found. There are not enough hosts available.'	18:58
clarkb	I think the quota stuff may just be noise and ^ is the reason we're not transitioning from building -> ready. We just happen to be right at quota and some requests are failing for going over	18:58
clarkb	in BHS1 I see nodes in an error state but they all complain about 'message': 'Quota exceeded for cores: Requested 8, but already used 1024 of 1024 cores'	19:00
clarkb	one of them 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f was created about 11 hours ago	19:00
clarkb	I'm going to try and manually delete that one and see what happens	19:01
clarkb	`No Server found for 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f`	19:02
clarkb	corvus: so for ovh I think we're not deleting things properly and that is causing a pile up and quota issues	19:02
clarkb	corvus: I won't delete any more manually. I just wanted to see if it was possible or if the cloud was preventing those deletions from occuring	19:02
clarkb	corvus: the launcher's cleanupNodes() method skips over nodes that are not in a USED state. Is it possible that we just leak all of these errored boots and they build up over time?	19:05
clarkb	hrm onyl tests seem to call that method	19:05
clarkb	looks like cleanupLeakedResource may be the actual thing responsible for this. But it only logs if it fails an action	19:07
leonardo-serrano	Hi, just to confirm, is the error being discussed related to this server https://zuul.opendev.org/t/openstack/status? There seems to be a bit of lag on reviews	19:11
clarkb	leonardo-serrano: the opendev zuul server isn't provisioning test nodes as quickly as we'd like. That means that zuul won't report CI results as quickly as normal. Is that what you mean by "bit of lag on reviews" ?	19:12
clarkb	corvus: if I grep the servers cloud uuid in the launcher debug file I get no hits. If I grep the suffix after np in the hostname I get one match for 'Requested node'	19:13
clarkb	corvus: so it does seem like somehow we're losing track of this node and not cleaning it up so that we can try again later	19:13
clarkb	but also maybe we need to improve the log lines to include that info?	19:14
fungi	i'm having to wait a bit on the mirror.debian volume cleanup, apparently even though i bos shutdown afs02.dfw before the upgrade, something caused vos release to fail earlier when the server was offline so now the mirror.debian, mirror.debian-security and mirror.ubuntu volumes are performing a full release since the past several hours (no idea how much longer they'll take since all	19:14
fungi	three are going in parallel)	19:14
corvus	clarkb: are you talking about instance 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f ? if so what was its node id?	19:16
clarkb	corvus: yes that one. Its name was npea8939bf25d24 in ovh gra1	19:16
clarkb	sorry ovh bhs1 not gra1	19:16
corvus	clarkb: that node was built and deleted on zl02 and the entire process took 6 seconds	19:17
corvus	2025-09-17 08:24:54,895 DEBUG zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Building node <OpenstackProviderNode uuid=ea8939bf25d2412ea595ee5f0f3d4fc1, label=ubuntu-noble, state=requested, provider=opendev.org%2Fopendev%2Fzuul-providers/ovh-bhs1-main>	19:18
corvus	2025-09-17 08:25:00,805 INFO zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Marking node <OpenstackProviderNode uuid=ea8939bf25d2412ea595ee5f0f3d4fc1, label=ubuntu-noble, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/ovh-bhs1-main> as tempfailed due to Quota exceeded	19:18
corvus	2025-09-17 08:25:02,601 DEBUG zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Removing provider node <OpenstackProviderNode uuid=ea8939bf25d2412ea595ee5f0f3d4fc1, label=ubuntu-noble, state=tempfailed, provider=opendev.org%2Fopendev%2Fzuul-providers/ovh-bhs1-main>	19:18
leonardo-serrano	clarkb: Yes, that was it. Just wanted to confirm, thanks	19:18
clarkb	oh right we have multiple launchers. the nodes was not deleted in the cloud	19:18
corvus	based on the log entries, i suspect that the cloud may not have given us an instance id	19:19
corvus	which, makes sense, because there should be no instance if we get a quota error	19:19
clarkb	and I guess we skip over trying to delete the resource if there is no resource id to delete	19:20
corvus	yep. but if it shows up later, it should still get deleted as a leaked resource	19:20
corvus	what was the metadata on that instance?	19:20
clarkb	corvus: properties \| zuul_node_uuid='ea8939bf25d2412ea595ee5f0f3d4fc1', zuul_system_id='99bcc7878aba48d697497abd56e737b6'	19:21
clarkb	corvus: there aer other examples in that cloud that have not been deleted if you want to look at one that is still there	19:21
clarkb	corvus: 0cbca713-6cd3-46bf-9213-1ed03317c673 created 09:13 UTC ish today in a similar state	19:22
corvus	i think we have a minor mystery of why those error instances aren't being deleted as leaked resources, but they shouldn't be taking up any actual quota in the cloud so they shouldn't be a factor	19:22
corvus	i think the bigger issue is that the quota on the account doesn't match the cloud capacity, so zuul continually hits those orrors.	19:23
fungi	are we certain that instances in error state don't count towards quota?	19:24
clarkb	corvus: I thnik as long as nova lists them they count against our quota	19:24
clarkb	fungi: no I'm pretty sure they do	19:24
corvus	i'd love for someone to explain that to me :)	19:24
corvus	show me which cores it's using	19:25
clarkb	I think the way nova calculates quotas is via ist databse not via libvirt	19:25
clarkb	so as long as the items are in the database it counts	19:25
fungi	i had previously assumed they did, but don't recall whether it was based on prior experience or just guesswork	19:25
clarkb	melwitt: might be able to confirm?	19:25
fungi	i agree from an accounting perspective it's rather absurd to limit or bill a customer based on resources they can't use because they're in error, but maybe some errors are temporary and it's to avoid those things suddenly exceeding the quota if they transition to active later	19:26
corvus	the specific server we're looking at is in error state because the cloud decided it didn't have enough resources to run it	19:27
clarkb	yes but somehow that server also has a uuid and a flavor attached to it	19:27
corvus	sure, but those aren't resources	19:27
corvus	it doesn't have any cores or ram or disk attached to it	19:27
fungi	was it one of the "no valid host was found" errors or a different one?	19:27
corvus	yes that	19:28
clarkb	fungi: its 'message': 'Quota exceeded for cores: Requested 8, but already used 1024 of 1024 cores'	19:28
corvus	clarkb: oh it was that error?	19:28
clarkb	its been a while since I dug into nova quota stuff but I'm pretty sure it takes a naive view and anything listed in `nova list` counts against your quota if it has a flavor	19:28
fungi	oh, quota exceeded is an even more strange case. i wouldn't have expected it to even try to create anything in that case	19:28
clarkb	I agree that ideally nova should only count quota against actual usage	19:29
clarkb	corvus: yes the server exists and that is the recorded error	19:29
clarkb	s/recorded error/recorded fault/	19:29
corvus	clarkb: you said `No Server found for 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f`	19:29
corvus	i thought that was the instance we were discussing	19:29
clarkb	corvus: after I deleted it. Sorry I was trying to say deletes work. The bug isn't in cloud failing to delete the instance	19:29
clarkb	the problem is that zuul appears to not be dleeting any of these instances. So they accumulate, count against our quota and now we have more errors which probably snowballs	19:30
corvus	okay where did you see the quota error?	19:30
clarkb	and 0cbca713-6cd3-46bf-9213-1ed03317c673 is the same state as 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f was before I deleted it but this instance is not deleted	19:30
clarkb	corvus: server show 0cbca713-6cd3-46bf-9213-1ed03317c673	19:30
clarkb	(or server show 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f before the deletion)	19:31
corvus	2025-09-17 08:25:00,804 ERROR zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Launch attempt failed: Error in creating the server. Compute service reports fault: Quota exceeded for cores: Requested 8, but already used 1024 of 1024 cores	19:31
corvus	the launcher got that in an api response as well	19:31
corvus	(nonetheless, if we get a server id back in the response, we will store it and try to delete it; since we didn't delete it, that suggests that we didn't get one back)	19:32
clarkb	but then we should still deleted it in leaked node cleanups which should run every minute?	19:33
corvus	yep	19:34
clarkb	the zuul uuid for the not deleted 0cbca713-6cd3-46bf-9213-1ed03317c673 appears to be 037b3d4dd45c4531b940908ddf1d829d	19:35
corvus	the logs for the cleanup worker aren't as chatty; we don't have an "everything is okay" alarm there. i'll try to see if it's stuck or just thinks it has nothing to do.	19:37
corvus	not stuck	19:38
clarkb	looking at rax-iad we seem to have a similar problem with a number of nodes stuck in an error state. However most of them are from last year	19:39
clarkb	I will try manually deleting one of them next	19:39
clarkb	those servers don't appear to have properties set on them so cleanup can't handle them	19:40
corvus	anything before august probably isn't going to be detected as leaked because they won't have matching metadata	19:40
corvus	(so we should just delete them without further thought)	19:40
clarkb	yup. I'm trying on one first to see if it goes away or if we need cloud intervention	19:41
clarkb	929edb89-4d6a-4a00-a4b3-4794d05f279d is the cloud uuid for that in rax-iad	19:41
corvus	i'm going to dig into the leak algorithm now, i'll be occupied with that for a little while.	19:41
clarkb	sounds good. I'm going to work on cleaning up these old error'd nodes from iad (and then check if ord is similar)	19:41
clarkb	I won't manually delete anymore that look recent and have the metadata in place for cleanup to handle them	19:42
clarkb	they were easy to identify as they all had the old style np0038 name prefix from nodepool	19:45
clarkb	there are two old nodepool nodes in iad. One says it is 'DELETED' and the other is 'ACTIVE' I'll leave those alone for the moment	19:50
clarkb	ORD has a number of old nodepool nodes in an ERROR state just like IAD. I'm going to try deleting them next. Note ti also has a number of ACTIVE nodepool nodes that I am not deleting yet	19:51
clarkb	'message': 'DBDeadlock' <- that is from one of the ord error nodes. That is a new error to me	19:52
fungi	that's an interesting amount of implementation detail exposed to the end user	19:53
clarkb	that one does not want to delete. I spot checked two others and they have the same fault and are all in a deleting task state (server state is ERROR)	19:55
clarkb	I suspect for those we may need cloud intervention	19:55
melwitt	clarkb: that's right, nova quotas counts resources in the database (instance's flavors for example) if [quota]driver=nova.quota.DbQuotaDriver and if [quota]driver=nova.quota.UnifiedLimitsDriver it will count resources in the Placement service when possible	19:55
clarkb	melwitt: thank you for confirming	19:56
clarkb	melwitt: does it take the floor of db and placement if there is a disagreement?	19:57
melwitt	clarkb: no it won't do both at the same time, nova can be configured for one or the other	19:57
clarkb	got it thanks	19:57
clarkb	the old nodepool nodes that are ACTIVE in ORD lack nodepool properties entirely so I think this is why they didn't get garbage collceted forever ago	19:58
clarkb	I suspect that I can manually delete those given their age and situation. I'm happy to do that but would be good for someone else to confirm they think that is safe. The main risk would be to held nodes I guess?	19:59
clarkb	fungi: re rax-flex looks like things changed around the 12th-13th	20:02
clarkb	fungi: that coincides with the previous automated upgrade of zuul so could be sdk udpated in the week prior?	20:02
corvus	i see a problem with the cleanup method. i believe i can easily monkeypatch the running processes, so i'm going to do that, which should kick off a round of deletions 2 minutes after i'm done.	20:03
clarkb	I am currently unable to list servers in dfw	20:06
clarkb	I get a 503 error	20:06
corvus	launchers patched	20:06
clarkb	GRA1 looks cleaner now. But 0cbca713-6cd3-46bf-9213-1ed03317c673 is still in BHS1 and there are still a number of ERROR'd nodes there (could be we just need to wait a bit for all the api requests to be made and get processed)	20:09
fungi	clarkb: openstacksdk 4.7.1 was released the day before zuul was upgraded, so a possible candidate. 4.7.0 was several weeks prior so i feel like that would probably have been in use for a while	20:09
fungi	the 4.7.1 release notes are very sparse though	20:10
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/961547 Launcher: fix endpoint_score [NEW]	20:10
corvus	the TLDR is that both launchers thought the other was handling it.	20:10
clarkb	0cbca713-6cd3-46bf-9213-1ed03317c673 is gone now	20:10
corvus	there are many delitions from ord and bhs1	20:10
corvus	and gra1	20:11
clarkb	corvus: I expect a lot of those ord delets to fail based on my earlier manual tests	20:11
clarkb	corvus: but I'll be happy if they succeed	20:11
clarkb	I'll go review the fixup now	20:11
corvus	it'll keep trying	20:11
corvus	a bunch from iad now too	20:12
clarkb	fwiw I think part of the problem right now is that rax-flex is grabbing requests and needig to fail X times before someone else gets a chance	20:12
clarkb	the confluence of problems made things extra bad. So now that we're addressing different peices it should slowly start cathcing up again (I hope)	20:13
fungi	we could temporarily hand-patch 961537 onto the launchers	20:13
corvus	i'm in favor of that and happy to do it	20:14
clarkb	true, corvus do you know if that also requires a launcher restart?	20:14
clarkb	corvus: no objections from me if you want to do that	20:14
clarkb	(the hourly jobs will undo it but we can put hosts in the emergency file or just hope that the 50 ish minutes to thenext run is enough time to get that change landed)	20:14
corvus	i don't know, but i suspect it does. i'm happy to do that and re-apply my monkeypatch.	20:14
clarkb	+2 on the launcher fix change. That was a surprisingly small edit	20:16
clarkb	corvus: I'm still seeing large numbers of requests sent to openmetal and ovh but not rax classic	20:20
clarkb	I wonder if rax classic still thinks it is at quota or something	20:21
corvus	wasn't there an api error from one of the classic regions?	20:21
clarkb	corvus: yes I was getting the 413 errors there against /v2/tenantid/servers I thought that was for listing servers but now I wonder if that is the response you get when over quota? Let me dig into that a bit more	20:22
corvus	if we can't list servers we can't build them	20:22
clarkb	rax-iad-main> usage factor: 0.21465968586387435 is a recent log entry so it seems to know that it isn't at quota	20:23
clarkb	oh and ya rax dfw was returning 503 errors when I attempted to manually list nodes there (I wanted to check for possible nodepool node cleanups like I did for iad and ord)	20:24
clarkb	corvus: in ord we hvae a number of old active servers from the nodepool era. They don't appear to have properties/metadata set so I don't think they will automatically get deleted. Should I manually delete them at this point? Is that safe (thinking about possible held nodes)	20:25
corvus	clarkb: yes, i think we should delete them. i tried to ping people about held nodes, but didn't get a response, and i deleted them from zuul 2 days ago anyway.	20:26
clarkb	ok I'll attempt to clear those out of ord now	20:27
corvus	2025-09-17 20:27:26,442 DEBUG zuul.Launcher: [e: 89af73402813442b8ac38dbb6bd18d3a] [req: 5e9297bf85804a5fa0588b3f5eb0346d] Node <OpenstackProviderNode uuid=17af3b74730c4cd696392cd10ea6bd47, label=ubuntu-jammy-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-dfw3-main> advanced from submit creating server to complete	20:28
corvus	that looks good for flex	20:28
corvus	i've restarted both launchers with the network fix manually applied. re-applying the cleanup fix now.	20:28
clarkb	thanks!	20:30
corvus	done	20:32
clarkb	ORD has three nodepool nodes from july so not as old as the ones I just deleted. Deleting that should be safe too since you've severed all ties with nodepool held nodes at this point right?	20:33
corvus	yep	20:33
corvus	2025-09-17 20:34:16,359 ERROR zuul.Launcher: keystoneauth1.exceptions.connection.ConnectTimeout: Request to https://dfw.servers.api.rackspacecloud.com/v2/637776/flavors/detail timed out	20:34
clarkb	ya I suspect that region is sad. I want to say this behavior was similar to what we saw in iad/ord previously then we emailed cloudnull and things got better	20:35
clarkb	maybe dfw needs similar intervention?	20:35
clarkb	in rax-ord the nodepool nodes are all either ERROR and not deletable or DELETING. I've claered out the others	20:35
clarkb	I'll scan the other clouds too and apply similar cleanups if necessary	20:36
*** mnaser_ is now known as mnaser		20:36
clarkb	ok rax-ord is the only provider I see old nodepool nodes in. They are all in an ERROR state and I think we can't delete them as end users as a result. Note I haven't checked rax-dfw due to errors listing servers there	20:43
clarkb	I think we've addressed most of what we can (other than actually landing permanent fixes for the two issues that generated code) and now ist a matter of allowing the requests to rebalance across happier providers	20:44
fungi	judging from https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc&viewPanel=panel-6 i suspect those drops in afs02.dfw usage are the old ro replicas for mirror.debian, mirror.debian-security and mirror.ubuntu being dropped and we're waiting for the slow climb back up to around its prior size, so maybe finishing sometime tomorrow	20:44
clarkb	fungi: afs02 deleted its ro volumes?	20:45
clarkb	I guess if it decides it is too out of date it does that maybe?	20:45
fungi	i'm not sure "deleted" is the appropriate term	20:45
fungi	well, maybe delete is the appropriate term	20:46
fungi	This is a recovery of previously failed release	20:46
fungi	Starting transaction on cloned volume 536870983... done	20:46
fungi	Deleting extant RO_DONTUSE site on afs02.dfw.openstack.org... done	20:46
fungi	Creating new volume 536870983 on replication site afs02.dfw.openstack.org: done	20:46
fungi	This will be a full dump: previous release failed	20:47
fungi	Starting ForwardMulti from 536870983 to 536870983 on afs02.dfw.openstack.org (entire volume).	20:47
clarkb	interesting so ya seems like it starts over and when it does so it doesn't try to use any of the old existing data	20:47
clarkb	which is probably very inefficient for this particular use case	20:47
clarkb	gate and check queue lengths are slowly shrinking (as things merge other pipelines are growing so I'm mostly ignoring them and using gate and check as the pipelines to monitor)	20:49
fungi	the graph shows it's recovering at a rate of about 0.1tb/hr and needs to recover about another 1.5tb so 15 hours to go, roughly?	20:49
clarkb	zoom zoom	20:49
fungi	should be finishing around the time i wake up, though i may not end up doing the stretch cleanup from mirror.debian and mirror.debian-security until friday since i wasn't really planning to be around tomorrow	20:50
clarkb	that should be fine. That content has been there for some time	20:51
opendevreview	Merged opendev/system-config master: Switch generic container role image builds back to docker https://review.opendev.org/c/opendev/system-config/+/961410	20:52
fungi	we're not really at risk of running out of space on the fileservers any time soon regardless, we've got nearly a tb free on the two dfw servers	20:52
clarkb	hourlies are starting up which will undo the rax flex fix for the launchers. However if we haev to restart them to pick up the change it won't matter	21:02
clarkb	I'll do my best to monitor that via logs and grafana	21:03
fungi	if it's restarting the launchers, then the cleanup handpatch will get undone too right?	21:03
clarkb	fungi: I don't think it restarts them. Only the weekly reboots do that	21:03
clarkb	so it depends on whether or not it reloads the cloud configs automatically or if it needs a restart	21:04
fungi	in that case they also may not pick up the clouds.yaml revert either	21:04
clarkb	yup it may be fine and just work as long as we don't restart the launchers	21:04
clarkb	as for rax-dfw and rax-ord I'm thinking wait to see if things are happier in dfw tomorrow then either way send an email to try and get rax-ord cleaned up and optionall ask about dfw's state in that email depending on its state tomorrow	21:05
fungi	and then once the config change deploys it'll be back in sync until the weekend, when it will hopefully be upgrading onto the cleanup fix proper anyway	21:05
clarkb	exactly	21:05
clarkb	corvus: do you think we should disable rax-dfw in the meantime? that should get it out of the endpoint calculations right?	21:06
clarkb	doing a server list seems to be a reasonable canary for that cloud so I don't think we need zuul-launcher running against it to determine if it is happy or sad	21:07
opendevreview	Merged opendev/system-config master: UCA: Add Flamingo https://review.opendev.org/c/opendev/system-config/+/961460	21:07
clarkb	wow one kolla-ansible change runs >60 jobs in check	21:08
clarkb	the vast majority of which are non voting. but thats like 10% of our entire capacity minimum for one change	21:08
fungi	megacheck	21:08
clarkb	(I didn't bother to see if those jobs are multinode in which case it can quickly become >10%)	21:09
clarkb	I think the remaining slowness is largely due to openmetal having had a bunch of requests assigned to it before we fixed the other clouds. So everything assigned there is going through a small funnel	21:12
clarkb	newly enqueued items seem to avoid this problem and are handled more quickly	21:12
clarkb	the requested value peaked at 124 for openmetal and is down to 56 now	21:13
corvus	clarkb: yeah, i think if it's kaput we should pull it out	21:14
clarkb	working on a change now	21:14
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Disable rax-dfw due to API errors https://review.opendev.org/c/opendev/zuul-providers/+/961553	21:16
opendevreview	Merged opendev/zuul-providers master: Disable rax-dfw due to API errors https://review.opendev.org/c/opendev/zuul-providers/+/961553	21:20
clarkb	based on the graphs for rax flex I don't think the hourly jobs impacted the running service	21:21
clarkb	as long as my change and corvus' update to the launcher land before we next restart (friday at the latest) we should be good there	21:21
fungi	i have a feeling a little while after that deploys we'll be able to openstack server list successfully, if it's a request backlog pileup leading to backend timeouts like we saw previously	21:22
fungi	er, i guess there's no deploying necessary, it should be picked up by zuul as soon as it merges?	21:24
clarkb	ya looks like grafana already reflects it	21:25
clarkb	I'll try to run a server list against dfw in a bit after we give it a ltitle time to chill	21:31
fungi	yeah, i'll wager it starts responding again once we're no longer hammering it for a while	21:33
corvus	were we hammering it? was the rate of server create calls to high? if so, we can tune the rate in the provider	21:41
fungi	i suspect "hammering" was not very fast at all in this case	21:42
fungi	just that calls on the backend were taking too long to complete, queuing up, and we were making new requests faster than the queue was clearing out	21:42
clarkb	server list did just work	21:42
fungi	at least that's how it was acting in iad a while back	21:42
clarkb	there are a number of old nodepool nodes in this region too	21:43
clarkb	looks like from around july. I'll go ahead and delete them next to match what I did in ord and iad	21:43
corvus	lmk if we need to change the rate limit	21:44
clarkb	corvus: lasttime with iad they idneitifed an issue and corrected it on the cloud side	21:45
corvus	it's currently 2 requests/second now (per launcher)	21:45
clarkb	so I'm hopeful that just needs to be repeated here	21:45
clarkb	since we know that things seem happier after disabling the region I'll go ahead and start drafting that email for ord and dfw now	21:47
clarkb	the old nodepool instance are gone	21:47
clarkb	email sent yall should be cc'd too	22:04
fungi	thanks!	22:04
clarkb	and openmetal has caught up now so we should be more business as usual at this point	22:04
opendevreview	Merged opendev/system-config master: Select the network to use in raxflex https://review.opendev.org/c/opendev/system-config/+/961537	22:27
clarkb	that didn't actually trigger the zuul deploy job. I'll double check after the next round of hourlies that the clouds.yaml look correct	22:33
fungi	yeah, only ran infra-prod-bootstrap-bridge	22:51
clarkb	/etc/openstack/clouds.yaml was updated at 23:05 on both zl01 and zl02 and both files contain the networks list again	23:17
clarkb	so I'm happy that we've updated that correctly and the fix for the launcher cleanup stuff landed as well so we should be all set for the friday restarts	23:17
fungi	perfect	23:18

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!