Wednesday, 2025-09-17

*** ykarel__ is now known as ykarel08:00
opendevreviewPierre Crégut proposed openstack/diskimage-builder master: root password for dynamic-login made simpler  https://review.opendev.org/c/openstack/diskimage-builder/+/96144908:04
opendevreviewMichal Nasiadka proposed opendev/system-config master: UCA: Add Flamingo  https://review.opendev.org/c/opendev/system-config/+/96146009:04
mnasiadkaclarkb, fungi: ^^ that's why we're not using UCA mirror ;-)09:11
opendevreviewPierre Crégut proposed openstack/diskimage-builder master: root password for dynamic-login made simpler  https://review.opendev.org/c/openstack/diskimage-builder/+/96144909:42
opendevreviewTakashi Kajinami proposed openstack/project-config master: Migrate propose-updates job for p-o-i in noble  https://review.opendev.org/c/openstack/project-config/+/96147512:30
opendevreviewTakashi Kajinami proposed openstack/project-config master: Migrate propose-updates job for p-o-i to noble  https://review.opendev.org/c/openstack/project-config/+/96147512:46
fungimnasiadka: that makes far more sense. the ubuntu mirror is back up to date btw13:49
mnasiadkafungi: thanks - I'll recheck how is kolla build feeling13:50
fungimirror.ubuntu rw volume move from afs02.dfw to afs01.dfw took a little over 42 hours, i'll start on the final afs02.dfw upgrade now13:51
fungi57 rw volumes on afs01.dfw, 0 on afs02.dfw or afs01.ord so this is looking just like it should now13:52
*** dansmith_ is now known as dansmith14:39
fungiafs02.dfw is now rebooted into noble and in the process of reinstalling our openafs ppa builds14:50
clarkbcorvus: I was thinking about ze11 a bit more this morning and wondered "what if we just delete it?"15:10
clarkbcorvus: specifically wondering if we need 12 executors anymore of iwe could get away with 11 (or 10 if we drop ze12 to avoid confusion)15:10
clarkbs/anymore of iwe/anymore or if we/15:11
fungiokay, afs02.dfw is back up and running as a fileserver again, and out of the emergency disable list finally15:17
fungigonna go grab lunch out and run a quick errand, then start looking at mirror content cleanup15:18
clarkbnice!15:21
clarkbI'm going to pop out in a little bit too to get a bike ride in before it gets too warm15:22
opendevreview임채민 proposed openstack/project-config master: Add Weblate support to translation update pipeline  https://review.opendev.org/c/openstack/project-config/+/96149915:33
clarkbfungi: when you get back https://review.opendev.org/c/opendev/system-config/+/961410 should be a noop update to close out the container stuff I've been working on lately (everything already explicitly uses docker now I just want to ensure our default matches that)15:34
opendevreviewWade Carpenter proposed zuul/zuul-jobs master: Bump bazelisk version to v1.27.0  https://review.opendev.org/c/zuul/zuul-jobs/+/96151216:15
corvusclarkb: i think it's likely that 10 is okay, and more likely that 11 is okay.  i'm fine with deleting it.16:31
corvuswe need to do something to clear out the statsd gauges16:31
mnasiadkafungi: looks good (ubuntu mirror)16:57
fungimnasiadka: thanks for confirming. i've approved the flamingo uca addition as well16:59
mnasiadkaThanks17:00
opendevreviewStephen Finucane proposed openstack/project-config master: Initiate retirement of shade  https://review.opendev.org/c/openstack/project-config/+/96152217:05
opendevreviewStephen Finucane proposed openstack/project-config master: Retire shade  https://review.opendev.org/c/openstack/project-config/+/96152417:09
fungii think we're going to need to increase the mirror.centos-stream volume quota, unless there's content we can clean up... it's over 99% used for its 400gb quota17:21
*** clif1 is now known as clif17:32
Clark[m]corvus: I seem to recall that required rewriting the whisper files? We could also try to filter the node out of grafana as a workaround?17:56
Clark[m]fungi: someone like spotz might have ideas? But aiui they just don't clean up packages when updated in the same way debuntu do so the growth over time is larger17:56
Clark[m]I'm not opposed to increasing the quota but then trying to recruit someone who might understand better how to trim it down17:57
opendevreviewJeremy Stanley proposed opendev/system-config master: Clean up OpenEuler mirroring infrastructure  https://review.opendev.org/c/opendev/system-config/+/95989217:58
opendevreviewJeremy Stanley proposed opendev/system-config master: Stop updating and delete OpenEuler mirror content  https://review.opendev.org/c/opendev/system-config/+/96152817:58
corvusClark: i think we can either just emit a 0 guage, or revisit the statsd timeout values for gauges (because i think there's a setting for when they timeout eventually).  maybe if we change that setting we have to rewrite files?  maybe that's what you're remembering?17:59
Clark[m]Ya maybe that is what I remember. It would've been from the puppet days so quite some time18:02
corvushttps://github.com/statsd/statsd/blob/master/exampleConfig.js#L6418:04
corvusi think deleteGauge and gaugesMaxTTL would be what we want to set18:05
corvusi think we've been hesitant to set those, because we have a number of gauges that may not emit very often18:06
corvusbut maybe 24h would be an okay value?18:06
corvushttps://review.opendev.org/246610 is a relevant change18:06
opendevreviewJames E. Blair proposed opendev/system-config master: Delete statsd gauges after 24h of inactivity  https://review.opendev.org/c/opendev/system-config/+/96153018:10
fungiClark[m]: is the new stack of 961528 and 959892 what you were suggesting?18:11
clarkbfungi: yes. I did have one minor thing about test updates on the child change18:24
clarkbcorvus: ya I guess for zuul jobs it isn't uncommon to run them infrequently18:26
clarkband we probably can't easily separate zuul the service from zuul the workload metrics right now?18:26
clarkbinfra-root I noticed zuul seemed oddly busy and looking at https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-6h&to=now&timezone=utc I think our ratio between building nodes and in-use/available seems odd. I wonder if we're struggling to successfully boot in a region or two18:32
clarkbyes rax, rax-flex, and ovh all look odd to me. osuosl and openmetal seem normalish18:33
opendevreviewJeremy Stanley proposed opendev/system-config master: Clean up OpenEuler mirroring infrastructure  https://review.opendev.org/c/opendev/system-config/+/95989218:33
corvusclarkb: i don't think any zuul jobs emit gauges, so that's fine.  i think the worry is probably queues or nodepool providers that may not get used as often.  it's possible that we now have enough periodic stats emissions that it's not actually a problem.18:37
clarkbrax flex is complaining about multiple networks found. I guess something in the cloud changed and now we need to be more specific in our configs? rax-iad (classic) is returning 413 errors for listing servers18:37
fungilooking at https://zuul.opendev.org/t/openstack/nodes for nodes that booted successfully and are in-use i don't see any indication that we're able to boot certain distros and not others18:37
clarkbI'm going to dig into the rax flex multiple networks issue first since I think that likely has an actionable resolution on our side of things18:37
fungiso a systemic cause seems more likely, or causes18:37
fungiopenstack network list shows what i would expect (PUBLICNET and opendevzuul-network1)18:39
fungisame in all 3 flex regions18:39
fungicould it be an sdk change?18:40
clarkbeither that or a change to the attributes on those networks (most likely publicnet) causing sdk to get confused18:40
clarkbwe manage opendevzuul-network1 and as far as I know have made no changes to it since they were created and had their mtus reduces18:40
fungiupdated_at 2025-06-25T15:56:12Z18:41
fungithat's for opendevzuul-network1 in SJC318:41
clarkbwhcih is probably the mtu update so ya I'm guessing a change to publicnet18:41
clarkbor a code update to sdk18:41
fungiin SJC3 PUBLICNET has updated_at 2025-06-02T17:16:27Z18:42
corvusclarkb: if the resolution is to specify the network id in zuul config, the config attribute name is `networks`18:42
clarkbcorvus: yup I'm reading the sdk docs now18:43
funginothing obvious in the recent openstacksdk release notes18:46
opendevreviewClark Boylan proposed opendev/system-config master: Select the network to use in raxflex  https://review.opendev.org/c/opendev/system-config/+/96153718:48
clarkbsomething like that I think18:48
fungii guess we were just relying on default detection heuristics and those have suddenly stopped working. i wonder what caused the behavior change18:50
corvusoh you put it in clouds.yaml; i was thinking of the zuul provider config.18:51
corvusso to be clear, my mentioning networks was for that.  is it the same name in the clouds.yaml?18:51
clarkbcorvus: oh I didn't realize that you could configure that via zuul directly. Yes its the same for clouds.yaml let me find a link to the published docs18:52
corvusi think clouds.yaml may be a good choice though.  we also do that for at least another cloud18:52
corvusbut i think we attached it to the region, not the cloud for that one.  does it work for both?18:52
clarkbcorvus: https://docs.openstack.org/openstacksdk/latest/user/config/network-config.html18:52
clarkbcorvus: yes I think it works for both. The example in ^ does it for the entire cloud not per region18:53
clarkbI could change it to per region if we want to be more explicit18:53
corvussgtm18:53
corvusglobal seems fine18:53
clarkbI see the 'openstack.exceptions.HttpException: HttpException: 413 OverLimit Retry...' in rax-ord and rax-iad  but not dfw18:53
clarkband it seems to always be on the /v2/tenantid/servers url18:54
clarkbwhich I think is the url for listing servers. So apparently the data requested is too large when doing that?18:54
clarkbfor OVH GRA1 I see quota errors18:56
clarkbok looking at gra1 more closely there are nodes entering an error state due to 'message': 'No valid host was found. There are not enough hosts available.'18:58
clarkbI think the quota stuff may just be noise and ^ is the reason we're not transitioning from building -> ready. We just happen to be right at quota and some requests are failing for going over18:58
clarkbin BHS1 I see nodes in an error state but they all complain about 'message': 'Quota exceeded for cores: Requested 8, but already used 1024 of 1024 cores'19:00
clarkbone of them 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f was created about 11 hours ago19:00
clarkbI'm going to try and manually delete that one and see what happens19:01
clarkb`No Server found for 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f`19:02
clarkbcorvus: so for ovh I think we're not deleting things properly and that is causing a pile up and quota issues19:02
clarkbcorvus: I won't delete any more manually. I just wanted to see if it was possible or if the cloud was preventing those deletions from occuring19:02
clarkbcorvus: the launcher's cleanupNodes() method skips over nodes that are not in a USED state. Is it possible that we just leak all of these errored boots and they build up over time?19:05
clarkbhrm onyl tests seem to call that method19:05
clarkblooks like cleanupLeakedResource may be the actual thing responsible for this. But it only logs if it fails an action19:07
leonardo-serranoHi, just to confirm, is the error being discussed related to this server https://zuul.opendev.org/t/openstack/status? There seems to be a bit of lag on reviews19:11
clarkbleonardo-serrano: the opendev zuul server isn't provisioning test nodes as quickly as we'd like. That means that zuul won't report CI results as quickly as normal. Is that what you mean by "bit of lag on reviews" ?19:12
clarkbcorvus: if I grep the servers cloud uuid in the launcher debug file I get no hits. If I grep the suffix after np in the hostname I get one match for 'Requested node'19:13
clarkbcorvus: so it does seem like somehow we're losing track of this node and not cleaning it up so that we can try again later19:13
clarkbbut also maybe we need to improve the log lines to include that info?19:14
fungii'm having to wait a bit on the mirror.debian volume cleanup, apparently even though i bos shutdown afs02.dfw before the upgrade, something caused vos release to fail earlier when the server was offline so now the mirror.debian, mirror.debian-security and mirror.ubuntu volumes are performing a full release since the past several hours (no idea how much longer they'll take since all19:14
fungithree are going in parallel)19:14
corvusclarkb: are you talking about instance 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f ?  if so what was its node id?19:16
clarkbcorvus: yes that one. Its name was npea8939bf25d24 in ovh gra119:16
clarkbsorry ovh bhs1 not gra119:16
corvusclarkb: that node was built and deleted on zl02 and the entire process took 6 seconds19:17
corvus2025-09-17 08:24:54,895 DEBUG zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Building node <OpenstackProviderNode uuid=ea8939bf25d2412ea595ee5f0f3d4fc1, label=ubuntu-noble, state=requested, provider=opendev.org%2Fopendev%2Fzuul-providers/ovh-bhs1-main>19:18
corvus2025-09-17 08:25:00,805 INFO zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Marking node <OpenstackProviderNode uuid=ea8939bf25d2412ea595ee5f0f3d4fc1, label=ubuntu-noble, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/ovh-bhs1-main> as tempfailed due to Quota exceeded19:18
corvus2025-09-17 08:25:02,601 DEBUG zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Removing provider node <OpenstackProviderNode uuid=ea8939bf25d2412ea595ee5f0f3d4fc1, label=ubuntu-noble, state=tempfailed, provider=opendev.org%2Fopendev%2Fzuul-providers/ovh-bhs1-main>19:18
leonardo-serranoclarkb: Yes, that was it. Just wanted to confirm, thanks19:18
clarkboh right we have multiple launchers. the nodes was not deleted in the cloud19:18
corvusbased on the log entries, i suspect that the cloud may not have given us an instance id19:19
corvuswhich, makes sense, because there should be no instance if we get a quota error19:19
clarkband I guess we skip over trying to delete the resource if there is no resource id to delete19:20
corvusyep.  but if it shows up later, it should still get deleted as a leaked resource19:20
corvuswhat was the metadata on that instance?19:20
clarkbcorvus: properties | zuul_node_uuid='ea8939bf25d2412ea595ee5f0f3d4fc1', zuul_system_id='99bcc7878aba48d697497abd56e737b6'19:21
clarkbcorvus: there aer other examples in that cloud that have not been deleted if you want to look at one that is still there19:21
clarkbcorvus: 0cbca713-6cd3-46bf-9213-1ed03317c673 created 09:13 UTC ish today in a similar state19:22
corvusi think we have a minor mystery of why those error instances aren't being deleted as leaked resources, but they shouldn't be taking up any actual quota in the cloud so they shouldn't be a factor19:22
corvusi think the bigger issue is that the quota on the account doesn't match the cloud capacity, so zuul continually hits those orrors.19:23
fungiare we certain that instances in error state don't count towards quota?19:24
clarkbcorvus: I thnik as long as nova lists them they count against our quota19:24
clarkbfungi: no I'm pretty sure they do19:24
corvusi'd love for someone to explain that to me :)19:24
corvusshow me which cores it's using19:25
clarkbI think the way nova calculates quotas is via ist databse not via libvirt19:25
clarkbso as long as the items are in the database it counts19:25
fungii had previously assumed they did, but don't recall whether it was based on prior experience or just guesswork19:25
clarkbmelwitt: might be able to confirm?19:25
fungii agree from an accounting perspective it's rather absurd to limit or bill a customer based on resources they can't use because they're in error, but maybe some errors are temporary and it's to avoid those things suddenly exceeding the quota if they transition to active later19:26
corvusthe specific server we're looking at is in error state because the cloud decided it didn't have enough resources to run it19:27
clarkbyes but somehow that server also has a uuid and a flavor attached to it19:27
corvussure, but those aren't resources19:27
corvusit doesn't have any cores or ram or disk attached to it19:27
fungiwas it one of the "no valid host was found" errors or a different one?19:27
corvusyes that19:28
clarkbfungi: its 'message': 'Quota exceeded for cores: Requested 8, but already used 1024 of 1024 cores'19:28
corvusclarkb: oh it was that error?19:28
clarkbits been a while since I dug into nova quota stuff but I'm pretty sure it takes a naive view and anything listed in `nova list` counts against your quota if it has a flavor19:28
fungioh, quota exceeded is an even more strange case. i wouldn't have expected it to even try to create anything in that case19:28
clarkbI agree that ideally nova should only count quota against actual usage19:29
clarkbcorvus: yes the server exists and that is the recorded error19:29
clarkbs/recorded error/recorded fault/19:29
corvusclarkb: you said `No Server found for 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f`19:29
corvusi thought that was the instance we were discussing19:29
clarkbcorvus: after I deleted it. Sorry I was trying to say deletes work. The bug isn't in cloud failing to delete the instance19:29
clarkbthe problem is that zuul appears to not be dleeting any of these instances. So they accumulate, count against our quota and now we have more errors which probably snowballs19:30
corvusokay where did you see the quota error?19:30
clarkband 0cbca713-6cd3-46bf-9213-1ed03317c673 is the same state as 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f was before I deleted it but this instance is not deleted19:30
clarkbcorvus: server show 0cbca713-6cd3-46bf-9213-1ed03317c67319:30
clarkb(or server show 9710ec38-c6b0-4e1f-87d1-90aea7e5b62f before the deletion)19:31
corvus2025-09-17 08:25:00,804 ERROR zuul.Launcher: [e: 521b48788f37421eb3e23ebc42189614] [req: e656a95f49744572b6ac31cde4f69672] Launch attempt failed: Error in creating the server. Compute service reports fault: Quota exceeded for cores: Requested 8, but already used 1024 of 1024 cores19:31
corvusthe launcher got that in an api response as well19:31
corvus(nonetheless, if we get a server id back in the response, we will store it and try to delete it; since we didn't delete it, that suggests that we didn't get one back)19:32
clarkbbut then we should still deleted it in leaked node cleanups which should run every minute?19:33
corvusyep19:34
clarkbthe zuul uuid for the not deleted 0cbca713-6cd3-46bf-9213-1ed03317c673 appears to be 037b3d4dd45c4531b940908ddf1d829d19:35
corvusthe logs for the cleanup worker aren't as chatty; we don't have an "everything is okay" alarm there.  i'll try to see if it's stuck or just thinks it has nothing to do.19:37
corvusnot stuck19:38
clarkblooking at rax-iad we seem to have a similar problem with a number of nodes stuck in an error state. However most of them are from last year19:39
clarkbI will try manually deleting one of them next19:39
clarkbthose servers don't appear to have properties set on them so cleanup can't handle them19:40
corvusanything before august probably isn't going to be detected as leaked because they won't have matching metadata19:40
corvus(so we should just delete them without further thought)19:40
clarkbyup. I'm trying on one first to see if it goes away or if we need cloud intervention19:41
clarkb929edb89-4d6a-4a00-a4b3-4794d05f279d is the cloud uuid for that in rax-iad19:41
corvusi'm going to dig into the leak algorithm now, i'll be occupied with that for a little while.19:41
clarkbsounds good. I'm going to work on cleaning up these old error'd nodes from iad (and then check if ord is similar)19:41
clarkbI won't manually delete anymore that look recent and have the metadata in place for cleanup to handle them19:42
clarkbthey were easy to identify as they all had the old style np0038 name prefix from nodepool19:45
clarkbthere are two old nodepool nodes in iad. One says it is 'DELETED' and the other is 'ACTIVE' I'll leave those alone for the moment19:50
clarkbORD has a number of old nodepool nodes in an ERROR state just like IAD. I'm going to try deleting them next. Note ti also has a number of ACTIVE nodepool nodes that I am not deleting yet19:51
clarkb'message': 'DBDeadlock' <- that is from one of the ord error nodes. That is a new error to me19:52
fungithat's an interesting amount of implementation detail exposed to the end user19:53
clarkbthat one does not want to delete. I spot checked two others and they have the same fault and are all in a deleting task state (server state is ERROR)19:55
clarkbI suspect for those we may need cloud intervention19:55
melwittclarkb: that's right, nova quotas counts resources in the database (instance's flavors for example)  if [quota]driver=nova.quota.DbQuotaDriver and if [quota]driver=nova.quota.UnifiedLimitsDriver it will count resources in the Placement service when possible19:55
clarkbmelwitt: thank you for confirming19:56
clarkbmelwitt: does it take the floor of db and placement if there is a disagreement?19:57
melwittclarkb: no it won't do both at the same time, nova can be configured for one or the other19:57
clarkbgot it thanks19:57
clarkbthe old nodepool nodes that are ACTIVE in ORD lack nodepool properties entirely so I think this is why they didn't get garbage collceted forever ago19:58
clarkbI suspect that I can manually delete those given their age and situation. I'm happy to do that but would be good for someone else to confirm they think that is safe. The main risk would be to held nodes I guess?19:59
clarkbfungi: re rax-flex looks like things changed around the 12th-13th20:02
clarkbfungi: that coincides with the previous automated upgrade of zuul so could be sdk udpated in the week prior?20:02
corvusi see a problem with the cleanup method. i believe i can easily monkeypatch the running processes, so i'm going to do that, which should kick off a round of deletions 2 minutes after i'm done.20:03
clarkbI am currently unable to list servers in dfw20:06
clarkbI get a 503 error20:06
corvuslaunchers patched20:06
clarkbGRA1 looks cleaner now. But 0cbca713-6cd3-46bf-9213-1ed03317c673 is still in BHS1 and there are still a number of ERROR'd nodes there (could be we just need to wait a bit for all the api requests to be made and get processed)20:09
fungiclarkb: openstacksdk 4.7.1 was released the day before zuul was upgraded, so a possible candidate. 4.7.0 was several weeks prior so i feel like that would probably have been in use for a while20:09
fungithe 4.7.1 release notes are very sparse though20:10
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/961547 Launcher: fix endpoint_score [NEW]        20:10
corvusthe TLDR is that both launchers thought the other was handling it.20:10
clarkb0cbca713-6cd3-46bf-9213-1ed03317c673 is gone now20:10
corvusthere are many delitions from ord and bhs120:10
corvusand gra120:11
clarkbcorvus: I expect a lot of those ord delets to fail based on my earlier manual tests20:11
clarkbcorvus: but I'll be happy if they succeed20:11
clarkbI'll go review the fixup now20:11
corvusit'll keep trying20:11
corvusa bunch from iad now too20:12
clarkbfwiw I think part of the problem right now is that rax-flex is grabbing requests and needig to fail X times before someone else gets a chance20:12
clarkbthe confluence of problems made things extra bad. So now that we're addressing different peices it should slowly start cathcing up again (I hope)20:13
fungiwe could temporarily hand-patch 961537 onto the launchers20:13
corvusi'm in favor of that and happy to do it20:14
clarkbtrue, corvus do you know if that also requires a launcher restart?20:14
clarkbcorvus: no objections from me if you want to do that20:14
clarkb(the hourly jobs will undo it but we can put hosts in the emergency file or just hope that the 50 ish minutes to thenext run is enough time to get that change landed)20:14
corvusi don't know, but i suspect it does.  i'm happy to do that and re-apply my monkeypatch.20:14
clarkb+2 on the launcher fix change. That was a surprisingly small edit20:16
clarkbcorvus: I'm still seeing large numbers of requests sent to openmetal and ovh but not rax classic20:20
clarkbI wonder if rax classic still thinks it is at quota or something20:21
corvuswasn't there an api error from one of the classic regions?20:21
clarkbcorvus: yes I was getting the 413 errors there against /v2/tenantid/servers I thought that was for listing servers but now I wonder if that is the response you get when over quota? Let me dig into that a bit more20:22
corvusif we can't list servers we can't build them20:22
clarkbrax-iad-main> usage factor: 0.21465968586387435 is a recent log entry so it seems to know that it isn't at quota20:23
clarkboh and ya rax dfw was returning 503 errors when I attempted to manually list nodes there (I wanted to check for possible nodepool node cleanups like I did for iad and ord)20:24
clarkbcorvus: in ord we hvae a number of old active servers from the nodepool era. They don't appear to have properties/metadata set so I don't think they will automatically get deleted. Should I manually delete them at this point? Is that safe (thinking about possible held nodes)20:25
corvusclarkb: yes, i think we should delete them.  i tried to ping people about held nodes, but didn't get a response, and i deleted them from zuul 2 days ago anyway.20:26
clarkbok I'll attempt to clear those out of ord now20:27
corvus2025-09-17 20:27:26,442 DEBUG zuul.Launcher: [e: 89af73402813442b8ac38dbb6bd18d3a] [req: 5e9297bf85804a5fa0588b3f5eb0346d] Node <OpenstackProviderNode uuid=17af3b74730c4cd696392cd10ea6bd47, label=ubuntu-jammy-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-dfw3-main> advanced from submit creating server to complete20:28
corvusthat looks good for flex20:28
corvusi've restarted both launchers with the network fix manually applied.  re-applying the cleanup fix now.20:28
clarkbthanks!20:30
corvusdone20:32
clarkbORD has three nodepool nodes from july so not as old as the ones I just deleted. Deleting that should be safe too since you've severed all ties with nodepool held nodes at this point right?20:33
corvusyep20:33
corvus2025-09-17 20:34:16,359 ERROR zuul.Launcher:   keystoneauth1.exceptions.connection.ConnectTimeout: Request to https://dfw.servers.api.rackspacecloud.com/v2/637776/flavors/detail timed out20:34
clarkbya I suspect that region is sad. I want to say this behavior was similar to what we saw in iad/ord previously then we emailed cloudnull and things got better20:35
clarkbmaybe dfw needs similar intervention?20:35
clarkbin rax-ord the nodepool nodes are all either ERROR and not deletable or DELETING. I've claered out the others20:35
clarkbI'll scan the other clouds too and apply similar cleanups if necessary20:36
*** mnaser_ is now known as mnaser20:36
clarkbok rax-ord is the only provider I see old nodepool nodes in. They are all in an ERROR state and I think we can't delete them as end users as a result. Note I haven't checked rax-dfw due to errors listing servers there20:43
clarkbI think we've addressed most of what we can (other than actually landing permanent fixes for the two issues that generated code) and now ist a matter of allowing the requests to rebalance across happier providers20:44
fungijudging from https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc&viewPanel=panel-6 i suspect those drops in afs02.dfw usage are the old ro replicas for mirror.debian, mirror.debian-security and mirror.ubuntu being dropped and we're waiting for the slow climb back up to around its prior size, so maybe finishing sometime tomorrow20:44
clarkbfungi: afs02 deleted its ro volumes?20:45
clarkbI guess if it decides it is too out of date it does that maybe?20:45
fungii'm not sure "deleted" is the appropriate term20:45
fungiwell, maybe delete is the appropriate term20:46
fungiThis is a recovery of previously failed release20:46
fungiStarting transaction on cloned volume 536870983... done20:46
fungiDeleting extant RO_DONTUSE site on afs02.dfw.openstack.org... done20:46
fungiCreating new volume 536870983 on replication site afs02.dfw.openstack.org:  done20:46
fungiThis will be a full dump: previous release failed20:47
fungiStarting ForwardMulti from 536870983 to 536870983 on afs02.dfw.openstack.org (entire volume).20:47
clarkbinteresting so ya seems like it starts over and when it does so it doesn't try to use any of the old existing data20:47
clarkbwhich is probably very inefficient for this particular use case20:47
clarkbgate and check queue lengths are slowly shrinking (as things merge other pipelines are growing so I'm mostly ignoring them and using gate and check as the pipelines to monitor)20:49
fungithe graph shows it's recovering at a rate of about 0.1tb/hr and needs to recover about another 1.5tb so 15 hours to go, roughly?20:49
clarkbzoom zoom20:49
fungishould be finishing around the time i wake up, though i may not end up doing the stretch cleanup from mirror.debian and mirror.debian-security until friday since i wasn't really planning to be around tomorrow20:50
clarkbthat should be fine. That content has been there for some time20:51
opendevreviewMerged opendev/system-config master: Switch generic container role image builds back to docker  https://review.opendev.org/c/opendev/system-config/+/96141020:52
fungiwe're not really at risk of running out of space on the fileservers any time soon regardless, we've got nearly a tb free on the two dfw servers20:52
clarkbhourlies are starting up which will undo the rax flex fix for the launchers. However if we haev to restart them to pick up the change it won't matter21:02
clarkbI'll do my best to monitor that via logs and grafana21:03
fungiif it's restarting the launchers, then the cleanup handpatch will get undone too right?21:03
clarkbfungi: I don't think it restarts them. Only the weekly reboots do that21:03
clarkbso it depends on whether or not it reloads the cloud configs automatically or if it needs a restart21:04
fungiin that case they also may not pick up the clouds.yaml revert either21:04
clarkbyup it may be fine and just work as long as we don't restart the launchers21:04
clarkbas for rax-dfw and rax-ord I'm thinking wait to see if things are happier in dfw tomorrow then either way send an email to try and get rax-ord cleaned up and optionall ask about dfw's state in that email depending on its state tomorrow21:05
fungiand then once the config change deploys it'll be back in sync until the weekend, when it will hopefully be upgrading onto the cleanup fix proper anyway21:05
clarkbexactly21:05
clarkbcorvus: do you think we should disable rax-dfw in the meantime? that should get it out of the endpoint calculations right?21:06
clarkbdoing a server list seems to be a reasonable canary for that cloud so I don't think we need zuul-launcher running against it to determine if it is happy or sad21:07
opendevreviewMerged opendev/system-config master: UCA: Add Flamingo  https://review.opendev.org/c/opendev/system-config/+/96146021:07
clarkbwow one kolla-ansible change runs >60 jobs in check21:08
clarkbthe vast majority of which are non voting. but thats like 10% of our entire capacity minimum for one change21:08
fungimegacheck21:08
clarkb(I didn't bother to see if those jobs are multinode in which case it can quickly become >10%)21:09
clarkbI think the remaining slowness is largely due to openmetal having had a bunch of requests assigned to it before we fixed the other clouds. So everything assigned there is going through a small funnel21:12
clarkbnewly enqueued items seem to avoid this problem and are handled more quickly21:12
clarkbthe requested value peaked at 124 for openmetal and is down to 56 now21:13
corvusclarkb: yeah, i think if it's kaput we should pull it out21:14
clarkbworking on a change now21:14
opendevreviewClark Boylan proposed opendev/zuul-providers master: Disable rax-dfw due to API errors  https://review.opendev.org/c/opendev/zuul-providers/+/96155321:16
opendevreviewMerged opendev/zuul-providers master: Disable rax-dfw due to API errors  https://review.opendev.org/c/opendev/zuul-providers/+/96155321:20
clarkbbased on the graphs for rax flex I don't think the hourly jobs impacted the running service21:21
clarkbas long as my change and corvus' update to the launcher land before we next restart (friday at the latest) we should be good there21:21
fungii have a feeling a little while after that deploys we'll be able to openstack server list successfully, if it's a request backlog pileup leading to backend timeouts like we saw previously21:22
fungier, i guess there's no deploying necessary, it should be picked up by zuul as soon as it merges?21:24
clarkbya looks like grafana already reflects it21:25
clarkbI'll try to run a server list against dfw in a bit after we give it a ltitle time to chill21:31
fungiyeah, i'll wager it starts responding again once we're no longer hammering it for a while21:33
corvuswere we hammering it?  was the rate of server create calls to high?  if so, we can tune the rate in the provider21:41
fungii suspect "hammering" was not very fast at all in this case21:42
fungijust that calls on the backend were taking too long to complete, queuing up, and we were making new requests faster than the queue was clearing out21:42
clarkbserver list did just work21:42
fungiat least that's how it was acting in iad a while back21:42
clarkbthere are a number of old nodepool nodes in this region too21:43
clarkblooks like from around july. I'll go ahead and delete them next to match what I did in ord and iad21:43
corvuslmk if we need to change the rate limit21:44
clarkbcorvus: lasttime with iad they idneitifed an issue and corrected it on the cloud side21:45
corvusit's currently 2 requests/second now (per launcher)21:45
clarkbso I'm hopeful that just needs to be repeated here21:45
clarkbsince we know that things seem happier after disabling the region I'll go ahead and start drafting that email for ord and dfw now21:47
clarkbthe old nodepool instance are gone21:47
clarkbemail sent yall should be cc'd too22:04
fungithanks!22:04
clarkband openmetal has caught up now so we should be more business as usual at this point22:04
opendevreviewMerged opendev/system-config master: Select the network to use in raxflex  https://review.opendev.org/c/opendev/system-config/+/96153722:27
clarkbthat didn't actually trigger the zuul deploy job. I'll double check after the next round of hourlies that the clouds.yaml look correct22:33
fungiyeah, only ran infra-prod-bootstrap-bridge22:51
clarkb/etc/openstack/clouds.yaml was updated at 23:05 on both zl01 and zl02 and both files contain the networks list again23:17
clarkbso I'm happy that we've updated that correctly and the fix for the launcher cleanup stuff landed as well so we should be all set for the friday restarts23:17
fungiperfect23:18

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!