Thursday, 2025-08-14

corvushrm, log rotation has happened on zl01 but not zl0200:23
clarkbthat rotation is happening in python (not using logrotate)00:25
corvusperhaps fallout from the memory issues?00:25
clarkbya maybe the object in python crashed trying to allocate memroy?00:26
clarkbor even the file copy itself?00:26
clarkbhowever 8e437d947d1c4d6eb53212fd5ce8d53c this request was one that failed per https://zuul.opendev.org/t/openstack/build/e7fe7c44d29242f9993be9137f5f6cf9 and it appears to have been a single node request whose noderequest was fulfilled by zl0100:27
clarkbthough I suppose that maybe zl02 was then responsbile for handling request completion for the entire request?00:27
corvusyeah, they often trade off :)00:28
corvusi was looking up request fd545f68533344b8b64135ed500cb4e4]00:28
clarkblooks like zl02 handled the ssh keyscanning for that node00:29
clarkbI don't see anything indicating why the request failed00:29
clarkbunless Exception loading ZKObject at /zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c/revision which happend early in the process was to blame?00:29
corvusno that's benign00:29
corvus2025-08-13 22:27:02,098 DEBUG zuul.Launcher: [e: 38edb69d08d34ae4b9cc0faee9bef52a] [req: fd545f68533344b8b64135ed500cb4e4] Request <NodesetRequest uuid=fd545f68533344b8b64135ed500cb4e4, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/fd545f68533344b8b64135ed500cb4e4> nodes ready: [<OpenstackProviderNode uuid=fbfe4e20998942d188a06e9a90270dbe, label=ubuntu-noble-8GB, state=ready,00:29
corvusprovider=opendev.org%2Fopendev%2Fzuul-providers/rax-dfw-main>]00:29
corvus2025-08-13 22:27:02,100 DEBUG zuul.Launcher: [e: 38edb69d08d34ae4b9cc0faee9bef52a] [req: fd545f68533344b8b64135ed500cb4e4] Marking request <NodesetRequest uuid=fd545f68533344b8b64135ed500cb4e4, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/fd545f68533344b8b64135ed500cb4e4> as fulfilled00:29
clarkbI thought so considering it seemed to proceed afterwards00:29
corvusi think the launcher thought everything was fine with that request00:30
clarkbmaybe need to look at the scheduler then?00:30
corvusyep00:30
clarkbthe schedulers have ERROR zuul.LauncherClient: Exception loading ZKObject at /zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c/revision00:31
corvusthat's fine too00:31
clarkbon zuul01 we have 2025-08-14 00:04:34,373 INFO zuul.Pipeline.openstack.gate: [e: 6892d4f49d6e479fa8af9c2296e48c07] Node(set) request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c>: failure for openstack-tox-pep800:32
corvus(https://review.opendev.org/957184 will reduce or eliminate a bunch of those; that change is approved)00:32
clarkbproblem is very few jobs seem to be running right now00:32
clarkbmaybe like 50% of them the others are hitting this issue so it may not land until we figure out what is going on00:33
clarkbjumping up a level and looking at the event logs it kinda seems like that is what we get log wise00:37
clarkbok that log message originates from onNodesProvisioned when request.fulfilled is not true. request.fulfilled requires the state to be fulfilled but it is accepted in that log line00:40
clarkbon zl01 we get 2025-08-14 00:04:33,972 DEBUG zuul.Launcher: [e: 6892d4f49d6e479fa8af9c2296e48c07] [req: 8e437d947d1c4d6eb53212fd5ce8d53c] Marking request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c> as fulfilled00:40
clarkbthen half a second later on zuul01 we get 2025-08-14 00:04:34,373 INFO zuul.Pipeline.openstack.gate: [e: 6892d4f49d6e479fa8af9c2296e48c07] Node(set) request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c>: failure for openstack-tox-pep800:41
clarkbthe zl01 log indicates it is marking it fulfilled but the state logged on zuul01 is still accepted00:41
clarkbso I think the problem is that we are not transitioning the state record properly for some reason00:41
clarkbI don't see any changes to the code that marks things fulfilled in the set of changes today00:43
corvusi think the launcher may be running very slowly and we're now consistently losing a race00:46
clarkbya I just noticed in the launcher code we send the event for nodes provisioned then updated the request attributes00:46
corvusi think it's worth restarting the launchers to clear things out00:46
corvusexactly that yes00:46
clarkbwhich would be a race condition we can lose if things are slower than the network00:46
corvusdo you want to restart launchers while i work on a code fix?00:46
corvusand maybe do something about logs while they're down00:47
clarkbcorvus: sure. Just docker compose down && docker compose up -d?00:47
clarkbor do you want me to pull first?00:47
clarkbI guess I should pull since we had some updates00:48
clarkbI'm doing zuul01 now. Will pull then down then up -d as it doesn't need log updates. Then I'll go look at what needs to be done while zl02 is down00:49
clarkbNot zuul01. zl0100:49
clarkbok it appears to be running. Moving on to zl0200:52
clarkbzl02 is down. I'm gzipping the large log file in case there is any info in there worth saving. Then I'll remove the inflated file version if gzip doesn't do that automatically and start the service up again00:55
clarkbzl02 launcher is starting up now00:57
clarkbcorvus: the logfile is gzipped and renamed with a dated suffix. We'll probably want to delete that file in a day or two if we don't end up digging into it for more data01:02
corvus++ thx01:02
tkajinamI was about to ask if there is any maintenance going on, seeing frequent node failure but I think you all are aware of the situation.01:03
* tkajinam is still reading the log01:03
clarkbthings look happier after I restarted the launchers. Newly pushed / enqueued changes aren't showing up with a bunch of orange in the ui01:04
clarkbI'm going to have to pop out soon for dinner. But will check back in on this in the morning01:05
clarkblooking at the zuul components list I think our weekend reboots may have gotten stuck or crashed again too01:07
clarkbI'm not going to dig into that deeper at the moment since its late and already Thurdsay UTC time so things have mostly been working despite that01:08
clarkbcorvus: ^ just fyi that is probably something to followup on next too01:08
corvusack01:10
corvusi'll clean out the leaked image files; i should be able to do it safely while the launchers are running, just based on timestamps.01:31
corvusnl01 is not fully functional due to https://review.opendev.org/957296 -- i'm going to restart it02:16
corvus2025-08-14 02:23:15,164 ERROR zuul.Launcher:   openstack.exceptions.HttpException: HttpException: 503: Server Error for url: https://ord.servers.api.rackspacecloud.com/v2/637776/flavors/detail?is_public=None, Service Unavailable02:25
corvusnow we're getting errors from rax-ord02:25
corvus2025-08-14 02:27:17,384 ERROR zuul.Launcher:   keystoneauth1.exceptions.connection.ConnectTimeout: Request to https://ord.servers.api.rackspacecloud.com/v2/637776/servers timed out02:27
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Disable rax-ord  https://review.opendev.org/c/opendev/zuul-providers/+/95729702:28
corvusinfra-root: ^ i self-approved that02:29
opendevreviewMerged opendev/zuul-providers master: Disable rax-ord  https://review.opendev.org/c/opendev/zuul-providers/+/95729702:29
corvus2025-08-14 02:41:00,922 ERROR zuul.Launcher:   Exception: Unable to find flavor: opendev.xxlarge in osuosl-regionone-main02:41
corvusapparently there is an opendev.xxxlarge02:43
corvusthere's no timestamp or anything, so i don't know if that changed02:43
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Correct osuosl flavor name  https://review.opendev.org/c/opendev/zuul-providers/+/95729802:44
corvusinfra-root: ^ i'm self-approving that too02:44
opendevreviewMerged opendev/zuul-providers master: Correct osuosl flavor name  https://review.opendev.org/c/opendev/zuul-providers/+/95729802:44
corvusit's pretty bad if we can't talk to a cloud at startup... we may need to parallelize the startup process and make the providers a bit more independent.02:47
opendevreviewTakashi Kajinami proposed opendev/system-config master: WIP: Add OpenVox to mirror  https://review.opendev.org/c/opendev/system-config/+/95729902:51
corvuswe can't deal with the load right now in this degraded state, i'm going to dump the periodic pipeline.02:51
corvusokay, afaict the launchers are working as well as they can be atm with ord misbehaving03:22
corvusi had to dump the periodic, periodic-stable, and check-arm64 queues in openstack03:23
corvusi also had to re-enqueue some changes in zuul/gate in order to get it to stop trying to boot nodes on ord03:23
*** mrunge_ is now known as mrunge08:49
stephenfinclarkb: fungi: When you're around, I think we want a v7.0.1 release for https://review.opendev.org/c/openstack/pbr/+/957262/ and above10:50
stephenfin...and above10:51
opendevreviewJan Gutter proposed zuul/zuul-jobs master: Raise connection pool for boto3 in s3 upload role  https://review.opendev.org/c/zuul/zuul-jobs/+/95721812:42
tristanC[m]It looks like I was mentioned last week in this room, but I can't find the message with element, apparently the "jump to notification" feature is not working. So please contact me again if needed!13:19
fricklertristanC[m]: likely this one? https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-08-07.log.html#opendev.2025-08-07.log.html#t2025-08-07T18:15:2413:26
tristanC[m]frickler: ha thanks! So are you grepping these irclogs file to find matrix mention?13:31
tristanC[m]clarkb: that's right, that's done for "reproducibility" where the image timestamps are set to a known value. But that's not super useful, next time I update the matrix gerritbot I'll use the current time of the day.13:32
tristanC[m]But as far as I know, the service has been humming away without noticeable bug, so I'm inclined to leave it that way unless you want to fix that.13:33
fricklertristanC[m]: I just looked for a mention of your nick last week. but for larger searches I do grep my local logs sometimes. one more advantage of IRC over matrix, I'd think, haven't found a way to save logs from element yet13:49
fungiwhile i can't recommend the matrix-rs plugin for weechat in its present state due to significant missing features, it does give me the ability to log matrix room discussions just as if they were irc channels13:51
corvusfrickler: Room Info -> Export Chat13:53
corvusin element13:53
fungifrickler: on 957262 why are you saying "(alias, arg) in DEPRECATED_CFG" won't match in that spot if it was matching a few lines earlier (the entire reason for the change is to get it to stop outputting those warnings unless they were actually used)13:58
corvusi'm still not happy with the launcher behavior; i have some stuff to do this morning, but i'll resume looking into that when i'm done.13:58
corvusthe launcher's aren't stuck, they're making progress, just slowly and poorly.  i wouldn't try to do anything manual.13:59
funginoted, thanks corvus13:59
fricklerfungi: from my understanding it cannot have worked before, either14:02
fungiso DEPRECATED_CFG isn't a list of tuples then?14:04
fungior is the problem more nuanced?14:04
fungii'm just trying to figure out where the excess warning messages were coming from if that was dead code as you say14:05
fricklerfungi: you can check the definitions at the beginning of the same file14:05
fricklerthese are nested tuples, with alias = (section, keyname), and only that tuple is being used as key in DEPRECATED_CFG14:06
fungihuh, i think the definition may have missing commas14:06
fricklerI commented on that, too, but that's a different - albeit related - bug14:07
fungithe dict keys should be ('metadata', 'home_page') rather than concatenated strings i guess14:07
frickleryes14:07
fungiimplicit string concatenation strikes again14:07
fungioh, *some* of them match14:07
fungi('metadata', 'requires_dist') is a key, for example14:08
fungiit's just the first few that are missing commas and so will never match14:08
frickleris there a sample of the excess warning messags showing up in logs somewhere?14:08
fungistephenfin: ^ did you have a public sample to link? otherwise i can fish one out of a random job log for a pbr-using project14:09
fungibasically some of the keys in the current DEPRECATED_CFG dict aren't working at all because of the missing commas leading to implicit concatenated strings, while others are matching because they're proper tuples as expected14:12
fungiprior to 957262 pbr is emitting a warning for config option it iterates over if that tuple is a key in the DEPRECATED_CFG dict (so all of the not-broken ones), with 957262 it only does so if the option was present in the configuration it's checking, since the conditional above it short-circuits the loop first14:14
fricklerI still claim nothing should match at all, since e.g. (alias, arg) = (('metadata', 'home_page'), 'url') in CFG_TO_PY_SETUP_ARGS, while DEPRECATED_CFG has (modulo missing comma) only ('metadata', 'home_page') as dict key14:15
fungiah, i see, so alias is also a tuple14:16
frickleryes and it is getting split as "section, option = alias" in L410. so using "(section, option)" in the deprecation check could also work and might be more readable14:17
clarkbLooks like corvus is already looking at launcher things. I'll probably start my morning looking at the zuul weekly reboot status (after noticiing inconsistency on the components page last night) then look at the pbr changes.14:57
clarkbbut first breakfast14:57
clarkbTASK [zuul-executor : Check if Zuul Executor containers are running] failed on ze05 with rc -13. This is what we believe to be the ssh control persistent race with starting a task as the ssh process shuts down15:03
clarkbso I guess that is "good" news in that it isn't a new error15:03
clarkbWe can probably just manually start the cronjob today if we want to kick things off sooner but I think with the launcher situation its probably best to address launcher things first and possibly just wait for the late friday cronjob to fire. Then check the results first thing monday15:04
clarkbthe run prior appears to have been successful15:04
clarkbit would be nice if ansible had some if rc code == -13 then retry option15:10
corvusclarkb: did we merge the change to do keepalives?15:48
clarkbcorvus: yes15:50
clarkb`ssh_args = -o ControlMaster=auto -o ControlPersist=180s -o ServerAliveInterval=60` is the current /etc/ansible/ansible.cfg entry on bridge15:51
stephenfinThanks, frickler 🤦 I've cleaned that up now https://review.opendev.org/c/openstack/pbr/+/95726216:10
fungiand added tests!16:11
fungivery nice16:11
corvusi get a 500 error for /usr/launcher-venv/bin/openstack --os-cloud opendevzuul-rax-flex --os-region SJC3 server list16:26
corvusseems to work for dfw316:26
corvus Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.16:27
corvus<class 'KeyError'>16:27
fungihah16:27
fungidan_with: do you happen to know whether there's maintenance or an incident going on in flex sjc3 right now that would cause the nova api to return a 500 error? ^16:28
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Disable raxflex-sjc3  https://review.opendev.org/c/opendev/zuul-providers/+/95744916:29
fungii'll test with my personal account too16:29
corvusopendevci account works16:29
corvusmay be a problem with opendevzuul16:30
fungithough i did notice a vm i have there spontaneously rebooted at 11:19 utc today16:30
clarkbI want to say we had this problem before and it required intervention on the cloud side16:30
* clarkb tries to find notes16:30
opendevreviewMerged opendev/zuul-providers master: Disable raxflex-sjc3  https://review.opendev.org/c/opendev/zuul-providers/+/95744916:31
clarkbaccording to my irc logs jamesdenton helped debug this issue on April 916:32
clarkbat the time there were three instances in an Error state but nova couldn't list them due to keyerrors16:33
corvuswe are down 3 cloud regions; this is making a significant dent in our capacity.16:33
clarkbhttps://bugs.launchpad.net/nova/+bug/2095364 its this bug16:34
clarkbor was this bug. Not 100% certain it is the same problem this time16:34
clarkbI think we can try an older microversion and attempt to delete error'd nodes that way16:35
clarkbI'll look into that16:35
corvusearlier one of the launchers saw: 2025-08-14 16:28:22,056 ERROR zuul.Launcher:   <class 'sqlalchemy.exc.OperationalError'>16:35
clarkb`sudo openstack --os-cloud opendevzuul-rax-flex --os-region SJC3 --os-compute-api-version 2.54 server list` this works on bridge16:36
corvusclarkb: if you have a process for listing servers using the earlier microversion, i'd be interested in the process and/or the current output16:36
clarkbcorvus: ^ that is it16:36
corvus++16:36
clarkbI'll hold off on deleting anything but there are a number of servers in an ERROR state.16:36
corvusthanks!  give me 1 second, then i think we can go to down deleting16:37
clarkbsetting the version to 2.95 also works16:37
clarkbI don't know why I have a much older version in my command history too. but I think that confirms this is the problem16:37
corvusokay i think we can delete stuff now16:38
clarkbcorvus: do you want me to do that?16:38
corvusyes please16:39
clarkbok I will attempt to manually delete every instance in sjc3 in an error state16:39
corvusi just saw the same error in dfw3, then dfw3 worked again in a subsequent attempt a few seconds later16:41
corvus(sjc3 still seems consistent)16:41
corvusand now it works16:42
fungihuh16:42
corvus(after clarkb deleted stuff, i think)16:42
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Revert "Disable raxflex-sjc3"  https://review.opendev.org/c/opendev/zuul-providers/+/95745216:42
clarkbyes deletions are complete and a listing shows no more ERROR'd nodes16:42
fungithat's "fun"16:42
opendevreviewMerged opendev/zuul-providers master: Revert "Disable raxflex-sjc3"  https://review.opendev.org/c/opendev/zuul-providers/+/95745216:43
clarkblast time the cloud intervened on our behalf and deleted the nodes for us16:44
clarkbbut it is good to know that we can do it if we use the correct magical microversion16:45
fungiyeah, that's a great find16:45
fungithough it's supposedly fixed in a stable point release for dalmatian, has a fix merged in stable/caracal (no point release yet), and one proposed for stable/bobcat that hasn't merged16:47
clarkbI don't know which version rackspace flex has deployed but maybe dan_with can try to get that fix on their radar16:48
clarkbthinking out loud here and I'm not convinced this is a good idea: but we could probably get away with forcing microversion 2.95 on all clouds that support 2.96?16:51
clarkbthat would require looking up supported versions but launchers are already caching things on startup about clouds so maybe this would fit into that process? Basically lookup microversions if repsent and if version 2.95 is in the valid version use it. otherwise use whatever default the sdk picks16:52
corvusi'm not sure if that affected the launcher... now that i know more of what's going on, that may have been cli only16:52
clarkbhuh I guess that could be if the sdk listing things isn't requesting the filed with a keyerror or using a "compatibile" microversion for some reason?16:53
corvuswe don't use the sdk method for listing16:54
clarkbgot it. That may explain it then as we probably aren't doing anything with microversions like the cli or sdk would16:55
corvusthe launchers do a straight keystoneauth get of '/servers/detail'16:55
clarkbfungi: looking at the timestamps on that nova bug seems like a lot of those changes just merged17:15
clarkbI guess we shouldn't be surprised the fixes haven't rolled out to actual clouds yet17:15
fungiyeah, agreed17:15
fungiapropos of nothing in particular, https://github.com/mfontanini/presenterm looks like a presentty clone in rust (though using markdown instead of restructuredtext)17:29
corvusokay, i have identified 2 improvements i'd like to make to the launchers.  they aren't trivial, but they aren't too complex.  it's going to take a bit.  i don't have any quick fixes to deal with the backlog.  i'm just going to try to plow through writing these changes and we can try to get them merged asap.17:34
corvusafter that, i'll fix up the event stuff from yesterday.  even though that was catastrophic, as long as we don't start swapping again, i think it's slightly lower priority.17:34
clarkbmakes sense17:35
fungisounds good, thanks!17:41
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/957456 Include requested nodes in quota percentage calculation [NEW]        17:43
corvusthat's the first change -- hopefully the commit msg explains sufficiently17:43
corvusthe second one i'm going to work on is roughly "order the node objects so we fill old multinode requests sooner"17:44
fungiyep, commit message made sense, now seeing if i can apply that understanding to the diff17:45
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/957457 Sort nodes by request time [NEW]        18:13
corvusthat's change #218:13
corvuswas not expecting sequential numbers for those18:14
corvusi think the systemic problem that's going on is that:18:23
corvusany time the quota usage for a provider drops, the launchers see an opportunity and create a node request for that provider, regardless of how many other backlogged reqsts there are18:23
corvuseven if that only happens one at a time (because after it creates one node, it should account for that allocation and defer the next request), given enough time, all the node requests can end up with allocated nodes, as each one has plenty of opportunities to find a "hole"18:23
corvusthe first change should help us avoid that situation by accounting for the backlog earlier, so that zuul doesn't see the hole in the first place18:24
corvussome leakage will still happen due to multinode requests; the second change helps get them cleared out in the right order.18:24
clarkbcorvus: I have some questions on https://review.opendev.org/c/zuul/zuul/+/95745718:27
corvusthx replied18:31
clarkbcorvus: thanks one followup posted18:35
clarkbside note: I love that datetime.datetime.utcfromtimestamp is deprecated and will be removed in a future python release because you are supposed to use timezone aware objects to represent datetimes in UTC18:37
clarkb"utcfromtimestamp" isn't timezone aware enough I guess18:38
clarkbposix epoch times are recorded as beginning from midnight january 1 1970 UTC. Asking for the datetime in UTC X seconds after the epoch is timezone aware...18:38
corvusclarkb: thx fixed18:39
clarkbcorvus: are you pushing a new patchset?18:48
clarkbwant to make sure I wait for that to +2 before lunch if so18:48
corvusoh sorry, git-review was asking me a question i forgot to answer19:16
corvushope you didn't wait for lunch19:16
fungigit-review can be so nosey sometimes19:18
*** dhill is now known as Guest2417419:31
clarkbI did not wait :)19:39
clarkbIv'e approved it now19:40
JayFclarkb: utcfromtimestamp() returns a tzinfo=None object; so while you get a datetime that is UTC, it doesn't return a tz-aware value19:50
JayFthey can't change the return without breaking the api19:51
JayFI am unable to reach security.openstack.org20:36
JayFwhich is resolving to static02.opendev.org for me20:36
JayFonly via ipv4, once I tried ipv6 it's working20:38
JayFmtr implies it's somewhere close to the lasthop that's dropping20:38
fungii can ping it over v4 and v620:39
fungiso if it's close to the last hop, it might be that you've got an asymmetrical route so your traceroute shows the drop at the hop where they converge (the provider's border)20:40
fungiif you want to msg me your ip address i could try to traceroutge to it from static02 for comparison20:40
clarkbJayF: I mean that seems reasonable if A) the function name is literally UTC already and B) they aer going to delete it and break everyone anyway20:41
clarkbI dunno just seems like if you're going to break it then breaking a subset of people is probably better20:41
JayFI don't really agree; you can't use tzinfo=None and tzinfo=!None objects interchangibly. It's just a bad interface from the beginning and any pain now is just paying back for that original sin20:42
fungiit sends a clear message20:42
fungito any other functions that might try to misbehave20:42
JayFfungi: https://usercontent.irccloud-cdn.com/file/fQNI38xe/image.png20:43
corvusi have packet loss to both review02 and static0220:45
fungiJayF: yeah, going back the other way i see packet loss starting in or just past tukw.qwest.net20:46
fungipotentially at the lumen/qwest peer there20:46
JayFjust broken internet stuff, got it20:47
JayFmakes sense20:47
JayFcorvus: are you my ISP-brother lol20:47
corvusprobably not; maybe just a bad internet day :)20:48
corvusfungi: clarkb https://review.opendev.org/957464 is the extra fix to the event thing from yesterday (also, swest added a change to that stack too)20:49
fungistatic is in rackspace dfw, while review is in vexxhost ca-ymq-1, so different clouds in different parts of the continet even20:49
corvusso i've got a whole sequence stacked up now, with the 2 changes from this morning, the 3 event sequence changes, and then 2 more small patches on that.20:50
corvusstack end is at https://review.opendev.org/95729620:50
corvusif everything looks good to folks, i'll probably take a build of that and manually deploy it to launchers20:50
opendevreviewJeremy Stanley proposed openstack/project-config master: Replace 2025.2/Flamingo key with 2026.1/Gazpacho  https://review.opendev.org/c/openstack/project-config/+/95746721:23
clarkbcorvus: I have a few questions on https://review.opendev.org/c/zuul/zuul/+/95746421:47
corvusreplied21:58
clarkbcorvus: thanks I think that behavior I'm thinkign of with the .get is only on certain classes that have a list of attributes that zuul loops thrugh to load up22:06
clarkbthis one doesn't do that so I was mistaken22:06
corvus++ i like this pattern better22:08
corvus(can't always use it tho)22:08
corvusi think it's going to be a long time until we get an image build we can use with the launchers.22:19
corvus(and even longer until those changes actually merge)22:19
corvusi wonder what we could do to improve things....22:19
corvusi could try to monkeypatch those fixes into the running server, but that's a lot to patch and easy to get wrong22:20
corvuswe could dequeue some stuff to clear out the queue22:20
corvusi could try to build an image manually22:21
corvusor i could push up a change exclusively to build an image; if we clear the zuul tenant queues, that job might get done in a reasonable time22:22
corvusi'm open to other suggestions22:22
clarkbchange just to build an image seems like maybe something worth trying then fallback to locally built image?22:23
corvusack.  i'll try that.22:23
corvusokay, that change is uploaded, its one job is running now.  i also am running the test suite locally on that change.22:28
corvusit's also worth noting that even when we restart the launcher with these changes, the existing node requests are still going to be a bit of a mess.  so until we work through that backlog, we may continue to have problems.  we also may not be able to work through the backlog due to the way the multinode jobs have wedged things.  so even then, we may need to start dequeing some things.22:30
corvuser, i said the job was running, i should have said the node was building... :/22:35
corvusunit tests and linters pass locally; so if the job ever runs, we're good22:37
corvus2025-08-14 22:38:24,549 ERROR zuul.Launcher:   openstack.exceptions.HttpException: HttpException: 503: Server Error for url: https://dfw.servers.api.rackspacecloud.com/v2/637776/servers, Service Unavailable22:41
corvuswe're getting 503 errors in rax-dfw (classic) now22:42
corvuswe were getting those with rax-ord yesterday, so i turned it off22:42
corvusshould we do the same for dfw?22:43
corvusthat is our only remaining region in rax classic22:43
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Disable rax-dfw  https://review.opendev.org/c/opendev/zuul-providers/+/95747022:46
opendevreviewMerged opendev/zuul-providers master: Disable rax-dfw  https://review.opendev.org/c/opendev/zuul-providers/+/95747022:47
corvusi'm aggressively cleaning up logs on the launchers22:59
clarkbfwiw trying to list servers from bridge in rax dfw is very slow to return (I haven't gotten results yet) and wonder if the proxy will eventually 50323:02
clarkboh wait no there it goes it listed23:02
corvusthe graphs and logs suggest that it's spotty23:02
clarkbcorvus: I wonder if the 503 problem is just something we have to be resilent against happening occasionally?23:02
corvusbut when there are problems, it's hours at a time23:02
clarkbya that23:02
clarkbah ok23:02
corvusyeah, if it were occasionally, then i think we could work around it, but it's looking a lot more like a 6 hour stretch where the launcher can't see what servers there are23:03
corvuspulling images23:06
corvusthe launchers are restarted23:09
corvusi think we're going to need to dequeue at least some of the queue items with multinode jobs in order to un-wedge things23:15
corvusi will start picking some off.23:15
clarkbis that bceause we're still susceptible to the ordering problem with those node requests hanging around?23:22
clarkb(basicaly we need new ones to get the ordering right?)23:22
corvusyep, and because so many of the multinode requests are partially completed and they're hogging the quota as ready nodes waiting for their friends23:23
corvusi dequeued 2 kolla changes, and the number of in-use nodes in ovh-bhs1 has increased from 1 to 2523:26
corvusi think it will be able to slowly make progress in that region at this point23:26
*** cloudnull10977461 is now known as cloudnull23:56

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!