Thursday, 2025-08-14

corvus	hrm, log rotation has happened on zl01 but not zl02	00:23
clarkb	that rotation is happening in python (not using logrotate)	00:25
corvus	perhaps fallout from the memory issues?	00:25
clarkb	ya maybe the object in python crashed trying to allocate memroy?	00:26
clarkb	or even the file copy itself?	00:26
clarkb	however 8e437d947d1c4d6eb53212fd5ce8d53c this request was one that failed per https://zuul.opendev.org/t/openstack/build/e7fe7c44d29242f9993be9137f5f6cf9 and it appears to have been a single node request whose noderequest was fulfilled by zl01	00:27
clarkb	though I suppose that maybe zl02 was then responsbile for handling request completion for the entire request?	00:27
corvus	yeah, they often trade off :)	00:28
corvus	i was looking up request fd545f68533344b8b64135ed500cb4e4]	00:28
clarkb	looks like zl02 handled the ssh keyscanning for that node	00:29
clarkb	I don't see anything indicating why the request failed	00:29
clarkb	unless Exception loading ZKObject at /zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c/revision which happend early in the process was to blame?	00:29
corvus	no that's benign	00:29
corvus	2025-08-13 22:27:02,098 DEBUG zuul.Launcher: [e: 38edb69d08d34ae4b9cc0faee9bef52a] [req: fd545f68533344b8b64135ed500cb4e4] Request <NodesetRequest uuid=fd545f68533344b8b64135ed500cb4e4, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/fd545f68533344b8b64135ed500cb4e4> nodes ready: [<OpenstackProviderNode uuid=fbfe4e20998942d188a06e9a90270dbe, label=ubuntu-noble-8GB, state=ready,	00:29
corvus	provider=opendev.org%2Fopendev%2Fzuul-providers/rax-dfw-main>]	00:29
corvus	2025-08-13 22:27:02,100 DEBUG zuul.Launcher: [e: 38edb69d08d34ae4b9cc0faee9bef52a] [req: fd545f68533344b8b64135ed500cb4e4] Marking request <NodesetRequest uuid=fd545f68533344b8b64135ed500cb4e4, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/fd545f68533344b8b64135ed500cb4e4> as fulfilled	00:29
clarkb	I thought so considering it seemed to proceed afterwards	00:29
corvus	i think the launcher thought everything was fine with that request	00:30
clarkb	maybe need to look at the scheduler then?	00:30
corvus	yep	00:30
clarkb	the schedulers have ERROR zuul.LauncherClient: Exception loading ZKObject at /zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c/revision	00:31
corvus	that's fine too	00:31
clarkb	on zuul01 we have 2025-08-14 00:04:34,373 INFO zuul.Pipeline.openstack.gate: [e: 6892d4f49d6e479fa8af9c2296e48c07] Node(set) request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c>: failure for openstack-tox-pep8	00:32
corvus	(https://review.opendev.org/957184 will reduce or eliminate a bunch of those; that change is approved)	00:32
clarkb	problem is very few jobs seem to be running right now	00:32
clarkb	maybe like 50% of them the others are hitting this issue so it may not land until we figure out what is going on	00:33
clarkb	jumping up a level and looking at the event logs it kinda seems like that is what we get log wise	00:37
clarkb	ok that log message originates from onNodesProvisioned when request.fulfilled is not true. request.fulfilled requires the state to be fulfilled but it is accepted in that log line	00:40
clarkb	on zl01 we get 2025-08-14 00:04:33,972 DEBUG zuul.Launcher: [e: 6892d4f49d6e479fa8af9c2296e48c07] [req: 8e437d947d1c4d6eb53212fd5ce8d53c] Marking request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c> as fulfilled	00:40
clarkb	then half a second later on zuul01 we get 2025-08-14 00:04:34,373 INFO zuul.Pipeline.openstack.gate: [e: 6892d4f49d6e479fa8af9c2296e48c07] Node(set) request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c>: failure for openstack-tox-pep8	00:41
clarkb	the zl01 log indicates it is marking it fulfilled but the state logged on zuul01 is still accepted	00:41
clarkb	so I think the problem is that we are not transitioning the state record properly for some reason	00:41
clarkb	I don't see any changes to the code that marks things fulfilled in the set of changes today	00:43
corvus	i think the launcher may be running very slowly and we're now consistently losing a race	00:46
clarkb	ya I just noticed in the launcher code we send the event for nodes provisioned then updated the request attributes	00:46
corvus	i think it's worth restarting the launchers to clear things out	00:46
corvus	exactly that yes	00:46
clarkb	which would be a race condition we can lose if things are slower than the network	00:46
corvus	do you want to restart launchers while i work on a code fix?	00:46
corvus	and maybe do something about logs while they're down	00:47
clarkb	corvus: sure. Just docker compose down && docker compose up -d?	00:47
clarkb	or do you want me to pull first?	00:47
clarkb	I guess I should pull since we had some updates	00:48
clarkb	I'm doing zuul01 now. Will pull then down then up -d as it doesn't need log updates. Then I'll go look at what needs to be done while zl02 is down	00:49
clarkb	Not zuul01. zl01	00:49
clarkb	ok it appears to be running. Moving on to zl02	00:52
clarkb	zl02 is down. I'm gzipping the large log file in case there is any info in there worth saving. Then I'll remove the inflated file version if gzip doesn't do that automatically and start the service up again	00:55
clarkb	zl02 launcher is starting up now	00:57
clarkb	corvus: the logfile is gzipped and renamed with a dated suffix. We'll probably want to delete that file in a day or two if we don't end up digging into it for more data	01:02
corvus	++ thx	01:02
tkajinam	I was about to ask if there is any maintenance going on, seeing frequent node failure but I think you all are aware of the situation.	01:03
* tkajinam is still reading the log		01:03
clarkb	things look happier after I restarted the launchers. Newly pushed / enqueued changes aren't showing up with a bunch of orange in the ui	01:04
clarkb	I'm going to have to pop out soon for dinner. But will check back in on this in the morning	01:05
clarkb	looking at the zuul components list I think our weekend reboots may have gotten stuck or crashed again too	01:07
clarkb	I'm not going to dig into that deeper at the moment since its late and already Thurdsay UTC time so things have mostly been working despite that	01:08
clarkb	corvus: ^ just fyi that is probably something to followup on next too	01:08
corvus	ack	01:10
corvus	i'll clean out the leaked image files; i should be able to do it safely while the launchers are running, just based on timestamps.	01:31
corvus	nl01 is not fully functional due to https://review.opendev.org/957296 -- i'm going to restart it	02:16
corvus	2025-08-14 02:23:15,164 ERROR zuul.Launcher: openstack.exceptions.HttpException: HttpException: 503: Server Error for url: https://ord.servers.api.rackspacecloud.com/v2/637776/flavors/detail?is_public=None, Service Unavailable	02:25
corvus	now we're getting errors from rax-ord	02:25
corvus	2025-08-14 02:27:17,384 ERROR zuul.Launcher: keystoneauth1.exceptions.connection.ConnectTimeout: Request to https://ord.servers.api.rackspacecloud.com/v2/637776/servers timed out	02:27
opendevreview	James E. Blair proposed opendev/zuul-providers master: Disable rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/957297	02:28
corvus	infra-root: ^ i self-approved that	02:29
opendevreview	Merged opendev/zuul-providers master: Disable rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/957297	02:29
corvus	2025-08-14 02:41:00,922 ERROR zuul.Launcher: Exception: Unable to find flavor: opendev.xxlarge in osuosl-regionone-main	02:41
corvus	apparently there is an opendev.xxxlarge	02:43
corvus	there's no timestamp or anything, so i don't know if that changed	02:43
opendevreview	James E. Blair proposed opendev/zuul-providers master: Correct osuosl flavor name https://review.opendev.org/c/opendev/zuul-providers/+/957298	02:44
corvus	infra-root: ^ i'm self-approving that too	02:44
opendevreview	Merged opendev/zuul-providers master: Correct osuosl flavor name https://review.opendev.org/c/opendev/zuul-providers/+/957298	02:44
corvus	it's pretty bad if we can't talk to a cloud at startup... we may need to parallelize the startup process and make the providers a bit more independent.	02:47
opendevreview	Takashi Kajinami proposed opendev/system-config master: WIP: Add OpenVox to mirror https://review.opendev.org/c/opendev/system-config/+/957299	02:51
corvus	we can't deal with the load right now in this degraded state, i'm going to dump the periodic pipeline.	02:51
corvus	okay, afaict the launchers are working as well as they can be atm with ord misbehaving	03:22
corvus	i had to dump the periodic, periodic-stable, and check-arm64 queues in openstack	03:23
corvus	i also had to re-enqueue some changes in zuul/gate in order to get it to stop trying to boot nodes on ord	03:23
*** mrunge_ is now known as mrunge		08:49
stephenfin	clarkb: fungi: When you're around, I think we want a v7.0.1 release for https://review.opendev.org/c/openstack/pbr/+/957262/ and above	10:50
stephenfin	...and above	10:51
opendevreview	Jan Gutter proposed zuul/zuul-jobs master: Raise connection pool for boto3 in s3 upload role https://review.opendev.org/c/zuul/zuul-jobs/+/957218	12:42
tristanC[m]	It looks like I was mentioned last week in this room, but I can't find the message with element, apparently the "jump to notification" feature is not working. So please contact me again if needed!	13:19
frickler	tristanC[m]: likely this one? https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-08-07.log.html#opendev.2025-08-07.log.html#t2025-08-07T18:15:24	13:26
tristanC[m]	frickler: ha thanks! So are you grepping these irclogs file to find matrix mention?	13:31
tristanC[m]	clarkb: that's right, that's done for "reproducibility" where the image timestamps are set to a known value. But that's not super useful, next time I update the matrix gerritbot I'll use the current time of the day.	13:32
tristanC[m]	But as far as I know, the service has been humming away without noticeable bug, so I'm inclined to leave it that way unless you want to fix that.	13:33
frickler	tristanC[m]: I just looked for a mention of your nick last week. but for larger searches I do grep my local logs sometimes. one more advantage of IRC over matrix, I'd think, haven't found a way to save logs from element yet	13:49
fungi	while i can't recommend the matrix-rs plugin for weechat in its present state due to significant missing features, it does give me the ability to log matrix room discussions just as if they were irc channels	13:51
corvus	frickler: Room Info -> Export Chat	13:53
corvus	in element	13:53
fungi	frickler: on 957262 why are you saying "(alias, arg) in DEPRECATED_CFG" won't match in that spot if it was matching a few lines earlier (the entire reason for the change is to get it to stop outputting those warnings unless they were actually used)	13:58
corvus	i'm still not happy with the launcher behavior; i have some stuff to do this morning, but i'll resume looking into that when i'm done.	13:58
corvus	the launcher's aren't stuck, they're making progress, just slowly and poorly. i wouldn't try to do anything manual.	13:59
fungi	noted, thanks corvus	13:59
frickler	fungi: from my understanding it cannot have worked before, either	14:02
fungi	so DEPRECATED_CFG isn't a list of tuples then?	14:04
fungi	or is the problem more nuanced?	14:04
fungi	i'm just trying to figure out where the excess warning messages were coming from if that was dead code as you say	14:05
frickler	fungi: you can check the definitions at the beginning of the same file	14:05
frickler	these are nested tuples, with alias = (section, keyname), and only that tuple is being used as key in DEPRECATED_CFG	14:06
fungi	huh, i think the definition may have missing commas	14:06
frickler	I commented on that, too, but that's a different - albeit related - bug	14:07
fungi	the dict keys should be ('metadata', 'home_page') rather than concatenated strings i guess	14:07
frickler	yes	14:07
fungi	implicit string concatenation strikes again	14:07
fungi	oh, some of them match	14:07
fungi	('metadata', 'requires_dist') is a key, for example	14:08
fungi	it's just the first few that are missing commas and so will never match	14:08
frickler	is there a sample of the excess warning messags showing up in logs somewhere?	14:08
fungi	stephenfin: ^ did you have a public sample to link? otherwise i can fish one out of a random job log for a pbr-using project	14:09
fungi	basically some of the keys in the current DEPRECATED_CFG dict aren't working at all because of the missing commas leading to implicit concatenated strings, while others are matching because they're proper tuples as expected	14:12
fungi	prior to 957262 pbr is emitting a warning for config option it iterates over if that tuple is a key in the DEPRECATED_CFG dict (so all of the not-broken ones), with 957262 it only does so if the option was present in the configuration it's checking, since the conditional above it short-circuits the loop first	14:14
frickler	I still claim nothing should match at all, since e.g. (alias, arg) = (('metadata', 'home_page'), 'url') in CFG_TO_PY_SETUP_ARGS, while DEPRECATED_CFG has (modulo missing comma) only ('metadata', 'home_page') as dict key	14:15
fungi	ah, i see, so alias is also a tuple	14:16
frickler	yes and it is getting split as "section, option = alias" in L410. so using "(section, option)" in the deprecation check could also work and might be more readable	14:17
clarkb	Looks like corvus is already looking at launcher things. I'll probably start my morning looking at the zuul weekly reboot status (after noticiing inconsistency on the components page last night) then look at the pbr changes.	14:57
clarkb	but first breakfast	14:57
clarkb	TASK [zuul-executor : Check if Zuul Executor containers are running] failed on ze05 with rc -13. This is what we believe to be the ssh control persistent race with starting a task as the ssh process shuts down	15:03
clarkb	so I guess that is "good" news in that it isn't a new error	15:03
clarkb	We can probably just manually start the cronjob today if we want to kick things off sooner but I think with the launcher situation its probably best to address launcher things first and possibly just wait for the late friday cronjob to fire. Then check the results first thing monday	15:04
clarkb	the run prior appears to have been successful	15:04
clarkb	it would be nice if ansible had some if rc code == -13 then retry option	15:10
corvus	clarkb: did we merge the change to do keepalives?	15:48
clarkb	corvus: yes	15:50
clarkb	`ssh_args = -o ControlMaster=auto -o ControlPersist=180s -o ServerAliveInterval=60` is the current /etc/ansible/ansible.cfg entry on bridge	15:51
stephenfin	Thanks, frickler 🤦 I've cleaned that up now https://review.opendev.org/c/openstack/pbr/+/957262	16:10
fungi	and added tests!	16:11
fungi	very nice	16:11
corvus	i get a 500 error for /usr/launcher-venv/bin/openstack --os-cloud opendevzuul-rax-flex --os-region SJC3 server list	16:26
corvus	seems to work for dfw3	16:26
corvus	Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.	16:27
corvus	<class 'KeyError'>	16:27
fungi	hah	16:27
fungi	dan_with: do you happen to know whether there's maintenance or an incident going on in flex sjc3 right now that would cause the nova api to return a 500 error? ^	16:28
opendevreview	James E. Blair proposed opendev/zuul-providers master: Disable raxflex-sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/957449	16:29
fungi	i'll test with my personal account too	16:29
corvus	opendevci account works	16:29
corvus	may be a problem with opendevzuul	16:30
fungi	though i did notice a vm i have there spontaneously rebooted at 11:19 utc today	16:30
clarkb	I want to say we had this problem before and it required intervention on the cloud side	16:30
* clarkb tries to find notes		16:30
opendevreview	Merged opendev/zuul-providers master: Disable raxflex-sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/957449	16:31
clarkb	according to my irc logs jamesdenton helped debug this issue on April 9	16:32
clarkb	at the time there were three instances in an Error state but nova couldn't list them due to keyerrors	16:33
corvus	we are down 3 cloud regions; this is making a significant dent in our capacity.	16:33
clarkb	https://bugs.launchpad.net/nova/+bug/2095364 its this bug	16:34
clarkb	or was this bug. Not 100% certain it is the same problem this time	16:34
clarkb	I think we can try an older microversion and attempt to delete error'd nodes that way	16:35
clarkb	I'll look into that	16:35
corvus	earlier one of the launchers saw: 2025-08-14 16:28:22,056 ERROR zuul.Launcher: <class 'sqlalchemy.exc.OperationalError'>	16:35
clarkb	`sudo openstack --os-cloud opendevzuul-rax-flex --os-region SJC3 --os-compute-api-version 2.54 server list` this works on bridge	16:36
corvus	clarkb: if you have a process for listing servers using the earlier microversion, i'd be interested in the process and/or the current output	16:36
clarkb	corvus: ^ that is it	16:36
corvus	++	16:36
clarkb	I'll hold off on deleting anything but there are a number of servers in an ERROR state.	16:36
corvus	thanks! give me 1 second, then i think we can go to down deleting	16:37
clarkb	setting the version to 2.95 also works	16:37
clarkb	I don't know why I have a much older version in my command history too. but I think that confirms this is the problem	16:37
corvus	okay i think we can delete stuff now	16:38
clarkb	corvus: do you want me to do that?	16:38
corvus	yes please	16:39
clarkb	ok I will attempt to manually delete every instance in sjc3 in an error state	16:39
corvus	i just saw the same error in dfw3, then dfw3 worked again in a subsequent attempt a few seconds later	16:41
corvus	(sjc3 still seems consistent)	16:41
corvus	and now it works	16:42
fungi	huh	16:42
corvus	(after clarkb deleted stuff, i think)	16:42
opendevreview	James E. Blair proposed opendev/zuul-providers master: Revert "Disable raxflex-sjc3" https://review.opendev.org/c/opendev/zuul-providers/+/957452	16:42
clarkb	yes deletions are complete and a listing shows no more ERROR'd nodes	16:42
fungi	that's "fun"	16:42
opendevreview	Merged opendev/zuul-providers master: Revert "Disable raxflex-sjc3" https://review.opendev.org/c/opendev/zuul-providers/+/957452	16:43
clarkb	last time the cloud intervened on our behalf and deleted the nodes for us	16:44
clarkb	but it is good to know that we can do it if we use the correct magical microversion	16:45
fungi	yeah, that's a great find	16:45
fungi	though it's supposedly fixed in a stable point release for dalmatian, has a fix merged in stable/caracal (no point release yet), and one proposed for stable/bobcat that hasn't merged	16:47
clarkb	I don't know which version rackspace flex has deployed but maybe dan_with can try to get that fix on their radar	16:48
clarkb	thinking out loud here and I'm not convinced this is a good idea: but we could probably get away with forcing microversion 2.95 on all clouds that support 2.96?	16:51
clarkb	that would require looking up supported versions but launchers are already caching things on startup about clouds so maybe this would fit into that process? Basically lookup microversions if repsent and if version 2.95 is in the valid version use it. otherwise use whatever default the sdk picks	16:52
corvus	i'm not sure if that affected the launcher... now that i know more of what's going on, that may have been cli only	16:52
clarkb	huh I guess that could be if the sdk listing things isn't requesting the filed with a keyerror or using a "compatibile" microversion for some reason?	16:53
corvus	we don't use the sdk method for listing	16:54
clarkb	got it. That may explain it then as we probably aren't doing anything with microversions like the cli or sdk would	16:55
corvus	the launchers do a straight keystoneauth get of '/servers/detail'	16:55
clarkb	fungi: looking at the timestamps on that nova bug seems like a lot of those changes just merged	17:15
clarkb	I guess we shouldn't be surprised the fixes haven't rolled out to actual clouds yet	17:15
fungi	yeah, agreed	17:15
fungi	apropos of nothing in particular, https://github.com/mfontanini/presenterm looks like a presentty clone in rust (though using markdown instead of restructuredtext)	17:29
corvus	okay, i have identified 2 improvements i'd like to make to the launchers. they aren't trivial, but they aren't too complex. it's going to take a bit. i don't have any quick fixes to deal with the backlog. i'm just going to try to plow through writing these changes and we can try to get them merged asap.	17:34
corvus	after that, i'll fix up the event stuff from yesterday. even though that was catastrophic, as long as we don't start swapping again, i think it's slightly lower priority.	17:34
clarkb	makes sense	17:35
fungi	sounds good, thanks!	17:41
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/957456 Include requested nodes in quota percentage calculation [NEW]	17:43
corvus	that's the first change -- hopefully the commit msg explains sufficiently	17:43
corvus	the second one i'm going to work on is roughly "order the node objects so we fill old multinode requests sooner"	17:44
fungi	yep, commit message made sense, now seeing if i can apply that understanding to the diff	17:45
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/957457 Sort nodes by request time [NEW]	18:13
corvus	that's change #2	18:13
corvus	was not expecting sequential numbers for those	18:14
corvus	i think the systemic problem that's going on is that:	18:23
corvus	any time the quota usage for a provider drops, the launchers see an opportunity and create a node request for that provider, regardless of how many other backlogged reqsts there are	18:23
corvus	even if that only happens one at a time (because after it creates one node, it should account for that allocation and defer the next request), given enough time, all the node requests can end up with allocated nodes, as each one has plenty of opportunities to find a "hole"	18:23
corvus	the first change should help us avoid that situation by accounting for the backlog earlier, so that zuul doesn't see the hole in the first place	18:24
corvus	some leakage will still happen due to multinode requests; the second change helps get them cleared out in the right order.	18:24
clarkb	corvus: I have some questions on https://review.opendev.org/c/zuul/zuul/+/957457	18:27
corvus	thx replied	18:31
clarkb	corvus: thanks one followup posted	18:35
clarkb	side note: I love that datetime.datetime.utcfromtimestamp is deprecated and will be removed in a future python release because you are supposed to use timezone aware objects to represent datetimes in UTC	18:37
clarkb	"utcfromtimestamp" isn't timezone aware enough I guess	18:38
clarkb	posix epoch times are recorded as beginning from midnight january 1 1970 UTC. Asking for the datetime in UTC X seconds after the epoch is timezone aware...	18:38
corvus	clarkb: thx fixed	18:39
clarkb	corvus: are you pushing a new patchset?	18:48
clarkb	want to make sure I wait for that to +2 before lunch if so	18:48
corvus	oh sorry, git-review was asking me a question i forgot to answer	19:16
corvus	hope you didn't wait for lunch	19:16
fungi	git-review can be so nosey sometimes	19:18
*** dhill is now known as Guest24174		19:31
clarkb	I did not wait :)	19:39
clarkb	Iv'e approved it now	19:40
JayF	clarkb: utcfromtimestamp() returns a tzinfo=None object; so while you get a datetime that is UTC, it doesn't return a tz-aware value	19:50
JayF	they can't change the return without breaking the api	19:51
JayF	I am unable to reach security.openstack.org	20:36
JayF	which is resolving to static02.opendev.org for me	20:36
JayF	only via ipv4, once I tried ipv6 it's working	20:38
JayF	mtr implies it's somewhere close to the lasthop that's dropping	20:38
fungi	i can ping it over v4 and v6	20:39
fungi	so if it's close to the last hop, it might be that you've got an asymmetrical route so your traceroute shows the drop at the hop where they converge (the provider's border)	20:40
fungi	if you want to msg me your ip address i could try to traceroutge to it from static02 for comparison	20:40
clarkb	JayF: I mean that seems reasonable if A) the function name is literally UTC already and B) they aer going to delete it and break everyone anyway	20:41
clarkb	I dunno just seems like if you're going to break it then breaking a subset of people is probably better	20:41
JayF	I don't really agree; you can't use tzinfo=None and tzinfo=!None objects interchangibly. It's just a bad interface from the beginning and any pain now is just paying back for that original sin	20:42
fungi	it sends a clear message	20:42
fungi	to any other functions that might try to misbehave	20:42
JayF	fungi: https://usercontent.irccloud-cdn.com/file/fQNI38xe/image.png	20:43
corvus	i have packet loss to both review02 and static02	20:45
fungi	JayF: yeah, going back the other way i see packet loss starting in or just past tukw.qwest.net	20:46
fungi	potentially at the lumen/qwest peer there	20:46
JayF	just broken internet stuff, got it	20:47
JayF	makes sense	20:47
JayF	corvus: are you my ISP-brother lol	20:47
corvus	probably not; maybe just a bad internet day :)	20:48
corvus	fungi: clarkb https://review.opendev.org/957464 is the extra fix to the event thing from yesterday (also, swest added a change to that stack too)	20:49
fungi	static is in rackspace dfw, while review is in vexxhost ca-ymq-1, so different clouds in different parts of the continet even	20:49
corvus	so i've got a whole sequence stacked up now, with the 2 changes from this morning, the 3 event sequence changes, and then 2 more small patches on that.	20:50
corvus	stack end is at https://review.opendev.org/957296	20:50
corvus	if everything looks good to folks, i'll probably take a build of that and manually deploy it to launchers	20:50
opendevreview	Jeremy Stanley proposed openstack/project-config master: Replace 2025.2/Flamingo key with 2026.1/Gazpacho https://review.opendev.org/c/openstack/project-config/+/957467	21:23
clarkb	corvus: I have a few questions on https://review.opendev.org/c/zuul/zuul/+/957464	21:47
corvus	replied	21:58
clarkb	corvus: thanks I think that behavior I'm thinkign of with the .get is only on certain classes that have a list of attributes that zuul loops thrugh to load up	22:06
clarkb	this one doesn't do that so I was mistaken	22:06
corvus	++ i like this pattern better	22:08
corvus	(can't always use it tho)	22:08
corvus	i think it's going to be a long time until we get an image build we can use with the launchers.	22:19
corvus	(and even longer until those changes actually merge)	22:19
corvus	i wonder what we could do to improve things....	22:19
corvus	i could try to monkeypatch those fixes into the running server, but that's a lot to patch and easy to get wrong	22:20
corvus	we could dequeue some stuff to clear out the queue	22:20
corvus	i could try to build an image manually	22:21
corvus	or i could push up a change exclusively to build an image; if we clear the zuul tenant queues, that job might get done in a reasonable time	22:22
corvus	i'm open to other suggestions	22:22
clarkb	change just to build an image seems like maybe something worth trying then fallback to locally built image?	22:23
corvus	ack. i'll try that.	22:23
corvus	okay, that change is uploaded, its one job is running now. i also am running the test suite locally on that change.	22:28
corvus	it's also worth noting that even when we restart the launcher with these changes, the existing node requests are still going to be a bit of a mess. so until we work through that backlog, we may continue to have problems. we also may not be able to work through the backlog due to the way the multinode jobs have wedged things. so even then, we may need to start dequeing some things.	22:30
corvus	er, i said the job was running, i should have said the node was building... :/	22:35
corvus	unit tests and linters pass locally; so if the job ever runs, we're good	22:37
corvus	2025-08-14 22:38:24,549 ERROR zuul.Launcher: openstack.exceptions.HttpException: HttpException: 503: Server Error for url: https://dfw.servers.api.rackspacecloud.com/v2/637776/servers, Service Unavailable	22:41
corvus	we're getting 503 errors in rax-dfw (classic) now	22:42
corvus	we were getting those with rax-ord yesterday, so i turned it off	22:42
corvus	should we do the same for dfw?	22:43
corvus	that is our only remaining region in rax classic	22:43
opendevreview	James E. Blair proposed opendev/zuul-providers master: Disable rax-dfw https://review.opendev.org/c/opendev/zuul-providers/+/957470	22:46
opendevreview	Merged opendev/zuul-providers master: Disable rax-dfw https://review.opendev.org/c/opendev/zuul-providers/+/957470	22:47
corvus	i'm aggressively cleaning up logs on the launchers	22:59
clarkb	fwiw trying to list servers from bridge in rax dfw is very slow to return (I haven't gotten results yet) and wonder if the proxy will eventually 503	23:02
clarkb	oh wait no there it goes it listed	23:02
corvus	the graphs and logs suggest that it's spotty	23:02
clarkb	corvus: I wonder if the 503 problem is just something we have to be resilent against happening occasionally?	23:02
corvus	but when there are problems, it's hours at a time	23:02
clarkb	ya that	23:02
clarkb	ah ok	23:02
corvus	yeah, if it were occasionally, then i think we could work around it, but it's looking a lot more like a 6 hour stretch where the launcher can't see what servers there are	23:03
corvus	pulling images	23:06
corvus	the launchers are restarted	23:09
corvus	i think we're going to need to dequeue at least some of the queue items with multinode jobs in order to un-wedge things	23:15
corvus	i will start picking some off.	23:15
clarkb	is that bceause we're still susceptible to the ordering problem with those node requests hanging around?	23:22
clarkb	(basicaly we need new ones to get the ordering right?)	23:22
corvus	yep, and because so many of the multinode requests are partially completed and they're hogging the quota as ready nodes waiting for their friends	23:23
corvus	i dequeued 2 kolla changes, and the number of in-use nodes in ovh-bhs1 has increased from 1 to 25	23:26
corvus	i think it will be able to slowly make progress in that region at this point	23:26
*** cloudnull10977461 is now known as cloudnull		23:56

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!