Monday, 2025-07-21

*** elodilles_pto is now known as elodilles		05:53
ykarel	corvus, fungi some recent multinode failures you would like to check after the last patch for quota message merged	12:11
ykarel	19th https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4	12:11
ykarel	20th https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a	12:11
ykarel	21st https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754	12:11
*** ykarel_ is now known as ykarel		12:40
fungi	ykarel: thanks, i think the latest improvement for that (955292) would have taken effect at worst during the manual restart of zuul services that occurred a little before 19:00 utc on saturday (2025-07-19), so at least the second and third example happened with it in place	13:00
fungi	frickler: did my comment on https://review.opendev.org/952861 answer your question?	13:27
Clark[m]	Making note of it early as I boot my day because it is potentially important/relevant: there are two reports of Gerrit users after people updated to newer versions like we did. First is problems doing full offline reindexing after upgrading from 3.11.3 to 3.11.4. We can check on this as part of our 3.10 to 3.11 upgrading. Second is replication failing for specific refs until restarting Gerrit (even a request to replicate everything fails). The	14:00
Clark[m]	presumed commit with the issue is in the replication plugin for 3.11.4 and 3.12.1 but not 3.10.7 (which is what we run)	14:00
Clark[m]	If we notice problems with either of those on 3.10.7 I can pass that info along to aid upstream debugging	14:01
corvus	ykarel: ack, thanks! i'll look into those	14:02
corvus	https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754	14:17
corvus	both nodes were in ovh-bhs1	14:17
corvus	https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a	14:17
corvus	both nodes were in ovh-gra1	14:17
corvus	https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4	14:17
corvus	both nodes were in rax-dfw	14:17
corvus	ykarel: it looks like those failures were not zuul/nodepool related	14:17
ykarel	corvus, but i see those on different nodes	14:20
ykarel	first one ovh-bhs1-main and rax-dfw-main	14:20
ykarel	https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754/log/job-output.txt#38	14:21
ykarel	second ovh-gra1-main and rax-dfw-main https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a/log/job-output.txt#38	14:22
ykarel	third rax-dfw-main and rax-ord-main https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4/log/job-output.txt#38	14:22
clarkb	fungi: maybe after you squash the stack of gitea page updates I can rebase https://review.opendev.org/c/opendev/system-config/+/955411 onto that and we can rollout 1.24.3 after the edits	14:23
fungi	sounds good, or we can go ahead with the gitea upgrade until there's consensus on the splash page content updates	14:24
corvus	ykarel: oh sorry! i misread	14:25
corvus	okay, for https://zuul.openstack.org/build/8a295130e958491cba1d343f4da19754 it looks like ovh-bhs1 put the server in "error" state with no fault message, so there was no way for us to tell if it was quota related or not	14:32
corvus	for https://zuul.openstack.org/build/2bccf7810f0547b3a09175ff0eee295a we got a fault message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance ...	14:38
fungi	mmm, i spotted a sneaky spam message that made it onto openstack-discuss and thought that someone approved that new subscriber through moderation without noticing... but in revisiting the list settings i see that message acceptance for members is set to default processing, but i was sure i had set it to hold for moderation. i wonder how/when it switched back	14:41
fungi	i've set it again now and will keep a closer eye on it	14:44
corvus	and https://zuul.openstack.org/build/a97ae07d126648ffa02d979759a672f4 was the "error" state with no fault message again	14:44
corvus	okay, so i think z-l is working as designed here, and we're just getting unclear error messages from openstack. so if we want to improve this, we need to change the design. i think we could either have z-l be more insistent that node requests stay in the same provider, or we could have it (optionally) error if they can't.	14:46
corvus	the cost of the first option would be that we might leave some ready nodes around if, say, 1/2 nodes succeeds.	14:47
corvus	nonetheless, i think that might be the best approach, so i'll look into doing that first.	14:47
corvus	(if we end up with a ready node, it' probably won't stick around for long, unless it's an esoteric label)	14:48
fungi	checking the other list i'd set to default hold for moderation (legal-discuss) it's still set, so maybe whatever happened to openstack-discuss was user error on my part	14:48
opendevreview	Clark Boylan proposed opendev/system-config master: DNM Test offline reindexing with Gerrit 3.11.4 https://review.opendev.org/c/opendev/system-config/+/955487	15:05
Ramereth[m]	fungi: so there's no single endpoint that I can look at to see if there are any images still stuck in a deleting state like before? I see several that seem to be stuck in a "saving" state and one in "queued" currently from my side.	15:37
fungi	Ramereth[m]: i don't believe so, but maybe corvus or clarkb have some ideas	15:37
fungi	an api query to get that might be possible	15:37
corvus	Ramereth[m]: fungi are we interested in a particular cloud?	15:43
corvus	like, is the intended use something like "show all images in rax-flex that are in X state"?	15:45
clarkb	corvus: I suspect Ramereth[m] is only interested in the osuosl cloud image states	15:52
corvus	okay, so we want to see "what is the status of every upload in osuosl"?	15:52
clarkb	yes I think so. The reason for that is sometimes the cloud fails to delete the images properly and intervention is required. I think Ramereth[m] was monitoring that via the old api and periodically cleaning things out on the cloud side	15:53
corvus	indeed we don't have a summary view of all uploads for a given provider in the web ui, so right now, you'd have to click through to each image.	16:00
corvus	but we can get that info through the api	16:00
corvus	Ramereth[m]: try this: `curl -q https://zuul.opendev.org/api/tenant/opendev/providers \|jq -r '.[] \| select(.name == "osuosl-regionone-main").images[].build_artifacts[].uploads[] \| [.state, .external_id] \| @tsv'`	16:00
corvus	it does look not entirely healthy	16:01
clarkb	my test in 955487 seems to show that in the trivial case offline reindexing is fine with gerrit 3.11.4. I suspect the problem with reindexing there is specific to the state of the changes on the system tripping over some bug	16:05
clarkb	I think I'll just continue with upgrade prep and monitor the situation upstream to determine if we need to do some testing on our side. I think worst case we'll be able to revert to older versions if we hit things on our side (note offline reindexing shouldn't be necessary for the ugprade itself)	16:06
opendevreview	Clark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test 3.11 upgrade and downgrade https://review.opendev.org/c/opendev/system-config/+/893571	16:45
clarkb	put a couple of holds in place for ^ to dig into the upgrade more	16:48
clarkb	corvus: any reason to not approve https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/955236 at this point (this removes the nodepool config verification job)	16:57
corvus	clarkb: nope i think we can proceed with that	16:58
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/955497 Avoid duplicate launcher upload jobs [NEW]	17:11
corvus	we have a lot of "pending" uploads; ^ that change should address that	17:11
clarkb	fungi: re the opendev.org main page I guess you're looking for more consensus around that? Should we plan to upgrade gitea first then and get reviews on https://review.opendev.org/c/opendev/system-config/+/955411 ? I expect to be around today and can monitor that if we do it. The main concern I have is with gitea13 (and any others I guess) being overloaded still	17:12
clarkb	it is busy this morning (the others aren't) but its below the worst that I've seen	17:13
clarkb	corvus: I had a couple of notes on 955497 but nothing major so went ahead and +2'd it	17:23
corvus	clarkb: thx, replied	17:28
fungi	clarkb: mainly i was hoping to see if frickler's concern on 952861 was addressed	17:33
clarkb	corvus: huh we dno't mark the image build artifact ready as a precursor to then uploading the image build?	17:33
clarkb	corvus: I guess in my mind it was image build builds goes ready then we upload it	17:33
fungi	once i know whether 952861 has consensus i can decide between squashing all three or just squashing the two parents for now	17:34
corvus	nope, because "ready image builds with no uploads" look like something that should be deleted. so we create the uploads before we say the image is ready. it's the last thing that happens, and it happens in the same "trasaction"	17:34
clarkb	got it	17:36
corvus	(we normally create the uploads as "pending"; that unit test does it as uploading, but otherwise, it's the same)	17:36
clarkb	fungi: ack. I guess we review the 1.24.3 upgrade in parallel and if it is ready first it can go first otherwise I can rebase it	17:36
fungi	i'll probably disappear around 19:00-20:00 utc to get a late lunch/early dinner, but otherwise expect to be around for helping monitor the gitea upgrade	17:37
corvus	clarkb: here's the code (you can see it structurally mirrors the test) https://opendev.org/zuul/zuul/src/commit/01e95e978668123a54ff76f1da77e178c88d13f9/zuul/scheduler.py#L3407	17:37
clarkb	corvus: thanks	17:37
fungi	the splash page changes aren't urgent, i'd just prioritize the gitea upgrade for now	17:37
clarkb	ack if anyone else wants to review https://review.opendev.org/c/opendev/system-config/+/955411 before 2000 that would be great. Otherwise I can approve it around then (fungi's early dinner lines up with my lunch hour so I figure we can wait until after that block of time)	17:41
clarkb	jrosser: its been long enough that https://review.opendev.org/c/openstack/diskimage-builder/+/954760 should've addressed your debian image problems with backports	17:53
clarkb	jrosser: not a rush but any chance you can confirm that is the case?	17:53
clarkb	once that is confirmed I can abandon https://review.opendev.org/c/zuul/zuul-jobs/+/954280	18:05
opendevreview	Jeremy Stanley proposed opendev/system-config master: Multiple Gitea splash page updates https://review.opendev.org/c/opendev/system-config/+/952407	18:41
fungi	headed out to grab a bite to eat, shouldn't be more than an hour	18:56
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	19:12
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	19:13
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	19:14
Ramereth[m]	corvus: that's helpful, but all of the externa_id fields are null so I don't know how to reference them on my end	19:15
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Stop using storyboard by Freezer https://review.opendev.org/c/openstack/project-config/+/941138	19:15
corvus	Ramereth[m]: the pending ones haven't created a cloud image yet, so you can ignore them (also, there are way too many of those right now due to a bug that should be fixed soon)	19:16
Ramereth[m]	corvus: but the deleting ones don't show any external_id's either	19:17
corvus	Ramereth[m]: you can ignore those too -- those are likely to have failed before we started uploading them	19:18
Ramereth[m]	ah okay	19:18
corvus	(in other words, if we have an external id, it should be there; otherwise, something has gone wrong that doesn't involve the target cloud)	19:19
Ramereth[m]	So I updated your script to the following:	19:25
Ramereth[m]	curl -q https://zuul.opendev.org/api/tenant/opendev/providers \|jq -r '.[] \| select(.name == "osuosl-regionone-main").images[].build_artifacts[].uploads[] \| select(.state == "deleting") \| select(.external_id != null) \| [.state, .external_id] \| @tsv'	19:25
Ramereth[m]	According to this, it looks like we don't have any images that are "stuck"?	19:25
corvus	Ramereth[m]: yes, i agree	19:31
clarkb	I've approved the gitea 1.24.3 upgrade change. It usually takes about an hour to gate so figured a head start was good	19:37
clarkb	can always -2 or -W it if something comes up in any additional reviews	19:37
opendevreview	Clark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node https://review.opendev.org/c/opendev/system-config/+/955520	20:05
fungi	cool	20:24
fungi	ready for when it merges	20:25
opendevreview	Merged opendev/system-config master: Update to gitea 1.24.3 https://review.opendev.org/c/opendev/system-config/+/955411	21:02
opendevreview	Clark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node https://review.opendev.org/c/opendev/system-config/+/955520	21:03
opendevreview	Clark Boylan proposed opendev/system-config master: Switch IRC and matrix bots to log with journald rather than syslog https://review.opendev.org/c/opendev/system-config/+/955544	21:03
fungi	zuul estimates 3 minutes left on the hourly buildset	21:05
fungi	after which the deploy jobs for 955411 should kick in	21:06
corvus	clarkb fungi ykarel as i mentioned earlier, i think we're now at the point where we've done everything we can to make zuul-launcher work as designed, and now we need to change the design to accommodate the un-actionable errors we're getting from the cloud. i proposed this change which i think will have the desired effect: https://review.opendev.org/c/zuul/zuul/+/955545 Require multinode requests served from same provider	21:07
fungi	thanks corvus!	21:07
corvus	i don't want to rush that one through, so we may still have the current behavior for another day or so	21:08
corvus	(while it's reviewed)	21:08
corvus	the bugfix for pending uploads merged, so i'm going to restart the launchers now	21:10
fungi	infra-prod-service-gitea is in progress	21:12
clarkb	https://gitea09.opendev.org:3081/opendev/system-config/ is upgraded	21:13
clarkb	I can git clone system-config from gitea09 as well	21:13
clarkb	it seems to be working for me so far	21:14
fungi	Powered by Gitea Version: v1.24.2	21:15
fungi	mmm	21:15
fungi	oh, i hit a redirect	21:15
clarkb	ya you have to be careful navigating the web ui some links send you back to the haproxy	21:16
fungi	okay, getting v1.24.3 now	21:16
fungi	yeah, i was doing 3080/tcp instead of 3081	21:16
clarkb	09-11 are done now	21:16
clarkb	13 is the one I'm slightly worried about	21:16
fungi	look like it should be done now?	21:19
fungi	ah, still restarting	21:20
clarkb	no 13 is just finishing up image downloads and doing the upgrade now	21:20
fungi	yeah	21:20
fungi	stopping is in progress still	21:20
clarkb	this restart will be slow because I think it stays up until a timeout for existing connnections and since thsi si the backend with all the connections it is going to be slower	21:21
fungi	https://gitea13.opendev.org:3081/ is loading for me	21:21
clarkb	now 13 is done	21:21
fungi	agreed	21:21
clarkb	worried unnecessarily. I figured ti would be ok but if any were to give us trouble it would be this one	21:21
clarkb	all of them are done now. The job shoudl finish up soon	21:23
clarkb	corvus: reading teh commit message on that change this seems like a reasonable approach. It maintains locality when possible but still allows you to mix k8s and openstack and ec2 or whatever if you wish in a single nodeset	21:24
clarkb	I'll work on a proper review shortly	21:24
clarkb	https://zuul.opendev.org/t/openstack/buildset/738b650c6e0b4de38309f62ecea974ad that is a successful gitea upgrade buildset	21:25
fungi	yeah, i'm already about halfway through it, but based on the description i agree it's the best possible tradeff	21:25
fungi	tradeoff	21:25
clarkb	the one thing I haven't checked yet is replication	21:25
corvus	clarkb: yep that's what i'm thinking	21:25
opendevreview	Clark Boylan proposed opendev/system-config master: DNM test eavesdrop running on a noble node https://review.opendev.org/c/opendev/system-config/+/955520	21:27
clarkb	https://opendev.org/opendev/system-config/commit/a0fc9e3edab9e2b7c5806fe93a5386381499756f replication seems to be working	21:27
corvus	since the bugfix for the leaked nodeset request locks merged and we're running that now, i have deleted what is hopefully the last batch of leaked lock znodes.	21:33
corvus	the launchers look like they are more or less done deleting all the extra "pending" uploads	21:34
clarkb	corvus: ok posted a couple of comments but overall I think this is good	21:41
corvus	thx, replied	21:52
clarkb	https://review.opendev.org/c/opendev/system-config/+/955544 is a prep change for eventually replacing the current eavesdrop server with a new noble one. I think it is a big noop (we've tested it a bit at this point) to move from syslog to journald for container logging	22:01
clarkb	ok time to put the meeting agenda together. I'll try to do niz updates and updates on gerrit and so on. Anything else that should be in there?	22:43
clarkb	my edits are in. I'll send this out in ~10-15 minutes. Let me know before then if I'm mossing something important	22:56
corvus	my main question is "wen delete nodepool" which i imagine is covered under at least one of the existing topics :)	23:08
clarkb	yup we can talk about that in the first topic tomorrow. I've just sent the agenda	23:08
clarkb	oh nope I failed to send. One moment	23:09
clarkb	now it should be in your inboxes	23:10
corvus	email is hard	23:11
fungi	also as far as wen baleet nodepull i'm good with any time now	23:13
fungi	we can arrange to have a small memorial service for it if anyone wants	23:15
fungi	otherwise... _press_button_	23:15
fungi	the stories about where things in the openstack sdk came from are going to get that much more amusing now when we talk about shade getting split off of a component that no longer exists at all	23:17
mordred	what a fun family tree :)	23:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!