Thursday, 2025-07-24

*** ykarel_ is now known as ykarel		09:08
*** amoralej_ is now known as amoralej		12:07
priteau	Hello. CentOS Stream 10 is missing from Zuul mirrors. For example: https://mirror.gra1.ovh.opendev.org/centos-stream/	12:28
priteau	Would it be possible to add it, or should I use official mirrors?	12:29
priteau	Same for EPEL, we have 9 only: http://mirror.dfw3.raxflex.opendev.org/epel/	12:41
fungi	priteau: i think people were planning to submit changes to add it once centos-10 nodes were seeing more use. maybe we're there now and it's time to revisit that. looking at https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-1y%2Fy&to=now-1y%2Fy&timezone=utc&viewPanel=panel-36 we'll need to expand that volume first	12:59
priteau	I am working on c10s support in Kayobe. Same in kolla/kolla-ansible, they've started to run some.	13:05
priteau	It's not urgent of course, we can use official mirrors in the meantime.	13:05
priteau	I noticed there was no Rocky Linux content at all, was that a choice given the resources available?	13:06
fungi	priteau: similar plan, if a lot of projects start running frequent jobs on rocky then mirroring packages for it would make sense at that point	13:50
Clark[m]	fungi: I'm not really at the computer for another hour or so but wondering if we should drop the AAAA record and/or do other debugging. I couldn't find a way to verify if dad is the issue via logging but maybe there is some way?	13:58
fungi	i haven't had time to look into it yet, and am on a conference call for the next hour	13:58
Clark[m]	Re distro mirrors: they consume large quantities of space and have historically been a large time sink. In particular upstreams make bad updates then we have to debug and say no it's upstreams fault. Also cleanup in a few years is often complicated. So ya with newer less commonly used platforms I asked if we can trial things without dedicated mirror content	14:00
priteau	No problem, as long as we know that we should use official mirrors instead.	14:07
corvus	2 errors in image builds:	14:54
corvus	https://zuul.opendev.org/t/opendev/build/afd97251bc7740dbaa8ede9dcd573557 -- i wonder why we didn't do the retry loop on that one?	14:54
corvus	https://zuul.opendev.org/t/opendev/build/d65637a8f31f420a93eff81300fcf42e hit the 503 error again	14:54
corvus	so if the haproxy config is in place, then maybe that didn't help?	14:55
fungi	corvus: could it be the ipv6 connectivity problem we've been digging into for gitea-lb02?	14:57
fungi	i see `ip -6 ad sh ens3` is back to only returning a linklocal address at the moment	14:59
corvus	fungi: not saying "no", but i believe the way clark was leaning yesterday was that the connection errors were possibly v6 related, but he was assuming the 503 was a real gitea response that made it all the way from the backend to the client.	14:59
fungi	oh, good point	14:59
clarkb	corvus: those retries occurred in under 2 seconds	15:00
fungi	connection issues at least between the client and proxy should never result in an http error of any kind	15:00
clarkb	corvus: I think we need the sleep 5 change in conjunction with the better health checks	15:00
clarkb	looks like that change has been approved	15:01
clarkb	corvus: I think the gitea healthcheck update should in theory takes us from a 10-15 second window to a 2 second window and its possible we still hit that 2 second window here	15:02
corvus	clarkb: yes, for the 503 error, perhaps the sleep 5 would have helped -- but my point with that is that, if the haproxy healthz check was in place at the time, then it appears not to have solved the underlying problem that gitea generated 503s in the first place.	15:02
corvus	oh now i understand the significance of 2 seconds in your message ;)	15:02
opendevreview	Merged opendev/system-config master: Trigger load balancer deployment jobs when their roles update https://review.opendev.org/c/opendev/system-config/+/955734	15:02
clarkb	corvus: yes I think we cannot solve that problem without either 1) terminating https in haproxy and doing inline healthchecks to immediate shut things done on the first error or 2) add more coordination to remove nodes from haproxy before they get updated	15:03
corvus	i agree, it's possible the lb change reduced the window for 503s and we didn't observe that (other than the reduced failure count :)	15:03
clarkb	we can also reduce the healthceck interval to reduce that window size further	15:03
corvus	i like the idea of leaving the settings as-is and seeing if the 5 second change will help	15:03
clarkb	I suppose it is also possible the gitea healthcheck is returning 200 when it shouldn't	15:03
clarkb	corvus: ++	15:03
corvus	yeah, something like a false 200 was my initial concern here	15:04
corvus	but now i agree it's premature to suspect that	15:04
fungi	clarkb: reading up on dad, it looks like duplicate addresses should be resulting in klog messages like "IPv6 duplicate address nnnn detected!"	15:05
fungi	i don't see anything like that in dmesg at least	15:05
opendevreview	James E. Blair proposed opendev/project-config master: Use built images even if some failed https://review.opendev.org/c/opendev/project-config/+/955797	15:06
clarkb	fungi: ok so probably rule that out then. One interesting datapoint is that my ping -6 from gitea-lb02 to gitea09 doesn't seem to ever fail even though the ip address has gone away according to ip addr	15:06
corvus	i'm honestly ambivalent about that change ^ (955797) -- consider it a topic for thought/discussion	15:07
clarkb	fungi: I wonder if while true; do ping -6 -c 1 ; done would have a different result as it would look up new socket info each time?	15:07
clarkb	its possible a single ping process over time is just reusing working network info from when it started?	15:07
clarkb	and from what I could see yesterday I don't think any of the backends are suffering this issue	15:07
fungi	yeah, it feels like something is going weird maybe on one hypervisor host or something	15:08
clarkb	they are running the same platform (jammy) with docker + docker-compose and host networking. but they have different software on top of that (gitea + ssh vs haproxy)	15:08
clarkb	I don't think either haproxy or gitea+ssh should be affecting the network interface config	15:08
clarkb	I did some quick sysctl -a \| grep ipv6 inspection yesterday and didn't see any obvious differences there either	15:09
fungi	right, that seems unlikely to me too	15:09
clarkb	they are all set up to autoconf and respect dads	15:09
fungi	we could try booting an identical replacement proxy server and switching dns over, i guess	15:09
clarkb	ya. I wonder if this is of interest to guilhermesp, mnaser, and ricolin too	15:10
fungi	maybe also double-check the host id hash in nova metadata to make sure we're on a different compute node	15:10
corvus	clarkb: i took a quick look at the ping source code, and yes, it appears it does all the socket setup once, then reuses that.	15:11
clarkb	guilhermesp mnaser ricolin the TLDR is that we have one server (gitea-lb02) in vexxhost sjc1 that goes back and forth between having a valid global ipv6 address on its primary network interface. Other servers in that region (gitea09-gitea14) don't seem to exhibit the same behavior	15:11
clarkb	corvus: ack I'll stop my ping then as I don't think this is giving us any new info at this point	15:11
clarkb	54778 packets transmitted, 54778 received, 0% packet loss just to confirm it never failed despite the networking config changing udner it	15:12
clarkb	so ya maybe our best option at this point is to spin up a new load balancer. We can run it on noble so that we can simplify some of the code in the haproxy role (zuul-lb is already noble iirc)	15:14
fungi	finding some similar discussions like https://askubuntu.com/questions/1412947/how-to-fix-an-ubuntu-server-losing-its-ipv6-address-without-any-traces-inside-th	15:16
fungi	i guess it could be a systemd regression in jammy that we just happen to be tickling on that one vm somehow	15:17
clarkb	fungi: interesting. More reason to boot a new noble node I guess?	15:32
clarkb	I went ahead and stopped my while loop checking ip addr and ip route infop	15:32
clarkb	this is probably user error but why isn't https://review.opendev.org/c/openstack/diskimage-builder/+/955712 showing up in the zuul gate queue?	15:37
clarkb	oh dibs queue is called "glean" heh	15:37
clarkb	user error indeed	15:37
fungi	https://bbs.archlinux.org/viewtopic.php?id=140312 was another interesting similar-sounding case, but i'm not getting how the linked bug/fix would actually address the reported problem	15:38
clarkb	that seems specific to arch's netcfg scripting	15:41
clarkb	which I can only assume they have replaced wtih systemd-networkd now	15:41
fungi	yeah, almost certainly	15:47
fungi	just noticed that particular discussion was ~13 years ago	15:48
fungi	so almost definitely irrelevat	15:48
fungi	irrelevant	15:48
clarkb	gitea-lb03.opendev.org is booting now on the noble image	15:50
fungi	ah, yeah i guess it's a good opportunity to upgrade anyway	15:51
clarkb	it will allow us to simplify some of the haproxy role stuff since zuul-lb is already on noble	15:52
clarkb	(we can drop the docker compose version specifier and normalize some testing iirc	15:52
clarkb	which I guess is the next step. Sorting out the testing	15:52
opendevreview	Clark Boylan proposed opendev/system-config master: Add gitea-lb03 https://review.opendev.org/c/opendev/system-config/+/955805	16:07
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add gitea-lb03 DNS records https://review.opendev.org/c/opendev/zone-opendev.org/+/955807	16:10
clarkb	please carefully review ^ I'm distracted by a meeting so may make mistakes	16:11
clarkb	I don't think the load balancer does anything with certs so we should be able to merge those changes in either order	16:13
clarkb	both of those gitea-lb03 changes look happy. I'm actually going to pop out momentarily for a bike ride before it gets hot today	17:04
clarkb	But I think we can proceed with those if they look good to others. As written it should deploy the new host alongside the old host and add dns for it but not update opendev.org dns so it doesn't change production bheavior for anyone yet	17:04
clarkb	we can validate the new load balcner functions then update DNS to point at the new server and finally clean things up	17:05
clarkb	the host ids do differ too	17:05
clarkb	ok popping out now. Should be back around 1900UTC	17:22
fungi	currently on gitea-lb02 i cannot ping either of the two v6 gateways showing up as the default routes, 100% packet loss, and the neighbor table entries for both are perpetually marked stale	18:03
TheJulia	o/	18:11
TheJulia	Are there any known issues with gerrit right now?	18:11
TheJulia	https://www.irccloud.com/pastebin/bIBRrElM/	18:11
fungi	TheJulia: try `git review --no-thin` and see if that helps? on rare occasions cgit and jgit disagree on the packing algorithm	18:14
TheJulia	okay, that worked	18:14
TheJulia	thanks!	18:14
fungi	yw	18:14
fungi	TheJulia: the git-review.1 manpage explains that option... "Disable thin pushes when pushing to Gerrit. This should only be used if you are currently experiencing unpack failures due to missing trees. It should not be required in typical day to day use."	18:15
opendevreview	Merged opendev/zone-opendev.org master: Add gitea-lb03 DNS records https://review.opendev.org/c/opendev/zone-opendev.org/+/955807	18:15
TheJulia	Yeah, when it says to check the server log, I got a bit worried	18:16
fungi	basically jgit dumps a java backtrace into the gerrit error log any time that happens	18:21
fungi	it's a known issue, we had a bug opened against git-review for it back in 2014 and eventually added the --no-thin option as a workaround, but the bug is really between cgit and jgit which we can't control, and always passing --no-thin to git push would be a pretty significant performance regression	18:23
fungi	it works fine like 99.99% of the time, which makes it hard to justify as a full-time workaround, but also means it's easy to not know about the option if you've rarely or never hit the issue before	18:25
Clark[m]	fungi: I'm not back yet but maybe add gitea-lb02.ooendev.org to the emergency file? I'm not 100% certain it will be happy with the docker compose version specifier removal	18:25
Clark[m]	The things that occur to you while on a bike...	18:25
corvus	fungi: maybe git-review could read the error message and auto-enable that for one push	18:27
fungi	yeah, i can't recall if we looked into that, and whether it has access to the fd where that's printed or if it's going straight to the tty from the git push process	18:28
fungi	but it's good to look into	18:28
fungi	Clark[m]: done	18:29
fungi	i think also one of the problems with implementing a more transparent workaround with an automatic --no-thin retry is coming up with a reliable reproducer to make sure the implementation actually works	18:53
corvus	i'd do it this way: write the code to turn on the flag based on some harmless thing that's always in the output (like "Processing changes"). make sure it actually reads the output and retries with the flag. then switch the pattern match to the actual text, and presume it will work. as long as that doesn't then break normal use, then we're no worse off than before, possibly better.	19:01
fungi	right, my main concern is being able to confirm the error message is coming through on stderr or stdout vs some higher fd that git-review doesn't capture	19:03
fungi	if i had a reproducer i could figure out what fd it's coming through	19:03
corvus	isn't the output from the gerrit server?	19:04
fungi	the usual joys of wrangling output from child processes	19:04
fungi	git is getting it from the gerrit server, yes	19:04
fungi	git-review may be getting it from git, or the parent terminal may be inheriting an fd from git directly	19:05
fungi	at the moment i don't know whether that message is passing through git-review, and even if it is, whether it's on stderr vs stdout	19:06
corvus	like, i guess i'm saying, there's a 95% chance that when git-review prints "remote: Processing changes: refs: 1, new: 1" it's going through the same fd where the thin error message would be printed.	19:06
fungi	i think it's a good guess, yes	19:07
fungi	the infrequency of this particular error is just such that the attempted solution may end up in several releases before we ever find out if it actually worked	19:08
fungi	we can iterate on it that way, it'll just be slow on the order of years, which is probably okay	19:08
clarkb	ok I made it back before gitea-lb03 change merged	19:30
opendevreview	Merged opendev/system-config master: Add gitea-lb03 https://review.opendev.org/c/opendev/system-config/+/955805	19:31
clarkb	just	19:31
clarkb	the deploy job for ^ is running base without concurrent jobs. So now I wonder why we had concurrent jobs before. There must be some third job that everything waits on that also depends on -base that didn't run when other jobs started in the other buildset?	19:36
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Switch opendev.org load balancers https://review.opendev.org/c/opendev/zone-opendev.org/+/955825	19:44
clarkb	I just tested opendev.org through gitea-lb03 after editing /etc/hosts and it seems to work for browsing and git clone	19:45
clarkb	so I think we're ready for 955825 as soon as someone else concurs	19:45
clarkb	and so far the new server still has its ipv6 address	19:45
fungi	yeah, that was the biggest thing, i wanted to watch the interface for a bit and see if it exhibited similar problems	19:46
fungi	though i guess it can't be any worse than 02	19:46
clarkb	and ya exercising it to see if it happens might be better than waiting?	19:48
clarkb	(if it is somethign to do with workload)	19:48
fungi	agreed	19:48
fungi	in related news, 02 now has reachable entries in its neighbor table for the routers again, and has a global v6 address on the ens3 interface	19:49
fungi	though ping6 still reports 100% packet loss to those	19:49
fungi	to the routers i mean	19:49
clarkb	huh so even when the address is configured the network is sometimes non workable	19:50
clarkb	I wonder if that is why we're automatically removing it	19:50
clarkb	not so much dad collisions but detection of non workability so better to have things use ipv4?	19:50
fungi	i went ahead and approved 955825 since things generally check out for me as well	19:53
clarkb	thanks	19:53
fungi	if this doesn't solve the problem, i agree next step is to at least temporarily drop the aaaa records	19:54
opendevreview	Merged opendev/zone-opendev.org master: Switch opendev.org load balancers https://review.opendev.org/c/opendev/zone-opendev.org/+/955825	19:55
clarkb	that is queued up behind the rest of the deploy for 955805 so will be a minute	19:56
clarkb	fungi: probably a good idea to stop haproxy on gitea-lb02 about an hour after gitea-lb03 takes over? Just to avoid any confusion about where connections are going later if we still see problems?	19:56
clarkb	I can do that if we think that is a good idea	19:57
fungi	yeah, it's likely there are systems clinging to that ip address, and they may need a hard kick in the seat to realize it needs re-resolving	19:58
clarkb	but I was also thinking we can leave gitea-lb02 up for a bit if we think it will aid cloud side debugging	19:58
fungi	i'll be curious to see if the problem vanishes once the network load subsides too	19:59
fungi	maybe it's just a deliverability issue with some icmp6 messsaging due to a flooded interface? hard to know	19:59
clarkb	to summarize the next steps are double check traffic largely shifts over to gitea-lb03, then an hour later shutdown services on gitea-lb02, then if things look happy reset the ttl on the dns records, land a change to pull gitea-lb02 out of inventory. Wait to see if cloud wants to debug gitea-lb02 further, delete gitea-lb02 when no longer needed	20:00
fungi	sounds good to me, yep	20:00
fungi	neighbor table entries on 02 for the routers have gone permanently stale again even after pinging them, and the global address has been dropped from ens3	20:02
clarkb	gitea-lb03 ipv6 continues to look fine. But also no load yet	20:05
clarkb	the dns job is about to complete and I already see requests coming in	20:12
clarkb	and i see both new A and AAAA records myself now	20:12
clarkb	at ~21:12 I'll stop haproxy on gitea-lb02	20:13
clarkb	assuming this is happy I guess	20:13
clarkb	I'm "impressed" with how many requests are still going to 02. THough a good chunk seem to originate from a single IP which implies software that is poorly configured	20:20
fungi	yes	20:21
fungi	probably it pipelines multiple connections over a persistent socket session so hasn't needed to do any dns lookup	20:22
fungi	but once the connection gets forcibly closed it will hopefully re-reolve the name when opening a new one	20:22
fungi	er, re-resolve	20:22
opendevreview	Clark Boylan proposed opendev/system-config master: Drop gitea-lb02 from our inventory https://review.opendev.org/c/opendev/system-config/+/955829	20:24
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records https://review.opendev.org/c/opendev/zone-opendev.org/+/955830	20:25
clarkb	I don't know that we're ready for ^ those yet but figured I'd get them running through CI at least	20:25
clarkb	so far I haven't seen any ipv6 weirdness on lb03 either	20:30
fungi	and the global address is still missing from ens3 on lb02	20:32
fungi	so the drop in traffic volume hasn't magically cured it, at least not yet	20:32
clarkb	I'm going to stop haproxy on gitea-lb02 now	21:09
clarkb	thats done. And I can still reach services	21:09
clarkb	at this point I think I'm happy with the state of things. ipv6 continues to work on gitea-lb03 etc. I think we can land 955829 now or we can wait and see how stable it is first. Happy either way	21:17
clarkb	it does look like gitea-lb02 has its ipv6 address back after stopping haproxy.	21:18
clarkb	#status log Replaced gitea-lb02 with gitea-lb03 a new Noble node in an effort to maintain stable ipv6 connectivity	21:19
opendevstatus	clarkb: finished logging	21:20
clarkb	ipv6 addr on gitea-lb02 is gone again after shutting down haproxy	21:31
clarkb	so I don't think this is workload related	21:31
fungi	yeah, seems not	21:33
opendevreview	James E. Blair proposed opendev/zone-opendev.org master: Remove nodepool hostnames https://review.opendev.org/c/opendev/zone-opendev.org/+/955839	21:34
corvus	#status log deleted nb05-nb07 and nl05-nl08	21:34
clarkb	I'm ahppy to rebase my dns updates on 955839 since I think 955839 has no reason to delay	21:34
opendevstatus	corvus: finished logging	21:35
corvus	sounds good, also, we can just rebase 839 when the gitea stuff is done, no rush.	21:35
opendevreview	Merged openstack/diskimage-builder master: Add a 5 second delay between cache update retries https://review.opendev.org/c/openstack/diskimage-builder/+/955712	22:03
corvus	tonyb: are you done with https://zuul.opendev.org/t/openstack/autohold/0000000208 ?	22:19
corvus	(no rush, it's not blocking anything -- but when you are done, i'll need to do some manual cleanup of that node since there's no longer any automated process that can delete it)	22:20
corvus	i'm a little sad that we're losing our node counter with node ids switching from a sequence to uuids... since we got up to 41 million nodes.	22:21
clarkb	oh good the dib change managed to land before image builds today	22:30
opendevreview	Merged opendev/zone-opendev.org master: Remove nodepool hostnames https://review.opendev.org/c/opendev/zone-opendev.org/+/955839	22:37
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records https://review.opendev.org/c/opendev/zone-opendev.org/+/955830	22:38

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!