Thursday, 2025-07-24

*** ykarel_ is now known as ykarel09:08
*** amoralej_ is now known as amoralej12:07
priteauHello. CentOS Stream 10 is missing from Zuul mirrors. For example: https://mirror.gra1.ovh.opendev.org/centos-stream/12:28
priteauWould it be possible to add it, or should I use official mirrors?12:29
priteauSame for EPEL, we have 9 only: http://mirror.dfw3.raxflex.opendev.org/epel/12:41
fungipriteau: i think people were planning to submit changes to add it once centos-10 nodes were seeing more use. maybe we're there now and it's time to revisit that. looking at https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-1y%2Fy&to=now-1y%2Fy&timezone=utc&viewPanel=panel-36 we'll need to expand that volume first12:59
priteauI am working on c10s support in Kayobe. Same in kolla/kolla-ansible, they've started to run some.13:05
priteauIt's not urgent of course, we can use official mirrors in the meantime.13:05
priteauI noticed there was no Rocky Linux content at all, was that a choice given the resources available?13:06
fungipriteau: similar plan, if a lot of projects start running frequent jobs on rocky then mirroring packages for it would make sense at that point13:50
Clark[m]fungi: I'm not really at the computer for another hour or so but wondering if we should drop the AAAA record and/or do other debugging. I couldn't find a way to verify if dad is the issue via logging but maybe there is some way?13:58
fungii haven't had time to look into it yet, and am on a conference call for the next hour13:58
Clark[m]Re distro mirrors: they consume large quantities of space and have historically been a large time sink. In particular upstreams make bad updates then we have to debug and say no it's upstreams fault. Also cleanup in a few years is often complicated. So ya with newer less commonly used platforms I asked if we can trial things without dedicated mirror content 14:00
priteauNo problem, as long as we know that we should use official mirrors instead.14:07
corvus2 errors in image builds:14:54
corvushttps://zuul.opendev.org/t/opendev/build/afd97251bc7740dbaa8ede9dcd573557 -- i wonder why we didn't do the retry loop on that one?14:54
corvushttps://zuul.opendev.org/t/opendev/build/d65637a8f31f420a93eff81300fcf42e hit the 503 error again14:54
corvusso if the haproxy config is in place, then maybe that didn't help?14:55
fungicorvus: could it be the ipv6 connectivity problem we've been digging into for gitea-lb02?14:57
fungii see `ip -6 ad sh ens3` is back to only returning a linklocal address at the moment14:59
corvusfungi: not saying "no", but i believe the way clark was leaning yesterday was that the connection errors were possibly v6 related, but he was assuming the 503 was a real gitea response that made it all the way from the backend to the client.14:59
fungioh, good point14:59
clarkbcorvus: those retries occurred in under 2 seconds15:00
fungiconnection issues at least between the client and proxy should never result in an http error of any kind15:00
clarkbcorvus: I think we need the sleep 5 change in conjunction with the better health checks15:00
clarkblooks like that change has been approved15:01
clarkbcorvus: I think the gitea healthcheck update should in theory takes us from a 10-15 second window to a 2 second window and its possible we still hit that 2 second window here15:02
corvusclarkb: yes, for the 503 error, perhaps the sleep 5 would have helped -- but my point with that is that, if the haproxy healthz check was in place at the time, then it appears not to have solved the underlying problem that gitea generated 503s in the first place.15:02
corvusoh now i understand the significance of 2 seconds in your message ;)15:02
opendevreviewMerged opendev/system-config master: Trigger load balancer deployment jobs when their roles update  https://review.opendev.org/c/opendev/system-config/+/95573415:02
clarkbcorvus: yes I think we cannot solve that problem without either 1) terminating https in haproxy and doing inline healthchecks to immediate shut things done on the first error or 2) add more coordination to remove nodes from haproxy before they get updated15:03
corvusi agree, it's possible the lb change reduced the window for 503s and we didn't observe that (other than the reduced failure count :)15:03
clarkbwe can also reduce the healthceck interval to reduce that window size further15:03
corvusi like the idea of leaving the settings as-is and seeing if the 5 second change will help15:03
clarkbI suppose it is also possible the gitea healthcheck is returning 200 when it shouldn't15:03
clarkbcorvus: ++15:03
corvusyeah, something like a false 200 was my initial concern here15:04
corvusbut now i agree it's premature to suspect that15:04
fungiclarkb: reading up on dad, it looks like duplicate addresses should be resulting in klog messages like "IPv6 duplicate address nnnn detected!"15:05
fungii don't see anything like that in dmesg at least15:05
opendevreviewJames E. Blair proposed opendev/project-config master: Use built images even if some failed  https://review.opendev.org/c/opendev/project-config/+/95579715:06
clarkbfungi: ok so probably rule that out then. One interesting datapoint is that my ping -6 from gitea-lb02 to gitea09 doesn't seem to ever fail even though the ip address has gone away according to ip addr15:06
corvusi'm honestly ambivalent about that change ^ (955797) -- consider it a topic for thought/discussion15:07
clarkbfungi: I wonder if while true; do ping -6 -c 1 ; done would have a different result as it would look up new socket info each time?15:07
clarkbits possible a single ping process over time is just reusing working network info from when it started?15:07
clarkband from what I could see yesterday I don't think any of the backends are suffering this issue15:07
fungiyeah, it feels like something is going weird maybe on one hypervisor host or something15:08
clarkbthey are running the same platform (jammy) with docker + docker-compose and host networking. but they have different software on top of that (gitea + ssh vs haproxy)15:08
clarkbI don't think either haproxy or gitea+ssh should be affecting the network interface config15:08
clarkbI did some quick sysctl -a | grep ipv6 inspection yesterday and didn't see any obvious differences there either15:09
fungiright, that seems unlikely to me too15:09
clarkbthey are all set up to autoconf and respect dads15:09
fungiwe could try booting an identical replacement proxy server and switching dns over, i guess15:09
clarkbya. I wonder if this is of interest to guilhermesp, mnaser, and ricolin too15:10
fungimaybe also double-check the host id hash in nova metadata to make sure we're on a different compute node15:10
corvusclarkb: i took a quick look at the ping source code, and yes, it appears it does all the socket setup once, then reuses that.15:11
clarkbguilhermesp mnaser ricolin the TLDR is that we have one server (gitea-lb02) in vexxhost sjc1 that goes back and forth between having a valid global ipv6 address on its primary network interface. Other servers in that region (gitea09-gitea14) don't seem to exhibit the same behavior15:11
clarkbcorvus: ack I'll stop my ping then as I don't think this is giving us any new info at this point15:11
clarkb54778 packets transmitted, 54778 received, 0% packet loss just to confirm it never failed despite the networking config changing udner it15:12
clarkbso ya maybe our best option at this point is to spin up a new load balancer. We can run it on noble so that we can simplify some of the code in the haproxy role (zuul-lb is already noble iirc)15:14
fungifinding some similar discussions like https://askubuntu.com/questions/1412947/how-to-fix-an-ubuntu-server-losing-its-ipv6-address-without-any-traces-inside-th15:16
fungii guess it could be a systemd regression in jammy that we just happen to be tickling on that one vm somehow15:17
clarkbfungi: interesting. More reason to boot a new noble node I guess?15:32
clarkbI went ahead and stopped my while loop checking ip addr and ip route infop15:32
clarkbthis is probably user error but why isn't https://review.opendev.org/c/openstack/diskimage-builder/+/955712 showing up in the zuul gate queue?15:37
clarkboh dibs queue is called "glean" heh15:37
clarkbuser error indeed15:37
fungihttps://bbs.archlinux.org/viewtopic.php?id=140312 was another interesting similar-sounding case, but i'm not getting how the linked bug/fix would actually address the reported problem15:38
clarkbthat seems specific to arch's netcfg scripting15:41
clarkbwhich I can only assume they have replaced wtih systemd-networkd now15:41
fungiyeah, almost certainly15:47
fungijust noticed that particular discussion was ~13 years ago15:48
fungiso almost definitely irrelevat15:48
fungiirrelevant15:48
clarkbgitea-lb03.opendev.org is booting now on the noble image15:50
fungiah, yeah i guess it's a good opportunity to upgrade anyway15:51
clarkbit will allow us to simplify some of the haproxy role stuff since zuul-lb is already on noble15:52
clarkb(we can drop the docker compose version specifier and normalize some testing iirc15:52
clarkbwhich I guess is the next step. Sorting out the testing15:52
opendevreviewClark Boylan proposed opendev/system-config master: Add gitea-lb03  https://review.opendev.org/c/opendev/system-config/+/95580516:07
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add gitea-lb03 DNS records  https://review.opendev.org/c/opendev/zone-opendev.org/+/95580716:10
clarkbplease carefully review ^ I'm distracted by a meeting so may make mistakes16:11
clarkbI don't think the load balancer does anything with certs so we should be able to merge those changes in either order16:13
clarkbboth of those gitea-lb03 changes look happy. I'm actually going to pop out momentarily for a bike ride before it gets hot today17:04
clarkbBut I think we can proceed with those if they look good to others. As written it should deploy the new host alongside the old host and add dns for it but not update opendev.org dns so it doesn't change production bheavior for anyone yet17:04
clarkbwe can validate the new load balcner functions then update DNS to point at the new server and finally clean things up17:05
clarkbthe host ids do differ too17:05
clarkbok popping out now. Should be back around 1900UTC17:22
fungicurrently on gitea-lb02 i cannot ping either of the two v6 gateways showing up as the default routes, 100% packet loss, and the neighbor table entries for both are perpetually marked stale18:03
TheJuliao/18:11
TheJuliaAre there any known issues with gerrit right now?18:11
TheJuliahttps://www.irccloud.com/pastebin/bIBRrElM/18:11
fungiTheJulia: try `git review --no-thin` and see if that helps? on rare occasions cgit and jgit disagree on the packing algorithm18:14
TheJuliaokay, that worked18:14
TheJuliathanks!18:14
fungiyw18:14
fungiTheJulia: the git-review.1 manpage explains that option... "Disable thin pushes when pushing to Gerrit. This should only be used if you are currently experiencing unpack failures due to missing trees. It should not be required in typical day to day use."18:15
opendevreviewMerged opendev/zone-opendev.org master: Add gitea-lb03 DNS records  https://review.opendev.org/c/opendev/zone-opendev.org/+/95580718:15
TheJuliaYeah, when it says to check the server log, I got a bit worried18:16
fungibasically jgit dumps a java backtrace into the gerrit error log any time that happens18:21
fungiit's a known issue, we had a bug opened against git-review for it back in 2014 and eventually added the --no-thin option as a workaround, but the bug is really between cgit and jgit which we can't control, and always passing --no-thin to git push would be a pretty significant performance regression18:23
fungiit works fine like 99.99% of the time, which makes it hard to justify as a full-time workaround, but also means it's easy to not know about the option if you've rarely or never hit the issue before18:25
Clark[m]fungi: I'm not back yet but maybe add gitea-lb02.ooendev.org to the emergency file? I'm not 100% certain it will be happy with the docker compose version specifier removal18:25
Clark[m]The things that occur to you while on a bike...18:25
corvusfungi: maybe git-review could read the error message and auto-enable that for one push18:27
fungiyeah, i can't recall if we looked into that, and whether it has access to the fd where that's printed or if it's going straight to the tty from the git push process18:28
fungibut it's good to look into18:28
fungiClark[m]: done18:29
fungii think also one of the problems with implementing a more transparent workaround with an automatic --no-thin retry is coming up with a reliable reproducer to make sure the implementation actually works18:53
corvusi'd do it this way: write the code to turn on the flag based on some harmless thing that's always in the output (like "Processing changes").  make sure it actually reads the output and retries with the flag.  then switch the pattern match to the actual text, and presume it will work.  as long as that doesn't then break normal use, then we're no worse off than before, possibly better.19:01
fungiright, my main concern is being able to confirm the error message is coming through on stderr or stdout vs some higher fd that git-review doesn't capture19:03
fungiif i had a reproducer i could figure out what fd it's coming through19:03
corvusisn't the output from the gerrit server?19:04
fungithe usual joys of wrangling output from child processes19:04
fungigit is getting it from the gerrit server, yes19:04
fungigit-review may be getting it from git, or the parent terminal may be inheriting an fd from git directly19:05
fungiat the moment i don't know whether that message is passing through git-review, and even if it is, whether it's on stderr vs stdout19:06
corvuslike, i guess i'm saying, there's a 95% chance that when git-review prints "remote: Processing changes: refs: 1, new: 1" it's going through the same fd where the thin error message would be printed.19:06
fungii think it's a good guess, yes19:07
fungithe infrequency of this particular error is just such that the attempted solution may end up in several releases before we ever find out if it actually worked19:08
fungiwe can iterate on it that way, it'll just be slow on the order of years, which is probably okay19:08
clarkbok I made it back before gitea-lb03 change merged19:30
opendevreviewMerged opendev/system-config master: Add gitea-lb03  https://review.opendev.org/c/opendev/system-config/+/95580519:31
clarkbjust19:31
clarkbthe deploy job for ^ is running base without concurrent jobs. So now I wonder why we had concurrent jobs before. There must be some third job that everything waits on that also depends on -base that didn't run when other jobs started in the other buildset?19:36
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Switch opendev.org load balancers  https://review.opendev.org/c/opendev/zone-opendev.org/+/95582519:44
clarkbI just tested opendev.org through gitea-lb03 after editing /etc/hosts and it seems to work for browsing and git clone19:45
clarkbso I think we're ready for 955825 as soon as someone else concurs19:45
clarkband so far the new server still has its ipv6 address19:45
fungiyeah, that was the biggest thing, i wanted to watch the interface for a bit and see if it exhibited similar problems19:46
fungithough i guess it can't be any worse than 0219:46
clarkband ya exercising it to see if it happens might be better than waiting?19:48
clarkb(if it is somethign to do with workload)19:48
fungiagreed19:48
fungiin related news, 02 now has reachable entries in its neighbor table for the routers again, and has a global v6 address on the ens3 interface19:49
fungithough ping6 still reports 100% packet loss to those19:49
fungito the routers i mean19:49
clarkbhuh so even when the address is configured the network is sometimes non workable19:50
clarkbI wonder if that is why we're automatically removing it19:50
clarkbnot so much dad collisions but detection of non workability so better to have things use ipv4?19:50
fungii went ahead and approved 955825 since things generally check out for me as well19:53
clarkbthanks19:53
fungiif this doesn't solve the problem, i agree next step is to at least temporarily drop the aaaa records19:54
opendevreviewMerged opendev/zone-opendev.org master: Switch opendev.org load balancers  https://review.opendev.org/c/opendev/zone-opendev.org/+/95582519:55
clarkbthat is queued up behind the rest of the deploy for 955805 so will be a minute19:56
clarkbfungi: probably a good idea to stop haproxy on gitea-lb02 about an hour after gitea-lb03 takes over? Just to avoid any confusion about where connections are going later if we still see problems?19:56
clarkbI can do that if we think that is a good idea19:57
fungiyeah, it's likely there are systems clinging to that ip address, and they may need a hard kick in the seat to realize it needs re-resolving19:58
clarkbbut I was also thinking we can leave gitea-lb02 up for a bit if we think it will aid cloud side debugging19:58
fungii'll be curious to see if the problem vanishes once the network load subsides too19:59
fungimaybe it's just a deliverability issue with some icmp6 messsaging due to a flooded interface? hard to know19:59
clarkbto summarize the next steps are double check traffic largely shifts over to gitea-lb03, then an hour later shutdown services on gitea-lb02, then if things look happy reset the ttl on the dns records, land a change to pull gitea-lb02 out of inventory. Wait to see if cloud wants to debug gitea-lb02 further, delete gitea-lb02 when no longer needed20:00
fungisounds good to me, yep20:00
fungineighbor table entries on 02 for the routers have gone permanently stale again even after pinging them, and the global address has been dropped from ens320:02
clarkbgitea-lb03 ipv6 continues to look fine. But also no load yet20:05
clarkbthe dns job is about to complete and I already see requests coming in20:12
clarkband i see both new A and AAAA records myself now20:12
clarkbat ~21:12 I'll stop haproxy on gitea-lb0220:13
clarkbassuming this is happy I guess20:13
clarkbI'm "impressed" with how many requests are still going to 02. THough a good chunk seem to originate from a single IP which implies software that is poorly configured20:20
fungiyes20:21
fungiprobably it pipelines multiple connections over a persistent socket session so hasn't needed to do any dns lookup20:22
fungibut once the connection gets forcibly closed it will hopefully re-reolve the name when opening a new one20:22
fungier, re-resolve20:22
opendevreviewClark Boylan proposed opendev/system-config master: Drop gitea-lb02 from our inventory  https://review.opendev.org/c/opendev/system-config/+/95582920:24
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records  https://review.opendev.org/c/opendev/zone-opendev.org/+/95583020:25
clarkbI don't know that we're ready for ^ those yet but figured I'd get them running through CI at least20:25
clarkbso far I haven't seen any ipv6 weirdness on lb03 either20:30
fungiand the global address is still missing from ens3 on lb0220:32
fungiso the drop in traffic volume hasn't magically cured it, at least not yet20:32
clarkbI'm going to stop haproxy on gitea-lb02 now21:09
clarkbthats done. And I can still reach services21:09
clarkbat this point I think I'm happy with the state of things. ipv6 continues to work on gitea-lb03 etc. I think we can land 955829 now or we can wait and see how stable it is first. Happy either way21:17
clarkbit does look like gitea-lb02 has its ipv6 address back after stopping haproxy.21:18
clarkb#status log Replaced gitea-lb02 with gitea-lb03 a new Noble node in an effort to maintain stable ipv6 connectivity21:19
opendevstatusclarkb: finished logging21:20
clarkbipv6 addr on gitea-lb02 is gone again after shutting down haproxy21:31
clarkbso I don't think this is workload related21:31
fungiyeah, seems not21:33
opendevreviewJames E. Blair proposed opendev/zone-opendev.org master: Remove nodepool hostnames  https://review.opendev.org/c/opendev/zone-opendev.org/+/95583921:34
corvus#status log deleted nb05-nb07 and nl05-nl0821:34
clarkbI'm ahppy to rebase my dns updates on 955839 since I think 955839 has no reason to delay21:34
opendevstatuscorvus: finished logging21:35
corvussounds good, also, we can just rebase 839 when the gitea stuff is done, no rush.21:35
opendevreviewMerged openstack/diskimage-builder master: Add a 5 second delay between cache update retries  https://review.opendev.org/c/openstack/diskimage-builder/+/95571222:03
corvustonyb: are you done with https://zuul.opendev.org/t/openstack/autohold/0000000208 ?22:19
corvus(no rush, it's not blocking anything -- but when you are done, i'll need to do some manual cleanup of that node since there's no longer any automated process that can delete it)22:20
corvusi'm a little sad that we're losing our node counter with node ids switching from a sequence to uuids... since we got up to 41 million nodes.22:21
clarkboh good the dib change managed to land before image builds today22:30
opendevreviewMerged opendev/zone-opendev.org master: Remove nodepool hostnames  https://review.opendev.org/c/opendev/zone-opendev.org/+/95583922:37
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records  https://review.opendev.org/c/opendev/zone-opendev.org/+/95583022:38

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!