*** ykarel_ is now known as ykarel | 09:08 | |
*** amoralej_ is now known as amoralej | 12:07 | |
priteau | Hello. CentOS Stream 10 is missing from Zuul mirrors. For example: https://mirror.gra1.ovh.opendev.org/centos-stream/ | 12:28 |
---|---|---|
priteau | Would it be possible to add it, or should I use official mirrors? | 12:29 |
priteau | Same for EPEL, we have 9 only: http://mirror.dfw3.raxflex.opendev.org/epel/ | 12:41 |
fungi | priteau: i think people were planning to submit changes to add it once centos-10 nodes were seeing more use. maybe we're there now and it's time to revisit that. looking at https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-1y%2Fy&to=now-1y%2Fy&timezone=utc&viewPanel=panel-36 we'll need to expand that volume first | 12:59 |
priteau | I am working on c10s support in Kayobe. Same in kolla/kolla-ansible, they've started to run some. | 13:05 |
priteau | It's not urgent of course, we can use official mirrors in the meantime. | 13:05 |
priteau | I noticed there was no Rocky Linux content at all, was that a choice given the resources available? | 13:06 |
fungi | priteau: similar plan, if a lot of projects start running frequent jobs on rocky then mirroring packages for it would make sense at that point | 13:50 |
Clark[m] | fungi: I'm not really at the computer for another hour or so but wondering if we should drop the AAAA record and/or do other debugging. I couldn't find a way to verify if dad is the issue via logging but maybe there is some way? | 13:58 |
fungi | i haven't had time to look into it yet, and am on a conference call for the next hour | 13:58 |
Clark[m] | Re distro mirrors: they consume large quantities of space and have historically been a large time sink. In particular upstreams make bad updates then we have to debug and say no it's upstreams fault. Also cleanup in a few years is often complicated. So ya with newer less commonly used platforms I asked if we can trial things without dedicated mirror content | 14:00 |
priteau | No problem, as long as we know that we should use official mirrors instead. | 14:07 |
corvus | 2 errors in image builds: | 14:54 |
corvus | https://zuul.opendev.org/t/opendev/build/afd97251bc7740dbaa8ede9dcd573557 -- i wonder why we didn't do the retry loop on that one? | 14:54 |
corvus | https://zuul.opendev.org/t/opendev/build/d65637a8f31f420a93eff81300fcf42e hit the 503 error again | 14:54 |
corvus | so if the haproxy config is in place, then maybe that didn't help? | 14:55 |
fungi | corvus: could it be the ipv6 connectivity problem we've been digging into for gitea-lb02? | 14:57 |
fungi | i see `ip -6 ad sh ens3` is back to only returning a linklocal address at the moment | 14:59 |
corvus | fungi: not saying "no", but i believe the way clark was leaning yesterday was that the connection errors were possibly v6 related, but he was assuming the 503 was a real gitea response that made it all the way from the backend to the client. | 14:59 |
fungi | oh, good point | 14:59 |
clarkb | corvus: those retries occurred in under 2 seconds | 15:00 |
fungi | connection issues at least between the client and proxy should never result in an http error of any kind | 15:00 |
clarkb | corvus: I think we need the sleep 5 change in conjunction with the better health checks | 15:00 |
clarkb | looks like that change has been approved | 15:01 |
clarkb | corvus: I think the gitea healthcheck update should in theory takes us from a 10-15 second window to a 2 second window and its possible we still hit that 2 second window here | 15:02 |
corvus | clarkb: yes, for the 503 error, perhaps the sleep 5 would have helped -- but my point with that is that, if the haproxy healthz check was in place at the time, then it appears not to have solved the underlying problem that gitea generated 503s in the first place. | 15:02 |
corvus | oh now i understand the significance of 2 seconds in your message ;) | 15:02 |
opendevreview | Merged opendev/system-config master: Trigger load balancer deployment jobs when their roles update https://review.opendev.org/c/opendev/system-config/+/955734 | 15:02 |
clarkb | corvus: yes I think we cannot solve that problem without either 1) terminating https in haproxy and doing inline healthchecks to immediate shut things done on the first error or 2) add more coordination to remove nodes from haproxy before they get updated | 15:03 |
corvus | i agree, it's possible the lb change reduced the window for 503s and we didn't observe that (other than the reduced failure count :) | 15:03 |
clarkb | we can also reduce the healthceck interval to reduce that window size further | 15:03 |
corvus | i like the idea of leaving the settings as-is and seeing if the 5 second change will help | 15:03 |
clarkb | I suppose it is also possible the gitea healthcheck is returning 200 when it shouldn't | 15:03 |
clarkb | corvus: ++ | 15:03 |
corvus | yeah, something like a false 200 was my initial concern here | 15:04 |
corvus | but now i agree it's premature to suspect that | 15:04 |
fungi | clarkb: reading up on dad, it looks like duplicate addresses should be resulting in klog messages like "IPv6 duplicate address nnnn detected!" | 15:05 |
fungi | i don't see anything like that in dmesg at least | 15:05 |
opendevreview | James E. Blair proposed opendev/project-config master: Use built images even if some failed https://review.opendev.org/c/opendev/project-config/+/955797 | 15:06 |
clarkb | fungi: ok so probably rule that out then. One interesting datapoint is that my ping -6 from gitea-lb02 to gitea09 doesn't seem to ever fail even though the ip address has gone away according to ip addr | 15:06 |
corvus | i'm honestly ambivalent about that change ^ (955797) -- consider it a topic for thought/discussion | 15:07 |
clarkb | fungi: I wonder if while true; do ping -6 -c 1 ; done would have a different result as it would look up new socket info each time? | 15:07 |
clarkb | its possible a single ping process over time is just reusing working network info from when it started? | 15:07 |
clarkb | and from what I could see yesterday I don't think any of the backends are suffering this issue | 15:07 |
fungi | yeah, it feels like something is going weird maybe on one hypervisor host or something | 15:08 |
clarkb | they are running the same platform (jammy) with docker + docker-compose and host networking. but they have different software on top of that (gitea + ssh vs haproxy) | 15:08 |
clarkb | I don't think either haproxy or gitea+ssh should be affecting the network interface config | 15:08 |
clarkb | I did some quick sysctl -a | grep ipv6 inspection yesterday and didn't see any obvious differences there either | 15:09 |
fungi | right, that seems unlikely to me too | 15:09 |
clarkb | they are all set up to autoconf and respect dads | 15:09 |
fungi | we could try booting an identical replacement proxy server and switching dns over, i guess | 15:09 |
clarkb | ya. I wonder if this is of interest to guilhermesp, mnaser, and ricolin too | 15:10 |
fungi | maybe also double-check the host id hash in nova metadata to make sure we're on a different compute node | 15:10 |
corvus | clarkb: i took a quick look at the ping source code, and yes, it appears it does all the socket setup once, then reuses that. | 15:11 |
clarkb | guilhermesp mnaser ricolin the TLDR is that we have one server (gitea-lb02) in vexxhost sjc1 that goes back and forth between having a valid global ipv6 address on its primary network interface. Other servers in that region (gitea09-gitea14) don't seem to exhibit the same behavior | 15:11 |
clarkb | corvus: ack I'll stop my ping then as I don't think this is giving us any new info at this point | 15:11 |
clarkb | 54778 packets transmitted, 54778 received, 0% packet loss just to confirm it never failed despite the networking config changing udner it | 15:12 |
clarkb | so ya maybe our best option at this point is to spin up a new load balancer. We can run it on noble so that we can simplify some of the code in the haproxy role (zuul-lb is already noble iirc) | 15:14 |
fungi | finding some similar discussions like https://askubuntu.com/questions/1412947/how-to-fix-an-ubuntu-server-losing-its-ipv6-address-without-any-traces-inside-th | 15:16 |
fungi | i guess it could be a systemd regression in jammy that we just happen to be tickling on that one vm somehow | 15:17 |
clarkb | fungi: interesting. More reason to boot a new noble node I guess? | 15:32 |
clarkb | I went ahead and stopped my while loop checking ip addr and ip route infop | 15:32 |
clarkb | this is probably user error but why isn't https://review.opendev.org/c/openstack/diskimage-builder/+/955712 showing up in the zuul gate queue? | 15:37 |
clarkb | oh dibs queue is called "glean" heh | 15:37 |
clarkb | user error indeed | 15:37 |
fungi | https://bbs.archlinux.org/viewtopic.php?id=140312 was another interesting similar-sounding case, but i'm not getting how the linked bug/fix would actually address the reported problem | 15:38 |
clarkb | that seems specific to arch's netcfg scripting | 15:41 |
clarkb | which I can only assume they have replaced wtih systemd-networkd now | 15:41 |
fungi | yeah, almost certainly | 15:47 |
fungi | just noticed that particular discussion was ~13 years ago | 15:48 |
fungi | so almost definitely irrelevat | 15:48 |
fungi | irrelevant | 15:48 |
clarkb | gitea-lb03.opendev.org is booting now on the noble image | 15:50 |
fungi | ah, yeah i guess it's a good opportunity to upgrade anyway | 15:51 |
clarkb | it will allow us to simplify some of the haproxy role stuff since zuul-lb is already on noble | 15:52 |
clarkb | (we can drop the docker compose version specifier and normalize some testing iirc | 15:52 |
clarkb | which I guess is the next step. Sorting out the testing | 15:52 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add gitea-lb03 https://review.opendev.org/c/opendev/system-config/+/955805 | 16:07 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add gitea-lb03 DNS records https://review.opendev.org/c/opendev/zone-opendev.org/+/955807 | 16:10 |
clarkb | please carefully review ^ I'm distracted by a meeting so may make mistakes | 16:11 |
clarkb | I don't think the load balancer does anything with certs so we should be able to merge those changes in either order | 16:13 |
clarkb | both of those gitea-lb03 changes look happy. I'm actually going to pop out momentarily for a bike ride before it gets hot today | 17:04 |
clarkb | But I think we can proceed with those if they look good to others. As written it should deploy the new host alongside the old host and add dns for it but not update opendev.org dns so it doesn't change production bheavior for anyone yet | 17:04 |
clarkb | we can validate the new load balcner functions then update DNS to point at the new server and finally clean things up | 17:05 |
clarkb | the host ids do differ too | 17:05 |
clarkb | ok popping out now. Should be back around 1900UTC | 17:22 |
fungi | currently on gitea-lb02 i cannot ping either of the two v6 gateways showing up as the default routes, 100% packet loss, and the neighbor table entries for both are perpetually marked stale | 18:03 |
TheJulia | o/ | 18:11 |
TheJulia | Are there any known issues with gerrit right now? | 18:11 |
TheJulia | https://www.irccloud.com/pastebin/bIBRrElM/ | 18:11 |
fungi | TheJulia: try `git review --no-thin` and see if that helps? on rare occasions cgit and jgit disagree on the packing algorithm | 18:14 |
TheJulia | okay, that worked | 18:14 |
TheJulia | thanks! | 18:14 |
fungi | yw | 18:14 |
fungi | TheJulia: the git-review.1 manpage explains that option... "Disable thin pushes when pushing to Gerrit. This should only be used if you are currently experiencing unpack failures due to missing trees. It should not be required in typical day to day use." | 18:15 |
opendevreview | Merged opendev/zone-opendev.org master: Add gitea-lb03 DNS records https://review.opendev.org/c/opendev/zone-opendev.org/+/955807 | 18:15 |
TheJulia | Yeah, when it says to check the server log, I got a bit worried | 18:16 |
fungi | basically jgit dumps a java backtrace into the gerrit error log any time that happens | 18:21 |
fungi | it's a known issue, we had a bug opened against git-review for it back in 2014 and eventually added the --no-thin option as a workaround, but the bug is really between cgit and jgit which we can't control, and always passing --no-thin to git push would be a pretty significant performance regression | 18:23 |
fungi | it works fine like 99.99% of the time, which makes it hard to justify as a full-time workaround, but also means it's easy to not know about the option if you've rarely or never hit the issue before | 18:25 |
Clark[m] | fungi: I'm not back yet but maybe add gitea-lb02.ooendev.org to the emergency file? I'm not 100% certain it will be happy with the docker compose version specifier removal | 18:25 |
Clark[m] | The things that occur to you while on a bike... | 18:25 |
corvus | fungi: maybe git-review could read the error message and auto-enable that for one push | 18:27 |
fungi | yeah, i can't recall if we looked into that, and whether it has access to the fd where that's printed or if it's going straight to the tty from the git push process | 18:28 |
fungi | but it's good to look into | 18:28 |
fungi | Clark[m]: done | 18:29 |
fungi | i think also one of the problems with implementing a more transparent workaround with an automatic --no-thin retry is coming up with a reliable reproducer to make sure the implementation actually works | 18:53 |
corvus | i'd do it this way: write the code to turn on the flag based on some harmless thing that's always in the output (like "Processing changes"). make sure it actually reads the output and retries with the flag. then switch the pattern match to the actual text, and presume it will work. as long as that doesn't then break normal use, then we're no worse off than before, possibly better. | 19:01 |
fungi | right, my main concern is being able to confirm the error message is coming through on stderr or stdout vs some higher fd that git-review doesn't capture | 19:03 |
fungi | if i had a reproducer i could figure out what fd it's coming through | 19:03 |
corvus | isn't the output from the gerrit server? | 19:04 |
fungi | the usual joys of wrangling output from child processes | 19:04 |
fungi | git is getting it from the gerrit server, yes | 19:04 |
fungi | git-review may be getting it from git, or the parent terminal may be inheriting an fd from git directly | 19:05 |
fungi | at the moment i don't know whether that message is passing through git-review, and even if it is, whether it's on stderr vs stdout | 19:06 |
corvus | like, i guess i'm saying, there's a 95% chance that when git-review prints "remote: Processing changes: refs: 1, new: 1" it's going through the same fd where the thin error message would be printed. | 19:06 |
fungi | i think it's a good guess, yes | 19:07 |
fungi | the infrequency of this particular error is just such that the attempted solution may end up in several releases before we ever find out if it actually worked | 19:08 |
fungi | we can iterate on it that way, it'll just be slow on the order of years, which is probably okay | 19:08 |
clarkb | ok I made it back before gitea-lb03 change merged | 19:30 |
opendevreview | Merged opendev/system-config master: Add gitea-lb03 https://review.opendev.org/c/opendev/system-config/+/955805 | 19:31 |
clarkb | just | 19:31 |
clarkb | the deploy job for ^ is running base without concurrent jobs. So now I wonder why we had concurrent jobs before. There must be some third job that everything waits on that also depends on -base that didn't run when other jobs started in the other buildset? | 19:36 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Switch opendev.org load balancers https://review.opendev.org/c/opendev/zone-opendev.org/+/955825 | 19:44 |
clarkb | I just tested opendev.org through gitea-lb03 after editing /etc/hosts and it seems to work for browsing and git clone | 19:45 |
clarkb | so I think we're ready for 955825 as soon as someone else concurs | 19:45 |
clarkb | and so far the new server still has its ipv6 address | 19:45 |
fungi | yeah, that was the biggest thing, i wanted to watch the interface for a bit and see if it exhibited similar problems | 19:46 |
fungi | though i guess it can't be any worse than 02 | 19:46 |
clarkb | and ya exercising it to see if it happens might be better than waiting? | 19:48 |
clarkb | (if it is somethign to do with workload) | 19:48 |
fungi | agreed | 19:48 |
fungi | in related news, 02 now has reachable entries in its neighbor table for the routers again, and has a global v6 address on the ens3 interface | 19:49 |
fungi | though ping6 still reports 100% packet loss to those | 19:49 |
fungi | to the routers i mean | 19:49 |
clarkb | huh so even when the address is configured the network is sometimes non workable | 19:50 |
clarkb | I wonder if that is why we're automatically removing it | 19:50 |
clarkb | not so much dad collisions but detection of non workability so better to have things use ipv4? | 19:50 |
fungi | i went ahead and approved 955825 since things generally check out for me as well | 19:53 |
clarkb | thanks | 19:53 |
fungi | if this doesn't solve the problem, i agree next step is to at least temporarily drop the aaaa records | 19:54 |
opendevreview | Merged opendev/zone-opendev.org master: Switch opendev.org load balancers https://review.opendev.org/c/opendev/zone-opendev.org/+/955825 | 19:55 |
clarkb | that is queued up behind the rest of the deploy for 955805 so will be a minute | 19:56 |
clarkb | fungi: probably a good idea to stop haproxy on gitea-lb02 about an hour after gitea-lb03 takes over? Just to avoid any confusion about where connections are going later if we still see problems? | 19:56 |
clarkb | I can do that if we think that is a good idea | 19:57 |
fungi | yeah, it's likely there are systems clinging to that ip address, and they may need a hard kick in the seat to realize it needs re-resolving | 19:58 |
clarkb | but I was also thinking we can leave gitea-lb02 up for a bit if we think it will aid cloud side debugging | 19:58 |
fungi | i'll be curious to see if the problem vanishes once the network load subsides too | 19:59 |
fungi | maybe it's just a deliverability issue with some icmp6 messsaging due to a flooded interface? hard to know | 19:59 |
clarkb | to summarize the next steps are double check traffic largely shifts over to gitea-lb03, then an hour later shutdown services on gitea-lb02, then if things look happy reset the ttl on the dns records, land a change to pull gitea-lb02 out of inventory. Wait to see if cloud wants to debug gitea-lb02 further, delete gitea-lb02 when no longer needed | 20:00 |
fungi | sounds good to me, yep | 20:00 |
fungi | neighbor table entries on 02 for the routers have gone permanently stale again even after pinging them, and the global address has been dropped from ens3 | 20:02 |
clarkb | gitea-lb03 ipv6 continues to look fine. But also no load yet | 20:05 |
clarkb | the dns job is about to complete and I already see requests coming in | 20:12 |
clarkb | and i see both new A and AAAA records myself now | 20:12 |
clarkb | at ~21:12 I'll stop haproxy on gitea-lb02 | 20:13 |
clarkb | assuming this is happy I guess | 20:13 |
clarkb | I'm "impressed" with how many requests are still going to 02. THough a good chunk seem to originate from a single IP which implies software that is poorly configured | 20:20 |
fungi | yes | 20:21 |
fungi | probably it pipelines multiple connections over a persistent socket session so hasn't needed to do any dns lookup | 20:22 |
fungi | but once the connection gets forcibly closed it will hopefully re-reolve the name when opening a new one | 20:22 |
fungi | er, re-resolve | 20:22 |
opendevreview | Clark Boylan proposed opendev/system-config master: Drop gitea-lb02 from our inventory https://review.opendev.org/c/opendev/system-config/+/955829 | 20:24 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records https://review.opendev.org/c/opendev/zone-opendev.org/+/955830 | 20:25 |
clarkb | I don't know that we're ready for ^ those yet but figured I'd get them running through CI at least | 20:25 |
clarkb | so far I haven't seen any ipv6 weirdness on lb03 either | 20:30 |
fungi | and the global address is still missing from ens3 on lb02 | 20:32 |
fungi | so the drop in traffic volume hasn't magically cured it, at least not yet | 20:32 |
clarkb | I'm going to stop haproxy on gitea-lb02 now | 21:09 |
clarkb | thats done. And I can still reach services | 21:09 |
clarkb | at this point I think I'm happy with the state of things. ipv6 continues to work on gitea-lb03 etc. I think we can land 955829 now or we can wait and see how stable it is first. Happy either way | 21:17 |
clarkb | it does look like gitea-lb02 has its ipv6 address back after stopping haproxy. | 21:18 |
clarkb | #status log Replaced gitea-lb02 with gitea-lb03 a new Noble node in an effort to maintain stable ipv6 connectivity | 21:19 |
opendevstatus | clarkb: finished logging | 21:20 |
clarkb | ipv6 addr on gitea-lb02 is gone again after shutting down haproxy | 21:31 |
clarkb | so I don't think this is workload related | 21:31 |
fungi | yeah, seems not | 21:33 |
opendevreview | James E. Blair proposed opendev/zone-opendev.org master: Remove nodepool hostnames https://review.opendev.org/c/opendev/zone-opendev.org/+/955839 | 21:34 |
corvus | #status log deleted nb05-nb07 and nl05-nl08 | 21:34 |
clarkb | I'm ahppy to rebase my dns updates on 955839 since I think 955839 has no reason to delay | 21:34 |
opendevstatus | corvus: finished logging | 21:35 |
corvus | sounds good, also, we can just rebase 839 when the gitea stuff is done, no rush. | 21:35 |
opendevreview | Merged openstack/diskimage-builder master: Add a 5 second delay between cache update retries https://review.opendev.org/c/openstack/diskimage-builder/+/955712 | 22:03 |
corvus | tonyb: are you done with https://zuul.opendev.org/t/openstack/autohold/0000000208 ? | 22:19 |
corvus | (no rush, it's not blocking anything -- but when you are done, i'll need to do some manual cleanup of that node since there's no longer any automated process that can delete it) | 22:20 |
corvus | i'm a little sad that we're losing our node counter with node ids switching from a sequence to uuids... since we got up to 41 million nodes. | 22:21 |
clarkb | oh good the dib change managed to land before image builds today | 22:30 |
opendevreview | Merged opendev/zone-opendev.org master: Remove nodepool hostnames https://review.opendev.org/c/opendev/zone-opendev.org/+/955839 | 22:37 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records https://review.opendev.org/c/opendev/zone-opendev.org/+/955830 | 22:38 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!