Wednesday, 2025-07-23

opendevreviewMerged opendev/system-config master: Remove nodepool configuration  https://review.opendev.org/c/opendev/system-config/+/95522900:24
*** bbezak is now known as Guest2249301:19
*** ykarel_ is now known as ykarel07:52
*** Guest22493 is now known as bbezak08:01
fricklerinfra-root: another buildset where all jobs finished >1h ago, but zuul shows it still being in progress https://zuul.opendev.org/t/openstack/buildset/0da92bb6a2ec4eba895ac9c564db45a013:06
fricklersaw something similar yesterday, maybe a regression in event handling after all?13:07
fungiyeah, my guess would be the results queue processing is on hold for something else (reconfiguration?)13:08
*** darmach1 is now known as darmach13:15
fricklerseems it finished and merged just after I mentioned it. so possibly nothing critical, but worth to keep an eye on IMO13:34
frickleralso ftr I'll be offline tomorrow and friday13:35
fungihave a good weekend!13:36
corvuswhen that buildset finished, there were 7 changes ahead of it in the gate pipeline.  is it implausible that it took those 7 changes an hour to finish?14:00
fricklerah, good point, I didn't check that, since I only looked at the status with a filter on that change when bbezak mentioned it. but indeed https://review.opendev.org/955570 was ahead in gate and took until 13:09, so that explains it14:09
corvusit reported as soon as the item ahead of it reported.14:09
corvusand over the course of the hour, those 7 changes ahead gradually reduced14:09
corvusso i guess it's a good time to remind folks that the gate pipeline is a queued sequence of merge operations, and none of those are final until all of the items ahead are finalized14:10
corvusthat's because at any time, one of the items ahead might end up failing, and that would invalidate the assumptions that zuul made about what changes would be merged and therefore are included in the testing of changes further behind in the queue14:11
corvusso if tests for changes at the head of the queue are not finished, but tests for changes later in the queue are, it's normal for zuul to wait to report those until everything ahead is done14:12
frickleryes, that's all pretty clear now that you mention it, but the filtered status view of just a single change was misleading me, maybe one could still somehow show the implicit dependencies of a change in the gate pipeline there14:14
fungiyeah, not necessarily the gate pipeline overall, but any shared queues in the gate pipeline have that property, and if you're filtering to a subset of changes or projects for that queue then you won't have a complete picture14:37
fungii'm going to pop out briefly, but should be around again by 16:00 utc14:43
corvusneither the scheduler nor web/fingergw services are running on zuul0214:56
clarkbhuh is that fallout from the reboots I wonder14:57
corvusjuly 19 17:00 is when zuul02 web stops on the graphs14:58
corvus17:46-17:48 to be more precise14:59
clarkbcorvus: it never rebooted15:00
clarkbis the playbook still running and waiting for some condition to issue the reboot or maybe the playbook crashed leaving it this way?15:00
corvusthe playbook finished15:00
clarkbFAILED - RETRYING: [zuul02.opendev.org]: Upgrade scheduler server packages15:01
clarkbfrom /var/log/ansible/zuul_reboot.log.115:01
clarkbFailed to lock apt for exclusive operation: Failed to lock directory /var/lib/apt/lists/15:01
corvusroot        3575  0.0  0.0   2800  1920 ?        Ss   Jun28   0:00 /bin/sh /usr/lib/apt/apt.systemd.daily update15:02
clarkbI think this is the second time we've hit this with the new noble nodes? Specifically the auto updates getting stuck15:03
corvusthird, but the second we think may have been the same event on a different host15:03
clarkboh wait but there was the mirror issue with upstream15:03
clarkbya I think we decided the mirror issue was a likely cause15:03
corvusyeah, and i can't remember what that date was15:03
corvusmaybe this is old enough that's still just the same event on a third host?15:03
clarkbya that is my suspicion15:03
corvusi killed the http process; it's proceeding15:04
corvusokay, it ran unattended upgrades again and is up to date15:11
corvusi will reboot then start15:11
corvusstarted.  once they start, i'm going to restart zuul01 so the versions are in sync15:16
corvus(i don't think there's a difference, but just in case)15:18
clarkback15:18
clarkbhuh wednesdays have a very even distribution of meetings through the day particularly in even weeks (which I think this week is)15:22
clarkbcorvus: the nodepool cleanup change (955229) did merge and mostly deployed successfully. The zookeeper deployment failed due to hitting docker hub rate limits15:38
clarkbthe daily run of zookeeper ended up succeeding later it looks like15:38
corvuswell how about that :)15:39
corvusi'll execute the delete commands today15:39
clarkbdon't forget the builders have volumes too (they typically don't cascade on delete those)15:40
clarkbinfra-root I wrote https://review.opendev.org/c/opendev/system-config/+/955414 because I want to block the IBM server that cannot negotiate ssh with gerrit anymore on review03 but fungi had to manually apply a similar type of rule to static recently15:41
clarkbdo we think that change is A) safe and B) appropriate for our current ruleset?15:41
clarkbbut we could manage more permanent blocklists easily if we land something liek that15:41
clarkbI wonder if we capture those rules files in the existing test jobs. Maybe doing that is a good idea15:42
clarkbhttps://zuul.opendev.org/t/openstack/build/7e34ad567e7642539e25a0468a58cf98/log/noble/rules.v4.txt#14-15 we do15:43
clarkbfungi: is there some IP address that we could stick into group_vars for ci nodes to block connectivity to/from maybe without breaking anything?15:43
corvusclarkb: know if there's a testinfra test that connects to port 22 of hosts?  if so, you could add a block rule for, say, an rfc1918 address just to have the rule in there, and then verify ssh still works15:44
clarkbcorvus: I think testinfra implicitly connects via port 22 using ansible15:45
clarkb203.0.113.0/24 looks promising too looking at a wikipedia table of ip ranges15:45
corvusclarkb: yeah, that's one of the ones in  https://datatracker.ietf.org/doc/rfc5737/15:46
corvusthat's a good choice i think15:46
clarkbya for use in documentation implies that things on the internet shouldn't use them in a valid way15:47
corvushttps://zuul.opendev.org/t/opendev/buildset/ea307964afc4460a8d16992f42db0a59 is the nightly image build15:47
corvusit did not go well15:47
clarkblet me see if there is something similar for ipv6 and I can update the change to block both sets and ensure that my change doesn't break anything in unexpected ways15:48
clarkb3fff::/20 appears to be an ipv6 equivalent15:48
corvusgenerally speaking, it looks like the builds encountered errors fetching from opendev.org15:49
corvusconnection timeouts and 503 errors15:49
opendevreviewClark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs  https://review.opendev.org/c/opendev/system-config/+/95541415:52
clarkbcorvus: the failures here https://zuul.opendev.org/t/opendev/build/ea58ddf19b3e4ba19945054238c4dbd4/log/job-output.txt#5257-5268 happen before the service-gitea.yaml deployment at 2025-07-23T02:40:4615:55
clarkb(just ruling out potential problems with gitea being deployed at the same time as image builds)15:55
corvusyeah i was wondering about that.15:56
corvusthere are some errors at 02:42:..15:56
clarkbcorvus: looking at the haproxy log on gitea-lb02 the last connection from that image build ip address was Jul 23 02:32:25 and it closed cleanly15:57
clarkboh but I should check ipv6 too maybe15:57
fungisorry, back now15:58
clarkbcorvus: the inventory reports there is no ipv6 address for taht host. But ansible fact gathering does report an ipv6 address15:58
fungicorvus: clarkb: the event in question was around the end of last month, so if this was hung from like june 28-30 timeframe that would line up i think15:58
clarkbcorvus: the ipv6 address last spoke to gitea-lb02 at Jul 23 02:27:59 and that failed with cD15:59
clarkbthis feels similar to earlier problems where we flipped back and forth between ipv4 and ipv6 but now instead of the retry immediately working its consistently broken15:59
fungion the client address blocking topic, the challenge with doing an opendev-wide private group_vars is that it would only take effect the next time we ran deployment jobs for those services15:59
clarkbfungi: yes I think you still manually apply them, but then you can add the rule to the permanent list so that ar eboot or restart of iptables services doesn't claer the rule out16:00
fungioh, you're looking for an example network to add, yeah the rfc 5737 network is a good choice16:00
clarkbcorvus: so I think we're back to why are we flip flopping between ipv4 and ipv6 for connectivity as a thread to pull on. I suspect but can't say for sure that ipv6 works until 02:27:59 then fails with cD, we switch to ipv4 and things work until we switch back to ipv6 for some reason but its still broken for the same reasone that caused the cD but now we never fallback and just hard16:02
clarkbfail16:02
clarkbfungi: whcih I think gets us back to "is there any log file that exists that records network stack decision making like this"16:03
corvusshould we down the ipv6 interfaces on image builds?16:03
corvus(not a very satisfying solution)16:04
clarkbcorvus: maybe we try that if only to confirm it is more reliable?16:04
fungii don't think per-process routing decisions get logged unless the process decides to log them (which may imply deeper integration into the tcp/ip stack than most software cares to implement)16:04
clarkbif we continue to have problems afterwards then we'd know that this is a likely incorrect thred to follow16:04
clarkbfungi: ya we'd have to ebpf or strace maybe16:05
clarkbhwoever I suspect that git is just using what is available on the system16:05
clarkband the system is deciding ipv6 exists or not and sometimes when it thinks it exists it doesn't actually function16:05
clarkbcorvus: the 503 errors I think are related to gitea restarts16:09
clarkbcorvus: on gitea11 the gitea-web process started at 02:45. According to the apache log there between 23/Jul/2025:02:45:37 and 23/Jul/2025:02:45:49 the requests 50316:10
corvusi wonder two things: 1) if we could improve the haproxy health check; and 2) whether we should add more of a delay in the dib element that's retrying the git operations16:11
fungiproblematic v6/v4 fallbacks have been fairly common among projects implementing "happy eyeballs" protocols16:11
clarkbJul 23 02:45:40 gitea11 docker-gitea[747]: 2025/07/23 02:45:40 cmd/web.go:233:serveInstalled() [I] PID: 1 Gitea Web Finished16:11
clarkbJul 23 02:45:43 gitea11 docker-gitea[747]: 2025/07/23 02:45:43 cmd/web.go:261:runWeb() [I] Starting Gitea on PID: 116:12
clarkbJul 23 02:45:48 gitea11 docker-gitea[747]: 2025/07/23 02:45:48 cmd/web.go:323:listen() [I] Listen: https://0.0.0.0:300016:12
clarkbcorvus: yup I suspect based on ^ and apache returning 503s that because we're talking to apache the health check checks out even though the service behind it is failing16:13
clarkbso 1) seems like somethign we should dig into16:13
clarkbhttps://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/gitea-lb.yaml#L3016:14
clarkbI think thats an l4 check?16:15
clarkbI wonder if we can only do http level cehcks if we balance http16:15
clarkbhttps://www.haproxy.com/documentation/haproxy-configuration-tutorials/reliability/health-checks/#http-health-checks doesn't seem to indicate we need to balance http to do the http check16:16
clarkbhttps://gitea09.opendev.org:3081/api/healthz16:21
stephenfininterestingly, attempting to paste text including 🤦 to paste.o.o causes a HTTP 50216:25
stephenfinor many other emojis16:26
clarkbstephenfin: I think the dataabse there is still utf8 3 byte only16:26
clarkbbecause mysql defaults from the long long ago16:26
stephenfinTIL16:27
fungithough we could probably rectify that with a migration16:29
fungiafter setting some config options16:29
fungii don't know if lodgeit's underlying db interface code needs any adjustments for that16:30
opendevreviewClark Boylan proposed opendev/system-config master: Have haproxy check gitea's health status endpoint  https://review.opendev.org/c/opendev/system-config/+/95570916:31
clarkbcorvus: ^ the gitea 503 problem may be as simple as that?16:31
clarkbcorvus: the default haproxy healthcheck interval is once every 2 seconds and this build failed within a couple of seconds: https://zuul.opendev.org/t/opendev/build/17ddb6e1797d45d596fb7c6cef4d6199/log/job-output.txt#4165-4175 I think maybe add a delay of at least one second between requests?16:32
corvusthese shouldn't happen, i'd do at least 5 seconds16:33
clarkbcorvus: ++16:33
corvushealthz is a good find16:34
clarkbfungi: ya I'm not sure what sort of db management is in ldogeit itself16:34
clarkbnote I tried to HEAD /api/healthz and that 404s. Seems you have to GET it which is fine16:34
corvusyeah, presumably internal processing of the healthz GET is minimal, which is the main reason to HEAD any other endpoint16:37
opendevreviewClark Boylan proposed openstack/diskimage-builder master: Add a 5 second delay between cache update retries  https://review.opendev.org/c/openstack/diskimage-builder/+/95571216:41
clarkbcorvus: ^ and that should add the short delay between retries16:41
opendevreviewClark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs  https://review.opendev.org/c/opendev/system-config/+/95541416:49
opendevreviewClark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs  https://review.opendev.org/c/opendev/system-config/+/95541417:02
clarkbok I think ^ is in a good state now17:03
clarkbI'm glad I added the ranges to block to exercise it as it caught a bug17:03
clarkbhttps://zuul.opendev.org/t/openstack/build/9444be613c964224ad4491241f955cd8/log/gitea-lb02.opendev.org/haproxy.log#11-12 this shows the new check seems to work. Note the http backend returns a 307 which is considered as valid. I think for http that is probably good enough since it redirects back to https in apache itself iirc17:37
opendevreviewClark Boylan proposed opendev/system-config master: Have haproxy check zuul's health/ready page  https://review.opendev.org/c/opendev/system-config/+/95571818:09
clarkbcorvus: ^ I realized that a similar update to haproxy may be useful for zuul too18:09
corvusgood idea18:11
corvusthough... one sec18:11
corvusclarkb: 2 things: 1) we don't actually start listening until the component is fully ready; 2) /health/ready is served on a different port18:14
corvusso i think we can leave it as it was18:15
clarkbah ok. fwiw I did curl --head https://zuul01.opendev.org/health/ready and got a 200 but maybe that is some other endpoint being confusing18:15
clarkbI can see in the response headers that cherrypy is what responded.18:16
corvusyeah that was probably the universal rediect18:16
clarkb`curl -X OPTIONS https://zuul01.opendev.org/` returns a lot more data than is necessary. I guess a small optimization would be to HEAD / instead of OPTIONS /18:17
clarkbbut we can leave it as is if you think getting the js back is a better canary18:17
corvusi think head should be fine; my expectation is that we don't start listening until we're ready18:19
corvus(also, maybe we should see what options is returning and whether we need to fix something in zuul-web)18:19
clarkbcorvus: it seems to return the web site html/js for me18:19
clarkbwhcih should be cached and isn't super huge in its initial response so I don't think its a big cost to use it as a check. But yes maybe options should return something different to be more correct18:20
clarkbI've abandoned my change with some notes18:20
clarkbinfra-root if anyone else is willing to review https://review.opendev.org/c/opendev/system-config/+/955709 I think that would be a good one to get in today in the hopes it makes image builds more reliable20:01
fungiapproved it just ow20:05
clarkbthanks. I can keep an eye on it20:05
fungiwill we need to manually restart haproxy for that?20:05
clarkbfungi: I don't not believe so. But I'll double check (iirc ansible is set up to gracefully restart haproxy on config updates)20:06
fungiah, cool20:06
clarkbfungi: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/haproxy/tasks/main.yaml#L60-L67 writing out the config file notifies the reload haproxy handler which does: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/haproxy/handlers/hup_haproxy.yaml and that should gracefully reload things20:07
fungiperfect20:12
clarkbthe dib change to add a delay between dib repo cache retries is looking like it will go green after a recheck20:48
clarkbthe pre recheck failure was an odd one (the ssh client couldn't negotiate kex exchange stuff when setting up the ssh connection)20:49
clarkbbut none of the other builds failed that way so I figured maybe somethin in centos 9 stream sshd pacakging? seems to have passed on a second run though20:49
fungithe nodepool removal change failed infra-prod-service-grafana in deploy, has anyone looked into that yet?21:05
funginever mind, docker rate limit error21:07
fungiand the same jobs has succeeded subsequently21:07
fungier, job21:07
clarkbyup21:09
clarkbthe zookeeper job failed too? Or maybe it was just zookeeper I checked early today and I would haev to go klook again21:09
clarkbfungi: https://review.opendev.org/c/openstack/diskimage-builder/+/955712 is the other change aimed at improving daily image builds and it now passes check testing21:10
fungii approved the iptables block change21:10
clarkbok I'm fairly confident in that change due to our testing but we'll probably wnt to double check anyway after it deploys21:11
clarkbyou know I wonder if the -base job should trigger off of edits to playbooks/roles/iptables?21:12
clarkbya iptables is run by the base.yaml playbook. I think that is a good update21:14
clarkbfungi: do you think ^ is worth unapproving the iptables block change in order to combine them?21:14
clarkbotherwise we either have to land a second change or wait for daily runs21:14
fungii can unapprove, sure21:15
fungidone21:16
fungii agree it's preferable to have that added one way or the other, in order to speed up deployment of changes to those rules in the future21:17
opendevreviewClark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs  https://review.opendev.org/c/opendev/system-config/+/95541421:17
clarkbthat is the udpated. system-config-run-base is already triggering on all playbooks/ updates21:18
opendevreviewMerged opendev/system-config master: Have haproxy check gitea's health status endpoint  https://review.opendev.org/c/opendev/system-config/+/95570921:20
clarkbarg ^ that didnt trigger jobs either21:21
opendevreviewClark Boylan proposed opendev/system-config master: Trigger load balancer deployment jobs when their roles update  https://review.opendev.org/c/opendev/system-config/+/95573421:25
clarkbfungi: ^ that change aims to fix the lack of gitea-lb updates from 955709 landing21:25
fungireapproved 95541421:29
fungialso quick-approved 955734 so we can see the haproxy changes take effect sooner21:31
clarkbthanks21:32
fungipython 3.14.0rc1 is out21:44
clarkbfungi: I expect that 955414 will remove your manual rule on static. Do you think we should edit hostvars/group vars on bridge to add that IP address to the new block var list for static?21:54
fungithat's a good idea for a test21:54
fungishould we do it before the triggered deploy, or wait and see if the daily run picks it up?21:55
clarkbI'm thinking before the triggered deploy would ensure that we don't have a gap where that host could become problematic again21:55
fungifair enough, it might save us some cleanup work in that regard21:56
fungii'll get that added and let you double-check i stuck it in the right spot21:56
clarkbsounds good21:57
fungishould i do it in a service-specific group or does this only work if it's added to the global one?21:58
clarkbfungi: it should work in any group. I would make it service specific particularly while we're deploying the new functiaonlity21:58
clarkbthis way we don't accidently over block globally21:59
fungiyeah, take a look at the last commit on bridge and see if that's what you expect then22:01
clarkbfungi: almost. host_vars/static.opendev.org.yaml only applies to the host known as static.opendev.org which doesn't exist anymore. Looks like we're using static02.opendev.org22:03
fungioh22:04
clarkbfungi: there is also a static group defined. So you can either add host_vars/static02.opendev.org or group_vars/static.yaml22:04
clarkbgroup vars is probably preferred22:04
fungiyeah i checked group vars first and we didn't seem to have a file for static at all there22:05
fungiwill it just be picked up automatically if i add one?22:05
clarkbyes it should be22:05
clarkbthat service only uses public group vars so far. The two should be mixed together. They are disjoin values so we don't have to worry about precedence22:06
fungisee if the amended master branch commit looks right now22:07
clarkbyes that looks correct22:08
fungigreat22:08
clarkbfungi: fwiw I don't see that IP in the current ruleset on static0222:10
clarkboh I see it. Its like the first rule of everything22:11
fungiyeah22:11
fungii inserted it into the input chain without specifying a rule number to insert at/after, so it goes in at the front22:12
clarkbthe iptables update which should noop on every host but static02 is about to merge22:13
opendevreviewMerged opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs  https://review.opendev.org/c/opendev/system-config/+/95541422:14
clarkbhrm why are jobs running at the same time as -base?22:18
clarkbI guess those builds don't actually depend on base. Looking at system-config/zuul.d/project.yaml this appears to be how we've configured it22:20
clarkbI'm going to not worry about that now, but we might consider changing that?22:20
clarkbI see the files on disk have updated but we haven't restarted netfilter-persistent yet. I suspect because that is a handler so runs at the very end of the playbook22:22
clarkboh no actually it did reload the rules on static02. My check of systemctl status netfilter-persistent is just not giving me the timestamp of the last reload22:23
clarkbthis is looking good but I want to make sure that the ruleset was restarted on hosts where we expect it to noop too22:24
clarkbwe dont' use systemd to load the rules. We use the netfilter-persistent script directly which does an iptables-restore22:28
clarkbunforatunely this means I can't confirm the timestamps but I think it is working as expected22:28
fungiyeah, the old INPUT chain entry is cleared now, and there's a new entry for it in the openstack-INPUT chain instead22:29
fungii know this is the intended design, but we need to remember that if we've got someone hammering the ssh port on one of our servers we can't block them with this rule22:29
clarkbya it definitely did what I expect on static02. I was mostly trying to confirm that if some other server accidnetally reboots in the futuer it won't load up a broken ruleset22:29
clarkbthe zookeeper job failed pulling docker images22:30
clarkbError: /Stage[main]/Jeepyb/Vcsrepo[/opt/jeepyb]: Could not evaluate: Execution of '/usr/bin/git fetch origin' returned 128: fatal: unable to access 'https://opendev.org/openstack-infra/jeepyb/': Failed to connect to opendev.org port 443: No route to host22:31
clarkbthis is why the puppet job failed22:31
clarkbopendev.org is reachable from here. But now I'm wondering if the problem is maybe ipv6 conenctivity on the opendev.org side of things impacting the image builds22:32
clarkbgitea did reload its haproxy config but that is supposed to be graceful so I wouldn't expect it to cause no route to host problems22:33
clarkblooking at the base.yaml.log on bridge the handler did seem to run netfilter-persistent start across the board so I'm a lot less worried about not having applied the noop case22:34
clarkbfungi: from my personal ovh server with ipv6 addressing ping6 fails to opendev.org22:38
clarkbit also is unreachable from mirror.dfw.rax.opendev.org22:39
fungilovely22:39
fungilemme try from here22:39
clarkboho ip addr on gitea-lb02 only shows a link local address22:39
clarkbso I think this i the problem. Its not client side its on the load balacner22:39
clarkbdmesg doesn't report why the ip address goes away22:40
fungiyeah, 100% packet loss from my house too22:42
fungii concur, it hasn't picked up any v6 address22:43
clarkbexcept if you look in the haproxy logs we know it has working ipv6 sometimes (because there are ipv6 connections)22:43
clarkbso something is causing the ipv6 address to get unconfigured. And now I'm remembering that we had specific netplan rules for the old gerrit server to statically configure ipv6 due to problems. I wonder if thati s an issue here on the other region now22:44
fungiit has stale neighbor table entries for the ll addresses of the two gateways that gitea09 is using for its default routes22:45
fungithe gerrit server, once upon a time, was picking up additional default routes from what looked like stray advertisements (maybe from our own projects' test jobs)22:46
fungiin this case it's not extras i see, but none at all22:47
fungiif i ping6 those gateways they show up reachable in the neighbor table22:48
clarkbit appears that the server was configured with cloud-init which then configured netplan which then configures systemd-networkd22:48
clarkbfungi: the host has a valid ipv6 address again22:48
clarkbnot sure if that is a side effect of your pinging or just regular return to service that makse this work some of the time22:49
fungiso it's expiring and not getting reestablished sometimes i guess?22:49
fungivery well may have been due to me pinging one of the gateways22:49
fungimaybe there's something weird in the switching somewhere, and having a frame traverse the right device in the other direction reestablished a working flow or blew out a stale one22:50
fungithough route announcements should be broadcast type, so that wouldn't make sense22:50
clarkbfungi: `sudo networkctl status ens3` indicates that DHCPv6 leases have been lost in the past22:51
clarkbI suspect that since our lease is gone we're relying on stray RAs?22:52
clarkbthough looking at those logs it seems to say the lease is lost when configuring the interface (on boot?)22:53
fungioh, could be this is not relying on slaac22:53
fungii was thinking slaac not dhcp622:53
clarkbit goes from dhcpv6 lease lost to ens3 up22:53
clarkbfungi: ya the netplan config seems to say use dhcp for both ipv4 and ipv6 and that should be coming from cloud metadata service configuration22:53
fungithough there are hybrid configurations where you can do address selection via slaac and rely on dhcp6 to handle things like dns server info22:54
clarkbbut then there is no dhcpv6 got a lease message afterwards like there is for dhcpv422:54
fungiright it might not be relying on dhcp6 for addressing22:55
clarkbhow long is an RA valid for? can they expire causing us to unconfigure the address?22:55
fungithe route expirations will be reported by ip -6 ro sh22:55
fungidefault proto ra metric 100 expires 1789sec pref medium22:56
fungithese look pretty short, fwiw22:56
fungii guess not, my home v6 routing is similarly short22:57
clarkbthats just under half an hour. I guess we check again at 23:30 UTC and see if the address is gone?22:57
clarkbin the meantime should we drop the AAAA record for opendev.org?22:57
fungimight not be the worst idea, as a temporary measure while we're trying to sort this out22:57
fungihopefully we don't have any v6-only clients22:58
clarkbI don't think we do anymore. But we should check our base jobs which do routing checks and all that and see if they try to v6 opendev.org22:58
fungiwell, i didn't mean clients we're running, i meant users at large23:00
clarkboh sure. I guess we'd be chaning them from failures X% of the time to 100% of the time23:00
clarkbhttps://opendev.org/zuul/zuul-jobs/src/branch/master/roles/validate-host/library/zuul_debug_info.py#L92-L106 we do validate but it is configurable23:05
clarkbhttps://opendev.org/openstack/project-config/src/branch/master/zuul/site-variables.yaml#L14-L15 it is ok if ipv6 fails I think based on this23:06
fungii half wonder if the same thing could be happening to the gitea backends too, and we just don't know because we don't connect directly to them23:06
fungithe route expiration on gitea-lb02 has jumped back up again23:08
fungiso seems like it normally gets refreshed every time it sees a new announcement or something23:08
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Drop the opendev.org AAAA record  https://review.opendev.org/c/opendev/zone-opendev.org/+/95573723:08
clarkbPushed that so we have the option23:09
fungilowest i caught it at was 1201sec23:09
fungibut now it's 1603sec23:09
fungiso it's probably something like 15 minutes and gets reset every 523:09
fungiso under normal circumstances shouldn't fall below 10 minutes23:10
clarkbthe old value was almost 1800 which is ~30 minutes23:10
fungier, yes my math is off23:11
fungii meant 30 minutes expiration reset every 1023:11
fungiso shouldn't fall below 2023:11
fungibecause the highest i've seen is just shy of 30 minutes and lowest i've seen is barely above 20 minutes23:12
fungiwe'll know in about another 3 minutes23:12
clarkball of the backends have ipv6 addrs right now too. Presumably they are on the same network and see the same RAs?23:12
fungiall the ones i checked matched for gateway addresses at least, yeah23:13
clarkbso maybe check them next time we see the load balancer lose its address to see if it has happened across the board23:13
fungiright23:13
fungiabout a minute to go before the default routes are refreshed on gitea-lb02, if my observation holds23:15
fungibingo23:16
fungii saw it jump from 1200 to 179923:16
clarkbI wish I could figure out where this is logged but I don't think aynthing logs it23:16
fungino, there's probably a sysctl option to turn on verbose logging in that kernel subsystem23:17
fungiwhich would then go to dmesg and probably syslog23:17
clarkbI'm not finding one in https://docs.kernel.org/6.8/networking/ip-sysctl.html but the formatting of that makes it a bit tough to grep through23:19
fungii've +2'd but not approved 955737 for now23:19
clarkbyou can log martians23:19
fungibut not venutians or mercurians, sadly23:19
clarkbmaybe we just want a naive script that captures ip addr output and ip -6 ro sho output every 5 minutes23:20
fungithough it's really those pesky saturnians you need to look out for23:20
fungiyeah, script or even just a temporary cronjob that e-mails when there are no default v6 routes23:21
clarkbwhile true ; do ip -6 addr show dev ens3 && ip -6 route show default ; done > /var/log/ipv6_networking.log > 2>&1 and put that in screen?23:22
clarkbI meant to add a sleep 300 to that too23:22
fungicould even just test whether the output of `ip -6 ro sh default` is empty23:22
fungiyeah, screen session would be fine23:23
fungislap a date command or something in there too so we have timestamping23:23
fungianyway, i need to knock off for the evening, told christine i was coming upstairs something like 30 minutes ago so she's probably wondering where i am (scratch that, i'm sure she knows where i am...)23:24
clarkbI added a date and a sleep 300 and stuck that in a screen writing to my homedir (so not /var/log)23:25
fungisgtm, thanks!23:26
clarkbquay.io is in emergency maintenance ro mode23:29
clarkbtheir status page indicates this affects pushes but I'm seeing jobs fail fetching with 502 bad gateway too23:29
clarkbjust a heads up. Not much we can do about it from here other than be aware23:30
clarkbfungi: I know you popped out but almost immedaitel after you did the ipv6 address went away. The ip -6 route sho output still shows a valid default route though23:31
clarkbcuriously this occurred around my original guestimate for 23:30 UTC but that may be coincidence23:32
clarkbI'm going to leave it along and see if refreshing the routes refreshes the ip23:33
clarkbnone of the backends are currently affected. They all have valid global ipv6 addresses23:36
clarkbafter the routes refreshed the ip address came back23:36
clarkband its gone again23:40
clarkbI'm beginning to suspect our accept_dad = 1 sysctl option for ens3 means that we're removing the address if a duplicate is found23:47

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!