Wednesday, 2025-07-23

opendevreview	Merged opendev/system-config master: Remove nodepool configuration https://review.opendev.org/c/opendev/system-config/+/955229	00:24
*** bbezak is now known as Guest22493		01:19
*** ykarel_ is now known as ykarel		07:52
*** Guest22493 is now known as bbezak		08:01
frickler	infra-root: another buildset where all jobs finished >1h ago, but zuul shows it still being in progress https://zuul.opendev.org/t/openstack/buildset/0da92bb6a2ec4eba895ac9c564db45a0	13:06
frickler	saw something similar yesterday, maybe a regression in event handling after all?	13:07
fungi	yeah, my guess would be the results queue processing is on hold for something else (reconfiguration?)	13:08
*** darmach1 is now known as darmach		13:15
frickler	seems it finished and merged just after I mentioned it. so possibly nothing critical, but worth to keep an eye on IMO	13:34
frickler	also ftr I'll be offline tomorrow and friday	13:35
fungi	have a good weekend!	13:36
corvus	when that buildset finished, there were 7 changes ahead of it in the gate pipeline. is it implausible that it took those 7 changes an hour to finish?	14:00
frickler	ah, good point, I didn't check that, since I only looked at the status with a filter on that change when bbezak mentioned it. but indeed https://review.opendev.org/955570 was ahead in gate and took until 13:09, so that explains it	14:09
corvus	it reported as soon as the item ahead of it reported.	14:09
corvus	and over the course of the hour, those 7 changes ahead gradually reduced	14:09
corvus	so i guess it's a good time to remind folks that the gate pipeline is a queued sequence of merge operations, and none of those are final until all of the items ahead are finalized	14:10
corvus	that's because at any time, one of the items ahead might end up failing, and that would invalidate the assumptions that zuul made about what changes would be merged and therefore are included in the testing of changes further behind in the queue	14:11
corvus	so if tests for changes at the head of the queue are not finished, but tests for changes later in the queue are, it's normal for zuul to wait to report those until everything ahead is done	14:12
frickler	yes, that's all pretty clear now that you mention it, but the filtered status view of just a single change was misleading me, maybe one could still somehow show the implicit dependencies of a change in the gate pipeline there	14:14
fungi	yeah, not necessarily the gate pipeline overall, but any shared queues in the gate pipeline have that property, and if you're filtering to a subset of changes or projects for that queue then you won't have a complete picture	14:37
fungi	i'm going to pop out briefly, but should be around again by 16:00 utc	14:43
corvus	neither the scheduler nor web/fingergw services are running on zuul02	14:56
clarkb	huh is that fallout from the reboots I wonder	14:57
corvus	july 19 17:00 is when zuul02 web stops on the graphs	14:58
corvus	17:46-17:48 to be more precise	14:59
clarkb	corvus: it never rebooted	15:00
clarkb	is the playbook still running and waiting for some condition to issue the reboot or maybe the playbook crashed leaving it this way?	15:00
corvus	the playbook finished	15:00
clarkb	FAILED - RETRYING: [zuul02.opendev.org]: Upgrade scheduler server packages	15:01
clarkb	from /var/log/ansible/zuul_reboot.log.1	15:01
clarkb	Failed to lock apt for exclusive operation: Failed to lock directory /var/lib/apt/lists/	15:01
corvus	root 3575 0.0 0.0 2800 1920 ? Ss Jun28 0:00 /bin/sh /usr/lib/apt/apt.systemd.daily update	15:02
clarkb	I think this is the second time we've hit this with the new noble nodes? Specifically the auto updates getting stuck	15:03
corvus	third, but the second we think may have been the same event on a different host	15:03
clarkb	oh wait but there was the mirror issue with upstream	15:03
clarkb	ya I think we decided the mirror issue was a likely cause	15:03
corvus	yeah, and i can't remember what that date was	15:03
corvus	maybe this is old enough that's still just the same event on a third host?	15:03
clarkb	ya that is my suspicion	15:03
corvus	i killed the http process; it's proceeding	15:04
corvus	okay, it ran unattended upgrades again and is up to date	15:11
corvus	i will reboot then start	15:11
corvus	started. once they start, i'm going to restart zuul01 so the versions are in sync	15:16
corvus	(i don't think there's a difference, but just in case)	15:18
clarkb	ack	15:18
clarkb	huh wednesdays have a very even distribution of meetings through the day particularly in even weeks (which I think this week is)	15:22
clarkb	corvus: the nodepool cleanup change (955229) did merge and mostly deployed successfully. The zookeeper deployment failed due to hitting docker hub rate limits	15:38
clarkb	the daily run of zookeeper ended up succeeding later it looks like	15:38
corvus	well how about that :)	15:39
corvus	i'll execute the delete commands today	15:39
clarkb	don't forget the builders have volumes too (they typically don't cascade on delete those)	15:40
clarkb	infra-root I wrote https://review.opendev.org/c/opendev/system-config/+/955414 because I want to block the IBM server that cannot negotiate ssh with gerrit anymore on review03 but fungi had to manually apply a similar type of rule to static recently	15:41
clarkb	do we think that change is A) safe and B) appropriate for our current ruleset?	15:41
clarkb	but we could manage more permanent blocklists easily if we land something liek that	15:41
clarkb	I wonder if we capture those rules files in the existing test jobs. Maybe doing that is a good idea	15:42
clarkb	https://zuul.opendev.org/t/openstack/build/7e34ad567e7642539e25a0468a58cf98/log/noble/rules.v4.txt#14-15 we do	15:43
clarkb	fungi: is there some IP address that we could stick into group_vars for ci nodes to block connectivity to/from maybe without breaking anything?	15:43
corvus	clarkb: know if there's a testinfra test that connects to port 22 of hosts? if so, you could add a block rule for, say, an rfc1918 address just to have the rule in there, and then verify ssh still works	15:44
clarkb	corvus: I think testinfra implicitly connects via port 22 using ansible	15:45
clarkb	203.0.113.0/24 looks promising too looking at a wikipedia table of ip ranges	15:45
corvus	clarkb: yeah, that's one of the ones in https://datatracker.ietf.org/doc/rfc5737/	15:46
corvus	that's a good choice i think	15:46
clarkb	ya for use in documentation implies that things on the internet shouldn't use them in a valid way	15:47
corvus	https://zuul.opendev.org/t/opendev/buildset/ea307964afc4460a8d16992f42db0a59 is the nightly image build	15:47
corvus	it did not go well	15:47
clarkb	let me see if there is something similar for ipv6 and I can update the change to block both sets and ensure that my change doesn't break anything in unexpected ways	15:48
clarkb	3fff::/20 appears to be an ipv6 equivalent	15:48
corvus	generally speaking, it looks like the builds encountered errors fetching from opendev.org	15:49
corvus	connection timeouts and 503 errors	15:49
opendevreview	Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414	15:52
clarkb	corvus: the failures here https://zuul.opendev.org/t/opendev/build/ea58ddf19b3e4ba19945054238c4dbd4/log/job-output.txt#5257-5268 happen before the service-gitea.yaml deployment at 2025-07-23T02:40:46	15:55
clarkb	(just ruling out potential problems with gitea being deployed at the same time as image builds)	15:55
corvus	yeah i was wondering about that.	15:56
corvus	there are some errors at 02:42:..	15:56
clarkb	corvus: looking at the haproxy log on gitea-lb02 the last connection from that image build ip address was Jul 23 02:32:25 and it closed cleanly	15:57
clarkb	oh but I should check ipv6 too maybe	15:57
fungi	sorry, back now	15:58
clarkb	corvus: the inventory reports there is no ipv6 address for taht host. But ansible fact gathering does report an ipv6 address	15:58
fungi	corvus: clarkb: the event in question was around the end of last month, so if this was hung from like june 28-30 timeframe that would line up i think	15:58
clarkb	corvus: the ipv6 address last spoke to gitea-lb02 at Jul 23 02:27:59 and that failed with cD	15:59
clarkb	this feels similar to earlier problems where we flipped back and forth between ipv4 and ipv6 but now instead of the retry immediately working its consistently broken	15:59
fungi	on the client address blocking topic, the challenge with doing an opendev-wide private group_vars is that it would only take effect the next time we ran deployment jobs for those services	15:59
clarkb	fungi: yes I think you still manually apply them, but then you can add the rule to the permanent list so that ar eboot or restart of iptables services doesn't claer the rule out	16:00
fungi	oh, you're looking for an example network to add, yeah the rfc 5737 network is a good choice	16:00
clarkb	corvus: so I think we're back to why are we flip flopping between ipv4 and ipv6 for connectivity as a thread to pull on. I suspect but can't say for sure that ipv6 works until 02:27:59 then fails with cD, we switch to ipv4 and things work until we switch back to ipv6 for some reason but its still broken for the same reasone that caused the cD but now we never fallback and just hard	16:02
clarkb	fail	16:02
clarkb	fungi: whcih I think gets us back to "is there any log file that exists that records network stack decision making like this"	16:03
corvus	should we down the ipv6 interfaces on image builds?	16:03
corvus	(not a very satisfying solution)	16:04
clarkb	corvus: maybe we try that if only to confirm it is more reliable?	16:04
fungi	i don't think per-process routing decisions get logged unless the process decides to log them (which may imply deeper integration into the tcp/ip stack than most software cares to implement)	16:04
clarkb	if we continue to have problems afterwards then we'd know that this is a likely incorrect thred to follow	16:04
clarkb	fungi: ya we'd have to ebpf or strace maybe	16:05
clarkb	hwoever I suspect that git is just using what is available on the system	16:05
clarkb	and the system is deciding ipv6 exists or not and sometimes when it thinks it exists it doesn't actually function	16:05
clarkb	corvus: the 503 errors I think are related to gitea restarts	16:09
clarkb	corvus: on gitea11 the gitea-web process started at 02:45. According to the apache log there between 23/Jul/2025:02:45:37 and 23/Jul/2025:02:45:49 the requests 503	16:10
corvus	i wonder two things: 1) if we could improve the haproxy health check; and 2) whether we should add more of a delay in the dib element that's retrying the git operations	16:11
fungi	problematic v6/v4 fallbacks have been fairly common among projects implementing "happy eyeballs" protocols	16:11
clarkb	Jul 23 02:45:40 gitea11 docker-gitea[747]: 2025/07/23 02:45:40 cmd/web.go:233:serveInstalled() [I] PID: 1 Gitea Web Finished	16:11
clarkb	Jul 23 02:45:43 gitea11 docker-gitea[747]: 2025/07/23 02:45:43 cmd/web.go:261:runWeb() [I] Starting Gitea on PID: 1	16:12
clarkb	Jul 23 02:45:48 gitea11 docker-gitea[747]: 2025/07/23 02:45:48 cmd/web.go:323:listen() [I] Listen: https://0.0.0.0:3000	16:12
clarkb	corvus: yup I suspect based on ^ and apache returning 503s that because we're talking to apache the health check checks out even though the service behind it is failing	16:13
clarkb	so 1) seems like somethign we should dig into	16:13
clarkb	https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/gitea-lb.yaml#L30	16:14
clarkb	I think thats an l4 check?	16:15
clarkb	I wonder if we can only do http level cehcks if we balance http	16:15
clarkb	https://www.haproxy.com/documentation/haproxy-configuration-tutorials/reliability/health-checks/#http-health-checks doesn't seem to indicate we need to balance http to do the http check	16:16
clarkb	https://gitea09.opendev.org:3081/api/healthz	16:21
stephenfin	interestingly, attempting to paste text including 🤦 to paste.o.o causes a HTTP 502	16:25
stephenfin	or many other emojis	16:26
clarkb	stephenfin: I think the dataabse there is still utf8 3 byte only	16:26
clarkb	because mysql defaults from the long long ago	16:26
stephenfin	TIL	16:27
fungi	though we could probably rectify that with a migration	16:29
fungi	after setting some config options	16:29
fungi	i don't know if lodgeit's underlying db interface code needs any adjustments for that	16:30
opendevreview	Clark Boylan proposed opendev/system-config master: Have haproxy check gitea's health status endpoint https://review.opendev.org/c/opendev/system-config/+/955709	16:31
clarkb	corvus: ^ the gitea 503 problem may be as simple as that?	16:31
clarkb	corvus: the default haproxy healthcheck interval is once every 2 seconds and this build failed within a couple of seconds: https://zuul.opendev.org/t/opendev/build/17ddb6e1797d45d596fb7c6cef4d6199/log/job-output.txt#4165-4175 I think maybe add a delay of at least one second between requests?	16:32
corvus	these shouldn't happen, i'd do at least 5 seconds	16:33
clarkb	corvus: ++	16:33
corvus	healthz is a good find	16:34
clarkb	fungi: ya I'm not sure what sort of db management is in ldogeit itself	16:34
clarkb	note I tried to HEAD /api/healthz and that 404s. Seems you have to GET it which is fine	16:34
corvus	yeah, presumably internal processing of the healthz GET is minimal, which is the main reason to HEAD any other endpoint	16:37
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Add a 5 second delay between cache update retries https://review.opendev.org/c/openstack/diskimage-builder/+/955712	16:41
clarkb	corvus: ^ and that should add the short delay between retries	16:41
opendevreview	Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414	16:49
opendevreview	Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414	17:02
clarkb	ok I think ^ is in a good state now	17:03
clarkb	I'm glad I added the ranges to block to exercise it as it caught a bug	17:03
clarkb	https://zuul.opendev.org/t/openstack/build/9444be613c964224ad4491241f955cd8/log/gitea-lb02.opendev.org/haproxy.log#11-12 this shows the new check seems to work. Note the http backend returns a 307 which is considered as valid. I think for http that is probably good enough since it redirects back to https in apache itself iirc	17:37
opendevreview	Clark Boylan proposed opendev/system-config master: Have haproxy check zuul's health/ready page https://review.opendev.org/c/opendev/system-config/+/955718	18:09
clarkb	corvus: ^ I realized that a similar update to haproxy may be useful for zuul too	18:09
corvus	good idea	18:11
corvus	though... one sec	18:11
corvus	clarkb: 2 things: 1) we don't actually start listening until the component is fully ready; 2) /health/ready is served on a different port	18:14
corvus	so i think we can leave it as it was	18:15
clarkb	ah ok. fwiw I did curl --head https://zuul01.opendev.org/health/ready and got a 200 but maybe that is some other endpoint being confusing	18:15
clarkb	I can see in the response headers that cherrypy is what responded.	18:16
corvus	yeah that was probably the universal rediect	18:16
clarkb	`curl -X OPTIONS https://zuul01.opendev.org/` returns a lot more data than is necessary. I guess a small optimization would be to HEAD / instead of OPTIONS /	18:17
clarkb	but we can leave it as is if you think getting the js back is a better canary	18:17
corvus	i think head should be fine; my expectation is that we don't start listening until we're ready	18:19
corvus	(also, maybe we should see what options is returning and whether we need to fix something in zuul-web)	18:19
clarkb	corvus: it seems to return the web site html/js for me	18:19
clarkb	whcih should be cached and isn't super huge in its initial response so I don't think its a big cost to use it as a check. But yes maybe options should return something different to be more correct	18:20
clarkb	I've abandoned my change with some notes	18:20
clarkb	infra-root if anyone else is willing to review https://review.opendev.org/c/opendev/system-config/+/955709 I think that would be a good one to get in today in the hopes it makes image builds more reliable	20:01
fungi	approved it just ow	20:05
clarkb	thanks. I can keep an eye on it	20:05
fungi	will we need to manually restart haproxy for that?	20:05
clarkb	fungi: I don't not believe so. But I'll double check (iirc ansible is set up to gracefully restart haproxy on config updates)	20:06
fungi	ah, cool	20:06
clarkb	fungi: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/haproxy/tasks/main.yaml#L60-L67 writing out the config file notifies the reload haproxy handler which does: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/haproxy/handlers/hup_haproxy.yaml and that should gracefully reload things	20:07
fungi	perfect	20:12
clarkb	the dib change to add a delay between dib repo cache retries is looking like it will go green after a recheck	20:48
clarkb	the pre recheck failure was an odd one (the ssh client couldn't negotiate kex exchange stuff when setting up the ssh connection)	20:49
clarkb	but none of the other builds failed that way so I figured maybe somethin in centos 9 stream sshd pacakging? seems to have passed on a second run though	20:49
fungi	the nodepool removal change failed infra-prod-service-grafana in deploy, has anyone looked into that yet?	21:05
fungi	never mind, docker rate limit error	21:07
fungi	and the same jobs has succeeded subsequently	21:07
fungi	er, job	21:07
clarkb	yup	21:09
clarkb	the zookeeper job failed too? Or maybe it was just zookeeper I checked early today and I would haev to go klook again	21:09
clarkb	fungi: https://review.opendev.org/c/openstack/diskimage-builder/+/955712 is the other change aimed at improving daily image builds and it now passes check testing	21:10
fungi	i approved the iptables block change	21:10
clarkb	ok I'm fairly confident in that change due to our testing but we'll probably wnt to double check anyway after it deploys	21:11
clarkb	you know I wonder if the -base job should trigger off of edits to playbooks/roles/iptables?	21:12
clarkb	ya iptables is run by the base.yaml playbook. I think that is a good update	21:14
clarkb	fungi: do you think ^ is worth unapproving the iptables block change in order to combine them?	21:14
clarkb	otherwise we either have to land a second change or wait for daily runs	21:14
fungi	i can unapprove, sure	21:15
fungi	done	21:16
fungi	i agree it's preferable to have that added one way or the other, in order to speed up deployment of changes to those rules in the future	21:17
opendevreview	Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414	21:17
clarkb	that is the udpated. system-config-run-base is already triggering on all playbooks/ updates	21:18
opendevreview	Merged opendev/system-config master: Have haproxy check gitea's health status endpoint https://review.opendev.org/c/opendev/system-config/+/955709	21:20
clarkb	arg ^ that didnt trigger jobs either	21:21
opendevreview	Clark Boylan proposed opendev/system-config master: Trigger load balancer deployment jobs when their roles update https://review.opendev.org/c/opendev/system-config/+/955734	21:25
clarkb	fungi: ^ that change aims to fix the lack of gitea-lb updates from 955709 landing	21:25
fungi	reapproved 955414	21:29
fungi	also quick-approved 955734 so we can see the haproxy changes take effect sooner	21:31
clarkb	thanks	21:32
fungi	python 3.14.0rc1 is out	21:44
clarkb	fungi: I expect that 955414 will remove your manual rule on static. Do you think we should edit hostvars/group vars on bridge to add that IP address to the new block var list for static?	21:54
fungi	that's a good idea for a test	21:54
fungi	should we do it before the triggered deploy, or wait and see if the daily run picks it up?	21:55
clarkb	I'm thinking before the triggered deploy would ensure that we don't have a gap where that host could become problematic again	21:55
fungi	fair enough, it might save us some cleanup work in that regard	21:56
fungi	i'll get that added and let you double-check i stuck it in the right spot	21:56
clarkb	sounds good	21:57
fungi	should i do it in a service-specific group or does this only work if it's added to the global one?	21:58
clarkb	fungi: it should work in any group. I would make it service specific particularly while we're deploying the new functiaonlity	21:58
clarkb	this way we don't accidently over block globally	21:59
fungi	yeah, take a look at the last commit on bridge and see if that's what you expect then	22:01
clarkb	fungi: almost. host_vars/static.opendev.org.yaml only applies to the host known as static.opendev.org which doesn't exist anymore. Looks like we're using static02.opendev.org	22:03
fungi	oh	22:04
clarkb	fungi: there is also a static group defined. So you can either add host_vars/static02.opendev.org or group_vars/static.yaml	22:04
clarkb	group vars is probably preferred	22:04
fungi	yeah i checked group vars first and we didn't seem to have a file for static at all there	22:05
fungi	will it just be picked up automatically if i add one?	22:05
clarkb	yes it should be	22:05
clarkb	that service only uses public group vars so far. The two should be mixed together. They are disjoin values so we don't have to worry about precedence	22:06
fungi	see if the amended master branch commit looks right now	22:07
clarkb	yes that looks correct	22:08
fungi	great	22:08
clarkb	fungi: fwiw I don't see that IP in the current ruleset on static02	22:10
clarkb	oh I see it. Its like the first rule of everything	22:11
fungi	yeah	22:11
fungi	i inserted it into the input chain without specifying a rule number to insert at/after, so it goes in at the front	22:12
clarkb	the iptables update which should noop on every host but static02 is about to merge	22:13
opendevreview	Merged opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414	22:14
clarkb	hrm why are jobs running at the same time as -base?	22:18
clarkb	I guess those builds don't actually depend on base. Looking at system-config/zuul.d/project.yaml this appears to be how we've configured it	22:20
clarkb	I'm going to not worry about that now, but we might consider changing that?	22:20
clarkb	I see the files on disk have updated but we haven't restarted netfilter-persistent yet. I suspect because that is a handler so runs at the very end of the playbook	22:22
clarkb	oh no actually it did reload the rules on static02. My check of systemctl status netfilter-persistent is just not giving me the timestamp of the last reload	22:23
clarkb	this is looking good but I want to make sure that the ruleset was restarted on hosts where we expect it to noop too	22:24
clarkb	we dont' use systemd to load the rules. We use the netfilter-persistent script directly which does an iptables-restore	22:28
clarkb	unforatunely this means I can't confirm the timestamps but I think it is working as expected	22:28
fungi	yeah, the old INPUT chain entry is cleared now, and there's a new entry for it in the openstack-INPUT chain instead	22:29
fungi	i know this is the intended design, but we need to remember that if we've got someone hammering the ssh port on one of our servers we can't block them with this rule	22:29
clarkb	ya it definitely did what I expect on static02. I was mostly trying to confirm that if some other server accidnetally reboots in the futuer it won't load up a broken ruleset	22:29
clarkb	the zookeeper job failed pulling docker images	22:30
clarkb	Error: /Stage[main]/Jeepyb/Vcsrepo[/opt/jeepyb]: Could not evaluate: Execution of '/usr/bin/git fetch origin' returned 128: fatal: unable to access 'https://opendev.org/openstack-infra/jeepyb/': Failed to connect to opendev.org port 443: No route to host	22:31
clarkb	this is why the puppet job failed	22:31
clarkb	opendev.org is reachable from here. But now I'm wondering if the problem is maybe ipv6 conenctivity on the opendev.org side of things impacting the image builds	22:32
clarkb	gitea did reload its haproxy config but that is supposed to be graceful so I wouldn't expect it to cause no route to host problems	22:33
clarkb	looking at the base.yaml.log on bridge the handler did seem to run netfilter-persistent start across the board so I'm a lot less worried about not having applied the noop case	22:34
clarkb	fungi: from my personal ovh server with ipv6 addressing ping6 fails to opendev.org	22:38
clarkb	it also is unreachable from mirror.dfw.rax.opendev.org	22:39
fungi	lovely	22:39
fungi	lemme try from here	22:39
clarkb	oho ip addr on gitea-lb02 only shows a link local address	22:39
clarkb	so I think this i the problem. Its not client side its on the load balacner	22:39
clarkb	dmesg doesn't report why the ip address goes away	22:40
fungi	yeah, 100% packet loss from my house too	22:42
fungi	i concur, it hasn't picked up any v6 address	22:43
clarkb	except if you look in the haproxy logs we know it has working ipv6 sometimes (because there are ipv6 connections)	22:43
clarkb	so something is causing the ipv6 address to get unconfigured. And now I'm remembering that we had specific netplan rules for the old gerrit server to statically configure ipv6 due to problems. I wonder if thati s an issue here on the other region now	22:44
fungi	it has stale neighbor table entries for the ll addresses of the two gateways that gitea09 is using for its default routes	22:45
fungi	the gerrit server, once upon a time, was picking up additional default routes from what looked like stray advertisements (maybe from our own projects' test jobs)	22:46
fungi	in this case it's not extras i see, but none at all	22:47
fungi	if i ping6 those gateways they show up reachable in the neighbor table	22:48
clarkb	it appears that the server was configured with cloud-init which then configured netplan which then configures systemd-networkd	22:48
clarkb	fungi: the host has a valid ipv6 address again	22:48
clarkb	not sure if that is a side effect of your pinging or just regular return to service that makse this work some of the time	22:49
fungi	so it's expiring and not getting reestablished sometimes i guess?	22:49
fungi	very well may have been due to me pinging one of the gateways	22:49
fungi	maybe there's something weird in the switching somewhere, and having a frame traverse the right device in the other direction reestablished a working flow or blew out a stale one	22:50
fungi	though route announcements should be broadcast type, so that wouldn't make sense	22:50
clarkb	fungi: `sudo networkctl status ens3` indicates that DHCPv6 leases have been lost in the past	22:51
clarkb	I suspect that since our lease is gone we're relying on stray RAs?	22:52
clarkb	though looking at those logs it seems to say the lease is lost when configuring the interface (on boot?)	22:53
fungi	oh, could be this is not relying on slaac	22:53
fungi	i was thinking slaac not dhcp6	22:53
clarkb	it goes from dhcpv6 lease lost to ens3 up	22:53
clarkb	fungi: ya the netplan config seems to say use dhcp for both ipv4 and ipv6 and that should be coming from cloud metadata service configuration	22:53
fungi	though there are hybrid configurations where you can do address selection via slaac and rely on dhcp6 to handle things like dns server info	22:54
clarkb	but then there is no dhcpv6 got a lease message afterwards like there is for dhcpv4	22:54
fungi	right it might not be relying on dhcp6 for addressing	22:55
clarkb	how long is an RA valid for? can they expire causing us to unconfigure the address?	22:55
fungi	the route expirations will be reported by ip -6 ro sh	22:55
fungi	default proto ra metric 100 expires 1789sec pref medium	22:56
fungi	these look pretty short, fwiw	22:56
fungi	i guess not, my home v6 routing is similarly short	22:57
clarkb	thats just under half an hour. I guess we check again at 23:30 UTC and see if the address is gone?	22:57
clarkb	in the meantime should we drop the AAAA record for opendev.org?	22:57
fungi	might not be the worst idea, as a temporary measure while we're trying to sort this out	22:57
fungi	hopefully we don't have any v6-only clients	22:58
clarkb	I don't think we do anymore. But we should check our base jobs which do routing checks and all that and see if they try to v6 opendev.org	22:58
fungi	well, i didn't mean clients we're running, i meant users at large	23:00
clarkb	oh sure. I guess we'd be chaning them from failures X% of the time to 100% of the time	23:00
clarkb	https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/validate-host/library/zuul_debug_info.py#L92-L106 we do validate but it is configurable	23:05
clarkb	https://opendev.org/openstack/project-config/src/branch/master/zuul/site-variables.yaml#L14-L15 it is ok if ipv6 fails I think based on this	23:06
fungi	i half wonder if the same thing could be happening to the gitea backends too, and we just don't know because we don't connect directly to them	23:06
fungi	the route expiration on gitea-lb02 has jumped back up again	23:08
fungi	so seems like it normally gets refreshed every time it sees a new announcement or something	23:08
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Drop the opendev.org AAAA record https://review.opendev.org/c/opendev/zone-opendev.org/+/955737	23:08
clarkb	Pushed that so we have the option	23:09
fungi	lowest i caught it at was 1201sec	23:09
fungi	but now it's 1603sec	23:09
fungi	so it's probably something like 15 minutes and gets reset every 5	23:09
fungi	so under normal circumstances shouldn't fall below 10 minutes	23:10
clarkb	the old value was almost 1800 which is ~30 minutes	23:10
fungi	er, yes my math is off	23:11
fungi	i meant 30 minutes expiration reset every 10	23:11
fungi	so shouldn't fall below 20	23:11
fungi	because the highest i've seen is just shy of 30 minutes and lowest i've seen is barely above 20 minutes	23:12
fungi	we'll know in about another 3 minutes	23:12
clarkb	all of the backends have ipv6 addrs right now too. Presumably they are on the same network and see the same RAs?	23:12
fungi	all the ones i checked matched for gateway addresses at least, yeah	23:13
clarkb	so maybe check them next time we see the load balancer lose its address to see if it has happened across the board	23:13
fungi	right	23:13
fungi	about a minute to go before the default routes are refreshed on gitea-lb02, if my observation holds	23:15
fungi	bingo	23:16
fungi	i saw it jump from 1200 to 1799	23:16
clarkb	I wish I could figure out where this is logged but I don't think aynthing logs it	23:16
fungi	no, there's probably a sysctl option to turn on verbose logging in that kernel subsystem	23:17
fungi	which would then go to dmesg and probably syslog	23:17
clarkb	I'm not finding one in https://docs.kernel.org/6.8/networking/ip-sysctl.html but the formatting of that makes it a bit tough to grep through	23:19
fungi	i've +2'd but not approved 955737 for now	23:19
clarkb	you can log martians	23:19
fungi	but not venutians or mercurians, sadly	23:19
clarkb	maybe we just want a naive script that captures ip addr output and ip -6 ro sho output every 5 minutes	23:20
fungi	though it's really those pesky saturnians you need to look out for	23:20
fungi	yeah, script or even just a temporary cronjob that e-mails when there are no default v6 routes	23:21
clarkb	while true ; do ip -6 addr show dev ens3 && ip -6 route show default ; done > /var/log/ipv6_networking.log > 2>&1 and put that in screen?	23:22
clarkb	I meant to add a sleep 300 to that too	23:22
fungi	could even just test whether the output of `ip -6 ro sh default` is empty	23:22
fungi	yeah, screen session would be fine	23:23
fungi	slap a date command or something in there too so we have timestamping	23:23
fungi	anyway, i need to knock off for the evening, told christine i was coming upstairs something like 30 minutes ago so she's probably wondering where i am (scratch that, i'm sure she knows where i am...)	23:24
clarkb	I added a date and a sleep 300 and stuck that in a screen writing to my homedir (so not /var/log)	23:25
fungi	sgtm, thanks!	23:26
clarkb	quay.io is in emergency maintenance ro mode	23:29
clarkb	their status page indicates this affects pushes but I'm seeing jobs fail fetching with 502 bad gateway too	23:29
clarkb	just a heads up. Not much we can do about it from here other than be aware	23:30
clarkb	fungi: I know you popped out but almost immedaitel after you did the ipv6 address went away. The ip -6 route sho output still shows a valid default route though	23:31
clarkb	curiously this occurred around my original guestimate for 23:30 UTC but that may be coincidence	23:32
clarkb	I'm going to leave it along and see if refreshing the routes refreshes the ip	23:33
clarkb	none of the backends are currently affected. They all have valid global ipv6 addresses	23:36
clarkb	after the routes refreshed the ip address came back	23:36
clarkb	and its gone again	23:40
clarkb	I'm beginning to suspect our accept_dad = 1 sysctl option for ens3 means that we're removing the address if a duplicate is found	23:47

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!