opendevreview | Merged opendev/system-config master: Remove nodepool configuration https://review.opendev.org/c/opendev/system-config/+/955229 | 00:24 |
---|---|---|
*** bbezak is now known as Guest22493 | 01:19 | |
*** ykarel_ is now known as ykarel | 07:52 | |
*** Guest22493 is now known as bbezak | 08:01 | |
frickler | infra-root: another buildset where all jobs finished >1h ago, but zuul shows it still being in progress https://zuul.opendev.org/t/openstack/buildset/0da92bb6a2ec4eba895ac9c564db45a0 | 13:06 |
frickler | saw something similar yesterday, maybe a regression in event handling after all? | 13:07 |
fungi | yeah, my guess would be the results queue processing is on hold for something else (reconfiguration?) | 13:08 |
*** darmach1 is now known as darmach | 13:15 | |
frickler | seems it finished and merged just after I mentioned it. so possibly nothing critical, but worth to keep an eye on IMO | 13:34 |
frickler | also ftr I'll be offline tomorrow and friday | 13:35 |
fungi | have a good weekend! | 13:36 |
corvus | when that buildset finished, there were 7 changes ahead of it in the gate pipeline. is it implausible that it took those 7 changes an hour to finish? | 14:00 |
frickler | ah, good point, I didn't check that, since I only looked at the status with a filter on that change when bbezak mentioned it. but indeed https://review.opendev.org/955570 was ahead in gate and took until 13:09, so that explains it | 14:09 |
corvus | it reported as soon as the item ahead of it reported. | 14:09 |
corvus | and over the course of the hour, those 7 changes ahead gradually reduced | 14:09 |
corvus | so i guess it's a good time to remind folks that the gate pipeline is a queued sequence of merge operations, and none of those are final until all of the items ahead are finalized | 14:10 |
corvus | that's because at any time, one of the items ahead might end up failing, and that would invalidate the assumptions that zuul made about what changes would be merged and therefore are included in the testing of changes further behind in the queue | 14:11 |
corvus | so if tests for changes at the head of the queue are not finished, but tests for changes later in the queue are, it's normal for zuul to wait to report those until everything ahead is done | 14:12 |
frickler | yes, that's all pretty clear now that you mention it, but the filtered status view of just a single change was misleading me, maybe one could still somehow show the implicit dependencies of a change in the gate pipeline there | 14:14 |
fungi | yeah, not necessarily the gate pipeline overall, but any shared queues in the gate pipeline have that property, and if you're filtering to a subset of changes or projects for that queue then you won't have a complete picture | 14:37 |
fungi | i'm going to pop out briefly, but should be around again by 16:00 utc | 14:43 |
corvus | neither the scheduler nor web/fingergw services are running on zuul02 | 14:56 |
clarkb | huh is that fallout from the reboots I wonder | 14:57 |
corvus | july 19 17:00 is when zuul02 web stops on the graphs | 14:58 |
corvus | 17:46-17:48 to be more precise | 14:59 |
clarkb | corvus: it never rebooted | 15:00 |
clarkb | is the playbook still running and waiting for some condition to issue the reboot or maybe the playbook crashed leaving it this way? | 15:00 |
corvus | the playbook finished | 15:00 |
clarkb | FAILED - RETRYING: [zuul02.opendev.org]: Upgrade scheduler server packages | 15:01 |
clarkb | from /var/log/ansible/zuul_reboot.log.1 | 15:01 |
clarkb | Failed to lock apt for exclusive operation: Failed to lock directory /var/lib/apt/lists/ | 15:01 |
corvus | root 3575 0.0 0.0 2800 1920 ? Ss Jun28 0:00 /bin/sh /usr/lib/apt/apt.systemd.daily update | 15:02 |
clarkb | I think this is the second time we've hit this with the new noble nodes? Specifically the auto updates getting stuck | 15:03 |
corvus | third, but the second we think may have been the same event on a different host | 15:03 |
clarkb | oh wait but there was the mirror issue with upstream | 15:03 |
clarkb | ya I think we decided the mirror issue was a likely cause | 15:03 |
corvus | yeah, and i can't remember what that date was | 15:03 |
corvus | maybe this is old enough that's still just the same event on a third host? | 15:03 |
clarkb | ya that is my suspicion | 15:03 |
corvus | i killed the http process; it's proceeding | 15:04 |
corvus | okay, it ran unattended upgrades again and is up to date | 15:11 |
corvus | i will reboot then start | 15:11 |
corvus | started. once they start, i'm going to restart zuul01 so the versions are in sync | 15:16 |
corvus | (i don't think there's a difference, but just in case) | 15:18 |
clarkb | ack | 15:18 |
clarkb | huh wednesdays have a very even distribution of meetings through the day particularly in even weeks (which I think this week is) | 15:22 |
clarkb | corvus: the nodepool cleanup change (955229) did merge and mostly deployed successfully. The zookeeper deployment failed due to hitting docker hub rate limits | 15:38 |
clarkb | the daily run of zookeeper ended up succeeding later it looks like | 15:38 |
corvus | well how about that :) | 15:39 |
corvus | i'll execute the delete commands today | 15:39 |
clarkb | don't forget the builders have volumes too (they typically don't cascade on delete those) | 15:40 |
clarkb | infra-root I wrote https://review.opendev.org/c/opendev/system-config/+/955414 because I want to block the IBM server that cannot negotiate ssh with gerrit anymore on review03 but fungi had to manually apply a similar type of rule to static recently | 15:41 |
clarkb | do we think that change is A) safe and B) appropriate for our current ruleset? | 15:41 |
clarkb | but we could manage more permanent blocklists easily if we land something liek that | 15:41 |
clarkb | I wonder if we capture those rules files in the existing test jobs. Maybe doing that is a good idea | 15:42 |
clarkb | https://zuul.opendev.org/t/openstack/build/7e34ad567e7642539e25a0468a58cf98/log/noble/rules.v4.txt#14-15 we do | 15:43 |
clarkb | fungi: is there some IP address that we could stick into group_vars for ci nodes to block connectivity to/from maybe without breaking anything? | 15:43 |
corvus | clarkb: know if there's a testinfra test that connects to port 22 of hosts? if so, you could add a block rule for, say, an rfc1918 address just to have the rule in there, and then verify ssh still works | 15:44 |
clarkb | corvus: I think testinfra implicitly connects via port 22 using ansible | 15:45 |
clarkb | 203.0.113.0/24 looks promising too looking at a wikipedia table of ip ranges | 15:45 |
corvus | clarkb: yeah, that's one of the ones in https://datatracker.ietf.org/doc/rfc5737/ | 15:46 |
corvus | that's a good choice i think | 15:46 |
clarkb | ya for use in documentation implies that things on the internet shouldn't use them in a valid way | 15:47 |
corvus | https://zuul.opendev.org/t/opendev/buildset/ea307964afc4460a8d16992f42db0a59 is the nightly image build | 15:47 |
corvus | it did not go well | 15:47 |
clarkb | let me see if there is something similar for ipv6 and I can update the change to block both sets and ensure that my change doesn't break anything in unexpected ways | 15:48 |
clarkb | 3fff::/20 appears to be an ipv6 equivalent | 15:48 |
corvus | generally speaking, it looks like the builds encountered errors fetching from opendev.org | 15:49 |
corvus | connection timeouts and 503 errors | 15:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414 | 15:52 |
clarkb | corvus: the failures here https://zuul.opendev.org/t/opendev/build/ea58ddf19b3e4ba19945054238c4dbd4/log/job-output.txt#5257-5268 happen before the service-gitea.yaml deployment at 2025-07-23T02:40:46 | 15:55 |
clarkb | (just ruling out potential problems with gitea being deployed at the same time as image builds) | 15:55 |
corvus | yeah i was wondering about that. | 15:56 |
corvus | there are some errors at 02:42:.. | 15:56 |
clarkb | corvus: looking at the haproxy log on gitea-lb02 the last connection from that image build ip address was Jul 23 02:32:25 and it closed cleanly | 15:57 |
clarkb | oh but I should check ipv6 too maybe | 15:57 |
fungi | sorry, back now | 15:58 |
clarkb | corvus: the inventory reports there is no ipv6 address for taht host. But ansible fact gathering does report an ipv6 address | 15:58 |
fungi | corvus: clarkb: the event in question was around the end of last month, so if this was hung from like june 28-30 timeframe that would line up i think | 15:58 |
clarkb | corvus: the ipv6 address last spoke to gitea-lb02 at Jul 23 02:27:59 and that failed with cD | 15:59 |
clarkb | this feels similar to earlier problems where we flipped back and forth between ipv4 and ipv6 but now instead of the retry immediately working its consistently broken | 15:59 |
fungi | on the client address blocking topic, the challenge with doing an opendev-wide private group_vars is that it would only take effect the next time we ran deployment jobs for those services | 15:59 |
clarkb | fungi: yes I think you still manually apply them, but then you can add the rule to the permanent list so that ar eboot or restart of iptables services doesn't claer the rule out | 16:00 |
fungi | oh, you're looking for an example network to add, yeah the rfc 5737 network is a good choice | 16:00 |
clarkb | corvus: so I think we're back to why are we flip flopping between ipv4 and ipv6 for connectivity as a thread to pull on. I suspect but can't say for sure that ipv6 works until 02:27:59 then fails with cD, we switch to ipv4 and things work until we switch back to ipv6 for some reason but its still broken for the same reasone that caused the cD but now we never fallback and just hard | 16:02 |
clarkb | fail | 16:02 |
clarkb | fungi: whcih I think gets us back to "is there any log file that exists that records network stack decision making like this" | 16:03 |
corvus | should we down the ipv6 interfaces on image builds? | 16:03 |
corvus | (not a very satisfying solution) | 16:04 |
clarkb | corvus: maybe we try that if only to confirm it is more reliable? | 16:04 |
fungi | i don't think per-process routing decisions get logged unless the process decides to log them (which may imply deeper integration into the tcp/ip stack than most software cares to implement) | 16:04 |
clarkb | if we continue to have problems afterwards then we'd know that this is a likely incorrect thred to follow | 16:04 |
clarkb | fungi: ya we'd have to ebpf or strace maybe | 16:05 |
clarkb | hwoever I suspect that git is just using what is available on the system | 16:05 |
clarkb | and the system is deciding ipv6 exists or not and sometimes when it thinks it exists it doesn't actually function | 16:05 |
clarkb | corvus: the 503 errors I think are related to gitea restarts | 16:09 |
clarkb | corvus: on gitea11 the gitea-web process started at 02:45. According to the apache log there between 23/Jul/2025:02:45:37 and 23/Jul/2025:02:45:49 the requests 503 | 16:10 |
corvus | i wonder two things: 1) if we could improve the haproxy health check; and 2) whether we should add more of a delay in the dib element that's retrying the git operations | 16:11 |
fungi | problematic v6/v4 fallbacks have been fairly common among projects implementing "happy eyeballs" protocols | 16:11 |
clarkb | Jul 23 02:45:40 gitea11 docker-gitea[747]: 2025/07/23 02:45:40 cmd/web.go:233:serveInstalled() [I] PID: 1 Gitea Web Finished | 16:11 |
clarkb | Jul 23 02:45:43 gitea11 docker-gitea[747]: 2025/07/23 02:45:43 cmd/web.go:261:runWeb() [I] Starting Gitea on PID: 1 | 16:12 |
clarkb | Jul 23 02:45:48 gitea11 docker-gitea[747]: 2025/07/23 02:45:48 cmd/web.go:323:listen() [I] Listen: https://0.0.0.0:3000 | 16:12 |
clarkb | corvus: yup I suspect based on ^ and apache returning 503s that because we're talking to apache the health check checks out even though the service behind it is failing | 16:13 |
clarkb | so 1) seems like somethign we should dig into | 16:13 |
clarkb | https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/gitea-lb.yaml#L30 | 16:14 |
clarkb | I think thats an l4 check? | 16:15 |
clarkb | I wonder if we can only do http level cehcks if we balance http | 16:15 |
clarkb | https://www.haproxy.com/documentation/haproxy-configuration-tutorials/reliability/health-checks/#http-health-checks doesn't seem to indicate we need to balance http to do the http check | 16:16 |
clarkb | https://gitea09.opendev.org:3081/api/healthz | 16:21 |
stephenfin | interestingly, attempting to paste text including 🤦 to paste.o.o causes a HTTP 502 | 16:25 |
stephenfin | or many other emojis | 16:26 |
clarkb | stephenfin: I think the dataabse there is still utf8 3 byte only | 16:26 |
clarkb | because mysql defaults from the long long ago | 16:26 |
stephenfin | TIL | 16:27 |
fungi | though we could probably rectify that with a migration | 16:29 |
fungi | after setting some config options | 16:29 |
fungi | i don't know if lodgeit's underlying db interface code needs any adjustments for that | 16:30 |
opendevreview | Clark Boylan proposed opendev/system-config master: Have haproxy check gitea's health status endpoint https://review.opendev.org/c/opendev/system-config/+/955709 | 16:31 |
clarkb | corvus: ^ the gitea 503 problem may be as simple as that? | 16:31 |
clarkb | corvus: the default haproxy healthcheck interval is once every 2 seconds and this build failed within a couple of seconds: https://zuul.opendev.org/t/opendev/build/17ddb6e1797d45d596fb7c6cef4d6199/log/job-output.txt#4165-4175 I think maybe add a delay of at least one second between requests? | 16:32 |
corvus | these shouldn't happen, i'd do at least 5 seconds | 16:33 |
clarkb | corvus: ++ | 16:33 |
corvus | healthz is a good find | 16:34 |
clarkb | fungi: ya I'm not sure what sort of db management is in ldogeit itself | 16:34 |
clarkb | note I tried to HEAD /api/healthz and that 404s. Seems you have to GET it which is fine | 16:34 |
corvus | yeah, presumably internal processing of the healthz GET is minimal, which is the main reason to HEAD any other endpoint | 16:37 |
opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Add a 5 second delay between cache update retries https://review.opendev.org/c/openstack/diskimage-builder/+/955712 | 16:41 |
clarkb | corvus: ^ and that should add the short delay between retries | 16:41 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414 | 16:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414 | 17:02 |
clarkb | ok I think ^ is in a good state now | 17:03 |
clarkb | I'm glad I added the ranges to block to exercise it as it caught a bug | 17:03 |
clarkb | https://zuul.opendev.org/t/openstack/build/9444be613c964224ad4491241f955cd8/log/gitea-lb02.opendev.org/haproxy.log#11-12 this shows the new check seems to work. Note the http backend returns a 307 which is considered as valid. I think for http that is probably good enough since it redirects back to https in apache itself iirc | 17:37 |
opendevreview | Clark Boylan proposed opendev/system-config master: Have haproxy check zuul's health/ready page https://review.opendev.org/c/opendev/system-config/+/955718 | 18:09 |
clarkb | corvus: ^ I realized that a similar update to haproxy may be useful for zuul too | 18:09 |
corvus | good idea | 18:11 |
corvus | though... one sec | 18:11 |
corvus | clarkb: 2 things: 1) we don't actually start listening until the component is fully ready; 2) /health/ready is served on a different port | 18:14 |
corvus | so i think we can leave it as it was | 18:15 |
clarkb | ah ok. fwiw I did curl --head https://zuul01.opendev.org/health/ready and got a 200 but maybe that is some other endpoint being confusing | 18:15 |
clarkb | I can see in the response headers that cherrypy is what responded. | 18:16 |
corvus | yeah that was probably the universal rediect | 18:16 |
clarkb | `curl -X OPTIONS https://zuul01.opendev.org/` returns a lot more data than is necessary. I guess a small optimization would be to HEAD / instead of OPTIONS / | 18:17 |
clarkb | but we can leave it as is if you think getting the js back is a better canary | 18:17 |
corvus | i think head should be fine; my expectation is that we don't start listening until we're ready | 18:19 |
corvus | (also, maybe we should see what options is returning and whether we need to fix something in zuul-web) | 18:19 |
clarkb | corvus: it seems to return the web site html/js for me | 18:19 |
clarkb | whcih should be cached and isn't super huge in its initial response so I don't think its a big cost to use it as a check. But yes maybe options should return something different to be more correct | 18:20 |
clarkb | I've abandoned my change with some notes | 18:20 |
clarkb | infra-root if anyone else is willing to review https://review.opendev.org/c/opendev/system-config/+/955709 I think that would be a good one to get in today in the hopes it makes image builds more reliable | 20:01 |
fungi | approved it just ow | 20:05 |
clarkb | thanks. I can keep an eye on it | 20:05 |
fungi | will we need to manually restart haproxy for that? | 20:05 |
clarkb | fungi: I don't not believe so. But I'll double check (iirc ansible is set up to gracefully restart haproxy on config updates) | 20:06 |
fungi | ah, cool | 20:06 |
clarkb | fungi: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/haproxy/tasks/main.yaml#L60-L67 writing out the config file notifies the reload haproxy handler which does: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/haproxy/handlers/hup_haproxy.yaml and that should gracefully reload things | 20:07 |
fungi | perfect | 20:12 |
clarkb | the dib change to add a delay between dib repo cache retries is looking like it will go green after a recheck | 20:48 |
clarkb | the pre recheck failure was an odd one (the ssh client couldn't negotiate kex exchange stuff when setting up the ssh connection) | 20:49 |
clarkb | but none of the other builds failed that way so I figured maybe somethin in centos 9 stream sshd pacakging? seems to have passed on a second run though | 20:49 |
fungi | the nodepool removal change failed infra-prod-service-grafana in deploy, has anyone looked into that yet? | 21:05 |
fungi | never mind, docker rate limit error | 21:07 |
fungi | and the same jobs has succeeded subsequently | 21:07 |
fungi | er, job | 21:07 |
clarkb | yup | 21:09 |
clarkb | the zookeeper job failed too? Or maybe it was just zookeeper I checked early today and I would haev to go klook again | 21:09 |
clarkb | fungi: https://review.opendev.org/c/openstack/diskimage-builder/+/955712 is the other change aimed at improving daily image builds and it now passes check testing | 21:10 |
fungi | i approved the iptables block change | 21:10 |
clarkb | ok I'm fairly confident in that change due to our testing but we'll probably wnt to double check anyway after it deploys | 21:11 |
clarkb | you know I wonder if the -base job should trigger off of edits to playbooks/roles/iptables? | 21:12 |
clarkb | ya iptables is run by the base.yaml playbook. I think that is a good update | 21:14 |
clarkb | fungi: do you think ^ is worth unapproving the iptables block change in order to combine them? | 21:14 |
clarkb | otherwise we either have to land a second change or wait for daily runs | 21:14 |
fungi | i can unapprove, sure | 21:15 |
fungi | done | 21:16 |
fungi | i agree it's preferable to have that added one way or the other, in order to speed up deployment of changes to those rules in the future | 21:17 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414 | 21:17 |
clarkb | that is the udpated. system-config-run-base is already triggering on all playbooks/ updates | 21:18 |
opendevreview | Merged opendev/system-config master: Have haproxy check gitea's health status endpoint https://review.opendev.org/c/opendev/system-config/+/955709 | 21:20 |
clarkb | arg ^ that didnt trigger jobs either | 21:21 |
opendevreview | Clark Boylan proposed opendev/system-config master: Trigger load balancer deployment jobs when their roles update https://review.opendev.org/c/opendev/system-config/+/955734 | 21:25 |
clarkb | fungi: ^ that change aims to fix the lack of gitea-lb updates from 955709 landing | 21:25 |
fungi | reapproved 955414 | 21:29 |
fungi | also quick-approved 955734 so we can see the haproxy changes take effect sooner | 21:31 |
clarkb | thanks | 21:32 |
fungi | python 3.14.0rc1 is out | 21:44 |
clarkb | fungi: I expect that 955414 will remove your manual rule on static. Do you think we should edit hostvars/group vars on bridge to add that IP address to the new block var list for static? | 21:54 |
fungi | that's a good idea for a test | 21:54 |
fungi | should we do it before the triggered deploy, or wait and see if the daily run picks it up? | 21:55 |
clarkb | I'm thinking before the triggered deploy would ensure that we don't have a gap where that host could become problematic again | 21:55 |
fungi | fair enough, it might save us some cleanup work in that regard | 21:56 |
fungi | i'll get that added and let you double-check i stuck it in the right spot | 21:56 |
clarkb | sounds good | 21:57 |
fungi | should i do it in a service-specific group or does this only work if it's added to the global one? | 21:58 |
clarkb | fungi: it should work in any group. I would make it service specific particularly while we're deploying the new functiaonlity | 21:58 |
clarkb | this way we don't accidently over block globally | 21:59 |
fungi | yeah, take a look at the last commit on bridge and see if that's what you expect then | 22:01 |
clarkb | fungi: almost. host_vars/static.opendev.org.yaml only applies to the host known as static.opendev.org which doesn't exist anymore. Looks like we're using static02.opendev.org | 22:03 |
fungi | oh | 22:04 |
clarkb | fungi: there is also a static group defined. So you can either add host_vars/static02.opendev.org or group_vars/static.yaml | 22:04 |
clarkb | group vars is probably preferred | 22:04 |
fungi | yeah i checked group vars first and we didn't seem to have a file for static at all there | 22:05 |
fungi | will it just be picked up automatically if i add one? | 22:05 |
clarkb | yes it should be | 22:05 |
clarkb | that service only uses public group vars so far. The two should be mixed together. They are disjoin values so we don't have to worry about precedence | 22:06 |
fungi | see if the amended master branch commit looks right now | 22:07 |
clarkb | yes that looks correct | 22:08 |
fungi | great | 22:08 |
clarkb | fungi: fwiw I don't see that IP in the current ruleset on static02 | 22:10 |
clarkb | oh I see it. Its like the first rule of everything | 22:11 |
fungi | yeah | 22:11 |
fungi | i inserted it into the input chain without specifying a rule number to insert at/after, so it goes in at the front | 22:12 |
clarkb | the iptables update which should noop on every host but static02 is about to merge | 22:13 |
opendevreview | Merged opendev/system-config master: Add iptables rule blocks to drop traffic from specific IPs https://review.opendev.org/c/opendev/system-config/+/955414 | 22:14 |
clarkb | hrm why are jobs running at the same time as -base? | 22:18 |
clarkb | I guess those builds don't actually depend on base. Looking at system-config/zuul.d/project.yaml this appears to be how we've configured it | 22:20 |
clarkb | I'm going to not worry about that now, but we might consider changing that? | 22:20 |
clarkb | I see the files on disk have updated but we haven't restarted netfilter-persistent yet. I suspect because that is a handler so runs at the very end of the playbook | 22:22 |
clarkb | oh no actually it did reload the rules on static02. My check of systemctl status netfilter-persistent is just not giving me the timestamp of the last reload | 22:23 |
clarkb | this is looking good but I want to make sure that the ruleset was restarted on hosts where we expect it to noop too | 22:24 |
clarkb | we dont' use systemd to load the rules. We use the netfilter-persistent script directly which does an iptables-restore | 22:28 |
clarkb | unforatunely this means I can't confirm the timestamps but I think it is working as expected | 22:28 |
fungi | yeah, the old INPUT chain entry is cleared now, and there's a new entry for it in the openstack-INPUT chain instead | 22:29 |
fungi | i know this is the intended design, but we need to remember that if we've got someone hammering the ssh port on one of our servers we can't block them with this rule | 22:29 |
clarkb | ya it definitely did what I expect on static02. I was mostly trying to confirm that if some other server accidnetally reboots in the futuer it won't load up a broken ruleset | 22:29 |
clarkb | the zookeeper job failed pulling docker images | 22:30 |
clarkb | Error: /Stage[main]/Jeepyb/Vcsrepo[/opt/jeepyb]: Could not evaluate: Execution of '/usr/bin/git fetch origin' returned 128: fatal: unable to access 'https://opendev.org/openstack-infra/jeepyb/': Failed to connect to opendev.org port 443: No route to host | 22:31 |
clarkb | this is why the puppet job failed | 22:31 |
clarkb | opendev.org is reachable from here. But now I'm wondering if the problem is maybe ipv6 conenctivity on the opendev.org side of things impacting the image builds | 22:32 |
clarkb | gitea did reload its haproxy config but that is supposed to be graceful so I wouldn't expect it to cause no route to host problems | 22:33 |
clarkb | looking at the base.yaml.log on bridge the handler did seem to run netfilter-persistent start across the board so I'm a lot less worried about not having applied the noop case | 22:34 |
clarkb | fungi: from my personal ovh server with ipv6 addressing ping6 fails to opendev.org | 22:38 |
clarkb | it also is unreachable from mirror.dfw.rax.opendev.org | 22:39 |
fungi | lovely | 22:39 |
fungi | lemme try from here | 22:39 |
clarkb | oho ip addr on gitea-lb02 only shows a link local address | 22:39 |
clarkb | so I think this i the problem. Its not client side its on the load balacner | 22:39 |
clarkb | dmesg doesn't report why the ip address goes away | 22:40 |
fungi | yeah, 100% packet loss from my house too | 22:42 |
fungi | i concur, it hasn't picked up any v6 address | 22:43 |
clarkb | except if you look in the haproxy logs we know it has working ipv6 sometimes (because there are ipv6 connections) | 22:43 |
clarkb | so something is causing the ipv6 address to get unconfigured. And now I'm remembering that we had specific netplan rules for the old gerrit server to statically configure ipv6 due to problems. I wonder if thati s an issue here on the other region now | 22:44 |
fungi | it has stale neighbor table entries for the ll addresses of the two gateways that gitea09 is using for its default routes | 22:45 |
fungi | the gerrit server, once upon a time, was picking up additional default routes from what looked like stray advertisements (maybe from our own projects' test jobs) | 22:46 |
fungi | in this case it's not extras i see, but none at all | 22:47 |
fungi | if i ping6 those gateways they show up reachable in the neighbor table | 22:48 |
clarkb | it appears that the server was configured with cloud-init which then configured netplan which then configures systemd-networkd | 22:48 |
clarkb | fungi: the host has a valid ipv6 address again | 22:48 |
clarkb | not sure if that is a side effect of your pinging or just regular return to service that makse this work some of the time | 22:49 |
fungi | so it's expiring and not getting reestablished sometimes i guess? | 22:49 |
fungi | very well may have been due to me pinging one of the gateways | 22:49 |
fungi | maybe there's something weird in the switching somewhere, and having a frame traverse the right device in the other direction reestablished a working flow or blew out a stale one | 22:50 |
fungi | though route announcements should be broadcast type, so that wouldn't make sense | 22:50 |
clarkb | fungi: `sudo networkctl status ens3` indicates that DHCPv6 leases have been lost in the past | 22:51 |
clarkb | I suspect that since our lease is gone we're relying on stray RAs? | 22:52 |
clarkb | though looking at those logs it seems to say the lease is lost when configuring the interface (on boot?) | 22:53 |
fungi | oh, could be this is not relying on slaac | 22:53 |
fungi | i was thinking slaac not dhcp6 | 22:53 |
clarkb | it goes from dhcpv6 lease lost to ens3 up | 22:53 |
clarkb | fungi: ya the netplan config seems to say use dhcp for both ipv4 and ipv6 and that should be coming from cloud metadata service configuration | 22:53 |
fungi | though there are hybrid configurations where you can do address selection via slaac and rely on dhcp6 to handle things like dns server info | 22:54 |
clarkb | but then there is no dhcpv6 got a lease message afterwards like there is for dhcpv4 | 22:54 |
fungi | right it might not be relying on dhcp6 for addressing | 22:55 |
clarkb | how long is an RA valid for? can they expire causing us to unconfigure the address? | 22:55 |
fungi | the route expirations will be reported by ip -6 ro sh | 22:55 |
fungi | default proto ra metric 100 expires 1789sec pref medium | 22:56 |
fungi | these look pretty short, fwiw | 22:56 |
fungi | i guess not, my home v6 routing is similarly short | 22:57 |
clarkb | thats just under half an hour. I guess we check again at 23:30 UTC and see if the address is gone? | 22:57 |
clarkb | in the meantime should we drop the AAAA record for opendev.org? | 22:57 |
fungi | might not be the worst idea, as a temporary measure while we're trying to sort this out | 22:57 |
fungi | hopefully we don't have any v6-only clients | 22:58 |
clarkb | I don't think we do anymore. But we should check our base jobs which do routing checks and all that and see if they try to v6 opendev.org | 22:58 |
fungi | well, i didn't mean clients we're running, i meant users at large | 23:00 |
clarkb | oh sure. I guess we'd be chaning them from failures X% of the time to 100% of the time | 23:00 |
clarkb | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/validate-host/library/zuul_debug_info.py#L92-L106 we do validate but it is configurable | 23:05 |
clarkb | https://opendev.org/openstack/project-config/src/branch/master/zuul/site-variables.yaml#L14-L15 it is ok if ipv6 fails I think based on this | 23:06 |
fungi | i half wonder if the same thing could be happening to the gitea backends too, and we just don't know because we don't connect directly to them | 23:06 |
fungi | the route expiration on gitea-lb02 has jumped back up again | 23:08 |
fungi | so seems like it normally gets refreshed every time it sees a new announcement or something | 23:08 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Drop the opendev.org AAAA record https://review.opendev.org/c/opendev/zone-opendev.org/+/955737 | 23:08 |
clarkb | Pushed that so we have the option | 23:09 |
fungi | lowest i caught it at was 1201sec | 23:09 |
fungi | but now it's 1603sec | 23:09 |
fungi | so it's probably something like 15 minutes and gets reset every 5 | 23:09 |
fungi | so under normal circumstances shouldn't fall below 10 minutes | 23:10 |
clarkb | the old value was almost 1800 which is ~30 minutes | 23:10 |
fungi | er, yes my math is off | 23:11 |
fungi | i meant 30 minutes expiration reset every 10 | 23:11 |
fungi | so shouldn't fall below 20 | 23:11 |
fungi | because the highest i've seen is just shy of 30 minutes and lowest i've seen is barely above 20 minutes | 23:12 |
fungi | we'll know in about another 3 minutes | 23:12 |
clarkb | all of the backends have ipv6 addrs right now too. Presumably they are on the same network and see the same RAs? | 23:12 |
fungi | all the ones i checked matched for gateway addresses at least, yeah | 23:13 |
clarkb | so maybe check them next time we see the load balancer lose its address to see if it has happened across the board | 23:13 |
fungi | right | 23:13 |
fungi | about a minute to go before the default routes are refreshed on gitea-lb02, if my observation holds | 23:15 |
fungi | bingo | 23:16 |
fungi | i saw it jump from 1200 to 1799 | 23:16 |
clarkb | I wish I could figure out where this is logged but I don't think aynthing logs it | 23:16 |
fungi | no, there's probably a sysctl option to turn on verbose logging in that kernel subsystem | 23:17 |
fungi | which would then go to dmesg and probably syslog | 23:17 |
clarkb | I'm not finding one in https://docs.kernel.org/6.8/networking/ip-sysctl.html but the formatting of that makes it a bit tough to grep through | 23:19 |
fungi | i've +2'd but not approved 955737 for now | 23:19 |
clarkb | you can log martians | 23:19 |
fungi | but not venutians or mercurians, sadly | 23:19 |
clarkb | maybe we just want a naive script that captures ip addr output and ip -6 ro sho output every 5 minutes | 23:20 |
fungi | though it's really those pesky saturnians you need to look out for | 23:20 |
fungi | yeah, script or even just a temporary cronjob that e-mails when there are no default v6 routes | 23:21 |
clarkb | while true ; do ip -6 addr show dev ens3 && ip -6 route show default ; done > /var/log/ipv6_networking.log > 2>&1 and put that in screen? | 23:22 |
clarkb | I meant to add a sleep 300 to that too | 23:22 |
fungi | could even just test whether the output of `ip -6 ro sh default` is empty | 23:22 |
fungi | yeah, screen session would be fine | 23:23 |
fungi | slap a date command or something in there too so we have timestamping | 23:23 |
fungi | anyway, i need to knock off for the evening, told christine i was coming upstairs something like 30 minutes ago so she's probably wondering where i am (scratch that, i'm sure she knows where i am...) | 23:24 |
clarkb | I added a date and a sleep 300 and stuck that in a screen writing to my homedir (so not /var/log) | 23:25 |
fungi | sgtm, thanks! | 23:26 |
clarkb | quay.io is in emergency maintenance ro mode | 23:29 |
clarkb | their status page indicates this affects pushes but I'm seeing jobs fail fetching with 502 bad gateway too | 23:29 |
clarkb | just a heads up. Not much we can do about it from here other than be aware | 23:30 |
clarkb | fungi: I know you popped out but almost immedaitel after you did the ipv6 address went away. The ip -6 route sho output still shows a valid default route though | 23:31 |
clarkb | curiously this occurred around my original guestimate for 23:30 UTC but that may be coincidence | 23:32 |
clarkb | I'm going to leave it along and see if refreshing the routes refreshes the ip | 23:33 |
clarkb | none of the backends are currently affected. They all have valid global ipv6 addresses | 23:36 |
clarkb | after the routes refreshed the ip address came back | 23:36 |
clarkb | and its gone again | 23:40 |
clarkb | I'm beginning to suspect our accept_dad = 1 sysctl option for ens3 means that we're removing the address if a duplicate is found | 23:47 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!