| -@gerrit:opendev.org- Mauricio Harley proposed: [openstack/project-config] 990078: Remove gerritbot project notifications from #openstack-pqc https://review.opendev.org/c/openstack/project-config/+/990078 | 09:59 | |
| -@gerrit:opendev.org- Zuul merged on behalf of Mauricio Harley: [openstack/project-config] 990078: Remove gerritbot project notifications from #openstack-pqc https://review.opendev.org/c/openstack/project-config/+/990078 | 13:08 | |
| @fungicide:matrix.org | looks like the project.tarballs afs volume has been running vos release since some time yesterday, blocking all the other static volume updates | 13:16 |
|---|---|---|
| @fungicide:matrix.org | `2026-05-25 02:55:02,365 release DEBUG Running: ssh -T -i /root/.ssh/id_vos_release vos_release@afs01.dfw.openstack.org -- vos release project.tarballs` | 13:19 |
| @fungicide:matrix.org | that's the one still running | 13:20 |
| -@gerrit:opendev.org- Sylvain Bauza proposed: [openstack/project-config] 990118: Add #openstack-agentic-workflows IRC channel https://review.opendev.org/c/openstack/project-config/+/990118 | 14:12 | |
| @fungicide:matrix.org | out to run a quick errand, back shortly | 14:16 |
| -@gerrit:opendev.org- Zuul merged on behalf of Sylvain Bauza: [openstack/project-config] 990118: Add #openstack-agentic-workflows IRC channel https://review.opendev.org/c/openstack/project-config/+/990118 | 14:28 | |
| -@gerrit:opendev.org- Stephen Finucane proposed: [openstack/project-config] 990122: Rename x/cursive to openstack/cursive https://review.opendev.org/c/openstack/project-config/+/990122 | 14:42 | |
| @clarkb:matrix.org | infra-root we'll discuss much of this in today's meeting but the things I'm looking at this week are hopefully landing https://review.opendev.org/c/opendev/system-config/+/988993 to ensure new executors have manual configs that were added. Upgrading Gitea to 1.26.2 https://review.opendev.org/c/opendev/system-config/+/989448 Helping mnasiadka add a new backup server https://review.opendev.org/c/opendev/system-config/+/989567 and its depends on. Then doing more Gerrit upgrade prep. I plan to announce the June 5 upgrade date if there are no concerns with that today and then work through my TODO list in the planning etherpad: https://etherpad.opendev.org/p/gerrit-upgrade-3.13 | 15:23 |
| @clarkb:matrix.org | oh and digging into the dns job failure. Looks like my last recheck of the test change may have caught a failure but I haven't looked any closer yet | 15:24 |
| @clarkb:matrix.org | https://3edfd4ea22585141d74d-f3c4fca3c92876a4d627c25bf953ebd1.ssl.cf1.rackcdn.com/openstack/486a31d808e04145a00d975a8b984e59/bridge99.opendev.org/ara-report/results/373.html shows a SERVFAIL dns response | 15:28 |
| @clarkb:matrix.org | that query was run against the local resolver. The next two queries query the authoritative servers successfully | 15:29 |
| @clarkb:matrix.org | Looking at that I think I need to grab /var/log/unbound.log as there isn't really any info as to why the local resolver failed | 15:30 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989784: Add some dns lookup debugging to adns test server https://review.opendev.org/c/opendev/system-config/+/989784 | 15:33 | |
| @clarkb:matrix.org | now with more logging | 15:33 |
| -@gerrit:opendev.org- Zuul merged on behalf of Sylvain Bauza: [opendev/system-config] 988406: Add bots to #openstack-agentic-worfklows https://review.opendev.org/c/opendev/system-config/+/988406 | 15:42 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 990139: DNM: Test zuul_user_dir https://review.opendev.org/c/zuul/zuul-jobs/+/990139 | 16:22 | |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 989997: Use zuul_user_dir in some roles https://review.opendev.org/c/zuul/zuul-jobs/+/989997 | 17:09 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed on behalf of Mohammed Naser: | 18:09 | |
| - [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | ||
| - [opendev/system-config] 980994: Deploy node_exporter across all managed hosts https://review.opendev.org/c/opendev/system-config/+/980994 | ||
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 988310: Add GrepTimeDB long term storage for Prometheus https://review.opendev.org/c/opendev/system-config/+/988310 | 18:11 | |
| @mnasiadka:matrix.org | Clark: updated the node-exporter patch, will work on the greptimedb one - still I'm unsure if we should have that in the deploy pipeline from the start or not | 18:12 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 990164: DNM: test zuul_user_dir fallback https://review.opendev.org/c/zuul/zuul-jobs/+/990164 | 18:16 | |
| @clarkb:matrix.org | mnasiadka: I think we can leave it out of deploy and add it to deploy when we add a server if we prefer that approach | 19:07 |
| @fungicide:matrix.org | the project.tarballs vos release complleted some time between 18:30 and 18:35 utc, and another vos release is in progress for it now, hopefully wrapping up soon | 19:08 |
| @dpanech:matrix.org | Hi all, in this review: https://review.opendev.org/c/starlingx/tools/+/988511 the blueprint link doesn't work right. It links to https://blueprints.launchpad.net/openstack/... rather than .../starlingx/... . Is there a workaround? | 19:11 |
| I suspect the answer is "no" based on this: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerrit/templates/gerrit.config.j2#L137 . | ||
| Could somebody confirm ? | ||
| @fungicide:matrix.org | i think it's taking advantage of launchpad's umbrella org functionality that does bp name lookups for child projects of an umbrella parent | 19:21 |
| @fungicide:matrix.org | maybe we could drop the "openstack/" from the url and do this? https://blueprints.launchpad.net/?searchtext=testbp | 19:22 |
| @fungicide:matrix.org | it would potentially return blueprints from other projects on lp, but if people namespace their blueprints consistently maybe things would just work out most of the time? | 19:23 |
| @clarkb:matrix.org | ya I think you'd need to namespace blueprint names themselves to make that work and that may be an appropriate solution here | 19:24 |
| @fungicide:matrix.org | also worth noting, we already had the possibility for bp names to collide between different openstack projects, resulting in the existing query returning multiple results | 19:30 |
| @mnasiadka:matrix.org | So we just have a smaller collision domain right now | 19:35 |
| @fungicide:matrix.org | right. and insofar as our gerrit isn't only for openstack, this feels like another legacy decision we've needed to generalize for some time but nobody's pointed out until now | 19:37 |
| @mnasiadka:matrix.org | True, sounds like an easy switch, but some communication work is needed | 19:40 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: | 19:46 | |
| - [opendev/system-config] 990186: Don't assume all LP blueprints belong to OpenStack https://review.opendev.org/c/opendev/system-config/+/990186 | ||
| - [opendev/system-config] 990187: Update the bug reporting link in Gerrit https://review.opendev.org/c/opendev/system-config/+/990187 | ||
| @fungicide:matrix.org | the second one was something trivial i happened to notice while i was in there | 19:46 |
| @clarkb:matrix.org | fungi: I think you might be right about bind and unbound getting in a fight. In the unbound log I see that we do queries for us.archive.ubuntu.com to install bind and a couple of other packages. Then I don't see queries from by debug script after that. In particular google.com shows up nowhere in the log and I do a query for google.com against the local resolver | 19:49 |
| @clarkb:matrix.org | however, it does resolve | 19:49 |
| @clarkb:matrix.org | so maybe bind is coming up and taking over resolution duties? And for some reason google.com can resolve but not opendev.org in that situation? | 19:49 |
| @fungicide:matrix.org | though we do disable recursion in bind | 19:50 |
| @fungicide:matrix.org | or at least it seemed like we did | 19:50 |
| @clarkb:matrix.org | fungi: yes, but this is before we configure bind | 19:50 |
| @clarkb:matrix.org | we're in the period of time between package install for bind and configuring bind | 19:50 |
| @fungicide:matrix.org | mmm, so could also be coming from a cache | 19:51 |
| @fungicide:matrix.org | maybe even cached results can be returned later after disabling recursion | 19:51 |
| @clarkb:matrix.org | I think I would see the request go to unbound in that case. Unless systemd or libc is caching it and short circuiting before going to unbound? | 19:51 |
| @fungicide:matrix.org | or systemd-resolved | 19:52 |
| @clarkb:matrix.org | anyway at this point I'm wondering if I should try to hold a node or I can collect more debug output (ss to see what listening where and maybe the bind log?) | 19:52 |
| @fungicide:matrix.org | hopefully not that | 19:52 |
| @fungicide:matrix.org | but yeah, an autohold is warranted for this | 19:52 |
| @clarkb:matrix.org | ok I'll put one in place and start rechecking. I do worry that after we configure bind the issue will go away | 19:52 |
| @fungicide:matrix.org | otherwise you're probably stuck dumping lsof to a file | 19:52 |
| @clarkb:matrix.org | so maybe I should also add more debugging and cover all the bases | 19:52 |
| @fungicide:matrix.org | getting query logs from bind as well as unbound would help us identify if the queries were going to the wrong daemon | 19:53 |
| @clarkb:matrix.org | yup I'll work both angles in case one is insufficient | 19:53 |
| @dpanech:matrix.org | fungi: thanks for the BP links update | 19:57 |
| @fungicide:matrix.org | davlet: thanks for pointing it out! that wasn't something i had thought about in a very long time | 19:59 |
| @clarkb:matrix.org | fungi: more datapoints. We are collecting syslog which has named logs and it is listening at 127.0.01:53: `listening on IPv4 interface lo, 127.0.0.1#53` and that happens according to syslog before my debugging command task from ansible runs. In prod we have unbound on local and bind on the "public" (which is not publicly accessible) interfaces | 20:01 |
| @clarkb:matrix.org | also these jobs are failing in rax flex whcih are fast. I suspect that bind is coming up quicker there and then breaking/conflicting with unbound whereas normally there is enough of a delay there. | 20:02 |
| @clarkb:matrix.org | I think maybe we can refactor these tasks so that we install git first and then clone repos. Then install bind and rsync and worry about synchronization and configuring bind | 20:03 |
| @clarkb:matrix.org | and honestly I'm tempted to just push that change up and recheck it a few times and if it doesn't break maybe we're good | 20:03 |
| @clarkb:matrix.org | I think the problem is just within the window where bind has been installed and we need dns lookups which is small and avoidable | 20:03 |
| @fungicide:matrix.org | makes sense | 20:04 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 990190: Fix order of operations during adns bind installation https://review.opendev.org/c/opendev/system-config/+/990190 | 20:12 | |
| @clarkb:matrix.org | that change is decoupled from the debugging change. That said I have an autohold set and will push an update to the debugging change shortly to better illustrate the problem | 20:12 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989784: Add some dns lookup debugging to adns test server https://review.opendev.org/c/opendev/system-config/+/989784 | 20:16 | |
| @clarkb:matrix.org | https://zuul.opendev.org/t/openstack/build/d87513edff3f4e3ca278d4552e0def4f/log/job-output.txt#17238-17265 I think this shows the bind port conflict | 21:05 |
| @fungicide:matrix.org | interesting, how are they listening on the same port? i thought the kernel didn't allow that | 21:08 |
| @fungicide:matrix.org | oh! udp, so not "listening" in the tcp sense | 21:09 |
| @fungicide:matrix.org | my brain is fried, and it's only tuesday and yesterday was a holiday | 21:10 |
| @clarkb:matrix.org | Though it isn't multicast so not sure how the kernel decides which one to deliver to? Or does it deliver to all in a unicast fashion and then maybe conntrack gets confused | 21:12 |
| @clarkb:matrix.org | fungi: those nodes should be held too if you want to inspect the running system | 21:14 |
| @clarkb:matrix.org | But I'm doing the school run now | 21:14 |
| @fungicide:matrix.org | i think they all receive every datagram and then respond if they want | 21:14 |
| @clarkb:matrix.org | If they respond the resolution should work though? | 21:15 |
| @fungicide:matrix.org | so it's more about which response does the client accept | 21:15 |
| @clarkb:matrix.org | The client (dig) says servfail | 21:15 |
| @fungicide:matrix.org | my guess is the client is receiving two responses | 21:15 |
| @clarkb:matrix.org | Which is in that log just below | 21:15 |
| @fungicide:matrix.org | but i'm currently prepping dinner so can't experiment at the moment | 21:16 |
| @fungicide:matrix.org | we could probably tcpdump on lo0 to confirm that theory, but it doesn't change the fact that they're clearly in conflict with one another | 21:16 |
| @clarkb:matrix.org | Ya and I think the fix is straightforward. Avoid DNS name lookups after bind is installed and before it is reconfigured | 21:17 |
| @fungicide:matrix.org | oh! there's even more fun... | 21:17 |
| @clarkb:matrix.org | My proposed fix did not hit raxflex so I rechecked it | 21:17 |
| @fungicide:matrix.org | `...127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve"...` | 21:17 |
| @clarkb:matrix.org | Ya though I think that systemd magic happens there | 21:18 |
| @fungicide:matrix.org | so in fact we have 3 resolvers all listening on 53/udp | 21:18 |
| @fungicide:matrix.org | on the loopback | 21:19 |
| @clarkb:matrix.org | Ya it's probably worth checking the held node to see how that all shakes out. But I suspect the fix I already pushed is the solution | 21:22 |
| @clarkb:matrix.org | mnasiadka: noticed a test issue with the greptimedb change and left a comment with a suggestion on how to fix it | 21:50 |
| @fungicide:matrix.org | the irc meeting schedule indicates there's nothing going on at this point, so i'm going to restart the ircbot container on eavesdrop02 to pick up new channels | 22:44 |
| @fungicide:matrix.org | looks like it's rejoined all the channels now | 22:47 |
| @fungicide:matrix.org | including the two newly-added ones | 22:49 |
| @clarkb:matrix.org | perfect | 22:59 |
| @clarkb:matrix.org | I hopped onto the held adns node and did a `dig opendev.org` which failed on the first query but succeeded on the second | 23:34 |
| @clarkb:matrix.org | `broken trust chain resolving 'opendev.org/A/IN': 104.239.145.127#53` is in the bind log | 23:35 |
| @clarkb:matrix.org | I think maybe it is failing the first request on dnssec validation but then serving out of the cache on the second (that feels liek a bug but whatever) | 23:35 |
| @clarkb:matrix.org | and the unbound log file is acting like it isn't moving at all | 23:35 |
| @clarkb:matrix.org | as if the udp packet isn't getting delivered to it at all | 23:36 |
| @clarkb:matrix.org | In any case I suspect that reconfiguring bind removes the conflict which should allow unbound to get the packets again so reordering ansible tasks to avoid the name lookups when bind is half configured should work well as a fix (but I'm still trying to get a test run of that change in rax flex) | 23:36 |
| @clarkb:matrix.org | ya stracing the unbound process and running dig doesn't show any movement in that process at all so I think it is sitting there idle and bind is getting the packets for whatever reason | 23:38 |
| @clarkb:matrix.org | oh there is already an etherpad 3.2.0 with the session cleanup fix in it | 23:44 |
| @clarkb:matrix.org | I missed that | 23:44 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!