| -@gerrit:opendev.org- Sei Sano proposed: [opendev/irc-meetings] 988910: Update Masakari IRC meeting ... https://review.opendev.org/c/opendev/irc-meetings/+/988910 | 06:46 | |
| @mnasiadka:matrix.org | There is some slowness in git clone over https from opendev.org - as in `error: RPC failed; curl 28 Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds` | 09:46 |
|---|---|---|
| -@gerrit:opendev.org- Zuul merged on behalf of Elod Illes: [openstack/project-config] 988430: [release-tools] Fix dist_name fetch for upper bump https://review.opendev.org/c/openstack/project-config/+/988430 | 10:38 | |
| -@gerrit:opendev.org- Stephen Finucane proposed: | 12:52 | |
| - [opendev/git-review] 987712: Revert "Clean up all references to branchauthor after removal of usage" https://review.opendev.org/c/opendev/git-review/+/987712 | ||
| - [opendev/git-review] 987713: Add gitreview.autotopic git config flag https://review.opendev.org/c/opendev/git-review/+/987713 | ||
| - [opendev/git-review] 988966: Use configparser utils to parse types https://review.opendev.org/c/opendev/git-review/+/988966 | ||
| -@gerrit:opendev.org- Elod Illes proposed: [openstack/project-config] 988968: Fix publish-openstack-releasenotes-python3 job https://review.opendev.org/c/openstack/project-config/+/988968 | 12:58 | |
| -@gerrit:opendev.org- Stephen Finucane proposed: | 13:02 | |
| - [opendev/git-review] 988966: Use configparser utils to parse types https://review.opendev.org/c/opendev/git-review/+/988966 | ||
| - [opendev/git-review] 987712: Revert "Clean up all references to branchauthor after removal of usage" https://review.opendev.org/c/opendev/git-review/+/987712 | ||
| - [opendev/git-review] 987713: Add gitreview.autotopic git config flag https://review.opendev.org/c/opendev/git-review/+/987713 | ||
| @fungicide:matrix.org | mnasiadka: i'm seeing slowness too, i wonder if it's because an openstack deployment project just made a release recently, or whether the crawlers have figured out how to bypass anubis finally | 13:24 |
| @fungicide:matrix.org | i was able to clone openstack/nova at 5.35 MiB/s but there was quite a delay before the service responded, suggesting apache worker slots may be in short supply... looking now | 13:25 |
| @fungicide:matrix.org | nope, not that, apache worker slots are almost entirely unused on all backends, load averages are sub-1.0 too | 13:29 |
| @fungicide:matrix.org | and i'm not seeing any packet loss to the lb over ipv4 or ipv6 | 13:30 |
| @fungicide:matrix.org | looks like my connections are being balanced to gitea11 currently | 13:32 |
| @fungicide:matrix.org | digging in the apache logs on gitea11 i'm not seeing anything that stands out, nor in the haproxy logs on gitea-lb03 | 13:42 |
| @fungicide:matrix.org | probably unrelated, letsencrypt renewals have been broken for a few days now, i'll see if i can track down the cause | 13:46 |
| @fungicide:matrix.org | likely connected to the ansible upgrade on bridge, given the timing | 13:46 |
| @fungicide:matrix.org | `-rw-r--r-- 1 root root 431076 May 13 03:04 /var/log/ansible/letsencrypt.yaml.log` | 14:05 |
| @fungicide:matrix.org | so yeah, the playbook hasn't run for the past 5 days | 14:05 |
| @fungicide:matrix.org | that last run was earlier on the same day that we upgraded ansible on bridge | 14:07 |
| @fungicide:matrix.org | that also coincides with our daily infra-prod-base job failing: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-base&project=opendev/system-config | 14:10 |
| @fungicide:matrix.org | aha, it's trying to update translate-dev01.openstack.org which has too-old python | 14:11 |
| @fungicide:matrix.org | also translate01.openstack.org and storyboard01.opendev.org | 14:12 |
| @fungicide:matrix.org | i'll add them to the disable list for now while we discuss further | 14:12 |
| @clarkb:matrix.org | fungi: I made a note about translate and storyboard on saturday. I think we need to force ansible to use python2 on those hosts via our inventory. THe autodetection finds python3 which is too old for ansible 9 | 14:44 |
| @clarkb:matrix.org | mnasiadka: fungi: if the backends are happy then the issue could be that the frontend is getting overwhelmed | 14:44 |
| @clarkb:matrix.org | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-6h&to=now&timezone=utc looks ok though | 14:44 |
| @clarkb:matrix.org | and it seems to be response for me at the moment. So either its backend specific or the window of time where this was happening has passed? | 14:47 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 988992: Force some hosts to use python2 for compat with Ansible 9 https://review.opendev.org/c/opendev/system-config/+/988992 | 15:04 | |
| @fungicide:matrix.org | maybe it's something on my end, but if i `git clone https://opendev.org/openstack/nova` i see `Cloning into 'nova'...` immediately followed by nothing for 43 seconds before `remote: Enumerating objects:` appears | 15:10 |
| @fungicide:matrix.org | then it proceeds to actually download at a reasonable pace | 15:10 |
| @clarkb:matrix.org | I tested with system-config maybe it is repo specific? | 15:11 |
| @clarkb:matrix.org | nova does have a ton of refs and maybe its spinning on the backend preparing the pack files etc to supply to the client | 15:11 |
| @fungicide:matrix.org | doesn't matter if i add `-4` or `-6` either, so seems to be the same for either address family | 15:11 |
| @fungicide:matrix.org | if i clone opendev/bindep instead, the pause there is more like 1s | 15:12 |
| @fungicide:matrix.org | so yeah, seems to perhaps be related to repository size | 15:12 |
| @fungicide:matrix.org | maybe it's taking gitea a long time to read in the files? | 15:13 |
| @clarkb:matrix.org | ya I think maybe we should try to profile a specific git clone of say nova against a specific backend and see where that may be slow | 15:15 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 988993: Set kernel.yama.ptrace_scope to 2 on executors https://review.opendev.org/c/opendev/system-config/+/988993 | 15:17 | |
| @fungicide:matrix.org | fwiw, doesn't look like the backend i'm hitting has any memory pressure either, though the kernel does report 90% of ram is consumed for buff/cache | 15:19 |
| @clarkb:matrix.org | I think we have a repacking cronjob on the giteas like we do for gerrit and that is done for performance reasons. Maybe we should check that is working? | 15:20 |
| @clarkb:matrix.org | basically check the cron isn't failing and then also inspect if it looks like we've got thousands of loose refs that we don't expect | 15:20 |
| @clarkb:matrix.org | but also if you initiate a git clone and its taking that long you may also be able to directly observe what is happening and strace it or something | 15:21 |
| @clarkb:matrix.org | might take a couple of attempts but should be doable? | 15:21 |
| @fungicide:matrix.org | trying to work out the ssh permission error in system-config-run-base-arm64 for my mixed architecture change, i see we don't collect `/var/log/auth.log`, should we start? or would it make more sense for me to just hold a node? | 15:43 |
| @clarkb:matrix.org | It might make sense to hold the nodes | 15:59 |
| @fungicide:matrix.org | i'll just do that, then | 16:04 |
| @clarkb:matrix.org | fungi: https://review.opendev.org/c/opendev/system-config/+/988992 passed testing but I don't think it actually exercises the code path | 16:34 |
| @clarkb:matrix.org | if we proceed with ^ we'll want to remove the hosts from the emergency file so that the deplyoment actually affects those servers | 16:34 |
| @fungicide:matrix.org | yep | 16:39 |
| @fungicide:matrix.org | i can do that shortly, thanks! | 16:40 |
| @mnasiadka:matrix.org | 988992 looks legit, although I wasn't aware Ansible is so funny, that it supports 2.7 but not some older 3.x ;-) | 17:03 |
| @mnasiadka:matrix.org | (but Kolla-Ansible is way forward in Ansible versions) | 17:03 |
| @clarkb:matrix.org | one thing I learned after my networking issues on Friday is that tumbleweed has moved/is moving from apparmor to selinux. This partially explains the lack of testing that broke things. My laptop which is a newer install is all selinux and no apparmor | 17:06 |
| @clarkb:matrix.org | so I've been digging around in documentation and trying to decide if I convert my desktop. I probably should at this point but maybe that is another Friday activity rather than a monday activity | 17:07 |
| @clarkb:matrix.org | Looks like the process is to install all the selinux tools and tell the kernel to boot with selinux rather than apparmor enabled. First boot sets a flag to auto relabel everything and you boot into permissive mode first. If that all looks ok then you switch to enforcing and reboot again. Straightforward but considering fundamental items like nteworking can break I don't want to do that on Monday | 17:09 |
| @clarkb:matrix.org | I'm holding new gerrit upgrade test nodes to retest things with the latest images. Looking at a calendar June 5 is looking like it might be good for the 3.13 upgrade? | 17:21 |
| @clarkb:matrix.org | I'll bring that up in tomorrow's meeting but would be curious if June 5 works for others | 17:22 |
| @fungicide:matrix.org | okay, i've taken the three older servers i added this morning back out of the disable list, and have approved 988992 now | 17:26 |
| @fungicide:matrix.org | Clark: okay, bit of a head-scratcher on the multi-arch job failure... | 17:41 |
| @fungicide:matrix.org | i can see in the auth log on noble that the root login from bridge99 is being refused, and indeed `/root/.ssh/authorized_keys` has the wrong ipv6 address for bridge99 (though the correct ipv4 address) and the connection is being attempted over ipv6 | 17:42 |
| @fungicide:matrix.org | the zuul inventory doesn't have any record of the ipv6 address that's in the local config, for that matter | 17:43 |
| @fungicide:matrix.org | but also the inventory shows a null v6 address for bridge99 as `public_ipv6: ''` even though it was booted in rax-dfw | 17:44 |
| @fungicide:matrix.org | separately, i wonder if this would also be broken in cases where bridge99 ended up in a provider with no ipv6 at all while the inventory listed raw v6 addresses in `ansible_host` for the arm64 nodes in the inventory | 17:46 |
| @fungicide:matrix.org | since it seems like it wants to ssh from bridge99 to `root@$ansible_host` | 17:47 |
| @fungicide:matrix.org | the ip addresses of the held nodes are listed in https://zuul.opendev.org/t/openstack/build/604e9936e7c14b7cad823ed72d7ef30d/log/zuul-info/inventory.yaml if you want to take a look | 17:48 |
| @fungicide:matrix.org | aha, the mystery ipv6 address in `/root/.ssh/authorized_keys` actually belongs to production bridge01 | 17:50 |
| @fungicide:matrix.org | so probably it got filled from a dns lookup fallback due to the empty `public_ipv6` for bridge99> | 17:50 |
| @clarkb:matrix.org | Though I thought that is why we use bridge99 instead of the actual name to better decouple that stuff | 17:51 |
| @fungicide:matrix.org | but maybe we test for whether the address exists? | 17:52 |
| @clarkb:matrix.org | But ya maybe we just aren't handling the no ipv6 case since the prod bridge has ipv6 but how did this work when bridge was arm which has no ipv6? | 17:52 |
| @fungicide:matrix.org | the arm nodes do have ipv6 addresses | 17:52 |
| @fungicide:matrix.org | and the amd node *should* in theory because it's in rax classic, but for some reason it's empty | 17:52 |
| @clarkb:matrix.org | Oh I think arm nodes having ipv6 may be new | 17:53 |
| @fungicide:matrix.org | yeah, possible | 17:53 |
| @fungicide:matrix.org | regardless, and to clarify, the amd64 bridge99 *does* in fact have a working global ipv6 address and is using it to connect from rax-dfw to the arm64 nodes in osuosl-regionone, but the zuul/ansible *inventory* doesn't include it | 17:55 |
| @clarkb:matrix.org | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/base/server/tasks/main.yaml#L32-L39 | 17:56 |
| @clarkb:matrix.org | this is where we set up that authorized key rule | 17:56 |
| @fungicide:matrix.org | if it connected over ipv4 the tests would probably work, since the correct ipv4 address for bridge99 is allowed | 17:56 |
| @clarkb:matrix.org | fungi: there are two levels of ansible here. The one that zuul runs to execute the job and the nested level thatruns our our ansible playbooks and I think the same one testinfra tests use | 17:57 |
| @clarkb:matrix.org | its the nested inventory that is the problem right? | 17:57 |
| @clarkb:matrix.org | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/base/server/defaults/main.yaml#L2 and this explains why we get the prod ip addr when nothing else is defined (its the default for that var) | 17:58 |
| @fungicide:matrix.org | probably, i'm looking to see if i can find were we save a copy of the nested ansible's inventory | 17:58 |
| @clarkb:matrix.org | https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L105 this is where we try to set it in the test env's nested ansible config I think | 17:59 |
| @fungicide:matrix.org | here it is: https://zuul.opendev.org/t/openstack/build/604e9936e7c14b7cad823ed72d7ef30d/log/bridge99.opendev.org/etc/ansible/hosts/group_vars/all.yaml | 17:59 |
| @fungicide:matrix.org | and yeah, it contains `bastion_ipv4` but no mention of `bastion_ipv6` | 18:00 |
| @clarkb:matrix.org | https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/templates/group_vars/all.yaml.j2#L1-L9 which should write out this file to override the prod defaults | 18:01 |
| @fungicide:matrix.org | so i suspect `public_ipv6` being an empty string in the zuul inventory bridge99 entry is probably resulting in an undefined `bastion_ipv6` in the nested vars | 18:01 |
| @clarkb:matrix.org | ok so https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L105 is not writing out the ipv6 address we expect | 18:01 |
| @clarkb:matrix.org | fungi: yes I think that is correct | 18:01 |
| @clarkb:matrix.org | fungi: and then since we do actually have ipv6 on those hosts they use ipv6 and we fail. The expectation is that if we don't have an ipv6 addr listed in facts that we don't have working ipv6 at all and it will be fine | 18:02 |
| @clarkb:matrix.org | so is this an openstacksdk or zuul launcher level bug? | 18:02 |
| @clarkb:matrix.org | these addresses come from the openstack APIs not ansible fact gathering I Think | 18:03 |
| @clarkb:matrix.org | what is weird is the base job is working on the non mixed setup I think | 18:04 |
| @clarkb:matrix.org | but let me see if I can find one that also had the bridge in rax | 18:04 |
| @fungicide:matrix.org | well, for the non-mixed job zuul will try to put all the nodes in the same provider right? so in that case none of them will have v6 addresses in the zuul inventory in theory, and the nested ansible will refer to them all by their v4 addresses instead? | 18:06 |
| @clarkb:matrix.org | fungi: same issue here: https://zuul.opendev.org/t/openstack/build/ea9eddfb5ee6403384d5a445e48c81c8/log/zuul-info/inventory.yaml but the job passes | 18:06 |
| @clarkb:matrix.org | and yes I think that explains it. When everything is one provider there is no mixing of ipv6 and ipv4 and we just sidestep the issue | 18:07 |
| @clarkb:matrix.org | I suspect this problem is openstacksdk returning the wrong values for rax classic now | 18:07 |
| @fungicide:matrix.org | but separately, what will happen if an amd64 bridge99 ends up in openmetal-iad3 which has no ipv6 and then tries to ssh to arm64 nodes in osuosl-regionone? will it refer to them by their v6 addresses and error out with no route to host/invalid address family/whatever? | 18:09 |
| @clarkb:matrix.org | it depends on how we write out the inventory using the write-inventory role here: https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L73 | 18:11 |
| @clarkb:matrix.org | fungi: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/write-inventory/library/write_inventory.py writes that out which is a little anoying because it isn't an easy template file to just read | 18:13 |
| @clarkb:matrix.org | fungi: I think it is taking the current inventory (from the zuul ansible) and writing it out to a file that the nested ansible will use | 18:14 |
| @clarkb:matrix.org | which preserves the ansible_host values from the zuul ansible inventory | 18:15 |
| @clarkb:matrix.org | and it looks like we prefer ipv6 if present so ya I think this will break in both directions | 18:15 |
| @clarkb:matrix.org | so there are two bugs. The first is empty ipv6 addrs on hosts that do have ipv6 possibly as a problem with the openstack apis or openstacksdk or the zuul launcher. Then second we need to ensure that we only use ipv6 if all hosts can use ipv6 and vice versa with ipv4 when writing the nested inventory | 18:16 |
| @clarkb:matrix.org | fixing the second thing will paper over the first thing | 18:16 |
| @fungicide:matrix.org | okay, so 1. openstacksdk seems to not be finding rackspace classic ipv6 addresses recently, 2. our test `authorized_keys` entry shouldn't fall back to production addresses when test addresses are empty, 3. we need to figure out how we make the jobs work when bridge is in a provider with no ipv6 while other nodes have v6 addresses | 18:17 |
| @clarkb:matrix.org | fungi: I think 2 may be ok since in theory we should be ignoring those network families if we don't have them (but we're not because we do have them we're just mistaken about it) | 18:18 |
| @clarkb:matrix.org | for 3. I suspect that we can add a new task after this task: https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L73 that checks the inventory's ansible_host value for each host and if they aren't all ipv6 sets them to ipv4 ro something like that | 18:19 |
| @clarkb:matrix.org | basically do the current write-inventory step to translate things directly. Then do a cleanup pass | 18:19 |
| @clarkb:matrix.org | alternatively: zuul-launcher intends to support mixed provider nodesets. Maybe the actual bug here should be fixed in zuul-launcher itself | 18:21 |
| @clarkb:matrix.org | though that is a bit of a stretch since zuul's ansible can talk to all of them as is | 18:21 |
| @clarkb:matrix.org | but maybe this is a "make things easier for zuul users" item | 18:21 |
| @fungicide:matrix.org | as in have zuul-launcher omit v6 addresses from the inventory if not all nodes have them? | 18:21 |
| @clarkb:matrix.org | yup | 18:22 |
| @fungicide:matrix.org | (and same for v4 addresses i suppose, to support the v6-only case) | 18:22 |
| @clarkb:matrix.org | or at least don't set ansible_host to ipv6 (maybe still record it in the inventory but don't make it the default connection type) | 18:22 |
| @fungicide:matrix.org | oh, that sounds a little more reasonable | 18:23 |
| @fungicide:matrix.org | and i suppose there's also a corner case that would need to be covered: some nodes have only ipv4 addresses while others have only ipv6 | 18:23 |
| @fungicide:matrix.org | not that it would ever happen for us, but there could be some zuul deployments like that | 18:23 |
| @clarkb:matrix.org | yes, which I don't think zuul can solve in inventory. That would have to come with mixed nodeset partitioning | 18:24 |
| @clarkb:matrix.org | like don't mix this group with that group graph coloring | 18:24 |
| @fungicide:matrix.org | i suppose the logic could work like this: if not all nodes have ipv6 addresses, then the nodes which do have both ipv6 and ipv4 addresses should use their ipv4 address for `ansible_host` | 18:25 |
| @clarkb:matrix.org | yes or more generically: if all hosts have an ipvX address and not an ipvY address then ansible_host should be set to use ipvX for all hosts | 18:26 |
| @fungicide:matrix.org | then the all-dual-stack and all-v6-only cases would still work the way they do now, as would the some-v6-only while others v4-only case | 18:26 |
| @fungicide:matrix.org | well, my point was that the current v6 preference means that v6-only hosts will still get their v6 address as host anyway, so that doesn't need to change | 18:27 |
| @fungicide:matrix.org | there wouldn't be an actual need to "fall back" from missing v4 to present v6 | 18:28 |
| @clarkb:matrix.org | got it | 18:28 |
| @fungicide:matrix.org | so don't even need to test for that | 18:28 |
| @clarkb:matrix.org | so something like: if you only have one ipvX or ipvY address use that. If all hosts share either available ipvX or ipvY but not all hosts have both then prefer the shared version. Finally prefer ipv6 if all else fails? | 18:29 |
| @clarkb:matrix.org | anyway I think we can encode something like that in our run-base.yaml playbook as a followup to the initial inventory copy over | 18:29 |
| @clarkb:matrix.org | it might look ugly in ansible but should be doable | 18:29 |
| @fungicide:matrix.org | "if at least one node has no v6, every node which has v4 should prefer it" | 18:29 |
| @clarkb:matrix.org | ++ | 18:30 |
| @fungicide:matrix.org | that should be the simplest logical encoding | 18:30 |
| @clarkb:matrix.org | and then if zuul wants to make it easier for people by encoding these rules a level higher we can drop what we do in run-base.yaml | 18:31 |
| @clarkb:matrix.org | but step 0 seems like updating run-base.yaml to do this | 18:31 |
| @fungicide:matrix.org | is there a convenient tool for adjusting the values in that `gate-hosts.yaml` file, or do we need to make an external script/module for that? | 18:34 |
| @clarkb:matrix.org | fungi: I think ansible can load the yaml, then you can change values, then you can write it back again | 18:35 |
| @clarkb:matrix.org | one problem may be having that pollute the local inventory | 18:35 |
| @clarkb:matrix.org | I don't know if we can "namespace" things | 18:35 |
| @clarkb:matrix.org | ya looks like you can just load vars and set a new varliable that way | 18:36 |
| @clarkb:matrix.org | * ya looks like you can just include vars and set a new varliable that way | 18:36 |
| @fungicide:matrix.org | ick, https://zuul.opendev.org/t/openstack/build/604e9936e7c14b7cad823ed72d7ef30d/log/bridge99.opendev.org/gate-hosts.yaml is all one very long line of yaml | 18:37 |
| @clarkb:matrix.org | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/update-json-file/tasks/main.yaml this is how the docker roles updating docker related json files | 18:38 |
| @clarkb:matrix.org | fungi: that shouldn't matter too much if we use an approach like ^ | 18:38 |
| @clarkb:matrix.org | basically load the file and loop over the entries and check if ipv6 is set for all of them. If not then in a new loop pass update ansible_host to the public_v4 value for the current item | 18:39 |
| @clarkb:matrix.org | finally write the file back out again if we made changes | 18:39 |
| @fungicide:matrix.org | yeah, if we use something that groks yaml | 18:39 |
| @clarkb:matrix.org | yes https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/update-json-file/tasks/main.yaml#L15 shoudl work fi you change from_json to from_yaml iirc | 18:40 |
| @clarkb:matrix.org | https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/from_yaml_filter.html | 18:40 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 988992: Force some hosts to use python2 for compat with Ansible 9 https://review.opendev.org/c/opendev/system-config/+/988992 | 18:51 | |
| @clarkb:matrix.org | infra-prod-base succeeded in deploy for ^ | 19:14 |
| @clarkb:matrix.org | it is issuing certs now so deployment isn't done yet | 19:14 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 19:21 | |
| @clarkb:matrix.org | fungi: ^ that is totally untested, but I think captures the general shapre of what we're looking for | 19:21 |
| @clarkb:matrix.org | fungi: I think having both static02 and static03 trying to renew certs may be creating problems. static03 failed with `security.openstack.org:Verify error:{"type":"urn:ietf:params:acme:error:serverInternal","detail":"Unable to validate JWS","status": 500}` but static02 succeeded. I'm reading the interval error as being maybe we tried to issue a cert for that name too many times in short succession? | 19:24 |
| @clarkb:matrix.org | though we've done this with 6 giteas for a long time | 19:24 |
| @clarkb:matrix.org | so I don't know maybe that is wrong | 19:24 |
| @clarkb:matrix.org | oh its internal not interval so ya I'm probably reading that wrong | 19:24 |
| @fungicide:matrix.org | ah thanks! i was still trying to figure out the syntax for making ansible find hosts that were missing v6 addresses, i.e. step #1 of the process | 19:25 |
| @fungicide:matrix.org | yeah, i'll push up a change now to drop static02 and static04 from our production inventory, then work on cleanup once that lands | 19:26 |
| @clarkb:matrix.org | I think all of the certs that succeeded do get new values on disk but we maybe didn't reload the vhosts | 19:27 |
| @clarkb:matrix.org | so in theory this will either self heal this evening during the next daily runs and that will auto reload apache configs to pick up the new certs. Or if it will continue to fail and we can make more intervention | 19:27 |
| @clarkb:matrix.org | and now lunch | 19:28 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 989023: Drop static02 and static04 from inventory https://review.opendev.org/c/opendev/system-config/+/989023 | 19:33 | |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 989023: Drop static02 and static04 from inventory https://review.opendev.org/c/opendev/system-config/+/989023 | 19:34 | |
| @fungicide:matrix.org | forgot to reset the head initially | 19:34 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 19:37 | |
| @clarkb:matrix.org | there was a small bug that testing caught immediately. I'm hoping that the next failure is a big bug :) | 19:38 |
| @clarkb:matrix.org | ok really eating lunch now | 19:38 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/zone-opendev.org] 989024: Remove records for static02 and static04 https://review.opendev.org/c/opendev/zone-opendev.org/+/989024 | 19:39 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 19:49 | |
| -@gerrit:opendev.org- Stephen Finucane proposed: | 19:51 | |
| - [opendev/git-review] 988966: Use configparser utils to parse types https://review.opendev.org/c/opendev/git-review/+/988966 | ||
| - [opendev/git-review] 987712: Revert "Clean up all references to branchauthor after removal of usage" https://review.opendev.org/c/opendev/git-review/+/987712 | ||
| - [opendev/git-review] 987713: Add gitreview.autotopic git config flag https://review.opendev.org/c/opendev/git-review/+/987713 | ||
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 20:05 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 20:22 | |
| @clarkb:matrix.org | fungi: I have approved the static02 and static04 inventory removal change | 20:24 |
| @fungicide:matrix.org | i guess if we land 989023 to remove the unused static servers from our inventory, that will re-exercise the let's encrypt playbook in deploy and hopefully get through all the depending jobs | 20:24 |
| @clarkb:matrix.org | yup | 20:24 |
| @fungicide:matrix.org | great timing | 20:24 |
| @clarkb:matrix.org | in 989022 I'm struggling with getting all the type conversions and attribute lookups to line up. But hopefully each update is a small bit of progress | 20:26 |
| @jim:acmegating.com | Clark: what arm nodes do we have? | 20:27 |
| @fungicide:matrix.org | we test on arm in osuosl because we run a mirror server there | 20:28 |
| @fungicide:matrix.org | so the system-config-run-mirror job | 20:28 |
| @clarkb:matrix.org | yes I think that is the only arm node left in production. Everything else is a test node | 20:29 |
| @fungicide:matrix.org | er, system-config-run-mirror-arm64 technically | 20:29 |
| @fungicide:matrix.org | and also system-config-run-base-arm64 of course | 20:29 |
| @jim:acmegating.com | got it (and that doesn't require arm, it's just the only thing available in that region) | 20:30 |
| @fungicide:matrix.org | right. if it were a mixed-architecture region we could run an amd64 mirror serving the arm64 nodes located there | 20:30 |
| @fungicide:matrix.org | but since it's the only architecture in that region we want to make sure we can continue to deploy and manage a mirror server in it | 20:31 |
| @jim:acmegating.com | do we have any ipv6 only clouds? | 20:31 |
| @fungicide:matrix.org | also makes for an interesting test-case for zuul-launcher's new cross-provider nodeset capability | 20:32 |
| @fungicide:matrix.org | we do not currently have any ipv6-only clouds, no | 20:32 |
| @fungicide:matrix.org | though earlier we discussed the potential challenges with booting an ipv4-only bridge99 and some other node in an ipv6-only provider | 20:32 |
| @fungicide:matrix.org | or vice versa | 20:33 |
| @jim:acmegating.com | then i think the thing to do kind of depends on what we want to test; if we want the test to exercise real-world v4->v6 bridge->server connectivity, then we'll need something like clark's change, so that we make the most of whatever the clouds give us. i have doubts whether we care about this very much though, since it's luck of the draw whether we get v6 capable server nodes. if we don't care about that so much then: | 20:34 |
| @fungicide:matrix.org | ftr, this initially came up because https://review.opendev.org/c/opendev/system-config/+/988698 is attempting to work around lack of arm wheels for some python packages ansible needs, which resulted in the ansible v9 upgrade breaking the arm jobs | 20:34 |
| @jim:acmegating.com | a zuul-ish way of solving this would be to turn off v6 on all our clouds. possibly scoped to just some system-config specific labels, or potentially globally. | 20:34 |
| @jim:acmegating.com | (by "turn off v6" i mean "tell zuul-launcher not to use v6 addresses") | 20:35 |
| @fungicide:matrix.org | corvus: yeah, we also talked about maybe making zuul-launcher smart enough to set the v4 address as preferred on any dual-stack node when at least one node in the nodeset is v4-only | 20:35 |
| @clarkb:matrix.org | there is also the issue of empty ipv6 addresses | 20:36 |
| @jim:acmegating.com | that's something worth considering -- but it also means inconsistent behavior from a given provider | 20:36 |
| @fungicide:matrix.org | right, the missing address is a separate (likely openstacksdk) bug which seems to just be affecting rackspace classic | 20:36 |
| @jim:acmegating.com | like, you get v6 unless you get something from another provider in which case you get v4. convenient for this case, but is it universally sensible? | 20:37 |
| @fungicide:matrix.org | i suppose it depends on how likely most people's multi-node jobs are to want all the nodes to be able to communicate with each other via their ansible host identifiers | 20:38 |
| @fungicide:matrix.org | (specifically in heterogenous multi-provider-aware deployments with nodesets needing resources from more than one provider) | 20:39 |
| @jim:acmegating.com | i'm arguing there's an extra clause that belongs in that statement: "and they also do not want to disable ipv6 on the labels involved" | 20:39 |
| @fungicide:matrix.org | so seems like the potential target audience for such an optimization is vanishingly small and maybe "just opendev" | 20:39 |
| @jim:acmegating.com | yes, i would bet a nickel that all versions of this qualifier (yours and mine) match only opendev. :) | 20:40 |
| @fungicide:matrix.org | anyway, that's why we started with a workaround in our own playbooks | 20:41 |
| @jim:acmegating.com | incidentally, that's probably the thing that would convince me it's okay to have zuul-launcher downgrade the protocol: there's probably no other use case for this configuration than opendev :) | 20:42 |
| @jim:acmegating.com | (it's a coin flip which behavior is "better", and there's only one user who cares) | 20:43 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 20:44 | |
| @clarkb:matrix.org | fwiw I think that generally being able to talk between nodeset nodes is a good goal unless intentionally broken. And in general I think jobs would figure that out themselves based on the network info. The oddity here is our nested ansible and reuse of the inventory | 20:45 |
| @clarkb:matrix.org | and ya it is probably unlikely to be useful to many others (because nested ansible seems less common and few have the heterogenous envs that we do) | 20:46 |
| @jim:acmegating.com | note that if we do want to have zuul-launcher do this, it will require a small bit of refactoring since currently the drivers themselves select the interface ip, not the launcher. so we'd need to move that logic to the launcher. in the case of the openstack driver, we use the value returned by openstacksdk, so we would end up no longer using that (that's probably fine) | 20:46 |
| @clarkb:matrix.org | I'm happy to start with our own little workaround assuming I can ever get the jinja filters to work before changing anything in zuul | 20:47 |
| @jim:acmegating.com | ack. maybe the second time we hit this issue we look at the zl change. :) | 20:47 |
| @clarkb:matrix.org | re the `public_v6: ''` values I'm assuming we can try and reproduce that using sdk directly to list a node and see what value ti get back for the address | 20:50 |
| @clarkb:matrix.org | https://zuul.opendev.org/t/openstack/build/ead900ebf29e4e3290db0531f0557849/log/zuul-info/inventory.yaml#26 actually this is affecting ovh too | 20:51 |
| @clarkb:matrix.org | so not a rax classic specific problem. Which maybe means you can reproduce this more easily | 20:51 |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 989023: Drop static02 and static04 from inventory https://review.opendev.org/c/opendev/system-config/+/989023 | 20:52 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 20:56 | |
| @clarkb:matrix.org | ok that seems like it is getting closer. I managed to address the prior issue with dict2items. And then I had a small syntax bug. fixing that will probably just uncover the next small issue. But I have to do a school run in a few minutes so feel free to continue to iterate on that while I'm doing that. Otherwise I'll pick it up when I get back. | 20:57 |
| @clarkb:matrix.org | Also I'll be making meeting agenda edits when I get back. Let me know if there are any edits you'd like to see | 20:57 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 22:00 | |
| @clarkb:matrix.org | fungi: remote puppet else failed as did lists3 on the deploy for static02 and static04 removal so that is an improvement | 22:09 |
| @clarkb:matrix.org | I haven't looked into those failires but the remote puppet else failure may be related to the python version change or ansible 9 update I suppose | 22:10 |
| @fungicide:matrix.org | let's encrypt job i was just looking at that | 22:10 |
| @fungicide:matrix.org | infra-prod-letsencrypt succeeded, but infra-prod-service-lists3 and infra-prod-remote-puppet-else failed | 22:10 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 22:11 | |
| @fungicide:matrix.org | ERROR: for anubis Head "https://ghcr.io/v2/techarohq/anubis/manifests/v1.25.0": Get "https://ghcr.io/token?scope=repository%3Atecharohq%2Fanubis%3Apull&service=ghcr.io": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | 22:11 |
| @fungicide:matrix.org | so i think that's probably just a one-off | 22:11 |
| @clarkb:matrix.org | ya that looks like lists trying to figure out if there are updates to the v1.25.0 anubis image. I agree we can ignore that one | 22:12 |
| @clarkb:matrix.org | unless it becomes persistent then figure out mirroring that image or something. but for now its enough to ignroe I think | 22:12 |
| @fungicide:matrix.org | digging in `/var/log/ansible/remote_puppet_else.yaml.log` i'm not immediately seeing the problem, no tasks were failed or unreachable | 22:13 |
| @fungicide:matrix.org | aha! | 22:13 |
| @fungicide:matrix.org | `ERROR! [DEPRECATED]: ansible.builtin.include has been removed. Use include_tasks or import_tasks instead. This feature was removed from ansible-core in a release after 2023-05-16. Please update your playbooks.` | 22:14 |
| @fungicide:matrix.org | `The error appears to be in '/etc/ansible/roles/puppet/tasks/main.yaml': line 131, column 3, but may be elsewhere in the file depending on the exact syntax problem.` | 22:14 |
| @fungicide:matrix.org | no other obvious error messages in the log | 22:14 |
| @clarkb:matrix.org | of course finding the source of that role might be fun | 22:15 |
| @clarkb:matrix.org | given the path I'm assuming that is a separate reop | 22:15 |
| @clarkb:matrix.org | https://opendev.org/opendev/ansible-role-puppet likely | 22:16 |
| @clarkb:matrix.org | yup https://opendev.org/opendev/ansible-role-puppet/src/branch/master/tasks/main.yaml#L133 I think this matches | 22:17 |
| @clarkb:matrix.org | note the line number is for the task definition starting so its off by a couple but it matches | 22:18 |
| @clarkb:matrix.org | fungi: so I think that include: should just be include_tasks: and it will be happy again | 22:19 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989022: Handle mixed provider nodests and ipv6 availability https://review.opendev.org/c/opendev/system-config/+/989022 | 22:26 | |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/ansible-role-puppet] 989028: Use include_tasks instead of include https://review.opendev.org/c/opendev/ansible-role-puppet/+/989028 | 22:29 | |
| @fungicide:matrix.org | Clark: like that ^ ? | 22:29 |
| @fungicide:matrix.org | anyway, i'm stepping away for the night, but can pick this back up in the morning if nobody beats me to it | 22:30 |
| @clarkb:matrix.org | fungi: ya that looks right. I won't merge anything with what day is left for me. However, my change for the ipv6 handling in inventory is looking like it may be working now. I'll rebase your x86 bridge change on top of it if that is teh case | 22:38 |
| -@gerrit:opendev.org- Clark Boylan proposed wip on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 988698: Use mixed arches for system-config-run-base-arm64 https://review.opendev.org/c/opendev/system-config/+/988698 | 22:40 | |
| @clarkb:matrix.org | a few test jobs have reported success so I've done the rebase there ^ | 22:41 |
| @clarkb:matrix.org | ok I just did a first pass on the meeting agenda. I know its late so I don't expect any input at this point. But just in case there are thoughts I'll wait another 10-20 minutes before sending the email to make it official | 22:49 |
| @clarkb:matrix.org | fungi: hrm maybe testinfra isn't using the inventory like I thought it was. Its still trying to connect to ipv6 like before even after rebased onto my change | 23:27 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!