19:01:14 #startmeeting infra 19:01:14 Meeting started Tue May 31 19:01:14 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:14 The meeting name has been set to 'infra' 19:01:21 #link https://lists.opendev.org/pipermail/service-discuss/2022-May/000338.html Our Agenda 19:01:45 we have an agenda but I forgot to add a couple of items before I sent it out (too many ditractions) so we'll audible those in as we go 19:01:51 #topic Announcements 19:02:01 The open infra summit is happening next week 19:02:24 o/ 19:02:26 keep that in mind when making changes (particularly to etherpad which will be used by the colocated forum) 19:02:41 but also several of us will be distracted and on different timezone from normal 19:03:04 yeah, i'll be living on cest all week 19:03:37 #topic Actions from last meeting 19:03:42 #link http://eavesdrop.openstack.org/meetings/infra/2022/infra.2022-05-24-19.01.txt minutes from last meeting 19:03:45 There were no actions 19:03:48 #topic Topics 19:03:53 #topic Improving CD throughput 19:04:02 #link https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul_reboot.yaml Automated graceful Zuul upgrades and server reboots 19:04:15 Zuul management is becoming related to this topic and we landed that playbook and ran it last week 19:04:35 It mostly worked as expected. Zuul mergers didn't gracefully stop (they stopped doign work but then never exited). That bug has since been fixed 19:04:40 went very smoothly, only a few bumps along the way 19:05:01 The other issue we hit was zuul updated its model api version from 7 to 8 through this upgrade. And there was a bug in that transition 19:05:20 we managed to work around that by dequeing the affected buildsets and reenqueing them 19:05:41 as the model differences were basically deleted and then recreated on the expected content version when reenqueued 19:05:59 Zuul has also addressed thsi problem upstream (though it shouldn't be a problem for us now that we've updated to api version 8 anyway) 19:06:22 One change I made after the initial run was to add more package updating to the playbook so there is a small difference to pay attention to when next we run it 19:06:55 One thing I realized about ^ is that I'm not sure what happens if we try to do that when the auto updater is running. We might hit dpkg lock conflicts and fail? but we can keep learning as we go 19:07:13 That was all I had on this topic. Anything else? 19:07:41 the upgrade brought us new openstacksdk, but we can cover that in another topic 19:07:47 yup thats next 19:09:08 #topic Job log uploads to swift now with broken CORS headers 19:09:38 (only on rackspace, as far as we've observed) 19:09:38 We noticed ~Friday that job logs uploaded to swift didn't have CORS headers but only with rax (ovh is fine) 19:10:01 The current suspiciion is that that openstacksdk 0.99.0 release which we picked up by restarting and upgrading executors with the above playbook may be to blame 19:10:11 yeah, and anything which was uploaded to swift on/before thursday seems to still work fine 19:10:21 https://review.opendev.org/844090 will downgrade the version of openstacksdk on the executors and is in the zuul gate 19:10:25 which lends weight to it being related to the update 19:10:33 we should be able to confirm/deny this theory by upgrading our executors to that newer image once it lands 19:10:58 or at least significantly increase our confidence level that it's related 19:11:14 if downgrading sdk doesn't help we'll need to debug further as we're currently only uploading logs to ovh which is less than ideal 19:11:30 technically we still only have time-based correlation to back it up, even if the problem goes away at the next restart 19:11:59 yup that and we now know the 0.99.0 openstacksdk release is an expected to break things in general release 19:12:04 (we expected that with 1.0.0 but got it early) 19:12:46 oh, also on ~saturday we merged a change to temporarily stop uploading logs to rackspace until we confirm via base-test that things have recovered 19:13:00 i think we can batch a lot of these changes in the next zuul restart 19:13:05 so this really was only an impact window of 24-48 hours worth of log uploads 19:13:13 i'll keep an eye out for that and let folks know when i think the time is ripe 19:13:18 corvus: thanks 19:13:43 and we'll pcik this up furhter if downgrading sdk doesn't help 19:13:55 but for now be aware of that and the current process being taken to try and debug/fix the issue 19:13:55 ok, happy to help with verifying things once we're in a place to upgrade (downgrade?) executors 19:14:08 as for people with job results impacted during that window, the workaround is to look at the raw logs linked from the results page 19:14:09 ianw: we need to upgrade executors whcih will downgrade openstacksdk in the images 19:15:11 Anything else on this subject? 19:15:53 nada 19:16:01 #topic Glean doesn't handle static ipv6 configuration on rpm distros 19:16:18 We discovered recently that glean isn't configuring ipv6 statically on rpm/red hat distros 19:16:34 this particular combo happens on OVH because they are now publishing the ipv6 network info in their config drives but don't do RAs 19:16:49 statically meaning addressing supplied via instance metadata 19:16:51 all of the other clouds seem to also do RAs in addition to the config drive info so we don't have to statically configure ipv6 to get working ipv6 19:17:10 fungi: right and with the expectation that the host statically configure it rather than relying on RAs 19:17:33 in the past this was a non issue beacuse OVH didn't include that info in the metadata via config drive 19:17:39 #link https://review.opendev.org/q/topic:rh-ipv6 Changes to add support to glean 19:17:53 ianw has picked this up and has a stack of changes ^ that I'm now noticing I need to rereview 19:17:57 thank you for picking that up 19:18:17 this will be awesome, because it means we can finally do ipv6 in ovh generally, but also we can consider running servers there now as well 19:18:34 ++ 19:18:43 the lack of automatable v6 (because it wasn't provided via metadata) was a blocker for us in the past 19:18:58 ianw: any gotchas or things to be aware as we review those changes? 19:19:51 right, so rax/ovh both provide "ipv6" in their metadata, vexxhost provides "ipv6_slaac" 19:20:06 but only ovh doesn't also advertise RAs 19:20:29 I guess the rax metadata is maybe slightly wrong if it isn't ipv6_slaac? did you test these changes on rax to ensure we don't regress there by statically configuring things? 19:21:01 I don't expect it to be a problem the kernel might just configure ipv6 first or NM will do it later depending on when RAs are received 19:21:04 i haven't touched the !rh path at all, so that should not change 19:21:12 ianw: I mean for rh on rax 19:21:30 since we'll be going from ignoring it entirely and using kernel RA default handling to static NM configuration of ipv6 19:21:56 still probably fine so long as the metadata and neutron are in agreement 19:22:05 nm doesn't configure ipv6 there, for rh platforms 19:22:19 but the kernel doesn't accept ra's either 19:22:43 so basically we're moving from no ipv6 -> statically configured ipv6 on rax 19:22:49 no rax has ipv6 today 19:23:04 I believe it is accepting RAs by default and glean is just ignoring it all 19:23:37 that is the difference between rax and ovh. Rax sends RAs so we got away with glean not having support, but ovh does not. Now that ovh adds the info to the config drive we notice 19:24:09 i don't think so, e.g. 172.99.69.200 is running 9-stream now on rax and doesn't have ipv6 19:24:23 oh, it may only be working with debuntu nodes 19:24:26 the interface doesn't have IPV6INIT=yes set for it 19:24:59 https://paste.opendev.org/show/b3ya7eC9zN7oyzrEgQpk/ is a 9-stream with the change on rax i tested earlier 19:25:49 huh how did we not notice this issue in the past then? The ovh change is realtively new but I doubt rax has changed much 19:25:55 yeah, the debuntu case is different -- i called that out in https://review.opendev.org/c/opendev/glean/+/843996 19:25:56 anyway sounds like we'll get ipv6 in rax on rh too then 19:26:04 an improvement all around 19:26:08 https://zuul.opendev.org/t/openstack/build/429f8a274074476c9de7792aa71f5258/log/zuul-info/zuul-info.ubuntu-focal.txt#30 19:26:29 that ran in rax-dfw on focal and got ipv6 addressing 19:26:32 i feel like probably the debuntu path should be setting something in it's interface files to say "use autoconfig on this interface" 19:26:55 ianw: well if the type is "ipv6" yo uare not supposed to auto config 19:26:59 "ipv6_slaac" you do 19:27:02 iirc 19:27:14 [we noticed in zuul because of the particular way openshift 3 tries to start up (it relies on an ipv6 /etc/hosts record and does not fall back to ipv4) -- perhaps nothing else behaves like that] 19:27:41 corvus: ah that could be 19:28:17 anyway sounds like there are a stack of changes htat will fix this properly and they just need review. I've put reviewing those on my todo list for this afternoon 19:28:20 yeah, that sounds like a somewhat naive implementation 19:28:38 fungi: its specifically how neutron defines it 19:28:42 but a great canary 19:28:49 like "iface eth inet6 auto" 19:29:19 i mean openshift's decision to base ipv6 detection on /etc/hosts content seems naive 19:29:22 ah 19:29:41 i feel like the issue is maybe that the network management tools and the kernel start to get in each others way if the network management tools don't know ipv6 is autoconfigured 19:29:51 but, maybe the way we operate it just never becomes an issue 19:30:52 Thank you for digging into that. Anything else before we move on? 19:30:55 anyway, there is basically no cross-over between the debian path and the rh path in any of the changes. so nothing should change on the !rh path 19:31:35 sounds very promising 19:31:47 do we have a consensus on the revert in that stack? 19:31:58 i saw it got reshuffled 19:32:13 I think it is at the bottom of the stack still 19:32:20 (at least gerrit appears to be telling me it is) 19:32:36 oh, yeah i just missed a file on that one 19:32:55 I'm ok with that, I suspect we were the only users of the feature. We just need to communicate when we release that the feature has been removed and tag appropriately 19:33:09 and if necessary can add it back in again if anyone complains 19:33:43 yeah, the way it tests "OVH" but is actually testing "ignore everything" was quite confusing 19:33:49 yeah, i'm okay with no deprecation period as long as everyone else is in agreement we just communicate thoroughly 19:34:24 if anyone's going to be impacted it's probably ironic, and i doubt they used it either 19:34:31 (i haven't checked codesearch) 19:34:39 ya I think ironic makes a lot of use of dhcp all interfaces 19:34:44 i did check codesearch and nothing i could see was setting it 19:34:51 wfm. thanks! 19:34:53 which is essentailly what glean fell back to there and they get it from other tools 19:35:39 also, up to https://review.opendev.org/c/opendev/glean/+/843979/2 is all no-op refactoring 19:36:05 i was thinking of 2 releases; the first with that to make sure there's no regressions, then a second one a day later that actually turns on ipv6 19:36:12 that seems reasonable to me 19:36:44 i'll probably announce pending ipv6 enablement to the service-discuss list as a heads up 19:36:56 ++ 19:36:57 before we tag any release that does that, and the next build picks it up 19:37:12 yeah, even just getting working v6 on ovh is likely to be disruptive somehow 19:37:53 Alright lets continue as we hve a few more items to get to 19:38:01 this all sounds good to me though and I'll try to review shortly 19:38:06 #topic Container Maintenance 19:38:20 ianw: on this topic I was wondering if you had time to look at the mariadb container upgrade path via the env var 19:38:27 if not thats fine (a lot has been going on) 19:39:09 sorry not yet, but top of todo list is to recheck that and the new gerrit release (the ipv6 was way more work than i thought it would be :) 19:39:44 #topic Gerrit 3.5 upgrade 19:40:04 #link https://review.opendev.org/c/opendev/system-config/+/843298 new gerrit minor releases and our image updates for them 19:40:18 Getting that landed and production updated is probably a good idea before we upgrade 19:40:36 ianw: are there any other upgrade related changes that need review or items that need assitance? 19:41:34 i can look at a quick restart for the minor releases during my afternoon 19:41:42 thanks 19:42:01 no updates, but track 19:42:01 I do feel like I'm getting sucked into travel and summit prep and want to not be tied to too many ops obligations 19:42:04 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.5 19:42:15 if you're interested in keeping up, anything will be logged there 19:42:27 noted thanks 19:42:33 #topic Manually triggering periodic jobs in zuul 19:42:46 I saw some talk about this at one point so I kept it on the agenda this week 19:43:30 Does anyone know if we managed to trigger a periodic job manually? If not thats fine we can continue but wanted to make sure we calledi t out if someone managed it and has notes to share 19:43:55 I didn't succeed so far 19:44:02 still planning to do local testing 19:44:17 thanks for the update 19:44:28 ++ on getting instructions 19:44:50 #topic Zuul changing default versions of Ansible to v5 19:45:13 There is still no hard date for when zuul will make this change but I think we should continue to aim for end of June in opendev's deployment 19:45:36 note the version needs to be a string in yaml (this was something else we hit when we upgraded zuul last week) 19:45:54 Our image updates to include the acl package appear to have made devstack happy and I think the zuul-jobs fixups have all landed 19:46:15 frickler: you mentioned wanting to test kolla jobs with newer ansible. Have you managed to do that yet? any thing to report if so? 19:46:42 nope, still another todo 19:46:46 I think I'll send emal about this change at the end of next week assuming we don't hit any major blockers by then 19:46:57 I'll send that to service-announce 19:47:13 and plan for June 30 change (its a thursday) 19:48:41 #topic Removing Ethercalc 19:48:50 I have shutdown and snapshotted this server 19:49:09 I figure there isn't much harm in waiting a day or two before actually deleting it just in case someone starts screaming about problems due to this 19:49:24 any objections to leaving the turned off server around for a couple of days before I delete it and its dns records? 19:49:54 also if you have a moment to double check the snapshot exists and looks sane please do. 19:50:10 nope, we have backups too 19:50:12 lgtm 19:50:34 i guess they will prune but otherwise just stay around 19:50:40 good point 19:50:48 alright ~thursday I'll delete the server and its dns records 19:51:16 #topic Do we want to have a meeting next week 19:51:36 last item on the agenda. Do we want to have a meeting next week? I think fungi corvus and myself will be unable to make it 19:52:03 correct; i vote no 19:52:06 yeah, i expect to be at a dinner at that time, i think 19:52:20 at any rate, i won't be around 19:52:41 I guess we can skip 19:53:15 I won't plan to send an agenda but if others do want a meeting and can run it feel free to send one and drive it 19:53:36 but ya I expect it is a good day for ianw to sleep in and otherwise not worry about it :) 19:53:54 heh, yeah, let's see if anything comes up during the week that needs an eye kept on it 19:54:32 #topic Open Discussion 19:54:34 Anything else? 19:54:34 openstacksdk 1.0.0 releases seems to have potential :) but hopefully not! 19:54:49 ianw: we'll probably have everything pinned to 0.61.0 by then :) 19:56:16 i ran across an explanation of how browsers detect whether a page is trying to "auto" play audio, apparently they check to see if the user interacts by clicking or entering things in the page first, so guidance from jitsi-meet is to turn on a landing page where users preselect their options before proceeding to the room, which is generally sufficient to keep browsers from thinking the 19:56:18 page is just spontaneously generating sound 19:56:42 seems like this may address a lot of the "my sound doesn't work" complaints we've been getting about meetpAD 19:57:15 fungi: are there changes up to add the landing page thing to our jitsi ? 19:57:22 I know you were working on it then I got totally distracted 19:57:25 i've also noticed that the upstream config files and examples have diverged quite a bit from the copies we forked, and a number of the options we set have become defaults now, so i'm trying to rectify all that first 19:57:56 the current upstream configs turn on that landing page by default since some time 19:58:13 ah so if we sync with upstraem then we'll get it for free 19:58:22 so if i realign our configs with the upstream defaults as much as possible, that should just come for free, yes 19:58:52 but in an effort not to destabilize anything people might be using for remote participation in forum sessions, i think we probably shouldn't merge any major config updates until after next week 19:59:02 wfm 19:59:30 i'll try to push up something which minimizes our divergence from the upstream configs though, for further discussion 20:00:00 thanks 20:00:02 and we are at time 20:00:15 THank you everyone for joining us. We likely won't be hear next week but should be back in two weeks 20:00:20 #endmeeting