19:01:14 <clarkb> #startmeeting infra
19:01:14 <opendevmeet> Meeting started Tue May 31 19:01:14 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:14 <opendevmeet> The meeting name has been set to 'infra'
19:01:21 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-May/000338.html Our Agenda
19:01:45 <clarkb> we have an agenda but I forgot to add a couple of items before I sent it out (too many ditractions) so we'll audible those in as we go
19:01:51 <clarkb> #topic Announcements
19:02:01 <clarkb> The open infra summit is happening next week
19:02:24 <ianw> o/
19:02:26 <clarkb> keep that in mind when making changes (particularly to etherpad which will be used by the colocated forum)
19:02:41 <clarkb> but also several of us will be distracted and on different timezone from normal
19:03:04 <fungi> yeah, i'll be living on cest all week
19:03:37 <clarkb> #topic Actions from last meeting
19:03:42 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2022/infra.2022-05-24-19.01.txt minutes from last meeting
19:03:45 <clarkb> There were no actions
19:03:48 <clarkb> #topic Topics
19:03:53 <clarkb> #topic Improving CD throughput
19:04:02 <clarkb> #link https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul_reboot.yaml Automated graceful Zuul upgrades and server reboots
19:04:15 <clarkb> Zuul management is becoming related to this topic and we landed that playbook and ran it last week
19:04:35 <clarkb> It mostly worked as expected. Zuul mergers didn't gracefully stop (they stopped doign work but then never exited). That bug has since been fixed
19:04:40 <fungi> went very smoothly, only a few bumps along the way
19:05:01 <clarkb> The other issue we hit was zuul updated its model api version from 7 to 8 through this upgrade. And there was a bug in that transition
19:05:20 <clarkb> we managed to work around that by dequeing the affected buildsets and reenqueing them
19:05:41 <clarkb> as the model differences were basically deleted and then recreated on the expected content version when reenqueued
19:05:59 <clarkb> Zuul has also addressed thsi problem upstream (though it shouldn't be a problem for us now that we've updated to api version 8 anyway)
19:06:22 <clarkb> One change I made after the initial run was to add more package updating to the playbook so there is a small difference to pay attention to when next we run it
19:06:55 <clarkb> One thing I realized about ^ is that I'm not sure what happens if we try to do that when the auto updater is running. We might hit dpkg lock conflicts and fail? but we can keep learning as we go
19:07:13 <clarkb> That was all I had on this topic. Anything else?
19:07:41 <fungi> the upgrade brought us new openstacksdk, but we can cover that in another topic
19:07:47 <clarkb> yup thats next
19:09:08 <clarkb> #topic Job log uploads to swift now with broken CORS headers
19:09:38 <fungi> (only on rackspace, as far as we've observed)
19:09:38 <clarkb> We noticed ~Friday that job logs uploaded to swift didn't have CORS headers but only with rax (ovh is fine)
19:10:01 <clarkb> The current suspiciion is that that openstacksdk 0.99.0 release which we picked up by restarting and upgrading executors with the above playbook may be to blame
19:10:11 <fungi> yeah, and anything which was uploaded to swift on/before thursday seems to still work fine
19:10:21 <clarkb> https://review.opendev.org/844090 will downgrade the version of openstacksdk on the executors and is in the zuul gate
19:10:25 <fungi> which lends weight to it being related to the update
19:10:33 <clarkb> we should be able to confirm/deny this theory by upgrading our executors to that newer image once it lands
19:10:58 <fungi> or at least significantly increase our confidence level that it's related
19:11:14 <clarkb> if downgrading sdk doesn't help we'll need to debug further as we're currently only uploading logs to ovh which is less than ideal
19:11:30 <fungi> technically we still only have time-based correlation to back it up, even if the problem goes away at the next restart
19:11:59 <clarkb> yup that and we now know the 0.99.0 openstacksdk release is an expected to break things in general release
19:12:04 <clarkb> (we expected that with 1.0.0 but got it early)
19:12:46 <fungi> oh, also on ~saturday we merged a change to temporarily stop uploading logs to rackspace until we confirm via base-test that things have recovered
19:13:00 <corvus> i think we can batch a lot of these changes in the next zuul restart
19:13:05 <fungi> so this really was only an impact window of 24-48 hours worth of log uploads
19:13:13 <corvus> i'll keep an eye out for that and let folks know when i think the time is ripe
19:13:18 <clarkb> corvus: thanks
19:13:43 <clarkb> and we'll pcik this up furhter if downgrading sdk doesn't help
19:13:55 <clarkb> but for now be aware of that and the current process being taken to try and debug/fix the issue
19:13:55 <ianw> ok, happy to help with verifying things once we're in a place to upgrade (downgrade?) executors
19:14:08 <fungi> as for people with job results impacted during that window, the workaround is to look at the raw logs linked from the results page
19:14:09 <clarkb> ianw: we need to upgrade executors whcih will downgrade openstacksdk in the images
19:15:11 <clarkb> Anything else on this subject?
19:15:53 <fungi> nada
19:16:01 <clarkb> #topic Glean doesn't handle static ipv6 configuration on rpm distros
19:16:18 <clarkb> We discovered recently that glean isn't configuring ipv6 statically on rpm/red hat distros
19:16:34 <clarkb> this particular combo happens on OVH because they are now publishing the ipv6 network info in their config drives but don't do RAs
19:16:49 <fungi> statically meaning addressing supplied via instance metadata
19:16:51 <clarkb> all of the other clouds seem to also do RAs in addition to the config drive info so we don't have to statically configure ipv6 to get working ipv6
19:17:10 <clarkb> fungi: right and with the expectation that the host statically configure it rather than relying on RAs
19:17:33 <clarkb> in the past this was a non issue beacuse OVH didn't include that info in the metadata via config drive
19:17:39 <clarkb> #link https://review.opendev.org/q/topic:rh-ipv6 Changes to add support to glean
19:17:53 <clarkb> ianw has picked this up and has a stack of changes ^ that I'm now noticing I need to rereview
19:17:57 <clarkb> thank you for picking that up
19:18:17 <fungi> this will be awesome, because it means we can finally do ipv6 in ovh generally, but also we can consider running servers there now as well
19:18:34 <clarkb> ++
19:18:43 <fungi> the lack of automatable v6 (because it wasn't provided via metadata) was a blocker for us in the past
19:18:58 <clarkb> ianw: any gotchas or things to be aware as we review those changes?
19:19:51 <ianw> right, so rax/ovh both provide "ipv6" in their metadata, vexxhost provides "ipv6_slaac"
19:20:06 <clarkb> but only ovh doesn't also advertise RAs
19:20:29 <clarkb> I guess the rax metadata is maybe slightly wrong if it isn't ipv6_slaac? did you test these changes on rax to ensure we don't regress there by statically configuring things?
19:21:01 <clarkb> I don't expect it to be a problem the kernel might just configure ipv6 first or NM will do it later depending on when RAs are received
19:21:04 <ianw> i haven't touched the !rh path at all, so that should not change
19:21:12 <clarkb> ianw: I mean for rh on rax
19:21:30 <clarkb> since we'll be going from ignoring it entirely and using kernel RA default handling to static NM configuration of ipv6
19:21:56 <fungi> still probably fine so long as the metadata and neutron are in agreement
19:22:05 <ianw> nm doesn't configure ipv6 there, for rh platforms
19:22:19 <ianw> but the kernel doesn't accept ra's either
19:22:43 <ianw> so basically we're moving from no ipv6 -> statically configured ipv6 on rax
19:22:49 <clarkb> no rax has ipv6 today
19:23:04 <clarkb> I believe it is accepting RAs by default and glean is just ignoring it all
19:23:37 <clarkb> that is the difference between rax and ovh. Rax sends RAs so we got away with glean not having support, but ovh does not. Now that ovh adds the info to the config drive we notice
19:24:09 <ianw> i don't think so, e.g. 172.99.69.200 is running 9-stream now on rax and doesn't have ipv6
19:24:23 <fungi> oh, it may only be working with debuntu nodes
19:24:26 <ianw> the interface doesn't have IPV6INIT=yes set for it
19:24:59 <ianw> https://paste.opendev.org/show/b3ya7eC9zN7oyzrEgQpk/ is a 9-stream with the change on rax i tested earlier
19:25:49 <clarkb> huh how did we not notice this issue in the past then? The ovh change is realtively new but I doubt rax has changed much
19:25:55 <ianw> yeah, the debuntu case is different -- i called that out in https://review.opendev.org/c/opendev/glean/+/843996
19:25:56 <clarkb> anyway sounds like we'll get ipv6 in rax on rh too then
19:26:04 <clarkb> an improvement all around
19:26:08 <fungi> https://zuul.opendev.org/t/openstack/build/429f8a274074476c9de7792aa71f5258/log/zuul-info/zuul-info.ubuntu-focal.txt#30
19:26:29 <fungi> that ran in rax-dfw on focal and got ipv6 addressing
19:26:32 <ianw> i feel like probably the debuntu path should be setting something in it's interface files to say "use autoconfig on this interface"
19:26:55 <clarkb> ianw: well if the type is "ipv6" yo uare not supposed to auto config
19:26:59 <clarkb> "ipv6_slaac" you do
19:27:02 <clarkb> iirc
19:27:14 <corvus> [we noticed in zuul because of the particular way openshift 3 tries to start up (it relies on an ipv6 /etc/hosts record and does not fall back to ipv4) -- perhaps nothing else behaves like that]
19:27:41 <clarkb> corvus: ah that could be
19:28:17 <clarkb> anyway sounds like there are a stack of changes htat will fix this properly and they just need review. I've put reviewing those on my todo list for this afternoon
19:28:20 <fungi> yeah, that sounds like a somewhat naive implementation
19:28:38 <clarkb> fungi: its specifically how neutron defines it
19:28:42 <fungi> but a great canary
19:28:49 <ianw> like "iface eth<X> inet6 auto"
19:29:19 <fungi> i mean openshift's decision to base ipv6 detection on /etc/hosts content seems naive
19:29:22 <clarkb> ah
19:29:41 <ianw> i feel like the issue is maybe that the network management tools and the kernel start to get in each others way if the network management tools don't know ipv6 is autoconfigured
19:29:51 <ianw> but, maybe the way we operate it just never becomes an issue
19:30:52 <clarkb> Thank you for digging into that. Anything else before we move on?
19:30:55 <ianw> anyway, there is basically no cross-over between the debian path and the rh path in any of the changes.  so nothing should change on the !rh path
19:31:35 <fungi> sounds very promising
19:31:47 <fungi> do we have a consensus on the revert in that stack?
19:31:58 <fungi> i saw it got reshuffled
19:32:13 <clarkb> I think it is at the bottom of the stack still
19:32:20 <clarkb> (at least gerrit appears to be telling me it is)
19:32:36 <ianw> oh, yeah i just missed a file on that one
19:32:55 <clarkb> I'm ok with that, I suspect we were the only users of the feature. We just need to communicate when we release that the feature has been removed and tag appropriately
19:33:09 <clarkb> and if necessary can add it back in again if anyone complains
19:33:43 <ianw> yeah, the way it tests "OVH" but is actually testing "ignore everything" was quite confusing
19:33:49 <fungi> yeah, i'm okay with no deprecation period as long as everyone else is in agreement we just communicate thoroughly
19:34:24 <fungi> if anyone's going to be impacted it's probably ironic, and i doubt they used it either
19:34:31 <fungi> (i haven't checked codesearch)
19:34:39 <clarkb> ya I think ironic makes a lot of use of dhcp all interfaces
19:34:44 <ianw> i did check codesearch and nothing i could see was setting it
19:34:51 <fungi> wfm. thanks!
19:34:53 <clarkb> which is essentailly what glean fell back to there and they get it from other tools
19:35:39 <ianw> also, up to https://review.opendev.org/c/opendev/glean/+/843979/2 is all no-op refactoring
19:36:05 <ianw> i was thinking of 2 releases; the first with that to make sure there's no regressions, then a second one a day later that actually turns on ipv6
19:36:12 <clarkb> that seems reasonable to me
19:36:44 <ianw> i'll probably announce pending ipv6 enablement to the service-discuss list as a heads up
19:36:56 <clarkb> ++
19:36:57 <ianw> before we tag any release that does that, and the next build picks it up
19:37:12 <fungi> yeah, even just getting working v6 on ovh is likely to be disruptive somehow
19:37:53 <clarkb> Alright lets continue as we hve a few more items to get to
19:38:01 <clarkb> this all sounds good to me though and I'll try to review shortly
19:38:06 <clarkb> #topic Container Maintenance
19:38:20 <clarkb> ianw: on this topic I was wondering if you had time to look at the mariadb container upgrade path via the env var
19:38:27 <clarkb> if not thats fine (a lot has been going on)
19:39:09 <ianw> sorry not yet, but top of todo list is to recheck that and the new gerrit release (the ipv6 was way more work than i thought it would be :)
19:39:44 <clarkb> #topic Gerrit 3.5 upgrade
19:40:04 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/843298 new gerrit minor releases and our image updates for them
19:40:18 <clarkb> Getting that landed and production updated is probably a good idea before we upgrade
19:40:36 <clarkb> ianw: are there any other upgrade related changes that need review or items that need assitance?
19:41:34 <ianw> i can look at a quick restart for the minor releases during my afternoon
19:41:42 <clarkb> thanks
19:42:01 <ianw> no updates, but track
19:42:01 <clarkb> I do feel like I'm getting sucked into travel and summit prep and want to not be tied to too many ops obligations
19:42:04 <ianw> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.5
19:42:15 <ianw> if you're interested in keeping up, anything will be logged there
19:42:27 <clarkb> noted thanks
19:42:33 <clarkb> #topic Manually triggering periodic jobs in zuul
19:42:46 <clarkb> I saw some talk about this at one point so I kept it on the agenda this week
19:43:30 <clarkb> Does anyone know if we managed to trigger a periodic job manually? If not thats fine we can continue but wanted to make sure we calledi t out if someone managed it and has notes to share
19:43:55 <frickler> I didn't succeed so far
19:44:02 <frickler> still planning to do local testing
19:44:17 <clarkb> thanks for the update
19:44:28 <ianw> ++ on getting instructions
19:44:50 <clarkb> #topic Zuul changing default versions of Ansible to v5
19:45:13 <clarkb> There is still no hard date for when zuul will make this change but I think we should continue to aim for end of June in opendev's deployment
19:45:36 <clarkb> note the version needs to be a string in yaml (this was something else we hit when we upgraded zuul last week)
19:45:54 <clarkb> Our image updates to include the acl package appear to have made devstack happy and I think the zuul-jobs fixups have all landed
19:46:15 <clarkb> frickler: you mentioned wanting to test kolla jobs with newer ansible. Have you managed to do that yet? any thing to report if so?
19:46:42 <frickler> nope, still another todo
19:46:46 <clarkb> I think I'll send emal about this change at the end of next week assuming we don't hit any major blockers by then
19:46:57 <clarkb> I'll send that to service-announce
19:47:13 <clarkb> and plan for June 30 change (its a thursday)
19:48:41 <clarkb> #topic Removing Ethercalc
19:48:50 <clarkb> I have shutdown and snapshotted this server
19:49:09 <clarkb> I figure there isn't much harm in waiting a day or two before actually deleting it just in case someone starts screaming about problems due to this
19:49:24 <clarkb> any objections to leaving the turned off server around for a couple of days before I delete it and its dns records?
19:49:54 <clarkb> also if you have a moment to double check the snapshot exists and looks sane please do.
19:50:10 <ianw> nope, we have backups too
19:50:12 <fungi> lgtm
19:50:34 <ianw> i guess they will prune but otherwise just stay around
19:50:40 <clarkb> good point
19:50:48 <clarkb> alright ~thursday I'll delete the server and its dns records
19:51:16 <clarkb> #topic Do we want to have a meeting next week
19:51:36 <clarkb> last item on the agenda. Do we want to have a meeting next week? I think fungi corvus and myself will be unable to make it
19:52:03 <corvus> correct; i vote no
19:52:06 <fungi> yeah, i expect to be at a dinner at that time, i think
19:52:20 <fungi> at any rate, i won't be around
19:52:41 <frickler> I guess we can skip
19:53:15 <clarkb> I won't plan to send an agenda but if others do want a meeting and can run it feel free to send one and drive it
19:53:36 <clarkb> but ya I expect it is a good day for ianw to sleep in and otherwise not worry about it :)
19:53:54 <ianw> heh, yeah, let's see if anything comes up during the week that needs an eye kept on it
19:54:32 <clarkb> #topic Open Discussion
19:54:34 <clarkb> Anything else?
19:54:34 <ianw> openstacksdk 1.0.0 releases seems to have potential :)  but hopefully not!
19:54:49 <clarkb> ianw: we'll probably have everything pinned to 0.61.0 by then :)
19:56:16 <fungi> i ran across an explanation of how browsers detect whether a page is trying to "auto" play audio, apparently they check to see if the user interacts by clicking or entering things in the page first, so guidance from jitsi-meet is to turn on a landing page where users preselect their options before proceeding to the room, which is generally sufficient to keep browsers from thinking the
19:56:18 <fungi> page is just spontaneously generating sound
19:56:42 <fungi> seems like this may address a lot of the "my sound doesn't work" complaints we've been getting about meetpAD
19:57:15 <clarkb> fungi: are there changes up to add the landing page thing to our jitsi ?
19:57:22 <clarkb> I know you were working on it then I got totally distracted
19:57:25 <fungi> i've also noticed that the upstream config files and examples have diverged quite a bit from the copies we forked, and a number of the options we set have become defaults now, so i'm trying to rectify all that first
19:57:56 <fungi> the current upstream configs turn on that landing page by default since some time
19:58:13 <clarkb> ah so if we sync with upstraem then we'll get it for free
19:58:22 <fungi> so if i realign our configs with the upstream defaults as much as possible, that should just come for free, yes
19:58:52 <fungi> but in an effort not to destabilize anything people might be using for remote participation in forum sessions, i think we probably shouldn't merge any major config updates until after next week
19:59:02 <clarkb> wfm
19:59:30 <fungi> i'll try to push up something which minimizes our divergence from the upstream configs though, for further discussion
20:00:00 <clarkb> thanks
20:00:02 <clarkb> and we are at time
20:00:15 <clarkb> THank you everyone for joining us. We likely won't be hear next week but should be back in two weeks
20:00:20 <clarkb> #endmeeting