#opendev-meeting log

19:01:04 <clarkb> #startmeeting infra
19:01:04 <opendevmeet> Meeting started Tue Nov 29 19:01:04 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:04 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:04 <opendevmeet> The meeting name has been set to 'infra'
19:01:06 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-November/000383.html Our Agenda
19:01:44 <clarkb> #topic Announcements
19:01:45 <ianw> o/
19:02:22 <clarkb> There is a board meeting next week December 6 at 2100 UTC and weblate for openstack translations will be discussed if anyone is interested in attending
19:03:36 <clarkb> The summit cfp is also open. I noticed that SCaLE's cfp closes friday as well
19:03:43 <clarkb> and fossdem is doing things
19:04:24 <clarkb> anythine else to announce?
19:05:20 <clarkb> #topic Bastion Host Updates
19:06:30 <clarkb> with the holiday weekend I've sort of lost where we are at on this
19:06:45 <clarkb> I think there were changes to manage rax dns with a small script
19:07:06 <clarkb> and openstacksdk updated to fix the networking issue when booting rax nodes
19:07:20 <ianw> yep there's a stack for updating the node launcher @
19:07:32 <ianw> #link https://review.opendev.org/q/topic:rax-rdns
19:07:51 <fungi> #link https://review.opendev.org/865320 Improve launch-node deps and fix script bugs
19:07:53 <fungi> also that
19:08:11 <ianw> that has a tool for updating RAX rdns automatically when launching nodes, and also updates our dns outputs etc.
19:08:44 <ianw> there's also
19:08:46 <ianw> #link https://review.opendev.org/q/topic:bridge-osc
19:09:01 <ianw> which updates the "openstack" command on the bastion host that doesn't currently work
19:09:22 <ianw> and then
19:09:26 <ianw> #link https://review.opendev.org/q/topic:bridge-ansible-update
19:09:42 <fungi> though in theory we could just deeplink from /usr/local/(s)bin to the osc in the launch env and not manage two separate copies
19:09:43 <ianw> is a separate stack that updates ansible on the bridge
19:10:08 <ianw> fungi: that's what https://review.opendev.org/c/opendev/system-config/+/865606 does :)
19:10:15 <fungi> oh, perfect ;)
19:11:36 <clarkb> cool. I'll do my best to pull these up after my appointment and review as many as I can
19:11:53 <clarkb> anything else with the bastion?
19:12:27 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/865784
19:12:42 <ianw> is a tool for backing up bits of the host
19:13:25 <ianw> that's about it.  i think it's pretty close to being as much of a "normal" host as i think it can be
19:13:42 <ianw> there's still
19:13:45 <ianw> #link https://review.opendev.org/q/topic:prod-bastion-group
19:13:58 <clarkb> and that last link will do the parallel runs right?
19:14:05 <ianw> to update jobs so they can run in parallel
19:14:08 <ianw> yep
19:14:11 <clarkb> I'm thinking we should stablize as much as we can prior to that as that alone demands a fair bit of attention
19:14:25 <ianw> agree on that, not high priority
19:14:53 <clarkb> #topic Upgrading Old Servers
19:14:58 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes
19:15:08 <clarkb> I don't have anything new to add to this :( I keep finding distractions
19:15:29 <clarkb> #topic Mailman 3
19:15:35 <clarkb> #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes
19:15:38 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/865361
19:15:48 <clarkb> I believe fungi is just about ready to deploy the new server
19:15:57 <clarkb> then its a matter of announcing and performing the swap over
19:16:01 <fungi> well, the server exists, but the inventory change is in the gate noiw
19:16:09 <fungi> dns is already there
19:16:10 <clarkb> it just merged :)
19:16:55 <fungi> perfect
19:17:13 <fungi> and i've added reverse dns and am making sure it's clear of any blocklists
19:17:44 <fungi> so assuming the deploy from the inventory addition checks out, we should be able to announce a maintenance window now
19:18:14 <clarkb> is there anything you need from us at this point?
19:18:19 <fungi> i'll float a draft etherpad in #opendev later today or early tomorrow
19:18:40 <clarkb> sounds good
19:18:46 <fungi> when would folks want to do the maintenance? monday december 5?
19:18:55 <clarkb> that day should work for me
19:19:10 <ianw> ++
19:20:03 <fungi> as for things to cover in the announcement, people wanting to manage list moderation queues and configs or adjust their subscriptions will need to create accounts (but if they use the same address from the configs we've imported then those roles will be linked as soon as they follow the link from the resulting confirmation e-mail)
19:20:36 <fungi> also we're unable to migrate any held messages from moderation queues, so list moderators will want to do a quick pass over theirs shortly before the maintenance window
19:21:29 <fungi> other than that, and the brief outage and what the new ip addresses are, is there anything else i should cover in the announcement?
19:21:33 <frickler> can we first stop incoming mails to avoid race conditions?
19:21:40 <corvus> i can do that for zuul on sunday/monday
19:21:50 <clarkb> frickler: yes the etherpad link above covers the whole process
19:22:07 <clarkb> and I'm pretty sure stopping incoming mail is part of it?
19:22:27 <frickler> but before the final moderation run?
19:22:27 <clarkb> that should force it to queue up and then get through to the new server when dns is updated
19:22:36 <clarkb> frickler: oh specifically for moderation. For that I'm not sure
19:22:56 <frickler> maybe not so relevant for these low volume lists, but possibly for openstack-discuss
19:23:02 <fungi> the plan is to stop incoming mails by switching the hostnames for the sites to nonexistent addresses temporarily
19:23:23 <fungi> because we can't easily turn off some sites and not others at the mta layer
19:24:05 <fungi> but also to keep that as brief as possible, just long enough for dns to propagate and the imports to run
19:24:36 <fungi> and then we'll switch to the proper addresses in dns so messages deferred on the senders mtas will end up at the correct server
19:25:33 <fungi> for openstack-discuss, i'll personally be moderating it anyway and can literally check the moderation queue via an /etc/hosts override after dns is swizzled before the import
19:25:34 <frickler> ok, guess I should've read the etherpad first ;)
19:26:01 <fungi> but that won't be migrated in this upcoming window, it'll be sometime early next year
19:27:39 <clarkb> #topic Quo Vadis Storyboard
19:27:51 <clarkb> I sent the followup to the thread that I promised (a little late sorry)
19:28:10 <clarkb> tried to summarize the position the opendev team is in and some ideas for improving the status quo
19:28:25 <clarkb> I'd love to hear feedback (ideally on the thread so that others can easily follow along)
19:28:54 <clarkb> frickler did mention that if we're going to invest further in the gitea deployment (and maybe if we don't anyway) we'll need to sort out the gitea fork situation
19:29:10 <clarkb> I followed that whole thing as it was happening a few weeks ago and didn't feel like we needed to take any drastic action at the time
19:29:27 <clarkb> But i agree it is worth being informed and up to date on that situation to ensure we're putting effort in the correct location
19:29:37 <fungi> i guess people are still pushing to fork and it hasn't settled out yet?
19:29:53 <clarkb> yes I think a fork is in progress, but I'm not sure it is deployable yet
19:30:02 <fungi> does it have a name?
19:30:11 <frickler> yes, my main concern when I saw this was we might build something for gitea and then it turns out we would want/need the fork instead
19:30:16 <frickler> forgejo
19:30:30 <frickler> https://codeberg.org/forgejo/forgejo
19:30:50 <frickler> not sure though whether that is _the_ fork or just one of multiple ones
19:30:58 <fungi> they had to one-up gitea on unpronounceable names
19:31:18 <frickler> it's esperanto for "forge" they say
19:32:04 <fungi> i saw
19:32:21 <clarkb> I dont' think we need drastic action today either fwiw
19:32:37 <clarkb> but it would be good to evaluate both if we dig deeping into using gitea's extra functionality
19:33:32 <fungi> some of the folks involved in that have a history of getting very excited about things and then dropping them months later, so i'm not holding my breath yet anyway
19:34:19 <clarkb> #topic Vexxhost instance rescuing
19:34:45 <clarkb> This is mostly on the agenda at this point to try adn remind me to find time when jrosser's day overlaps with mine to gather info on the bfv setup they use for rescuing
19:35:03 <clarkb> then I can try and set that up in vexxhost and test it. IF I can make that work then we can develop a rescue image with all the right settings maybe
19:35:11 <clarkb> but no new updates on this since the last time we met
19:35:14 <clarkb> (holidays will do that)
19:35:21 <clarkb> #topic Gerrit 3.6 Upgrade
19:35:25 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.6
19:35:58 <clarkb> ianw: I've skimmed this doc a couple times at this point. The main new bit of feedback I've got (which is on the etherpad too) is that the latest 3.5 release which we upgraded to recently was made partially to fix a bunch of issues with copy approvals
19:36:25 <clarkb> I think it might be worth catching up on the state of copy approvals upstream (just to be sure there aren't any more bug fixes outstanding) then give it a go on our installation?
19:36:39 <clarkb> that way if our install finds new issues we have time to work upstream to address them
19:36:56 <ianw> yeah i agree -- aiui that can be run online just fine right?
19:37:11 <ianw> it looks likely to be a multiple-hour thing
19:37:18 <clarkb> ianw: yes, it is supposed to be able to run online in the background. Digging through logs for that might be the hard part to double check it was happy
19:37:56 <clarkb> the other piece which I don't thin kwe've done is evaluate any other potentailly breaking or user visible updates
19:38:06 <clarkb> just to try and idnetify any major things people might complain about
19:38:30 <ianw> yeah, i need to spin up the node
19:38:43 <ianw> then we can poke
19:38:57 <clarkb> 3.7 is looking to be far more problematic too so  I'm glad we aren't trying to make that jump yet
19:39:13 <clarkb> it has currently broken recheck comments for example
19:39:26 <ianw> if you like, i can try running the copy-approvals when it slows down in a few hours, and monitor it
19:39:49 <clarkb> ianw: I think the first thing is to look at the changelog and open changes for 3.5 to see if we are missing any copy approvals updates
19:39:57 <clarkb> update our image if necessary but then ya running it
19:40:04 <corvus> clarkb:  upstream bug suggests there should be an easy fix for that in zuul; i'll be investigating it soon
19:40:31 <ianw> ++ i'll check that out this morning and we can sync on it
19:40:35 <clarkb> corvus: it is a bit weird to me that want zuul to address it? its the commetn added event which has for a decade now included the content of the comment
19:40:50 <clarkb> corvus: I mean, its great if we can workaround it but I feel like that breaks a pretty base level expectation
19:41:28 <clarkb> ianw: sounds good. Anything else gerrit 3.6 upgrade related?
19:41:42 <corvus> assuming the upstream bug is accurate, they've had this backwards compat in place for years at our request, so i'm not ready to ding them on that.
19:42:26 <ianw> clarkb: nope, let's get that groundwork done then we can decide on update schedules
19:42:32 <clarkb> sounds good
19:42:37 <corvus> anyway, give me a chance to actually look into it so i can speak from knowledge :)
19:42:41 <clarkb> corvus: ++
19:43:03 <clarkb> #topic Acme.sh failures
19:43:24 <clarkb> acme.sh switched to ecc certs/keys by default (from rsa) recently and broke our ability to renew things
19:43:47 <clarkb> ianw fungi and I all poked at it a bit and I think ianw was able to track it down to that specific change and wrote an upstream issue about it
19:43:51 <clarkb> #link https://github.com/acmesh-official/acme.sh/issues/4416
19:44:04 <clarkb> Basically file paths changed and now acme.sh can't find data it is looking for later
19:44:21 <clarkb> to address this we've pinned to the previous release: 3.0.5 and the change to do that landed recently
19:44:40 <clarkb> we should expect that certs that our cert checker complains about refresh overnight today and address that (maybe sooner)
19:45:03 <clarkb> One thing I wanted to ask is if we should try and explicitly set settings like that to avoid underlying changes impacting us
19:45:10 <clarkb> we can set the key type and length and so on explicitly
19:45:18 <clarkb> and that might allow us to avoid pinning?
19:45:56 <clarkb> The downside to this is 5 years from now when rsa 2048 is no longer safe <_<
19:46:07 <ianw> we could -- this feels like a bit of a corner case because the ecc path seems to be a bit of a separate, opt-in thing, and upstream sort of half-switched it
19:47:09 <ianw> they can't really make the tool suddenly start updating certs in a different place on disk (a directory with _ecc appended) because that would seem to break everything
19:47:22 <clarkb> maybe we wait for feedback on the issue before deciding on how to move forward post pin
19:47:36 <clarkb> I just wanted to call the option of being explicit as an alternative to pinning
19:47:53 <ianw> yeah; i think it's good we've pointed it out -- we've run from dev for several years and this is the first time we've had an issue
19:48:40 <ianw> so i'll keep an eye, and if we can get back to a point of running from dev I think that's desirable to continue being a canary
19:48:46 <ianw> we can always pin to the last known working thing easily
19:48:56 <clarkb> sounds good and thank you for digging into that yesterday
19:49:04 <clarkb> I had a note on your debugging change too no sure if you saw
19:49:16 <clarkb> the one that updates driver.sh
19:49:39 <clarkb> #topic Open Discussion
19:49:41 <ianw> ok thanks, will go over those.  just a couple of things that would have made it easier if happens again
19:49:46 <clarkb> ++
19:50:13 <clarkb> corvus: ianw's fix to openstacksdk for launch node may be what we need for latest openstacksdk to work with nodepool too so I'll try to reprioritize testing that with your test tool
19:50:43 <clarkb> also linaro's new arm cloud appears to be near ready for use. This is being spun up to get us off old hardware that equinix wants to shutdown
19:51:02 <clarkb> I expect we'll need to move the builder and the mirror node manually (but maybe linaro has some magic to move those vms? I doubt it though )
19:51:19 <ianw> yeah pretty sure that will all be a rebuild
19:51:21 <clarkb> And a reminder that we'll turn off the iweb cloud at the end of the year.
19:51:26 <ianw> that's ok, good test of the launch node changes :)
19:51:30 <clarkb> ++
19:52:38 <clarkb> last call for anything else
19:54:29 <fungi> i've checked the new mm3 server's ip addresses against spamhaus and senderbase, and they're all clean
19:55:00 <clarkb> excellent
19:55:04 <clarkb> Thank you all for your time (in the meeting and working on OpenDev)! We'll be back next week. I should look at a calendar soon to see how December holidays and the new yaer impact our schedule.
19:55:08 <fungi> announcement is being drafted in the migration plan etherpad and is nearly complete, so i'll give folks a heads up in #opendev once it's ready to proof
19:55:15 <fungi> thanks clarkb!
19:55:27 <clarkb> #endmeeting