#opendev-meeting log

19:01:06 <clarkb> #startmeeting infra
19:01:06 <opendevmeet> Meeting started Tue Dec  6 19:01:06 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:06 <opendevmeet> The meeting name has been set to 'infra'
19:01:24 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GCABXQDEGIAYG4T63NXZJGNHACEICKAP/ Our Agenda
19:01:35 <clarkb> #topic Announcements
19:02:15 <clarkb> The foundation is entering election time for the board. Nominations for individual members close in 10 days on the 16th of december
19:02:28 <clarkb> Then an election is held in January
19:02:51 <clarkb> Any other announcements?
19:03:59 <fungi> board meeting today
19:04:12 <clarkb> oh right an hour after the end of this meeting (21:00 UTC) there will be a board meeting
19:04:16 <fungi> 21:00 utc in zoom
19:04:22 <fungi> yep
19:04:46 <clarkb> tools for openstack translations will be discussed which might interest this crowd
19:04:52 <fungi> https://board.openinfra.dev/meetings/2022-12-06
19:04:58 <fungi> that 'un
19:06:00 <clarkb> #topic Bastion Host Updates
19:06:13 <clarkb> I think we are getting very close to the end of this thread.
19:06:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/866542 addresses ansible installation on bridge to actually update to the ansible we are testing with
19:06:53 <clarkb> #link https://review.opendev.org/q/topic:prod-bastion-group parallelized zuul jobs on bridge. Should land when bridge is stable and we can monitor
19:07:10 <clarkb> #link https://review.opendev.org/q/topic:bridge-ansible-venv This group appears to have all of its changes merged or abandoned
19:07:26 <clarkb> ianw: Anything else to say on this topic? I need to rereview 866542 which is on my todo list for today
19:08:11 <ianw> yeah 866542 just got a rebase really since you looked at it yesterday, i removed a change that was updating a removed comment
19:08:37 <ianw> the stack really expanded to
19:08:41 <ianw> #link https://review.opendev.org/q/topic:boostrap-ansible-from-req
19:08:51 <clarkb> oh I see there are a few followons
19:09:02 <ianw> which just moves the same idea to the venv creation, which i noticed when watching the logs
19:09:31 <ianw> the other stack that needs feedback and action, particularly from infra-roots is
19:09:33 <ianw> #link https://review.opendev.org/q/topic:bridge-backups
19:09:58 <clarkb> oh right I had that in my local agenda notes. sorry
19:10:29 <clarkb> and what that does is encrypt things locally so they can be backed up remotely right?
19:11:05 <ianw> essentially yes, with a key split requiring 2 people to recombine
19:11:38 <ianw> this is so nobody needs to feel like they need to setup fort knox to keep the backup
19:12:46 <clarkb> I'll have to read into that more to understand the mechanics of it. Like do we all need to forward gpg agents or something? But that can happen in review or in #opendev
19:13:04 <clarkb> I'll do my best to review those two stacks after the board meeting today
19:13:25 <clarkb> Anything else bastion related?
19:13:26 <ianw> thanks; https://review.opendev.org/c/opendev/system-config/+/866430 should be pretty explanatory for that i think
19:13:51 <ianw> nope, thanks
19:14:00 <clarkb> #topic Upgrading old servers
19:14:09 <clarkb> Nothing new here other than we should find time to do more of this :/
19:14:25 <clarkb> I guess technically the bastion work is a subset of this so we are pushing that along :)
19:14:37 <fungi> technically we've partially upgraded the listserv too
19:14:37 <clarkb> and the mm3 work isn't directly related but does get us off an old serverthat has kernel fun
19:14:43 <fungi> yeah that
19:14:57 <clarkb> progress then. I'll take it :)
19:15:05 <fungi> i guess we already upgraded the distro on the old mailman server anyway
19:15:08 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes
19:15:12 <fungi> just not painlessly
19:15:15 <clarkb> yup
19:15:31 <clarkb> Which is a good lead into the next topic
19:15:39 <clarkb> #topic Mailman 3
19:15:44 <clarkb> #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes
19:15:58 <clarkb> lists.opendev.org and lists.zuul-ci.org moved to mailman3 on the new server yesterday
19:16:09 <fungi> and within the scheduled window even
19:16:18 <fungi> though in retrospect i should have called it two hours just in case
19:16:34 <fungi> i didn't factor in gate/deploy time for the dns updates
19:16:37 <clarkb> there were/are a couple of issues we found in the process. One was fixed which correct some url routes. The other is setting a site_owner value which was missed because all the other settings are set by env vars but not this one
19:16:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/866632 set mailman3 site_owner
19:17:04 <clarkb> fungi: we managed to make the timing work in the end
19:17:20 <clarkb> There is also a change to spin things down on the old server for these two domains
19:17:27 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/866630 disable old mm2 domains
19:17:38 <fungi> yeah, the broken archiving is totally my bad. when i re-tested the held node after we updated to the latest mm release, i forgot to double-check that new list messages ended up in the imported archive
19:18:01 <fungi> thanks to corvus and clarkb for figuring it out while i was off stuffing my face
19:18:17 <clarkb> One "exciting" thing is that the upstream mailing has a thread from today (of course the day after we upgrade) suggesting people not run latest mm3 which we are doing.
19:18:28 <fungi> hah
19:18:32 <ianw> will 866632 require a restart?
19:18:34 <clarkb> The reason for this is a bug when handling more than 10 list sites possibly postfix specific (we run exim)
19:18:36 <fungi> that's so our luck
19:18:36 <clarkb> ianw: yes
19:18:49 <clarkb> #link https://gitlab.com/mailman/mailman/-/issues/1044 bug with more than 10 lists
19:19:30 <clarkb> lists.opendev.org has 10 lists and lists.zuul-ci.org has fewer. I've also grepped for that warning string in our logs. I'm not sure if we are not affected because we use exim or if it is because we have few lists but I haven't found evidence we have a problem yet
19:19:45 <clarkb> something to watch though as lists.openstack.org has many more than 10 lists and we want this to be working before we upgrade
19:19:53 <clarkb> er s/upgrade/migrate lists.openstack.org/
19:20:13 <clarkb> And please say something if you see the indicated behavior with the migrated lists
19:20:43 <fungi> though lists.openstack.org also has way fewer lists than it used to, and might have fewer still if i can convince them to retire some more unused ones before we move it
19:20:47 <clarkb> Other than that I think this went really well. These two domains are newer but we migrated stuff off of an ancient server and software to something modern and it seems to work
19:21:14 <clarkb> the testing and planning seem to have done their job. Thank you to everyone who helped with that. fungi in particular did a lot of work on that side of things
19:21:21 <fungi> thanks everyone for all the work on that
19:21:53 <fungi> it's been on my wishlist for years and i was never able to find the time to tackle it on my own
19:22:26 <clarkb> anything else to add?
19:22:45 <fungi> nothing on my end
19:23:07 <clarkb> #topic Quo vadis Storyboard
19:23:22 <clarkb> I just realized my old link should be updated to the new hyperkitty archives. Oh well
19:23:33 <clarkb> I did send a followup covering our options and asked for feedback
19:23:52 <clarkb> The one response I got ws someone offering to help with the software but unfortunately I think we need to start with the deployment if we are going to adopt it
19:24:04 <clarkb> *if we adopt software maintenance we need to commit to updating the deployment first
19:24:23 <clarkb> I'll leave it open for more feedback as it has only been about a week. I'd be happy to hear from you
19:24:53 <clarkb> and I guess if that doesn't work I can suggest that people provide semi anonymous feedback instead and I can try to colate it if people trust me to do that
19:25:08 <clarkb> But I want to amke sure whatever we do here is reasonable and will be accepted
19:25:11 <fungi> yes, the software is already well ahead of what we're running in terms of major bug and performance fixes and new features
19:25:30 <fungi> which is a big part of the problem
19:25:58 <fungi> we had volunteers to develop the software, but nobody keeping our deployment up to date with it
19:26:04 <clarkb> right
19:26:37 <clarkb> anyway, lets see how we do over the next week for feedback and we can take a different appraoch if this continues to not generate input
19:26:47 <clarkb> I think 2 weeks is a reasonable amount of time for this sort of thing and we are halfway through that right now
19:26:54 <fungi> agreed. thanks!
19:27:10 <clarkb> #topic Vexxhost server rescues
19:27:20 <clarkb> jrosser shared image settings with me
19:27:27 <clarkb> #link https://paste.opendev.org/show/bxxFEEUWeUrkIVlBSGrw/ jrosser's image settings that work in their cloud
19:27:47 <clarkb> I've got a todo item to try and upload an image with those settings set and use it as a rescue image after modifying the root boot label
19:27:53 <clarkb> But I haven't done that yet
19:28:19 <clarkb> They use ceph too so I'm hopeful that this will work
19:28:30 <clarkb> #topic Gerrit 3.6
19:28:35 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.6
19:29:03 <clarkb> ianw ran copy-approvals on all of our repos. We had a small problem in neutron due to a change with more than 1k patchsets which is our current limit
19:29:18 <clarkb> ianw temporarily bumped that limit and reran which caused things to work except for a corrupt change
19:29:39 <clarkb> even if that didn't work we would've been fine beacuse all of the neutron changes are closed and not open so their votes are largely there for historical accuracy
19:30:01 <clarkb> ianw: looks like you've noted the next steps are holding a node and double checking things a bit more directly
19:30:11 <clarkb> as well as working on a proposal for the upgrade
19:30:51 <clarkb> ianw: is there anything around the gerrit upgrade we can help with?
19:31:06 <clarkb> I will note the openstack release cycle schedule is at https://releases.openstack.org/antelope/schedule.html which we should avoid conflicts with
19:31:18 <ianw> not really, i just want to hold a node and validate downgrade, which i should be able to do very soon
19:32:04 <ianw> if the mm upgrade time worked, we could do it then
19:32:25 <clarkb> ianw: ~2000UTC on a monday you mean?
19:33:03 <fungi> wfm
19:33:18 <ianw> yep, that would open up the 12th or the 19th, i'm around for both, though less time after the 19th
19:33:56 <clarkb> ya its a trade off I guess. Less lead time to test and announce with the 12th and less time to fix/debug if the 19th
19:34:05 <ianw> if we get some held node validation over the next few days, maybe the 12th?
19:34:19 <ianw> i'm fairly confident, there doesn't seem to be much more we could test
19:34:22 <clarkb> ya if that all looks good and doesn't show anything that users should need to worry about I'd be good with the 12th
19:34:39 <clarkb> we can even send an announcement nowish indicating we plan to do it on the 12th and postpone if necessary
19:34:50 <clarkb> I think there is a downgrade process for 3.6 -> 3.5 too so we have that option if necessary
19:35:00 <ianw> and that's an excuse to send a message through the list to keep it ungreylisted too :)
19:35:07 <fungi> heh
19:35:31 <clarkb> if we test and confirm the downgrade process seems to work then I'm extra happy to proceed early
19:35:43 <clarkb> I think 3.7 -> 3.6 has less easy downgrade though so that upgrade will be a funner one
19:36:10 <ianw> ok, i will get onto all that, https://etherpad.opendev.org/p/gerrit-upgrade-3.6 will be updated as things happen
19:36:37 <clarkb> sounds good, thanks!
19:36:52 <clarkb> #topic Open Discussion
19:37:13 <clarkb> that was it for the agenda but this morning I noticed something I had on my back burner has made some progress and is worth calling out
19:37:29 <clarkb> The first bit is nodepool updated to latest openstacksdk which includes ianw's fix for network stuff against older apis
19:38:03 <clarkb> image uploads seem to work (we have recent images) and I haven't seen any launcher issues. But we should skim the grafana dashboard for any evidence of problems
19:38:20 <clarkb> And then that unlocked the path for updating zuul and nodepool images to python3.11
19:38:30 <clarkb> The zuul change has landed and nodepool is gating
19:38:53 <clarkb> nodepool will restart once that change lands. zuul will normally restart over the weekend. Do we want to manually restart zuul sooner to observe it?
19:39:09 <clarkb> I should be able to do that tomorrow if we think that is a good idea.
19:39:29 <clarkb> In particular one thing I realized is that ansible might not like python3.11? However, we do have zuul testing that exercises ansible so maybe its fine?
19:39:36 <clarkb> cc corvus ^ if you have an opinion
19:39:44 <clarkb> I'm also happy to revert if we think we need more prep
19:40:45 <clarkb> Oh also last week I cleaned up the inmotion cloud's nova and placement records
19:41:07 <clarkb> There were two distinct issues. The first was that placement had leaked a few records for nodes that just didn't exist anymore either on the host or in the nova db
19:41:17 <clarkb> the second was the nova db leaked instances that didn't exist in libvirt on the hosts
19:41:32 <clarkb> cleaning up the first thing is relatively straightforward and placement has docs on the process.
19:42:15 <clarkb> Cleaning up the second thing required manually editing the nova db to associate nodes with the cell they lived in because some nova bug allowed them to be disassociated whch broke server deletion. Once those records were updated server delete worked fine
19:42:35 <clarkb> melwitt was a huge help in sorting that out, but now we have more nodes to test with so yay
19:43:37 <clarkb> oh we also had leaked nodes in rax
19:43:51 <clarkb> they didn't have proper nodepool metadata so nodepool refused to clean them up. i manually cleared those out
19:43:55 <corvus> clarkb: i think zuul's own tests should give us a heads up on any ansible/python probs like that.  i don't have a strong feeling about whether we need to restart early or just let it happen on the weekend
19:44:03 <clarkb> corvus: ack thanks
19:44:47 <clarkb> as far as team meetings go I think we'll cancel the 27th. Any strong opinions for having meetings on the 13th or january 3?
19:45:17 <fungi> i'll be round on the 13th and 3rd but don't necessarily require a meeting
19:45:20 <clarkb> er sorry the 20th and 3rd
19:45:26 <clarkb> I plan to be around on the 13th and have that meeting
19:45:56 <fungi> i also should be around on the 20th but may be a little distracted
19:46:36 <ianw> i should be around on 20th ... unsure on 3
19:46:50 <ianw> for sure not 27th
19:47:00 <clarkb> ok we can do a low key meeting on the 20th, then see what the new year looks like when we get there
19:47:09 <fungi> i do expect to have far more work than usual the week of the 3rd so may be distracted then too
19:47:22 <clarkb> ya its the time of year when all the paperwork needs to be done :)
19:47:44 <fungi> so much paperwork
19:47:54 <clarkb> alright then we'll see you here on the 13th and probably the 20th. Then we can enjoy the holidays for a bit (and you should nejoy them earlier too if you are able :) )
19:48:03 <fungi> thanks clarkb!
19:48:04 <clarkb> anythnig else?
19:48:32 <corvus> schedule a holiday party for the 20th ;)
19:49:13 <clarkb> good idea. Let me see if I can figure something out for that
19:49:24 <clarkb> board game arena game or something :)
19:49:46 <clarkb> thank you everyone for your time, I'll let you go now. See you next week.
19:49:48 <clarkb> #endmeeting