19:01:06 <clarkb> #startmeeting infra 19:01:06 <opendevmeet> Meeting started Tue Dec 6 19:01:06 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:06 <opendevmeet> The meeting name has been set to 'infra' 19:01:24 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GCABXQDEGIAYG4T63NXZJGNHACEICKAP/ Our Agenda 19:01:35 <clarkb> #topic Announcements 19:02:15 <clarkb> The foundation is entering election time for the board. Nominations for individual members close in 10 days on the 16th of december 19:02:28 <clarkb> Then an election is held in January 19:02:51 <clarkb> Any other announcements? 19:03:59 <fungi> board meeting today 19:04:12 <clarkb> oh right an hour after the end of this meeting (21:00 UTC) there will be a board meeting 19:04:16 <fungi> 21:00 utc in zoom 19:04:22 <fungi> yep 19:04:46 <clarkb> tools for openstack translations will be discussed which might interest this crowd 19:04:52 <fungi> https://board.openinfra.dev/meetings/2022-12-06 19:04:58 <fungi> that 'un 19:06:00 <clarkb> #topic Bastion Host Updates 19:06:13 <clarkb> I think we are getting very close to the end of this thread. 19:06:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/866542 addresses ansible installation on bridge to actually update to the ansible we are testing with 19:06:53 <clarkb> #link https://review.opendev.org/q/topic:prod-bastion-group parallelized zuul jobs on bridge. Should land when bridge is stable and we can monitor 19:07:10 <clarkb> #link https://review.opendev.org/q/topic:bridge-ansible-venv This group appears to have all of its changes merged or abandoned 19:07:26 <clarkb> ianw: Anything else to say on this topic? I need to rereview 866542 which is on my todo list for today 19:08:11 <ianw> yeah 866542 just got a rebase really since you looked at it yesterday, i removed a change that was updating a removed comment 19:08:37 <ianw> the stack really expanded to 19:08:41 <ianw> #link https://review.opendev.org/q/topic:boostrap-ansible-from-req 19:08:51 <clarkb> oh I see there are a few followons 19:09:02 <ianw> which just moves the same idea to the venv creation, which i noticed when watching the logs 19:09:31 <ianw> the other stack that needs feedback and action, particularly from infra-roots is 19:09:33 <ianw> #link https://review.opendev.org/q/topic:bridge-backups 19:09:58 <clarkb> oh right I had that in my local agenda notes. sorry 19:10:29 <clarkb> and what that does is encrypt things locally so they can be backed up remotely right? 19:11:05 <ianw> essentially yes, with a key split requiring 2 people to recombine 19:11:38 <ianw> this is so nobody needs to feel like they need to setup fort knox to keep the backup 19:12:46 <clarkb> I'll have to read into that more to understand the mechanics of it. Like do we all need to forward gpg agents or something? But that can happen in review or in #opendev 19:13:04 <clarkb> I'll do my best to review those two stacks after the board meeting today 19:13:25 <clarkb> Anything else bastion related? 19:13:26 <ianw> thanks; https://review.opendev.org/c/opendev/system-config/+/866430 should be pretty explanatory for that i think 19:13:51 <ianw> nope, thanks 19:14:00 <clarkb> #topic Upgrading old servers 19:14:09 <clarkb> Nothing new here other than we should find time to do more of this :/ 19:14:25 <clarkb> I guess technically the bastion work is a subset of this so we are pushing that along :) 19:14:37 <fungi> technically we've partially upgraded the listserv too 19:14:37 <clarkb> and the mm3 work isn't directly related but does get us off an old serverthat has kernel fun 19:14:43 <fungi> yeah that 19:14:57 <clarkb> progress then. I'll take it :) 19:15:05 <fungi> i guess we already upgraded the distro on the old mailman server anyway 19:15:08 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes 19:15:12 <fungi> just not painlessly 19:15:15 <clarkb> yup 19:15:31 <clarkb> Which is a good lead into the next topic 19:15:39 <clarkb> #topic Mailman 3 19:15:44 <clarkb> #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes 19:15:58 <clarkb> lists.opendev.org and lists.zuul-ci.org moved to mailman3 on the new server yesterday 19:16:09 <fungi> and within the scheduled window even 19:16:18 <fungi> though in retrospect i should have called it two hours just in case 19:16:34 <fungi> i didn't factor in gate/deploy time for the dns updates 19:16:37 <clarkb> there were/are a couple of issues we found in the process. One was fixed which correct some url routes. The other is setting a site_owner value which was missed because all the other settings are set by env vars but not this one 19:16:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/866632 set mailman3 site_owner 19:17:04 <clarkb> fungi: we managed to make the timing work in the end 19:17:20 <clarkb> There is also a change to spin things down on the old server for these two domains 19:17:27 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/866630 disable old mm2 domains 19:17:38 <fungi> yeah, the broken archiving is totally my bad. when i re-tested the held node after we updated to the latest mm release, i forgot to double-check that new list messages ended up in the imported archive 19:18:01 <fungi> thanks to corvus and clarkb for figuring it out while i was off stuffing my face 19:18:17 <clarkb> One "exciting" thing is that the upstream mailing has a thread from today (of course the day after we upgrade) suggesting people not run latest mm3 which we are doing. 19:18:28 <fungi> hah 19:18:32 <ianw> will 866632 require a restart? 19:18:34 <clarkb> The reason for this is a bug when handling more than 10 list sites possibly postfix specific (we run exim) 19:18:36 <fungi> that's so our luck 19:18:36 <clarkb> ianw: yes 19:18:49 <clarkb> #link https://gitlab.com/mailman/mailman/-/issues/1044 bug with more than 10 lists 19:19:30 <clarkb> lists.opendev.org has 10 lists and lists.zuul-ci.org has fewer. I've also grepped for that warning string in our logs. I'm not sure if we are not affected because we use exim or if it is because we have few lists but I haven't found evidence we have a problem yet 19:19:45 <clarkb> something to watch though as lists.openstack.org has many more than 10 lists and we want this to be working before we upgrade 19:19:53 <clarkb> er s/upgrade/migrate lists.openstack.org/ 19:20:13 <clarkb> And please say something if you see the indicated behavior with the migrated lists 19:20:43 <fungi> though lists.openstack.org also has way fewer lists than it used to, and might have fewer still if i can convince them to retire some more unused ones before we move it 19:20:47 <clarkb> Other than that I think this went really well. These two domains are newer but we migrated stuff off of an ancient server and software to something modern and it seems to work 19:21:14 <clarkb> the testing and planning seem to have done their job. Thank you to everyone who helped with that. fungi in particular did a lot of work on that side of things 19:21:21 <fungi> thanks everyone for all the work on that 19:21:53 <fungi> it's been on my wishlist for years and i was never able to find the time to tackle it on my own 19:22:26 <clarkb> anything else to add? 19:22:45 <fungi> nothing on my end 19:23:07 <clarkb> #topic Quo vadis Storyboard 19:23:22 <clarkb> I just realized my old link should be updated to the new hyperkitty archives. Oh well 19:23:33 <clarkb> I did send a followup covering our options and asked for feedback 19:23:52 <clarkb> The one response I got ws someone offering to help with the software but unfortunately I think we need to start with the deployment if we are going to adopt it 19:24:04 <clarkb> *if we adopt software maintenance we need to commit to updating the deployment first 19:24:23 <clarkb> I'll leave it open for more feedback as it has only been about a week. I'd be happy to hear from you 19:24:53 <clarkb> and I guess if that doesn't work I can suggest that people provide semi anonymous feedback instead and I can try to colate it if people trust me to do that 19:25:08 <clarkb> But I want to amke sure whatever we do here is reasonable and will be accepted 19:25:11 <fungi> yes, the software is already well ahead of what we're running in terms of major bug and performance fixes and new features 19:25:30 <fungi> which is a big part of the problem 19:25:58 <fungi> we had volunteers to develop the software, but nobody keeping our deployment up to date with it 19:26:04 <clarkb> right 19:26:37 <clarkb> anyway, lets see how we do over the next week for feedback and we can take a different appraoch if this continues to not generate input 19:26:47 <clarkb> I think 2 weeks is a reasonable amount of time for this sort of thing and we are halfway through that right now 19:26:54 <fungi> agreed. thanks! 19:27:10 <clarkb> #topic Vexxhost server rescues 19:27:20 <clarkb> jrosser shared image settings with me 19:27:27 <clarkb> #link https://paste.opendev.org/show/bxxFEEUWeUrkIVlBSGrw/ jrosser's image settings that work in their cloud 19:27:47 <clarkb> I've got a todo item to try and upload an image with those settings set and use it as a rescue image after modifying the root boot label 19:27:53 <clarkb> But I haven't done that yet 19:28:19 <clarkb> They use ceph too so I'm hopeful that this will work 19:28:30 <clarkb> #topic Gerrit 3.6 19:28:35 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.6 19:29:03 <clarkb> ianw ran copy-approvals on all of our repos. We had a small problem in neutron due to a change with more than 1k patchsets which is our current limit 19:29:18 <clarkb> ianw temporarily bumped that limit and reran which caused things to work except for a corrupt change 19:29:39 <clarkb> even if that didn't work we would've been fine beacuse all of the neutron changes are closed and not open so their votes are largely there for historical accuracy 19:30:01 <clarkb> ianw: looks like you've noted the next steps are holding a node and double checking things a bit more directly 19:30:11 <clarkb> as well as working on a proposal for the upgrade 19:30:51 <clarkb> ianw: is there anything around the gerrit upgrade we can help with? 19:31:06 <clarkb> I will note the openstack release cycle schedule is at https://releases.openstack.org/antelope/schedule.html which we should avoid conflicts with 19:31:18 <ianw> not really, i just want to hold a node and validate downgrade, which i should be able to do very soon 19:32:04 <ianw> if the mm upgrade time worked, we could do it then 19:32:25 <clarkb> ianw: ~2000UTC on a monday you mean? 19:33:03 <fungi> wfm 19:33:18 <ianw> yep, that would open up the 12th or the 19th, i'm around for both, though less time after the 19th 19:33:56 <clarkb> ya its a trade off I guess. Less lead time to test and announce with the 12th and less time to fix/debug if the 19th 19:34:05 <ianw> if we get some held node validation over the next few days, maybe the 12th? 19:34:19 <ianw> i'm fairly confident, there doesn't seem to be much more we could test 19:34:22 <clarkb> ya if that all looks good and doesn't show anything that users should need to worry about I'd be good with the 12th 19:34:39 <clarkb> we can even send an announcement nowish indicating we plan to do it on the 12th and postpone if necessary 19:34:50 <clarkb> I think there is a downgrade process for 3.6 -> 3.5 too so we have that option if necessary 19:35:00 <ianw> and that's an excuse to send a message through the list to keep it ungreylisted too :) 19:35:07 <fungi> heh 19:35:31 <clarkb> if we test and confirm the downgrade process seems to work then I'm extra happy to proceed early 19:35:43 <clarkb> I think 3.7 -> 3.6 has less easy downgrade though so that upgrade will be a funner one 19:36:10 <ianw> ok, i will get onto all that, https://etherpad.opendev.org/p/gerrit-upgrade-3.6 will be updated as things happen 19:36:37 <clarkb> sounds good, thanks! 19:36:52 <clarkb> #topic Open Discussion 19:37:13 <clarkb> that was it for the agenda but this morning I noticed something I had on my back burner has made some progress and is worth calling out 19:37:29 <clarkb> The first bit is nodepool updated to latest openstacksdk which includes ianw's fix for network stuff against older apis 19:38:03 <clarkb> image uploads seem to work (we have recent images) and I haven't seen any launcher issues. But we should skim the grafana dashboard for any evidence of problems 19:38:20 <clarkb> And then that unlocked the path for updating zuul and nodepool images to python3.11 19:38:30 <clarkb> The zuul change has landed and nodepool is gating 19:38:53 <clarkb> nodepool will restart once that change lands. zuul will normally restart over the weekend. Do we want to manually restart zuul sooner to observe it? 19:39:09 <clarkb> I should be able to do that tomorrow if we think that is a good idea. 19:39:29 <clarkb> In particular one thing I realized is that ansible might not like python3.11? However, we do have zuul testing that exercises ansible so maybe its fine? 19:39:36 <clarkb> cc corvus ^ if you have an opinion 19:39:44 <clarkb> I'm also happy to revert if we think we need more prep 19:40:45 <clarkb> Oh also last week I cleaned up the inmotion cloud's nova and placement records 19:41:07 <clarkb> There were two distinct issues. The first was that placement had leaked a few records for nodes that just didn't exist anymore either on the host or in the nova db 19:41:17 <clarkb> the second was the nova db leaked instances that didn't exist in libvirt on the hosts 19:41:32 <clarkb> cleaning up the first thing is relatively straightforward and placement has docs on the process. 19:42:15 <clarkb> Cleaning up the second thing required manually editing the nova db to associate nodes with the cell they lived in because some nova bug allowed them to be disassociated whch broke server deletion. Once those records were updated server delete worked fine 19:42:35 <clarkb> melwitt was a huge help in sorting that out, but now we have more nodes to test with so yay 19:43:37 <clarkb> oh we also had leaked nodes in rax 19:43:51 <clarkb> they didn't have proper nodepool metadata so nodepool refused to clean them up. i manually cleared those out 19:43:55 <corvus> clarkb: i think zuul's own tests should give us a heads up on any ansible/python probs like that. i don't have a strong feeling about whether we need to restart early or just let it happen on the weekend 19:44:03 <clarkb> corvus: ack thanks 19:44:47 <clarkb> as far as team meetings go I think we'll cancel the 27th. Any strong opinions for having meetings on the 13th or january 3? 19:45:17 <fungi> i'll be round on the 13th and 3rd but don't necessarily require a meeting 19:45:20 <clarkb> er sorry the 20th and 3rd 19:45:26 <clarkb> I plan to be around on the 13th and have that meeting 19:45:56 <fungi> i also should be around on the 20th but may be a little distracted 19:46:36 <ianw> i should be around on 20th ... unsure on 3 19:46:50 <ianw> for sure not 27th 19:47:00 <clarkb> ok we can do a low key meeting on the 20th, then see what the new year looks like when we get there 19:47:09 <fungi> i do expect to have far more work than usual the week of the 3rd so may be distracted then too 19:47:22 <clarkb> ya its the time of year when all the paperwork needs to be done :) 19:47:44 <fungi> so much paperwork 19:47:54 <clarkb> alright then we'll see you here on the 13th and probably the 20th. Then we can enjoy the holidays for a bit (and you should nejoy them earlier too if you are able :) ) 19:48:03 <fungi> thanks clarkb! 19:48:04 <clarkb> anythnig else? 19:48:32 <corvus> schedule a holiday party for the 20th ;) 19:49:13 <clarkb> good idea. Let me see if I can figure something out for that 19:49:24 <clarkb> board game arena game or something :) 19:49:46 <clarkb> thank you everyone for your time, I'll let you go now. See you next week. 19:49:48 <clarkb> #endmeeting