19:00:18 <clarkb> #startmeeting infra
19:00:18 <opendevmeet> Meeting started Tue Mar 11 19:00:18 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:18 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:18 <opendevmeet> The meeting name has been set to 'infra'
19:00:25 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/X75EQWC4QAGVRJKPB3HTTKHKONAM26JJ/ Our Agenda
19:00:32 <clarkb> #topic Announcements
19:00:42 <clarkb> I don't have anything to announce. Did anyone else have something?
19:02:12 * tonyb has nothing
19:03:24 <clarkb> #topic Zuul-launcher image builds
19:03:44 <clarkb> I think the big update here is that we did do the raxflex surgery to switch tenants and update networks (and their MTUs)
19:03:59 <clarkb> that happened for both zuul-launcher and nodepool and as far as I know is working
19:04:00 <corvus> yay!
19:04:26 <corvus> i know at least some nodes were provided by zuul-launcher yesterday
19:04:35 <corvus> i need to check on a node failure though
19:04:47 <clarkb> that would imply things are mostly working I think. Thats good
19:04:48 <corvus> so let's say i should confirm that's actually working with zl
19:05:36 <corvus> hrm, i think there may be a rax flex problem with zl
19:05:47 <corvus> 2025-03-10 17:26:56,062 ERROR zuul.Launcher:   TypeError: create_server() requires either 'image' or 'boot_volume'
19:05:49 <corvus> not sure what the deal is there
19:06:02 <corvus> i'll follow up in #opendev tho
19:06:05 <clarkb> nodepool is using images to boot off of not boot from volume fwiw
19:06:12 <clarkb> ack. Anything else to bring up on this topic?
19:06:35 <corvus> i think we're really close to being ready to start expanding the use of niz
19:06:48 <corvus> like maybe one or two more features to add to the launcher
19:07:07 <corvus> so it's probably worth asking how to proceed?
19:07:10 <corvus> i'm thinking...
19:07:44 <corvus> continue to add some more providers, and switch the zuul tenant over to use niz exclusively.  do that without the new flavors though (so we use the old flavors -- 8gb everywhere)
19:07:56 <fungi> i do have a change up to add rackspace flex dfw3
19:08:00 <corvus> (basically, decouple the adding of new flavors from the switching to niz)
19:08:32 <corvus> fungi: oh cool
19:08:42 <fungi> note that sjc3 and dfw3 use different flavor names, so my change moves the label->flavor mapping up into the region layer
19:09:00 <corvus> if we do what i suggest -- should we also dial down max-servers in nodepool just a little?  or let them duke it out?
19:09:01 <fungi> #link https://review.opendev.org/943104 Add the DFW3 region for Rackspace Flex
19:09:24 <clarkb> corvus: I suspect we can dial back max-servers by some small amount in nodepool
19:09:30 <corvus> fungi: nice -- that's why we can do it in both places
19:09:43 <fungi> yeah, i found that convenient
19:09:47 <fungi> great design!
19:10:27 <clarkb> we tend to float under max-servers due to $cloudproblems anyway so we may find that eventually leads to them duking it out :)
19:10:28 <corvus> we're definitely getting to the point where more images would be good to have :)  we have everything needed for zuul i think, but it would be good to have other images that other tenants use
19:10:38 <corvus> clarkb: good point
19:11:30 <corvus> i think that's about it from me
19:11:54 <clarkb> thanks for the update. Adding more images and starting with the zuul tenant sounds great to me
19:12:03 <clarkb> #topic Updating Flavors in OVH
19:12:11 <clarkb> related is the OVH flavor surgery that we've discussed with amorin
19:12:16 <clarkb> #link https://etherpad.opendev.org/p/ovh-flavors
19:12:23 <clarkb> the proposal is to do this Monday (March 17) ish
19:12:37 <corvus> did amorin respond about scheduling?
19:12:54 <corvus> i missed a bunch of scrollback due to travel
19:13:06 <clarkb> 19:21:33*       amorin | corvus: I will check with the team tomorrow about march 17 and let you know.
19:13:09 <clarkb> this was the last message I saw
19:13:20 <clarkb> so I guess we're still waiting on final confirmation they will proceed then
19:13:25 <corvus> cool, i did miss that.  gtk.
19:13:34 <corvus> probably worth a ping to check on the status, but no rush
19:13:53 <clarkb> ++
19:14:14 <clarkb> and ya other wise its mostly be aware of this as a thing we're trying to do. It should make working with ovh in zuul-launcher and nodepool nicer
19:14:24 <clarkb> more future proofed by getting off the bespoke scheduling system
19:16:33 <clarkb> #topic Running infra-prod Jobs in Parallel on Bridge
19:16:56 <clarkb> Yesterday we landed the chnage to increase our infra-prod jobs semaphore limit to 2. Since then we have been running our infra-prod jobs in parallel
19:17:07 <corvus> yaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaay
19:17:23 <tonyb> \o/  #greatsuccess
19:17:33 <fungi> now deploys are happening ~2x as fast
19:17:40 <fungi> it's very nice!
19:17:43 <clarkb> I monitored deploy, opendev-prod-hourly, and periodic buildsets as well as general demand/load on bridge and thankfully no major issues came up
19:17:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943999 A small logs fixup
19:18:02 <clarkb> this fixup addresses ansible logging on bridge that ianw noticed is off
19:18:21 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943992 A small job ordering fixup
19:18:33 <clarkb> and this change makes puppet-else depend on letsencrypt because puppet deployed services do use LE certs
19:18:45 <clarkb> neither of which is urgent nor impacting functionality at the moment
19:19:15 <corvus> it's important that puppet work so we can update our jenkins configuration :)
19:19:45 <clarkb> indeed
19:20:42 <corvus> (amusingly, since change #1 is a puppet change, we can say 943998 changes later and we still have puppet)
19:20:42 <clarkb> other than those two chagnes I think the next step is considering increasing that semaphore limit to a larger number. Bridge has 8vcpus and load avg was below 2 everytime I checked. I think that means we can safely bump up to 4 or 5 in the semaphore
19:21:24 <corvus> 4 sgtm
19:21:40 <tonyb> I like 4 also
19:21:59 <clarkb> I was also hoping to land a few more regular system-config changes to see deploy do its thing but the one that merged today largely noop'd
19:22:07 <clarkb> fungi's change from yseterday did not noop and was a good exercise thankfully
19:22:26 <clarkb> ok maybe try to land a change or two to system-config later today and if that and periodic look good bump up to 4 tomororw
19:23:10 <fungi> once we have performance graphs we can check easily again, we might consider increasing it further
19:23:16 <clarkb> anything else on this topic? It was a long time coming but its here and seems to be working so yay and thank you to everyone who helped make it happen
19:24:37 <clarkb> #topic Upgrading old servers
19:24:54 <clarkb> As expected last week was consumed by infra-prod chagnes and the start of this week too
19:25:04 <clarkb> so I don't have anyting enw on this front. But it is the next big item on my todo list to tackle
19:25:09 <clarkb> tonyb: did you have any updates?
19:26:07 <tonyb> :( No.
19:26:16 <clarkb> #topic Sprinting to Upgrade Servers to Noble
19:26:26 <clarkb> so as mentioned I'm hoping to pick this up again starting tomorrow most likely
19:26:53 <clarkb> Help with reviews is always appreciated as is help bootstrapping replacement servers
19:26:58 <clarkb> #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint
19:27:12 <clarkb> feel free to throw some notes in that etherpad and I'll try to get it updated from my side as I pick it up again
19:28:04 <clarkb> Then in related news I started brainstorming what the Gerrit replacement looks like. I think it is basically boot a new gerrit and volume, have system-config deploy it "empty" without replication configured. Sanity check things look right and functional. Then sync over current prod state and recheck operation. Then schedule a downtime to do a final sync and cut over dns
19:28:21 <clarkb> the main thing is taht we don't want replication to run until the new server is in production to avoid overwriting content on gitea
19:28:37 <clarkb> if you can think of any other sync poinst that need to be handled carefully let me know
19:29:43 <clarkb> #topic Running certcheck on bridge
19:29:59 <clarkb> as mentioned last week I volunteered to look at running this within an infra-prod job that runs daily instead
19:30:15 <clarkb> I haven't done that yet but am hopeful I'll be able to look before next meeting just because I have so much infra-prod stuff paged in at this point
19:30:42 <clarkb> and with things running in parallel the wall clock time cost for that should be minimal
19:31:05 <clarkb> #topic Working through our TODO list
19:31:09 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:31:28 <clarkb> this is just your weekly reminder to look at that list if you need good ideas for what to work on next. I think I can probably mark off infra-prod running in parallel now too
19:32:48 <clarkb> #topic Open Discussion
19:32:50 <clarkb> Anything else?
19:33:21 <fungi> i didn't have anything
19:34:28 <clarkb> my kids are out on spring break week after next. I may try to pop out a day or two to do family things as a result. but we have nothing big planned so my guess is it wouldn't be anything crazy
19:34:31 <tonyb> Nothing from me this week
19:36:10 <clarkb> that was quick. I think  we can end early today then
19:36:15 <clarkb> thank you everyone for your time and help
19:36:19 <clarkb> see you back here next week
19:36:22 <clarkb> #endmeeting