19:00:18 <clarkb> #startmeeting infra 19:00:18 <opendevmeet> Meeting started Tue Mar 11 19:00:18 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:18 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:18 <opendevmeet> The meeting name has been set to 'infra' 19:00:25 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/X75EQWC4QAGVRJKPB3HTTKHKONAM26JJ/ Our Agenda 19:00:32 <clarkb> #topic Announcements 19:00:42 <clarkb> I don't have anything to announce. Did anyone else have something? 19:02:12 * tonyb has nothing 19:03:24 <clarkb> #topic Zuul-launcher image builds 19:03:44 <clarkb> I think the big update here is that we did do the raxflex surgery to switch tenants and update networks (and their MTUs) 19:03:59 <clarkb> that happened for both zuul-launcher and nodepool and as far as I know is working 19:04:00 <corvus> yay! 19:04:26 <corvus> i know at least some nodes were provided by zuul-launcher yesterday 19:04:35 <corvus> i need to check on a node failure though 19:04:47 <clarkb> that would imply things are mostly working I think. Thats good 19:04:48 <corvus> so let's say i should confirm that's actually working with zl 19:05:36 <corvus> hrm, i think there may be a rax flex problem with zl 19:05:47 <corvus> 2025-03-10 17:26:56,062 ERROR zuul.Launcher: TypeError: create_server() requires either 'image' or 'boot_volume' 19:05:49 <corvus> not sure what the deal is there 19:06:02 <corvus> i'll follow up in #opendev tho 19:06:05 <clarkb> nodepool is using images to boot off of not boot from volume fwiw 19:06:12 <clarkb> ack. Anything else to bring up on this topic? 19:06:35 <corvus> i think we're really close to being ready to start expanding the use of niz 19:06:48 <corvus> like maybe one or two more features to add to the launcher 19:07:07 <corvus> so it's probably worth asking how to proceed? 19:07:10 <corvus> i'm thinking... 19:07:44 <corvus> continue to add some more providers, and switch the zuul tenant over to use niz exclusively. do that without the new flavors though (so we use the old flavors -- 8gb everywhere) 19:07:56 <fungi> i do have a change up to add rackspace flex dfw3 19:08:00 <corvus> (basically, decouple the adding of new flavors from the switching to niz) 19:08:32 <corvus> fungi: oh cool 19:08:42 <fungi> note that sjc3 and dfw3 use different flavor names, so my change moves the label->flavor mapping up into the region layer 19:09:00 <corvus> if we do what i suggest -- should we also dial down max-servers in nodepool just a little? or let them duke it out? 19:09:01 <fungi> #link https://review.opendev.org/943104 Add the DFW3 region for Rackspace Flex 19:09:24 <clarkb> corvus: I suspect we can dial back max-servers by some small amount in nodepool 19:09:30 <corvus> fungi: nice -- that's why we can do it in both places 19:09:43 <fungi> yeah, i found that convenient 19:09:47 <fungi> great design! 19:10:27 <clarkb> we tend to float under max-servers due to $cloudproblems anyway so we may find that eventually leads to them duking it out :) 19:10:28 <corvus> we're definitely getting to the point where more images would be good to have :) we have everything needed for zuul i think, but it would be good to have other images that other tenants use 19:10:38 <corvus> clarkb: good point 19:11:30 <corvus> i think that's about it from me 19:11:54 <clarkb> thanks for the update. Adding more images and starting with the zuul tenant sounds great to me 19:12:03 <clarkb> #topic Updating Flavors in OVH 19:12:11 <clarkb> related is the OVH flavor surgery that we've discussed with amorin 19:12:16 <clarkb> #link https://etherpad.opendev.org/p/ovh-flavors 19:12:23 <clarkb> the proposal is to do this Monday (March 17) ish 19:12:37 <corvus> did amorin respond about scheduling? 19:12:54 <corvus> i missed a bunch of scrollback due to travel 19:13:06 <clarkb> 19:21:33* amorin | corvus: I will check with the team tomorrow about march 17 and let you know. 19:13:09 <clarkb> this was the last message I saw 19:13:20 <clarkb> so I guess we're still waiting on final confirmation they will proceed then 19:13:25 <corvus> cool, i did miss that. gtk. 19:13:34 <corvus> probably worth a ping to check on the status, but no rush 19:13:53 <clarkb> ++ 19:14:14 <clarkb> and ya other wise its mostly be aware of this as a thing we're trying to do. It should make working with ovh in zuul-launcher and nodepool nicer 19:14:24 <clarkb> more future proofed by getting off the bespoke scheduling system 19:16:33 <clarkb> #topic Running infra-prod Jobs in Parallel on Bridge 19:16:56 <clarkb> Yesterday we landed the chnage to increase our infra-prod jobs semaphore limit to 2. Since then we have been running our infra-prod jobs in parallel 19:17:07 <corvus> yaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaay 19:17:23 <tonyb> \o/ #greatsuccess 19:17:33 <fungi> now deploys are happening ~2x as fast 19:17:40 <fungi> it's very nice! 19:17:43 <clarkb> I monitored deploy, opendev-prod-hourly, and periodic buildsets as well as general demand/load on bridge and thankfully no major issues came up 19:17:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943999 A small logs fixup 19:18:02 <clarkb> this fixup addresses ansible logging on bridge that ianw noticed is off 19:18:21 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943992 A small job ordering fixup 19:18:33 <clarkb> and this change makes puppet-else depend on letsencrypt because puppet deployed services do use LE certs 19:18:45 <clarkb> neither of which is urgent nor impacting functionality at the moment 19:19:15 <corvus> it's important that puppet work so we can update our jenkins configuration :) 19:19:45 <clarkb> indeed 19:20:42 <corvus> (amusingly, since change #1 is a puppet change, we can say 943998 changes later and we still have puppet) 19:20:42 <clarkb> other than those two chagnes I think the next step is considering increasing that semaphore limit to a larger number. Bridge has 8vcpus and load avg was below 2 everytime I checked. I think that means we can safely bump up to 4 or 5 in the semaphore 19:21:24 <corvus> 4 sgtm 19:21:40 <tonyb> I like 4 also 19:21:59 <clarkb> I was also hoping to land a few more regular system-config changes to see deploy do its thing but the one that merged today largely noop'd 19:22:07 <clarkb> fungi's change from yseterday did not noop and was a good exercise thankfully 19:22:26 <clarkb> ok maybe try to land a change or two to system-config later today and if that and periodic look good bump up to 4 tomororw 19:23:10 <fungi> once we have performance graphs we can check easily again, we might consider increasing it further 19:23:16 <clarkb> anything else on this topic? It was a long time coming but its here and seems to be working so yay and thank you to everyone who helped make it happen 19:24:37 <clarkb> #topic Upgrading old servers 19:24:54 <clarkb> As expected last week was consumed by infra-prod chagnes and the start of this week too 19:25:04 <clarkb> so I don't have anyting enw on this front. But it is the next big item on my todo list to tackle 19:25:09 <clarkb> tonyb: did you have any updates? 19:26:07 <tonyb> :( No. 19:26:16 <clarkb> #topic Sprinting to Upgrade Servers to Noble 19:26:26 <clarkb> so as mentioned I'm hoping to pick this up again starting tomorrow most likely 19:26:53 <clarkb> Help with reviews is always appreciated as is help bootstrapping replacement servers 19:26:58 <clarkb> #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint 19:27:12 <clarkb> feel free to throw some notes in that etherpad and I'll try to get it updated from my side as I pick it up again 19:28:04 <clarkb> Then in related news I started brainstorming what the Gerrit replacement looks like. I think it is basically boot a new gerrit and volume, have system-config deploy it "empty" without replication configured. Sanity check things look right and functional. Then sync over current prod state and recheck operation. Then schedule a downtime to do a final sync and cut over dns 19:28:21 <clarkb> the main thing is taht we don't want replication to run until the new server is in production to avoid overwriting content on gitea 19:28:37 <clarkb> if you can think of any other sync poinst that need to be handled carefully let me know 19:29:43 <clarkb> #topic Running certcheck on bridge 19:29:59 <clarkb> as mentioned last week I volunteered to look at running this within an infra-prod job that runs daily instead 19:30:15 <clarkb> I haven't done that yet but am hopeful I'll be able to look before next meeting just because I have so much infra-prod stuff paged in at this point 19:30:42 <clarkb> and with things running in parallel the wall clock time cost for that should be minimal 19:31:05 <clarkb> #topic Working through our TODO list 19:31:09 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:31:28 <clarkb> this is just your weekly reminder to look at that list if you need good ideas for what to work on next. I think I can probably mark off infra-prod running in parallel now too 19:32:48 <clarkb> #topic Open Discussion 19:32:50 <clarkb> Anything else? 19:33:21 <fungi> i didn't have anything 19:34:28 <clarkb> my kids are out on spring break week after next. I may try to pop out a day or two to do family things as a result. but we have nothing big planned so my guess is it wouldn't be anything crazy 19:34:31 <tonyb> Nothing from me this week 19:36:10 <clarkb> that was quick. I think we can end early today then 19:36:15 <clarkb> thank you everyone for your time and help 19:36:19 <clarkb> see you back here next week 19:36:22 <clarkb> #endmeeting