19:00:18 #startmeeting infra 19:00:18 Meeting started Tue Mar 11 19:00:18 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:18 The meeting name has been set to 'infra' 19:00:25 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/X75EQWC4QAGVRJKPB3HTTKHKONAM26JJ/ Our Agenda 19:00:32 #topic Announcements 19:00:42 I don't have anything to announce. Did anyone else have something? 19:02:12 * tonyb has nothing 19:03:24 #topic Zuul-launcher image builds 19:03:44 I think the big update here is that we did do the raxflex surgery to switch tenants and update networks (and their MTUs) 19:03:59 that happened for both zuul-launcher and nodepool and as far as I know is working 19:04:00 yay! 19:04:26 i know at least some nodes were provided by zuul-launcher yesterday 19:04:35 i need to check on a node failure though 19:04:47 that would imply things are mostly working I think. Thats good 19:04:48 so let's say i should confirm that's actually working with zl 19:05:36 hrm, i think there may be a rax flex problem with zl 19:05:47 2025-03-10 17:26:56,062 ERROR zuul.Launcher: TypeError: create_server() requires either 'image' or 'boot_volume' 19:05:49 not sure what the deal is there 19:06:02 i'll follow up in #opendev tho 19:06:05 nodepool is using images to boot off of not boot from volume fwiw 19:06:12 ack. Anything else to bring up on this topic? 19:06:35 i think we're really close to being ready to start expanding the use of niz 19:06:48 like maybe one or two more features to add to the launcher 19:07:07 so it's probably worth asking how to proceed? 19:07:10 i'm thinking... 19:07:44 continue to add some more providers, and switch the zuul tenant over to use niz exclusively. do that without the new flavors though (so we use the old flavors -- 8gb everywhere) 19:07:56 i do have a change up to add rackspace flex dfw3 19:08:00 (basically, decouple the adding of new flavors from the switching to niz) 19:08:32 fungi: oh cool 19:08:42 note that sjc3 and dfw3 use different flavor names, so my change moves the label->flavor mapping up into the region layer 19:09:00 if we do what i suggest -- should we also dial down max-servers in nodepool just a little? or let them duke it out? 19:09:01 #link https://review.opendev.org/943104 Add the DFW3 region for Rackspace Flex 19:09:24 corvus: I suspect we can dial back max-servers by some small amount in nodepool 19:09:30 fungi: nice -- that's why we can do it in both places 19:09:43 yeah, i found that convenient 19:09:47 great design! 19:10:27 we tend to float under max-servers due to $cloudproblems anyway so we may find that eventually leads to them duking it out :) 19:10:28 we're definitely getting to the point where more images would be good to have :) we have everything needed for zuul i think, but it would be good to have other images that other tenants use 19:10:38 clarkb: good point 19:11:30 i think that's about it from me 19:11:54 thanks for the update. Adding more images and starting with the zuul tenant sounds great to me 19:12:03 #topic Updating Flavors in OVH 19:12:11 related is the OVH flavor surgery that we've discussed with amorin 19:12:16 #link https://etherpad.opendev.org/p/ovh-flavors 19:12:23 the proposal is to do this Monday (March 17) ish 19:12:37 did amorin respond about scheduling? 19:12:54 i missed a bunch of scrollback due to travel 19:13:06 19:21:33* amorin | corvus: I will check with the team tomorrow about march 17 and let you know. 19:13:09 this was the last message I saw 19:13:20 so I guess we're still waiting on final confirmation they will proceed then 19:13:25 cool, i did miss that. gtk. 19:13:34 probably worth a ping to check on the status, but no rush 19:13:53 ++ 19:14:14 and ya other wise its mostly be aware of this as a thing we're trying to do. It should make working with ovh in zuul-launcher and nodepool nicer 19:14:24 more future proofed by getting off the bespoke scheduling system 19:16:33 #topic Running infra-prod Jobs in Parallel on Bridge 19:16:56 Yesterday we landed the chnage to increase our infra-prod jobs semaphore limit to 2. Since then we have been running our infra-prod jobs in parallel 19:17:07 yaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaay 19:17:23 \o/ #greatsuccess 19:17:33 now deploys are happening ~2x as fast 19:17:40 it's very nice! 19:17:43 I monitored deploy, opendev-prod-hourly, and periodic buildsets as well as general demand/load on bridge and thankfully no major issues came up 19:17:51 #link https://review.opendev.org/c/opendev/system-config/+/943999 A small logs fixup 19:18:02 this fixup addresses ansible logging on bridge that ianw noticed is off 19:18:21 #link https://review.opendev.org/c/opendev/system-config/+/943992 A small job ordering fixup 19:18:33 and this change makes puppet-else depend on letsencrypt because puppet deployed services do use LE certs 19:18:45 neither of which is urgent nor impacting functionality at the moment 19:19:15 it's important that puppet work so we can update our jenkins configuration :) 19:19:45 indeed 19:20:42 (amusingly, since change #1 is a puppet change, we can say 943998 changes later and we still have puppet) 19:20:42 other than those two chagnes I think the next step is considering increasing that semaphore limit to a larger number. Bridge has 8vcpus and load avg was below 2 everytime I checked. I think that means we can safely bump up to 4 or 5 in the semaphore 19:21:24 4 sgtm 19:21:40 I like 4 also 19:21:59 I was also hoping to land a few more regular system-config changes to see deploy do its thing but the one that merged today largely noop'd 19:22:07 fungi's change from yseterday did not noop and was a good exercise thankfully 19:22:26 ok maybe try to land a change or two to system-config later today and if that and periodic look good bump up to 4 tomororw 19:23:10 once we have performance graphs we can check easily again, we might consider increasing it further 19:23:16 anything else on this topic? It was a long time coming but its here and seems to be working so yay and thank you to everyone who helped make it happen 19:24:37 #topic Upgrading old servers 19:24:54 As expected last week was consumed by infra-prod chagnes and the start of this week too 19:25:04 so I don't have anyting enw on this front. But it is the next big item on my todo list to tackle 19:25:09 tonyb: did you have any updates? 19:26:07 :( No. 19:26:16 #topic Sprinting to Upgrade Servers to Noble 19:26:26 so as mentioned I'm hoping to pick this up again starting tomorrow most likely 19:26:53 Help with reviews is always appreciated as is help bootstrapping replacement servers 19:26:58 #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint 19:27:12 feel free to throw some notes in that etherpad and I'll try to get it updated from my side as I pick it up again 19:28:04 Then in related news I started brainstorming what the Gerrit replacement looks like. I think it is basically boot a new gerrit and volume, have system-config deploy it "empty" without replication configured. Sanity check things look right and functional. Then sync over current prod state and recheck operation. Then schedule a downtime to do a final sync and cut over dns 19:28:21 the main thing is taht we don't want replication to run until the new server is in production to avoid overwriting content on gitea 19:28:37 if you can think of any other sync poinst that need to be handled carefully let me know 19:29:43 #topic Running certcheck on bridge 19:29:59 as mentioned last week I volunteered to look at running this within an infra-prod job that runs daily instead 19:30:15 I haven't done that yet but am hopeful I'll be able to look before next meeting just because I have so much infra-prod stuff paged in at this point 19:30:42 and with things running in parallel the wall clock time cost for that should be minimal 19:31:05 #topic Working through our TODO list 19:31:09 #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:31:28 this is just your weekly reminder to look at that list if you need good ideas for what to work on next. I think I can probably mark off infra-prod running in parallel now too 19:32:48 #topic Open Discussion 19:32:50 Anything else? 19:33:21 i didn't have anything 19:34:28 my kids are out on spring break week after next. I may try to pop out a day or two to do family things as a result. but we have nothing big planned so my guess is it wouldn't be anything crazy 19:34:31 Nothing from me this week 19:36:10 that was quick. I think we can end early today then 19:36:15 thank you everyone for your time and help 19:36:19 see you back here next week 19:36:22 #endmeeting