19:00:11 <clarkb> #startmeeting infra
19:00:11 <opendevmeet> Meeting started Tue Apr 15 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:11 <opendevmeet> The meeting name has been set to 'infra'
19:00:25 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ICGFTZD6Y3OYOHTZWY4HN4VYLTN7R6BY/ Our Agenda
19:00:30 <clarkb> #topic Announcements
19:00:49 <clarkb> We announced the Gerrit server upgrade for April 21, 2025 at 1600 UTC (we'll talk more about that later)
19:01:56 <clarkb> anything else to announce? I guess the PTG is over as is the openstack release (mostly) so we're able to do more potentially impactful changes
19:03:15 <clarkb> I pulled the meetpad servers out of the emergency file so they updated over the weekend and I've tested the service since
19:03:28 <clarkb> if thats it I think we can dive into the agenda
19:03:31 <clarkb> #topic Zuul-launcher image builds
19:03:52 <corvus> a status update first, then i think we should make some decisions.
19:03:53 <corvus> status: i'm rolling out fixes to the defects previously observed: vhd upload timeouts and leaked node cleanups.  very soon (today/tomorrow?) i'll be ready to proceed with the previously discussed/agreed move of opendev and zuul tenants to using only launcher nodes
19:03:57 <corvus> there's still an outstanding item before i'm ready to propose moving larger (openstack) tenants to the launcher: statsd output.  i expect to have that done and ready to move openstack, maybe about a month from now?
19:04:04 <corvus> which leads us to the thing we should start making decisions about: i haven't seen any activity on getting the remaining images prepped.  specifically, i don't see any work on centos or arm images.
19:04:09 <corvus> we've been talking about this for many months now.  if someone here is going to volunteer to do that, i think now would be a good time to record that and an expected completion date.  if no one is interested in that work, then we should start making alternate plans now.
19:05:01 <clarkb> This is a good opportunity for someone who is interested but doesn't have a lot of time or doesn't have root as its largely contained to zuul jobs
19:06:16 <clarkb> re arm64 iamge builds they are much quicker than before after changes made to the cloud
19:06:21 <corvus> in the case of no volunteers, i think it would be reasonable to send a message to the service-discuss list (and maybe someone should bounce it to openstack-discuss?) telling people that we need someone to take ownership of maintaining those images, otherwise, by mid-may they will become best effort and possibly start returning node errors, and at some later date (july?), they will be removed and will always return errors.
19:06:24 <clarkb> so I'm far less worried about them being too slow or annoyingly slow
19:06:25 <corvus> that should give people adequate notice to step up and fix them before we remove them due to lack of maintenance.
19:06:45 <fungi> i can also communicate that need to the openstack tc, maybe they can find volunteers to add those resources since some projects in openstack are relying on them
19:07:11 <clarkb> that all seems reasonable to me
19:07:29 <corvus> fungi: sounds good
19:07:32 <corvus> should we do those in parallel (email / tc) or one then the other?
19:07:54 <clarkb> maybe send to service-discuss first then we can point the openstack tc to that thread
19:07:56 <clarkb> ?
19:08:15 <fungi> if someone else wants to send the notification to service-discuss, i can take on mentioning it on the openstack-discuss list and also to the tc at the same time
19:08:33 <corvus> i'm happy to draft that email message
19:08:37 <corvus> s/happy/willing/
19:08:52 <clarkb> sounds like a plan
19:09:07 <corvus> (i'm not happy about dropping images, i want someone to volunteer, and i would be happy to answer any questions they have)
19:09:22 <corvus> okay, clarkb want to #action?
19:09:54 <clarkb> #action corvus draft message to service-discuss asking for volunteers to port ci image builds otherwise images may become unmaintained
19:10:15 <clarkb> #action fungi followup to corvus' message with the openstack TC to see if anyone in openstack relying on those images is interested
19:10:24 <fungi> will do
19:10:31 <clarkb> anything else on this topic?
19:10:43 <corvus> thanks, that's it from me
19:10:52 <clarkb> #topic Container hygiene tasks
19:10:58 <clarkb> #link https://review.opendev.org/q/topic:%22opendev-python3.12%22+status:open Update images to use python3.12
19:11:09 <clarkb> this has stalled out somewhat while I work on moving Gerrit servers (our next topic)
19:11:46 <clarkb> but this is still a topic whose changes could use another set of reviews if people have time. Though once Gerrit stuff is done I'll probably proceed with fungi's +2 if no one else reviews
19:12:14 <clarkb> Basically not urgent but a good thing to flush out of our queue for general hygiene
19:12:22 <clarkb> #topic Switching Gerrit to run on Review03
19:12:53 <clarkb> As mentioned at the start of the meeting we've announced that April 21 (Monday) at 16:00 we're taking an outage to swap the servers around
19:13:26 <clarkb> The new server is very similar to the old one (still BFV, still running in vexxhost ca-ymq-1, using the same sized server) but with new backing resources
19:14:03 <clarkb> It is enrolled in our inventory as a review-staging group member and I've synced data and started gerrit on it. This means you can go to https://review03.opendev.org and test it today. IF you want to log in you need to update /etc/hosts to set up review.opendev.org to point at its IP(s)
19:14:22 <clarkb> the reason for needing /etc/hosts to login is the openid redirects use the review.opendev.org name
19:14:45 <clarkb> #link https://etherpad.opendev.org/p/i_vt63v18c3RKX2VyCs3 review03 testing and service migration planning
19:14:58 <clarkb> I've also been working on this document to track what I've done, what needs to be done, and the migration plan for Monday
19:15:10 <clarkb> feel free to leave comments on there if you have questions or concerns or testing shows something unexpected
19:15:21 <clarkb> Keep in mind that anything you do to review03 between now and monday will be delted when we switch servers
19:15:33 <clarkb> we aren't merging content we will replace what is on 03 with what is on 02 during the outage
19:15:45 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/947044 Prep work to ensure port 29418 ssh host keys match on old and new servers
19:16:04 <clarkb> This change is for something I discovered when spinning up the new server. Running init on the new server generated new ecdsa and ed25519 ssh host keys.
19:16:26 <clarkb> This change adds management of those host keys to ansible and I've updated private vars to match review02 so that review02 and review03 should have the same hostkeys
19:16:45 <clarkb> when you review that change please double check the private vars content if you have time. Thats probably the most tricky thing about the change is just mapping all the data properly
19:17:35 <clarkb> and finally I realized that since we manage DNS with gerrit changes updating DNS during the outage is a bit tricky. I've documented the half manual path for getting that done on the etherpad. That may be something good to check for viability
19:18:02 <clarkb> basically we have to force merge the change to swap dns (because zuul won't be working yet) then we have to manually run the service-nameserver.yaml playbook (because zuul isn't running yet). Doable but different
19:18:24 <fungi> we could in theory wait for the dns change to deploy and start the outage then
19:18:43 <fungi> which could avoid additional manual steps unless we need to roll back
19:18:50 <clarkb> fungi: possibly, one upside to using the path I describe is it will exercise replication for us
19:19:04 <clarkb> which is something I can't easily test until we switch
19:19:17 <clarkb> so its a quick early indication of whether or not that fundamental feature is functional
19:20:14 <clarkb> corvus: I did want to ask if you have any concerns with leaving zuul up during the outage
19:20:28 <clarkb> I think it will be fine and zuul will gracefully degrade so I haven't been planning on touching zuul. But let me know if that is a bad plan
19:20:30 <fungi> though another downside of updating dns after we bring up the new server is that we can't have dns occurring in parallel with the rsync and startup
19:20:54 <fungi> can't have dns propogation happening in parallel i mean
19:20:57 <corvus> clarkb: yeah i thought about that and agree that we should leave it up, but just monitor logs on the schedulers and look for how it handles the failover
19:21:04 <clarkb> fungi: ya it would have to wait for gerrit to be up on 03
19:21:33 <fungi> so deploying dns first could shorteen the effective outage duration
19:21:59 <clarkb> fungi: ya by about 5 minutes I guess I'll have to think about that alternative a bit more
19:22:19 <clarkb> any questions or concerns about the preparation, timing, testing, etc?
19:22:52 <fungi> none from me
19:23:42 <clarkb> I think my next goal is to login to review03 and make sure that functions and I see my personal content that I expect. Then maybe first thing tomorrow is a good time to land the ssh key management update if people can review before then
19:24:10 <clarkb> and then before firday we land the dns ttl update to shorten to 5 minutes on the review.o.o record
19:24:25 <clarkb> it just occurred to me that review.openstack.org is probably in DNS too and will need manual intervention
19:24:33 <clarkb> I'll update the etherpad for ^ after the meeting
19:24:55 <clarkb> fungi: ^ you will need to do that I think actually
19:24:58 <clarkb> since it is cloudflare now
19:25:23 <fungi> yeah, can do, thanks for the reminder
19:25:31 <fungi> i can adjust that during the outage
19:25:45 <clarkb> thanks!
19:26:04 <clarkb> #topic Upgrading old servers
19:26:33 <clarkb> Today I noticed that our noble nodes have a ~881MB /boot partition
19:27:06 <clarkb> due to historical reasons I have concerns about small /boot partitions but I did some cross checking against a personal machine that I had problems on and I think we're ok with that cloud image's setup
19:27:30 <clarkb> the initrds for kvm nodes are small enough that even with the smaller size I think we can fit ~twice as many kernels as my personal machine can do on a /boot that is 1.9GB large
19:27:46 <clarkb> But I wanted to call this out as a change from our existing older nodes that do not have a /boot partition
19:28:00 <fungi> yeah, that's plenty of room
19:28:14 <clarkb> any other server upgrade thinsg to note?
19:29:30 <fungi> not from me
19:29:38 <clarkb> ya seems liek not we can continue
19:29:40 <clarkb> #topic Running certcheck on bridge
19:29:47 <clarkb> I was going to drop this from the agenda but I forgot so its here :)
19:29:50 <fungi> no update
19:29:58 <fungi> yeah, please do
19:29:59 <clarkb> no updatse from me. That said once Gerrit is sorted out I should have time to look at this
19:30:08 <clarkb> so maybe it will go away temporarily. We'll see
19:30:14 <fungi> we can defer it to when we get prometheus done
19:30:19 <clarkb> Related to this the wiki cert expires in 2X days
19:30:53 <clarkb> and some consoritum of cert groups just announced a plan to limit cert validity down to 47 days over the next several yaers
19:31:12 <clarkb> March 2026 will be the end of getting a cert with one yaer of validity. So we might get one year now. Then another year in less tahn a year dpeending on how things go
19:31:49 <clarkb> anyway thats all for later
19:31:51 <fungi> wow, 47 days will be considerably less than let's encrypt does now even
19:31:58 <clarkb> fungi: yup
19:32:01 <fungi> by half
19:32:13 <clarkb> fungi: the original argument was for 90 days but apple apparently convinced everyone to go for ~half that
19:32:24 <clarkb> but that won't take effect until 2029 or something
19:32:46 <clarkb> #topic Working through our TODO list
19:32:50 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:32:59 <clarkb> just a reminder that we haev a higher level todo list on this etherpad
19:33:15 <clarkb> it mgiht want a better permanent home but until then feel free to pick up work off that list if you need something to do or want to get more invovled in opendev
19:33:30 <clarkb> I'm happy to answer any questions or concerns about the list and guide people on those tasks if they get picked up
19:33:47 <clarkb> and feel free to add things there as time goes on and we have new things
19:33:53 <clarkb> #topic Rotating mailman 3 logs
19:34:04 <fungi> not done yet
19:34:14 <clarkb> is there a held node or anything along those lines yet?
19:34:49 <fungi> no, not even a proposed change yet unless i did it and forgot
19:34:59 <clarkb> ack
19:35:18 <clarkb> we've lived without it for a while but eventually that file is going to become too unwieldly so we should try and figure out something here
19:35:27 <clarkb> let me know if I can help (I did discover the problem afterall)
19:35:46 <clarkb> #topic Moving hound image to quay.io
19:35:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/947010
19:36:09 <clarkb> We did this to lodgeit after running the service on noble. Codesearch has also been running on noble for some time and can use quay.io instead of docker too
19:36:23 <clarkb> Just another step in the slow process of getting off of docker hub
19:36:33 <clarkb> we'll be able to do this with the gerrit images too once we're confident we aren't rolling back
19:37:29 <clarkb> in related news it seems like the docker hub issues are less prevalent. I think all of the little changes we've made to rely on docker hub less add up to more reliability which is nice
19:37:47 <clarkb> (though the recent base python image updates did hit the issues pretty reliably but I Think that was due to the number of images being updated at once)
19:38:19 <clarkb> #topic Open Discussion
19:38:22 <clarkb> Anything else?
19:38:33 <fungi> i've got nothing
19:39:20 <clarkb> the weather is going to be great all week here before raining on the weekend again. So don't be surprised if I pop out for a bike ride here and there (definitely going to try for this afternoon)
19:39:31 <corvus> 1 thing
19:39:47 <corvus> * openEuler (x86 and Arm)... (full message at <https://matrix.org/oftc/media/v1/media/download/AUJLRSPeJHqFGHLgsbmtZfjle05809G8xfMA4kubBTrqB0se1-QBVrK6t_uAgY2M0OgPHkcXo8OnLmIOHKz0IMBCeWgxAFvQAG1hdHJpeC5vcmcva0NPaE9WQmFNVUhUaVB0a1VsdlV4SnNt>)
19:40:14 <corvus> that look okay in irc? i can make an alternative if not
19:40:27 <clarkb> corvus: it gave us irc folks a link to a paste essentially
19:40:41 <corvus> oh sorry.  that was my alternative anyway.  :)
19:40:47 <clarkb> openeuler and gentoo are effectively dead right now because they stopped building and there wasn't sufficient pick up to fix them
19:41:11 <corvus> okay, should i omit those from this email?
19:41:16 <clarkb> so you might treat them differently in your email and note that they don't currently build as is and will need a larger effort for the interested parties if those still exist. Otherwise letting them die is a good thing
19:41:36 <clarkb> the others should all currently have working builds and are more straightforward to port from nodepool to zuul-launcher
19:41:37 <corvus> okay, i'll try for that.  we can see how it looks in the etherpad.
19:41:51 <corvus> cool, thx
19:43:12 <clarkb> I'll keep things open until 19:45 and if there is nothing before then I think we can end 15 minutes early
19:45:00 <clarkb> thanks everyone!
19:45:04 <fungi> thanks clarkb!
19:45:08 <clarkb> we'll be back here same time and location next week
19:45:13 <clarkb> #endmeeting