19:00:11 <clarkb> #startmeeting infra 19:00:11 <opendevmeet> Meeting started Tue Apr 15 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:11 <opendevmeet> The meeting name has been set to 'infra' 19:00:25 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ICGFTZD6Y3OYOHTZWY4HN4VYLTN7R6BY/ Our Agenda 19:00:30 <clarkb> #topic Announcements 19:00:49 <clarkb> We announced the Gerrit server upgrade for April 21, 2025 at 1600 UTC (we'll talk more about that later) 19:01:56 <clarkb> anything else to announce? I guess the PTG is over as is the openstack release (mostly) so we're able to do more potentially impactful changes 19:03:15 <clarkb> I pulled the meetpad servers out of the emergency file so they updated over the weekend and I've tested the service since 19:03:28 <clarkb> if thats it I think we can dive into the agenda 19:03:31 <clarkb> #topic Zuul-launcher image builds 19:03:52 <corvus> a status update first, then i think we should make some decisions. 19:03:53 <corvus> status: i'm rolling out fixes to the defects previously observed: vhd upload timeouts and leaked node cleanups. very soon (today/tomorrow?) i'll be ready to proceed with the previously discussed/agreed move of opendev and zuul tenants to using only launcher nodes 19:03:57 <corvus> there's still an outstanding item before i'm ready to propose moving larger (openstack) tenants to the launcher: statsd output. i expect to have that done and ready to move openstack, maybe about a month from now? 19:04:04 <corvus> which leads us to the thing we should start making decisions about: i haven't seen any activity on getting the remaining images prepped. specifically, i don't see any work on centos or arm images. 19:04:09 <corvus> we've been talking about this for many months now. if someone here is going to volunteer to do that, i think now would be a good time to record that and an expected completion date. if no one is interested in that work, then we should start making alternate plans now. 19:05:01 <clarkb> This is a good opportunity for someone who is interested but doesn't have a lot of time or doesn't have root as its largely contained to zuul jobs 19:06:16 <clarkb> re arm64 iamge builds they are much quicker than before after changes made to the cloud 19:06:21 <corvus> in the case of no volunteers, i think it would be reasonable to send a message to the service-discuss list (and maybe someone should bounce it to openstack-discuss?) telling people that we need someone to take ownership of maintaining those images, otherwise, by mid-may they will become best effort and possibly start returning node errors, and at some later date (july?), they will be removed and will always return errors. 19:06:24 <clarkb> so I'm far less worried about them being too slow or annoyingly slow 19:06:25 <corvus> that should give people adequate notice to step up and fix them before we remove them due to lack of maintenance. 19:06:45 <fungi> i can also communicate that need to the openstack tc, maybe they can find volunteers to add those resources since some projects in openstack are relying on them 19:07:11 <clarkb> that all seems reasonable to me 19:07:29 <corvus> fungi: sounds good 19:07:32 <corvus> should we do those in parallel (email / tc) or one then the other? 19:07:54 <clarkb> maybe send to service-discuss first then we can point the openstack tc to that thread 19:07:56 <clarkb> ? 19:08:15 <fungi> if someone else wants to send the notification to service-discuss, i can take on mentioning it on the openstack-discuss list and also to the tc at the same time 19:08:33 <corvus> i'm happy to draft that email message 19:08:37 <corvus> s/happy/willing/ 19:08:52 <clarkb> sounds like a plan 19:09:07 <corvus> (i'm not happy about dropping images, i want someone to volunteer, and i would be happy to answer any questions they have) 19:09:22 <corvus> okay, clarkb want to #action? 19:09:54 <clarkb> #action corvus draft message to service-discuss asking for volunteers to port ci image builds otherwise images may become unmaintained 19:10:15 <clarkb> #action fungi followup to corvus' message with the openstack TC to see if anyone in openstack relying on those images is interested 19:10:24 <fungi> will do 19:10:31 <clarkb> anything else on this topic? 19:10:43 <corvus> thanks, that's it from me 19:10:52 <clarkb> #topic Container hygiene tasks 19:10:58 <clarkb> #link https://review.opendev.org/q/topic:%22opendev-python3.12%22+status:open Update images to use python3.12 19:11:09 <clarkb> this has stalled out somewhat while I work on moving Gerrit servers (our next topic) 19:11:46 <clarkb> but this is still a topic whose changes could use another set of reviews if people have time. Though once Gerrit stuff is done I'll probably proceed with fungi's +2 if no one else reviews 19:12:14 <clarkb> Basically not urgent but a good thing to flush out of our queue for general hygiene 19:12:22 <clarkb> #topic Switching Gerrit to run on Review03 19:12:53 <clarkb> As mentioned at the start of the meeting we've announced that April 21 (Monday) at 16:00 we're taking an outage to swap the servers around 19:13:26 <clarkb> The new server is very similar to the old one (still BFV, still running in vexxhost ca-ymq-1, using the same sized server) but with new backing resources 19:14:03 <clarkb> It is enrolled in our inventory as a review-staging group member and I've synced data and started gerrit on it. This means you can go to https://review03.opendev.org and test it today. IF you want to log in you need to update /etc/hosts to set up review.opendev.org to point at its IP(s) 19:14:22 <clarkb> the reason for needing /etc/hosts to login is the openid redirects use the review.opendev.org name 19:14:45 <clarkb> #link https://etherpad.opendev.org/p/i_vt63v18c3RKX2VyCs3 review03 testing and service migration planning 19:14:58 <clarkb> I've also been working on this document to track what I've done, what needs to be done, and the migration plan for Monday 19:15:10 <clarkb> feel free to leave comments on there if you have questions or concerns or testing shows something unexpected 19:15:21 <clarkb> Keep in mind that anything you do to review03 between now and monday will be delted when we switch servers 19:15:33 <clarkb> we aren't merging content we will replace what is on 03 with what is on 02 during the outage 19:15:45 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/947044 Prep work to ensure port 29418 ssh host keys match on old and new servers 19:16:04 <clarkb> This change is for something I discovered when spinning up the new server. Running init on the new server generated new ecdsa and ed25519 ssh host keys. 19:16:26 <clarkb> This change adds management of those host keys to ansible and I've updated private vars to match review02 so that review02 and review03 should have the same hostkeys 19:16:45 <clarkb> when you review that change please double check the private vars content if you have time. Thats probably the most tricky thing about the change is just mapping all the data properly 19:17:35 <clarkb> and finally I realized that since we manage DNS with gerrit changes updating DNS during the outage is a bit tricky. I've documented the half manual path for getting that done on the etherpad. That may be something good to check for viability 19:18:02 <clarkb> basically we have to force merge the change to swap dns (because zuul won't be working yet) then we have to manually run the service-nameserver.yaml playbook (because zuul isn't running yet). Doable but different 19:18:24 <fungi> we could in theory wait for the dns change to deploy and start the outage then 19:18:43 <fungi> which could avoid additional manual steps unless we need to roll back 19:18:50 <clarkb> fungi: possibly, one upside to using the path I describe is it will exercise replication for us 19:19:04 <clarkb> which is something I can't easily test until we switch 19:19:17 <clarkb> so its a quick early indication of whether or not that fundamental feature is functional 19:20:14 <clarkb> corvus: I did want to ask if you have any concerns with leaving zuul up during the outage 19:20:28 <clarkb> I think it will be fine and zuul will gracefully degrade so I haven't been planning on touching zuul. But let me know if that is a bad plan 19:20:30 <fungi> though another downside of updating dns after we bring up the new server is that we can't have dns occurring in parallel with the rsync and startup 19:20:54 <fungi> can't have dns propogation happening in parallel i mean 19:20:57 <corvus> clarkb: yeah i thought about that and agree that we should leave it up, but just monitor logs on the schedulers and look for how it handles the failover 19:21:04 <clarkb> fungi: ya it would have to wait for gerrit to be up on 03 19:21:33 <fungi> so deploying dns first could shorteen the effective outage duration 19:21:59 <clarkb> fungi: ya by about 5 minutes I guess I'll have to think about that alternative a bit more 19:22:19 <clarkb> any questions or concerns about the preparation, timing, testing, etc? 19:22:52 <fungi> none from me 19:23:42 <clarkb> I think my next goal is to login to review03 and make sure that functions and I see my personal content that I expect. Then maybe first thing tomorrow is a good time to land the ssh key management update if people can review before then 19:24:10 <clarkb> and then before firday we land the dns ttl update to shorten to 5 minutes on the review.o.o record 19:24:25 <clarkb> it just occurred to me that review.openstack.org is probably in DNS too and will need manual intervention 19:24:33 <clarkb> I'll update the etherpad for ^ after the meeting 19:24:55 <clarkb> fungi: ^ you will need to do that I think actually 19:24:58 <clarkb> since it is cloudflare now 19:25:23 <fungi> yeah, can do, thanks for the reminder 19:25:31 <fungi> i can adjust that during the outage 19:25:45 <clarkb> thanks! 19:26:04 <clarkb> #topic Upgrading old servers 19:26:33 <clarkb> Today I noticed that our noble nodes have a ~881MB /boot partition 19:27:06 <clarkb> due to historical reasons I have concerns about small /boot partitions but I did some cross checking against a personal machine that I had problems on and I think we're ok with that cloud image's setup 19:27:30 <clarkb> the initrds for kvm nodes are small enough that even with the smaller size I think we can fit ~twice as many kernels as my personal machine can do on a /boot that is 1.9GB large 19:27:46 <clarkb> But I wanted to call this out as a change from our existing older nodes that do not have a /boot partition 19:28:00 <fungi> yeah, that's plenty of room 19:28:14 <clarkb> any other server upgrade thinsg to note? 19:29:30 <fungi> not from me 19:29:38 <clarkb> ya seems liek not we can continue 19:29:40 <clarkb> #topic Running certcheck on bridge 19:29:47 <clarkb> I was going to drop this from the agenda but I forgot so its here :) 19:29:50 <fungi> no update 19:29:58 <fungi> yeah, please do 19:29:59 <clarkb> no updatse from me. That said once Gerrit is sorted out I should have time to look at this 19:30:08 <clarkb> so maybe it will go away temporarily. We'll see 19:30:14 <fungi> we can defer it to when we get prometheus done 19:30:19 <clarkb> Related to this the wiki cert expires in 2X days 19:30:53 <clarkb> and some consoritum of cert groups just announced a plan to limit cert validity down to 47 days over the next several yaers 19:31:12 <clarkb> March 2026 will be the end of getting a cert with one yaer of validity. So we might get one year now. Then another year in less tahn a year dpeending on how things go 19:31:49 <clarkb> anyway thats all for later 19:31:51 <fungi> wow, 47 days will be considerably less than let's encrypt does now even 19:31:58 <clarkb> fungi: yup 19:32:01 <fungi> by half 19:32:13 <clarkb> fungi: the original argument was for 90 days but apple apparently convinced everyone to go for ~half that 19:32:24 <clarkb> but that won't take effect until 2029 or something 19:32:46 <clarkb> #topic Working through our TODO list 19:32:50 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:32:59 <clarkb> just a reminder that we haev a higher level todo list on this etherpad 19:33:15 <clarkb> it mgiht want a better permanent home but until then feel free to pick up work off that list if you need something to do or want to get more invovled in opendev 19:33:30 <clarkb> I'm happy to answer any questions or concerns about the list and guide people on those tasks if they get picked up 19:33:47 <clarkb> and feel free to add things there as time goes on and we have new things 19:33:53 <clarkb> #topic Rotating mailman 3 logs 19:34:04 <fungi> not done yet 19:34:14 <clarkb> is there a held node or anything along those lines yet? 19:34:49 <fungi> no, not even a proposed change yet unless i did it and forgot 19:34:59 <clarkb> ack 19:35:18 <clarkb> we've lived without it for a while but eventually that file is going to become too unwieldly so we should try and figure out something here 19:35:27 <clarkb> let me know if I can help (I did discover the problem afterall) 19:35:46 <clarkb> #topic Moving hound image to quay.io 19:35:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/947010 19:36:09 <clarkb> We did this to lodgeit after running the service on noble. Codesearch has also been running on noble for some time and can use quay.io instead of docker too 19:36:23 <clarkb> Just another step in the slow process of getting off of docker hub 19:36:33 <clarkb> we'll be able to do this with the gerrit images too once we're confident we aren't rolling back 19:37:29 <clarkb> in related news it seems like the docker hub issues are less prevalent. I think all of the little changes we've made to rely on docker hub less add up to more reliability which is nice 19:37:47 <clarkb> (though the recent base python image updates did hit the issues pretty reliably but I Think that was due to the number of images being updated at once) 19:38:19 <clarkb> #topic Open Discussion 19:38:22 <clarkb> Anything else? 19:38:33 <fungi> i've got nothing 19:39:20 <clarkb> the weather is going to be great all week here before raining on the weekend again. So don't be surprised if I pop out for a bike ride here and there (definitely going to try for this afternoon) 19:39:31 <corvus> 1 thing 19:39:47 <corvus> * openEuler (x86 and Arm)... (full message at <https://matrix.org/oftc/media/v1/media/download/AUJLRSPeJHqFGHLgsbmtZfjle05809G8xfMA4kubBTrqB0se1-QBVrK6t_uAgY2M0OgPHkcXo8OnLmIOHKz0IMBCeWgxAFvQAG1hdHJpeC5vcmcva0NPaE9WQmFNVUhUaVB0a1VsdlV4SnNt>) 19:40:14 <corvus> that look okay in irc? i can make an alternative if not 19:40:27 <clarkb> corvus: it gave us irc folks a link to a paste essentially 19:40:41 <corvus> oh sorry. that was my alternative anyway. :) 19:40:47 <clarkb> openeuler and gentoo are effectively dead right now because they stopped building and there wasn't sufficient pick up to fix them 19:41:11 <corvus> okay, should i omit those from this email? 19:41:16 <clarkb> so you might treat them differently in your email and note that they don't currently build as is and will need a larger effort for the interested parties if those still exist. Otherwise letting them die is a good thing 19:41:36 <clarkb> the others should all currently have working builds and are more straightforward to port from nodepool to zuul-launcher 19:41:37 <corvus> okay, i'll try for that. we can see how it looks in the etherpad. 19:41:51 <corvus> cool, thx 19:43:12 <clarkb> I'll keep things open until 19:45 and if there is nothing before then I think we can end 15 minutes early 19:45:00 <clarkb> thanks everyone! 19:45:04 <fungi> thanks clarkb! 19:45:08 <clarkb> we'll be back here same time and location next week 19:45:13 <clarkb> #endmeeting