Tuesday, 2025-04-15

clarkbJust about meeting time18:59
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Apr 15 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ICGFTZD6Y3OYOHTZWY4HN4VYLTN7R6BY/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbWe announced the Gerrit server upgrade for April 21, 2025 at 1600 UTC (we'll talk more about that later)19:00
clarkbanything else to announce? I guess the PTG is over as is the openstack release (mostly) so we're able to do more potentially impactful changes19:01
clarkbI pulled the meetpad servers out of the emergency file so they updated over the weekend and I've tested the service since19:03
clarkbif thats it I think we can dive into the agenda19:03
clarkb#topic Zuul-launcher image builds19:03
corvusa status update first, then i think we should make some decisions.19:03
corvusstatus: i'm rolling out fixes to the defects previously observed: vhd upload timeouts and leaked node cleanups.  very soon (today/tomorrow?) i'll be ready to proceed with the previously discussed/agreed move of opendev and zuul tenants to using only launcher nodes19:03
corvusthere's still an outstanding item before i'm ready to propose moving larger (openstack) tenants to the launcher: statsd output.  i expect to have that done and ready to move openstack, maybe about a month from now?19:03
corvuswhich leads us to the thing we should start making decisions about: i haven't seen any activity on getting the remaining images prepped.  specifically, i don't see any work on centos or arm images.19:04
corvuswe've been talking about this for many months now.  if someone here is going to volunteer to do that, i think now would be a good time to record that and an expected completion date.  if no one is interested in that work, then we should start making alternate plans now.19:04
clarkbThis is a good opportunity for someone who is interested but doesn't have a lot of time or doesn't have root as its largely contained to zuul jobs19:05
clarkbre arm64 iamge builds they are much quicker than before after changes made to the cloud19:06
corvusin the case of no volunteers, i think it would be reasonable to send a message to the service-discuss list (and maybe someone should bounce it to openstack-discuss?) telling people that we need someone to take ownership of maintaining those images, otherwise, by mid-may they will become best effort and possibly start returning node errors, and at some later date (july?), they will be removed and will always return errors.19:06
clarkbso I'm far less worried about them being too slow or annoyingly slow19:06
corvusthat should give people adequate notice to step up and fix them before we remove them due to lack of maintenance.19:06
fungii can also communicate that need to the openstack tc, maybe they can find volunteers to add those resources since some projects in openstack are relying on them19:06
clarkbthat all seems reasonable to me19:07
corvusfungi: sounds good19:07
corvusshould we do those in parallel (email / tc) or one then the other?19:07
clarkbmaybe send to service-discuss first then we can point the openstack tc to that thread19:07
clarkb?19:07
fungiif someone else wants to send the notification to service-discuss, i can take on mentioning it on the openstack-discuss list and also to the tc at the same time19:08
corvusi'm happy to draft that email message19:08
corvuss/happy/willing/19:08
clarkbsounds like a plan19:08
corvus(i'm not happy about dropping images, i want someone to volunteer, and i would be happy to answer any questions they have)19:09
corvusokay, clarkb want to #action?19:09
clarkb#action corvus draft message to service-discuss asking for volunteers to port ci image builds otherwise images may become unmaintained19:09
clarkb#action fungi followup to corvus' message with the openstack TC to see if anyone in openstack relying on those images is interested19:10
fungiwill do19:10
clarkbanything else on this topic?19:10
corvusthanks, that's it from me19:10
clarkb#topic Container hygiene tasks19:10
clarkb#link https://review.opendev.org/q/topic:%22opendev-python3.12%22+status:open Update images to use python3.1219:10
clarkbthis has stalled out somewhat while I work on moving Gerrit servers (our next topic)19:11
clarkbbut this is still a topic whose changes could use another set of reviews if people have time. Though once Gerrit stuff is done I'll probably proceed with fungi's +2 if no one else reviews19:11
clarkbBasically not urgent but a good thing to flush out of our queue for general hygiene19:12
clarkb#topic Switching Gerrit to run on Review0319:12
clarkbAs mentioned at the start of the meeting we've announced that April 21 (Monday) at 16:00 we're taking an outage to swap the servers around19:12
clarkbThe new server is very similar to the old one (still BFV, still running in vexxhost ca-ymq-1, using the same sized server) but with new backing resources19:13
clarkbIt is enrolled in our inventory as a review-staging group member and I've synced data and started gerrit on it. This means you can go to https://review03.opendev.org and test it today. IF you want to log in you need to update /etc/hosts to set up review.opendev.org to point at its IP(s)19:14
clarkbthe reason for needing /etc/hosts to login is the openid redirects use the review.opendev.org name19:14
clarkb#link https://etherpad.opendev.org/p/i_vt63v18c3RKX2VyCs3 review03 testing and service migration planning19:14
clarkbI've also been working on this document to track what I've done, what needs to be done, and the migration plan for Monday19:14
clarkbfeel free to leave comments on there if you have questions or concerns or testing shows something unexpected19:15
clarkbKeep in mind that anything you do to review03 between now and monday will be delted when we switch servers19:15
clarkbwe aren't merging content we will replace what is on 03 with what is on 02 during the outage19:15
clarkb#link https://review.opendev.org/c/opendev/system-config/+/947044 Prep work to ensure port 29418 ssh host keys match on old and new servers19:15
clarkbThis change is for something I discovered when spinning up the new server. Running init on the new server generated new ecdsa and ed25519 ssh host keys.19:16
clarkbThis change adds management of those host keys to ansible and I've updated private vars to match review02 so that review02 and review03 should have the same hostkeys19:16
clarkbwhen you review that change please double check the private vars content if you have time. Thats probably the most tricky thing about the change is just mapping all the data properly19:16
clarkband finally I realized that since we manage DNS with gerrit changes updating DNS during the outage is a bit tricky. I've documented the half manual path for getting that done on the etherpad. That may be something good to check for viability19:17
clarkbbasically we have to force merge the change to swap dns (because zuul won't be working yet) then we have to manually run the service-nameserver.yaml playbook (because zuul isn't running yet). Doable but different19:18
fungiwe could in theory wait for the dns change to deploy and start the outage then19:18
fungiwhich could avoid additional manual steps unless we need to roll back19:18
clarkbfungi: possibly, one upside to using the path I describe is it will exercise replication for us19:18
clarkbwhich is something I can't easily test until we switch19:19
clarkbso its a quick early indication of whether or not that fundamental feature is functional19:19
clarkbcorvus: I did want to ask if you have any concerns with leaving zuul up during the outage19:20
clarkbI think it will be fine and zuul will gracefully degrade so I haven't been planning on touching zuul. But let me know if that is a bad plan19:20
fungithough another downside of updating dns after we bring up the new server is that we can't have dns occurring in parallel with the rsync and startup19:20
fungican't have dns propogation happening in parallel i mean19:20
corvusclarkb: yeah i thought about that and agree that we should leave it up, but just monitor logs on the schedulers and look for how it handles the failover19:20
clarkbfungi: ya it would have to wait for gerrit to be up on 0319:21
fungiso deploying dns first could shorteen the effective outage duration19:21
clarkbfungi: ya by about 5 minutes I guess I'll have to think about that alternative a bit more19:21
clarkbany questions or concerns about the preparation, timing, testing, etc?19:22
funginone from me19:22
clarkbI think my next goal is to login to review03 and make sure that functions and I see my personal content that I expect. Then maybe first thing tomorrow is a good time to land the ssh key management update if people can review before then19:23
clarkband then before firday we land the dns ttl update to shorten to 5 minutes on the review.o.o record19:24
clarkbit just occurred to me that review.openstack.org is probably in DNS too and will need manual intervention19:24
clarkbI'll update the etherpad for ^ after the meeting19:24
clarkbfungi: ^ you will need to do that I think actually19:24
clarkbsince it is cloudflare now19:24
fungiyeah, can do, thanks for the reminder19:25
fungii can adjust that during the outage19:25
clarkbthanks!19:25
clarkb#topic Upgrading old servers19:26
clarkbToday I noticed that our noble nodes have a ~881MB /boot partition19:26
clarkbdue to historical reasons I have concerns about small /boot partitions but I did some cross checking against a personal machine that I had problems on and I think we're ok with that cloud image's setup19:27
clarkbthe initrds for kvm nodes are small enough that even with the smaller size I think we can fit ~twice as many kernels as my personal machine can do on a /boot that is 1.9GB large19:27
clarkbBut I wanted to call this out as a change from our existing older nodes that do not have a /boot partition19:27
fungiyeah, that's plenty of room19:28
clarkbany other server upgrade thinsg to note?19:28
funginot from me19:29
clarkbya seems liek not we can continue19:29
clarkb#topic Running certcheck on bridge19:29
clarkbI was going to drop this from the agenda but I forgot so its here :)19:29
fungino update19:29
fungiyeah, please do19:29
clarkbno updatse from me. That said once Gerrit is sorted out I should have time to look at this19:29
clarkbso maybe it will go away temporarily. We'll see19:30
fungiwe can defer it to when we get prometheus done19:30
clarkbRelated to this the wiki cert expires in 2X days19:30
clarkband some consoritum of cert groups just announced a plan to limit cert validity down to 47 days over the next several yaers19:30
clarkbMarch 2026 will be the end of getting a cert with one yaer of validity. So we might get one year now. Then another year in less tahn a year dpeending on how things go19:31
clarkbanyway thats all for later19:31
fungiwow, 47 days will be considerably less than let's encrypt does now even19:31
clarkbfungi: yup19:31
fungiby half19:32
clarkbfungi: the original argument was for 90 days but apple apparently convinced everyone to go for ~half that19:32
clarkbbut that won't take effect until 2029 or something19:32
clarkb#topic Working through our TODO list19:32
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:32
clarkbjust a reminder that we haev a higher level todo list on this etherpad19:32
clarkbit mgiht want a better permanent home but until then feel free to pick up work off that list if you need something to do or want to get more invovled in opendev19:33
clarkbI'm happy to answer any questions or concerns about the list and guide people on those tasks if they get picked up19:33
clarkband feel free to add things there as time goes on and we have new things19:33
clarkb#topic Rotating mailman 3 logs19:33
funginot done yet19:34
clarkbis there a held node or anything along those lines yet?19:34
fungino, not even a proposed change yet unless i did it and forgot19:34
clarkback19:34
clarkbwe've lived without it for a while but eventually that file is going to become too unwieldly so we should try and figure out something here19:35
clarkblet me know if I can help (I did discover the problem afterall)19:35
clarkb#topic Moving hound image to quay.io19:35
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94701019:35
clarkbWe did this to lodgeit after running the service on noble. Codesearch has also been running on noble for some time and can use quay.io instead of docker too19:36
clarkbJust another step in the slow process of getting off of docker hub19:36
clarkbwe'll be able to do this with the gerrit images too once we're confident we aren't rolling back19:36
clarkbin related news it seems like the docker hub issues are less prevalent. I think all of the little changes we've made to rely on docker hub less add up to more reliability which is nice19:37
clarkb(though the recent base python image updates did hit the issues pretty reliably but I Think that was due to the number of images being updated at once)19:37
clarkb#topic Open Discussion19:38
clarkbAnything else?19:38
fungii've got nothing19:38
clarkbthe weather is going to be great all week here before raining on the weekend again. So don't be surprised if I pop out for a bike ride here and there (definitely going to try for this afternoon)19:39
corvus1 thing19:39
corvus* openEuler (x86 and Arm)... (full message at <https://matrix.org/oftc/media/v1/media/download/AUJLRSPeJHqFGHLgsbmtZfjle05809G8xfMA4kubBTrqB0se1-QBVrK6t_uAgY2M0OgPHkcXo8OnLmIOHKz0IMBCeWgxAFvQAG1hdHJpeC5vcmcva0NPaE9WQmFNVUhUaVB0a1VsdlV4SnNt>)19:39
corvusthat look okay in irc? i can make an alternative if not19:40
clarkbcorvus: it gave us irc folks a link to a paste essentially19:40
corvusoh sorry.  that was my alternative anyway.  :)19:40
clarkbopeneuler and gentoo are effectively dead right now because they stopped building and there wasn't sufficient pick up to fix them19:40
corvusokay, should i omit those from this email?19:41
clarkbso you might treat them differently in your email and note that they don't currently build as is and will need a larger effort for the interested parties if those still exist. Otherwise letting them die is a good thing19:41
clarkbthe others should all currently have working builds and are more straightforward to port from nodepool to zuul-launcher19:41
corvusokay, i'll try for that.  we can see how it looks in the etherpad.19:41
corvuscool, thx19:41
clarkbI'll keep things open until 19:45 and if there is nothing before then I think we can end 15 minutes early19:43
clarkbthanks everyone!19:45
fungithanks clarkb!19:45
clarkbwe'll be back here same time and location next week19:45
clarkb#endmeeting19:45
opendevmeetMeeting ended Tue Apr 15 19:45:13 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:45
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-04-15-19.00.html19:45
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-04-15-19.00.txt19:45
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-04-15-19.00.log.html19:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!