19:01:56 #startmeeting infra 19:01:56 Meeting started Tue Feb 4 19:01:56 2025 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:56 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:56 The meeting name has been set to 'infra' 19:02:30 #link as always, the agenda is at https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:02:50 #topic Announcements 19:03:13 i don't have any announcements in mind, did anyone have anything that needs mentioning? 19:03:18 not me 19:03:45 i'll skip over the actions and specs review sections since they seem to be similarly empty 19:04:00 #topic Zuul-launcher image builds (corvus) 19:04:27 the agenda mentions that zuul-launcher related configs (images, labels, providers, etc) have been refactored into a zuul-providers repo 19:04:38 #link https://opendev.org/opendev/zuul-providers 19:04:53 also that Zuul itself is starting to explore dogfooding of the new launcher managed images in jobs zuul runs 19:05:21 i reviewed some of those changes, but don't really have any updates myself 19:05:26 i was hoping we could keep the jobs separate, but they have to live together with the image definitions; so the idea is we'll (eventually) put that repo in all the tenants, but only load jobs/projects in the opendev tenant. 19:06:04 anyway, that's almost exactly what was in opendev/zuul-jobs before, so anyone working on image jobs can just retarget that repo with no changes 19:06:24 is the colocation of build jobs and image definitions a security requirement? 19:06:30 yep 19:06:42 sounds good to me 19:07:14 any next steps? what should we be reviewing/changing now? 19:07:19 i've run a job using a node on the launcher, including the new nodescan functionality 19:07:41 i think next i'll probably propose something like having zuul run its unit test jobs on new-style nodes 19:07:47 so we can get a little more volume on it 19:08:21 cool so we can do a piecemeal migration from nodepool-managed labels to zuul-launcher labels? 19:08:41 yep (eventually; i wouldn't say we're ready for that yet) 19:08:54 i think we want to try to catch some more bugs first :) 19:09:00 sure, i just meant as a future low-impact migration plan 19:09:06 yep definitely 19:09:13 we're not stuck with a big-bang )much) 19:09:31 i think we will have a sliding scale, and we can slide it backwards any time when the time comes 19:09:57 great! anything else you want to note about this? 19:10:03 that's it 19:10:38 in that case let's move on to the next topic (unless anyone has questions, feel free to interrupt whenever) 19:10:53 #topic Unpinning our Grafana deployment (clarkb) 19:11:04 seems like major progress was made on this 19:11:28 ya this went more quickly than I anticipated 19:11:36 the changes linked in the agenda are merged 19:11:47 we're running a noble grafana02 in production on the latest grafana 10 rlease as of an hour ago or so 19:12:04 the next steps will be cleaning up the old server once we're happy with this new one 19:12:12 and then figuring out an upgrade to grafana 11 19:12:21 yeah, i was about to ask 19:12:29 I guess be on the lookout for cleanup changes as I'll try to get those up soon 19:12:43 so probably similar process but in-place now that we're on a noble server? 19:12:46 ya 19:12:55 the 11 upgrade will probably require updates to grafyaml or our graph definitions though due to angular being deprecated in 10 19:13:14 it looks like 11 may have a toggle to reenable angular if it comes to that too but I think we should try and mvoe our graphs away from angular first 19:13:22 so hold a 11.x test node, see if it works, then let the upgrade deploy if no obvious issues are identified 19:13:38 yup with also debugging of angular deprecation on that held node 19:13:44 oh, got it, we probably need adjustments to grafyam 19:13:46 l 19:14:46 anything else we should know? or anyone have questions? 19:14:55 not from me 19:15:25 great, thanks for working on this! 19:15:37 #topic Upgrading Old Servers (clarkb) 19:15:53 mostly this is a catch all for updates around this. 19:16:02 We did update launch node to error if we detect fewer than 2 cpus 19:16:06 tonyb was working on the wiki, i might have missed updates if there were any 19:16:16 ya not sure if tonyb has any updates for wiki specifically 19:16:38 and we've still got cacti, storyboard, and translate as well 19:17:21 seems like probably nothing to cover here for now, aside from the change to detect problem instances in rax xen 19:17:31 ++ 19:17:38 changes i guess, with the typo correction 19:17:53 #topic Sprinting to Upgrade Servers to Focal (clarkb) 19:18:00 related to the previous topic 19:18:20 this is an idea I had that came out of doign the paste and grafana server replacments. The work itself is often fairly straight forward with most issues cause in CI before we deploy anything 19:18:37 then the major time sink is waiting for reviews no the various changes to update dns, add to inventory, reupdate dns etc 19:19:14 i'll note that the topic is probably a typo 19:19:22 i guess you meant upgrade to noble 19:19:22 oh yes 19:19:24 sorry 19:19:29 no probs 19:19:57 so basically I was wondering if othes woukld be willing to focus on this next week so that we can try and speed the process up and get some of the lwoer hanging fruit done 19:20:05 i should have read the notes under it before i set the topic ;) 19:20:18 i'm around next week for a sprint. did you have a particular day or days in mind? 19:20:21 part of my end state goal I'm hoping for is general confidence in noble and podman before we replace the gerrit server 19:20:43 that would certainly be good to have 19:20:45 no particular days probably start monday and end when we're tired of working on this specifric set of tasks 19:21:05 and just ask people to try and help replace servers as well as review changes to replace servers 19:21:24 to be clear, this is essentially a blocker to moving our images off dockerhub to quay, if we want to retain speculative testing of images, right? 19:21:43 yes 19:21:52 there are many reasons to do it 19:21:52 just making sure i've got the motivation stated 19:22:05 better ci, less docker hub, less old ubuntu 19:22:18 sure, upgrades are a good idea regardless, but at least to me that's the big carrot 19:22:28 dockerhub equals pain 19:23:34 I think most of the platform specific gotchas have been addressed at this point. Now we just need to do the uplift hence the ask for focused time on it 19:23:48 i also feel like we can be flexible about reviews on essentially "rote" changes... like if there's nothing too novel about an upgrade, we've all agreed it's a good idea and it's probably okay to push that through with minimal review 19:23:58 I'm happy with that too 19:24:12 and if something novel comes up, flag it for more discussion 19:24:18 I like that 19:24:23 yeah, i have been doing mostly single-core approvals on those if they come from another of our sysadmins and i plan to be around to keep an eye on things 19:24:41 so if folks are interested in making a solid dent in the random dockerhub rate limit failures for our jobs, let's try to move a bunch of stuff to new enough ubuntu next week 19:24:52 ++ 19:25:47 anything else we want to do right now for planning on this? or questions/concerns? 19:26:03 nope later this week I'll try and put a todo list that pepeople can pick off 19:26:05 thanks! 19:27:18 that would be great 19:27:54 #topic Switch to quay.io/opendevmirror images where possible (clarkb) 19:28:04 seems like we have a logical progression in topics 19:28:13 I tried to order them that way :) 19:28:23 prescient 19:28:30 made progress on this last week but still have gerrit, zuul db, and I think one other to do 19:28:45 corvus: any concern with just doing this for zuul ro should it be coordinated to minimize the loss of build records? 19:29:14 zuul-db only really affects zuul-web services right? or will it cause reporting failures while it's down? i can't remember now 19:29:29 it could cause reporting failures 19:29:33 and yeah, we'd presumably lose some buulds between the cracks 19:29:34 it will affect the record keeping of jobs that finish while the db restarts 19:29:34 but it will also retry 19:29:38 so do it fast enough it may be ok 19:29:56 so we could probably get by without pausing the whole system 19:30:00 I think it is a relatively quick but not instantaneous restart. On the order of 15-30 seconds? 19:30:16 mostly in mariadb startup costs 19:30:19 yeah... given we're not doing a release or anything, i'd say roll the dice :) 19:30:29 i'm willing to do it early or late in my hours, or on a weekend, to minimize impact 19:30:29 wfm thanks 19:30:47 (i mean, technically, this could happen any time if a hypervisor hiccups) 19:30:58 good point 19:31:09 you make a really good point, we still have making the db ha as an outstanding task 19:31:28 and we've considered the risk low 19:31:44 so maybe just ~whenever (within reason) 19:32:21 we can probably knock it out later this week in that case 19:32:28 ++ 19:32:29 ++ 19:32:45 any other points that bear raising on this? 19:32:53 not from me 19:33:17 #topic Running certcheck on bridge (fungi) 19:33:33 it said "clarkb" on the agenda but it's really me at this point 19:33:40 and i haven't gotten to it yet, but this is a good reminder 19:34:13 i don't really have anything to add, other than to note that i'm holding up one of the things we could move off the old cacti server 19:34:25 and i should really find a few minutes to get to it 19:34:37 #topic Service Coordinator Election (clarkb) 19:34:45 congratulations! no, wait, too soon 19:34:47 This is a reminder that the nomination period opens toady 19:34:51 *today even 19:35:13 I'm happy to answer questions if there is interst in someone else running 19:35:34 as am i, and all our previous leaders 19:35:50 (i'm speaking on their behalf. we'll get mordred back here yet) 19:36:32 ha 19:36:57 it's totally rewarding, and nothing like whitewashing this here picket fence 19:37:42 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/ February 2025 OpenDev Service Coordinator Election 19:38:08 #topic Working through our TODO list (clarkb) 19:38:28 And this is a reminder that we have a todo list that came out of meetup last month 19:38:50 I've been trying to use it to inform what I poke at (the first few items are directly related to launch node updeates and grafana server replacement) 19:38:58 #link https://etherpad.opendev.org/p/r.cb73d0388959699f27a517446dabaa71 2025q1 meetup notes 19:39:05 if you're lacking things to do feel free to take a look there and dive in 19:39:39 an excellent reminder 19:39:55 any specific items you want to call out as priorities from there? 19:40:01 not really 19:40:16 let's get cracking! 19:40:29 #topic Open discussion 19:41:01 freeform poetry is welcome, or whatever you feel appropriate 19:42:00 vogon poetry? 19:42:06 I'm going to step out now and try to get this headache under control. thanks everyone!@ 19:42:11 thy micturations are to me... 19:42:27 feel better! 19:42:43 these services aren't going to coordinate themselves, after all 19:42:50 ++ 19:43:26 in that case, enjoy the remaining 15-20 minutes for your preferred pasttimes 19:43:34 thanks everyone! 19:43:55 #endmeeting