19:00:09 <clarkb> #startmeeting infra 19:00:09 <opendevmeet> Meeting started Tue Mar 4 19:00:09 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:09 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:09 <opendevmeet> The meeting name has been set to 'infra' 19:00:16 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5FMGNRCSUJQZYAOHNLLXQI4QT7LIG4C7/ Our Agenda 19:00:22 <clarkb> #topic Announcements 19:01:20 <clarkb> PTG preparation in underway. I wasn't planning on participating as OpenDev. Just wanted to make that clear. I think we've found more valeu in our out of band meetups 19:01:34 <clarkb> and that gives each of us the ability to attend other PTG activities without additional distraction 19:01:52 <clarkb> Anything else to add? 19:03:38 <clarkb> Sounds like no. We can continue 19:03:44 <clarkb> #topic Redeploying raxflex resources 19:04:04 <clarkb> As mentioned previously we need to redeploy raxflex resources to take advantage of better network MTUs and to have tenant alignment across regions 19:04:27 <clarkb> this work is underway with new mirrors being booted in the new regions. DNS has been updated for sjc3 test nodes to talk to that mirror in the other tenant 19:04:59 <clarkb> We did discover that floating ips are intended to be mandatory for externally routable IPs and the MTUs we got by default were quite large so fungi manually reduced them to 1500 at the neutron network level 19:04:59 <corvus> i think i'm just assuming fungi will approve all the changes to switch when he has bandwidth 19:05:08 <clarkb> #link https://review.opendev.org/q/hashtag:flex-dfw3 19:05:33 <clarkb> and ya fungi has pushed a number of changes that need to be stepped through in a particular order to effect the migration. That link should contain all of them though I suspect many/most/all have sufficient reviews at this point 19:05:58 <corvus> i don't think any zuul-launcher coordination is necessary; so i say go when ready 19:06:09 <fungi> sorry, trying to switch gears, openstack tc meeting is just wrapping up late 19:06:22 <clarkb> corvus: ack that was going to be my last question. In that case I agree proceed when ready fungi 19:07:00 <clarkb> fungi: let me know if you have anything else to add otherwise I'll continue 19:07:34 <fungi> i guess just make sure i haven't missed anything (besides lingering cleanup) 19:07:42 <corvus> (looks like the image upload container switched merged and worked) 19:07:52 <clarkb> nothing stood out to me hwen I reviewed the changes 19:08:10 <clarkb> I double checked the reenablement stack ensured images go before max-servers etc and that all looked fine 19:08:14 <fungi> there are a bunch of moving parts in different repo with dependencies and pauses for deploy jobs to complete before approving the next part 19:09:00 <clarkb> probably at this point the best thing is ti start rolling forward and just adjust if necessary 19:09:48 <clarkb> I also double checked quotas in both new tenants and the math worked out for me 19:10:20 <fungi> yeah, seems we're primarily constrained by the ram quota 19:10:56 <fungi> we could increase by 50% if just the ram were raised 19:11:16 <clarkb> ya a good followup once everything is moved is asking if we can bump up memory quotas to meet the 50 instance quota 19:11:27 <clarkb> then we'd have 100 instances total (50 each in two regions) 19:11:58 <clarkb> #topic Zuul-launcher image builds 19:12:10 <clarkb> lets keep moving we're 1/4 through our time and have more items to discuss 19:12:26 <corvus> yeah most action is in previous and next topics 19:12:56 <clarkb> I also wanted to ask if we should drop max-servers on the nodepool side for raxflex to give zuul-launcher more of a chance to get nodes but I think that can happen after fungi flips stuff over 19:13:12 <clarkb> not sure if you think that will be helpful based on usage patterns 19:13:20 <corvus> the quota support should help us get nodes now 19:13:37 <corvus> it was actually useful to test that even 19:13:46 <corvus> (that we still get nodes and not errors when we're at unexpected quota) 19:14:02 <clarkb> got it. The less than ideal state ensures we exercise our handling of that scenario (whcih is reasonably common in the real world so a good thing) 19:14:22 <clarkb> I guess we can discuss the next item then 19:14:25 <corvus> so for the occasional test like we're doing now, i think it's fine. once we use more regularly, yeah, we should probably start to split the quota a bit. 19:14:28 <corvus> ++ 19:14:34 <clarkb> #topic Updating Flavors in OVH 19:14:48 <clarkb> the zuul-launcher effort looked into expanding into ovh then discovered we hav a single flavor to use there 19:15:11 <clarkb> that flavor is an 8GB memory 8cpu 80GB disk flavor that gets scheduled to specific hardware for us and we aren't supposed to use the general purpose flavors in ovh 19:15:34 <clarkb> this started a conversation with amorin about expanding the flavor options for us as zuul-launcher is attemptign to support a spread of 4gb, 8gb, and 16gb nodes 19:15:40 <clarkb> #link https://etherpad.opendev.org/p/ovh-flavors 19:15:54 <corvus> (and we learned bad things will happen if we do! they can't be scheduled on the same hypervisiors, so we would end up excluding ourselves from our own resources) 19:15:55 <clarkb> this document attempts to capture some of the problems and considerations with making the change 19:16:37 <clarkb> ya the existing flavors are using some custom scheduling parameters that don't work with a mix of flavors. They can move us to entirely new flavors that don't use those scheduling methods and we can avoid this problem. However, we have to take a downtime to do so 19:16:54 <clarkb> considering that ovh is a fairly large chunk of our quota corvus suggested we might want to hold off on this until after the openstack release is settled 19:17:12 <corvus> when is that? 19:17:17 <clarkb> in general I'm on board with the plan. I think it gets us away from weird special case behaviors and makes the system more flexible 19:17:40 <clarkb> #link https://releases.openstack.org/epoxy/schedule.html 19:18:08 <clarkb> the actual release is April 2. But things usually settle after the first rc which is March 14 ish 19:18:16 <fungi> fwiw, test usage should already be falling now that the freeze is done 19:18:35 <fungi> next week is when release candidates and stable branches appear 19:18:56 <clarkb> ya and after ^ we're past the major demands on the CI system. So week after next is fine? or maybe late next week 19:19:12 <corvus> would >= mar 17 be a good choice? 19:19:27 <clarkb> ya I think so 19:19:49 <corvus> sounds good, i'll let amorin know in #opendev 19:19:54 <tonyb> Is there any additional logging we want to add to zuul before the stable branches appear.... To try and find where we're missing branch creation events #tangent 19:19:57 <fungi> i'd even go so far as to say now is probably fine, but would want to look at the graphs a bit more to confirm 19:21:32 <clarkb> tonyb: I think we can followup on that in open discussion .it is a good question 19:21:39 <clarkb> Anything else on the subject of ovh flavor migrations? 19:21:41 <tonyb> ++ 19:22:18 <clarkb> #topic Running infra-prod Jobs in Parallel on Bridge 19:22:37 <clarkb> This replaces the fix known_hosts problems on bridge topic from last week 19:22:55 <clarkb> this is a followup to that which is basically lets get the parallel infra-prod jobs running finally 19:23:00 <corvus> [i would like to engage on the branch thing in open discussion, but need a reminder of the problem and current state; so if anyone has background bandwidth to find a link or something before we get there i would find that helpful] 19:23:01 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942439 19:23:50 <clarkb> This is a change that ianw proposed and I've since edited to address some concern. But the tl;dr is we can use a paused parent job that sets up the bridge to run infra-prod ansible in subsequent jobs. That paused job will hold a lock preventing any other parent job from starting (to avoid conflit across pipelines) 19:24:33 <clarkb> then we can run those child jobs in parallel. TO start I've created a new semaphore with a limit of 1 so that we can transition and observe the current behavior is maintained. Then we can bump that semaphore limit to 2 or 5 etc and have the infra prod jobs run in parallel 19:24:51 <corvus> looks great and i see no reason not to +w it whenever you have time to check on it. 19:25:09 <clarkb> ya I think the main thing is to call it out now so anyone else can review it if they would like to. Then I would like to +w it tomorrow 19:25:21 <clarkb> I should be able to make plenty of time to monitor it tomorrow 19:25:44 <clarkb> and I'm happy to answer any questions about it in the interim 19:25:57 <clarkb> any questions or concerns to raise in the meeting about this change? 19:26:45 <tonyb> None from me 19:26:55 <clarkb> #topic Upgrading old servers 19:27:04 <clarkb> cool feel free to ping me or leave review comments 19:27:26 <clarkb> that takes us to our long term topic on upgrading old servers. I don't have any tasks on this inflight at the moment. Refactoring our CD pipeline stuff took over tempoararily 19:27:32 <clarkb> does anyone else have anything to bring up on this subject? 19:28:34 <tonyb> One small thing, I did discover that upgrading mediawiki beyond 1.31 will likely break openid auth 19:28:51 <fungi> did they drop support for the plugin? 19:29:13 <tonyb> So that essentially means that once we get the automation on support we're blocked behind keycloak updates 19:29:42 <fungi> makes sense 19:29:43 <tonyb> fungi: not officially, but the core dropped functions and the plugin wasn't updated. 19:29:44 <clarkb> we'd still be in a better position overall than the situation today right? 19:30:03 <clarkb> so this is more of a speedbump than a complete roadblock? 19:30:04 <tonyb> I suppose there may be scope to fix it upstream 19:30:25 <fungi> i remember when i was trying to update mediawiki a few years ago, i was struggling with getting the openid plugin working and didn't find a resolution 19:30:45 <tonyb> clarkb: correct it was just a new surprise I discovered yesterday so I wanted to share 19:31:21 <clarkb> cool just making sure I understand the implications 19:31:28 <tonyb> ++ 19:32:20 <clarkb> #topic Sprinting to Upgrade Servers to Jammy 19:32:42 <clarkb> The initial sprint got a number of servers upgraded but there is still a fairly large backlog to get through. 19:32:59 <clarkb> I should be able to shift gears back to this effort next week and treat next week as another sprint 19:33:15 <clarkb> with known_hosts addressed and hopefully parallel infra-prod exeuction things should move a bit more quickly too 19:33:20 <fungi> and what was the reason to upgrade them to jammy instead of noble at this point? 19:33:34 <clarkb> oh sorry thats a typo arg it used to say focal 19:33:40 <clarkb> then I "fixed" it to jammy 19:33:43 <fungi> aha, makes more sense 19:33:43 <clarkb> #undo 19:33:43 <opendevmeet> Removing item from minutes: #topic Sprinting to Upgrade Servers to Jammy 19:33:58 <clarkb> #topic Sprinting to Upgrade Servers to Noble 19:34:07 <clarkb> #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint 19:34:31 <fungi> since noble also gets us the ability to use podman, so deploy containers from quay with speculative testing 19:34:41 <clarkb> if anyone else has time and bandwidth to boot a new server or three taht would be appreciated too. My general thought on this process is to continue to build out a set of servers that provide us feedback before we get to gerrit 19:34:48 <clarkb> yup exactly 19:34:53 <clarkb> there are benefits all around 19:35:13 <clarkb> reduces our dependency on docker hub, leaves old distro releases behind and so on 19:35:36 <clarkb> but ya I don't really have anything new on this subject this week. Going to try and push forward again next week 19:35:42 <clarkb> Help is very much welcome 19:35:48 <clarkb> which takes us to 19:35:50 <clarkb> #topic Docker Hub rate limits 19:36:18 <clarkb> Docker hub announced that on March 1, 2025 anonymous image pull rate limits would go from 100 per 6 hours per ipv4 address and per ipv6 /64 to 10 per hour 19:36:38 <clarkb> I haven't done the adnce of manually testing that their api reports those new limits yet but I've been operating underthe assumption that they are in place 19:36:50 <corvus> "please don't use our service" 19:37:05 <clarkb> it is possible this is a good thing for us because the rate limit resets hourly rather than every 6 hours and our jobs often run for about an hour so we mgith get away with this 19:37:15 <clarkb> but it is a great reminder that we should do our best to mitigate and get off the service 19:37:30 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943326 19:37:40 <clarkb> this change updates our use of selenium containers to fetch from mirrored images on quay 19:37:46 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943216 19:38:08 <clarkb> this change forces docker fetches to go over ipv4 rather than ipv6 as we get more rate limit quota that way due to how ipv6 is set up in the clouds we use (many nodes can share the same /64) 19:38:40 <clarkb> tonyb: on 943216 I made a note about a problem with that approach on long lived servers 19:38:57 <clarkb> essentially if ip addrs in dns records change then we can orphan bad/broken and potentially even security risky values 19:38:59 <tonyb> Yup. I'll address that when I'm at my laptop 19:39:03 <clarkb> cool 19:39:37 <clarkb> beyond these changes and the longer term efforst we've got in place I don't really have any good answers. Its mostly a case of keep moving away as much as we possibly can 19:40:27 <clarkb> #topic Running certcheck on bridge 19:40:35 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942719 19:41:07 <clarkb> this is fungi's change to move certcheck over to bridge. I think ianw does haev a good point there though whcih is we can run this out of a zuul job instead and avoid baking extra stuff into bridge that we don't need 19:41:45 <clarkb> the suggested job is infra-prod-letsencrypt but I don't think it even needs to be an LE job if we can generate the list of domains somehow (that letsencrypt job may be a good place because it already has the list of domains available) 19:42:30 <fungi> makes sense, all this really started as a naive attempt to get rid of a github connectivity failure in a job because we install certcheck from git in order to work around the fact that the cacti server is stuck on an old ubuntu version where the certcheck package is too old to work correctly any longer 19:43:07 <fungi> i'm not sure i have the time or appetite to develop a certcheck notification job (i didn't really have the time to work on moving it off the cacti server, but gave it a shot) 19:43:41 <clarkb> maybe we leave that open for a week or two and see if anyone gets to it otherwise we can go ahead with what you put together already 19:44:14 <fungi> turned out the job failure was actually due to a github outage, but by then i'd already written the initial change 19:44:35 <clarkb> ya the status quo is probably suffiicent to wait a week or two 19:44:38 <fungi> so went ahead and pushed it for review in case anyone found it useful 19:45:30 <clarkb> #topic Working through our TODO list 19:45:35 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:45:51 <clarkb> this is just another friendyl reminder that if you get bored or need a change of pace this list has a good set of things to dive into 19:46:18 <tonyb> ++ 19:46:28 <clarkb> I've also added at least one item there that we didn't discuss during the meetup but made sense as something to get done (the parallel infra prod jobs) 19:46:36 <clarkb> so feel free to edit the list as well 19:46:51 <clarkb> #topic Open Discussion 19:47:00 <clarkb> I had two really quick things then lets discuss zuul branch creation behavior 19:47:13 <fungi> story time wrt bulk branch creation and missed events? 19:47:21 <fungi> ah, yeah 19:47:29 <clarkb> first is we updated gitea to use an external memcached for caching rather than a process internal golang hashmap. The idea is to avoid golang gc paus the world behavior 19:48:06 <clarkb> so far my anecdotal experience is that gitea is much more consistent in its response times so that is good. but please say something if you ntoice abnormally long responses suddenly that may indicate additional problems or a misdiagnosed problem 19:48:39 <clarkb> then the other thing is the oftc matrix bridge may shutdown at the end of this month. Those of us relying on matrix for irc connectivity in one form or another should be prepared with some other alternative (run your own bridge, use native irc client, etc) 19:48:47 <clarkb> ok that was it from me 19:49:29 <clarkb> I can do my best to recollect the zuul thing, but if someone else did research already you go for it 19:50:24 <fungi> i mostly just remember a gerrit to gitea replication issue with branches not appearing in gitea. what's the symptom you're talking about, tonyb? 19:51:12 <clarkb> the symptom tonyb refers to is the one that starlingx also hit recently 19:51:47 <clarkb> basically when projects like openstack or starlingx use an automated script to create branches across many projects those events go into zuul and zuul needs to load their configs. But for some subset of projcts the config for those branches isn't loaded and changes to those branches are ignored by zuul 19:52:00 <clarkb> triggering a full tenant reload of the affected zuul tenant fixes the problem 19:52:11 <fungi> oh, right, that thanks 19:52:40 <tonyb> That's more detail than I had. 19:52:52 <tonyb> But that's the issue I was thinking of 19:53:32 <clarkb> I don't think it is clear yet if gerrit is failing to emit the events correctly, if zuul is getting the events then mishanlding them, or something else 19:53:42 <corvus> https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-01-05.log.html that day has some talk about missed deleted events 19:54:12 <tonyb> I thought there was a question about the source of the problem is Gerrit sending all the events? Is zuul missing them? 19:54:40 <corvus> i believe we do log all gerrit events, so it should be possible to answer that q 19:54:54 <tonyb> Okay. 19:55:11 <corvus> oh, i have a vague memory that maybe last time this came up, we were like 1 day past the retention time or something? 19:55:27 <corvus> maybe that was a different issue though. 19:55:30 <clarkb> yes I think that was correct for the latest starlingx case 19:55:33 <tonyb> That sounds familiar 19:55:45 <clarkb> they did the creation in january thenshowed up in february asking why changes weren't getting tested and by then the logs were gone 19:55:51 <corvus> 2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: Received data from Gerrit event stream: 19:55:51 <corvus> 2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: {'eventCreatedOn': 1741046440, 19:56:00 <corvus> just confirmed that has the raw event stream from gerrit 19:56:07 <fungi> so perhaps we should proactively look for this symptom immediately after openstack stable/2025.1 branches are created 19:56:10 <clarkb> ++ 19:56:26 <corvus> so yeah, if we find out in time, we should be able to dig through that and answer the first question. probably the next one after that too. 19:56:26 <clarkb> you should be able to query the zuul api for the valid branches for each project in zuul 19:56:56 <clarkb> so create branches. Wait an hour or two (since reloading config isn't fast) then query the api for those that hit the problem. If we find at least one go back to the zuul logs and see if we got the events from gerrit and work from there? 19:56:59 <tonyb> Yup if we have logging that'll be helpful 19:57:11 <corvus> ++ 19:57:21 <tonyb> ++ 19:57:38 <frickler> just note the branches aren't all created at the same time, but by release patches per project (team) 19:58:03 <clarkb> but they should all be created within a week period next week? 19:58:14 <clarkb> maybe just check for missing branches on friday and call that sufficient 19:58:21 <clarkb> (friday of next week I mean) 19:58:31 <tonyb> Yup. I believe "large" projects like OSA have hit it 19:58:46 <frickler> there might be exceptions but that should mostly work 19:59:10 <frickler> oh, osa is one of the cycle trailing projects, so kind of special 19:59:30 <fungi> the goal is not to necessarily find all incidences of the problem, but to find at least one so we can collect the relevant log data 19:59:43 <clarkb> yup exactly 19:59:53 <clarkb> and we can reload tenant configs as the general solution otherwise 20:00:13 <tonyb> Yeah but, when they branch.... All by there lonesome they've seen this behaviour so it isn't like we need "all of openstack" to trigger it 20:00:21 <clarkb> got it 20:00:39 <clarkb> and we are at time 20:00:43 <clarkb> I think that is a good plan and we can take it from there 20:00:51 <clarkb> feel free to discuss further in #opendev or on the mailing list 20:00:54 <clarkb> thank you everyone! 20:00:56 <clarkb> #endmeeting