#opendev-meeting log

19:00:09 <clarkb> #startmeeting infra
19:00:09 <opendevmeet> Meeting started Tue Mar  4 19:00:09 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:09 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:09 <opendevmeet> The meeting name has been set to 'infra'
19:00:16 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5FMGNRCSUJQZYAOHNLLXQI4QT7LIG4C7/ Our Agenda
19:00:22 <clarkb> #topic Announcements
19:01:20 <clarkb> PTG preparation in underway. I wasn't planning on participating as OpenDev. Just wanted to make that clear. I think we've found more valeu in our out of band meetups
19:01:34 <clarkb> and that gives each of us the ability to attend other PTG activities without additional distraction
19:01:52 <clarkb> Anything else to add?
19:03:38 <clarkb> Sounds like no. We can continue
19:03:44 <clarkb> #topic Redeploying raxflex resources
19:04:04 <clarkb> As mentioned previously we need to redeploy raxflex resources to take advantage of better network MTUs and to have tenant alignment across regions
19:04:27 <clarkb> this work is underway with new mirrors being booted in the new regions. DNS has been updated for sjc3 test nodes to talk to that mirror in the other tenant
19:04:59 <clarkb> We did discover that floating ips are intended to be mandatory for externally routable IPs and the MTUs we got by default were quite large so fungi manually reduced them to 1500 at the neutron network level
19:04:59 <corvus> i think i'm just assuming fungi will approve all the changes to switch when he has bandwidth
19:05:08 <clarkb> #link https://review.opendev.org/q/hashtag:flex-dfw3
19:05:33 <clarkb> and ya fungi has pushed a number of changes that need to be stepped through in a particular order to effect the migration. That link should contain all of them though I suspect many/most/all have sufficient reviews at this point
19:05:58 <corvus> i don't think any zuul-launcher coordination is necessary; so i say go when ready
19:06:09 <fungi> sorry, trying to switch gears, openstack tc meeting is just wrapping up late
19:06:22 <clarkb> corvus: ack that was going to be my last question. In that case I agree proceed when ready fungi
19:07:00 <clarkb> fungi: let me know if you have anything else to add otherwise I'll continue
19:07:34 <fungi> i guess just make sure i haven't missed anything (besides lingering cleanup)
19:07:42 <corvus> (looks like the image upload container switched merged and worked)
19:07:52 <clarkb> nothing stood out to me hwen I reviewed the changes
19:08:10 <clarkb> I double checked the reenablement stack ensured images go before max-servers etc and that all looked fine
19:08:14 <fungi> there are a bunch of moving parts in different repo with dependencies and pauses for deploy jobs to complete before approving the next part
19:09:00 <clarkb> probably at this point the best thing is ti start rolling forward and just adjust if necessary
19:09:48 <clarkb> I also double checked quotas in both new tenants and the math worked out for me
19:10:20 <fungi> yeah, seems we're primarily constrained by the ram quota
19:10:56 <fungi> we could increase by 50% if just the ram were raised
19:11:16 <clarkb> ya a good followup once everything is moved is asking if we can bump up memory quotas to meet the 50 instance quota
19:11:27 <clarkb> then we'd have 100 instances total (50 each in two regions)
19:11:58 <clarkb> #topic Zuul-launcher image builds
19:12:10 <clarkb> lets keep moving we're 1/4 through our time and have more items to discuss
19:12:26 <corvus> yeah most action is in previous and next topics
19:12:56 <clarkb> I also wanted to ask if we should drop max-servers on the nodepool side for raxflex to give zuul-launcher more of a chance to get nodes but I think that can happen after fungi flips stuff over
19:13:12 <clarkb> not sure if you think that will be helpful based on usage patterns
19:13:20 <corvus> the quota support should help us get nodes now
19:13:37 <corvus> it was actually useful to test that even
19:13:46 <corvus> (that we still get nodes and not errors when we're at unexpected quota)
19:14:02 <clarkb> got it. The less than ideal state ensures we exercise our handling of that scenario (whcih is reasonably common in the real world so a good thing)
19:14:22 <clarkb> I guess we can discuss the next item then
19:14:25 <corvus> so for the occasional test like we're doing now, i think it's fine.  once we use more regularly, yeah, we should probably start to split the quota a bit.
19:14:28 <corvus> ++
19:14:34 <clarkb> #topic Updating Flavors in OVH
19:14:48 <clarkb> the zuul-launcher effort looked into expanding into ovh then discovered we hav a single flavor to use there
19:15:11 <clarkb> that flavor is an 8GB memory 8cpu 80GB disk flavor that gets scheduled to specific hardware for us and we aren't supposed to use the general purpose flavors in ovh
19:15:34 <clarkb> this started a conversation with amorin about expanding the flavor options for us as zuul-launcher is attemptign to support a spread of 4gb, 8gb, and 16gb nodes
19:15:40 <clarkb> #link https://etherpad.opendev.org/p/ovh-flavors
19:15:54 <corvus> (and we learned bad things will happen if we do!  they can't be scheduled on the same hypervisiors, so we would end up excluding ourselves from our own resources)
19:15:55 <clarkb> this document attempts to capture some of the problems and considerations with making the change
19:16:37 <clarkb> ya the existing flavors are using some custom scheduling parameters that don't work with a mix of flavors. They can move us to entirely new flavors that don't use those scheduling methods and we can avoid this problem. However, we have to take a downtime to do so
19:16:54 <clarkb> considering that ovh is a fairly large chunk of our quota corvus suggested we might want to hold off on this until after the openstack release is settled
19:17:12 <corvus> when is that?
19:17:17 <clarkb> in general I'm on board with the plan. I think it gets us away from weird special case behaviors and makes the system more flexible
19:17:40 <clarkb> #link https://releases.openstack.org/epoxy/schedule.html
19:18:08 <clarkb> the actual release is April 2. But things usually settle after the first rc which is March 14 ish
19:18:16 <fungi> fwiw, test usage should already be falling now that the freeze is done
19:18:35 <fungi> next week is when release candidates and stable branches appear
19:18:56 <clarkb> ya and after ^ we're past the major demands on the CI system. So week after next is fine? or maybe late next week
19:19:12 <corvus> would >= mar 17 be a good choice?
19:19:27 <clarkb> ya I think so
19:19:49 <corvus> sounds good, i'll let amorin know in #opendev
19:19:54 <tonyb> Is there any additional logging we want to add to zuul before the stable branches appear.... To try and find where we're missing branch creation events #tangent
19:19:57 <fungi> i'd even go so far as to say now is probably fine, but would want to look at the graphs a bit more to confirm
19:21:32 <clarkb> tonyb: I think we can followup on that in open discussion .it is a good question
19:21:39 <clarkb> Anything else on the subject of ovh flavor migrations?
19:21:41 <tonyb> ++
19:22:18 <clarkb> #topic Running infra-prod Jobs in Parallel on Bridge
19:22:37 <clarkb> This replaces the fix known_hosts problems on bridge topic from last week
19:22:55 <clarkb> this is a followup to that which is basically lets get the parallel infra-prod jobs running finally
19:23:00 <corvus> [i would like to engage on the branch thing in open discussion, but need a reminder of the problem and current state; so if anyone has background bandwidth to find a link or something before we get there i would find that helpful]
19:23:01 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942439
19:23:50 <clarkb> This is a change that ianw proposed and I've since edited to address some concern. But the tl;dr is we can use a paused parent job that sets up the bridge to run infra-prod ansible in subsequent jobs. That paused job will hold a lock preventing any other parent job from starting (to avoid conflit across pipelines)
19:24:33 <clarkb> then we can run those child jobs in parallel. TO start I've created a new semaphore with a limit of 1 so that we can transition and observe the current behavior is maintained. Then we can bump that semaphore limit to 2 or 5 etc and have the infra prod jobs run in parallel
19:24:51 <corvus> looks great and i see no reason not to +w it whenever you have time to check on it.
19:25:09 <clarkb> ya I think the main thing is to call it out now so anyone else can review it if they would like to. Then I would like to +w it tomorrow
19:25:21 <clarkb> I should be able to make plenty of time to monitor it tomorrow
19:25:44 <clarkb> and I'm happy to answer any questions about it in the interim
19:25:57 <clarkb> any questions or concerns to raise in the meeting about this change?
19:26:45 <tonyb> None from me
19:26:55 <clarkb> #topic Upgrading old servers
19:27:04 <clarkb> cool feel free to ping me or leave review comments
19:27:26 <clarkb> that takes us to our long term topic on upgrading old servers. I don't have any tasks on this inflight at the moment. Refactoring our CD pipeline stuff took over tempoararily
19:27:32 <clarkb> does anyone else have anything to bring up on this subject?
19:28:34 <tonyb> One small thing, I did discover that upgrading mediawiki beyond 1.31 will likely break openid auth
19:28:51 <fungi> did they drop support for the plugin?
19:29:13 <tonyb> So that essentially means that once we get the automation on support we're blocked behind keycloak updates
19:29:42 <fungi> makes sense
19:29:43 <tonyb> fungi: not officially, but the core dropped functions and the plugin wasn't updated.
19:29:44 <clarkb> we'd still be in a better position overall than the situation today right?
19:30:03 <clarkb> so this is more of a speedbump than a complete roadblock?
19:30:04 <tonyb> I suppose there may be scope to fix it upstream
19:30:25 <fungi> i remember when i was trying to update mediawiki a few years ago, i was struggling with getting the openid plugin working and didn't find a resolution
19:30:45 <tonyb> clarkb: correct it was just a new surprise I discovered yesterday so I wanted to share
19:31:21 <clarkb> cool just making sure I understand the implications
19:31:28 <tonyb> ++
19:32:20 <clarkb> #topic Sprinting to Upgrade Servers to Jammy
19:32:42 <clarkb> The initial sprint got a number of servers upgraded but there is still a fairly large backlog to get through.
19:32:59 <clarkb> I should be able to shift gears back to this effort next week and treat next week as another sprint
19:33:15 <clarkb> with known_hosts addressed and hopefully parallel infra-prod exeuction things should move a bit more quickly too
19:33:20 <fungi> and what was the reason to upgrade them to jammy instead of noble at this point?
19:33:34 <clarkb> oh sorry thats a typo arg it used to say focal
19:33:40 <clarkb> then I "fixed" it to jammy
19:33:43 <fungi> aha, makes more sense
19:33:43 <clarkb> #undo
19:33:43 <opendevmeet> Removing item from minutes: #topic Sprinting to Upgrade Servers to Jammy
19:33:58 <clarkb> #topic Sprinting to Upgrade Servers to Noble
19:34:07 <clarkb> #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint
19:34:31 <fungi> since noble also gets us the ability to use podman, so deploy containers from quay with speculative testing
19:34:41 <clarkb> if anyone else has time and bandwidth to boot a new server or three taht would be appreciated too. My general thought on this process is to continue to build out a set of servers that provide us feedback before we get to gerrit
19:34:48 <clarkb> yup exactly
19:34:53 <clarkb> there are benefits all around
19:35:13 <clarkb> reduces our dependency on docker hub, leaves old distro releases behind and so on
19:35:36 <clarkb> but ya I don't really have anything new on this subject this week. Going to try and push forward again next week
19:35:42 <clarkb> Help is very much welcome
19:35:48 <clarkb> which takes us to
19:35:50 <clarkb> #topic Docker Hub rate limits
19:36:18 <clarkb> Docker hub announced that on March 1, 2025 anonymous image pull rate limits would go from 100 per 6 hours per ipv4 address and per ipv6 /64 to 10 per hour
19:36:38 <clarkb> I haven't done the adnce of manually testing that their api reports those new limits yet but I've been operating underthe assumption that they are in place
19:36:50 <corvus> "please don't use our service"
19:37:05 <clarkb> it is possible this is a good thing for us because the rate limit resets hourly rather than every 6 hours and our jobs often run for about an hour so we mgith get away with this
19:37:15 <clarkb> but it is a great reminder that we should do our best to mitigate and get off the service
19:37:30 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943326
19:37:40 <clarkb> this change updates our use of selenium containers to fetch from mirrored images on quay
19:37:46 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/943216
19:38:08 <clarkb> this change forces docker fetches to go over ipv4 rather than ipv6 as we get more rate limit quota that way due to how ipv6 is set up in the clouds we use (many nodes can share the same /64)
19:38:40 <clarkb> tonyb: on 943216 I made a note about a problem with that approach on long lived servers
19:38:57 <clarkb> essentially if ip addrs in dns records change then we can orphan bad/broken and potentially even security risky values
19:38:59 <tonyb> Yup.   I'll address that when I'm at my laptop
19:39:03 <clarkb> cool
19:39:37 <clarkb> beyond these changes and the longer term efforst we've got in place I don't really have any good answers. Its mostly a case of keep moving away as much as we possibly can
19:40:27 <clarkb> #topic Running certcheck on bridge
19:40:35 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942719
19:41:07 <clarkb> this is fungi's change to move certcheck over to bridge. I think ianw does haev a good point there though whcih is we can run this out of a zuul job instead and avoid baking extra stuff into bridge that we don't need
19:41:45 <clarkb> the suggested job is infra-prod-letsencrypt but I don't think it even needs to be an LE job if we can generate the list of domains somehow (that letsencrypt job may be a good place because it already has the list of domains available)
19:42:30 <fungi> makes sense, all this really started as a naive attempt to get rid of a github connectivity failure in a job because we install certcheck from git in order to work around the fact that the cacti server is stuck on an old ubuntu version where the certcheck package is too old to work correctly any longer
19:43:07 <fungi> i'm not sure i have the time or appetite to develop a certcheck notification job (i didn't really have the time to work on moving it off the cacti server, but gave it a shot)
19:43:41 <clarkb> maybe we leave that open for a week or two and see if anyone gets to it otherwise we can go ahead with what you put together already
19:44:14 <fungi> turned out the job failure was actually due to a github outage, but by then i'd already written the initial change
19:44:35 <clarkb> ya the status quo is probably suffiicent to wait a week or two
19:44:38 <fungi> so went ahead and pushed it for review in case anyone found it useful
19:45:30 <clarkb> #topic Working through our TODO list
19:45:35 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:45:51 <clarkb> this is just another friendyl reminder that if you get bored or need a change of pace this list has a good set of things to dive into
19:46:18 <tonyb> ++
19:46:28 <clarkb> I've also added at least one item there that we didn't discuss during the meetup but made sense as something to get done (the parallel infra prod jobs)
19:46:36 <clarkb> so feel free to edit the list as well
19:46:51 <clarkb> #topic Open Discussion
19:47:00 <clarkb> I had two really quick things then lets discuss zuul branch creation behavior
19:47:13 <fungi> story time wrt bulk branch creation and missed events?
19:47:21 <fungi> ah, yeah
19:47:29 <clarkb> first is we updated gitea to use an external memcached for caching rather than a process internal golang hashmap. The idea is to avoid golang gc paus the world behavior
19:48:06 <clarkb> so far my anecdotal experience is that gitea is much more consistent in its response times so that is good. but please say something if you ntoice abnormally long responses suddenly that may indicate additional problems or a misdiagnosed problem
19:48:39 <clarkb> then the other thing is the oftc matrix bridge may shutdown at the end of this month. Those of us relying on matrix for irc connectivity in one form or another should be prepared with some other alternative (run your own bridge, use native irc client, etc)
19:48:47 <clarkb> ok that was it from me
19:49:29 <clarkb> I can do my best to recollect the zuul thing, but if someone else did research already you go for it
19:50:24 <fungi> i mostly just remember a gerrit to gitea replication issue with branches not appearing in gitea. what's the symptom you're talking about, tonyb?
19:51:12 <clarkb> the symptom tonyb refers to is the one that starlingx also hit recently
19:51:47 <clarkb> basically when projects like openstack or starlingx use an automated script to create branches across many projects those events go into zuul and zuul needs to load their configs. But for some subset of projcts the config for those branches isn't loaded and changes to those branches are ignored by zuul
19:52:00 <clarkb> triggering a full tenant reload of the affected zuul tenant fixes the problem
19:52:11 <fungi> oh, right, that thanks
19:52:40 <tonyb> That's more detail than I had.
19:52:52 <tonyb> But that's the issue I was thinking of
19:53:32 <clarkb> I don't think it is clear yet if gerrit is failing to emit the events correctly, if zuul is getting the events then mishanlding them, or something else
19:53:42 <corvus> https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-01-05.log.html that day has some talk about missed deleted events
19:54:12 <tonyb> I thought there was a question about the source of the problem is Gerrit sending all the events? Is zuul missing them?
19:54:40 <corvus> i believe we do log all gerrit events, so it should be possible to answer that q
19:54:54 <tonyb> Okay.
19:55:11 <corvus> oh, i have a vague memory that maybe last time this came up, we were like 1 day past the retention time or something?
19:55:27 <corvus> maybe that was a different issue though.
19:55:30 <clarkb> yes I think that was correct for the latest starlingx case
19:55:33 <tonyb> That sounds familiar
19:55:45 <clarkb> they did the creation in january thenshowed up in february asking why changes weren't getting tested and by then the logs were gone
19:55:51 <corvus> 2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: Received data from Gerrit event stream:
19:55:51 <corvus> 2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh:   {'eventCreatedOn': 1741046440,
19:56:00 <corvus> just confirmed that has the raw event stream from gerrit
19:56:07 <fungi> so perhaps we should proactively look for this symptom immediately after openstack stable/2025.1 branches are created
19:56:10 <clarkb> ++
19:56:26 <corvus> so yeah, if we find out in time, we should be able to dig through that and answer the first question.  probably the next one after that too.
19:56:26 <clarkb> you should be able to query the zuul api for the valid branches for each project in zuul
19:56:56 <clarkb> so create branches. Wait an hour or two (since reloading config isn't fast) then query the api for those that hit the problem. If we find at least one go back to the zuul logs and see if we got the events from gerrit and work from there?
19:56:59 <tonyb> Yup if we have logging that'll be helpful
19:57:11 <corvus> ++
19:57:21 <tonyb> ++
19:57:38 <frickler> just note the branches aren't all created at the same time, but by release patches per project (team)
19:58:03 <clarkb> but they should all be created within a week period next week?
19:58:14 <clarkb> maybe just check for missing branches on friday and call that sufficient
19:58:21 <clarkb> (friday of next week I mean)
19:58:31 <tonyb> Yup.  I believe "large" projects like OSA have hit it
19:58:46 <frickler> there might be exceptions but that should mostly work
19:59:10 <frickler> oh, osa is one of the cycle trailing projects, so kind of special
19:59:30 <fungi> the goal is not to necessarily find all incidences of the problem, but to find at least one so we can collect the relevant log data
19:59:43 <clarkb> yup exactly
19:59:53 <clarkb> and we can reload tenant configs as the general solution otherwise
20:00:13 <tonyb> Yeah but, when they branch.... All by there lonesome they've seen this behaviour so it isn't like we need "all of openstack" to trigger it
20:00:21 <clarkb> got it
20:00:39 <clarkb> and we are at time
20:00:43 <clarkb> I think that is a good plan and we can take it from there
20:00:51 <clarkb> feel free to discuss further in #opendev or on the mailing list
20:00:54 <clarkb> thank you everyone!
20:00:56 <clarkb> #endmeeting