#opendev-meeting log

19:00:26 <clarkb> #startmeeting infra
19:00:26 <opendevmeet> Meeting started Tue May 27 19:00:26 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:26 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:26 <opendevmeet> The meeting name has been set to 'infra'
19:00:39 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/WTHTKBQ5IUYSAX6ITU7F46PBDATVMYCU/ Our Agenda
19:01:06 <clarkb> #topic Announcements
19:01:13 <clarkb> I don't have anything to announce. Did anyone else have something?
19:03:42 <clarkb> sounds like no. We can probably continue in that case
19:03:49 <clarkb> #topic Zuul-launcher image builds
19:03:54 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/949696 Rocky Images
19:03:57 <clarkb> this change just merged
19:04:12 <clarkb> There was an issue with jobs hitting POST_FAILUREs that seems to have self resolved. Probalby some issue with the internet or one of our dependencies
19:05:43 <clarkb> The next steps here that I am aware of are to switch to using zuul provided image types and use the zuul-jobs role to upload to swift
19:05:52 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/951018
19:05:58 <corvus> i think the only thing i'd add to that is that there is a change out there to remove no_log if we feel comfortable with that
19:06:14 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/949944
19:06:31 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/948989 this is the no_log change
19:06:36 <corvus> yep
19:06:45 <clarkb> I think I'm ok with ^ if we approve it when people can check the results afterwards (so that we can rotate creds quickly if necessary_
19:06:50 <clarkb> I'll +2 it but not approve
19:06:53 <corvus> definitely want buy-in on that; needs a few more +2s at least
19:07:18 <clarkb> corvus: any preference in order between zuul-jobs role switch or zuul image type source?
19:07:41 <corvus> image type
19:07:44 <corvus> then role
19:08:02 <clarkb> can I recheck that one now that rocky iomages have landed?
19:08:06 <clarkb> that one == image type
19:08:14 <corvus> ++ thanks
19:08:46 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/949944 has been rechecked
19:08:58 <clarkb> #topic Gerrit shutdown problems
19:09:54 <clarkb> last week we restarted gerrit to update from 3.10.5 to 3.10.6 and unfortunately our sigint didn't seem to shutdown gerrit cleanly
19:10:07 <fungi> and now we think this is cache cleanup taking too long?
19:10:20 <clarkb> we ended up waiting for the 5 minute timout before docker compose issued a sigkill. The restart prior to this we managed to test things and sigint did work then
19:10:39 <clarkb> so ya I started brainstorming what could be different and one difference is the size of caches and our use of h2 cache db compaction
19:11:26 <clarkb> I think it is possible that shutdown is slow beacuse it is trying to compact things and not doing so quickly enough. The total compaction time should be about 4 minutes max if done serially though which is less than our timeout. but if the shutdown needs at least a minute to do other things...
19:11:33 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/950595 One theory is that h2 compaction time may be slowing down shutdown enough to time out
19:12:05 <clarkb> I've got this change to remove the compaction timeout increase (default is 200ms so should be very quick after we remove the config). This won't apply until the next restart as this config is in place for this restart (it does it on shutdown not startup)
19:13:02 <clarkb> I'd like to propose that the next time we've got a clean block of time to restart gerrit we do this: land 950595 and wait for it to apply, manually issue a SIGHUP using kill to bypass the podman SIGHUP apparmor issue and see if SIGHUP behaves differently since the compaction won't take effect anyway
19:13:28 <fungi> sgtm
19:13:33 <clarkb> basically use SIGHUP to gather more data this next restart. Then after the next restart will be running without compaction timeout increases which means the restart after next can attempt to use sigint again
19:13:55 <clarkb> and if compaction is the problem then we should see sigint become more reliable. If it isn't then I want to know if hup vs int is something we can measure more accurately
19:14:04 <corvus> ++
19:14:59 <clarkb> great. In that case let me know if you want to restart Gerrit and I can help. Or if you have to restart gerrit for urgent reasons try to remember to use kill -HUP $GERRITPID ; docker-compose down ; docker-compose up -d
19:15:21 <clarkb> that should be safe since we don't auto restart gerrit so docker compose will notice that the container is not running then down will delete the container and we can start fresh on the up -d
19:15:34 <clarkb> otherwise I'll keep in mind that I want to do that soonish and try to make time for it
19:15:45 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:15:46 <corvus> I don't restart gerrit often, but when I do, I use kill -HUP $GERRITPID ; docker-compose down ; docker-compose up -d
19:16:05 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade
19:16:39 <clarkb> I haven't made a ton of progress on this since we last spoke. Was hoping to get some of the pre steps out of the way like the 3.10.6 upgrade and switching image location to quay that got derailed by the shutdown issue
19:17:11 <clarkb> I suppose switching the image location to quay is not something that should impact gerrit shutdown behavior so that should be safe to do in conjunction with the earlier restart proposals
19:17:16 <clarkb> so hopefully I can sneak that in soon too
19:17:46 <clarkb> I think that is the last major pre pre step. Then its all testing things and double checking behavior changes that we'll have to accomodate
19:17:54 <clarkb> #link https://www.gerritcodereview.com/3.11.html
19:18:08 <clarkb> if you have time to look over the release notes and make notes in the etherpad about things that you think deserve testing or attention please do so
19:19:00 <clarkb> I was hoping to have things a bit further along for an early june upgrade but I'm not sure that is feasible now just with other stuff I know I need to get done in the next coupel of weeks
19:19:07 <clarkb> but we'll see maybe mid june is still doable
19:19:26 <clarkb> Any questions or concerns or comments about Gerrit 3.11 upgrade?
19:20:06 <fungi> not from my end
19:20:37 <clarkb> #topic Upgrading old servers
19:20:56 <clarkb> no updates here from me. fungi I don't think we have any word on refstack yet do we?
19:21:37 <fungi> nope, sorry
19:21:45 <fungi> been distracted recently
19:22:30 <clarkb> ya I have similar distractions
19:22:47 <clarkb> #topic Working through our TODO list
19:22:52 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:23:21 <clarkb> just our weekly reminder to anyone listening that if they would like to help out more starting with the list at hte bottom of this etherpad is a good place to start. Happy to answer any questions anyone may have about the list too
19:23:28 <clarkb> #topic OFTC Matrix bridge no longer supporting new users
19:23:47 <clarkb> I mentioned this to the openstack TC last week and gouthamr did some testing and it seemed to work for him
19:24:05 <clarkb> so I'm not sure if this is a persistent issue, or maybe only affects subsets of users (specific server, client I dunno)
19:24:10 <clarkb> #link https://github.com/matrix-org/matrix-appservice-irc/issues/1851
19:24:34 <clarkb> the issue did get closed as others seem to have noticed it seems to be working since a bot restart too
19:24:38 <clarkb> that occurred ~May 19
19:24:43 <corvus> there was definitely a bridge restart
19:24:57 <clarkb> long story short this may not be super urgent and irc users should be able to talk to matrix users once again
19:25:22 <corvus> we're in the same place we were: the bridge has an unknown future.
19:25:34 <clarkb> right. Things are functional today but in limbo for the future
19:25:49 <corvus> but we also now have some mixed evidence: it may be subject to bitrot; and fixing that bitrot may happen; but if it does, it may not be a high priority
19:25:56 <corvus> so... "unknown'":)
19:26:18 <clarkb> Personally I would still be happy to migrate opendev into matrix if we want to go that route.
19:26:20 <corvus> (it was broken for... 2 weeks?  many days at least)
19:27:08 <clarkb> after a week of thinking about our options here (irc for irc, matrix for matrix; pay for a bridge; host a bridge; move to matrix) did anyone else have opinions on what they'd like to see us do?
19:27:55 <clarkb> and again to be clear this would basically be for #opendev and #opendev-meeting (thought maybe #opendev-meeting bceomes a thread in #opendev)
19:28:09 <fungi> oh please not a thread
19:28:11 <clarkb> not talking about moving openstack or anyone else. Just our opendev specific synchronous comms channels
19:28:13 <clarkb> fungi: ha
19:28:45 <fungi> you were just baiting me, i'm sure
19:29:53 <clarkb> to justfiy my position on this I think haveing a single room whether that be IRC or Matrix is valuable. Matrix enables us to cater to those using matrix to IRC today without forcing them to figure out persistent connections for scrollback etc. And we don't have to give up on using open source tools
19:30:26 <clarkb> then from a user perspective I've largely been happy using matrix particularly when encryption is not involved. The only real issues I'ev had have been in rooms with encryption which we would not configure for opendev as it would be public and logged anyway
19:30:53 <corvus> ++ and we continue to blaze a trail for other openinfra projects to follow in addressing their own issues
19:31:03 <clarkb> and given the regular care and feeding bridges appear to need I worry that eitherp aying for one or hosting one would just be more effort and time we could spend elsewhere
19:31:12 <corvus> (to be clear re encryption, the issues are usually that it works too well, not the other way, so... could be worse :)
19:32:00 <corvus> i agree, i don't love the bridge idea at the opendev/openinfra level. i think it works best either network-wide or very small (individual/team)
19:32:50 <clarkb> so I guess thinking about next steps here do we think I should make a formal proposal on service-discuss? or do we want to have rough consensus among us before proposing things more broadly on the list?
19:33:56 <fungi> i still struggle a bit to make matrix something i can pay attention to the way i can irc, but that's down to my obscure workflows and preferences not being met well by existing client options so i try not to let that cloud my opinion
19:34:49 <corvus> i think a consensus check would be good
19:35:17 <corvus> then take it wider if no one violently objects
19:35:19 <clarkb> in that case can the other infra-roots let me know what they are thinking as far as options here go? feel free to pm or email me or discuss publicly further
19:35:38 <clarkb> then based on that I can make a formal proposal if appropriate
19:35:57 <clarkb> I don't think we need to do the polling in this meeting. But please do follow up
19:36:00 <fungi> i'm willing to go along with and supportive of whatever others want to propose for this
19:36:29 <clarkb> ack
19:36:33 <fungi> but i don't have any strong opinions either way
19:36:51 <clarkb> I think we can move on for now and follow up when I have a bit more feedback
19:36:59 <clarkb> #topic Enabling hashtags globally
19:37:09 <fungi> this on the other hand ;)
19:37:20 <clarkb> corvus brought this up again and asks if we need a process to enable this globally for registered users
19:37:48 <clarkb> I think that if we set this in All-Projects then any existing configuration for specific projects limiting this would continue to limit things so we wouldn't immediately break those users' use cases
19:37:49 <fungi> what's the config option (if anyone knows off the top of their heads)?
19:38:05 <clarkb> fungi: editHashtags
19:38:16 <corvus> seeing some folks have issues since it's now bifurcated so some projects allow it and others don't, and being able to set them on a group of changes across a bunch of projects would be useful :)
19:38:25 <clarkb> given that existing configs should continue to win I'm thinking we can probably proceed with enabling this in all projects to address the 80% case
19:38:31 <fungi> thanks, just making sure i know what to git grep so i can figure out who in openstack to reach out to if they're already overriding it in different ways
19:38:55 <clarkb> then after we've enabled it and things haven't burnt down for a week or two we can reach out and get those other projects to drop their specific configs
19:38:57 <fungi> the main hurdle is that some projects enabled it only for change owner and core review teams
19:39:34 <fungi> so i wanted to look to see who might have done, exactly
19:39:36 <clarkb> ya I'm hoping that if we just go ahead and enable it then we've got examples of how we don't need tol limit it anymore
19:39:55 <clarkb> looks like its about a 50-50 split between registered users and core groups in project-config
19:40:26 <clarkb> I think the main hurdle here has been that all projects isn't managed by project-config so one of us has to update it using admin creds which is annoying (but doable)
19:40:37 <fungi> at quick count, 11 openstack acls restrict editHashtags to core reviewers
19:41:09 <corvus> i'm very happy to do the typing if/when it's decided
19:41:35 <clarkb> oh I guess since editHashtags isn't marked exclusive we wouldn't have their specific rules override the global rule
19:41:43 <fungi> but looks like 9 out of the 11 are managed by the technical committee directly, so maybe only a few groups of folks for me to reach out to about it
19:41:47 <clarkb> we would essentially make the specific rules obsolete/redundant with a global rule
19:42:23 <fungi> basically, it looks like the tc restricted hashtag use in ~all of their own repositories
19:42:39 <clarkb> for a process how about we announce our intent to change this to service-announce (or discuss if we think this is too much noise for announce) and give the tc a week to object and if that doesn't happen corvus can do the typing
19:43:11 <clarkb> fungi: or do you think you want to reach out directly first since only openstack has the special rules and we can change it as soon as we get the all clear?
19:43:12 <corvus> sgtm
19:43:31 <clarkb> I'm ahppy to draft and send the announcement if we go that route
19:43:37 <clarkb> I should be able to do that this afternoon
19:43:50 <corvus> (that sounds even better tm)
19:44:04 <fungi> i can handle the direct outreach, sure. it's just the tc kolla team, looks like
19:44:21 <fungi> 2 kolla repos, 9 tc repos
19:44:32 <clarkb> fungi: cool do you want to do that post announcement or do you think we can forego the announcement?
19:45:01 <fungi> i wouldn't forego the announcement, because it'll still be a behavior change for all projects
19:45:23 <clarkb> ack I'll send that out today with an announced all projects update of June 3
19:45:29 <clarkb> corvus: ^ does that timing work for your typing driver?
19:45:32 <fungi> and i can do outreach more easily after the announcement if i'm able to refer people back to it
19:46:08 <corvus> yep
19:46:13 <clarkb> excellent
19:46:22 <clarkb> #topic Adding CentOS 10 Stream Support to Glean, DIB, and Nodepool
19:46:30 <clarkb> (assuming that with that decided we can move on to the next topic)
19:47:14 <clarkb> CentOS 10 Stream drops Network Manager support for the old /etc/sys/config (or whatever the paths were) network configuration compatibility layer
19:47:27 <clarkb> this means you have to configure interfaces with NetworkManager directly which requires updates to glean
19:47:29 <clarkb> #link https://review.opendev.org/c/opendev/glean/+/941672 Glean NetworkManager Keyfile support
19:48:06 <clarkb> I think this change is basically there at this point to enable that (there is one small open question but shouldn't impact many if anyone and using this as a forcing function to get their feedaback seems useful at this point. Inline comments have details)
19:48:15 <clarkb> Reviews on that are helpful
19:48:30 <clarkb> Then with glean sorted out we can figure out diskimage-builder support
19:48:38 <clarkb> #link https://review.opendev.org/c/openstack/diskimage-builder/+/934045 DIB support for CentOS 10 Stream
19:49:32 <clarkb> Getting the DIB testing of CentOS 10 Stream has been somewhat complicated for two reasons. The first is CentOS 10 Stream requires x86-64-v3 hardware capabilities which rax classic does not provide (the other clouds do apparently but still that means ~40% of our cloud resources can boot CentOS 10 Straem which is not ideal)
19:50:01 <clarkb> This requirement has impacted dib's nodepool based testing and functest chroot based testing as code built for centos 10 stream is executed in both cases and needs to handle those cpu instructions
19:50:15 <clarkb> the current plan for dib testing is to rely on nested virt labels which aren't in rax classic
19:51:01 <clarkb> the last major complication related to this in dib is updating nodepool devstack deployments to configure the devstack nested VM cpu type (by default devstack uses some old cpu type to simplify testing of openstack things like live migration)
19:51:34 <clarkb> the plan there is to switch over to running devstack and dib without nodepool so that we can have greater control over the devstack configuration and don't need to update nodepool and zuul-job related stuff for this big corner case
19:51:42 <clarkb> I think tonyb was looking into this
19:52:15 <clarkb> Then the second issue is centos 10 stream's upstream disk images label / as /boot using partition uuids in the partition table
19:52:29 <clarkb> this breaks dib's detection of filesystems during image builds in the centos element
19:52:46 <clarkb> I've asked that we not add workarounds to dib until we've tried to get centos 10 stream to fix their partition labels instead
19:53:23 <clarkb> but this means we may land initial centos 10 stream support in dib only with the centos-minimal element which builds things from scratch and doesn't use the upstream image as a starting point
19:53:47 <clarkb> but overall I think we have a plan to land some sort of support for CentOS 10 stream in dib that is also tested
19:54:18 <clarkb> Once that happens we'll have to consider whether or not we're comfortable adding CentOS 10 Stream images to nodepool/zuul-launcher that can only run in 40% of our cloud resources
19:54:34 <clarkb> consider this a warning to start mulling that over
19:55:08 <clarkb> I'm somewhat concerned that that will become a hack to not use the 60% of our resources creating extar contention for the other 40%
19:55:34 <clarkb> but its a bit early to worry about that. If you have time to review the glean change I think that is reviewable at this point
19:55:54 <clarkb> then the dib stuff is close but you may wish to wait for the testing job fixups before worrying about proper review
19:56:05 <clarkb> #topic Open Discussion
19:56:25 <clarkb> that was all I had to say about CentOS 10 stream and that was the last thing on the agenda. Anything else with the last ~4 minutes of our hour?
19:58:02 <fungi> not from me
19:58:55 <clarkb> sounds like that may be everything then
19:59:21 <clarkb> thank you everyone. We'll be back here at the same time and location if I don't have any last minute stuff come up (there is a small possibility this happens next week...)
19:59:40 <clarkb> #endmeeting