19:00:26 <clarkb> #startmeeting infra 19:00:26 <opendevmeet> Meeting started Tue May 27 19:00:26 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:26 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:26 <opendevmeet> The meeting name has been set to 'infra' 19:00:39 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/WTHTKBQ5IUYSAX6ITU7F46PBDATVMYCU/ Our Agenda 19:01:06 <clarkb> #topic Announcements 19:01:13 <clarkb> I don't have anything to announce. Did anyone else have something? 19:03:42 <clarkb> sounds like no. We can probably continue in that case 19:03:49 <clarkb> #topic Zuul-launcher image builds 19:03:54 <clarkb> #link https://review.opendev.org/c/opendev/zuul-providers/+/949696 Rocky Images 19:03:57 <clarkb> this change just merged 19:04:12 <clarkb> There was an issue with jobs hitting POST_FAILUREs that seems to have self resolved. Probalby some issue with the internet or one of our dependencies 19:05:43 <clarkb> The next steps here that I am aware of are to switch to using zuul provided image types and use the zuul-jobs role to upload to swift 19:05:52 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/951018 19:05:58 <corvus> i think the only thing i'd add to that is that there is a change out there to remove no_log if we feel comfortable with that 19:06:14 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/949944 19:06:31 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/948989 this is the no_log change 19:06:36 <corvus> yep 19:06:45 <clarkb> I think I'm ok with ^ if we approve it when people can check the results afterwards (so that we can rotate creds quickly if necessary_ 19:06:50 <clarkb> I'll +2 it but not approve 19:06:53 <corvus> definitely want buy-in on that; needs a few more +2s at least 19:07:18 <clarkb> corvus: any preference in order between zuul-jobs role switch or zuul image type source? 19:07:41 <corvus> image type 19:07:44 <corvus> then role 19:08:02 <clarkb> can I recheck that one now that rocky iomages have landed? 19:08:06 <clarkb> that one == image type 19:08:14 <corvus> ++ thanks 19:08:46 <clarkb> https://review.opendev.org/c/opendev/zuul-providers/+/949944 has been rechecked 19:08:58 <clarkb> #topic Gerrit shutdown problems 19:09:54 <clarkb> last week we restarted gerrit to update from 3.10.5 to 3.10.6 and unfortunately our sigint didn't seem to shutdown gerrit cleanly 19:10:07 <fungi> and now we think this is cache cleanup taking too long? 19:10:20 <clarkb> we ended up waiting for the 5 minute timout before docker compose issued a sigkill. The restart prior to this we managed to test things and sigint did work then 19:10:39 <clarkb> so ya I started brainstorming what could be different and one difference is the size of caches and our use of h2 cache db compaction 19:11:26 <clarkb> I think it is possible that shutdown is slow beacuse it is trying to compact things and not doing so quickly enough. The total compaction time should be about 4 minutes max if done serially though which is less than our timeout. but if the shutdown needs at least a minute to do other things... 19:11:33 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/950595 One theory is that h2 compaction time may be slowing down shutdown enough to time out 19:12:05 <clarkb> I've got this change to remove the compaction timeout increase (default is 200ms so should be very quick after we remove the config). This won't apply until the next restart as this config is in place for this restart (it does it on shutdown not startup) 19:13:02 <clarkb> I'd like to propose that the next time we've got a clean block of time to restart gerrit we do this: land 950595 and wait for it to apply, manually issue a SIGHUP using kill to bypass the podman SIGHUP apparmor issue and see if SIGHUP behaves differently since the compaction won't take effect anyway 19:13:28 <fungi> sgtm 19:13:33 <clarkb> basically use SIGHUP to gather more data this next restart. Then after the next restart will be running without compaction timeout increases which means the restart after next can attempt to use sigint again 19:13:55 <clarkb> and if compaction is the problem then we should see sigint become more reliable. If it isn't then I want to know if hup vs int is something we can measure more accurately 19:14:04 <corvus> ++ 19:14:59 <clarkb> great. In that case let me know if you want to restart Gerrit and I can help. Or if you have to restart gerrit for urgent reasons try to remember to use kill -HUP $GERRITPID ; docker-compose down ; docker-compose up -d 19:15:21 <clarkb> that should be safe since we don't auto restart gerrit so docker compose will notice that the container is not running then down will delete the container and we can start fresh on the up -d 19:15:34 <clarkb> otherwise I'll keep in mind that I want to do that soonish and try to make time for it 19:15:45 <clarkb> #topic Gerrit 3.11 Upgrade Planning 19:15:46 <corvus> I don't restart gerrit often, but when I do, I use kill -HUP $GERRITPID ; docker-compose down ; docker-compose up -d 19:16:05 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade 19:16:39 <clarkb> I haven't made a ton of progress on this since we last spoke. Was hoping to get some of the pre steps out of the way like the 3.10.6 upgrade and switching image location to quay that got derailed by the shutdown issue 19:17:11 <clarkb> I suppose switching the image location to quay is not something that should impact gerrit shutdown behavior so that should be safe to do in conjunction with the earlier restart proposals 19:17:16 <clarkb> so hopefully I can sneak that in soon too 19:17:46 <clarkb> I think that is the last major pre pre step. Then its all testing things and double checking behavior changes that we'll have to accomodate 19:17:54 <clarkb> #link https://www.gerritcodereview.com/3.11.html 19:18:08 <clarkb> if you have time to look over the release notes and make notes in the etherpad about things that you think deserve testing or attention please do so 19:19:00 <clarkb> I was hoping to have things a bit further along for an early june upgrade but I'm not sure that is feasible now just with other stuff I know I need to get done in the next coupel of weeks 19:19:07 <clarkb> but we'll see maybe mid june is still doable 19:19:26 <clarkb> Any questions or concerns or comments about Gerrit 3.11 upgrade? 19:20:06 <fungi> not from my end 19:20:37 <clarkb> #topic Upgrading old servers 19:20:56 <clarkb> no updates here from me. fungi I don't think we have any word on refstack yet do we? 19:21:37 <fungi> nope, sorry 19:21:45 <fungi> been distracted recently 19:22:30 <clarkb> ya I have similar distractions 19:22:47 <clarkb> #topic Working through our TODO list 19:22:52 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:23:21 <clarkb> just our weekly reminder to anyone listening that if they would like to help out more starting with the list at hte bottom of this etherpad is a good place to start. Happy to answer any questions anyone may have about the list too 19:23:28 <clarkb> #topic OFTC Matrix bridge no longer supporting new users 19:23:47 <clarkb> I mentioned this to the openstack TC last week and gouthamr did some testing and it seemed to work for him 19:24:05 <clarkb> so I'm not sure if this is a persistent issue, or maybe only affects subsets of users (specific server, client I dunno) 19:24:10 <clarkb> #link https://github.com/matrix-org/matrix-appservice-irc/issues/1851 19:24:34 <clarkb> the issue did get closed as others seem to have noticed it seems to be working since a bot restart too 19:24:38 <clarkb> that occurred ~May 19 19:24:43 <corvus> there was definitely a bridge restart 19:24:57 <clarkb> long story short this may not be super urgent and irc users should be able to talk to matrix users once again 19:25:22 <corvus> we're in the same place we were: the bridge has an unknown future. 19:25:34 <clarkb> right. Things are functional today but in limbo for the future 19:25:49 <corvus> but we also now have some mixed evidence: it may be subject to bitrot; and fixing that bitrot may happen; but if it does, it may not be a high priority 19:25:56 <corvus> so... "unknown'":) 19:26:18 <clarkb> Personally I would still be happy to migrate opendev into matrix if we want to go that route. 19:26:20 <corvus> (it was broken for... 2 weeks? many days at least) 19:27:08 <clarkb> after a week of thinking about our options here (irc for irc, matrix for matrix; pay for a bridge; host a bridge; move to matrix) did anyone else have opinions on what they'd like to see us do? 19:27:55 <clarkb> and again to be clear this would basically be for #opendev and #opendev-meeting (thought maybe #opendev-meeting bceomes a thread in #opendev) 19:28:09 <fungi> oh please not a thread 19:28:11 <clarkb> not talking about moving openstack or anyone else. Just our opendev specific synchronous comms channels 19:28:13 <clarkb> fungi: ha 19:28:45 <fungi> you were just baiting me, i'm sure 19:29:53 <clarkb> to justfiy my position on this I think haveing a single room whether that be IRC or Matrix is valuable. Matrix enables us to cater to those using matrix to IRC today without forcing them to figure out persistent connections for scrollback etc. And we don't have to give up on using open source tools 19:30:26 <clarkb> then from a user perspective I've largely been happy using matrix particularly when encryption is not involved. The only real issues I'ev had have been in rooms with encryption which we would not configure for opendev as it would be public and logged anyway 19:30:53 <corvus> ++ and we continue to blaze a trail for other openinfra projects to follow in addressing their own issues 19:31:03 <clarkb> and given the regular care and feeding bridges appear to need I worry that eitherp aying for one or hosting one would just be more effort and time we could spend elsewhere 19:31:12 <corvus> (to be clear re encryption, the issues are usually that it works too well, not the other way, so... could be worse :) 19:32:00 <corvus> i agree, i don't love the bridge idea at the opendev/openinfra level. i think it works best either network-wide or very small (individual/team) 19:32:50 <clarkb> so I guess thinking about next steps here do we think I should make a formal proposal on service-discuss? or do we want to have rough consensus among us before proposing things more broadly on the list? 19:33:56 <fungi> i still struggle a bit to make matrix something i can pay attention to the way i can irc, but that's down to my obscure workflows and preferences not being met well by existing client options so i try not to let that cloud my opinion 19:34:49 <corvus> i think a consensus check would be good 19:35:17 <corvus> then take it wider if no one violently objects 19:35:19 <clarkb> in that case can the other infra-roots let me know what they are thinking as far as options here go? feel free to pm or email me or discuss publicly further 19:35:38 <clarkb> then based on that I can make a formal proposal if appropriate 19:35:57 <clarkb> I don't think we need to do the polling in this meeting. But please do follow up 19:36:00 <fungi> i'm willing to go along with and supportive of whatever others want to propose for this 19:36:29 <clarkb> ack 19:36:33 <fungi> but i don't have any strong opinions either way 19:36:51 <clarkb> I think we can move on for now and follow up when I have a bit more feedback 19:36:59 <clarkb> #topic Enabling hashtags globally 19:37:09 <fungi> this on the other hand ;) 19:37:20 <clarkb> corvus brought this up again and asks if we need a process to enable this globally for registered users 19:37:48 <clarkb> I think that if we set this in All-Projects then any existing configuration for specific projects limiting this would continue to limit things so we wouldn't immediately break those users' use cases 19:37:49 <fungi> what's the config option (if anyone knows off the top of their heads)? 19:38:05 <clarkb> fungi: editHashtags 19:38:16 <corvus> seeing some folks have issues since it's now bifurcated so some projects allow it and others don't, and being able to set them on a group of changes across a bunch of projects would be useful :) 19:38:25 <clarkb> given that existing configs should continue to win I'm thinking we can probably proceed with enabling this in all projects to address the 80% case 19:38:31 <fungi> thanks, just making sure i know what to git grep so i can figure out who in openstack to reach out to if they're already overriding it in different ways 19:38:55 <clarkb> then after we've enabled it and things haven't burnt down for a week or two we can reach out and get those other projects to drop their specific configs 19:38:57 <fungi> the main hurdle is that some projects enabled it only for change owner and core review teams 19:39:34 <fungi> so i wanted to look to see who might have done, exactly 19:39:36 <clarkb> ya I'm hoping that if we just go ahead and enable it then we've got examples of how we don't need tol limit it anymore 19:39:55 <clarkb> looks like its about a 50-50 split between registered users and core groups in project-config 19:40:26 <clarkb> I think the main hurdle here has been that all projects isn't managed by project-config so one of us has to update it using admin creds which is annoying (but doable) 19:40:37 <fungi> at quick count, 11 openstack acls restrict editHashtags to core reviewers 19:41:09 <corvus> i'm very happy to do the typing if/when it's decided 19:41:35 <clarkb> oh I guess since editHashtags isn't marked exclusive we wouldn't have their specific rules override the global rule 19:41:43 <fungi> but looks like 9 out of the 11 are managed by the technical committee directly, so maybe only a few groups of folks for me to reach out to about it 19:41:47 <clarkb> we would essentially make the specific rules obsolete/redundant with a global rule 19:42:23 <fungi> basically, it looks like the tc restricted hashtag use in ~all of their own repositories 19:42:39 <clarkb> for a process how about we announce our intent to change this to service-announce (or discuss if we think this is too much noise for announce) and give the tc a week to object and if that doesn't happen corvus can do the typing 19:43:11 <clarkb> fungi: or do you think you want to reach out directly first since only openstack has the special rules and we can change it as soon as we get the all clear? 19:43:12 <corvus> sgtm 19:43:31 <clarkb> I'm ahppy to draft and send the announcement if we go that route 19:43:37 <clarkb> I should be able to do that this afternoon 19:43:50 <corvus> (that sounds even better tm) 19:44:04 <fungi> i can handle the direct outreach, sure. it's just the tc kolla team, looks like 19:44:21 <fungi> 2 kolla repos, 9 tc repos 19:44:32 <clarkb> fungi: cool do you want to do that post announcement or do you think we can forego the announcement? 19:45:01 <fungi> i wouldn't forego the announcement, because it'll still be a behavior change for all projects 19:45:23 <clarkb> ack I'll send that out today with an announced all projects update of June 3 19:45:29 <clarkb> corvus: ^ does that timing work for your typing driver? 19:45:32 <fungi> and i can do outreach more easily after the announcement if i'm able to refer people back to it 19:46:08 <corvus> yep 19:46:13 <clarkb> excellent 19:46:22 <clarkb> #topic Adding CentOS 10 Stream Support to Glean, DIB, and Nodepool 19:46:30 <clarkb> (assuming that with that decided we can move on to the next topic) 19:47:14 <clarkb> CentOS 10 Stream drops Network Manager support for the old /etc/sys/config (or whatever the paths were) network configuration compatibility layer 19:47:27 <clarkb> this means you have to configure interfaces with NetworkManager directly which requires updates to glean 19:47:29 <clarkb> #link https://review.opendev.org/c/opendev/glean/+/941672 Glean NetworkManager Keyfile support 19:48:06 <clarkb> I think this change is basically there at this point to enable that (there is one small open question but shouldn't impact many if anyone and using this as a forcing function to get their feedaback seems useful at this point. Inline comments have details) 19:48:15 <clarkb> Reviews on that are helpful 19:48:30 <clarkb> Then with glean sorted out we can figure out diskimage-builder support 19:48:38 <clarkb> #link https://review.opendev.org/c/openstack/diskimage-builder/+/934045 DIB support for CentOS 10 Stream 19:49:32 <clarkb> Getting the DIB testing of CentOS 10 Stream has been somewhat complicated for two reasons. The first is CentOS 10 Stream requires x86-64-v3 hardware capabilities which rax classic does not provide (the other clouds do apparently but still that means ~40% of our cloud resources can boot CentOS 10 Straem which is not ideal) 19:50:01 <clarkb> This requirement has impacted dib's nodepool based testing and functest chroot based testing as code built for centos 10 stream is executed in both cases and needs to handle those cpu instructions 19:50:15 <clarkb> the current plan for dib testing is to rely on nested virt labels which aren't in rax classic 19:51:01 <clarkb> the last major complication related to this in dib is updating nodepool devstack deployments to configure the devstack nested VM cpu type (by default devstack uses some old cpu type to simplify testing of openstack things like live migration) 19:51:34 <clarkb> the plan there is to switch over to running devstack and dib without nodepool so that we can have greater control over the devstack configuration and don't need to update nodepool and zuul-job related stuff for this big corner case 19:51:42 <clarkb> I think tonyb was looking into this 19:52:15 <clarkb> Then the second issue is centos 10 stream's upstream disk images label / as /boot using partition uuids in the partition table 19:52:29 <clarkb> this breaks dib's detection of filesystems during image builds in the centos element 19:52:46 <clarkb> I've asked that we not add workarounds to dib until we've tried to get centos 10 stream to fix their partition labels instead 19:53:23 <clarkb> but this means we may land initial centos 10 stream support in dib only with the centos-minimal element which builds things from scratch and doesn't use the upstream image as a starting point 19:53:47 <clarkb> but overall I think we have a plan to land some sort of support for CentOS 10 stream in dib that is also tested 19:54:18 <clarkb> Once that happens we'll have to consider whether or not we're comfortable adding CentOS 10 Stream images to nodepool/zuul-launcher that can only run in 40% of our cloud resources 19:54:34 <clarkb> consider this a warning to start mulling that over 19:55:08 <clarkb> I'm somewhat concerned that that will become a hack to not use the 60% of our resources creating extar contention for the other 40% 19:55:34 <clarkb> but its a bit early to worry about that. If you have time to review the glean change I think that is reviewable at this point 19:55:54 <clarkb> then the dib stuff is close but you may wish to wait for the testing job fixups before worrying about proper review 19:56:05 <clarkb> #topic Open Discussion 19:56:25 <clarkb> that was all I had to say about CentOS 10 stream and that was the last thing on the agenda. Anything else with the last ~4 minutes of our hour? 19:58:02 <fungi> not from me 19:58:55 <clarkb> sounds like that may be everything then 19:59:21 <clarkb> thank you everyone. We'll be back here at the same time and location if I don't have any last minute stuff come up (there is a small possibility this happens next week...) 19:59:40 <clarkb> #endmeeting