#opendev-meeting log

19:00:11 <clarkb> #startmeeting infra
19:00:11 <opendevmeet> Meeting started Tue Nov 25 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:11 <opendevmeet> The meeting name has been set to 'infra'
19:01:18 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BDETGBBWKNBD6ILJ652HNF5FBOZLMEOJ/ Our Agenda
19:01:29 <clarkb> #topic Announcements
19:02:39 <clarkb> just a reminder that this weekend is a big holiday weekend for several of us. I won't be around Thursday or Friday and it looks like I'll be out Wednesday afternoon doing food prep
19:02:43 <clarkb> anything else to announce?
19:04:25 <clarkb> #topic Gerrit 3.11 Upgrade Planning
19:04:36 <clarkb> #link https://www.gerritcodereview.com/3.11.html Gerrit 3.11 release notes
19:05:07 <clarkb> I've gone through these release notes and taken my own notes on this etherpad
19:05:12 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade
19:05:27 <clarkb> I've tried to identify concerns or potential issues and then followed up by doing my best to test them
19:05:32 <clarkb> so far everything has checked out ok
19:05:52 <clarkb> After lunch today I plan to perform an actual upgrade on the held test node to test the process and take notes on the process itself
19:06:29 <clarkb> I don't anticipate any problems so I think we should probably go ahead and commit to a date. I had previously suggested December 7 at 2100 UTC so that tonyb can participate. Does that seem reasonable?
19:06:48 <clarkb> If so I can announce that (maybe I'll send that email tomorrow so that it happens after my upgrade tests and gives others a chance to chime in on timing)
19:07:21 <tonyb> works for me.
19:07:23 <clarkb> thoughts, concerns, or questions? Gerrit upgrades are big undertakings so more than happy to have input on it
19:07:56 <fungi> i'll be around
19:08:19 <clarkb> great
19:08:36 <clarkb> I'll let everyone know if I find anything concering in the upgrade tests today
19:08:50 <clarkb> and then if I don't hear otherwise announce the plan for December 7 at 2100 UTC tomorrow
19:08:58 <clarkb> #topic Rename project app-kubernetes-module-manager to app-kernel-module-management
19:08:58 <fungi> december 7 is a sunday (at that time on this side of the globe) just to be clear
19:09:05 <clarkb> #undo
19:09:05 <opendevmeet> Removing item from minutes: #topic Rename project app-kubernetes-module-manager to app-kernel-module-management
19:09:21 <clarkb> right the idea is to do it during a slow time while tonyb is and me and fungi are awake and operating
19:09:30 <clarkb> which is somewhat complicated and sunday fits the bill
19:09:43 <fungi> yeah, so our sunday night, tonyb's monday morning
19:09:54 <fungi> wfm
19:09:58 <clarkb> #topic Rename project app-kubernetes-module-manager to app-kernel-module-management
19:10:17 <clarkb> Then a related question is how do we want to go about renaming this project as requested by the starlingx community
19:10:31 <clarkb> my inclination is to focus on the gerrit upgrade first since 3.10 is eol now and we've fallen behind on gerrit upgrades
19:10:49 <clarkb> the project rename process is tested in our CI jobs for gerrit we rename y/testproject to x/test-project iirc
19:10:59 <clarkb> so it should be safe to do the rename after the upgrade some time.
19:11:06 <fungi> i'll be around the week after the upgrade and am happy to drive a rename maintenance then
19:11:23 <fungi> in theory it'll be fast, really no longer than a restart
19:11:27 <clarkb> fungi: great thanks. So sometime between the 8th and 12th
19:12:04 <fungi> yeah, i have no unusual commitments/appointments that week
19:12:52 <clarkb> I think I have a doctors visit on the 9th
19:13:21 <clarkb> but otherwise I don't anticipate any conflicts other than the typical meetings
19:13:25 <clarkb> and nope its the 4th so even better
19:13:52 <clarkb> ok I'll let fungi drive planning and set a time
19:13:57 <clarkb> #topic Upgrading old servers
19:14:00 <fungi> we'll have another meeting between now and then, so can keep this on the agenda and pick an exact date/time as things get closer
19:14:03 <clarkb> ++
19:14:15 <clarkb> Any movement on server upgrades?
19:14:26 <tonyb> no progress on my stuff
19:14:42 <clarkb> I'm not aware of any other new server work recently
19:15:00 <clarkb> I did want to note that I think I discovered that apt messages around lock failures have cahnged between jammy and noble
19:15:21 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/968270 is the fix for that
19:15:24 <fungi> apt or apt-get?
19:15:33 <clarkb> fungi: unclear what the ansible module uses under the hood
19:15:45 <fungi> okay, so neither. just ansible
19:15:46 <clarkb> fungi: but you can look at ps1 of that chagne to see the diff between the two versions
19:15:57 <clarkb> well I think ansible is bubbling up the image from apt or apt-get
19:16:02 <clarkb> s/image/message/
19:16:26 <clarkb> its a small difference but one we need to accomodate at least for zuul. Reviews very much welcome
19:16:32 <fungi> to be clear, apt-get tries to maintain a stable interface. apt (the cli) has not since its creation and outputs a warning to stderr if called in a pipe or otherwise unattended (though that's changing in forky)
19:16:49 <clarkb> got it. I'm guessing ansible uses apt then
19:17:23 <clarkb> #topic Matrix for OpenDev comms
19:17:25 <clarkb> Speaking of reviews
19:17:38 <clarkb> #link https://review.opendev.org/q/hashtag:%22opendev-matrix%22+status:open
19:17:49 <clarkb> I think this effort is largely stalled out on getting these changes landed
19:18:04 <clarkb> that isn't all we need to do but is the set of next steps (statusbot, eavesdrop)
19:18:10 <clarkb> oh and gerritbot
19:18:49 <clarkb> then with that done we can schedule a cutover date and flip eavesdropping from irc to matrix for #opendev and start migrating users over
19:18:56 <fungi> i feel like i may have previously committed to reviewing those and then failed to do so (can't remember now, it's been a busy few weeks)
19:19:07 <tonyb> I'll try to review them today
19:19:13 <fungi> i'll try to prioritize them
19:19:19 <clarkb> thank you
19:19:31 <clarkb> #topic Zuul Launcher Updates
19:19:44 <clarkb> there are two things we've noticed recently both affecting raxflex that I want to call out
19:20:05 <fungi> (at least i have session verification working in my matrix client now)
19:20:06 <clarkb> The first is that we discovered a second corrupted ubuntu noble image in raxflex sjc3. The first occurred on October 6 and the second last Friday
19:20:33 <clarkb> I did more poking around and testing the second time since two occurrences in two months feels less likely to be cosmic ray bit flips and could be something bigger
19:20:53 <clarkb> and tl;dr is that the md5sum we calculate for the image does not match the one in glance's checksum field in sjc3 but it did match in iad3
19:21:05 <fungi> we still have the region disabled in our launcher config too
19:21:07 <clarkb> so it definitely seems like the data got corrupted by the time glance got to it
19:21:19 <clarkb> I ended up emailing rackspace and cc'd infra root on it
19:21:27 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/968090 Will validate the glance checksum against our checksum and reject mismatches
19:21:39 <fungi> i guess we could enable it again now that the image has been superseded?
19:21:46 <corvus> i just did fungi's suggestion in that change
19:21:54 <clarkb> corvus wrote this change in response too. Basically have zuul check the hashes since glance doesn't appear to
19:22:05 <corvus> i spot checked ubuntu-focal images at random in all the clouds
19:22:31 <clarkb> corvus: ok cool that was goign to be my question, can we generally expect that checksum value to match the one we calculate or does glance do some manipulation first then hash?
19:22:42 <clarkb> corvus: any concerns with that from your spot check?
19:23:05 <clarkb> fungi: and yes I think we can probably reenable the region if we confirm the image has been superceded there
19:23:06 <corvus> they all match except: all the rax-classic images have different md5sums.  the raxflex-sjc3 image has a different md5sum.
19:23:30 <corvus> i suspect the rax-classic is the behavior that fungi was concerned about: that on the backend, the cloud is mutating them
19:23:36 <clarkb> corvus: neat in the raxflex-sjc3 case I'm guessing that is actually another corrupted image
19:23:46 <corvus> the flex-sjc3 checksum mismatch seems unexpected, yeah, i'm guessing so
19:23:51 <clarkb> corvus: and ya I think for classic they must be manipulating it breaking the assumtpion we can validate things this way
19:23:54 <fungi> i suspect if we were e.g. uploading qcow2 to vexxhost and relying on a glance convert task to turn it into raw we'd also see a mismatch
19:24:03 <corvus> i was just about to spot check a few more sjc3 to see if they are all like that, or if sjc3 is just very prone to breakage.
19:24:12 <clarkb> corvus: ++ that seems like a good idea
19:24:27 <clarkb> corvus: do we think we could make validating the hashes provider specific then we can just disable it for rax classic?
19:24:44 <corvus> fungi: if that's an option, i'm guessing so... we are currently uploading raw to vexxhost though, so they do match
19:25:07 <corvus> clarkb: i think for it to be useful, we may need to do that... i haven't thought about how yet
19:25:14 <clarkb> corvus: ack
19:25:20 <corvus> i believe we are doing special vhd conversion for rax
19:25:33 <fungi> right. that was more of a theoretical argument for why other users might want it to be configurable per-provider too
19:25:38 <clarkb> corvus: yes we rely on that hacked up xen tool thing monty made
19:25:53 <corvus> are we doing everything expected of us there?  or are we expected to do something else before upload that we are not doing?
19:26:10 <corvus> or is it just the case that no matter what we upload to rax-classic, they will mutate the image
19:26:33 <clarkb> corvus: I am not sure. I think that the whole process is a bit underdocumented. We know that qemu-img can convert to vhd but those images don't work in xen (but do work in azure?)
19:26:59 <clarkb> so I think even outside of little bubble this process is somewhat odd
19:27:10 <corvus> maybe the best approach is to just turn this off for classic and wait for flex to completely take over?
19:27:15 <clarkb> corvus: but that may be something we can ask rackspace directly about. Perhaps in a followup email with more sjc3 info too
19:27:34 <clarkb> corvus: ya I think that works as a way to pay down all the xen related tech debt
19:27:56 <clarkb> (to be clear I'm happy to send followup emails if there is new data. You don't need to send that)
19:28:24 <corvus> i just spot checked more sjc3 images
19:28:31 <fungi> but also i agree that there's probably no poing in caring about this in rackspace classic, just the ability to disable the check in certain providers should suffice
19:28:42 <fungi> s/poing/point/
19:28:59 <corvus> and it looks like they don't match
19:29:05 <tonyb> we could grab the mutated image and compare them.    I realise that may be a lot but it may also show something obvious
19:29:30 <corvus> yeah, sjc3 images never match.
19:29:43 <fungi> makes me wonder if flex sjc3 has some post-upload glance task applied that other flex regions don't
19:29:44 <corvus> 4/4 i checked so far
19:30:08 <corvus> and interesting that seems to correlate with where we have observed corruption
19:30:12 <clarkb> fungi: ya that is beginning to seem likely. So maybe the mismatched checksum is not the smoking gun I had hoped it was and we need to also consider validation jobs again
19:30:17 <clarkb> corvus: ++
19:30:36 <fungi> tonyb: sounds interesting, but i'm also fine just asking fanatical support why
19:31:02 <corvus> clarkb: i think questions about sjc3 are probably higher priority than rax-classic.  i'd suggest just focusing on that in the next email for now?
19:31:15 <fungi> clarkb: though also while not smoking gun, a likely place for things to be going wrong
19:31:42 <corvus> (i mean, i'm not opposed to understanding rax-classic, but it's not quite as interesting as sjc3)
19:31:45 <clarkb> corvus: wfm. And I think the new info is that the checksums never appear to match there implying there is some backend process manipulating and changing the image data. Could that process be corrupting the images as this is the only region with the problem so far
19:31:57 <corvus> clarkb: ++
19:32:00 <clarkb> I can write that followup today
19:32:11 <fungi> always changing, occasionally corrupting
19:32:37 <clarkb> the other launcher topic I wanted to bring up was the held node floating ip deletion
19:32:54 <clarkb> we suspect that this is a side effect of our leaked floating ip cleanup and have added more debugging to that process in the launcher
19:33:03 <fungi> sjc3 is the oldest flex region, so possible it has some cruft they avoided in later deployments too
19:33:41 <clarkb> I think my held node may have lost its floating ip too so maybe we have data to look at for this
19:34:30 <clarkb> corvus: ^ fyi on that. I'll try to dig into logs to see if I can figure out when that happened. THere is a window of time yesterday between testing gerrit things and updating the launcher where it may have occurred without logging in place
19:34:35 <fungi> since it's the only provider where we use floating ips, i'm not surprised if we had a blind spot there until now
19:35:09 <clarkb> anything else to note about the launchers?
19:35:09 <corvus> clarkb: wonder if it's restart related
19:35:27 <corvus> i can look at that (with you if you're around) after lunch
19:35:30 <clarkb> corvus: oh that is a good qusetion. The last time it happened it would've crossed an automated restart boundary
19:35:32 <fungi> also something i considered, the fip disappearences seemed to span restart times
19:35:42 <clarkb> and this time I did a manual restart yesterday
19:35:59 <fungi> not too hard to confirm in that case
19:36:11 <clarkb> cool we can followup after lunch
19:36:13 <clarkb> #topic Gitea 1.25.2 Upgrade
19:36:18 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/968245 Upgrade gitea to 1.25.2
19:36:58 <clarkb> there is a new gitea bugfix release. The screenshots from that change lgtm. There is the sha1-hulud question but I did some digging and the one package I could find in the pnpm lock file that overallped with the bad list had a version that wasn't listed as bad
19:37:19 <clarkb> I think if we're comfortable with that we can proceed with the upgrade. If we want more sha1-hulud stuff to shake out first we can hold off
19:37:23 <fungi> yeah, thanks for burrowing beneath the sands to check that
19:37:43 <fungi> gitea must flow
19:38:00 <clarkb> in theory pnpm is more resistant to the problems posed by this attack too so it may be a non issue even if a bad packag were included ( I don't like taking those odds though)
19:38:17 <clarkb> anyway reviews welcome. I'm happy to help babysit if we proceed today or tomorrow
19:38:29 <clarkb> #topic December Meeting Planning
19:39:07 <clarkb> I wanted to call out that we're approaching December which is full of holidays and time off for many. I expect December 23, 30 and possibly January 6 to be the most problematic dates for meetings
19:39:20 <clarkb> I believe that I can personally run meetings on every one of those days if we want to have them
19:39:51 <fungi> december 23 i'll probably skip. i should be around for the others if people want a meeting, and happy to run any where you're unavailable too
19:40:30 <tonyb> of those dates Dec 30 is the only one I'll likely skip.
19:40:37 <clarkb> good to know. I don't think we need to have answers right now. But if you're unliekly to make a specific meeting let me know. Then if it seems like only one or two people will be there then I can cancel
19:40:44 <fungi> i wouldn't mind skipping 30 too
19:40:55 <corvus> i would like to be not around on dec 30.  happy to skip others but will probably be around for those.
19:41:04 <fungi> well, really i wouldn't mind skipping any of them yeah ;)
19:41:18 <clarkb> ok lets say we're skipping the 30th for now and we can cancel others as plans come together
19:41:25 <fungi> wfm
19:41:31 <tonyb> ++
19:41:50 <clarkb> #topic Open Discussion
19:41:55 <clarkb> Anything else?
19:42:14 <clarkb> dmsimard[m] indicated an interest in talking about centralized Ara deployment for rendering CI job ansible run data
19:42:33 <dmsimard[m]> hi o/
19:42:37 <clarkb> #link https://etherpad.opendev.org/p/ara-for-databases
19:42:39 <dmsimard[m]> someone mentioned last week that this meeting would be a good place to talk about it
19:42:53 <fungi> as good as any
19:43:59 <clarkb> the problem is basically that doing an ara file export of a non trivial ansible run produces many small files which don't play nice with ci job log uploads
19:44:02 <dmsimard[m]> At a high level I think we were starting from the principle that we would like the databases to stay in s3 so we don't need a different way of uploading/expiring files
19:44:05 <clarkb> it takes many many minutes to upload all the files
19:44:31 <dmsimard[m]> right, generating and uploading html doesn't scale very well
19:44:35 <clarkb> and one solution to that problem is that ara can export an sqlite database with all of the data in it. But then you need a running ara to render the data in the database
19:45:02 <fungi> a running ara that you can point to an arbitrary database location
19:45:12 <clarkb> I noted on the etherpad and in irc that my main concern with this appraoch is that various tools (opendev, kolla, openstack ansible) can and do run different versions of ansible and different versiosn of ara
19:45:28 <clarkb> looking at the notes from dmsimard[m] the different versions of ansible problem is probably less of a concern today
19:45:49 <clarkb> but different versions of ara could be a problem as ara wants the sqlite db to be generated by the same version of ara that is rendering it
19:46:44 <dmsimard[m]> ideally, yes, otherwise there would need to be a mechanism where ara knows it should automatically runs sql migrations on arbitrary databases
19:46:53 <clarkb> this is probably not so much a problem at bootstrapping time, but updating ara on the central server would require us to update all the ci jobs. Maybe we're ok with that and basically just expect uses to update ara if they want to use the tool
19:47:20 <dmsimard[m]> the database schema and migrations hasn't historically moved a lot, but I could see it being an issue that could arise
19:48:03 <dmsimard[m]> we discussed an implementation that would know how to get the ara database stored as a zuul artifact
19:48:34 <clarkb> right some sort of click a button and the get automatically sent to a proxy that knows how to get ara to look things up from zuul?
19:49:27 <dmsimard[m]> a more ... exotic option was to consider whether ara could load databases straight from swift with something like s3ql (mounting swift as a local filesystem, basically), but I'm not sure about that since there's more than one swift, but maybe it's not impossible
19:50:17 <dmsimard[m]> clarkb: yeah, something that would let ara download the database from the provided url and then load it
19:50:20 <clarkb> before we do any of that we may want to do a current state of the expected users to see what versions of ansible and ara we're dealing with and whether or no the current mix is workable
19:50:31 <fungi> if it could read the sqlite database from a url you wouldn't need the filesystem abstraction
19:50:44 <clarkb> then if we think there aren't major conflicts in tool choice the next step is in trying to figure out the automated loading of the correct sqlite db
19:51:52 <dmsimard[m]> fungi: sqlite is fundamentally a file so loading one from a filesystem is generally the expectation, I don't know how to "stream" sqlite database queries over http if it happens to be possible
19:52:13 <clarkb> you'd probably have to fetch the entire thing and cache it locally
19:52:22 <fungi> yeah, depends on how much state the process is expected to keep
19:52:24 <clarkb> so load from url is fetch file at url to local disk then read from disk
19:53:05 <dmsimard[m]> yeah I could see how downloading to a local cache would work
19:53:52 <dmsimard[m]> we can probably figure out a not-so-insecure plumbing to make it work, then have a cron that routinely wipes the cache or something like that
19:54:22 <clarkb> but I think my main concern contineus to be the logistics of making this work with all the different consumers
19:55:07 <clarkb> particular if opendev is going to be responsible for this service its easy to get in an awkward spot where kolla and opendev are in conflict with one another and I don't think we should run an ara for each user (at that point the run a local script to deploy ara in docker makes more sense to me)
19:56:08 <dmsimard[m]> yeah I understand the maintenance aspect of it
19:57:35 <clarkb> but that is why my usggestion is to start with an audit of what that situation looks like today to inform on whether or not it is likely to be a problem later
19:58:04 <dmsimard[m]> yeah, I was going to say your suggestion from earlier would be a good start
19:58:21 <fungi> seems like consensus
19:59:00 <dmsimard[m]> A question, maybe
19:59:34 <clarkb> go for it (we have less than a minute left :) )
19:59:35 <dmsimard[m]> I could find the gerrit patch again but I remember an installation (for bridge?) in system-config that installed ara from source instead of pip
20:00:13 <clarkb> dmsimard[m]: that would be our system-config-run-base-ansible-devel job
20:00:20 <dmsimard[m]> I was wondering to what extent installing from source was necessary, as that would be where sql migrations would land first
20:00:25 <clarkb> but I think it may pip install ara now and only install ansible from source?
20:00:55 <fungi> but could install ara or anything else from source if we wanted
20:00:56 <dmsimard[m]> I could be misremembering, I'll check it out and report back if needed
20:00:56 <clarkb> dmsimard[m]: the idea behind that job was to be forward looking to detect problems. POtentially exactly like that. In that acse I think this is a feature and not a bug if we're still doing it
20:01:17 <clarkb> people debugging that job would just have to know that if ara breaks that is signal about using unreleased ara
20:01:29 <clarkb> and we are at time.
20:01:44 <clarkb> Feel free to continue discussion about this topic or any others in #opendev or on the mailing list
20:01:53 <clarkb> but I need to end here because I need food
20:01:58 <clarkb> thank you everyone!
20:02:00 <clarkb> #endmeeting