19:00:11 <clarkb> #startmeeting infra 19:00:11 <opendevmeet> Meeting started Tue Nov 25 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:11 <opendevmeet> The meeting name has been set to 'infra' 19:01:18 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BDETGBBWKNBD6ILJ652HNF5FBOZLMEOJ/ Our Agenda 19:01:29 <clarkb> #topic Announcements 19:02:39 <clarkb> just a reminder that this weekend is a big holiday weekend for several of us. I won't be around Thursday or Friday and it looks like I'll be out Wednesday afternoon doing food prep 19:02:43 <clarkb> anything else to announce? 19:04:25 <clarkb> #topic Gerrit 3.11 Upgrade Planning 19:04:36 <clarkb> #link https://www.gerritcodereview.com/3.11.html Gerrit 3.11 release notes 19:05:07 <clarkb> I've gone through these release notes and taken my own notes on this etherpad 19:05:12 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade 19:05:27 <clarkb> I've tried to identify concerns or potential issues and then followed up by doing my best to test them 19:05:32 <clarkb> so far everything has checked out ok 19:05:52 <clarkb> After lunch today I plan to perform an actual upgrade on the held test node to test the process and take notes on the process itself 19:06:29 <clarkb> I don't anticipate any problems so I think we should probably go ahead and commit to a date. I had previously suggested December 7 at 2100 UTC so that tonyb can participate. Does that seem reasonable? 19:06:48 <clarkb> If so I can announce that (maybe I'll send that email tomorrow so that it happens after my upgrade tests and gives others a chance to chime in on timing) 19:07:21 <tonyb> works for me. 19:07:23 <clarkb> thoughts, concerns, or questions? Gerrit upgrades are big undertakings so more than happy to have input on it 19:07:56 <fungi> i'll be around 19:08:19 <clarkb> great 19:08:36 <clarkb> I'll let everyone know if I find anything concering in the upgrade tests today 19:08:50 <clarkb> and then if I don't hear otherwise announce the plan for December 7 at 2100 UTC tomorrow 19:08:58 <clarkb> #topic Rename project app-kubernetes-module-manager to app-kernel-module-management 19:08:58 <fungi> december 7 is a sunday (at that time on this side of the globe) just to be clear 19:09:05 <clarkb> #undo 19:09:05 <opendevmeet> Removing item from minutes: #topic Rename project app-kubernetes-module-manager to app-kernel-module-management 19:09:21 <clarkb> right the idea is to do it during a slow time while tonyb is and me and fungi are awake and operating 19:09:30 <clarkb> which is somewhat complicated and sunday fits the bill 19:09:43 <fungi> yeah, so our sunday night, tonyb's monday morning 19:09:54 <fungi> wfm 19:09:58 <clarkb> #topic Rename project app-kubernetes-module-manager to app-kernel-module-management 19:10:17 <clarkb> Then a related question is how do we want to go about renaming this project as requested by the starlingx community 19:10:31 <clarkb> my inclination is to focus on the gerrit upgrade first since 3.10 is eol now and we've fallen behind on gerrit upgrades 19:10:49 <clarkb> the project rename process is tested in our CI jobs for gerrit we rename y/testproject to x/test-project iirc 19:10:59 <clarkb> so it should be safe to do the rename after the upgrade some time. 19:11:06 <fungi> i'll be around the week after the upgrade and am happy to drive a rename maintenance then 19:11:23 <fungi> in theory it'll be fast, really no longer than a restart 19:11:27 <clarkb> fungi: great thanks. So sometime between the 8th and 12th 19:12:04 <fungi> yeah, i have no unusual commitments/appointments that week 19:12:52 <clarkb> I think I have a doctors visit on the 9th 19:13:21 <clarkb> but otherwise I don't anticipate any conflicts other than the typical meetings 19:13:25 <clarkb> and nope its the 4th so even better 19:13:52 <clarkb> ok I'll let fungi drive planning and set a time 19:13:57 <clarkb> #topic Upgrading old servers 19:14:00 <fungi> we'll have another meeting between now and then, so can keep this on the agenda and pick an exact date/time as things get closer 19:14:03 <clarkb> ++ 19:14:15 <clarkb> Any movement on server upgrades? 19:14:26 <tonyb> no progress on my stuff 19:14:42 <clarkb> I'm not aware of any other new server work recently 19:15:00 <clarkb> I did want to note that I think I discovered that apt messages around lock failures have cahnged between jammy and noble 19:15:21 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/968270 is the fix for that 19:15:24 <fungi> apt or apt-get? 19:15:33 <clarkb> fungi: unclear what the ansible module uses under the hood 19:15:45 <fungi> okay, so neither. just ansible 19:15:46 <clarkb> fungi: but you can look at ps1 of that chagne to see the diff between the two versions 19:15:57 <clarkb> well I think ansible is bubbling up the image from apt or apt-get 19:16:02 <clarkb> s/image/message/ 19:16:26 <clarkb> its a small difference but one we need to accomodate at least for zuul. Reviews very much welcome 19:16:32 <fungi> to be clear, apt-get tries to maintain a stable interface. apt (the cli) has not since its creation and outputs a warning to stderr if called in a pipe or otherwise unattended (though that's changing in forky) 19:16:49 <clarkb> got it. I'm guessing ansible uses apt then 19:17:23 <clarkb> #topic Matrix for OpenDev comms 19:17:25 <clarkb> Speaking of reviews 19:17:38 <clarkb> #link https://review.opendev.org/q/hashtag:%22opendev-matrix%22+status:open 19:17:49 <clarkb> I think this effort is largely stalled out on getting these changes landed 19:18:04 <clarkb> that isn't all we need to do but is the set of next steps (statusbot, eavesdrop) 19:18:10 <clarkb> oh and gerritbot 19:18:49 <clarkb> then with that done we can schedule a cutover date and flip eavesdropping from irc to matrix for #opendev and start migrating users over 19:18:56 <fungi> i feel like i may have previously committed to reviewing those and then failed to do so (can't remember now, it's been a busy few weeks) 19:19:07 <tonyb> I'll try to review them today 19:19:13 <fungi> i'll try to prioritize them 19:19:19 <clarkb> thank you 19:19:31 <clarkb> #topic Zuul Launcher Updates 19:19:44 <clarkb> there are two things we've noticed recently both affecting raxflex that I want to call out 19:20:05 <fungi> (at least i have session verification working in my matrix client now) 19:20:06 <clarkb> The first is that we discovered a second corrupted ubuntu noble image in raxflex sjc3. The first occurred on October 6 and the second last Friday 19:20:33 <clarkb> I did more poking around and testing the second time since two occurrences in two months feels less likely to be cosmic ray bit flips and could be something bigger 19:20:53 <clarkb> and tl;dr is that the md5sum we calculate for the image does not match the one in glance's checksum field in sjc3 but it did match in iad3 19:21:05 <fungi> we still have the region disabled in our launcher config too 19:21:07 <clarkb> so it definitely seems like the data got corrupted by the time glance got to it 19:21:19 <clarkb> I ended up emailing rackspace and cc'd infra root on it 19:21:27 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/968090 Will validate the glance checksum against our checksum and reject mismatches 19:21:39 <fungi> i guess we could enable it again now that the image has been superseded? 19:21:46 <corvus> i just did fungi's suggestion in that change 19:21:54 <clarkb> corvus wrote this change in response too. Basically have zuul check the hashes since glance doesn't appear to 19:22:05 <corvus> i spot checked ubuntu-focal images at random in all the clouds 19:22:31 <clarkb> corvus: ok cool that was goign to be my question, can we generally expect that checksum value to match the one we calculate or does glance do some manipulation first then hash? 19:22:42 <clarkb> corvus: any concerns with that from your spot check? 19:23:05 <clarkb> fungi: and yes I think we can probably reenable the region if we confirm the image has been superceded there 19:23:06 <corvus> they all match except: all the rax-classic images have different md5sums. the raxflex-sjc3 image has a different md5sum. 19:23:30 <corvus> i suspect the rax-classic is the behavior that fungi was concerned about: that on the backend, the cloud is mutating them 19:23:36 <clarkb> corvus: neat in the raxflex-sjc3 case I'm guessing that is actually another corrupted image 19:23:46 <corvus> the flex-sjc3 checksum mismatch seems unexpected, yeah, i'm guessing so 19:23:51 <clarkb> corvus: and ya I think for classic they must be manipulating it breaking the assumtpion we can validate things this way 19:23:54 <fungi> i suspect if we were e.g. uploading qcow2 to vexxhost and relying on a glance convert task to turn it into raw we'd also see a mismatch 19:24:03 <corvus> i was just about to spot check a few more sjc3 to see if they are all like that, or if sjc3 is just very prone to breakage. 19:24:12 <clarkb> corvus: ++ that seems like a good idea 19:24:27 <clarkb> corvus: do we think we could make validating the hashes provider specific then we can just disable it for rax classic? 19:24:44 <corvus> fungi: if that's an option, i'm guessing so... we are currently uploading raw to vexxhost though, so they do match 19:25:07 <corvus> clarkb: i think for it to be useful, we may need to do that... i haven't thought about how yet 19:25:14 <clarkb> corvus: ack 19:25:20 <corvus> i believe we are doing special vhd conversion for rax 19:25:33 <fungi> right. that was more of a theoretical argument for why other users might want it to be configurable per-provider too 19:25:38 <clarkb> corvus: yes we rely on that hacked up xen tool thing monty made 19:25:53 <corvus> are we doing everything expected of us there? or are we expected to do something else before upload that we are not doing? 19:26:10 <corvus> or is it just the case that no matter what we upload to rax-classic, they will mutate the image 19:26:33 <clarkb> corvus: I am not sure. I think that the whole process is a bit underdocumented. We know that qemu-img can convert to vhd but those images don't work in xen (but do work in azure?) 19:26:59 <clarkb> so I think even outside of little bubble this process is somewhat odd 19:27:10 <corvus> maybe the best approach is to just turn this off for classic and wait for flex to completely take over? 19:27:15 <clarkb> corvus: but that may be something we can ask rackspace directly about. Perhaps in a followup email with more sjc3 info too 19:27:34 <clarkb> corvus: ya I think that works as a way to pay down all the xen related tech debt 19:27:56 <clarkb> (to be clear I'm happy to send followup emails if there is new data. You don't need to send that) 19:28:24 <corvus> i just spot checked more sjc3 images 19:28:31 <fungi> but also i agree that there's probably no poing in caring about this in rackspace classic, just the ability to disable the check in certain providers should suffice 19:28:42 <fungi> s/poing/point/ 19:28:59 <corvus> and it looks like they don't match 19:29:05 <tonyb> we could grab the mutated image and compare them. I realise that may be a lot but it may also show something obvious 19:29:30 <corvus> yeah, sjc3 images never match. 19:29:43 <fungi> makes me wonder if flex sjc3 has some post-upload glance task applied that other flex regions don't 19:29:44 <corvus> 4/4 i checked so far 19:30:08 <corvus> and interesting that seems to correlate with where we have observed corruption 19:30:12 <clarkb> fungi: ya that is beginning to seem likely. So maybe the mismatched checksum is not the smoking gun I had hoped it was and we need to also consider validation jobs again 19:30:17 <clarkb> corvus: ++ 19:30:36 <fungi> tonyb: sounds interesting, but i'm also fine just asking fanatical support why 19:31:02 <corvus> clarkb: i think questions about sjc3 are probably higher priority than rax-classic. i'd suggest just focusing on that in the next email for now? 19:31:15 <fungi> clarkb: though also while not smoking gun, a likely place for things to be going wrong 19:31:42 <corvus> (i mean, i'm not opposed to understanding rax-classic, but it's not quite as interesting as sjc3) 19:31:45 <clarkb> corvus: wfm. And I think the new info is that the checksums never appear to match there implying there is some backend process manipulating and changing the image data. Could that process be corrupting the images as this is the only region with the problem so far 19:31:57 <corvus> clarkb: ++ 19:32:00 <clarkb> I can write that followup today 19:32:11 <fungi> always changing, occasionally corrupting 19:32:37 <clarkb> the other launcher topic I wanted to bring up was the held node floating ip deletion 19:32:54 <clarkb> we suspect that this is a side effect of our leaked floating ip cleanup and have added more debugging to that process in the launcher 19:33:03 <fungi> sjc3 is the oldest flex region, so possible it has some cruft they avoided in later deployments too 19:33:41 <clarkb> I think my held node may have lost its floating ip too so maybe we have data to look at for this 19:34:30 <clarkb> corvus: ^ fyi on that. I'll try to dig into logs to see if I can figure out when that happened. THere is a window of time yesterday between testing gerrit things and updating the launcher where it may have occurred without logging in place 19:34:35 <fungi> since it's the only provider where we use floating ips, i'm not surprised if we had a blind spot there until now 19:35:09 <clarkb> anything else to note about the launchers? 19:35:09 <corvus> clarkb: wonder if it's restart related 19:35:27 <corvus> i can look at that (with you if you're around) after lunch 19:35:30 <clarkb> corvus: oh that is a good qusetion. The last time it happened it would've crossed an automated restart boundary 19:35:32 <fungi> also something i considered, the fip disappearences seemed to span restart times 19:35:42 <clarkb> and this time I did a manual restart yesterday 19:35:59 <fungi> not too hard to confirm in that case 19:36:11 <clarkb> cool we can followup after lunch 19:36:13 <clarkb> #topic Gitea 1.25.2 Upgrade 19:36:18 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/968245 Upgrade gitea to 1.25.2 19:36:58 <clarkb> there is a new gitea bugfix release. The screenshots from that change lgtm. There is the sha1-hulud question but I did some digging and the one package I could find in the pnpm lock file that overallped with the bad list had a version that wasn't listed as bad 19:37:19 <clarkb> I think if we're comfortable with that we can proceed with the upgrade. If we want more sha1-hulud stuff to shake out first we can hold off 19:37:23 <fungi> yeah, thanks for burrowing beneath the sands to check that 19:37:43 <fungi> gitea must flow 19:38:00 <clarkb> in theory pnpm is more resistant to the problems posed by this attack too so it may be a non issue even if a bad packag were included ( I don't like taking those odds though) 19:38:17 <clarkb> anyway reviews welcome. I'm happy to help babysit if we proceed today or tomorrow 19:38:29 <clarkb> #topic December Meeting Planning 19:39:07 <clarkb> I wanted to call out that we're approaching December which is full of holidays and time off for many. I expect December 23, 30 and possibly January 6 to be the most problematic dates for meetings 19:39:20 <clarkb> I believe that I can personally run meetings on every one of those days if we want to have them 19:39:51 <fungi> december 23 i'll probably skip. i should be around for the others if people want a meeting, and happy to run any where you're unavailable too 19:40:30 <tonyb> of those dates Dec 30 is the only one I'll likely skip. 19:40:37 <clarkb> good to know. I don't think we need to have answers right now. But if you're unliekly to make a specific meeting let me know. Then if it seems like only one or two people will be there then I can cancel 19:40:44 <fungi> i wouldn't mind skipping 30 too 19:40:55 <corvus> i would like to be not around on dec 30. happy to skip others but will probably be around for those. 19:41:04 <fungi> well, really i wouldn't mind skipping any of them yeah ;) 19:41:18 <clarkb> ok lets say we're skipping the 30th for now and we can cancel others as plans come together 19:41:25 <fungi> wfm 19:41:31 <tonyb> ++ 19:41:50 <clarkb> #topic Open Discussion 19:41:55 <clarkb> Anything else? 19:42:14 <clarkb> dmsimard[m] indicated an interest in talking about centralized Ara deployment for rendering CI job ansible run data 19:42:33 <dmsimard[m]> hi o/ 19:42:37 <clarkb> #link https://etherpad.opendev.org/p/ara-for-databases 19:42:39 <dmsimard[m]> someone mentioned last week that this meeting would be a good place to talk about it 19:42:53 <fungi> as good as any 19:43:59 <clarkb> the problem is basically that doing an ara file export of a non trivial ansible run produces many small files which don't play nice with ci job log uploads 19:44:02 <dmsimard[m]> At a high level I think we were starting from the principle that we would like the databases to stay in s3 so we don't need a different way of uploading/expiring files 19:44:05 <clarkb> it takes many many minutes to upload all the files 19:44:31 <dmsimard[m]> right, generating and uploading html doesn't scale very well 19:44:35 <clarkb> and one solution to that problem is that ara can export an sqlite database with all of the data in it. But then you need a running ara to render the data in the database 19:45:02 <fungi> a running ara that you can point to an arbitrary database location 19:45:12 <clarkb> I noted on the etherpad and in irc that my main concern with this appraoch is that various tools (opendev, kolla, openstack ansible) can and do run different versions of ansible and different versiosn of ara 19:45:28 <clarkb> looking at the notes from dmsimard[m] the different versions of ansible problem is probably less of a concern today 19:45:49 <clarkb> but different versions of ara could be a problem as ara wants the sqlite db to be generated by the same version of ara that is rendering it 19:46:44 <dmsimard[m]> ideally, yes, otherwise there would need to be a mechanism where ara knows it should automatically runs sql migrations on arbitrary databases 19:46:53 <clarkb> this is probably not so much a problem at bootstrapping time, but updating ara on the central server would require us to update all the ci jobs. Maybe we're ok with that and basically just expect uses to update ara if they want to use the tool 19:47:20 <dmsimard[m]> the database schema and migrations hasn't historically moved a lot, but I could see it being an issue that could arise 19:48:03 <dmsimard[m]> we discussed an implementation that would know how to get the ara database stored as a zuul artifact 19:48:34 <clarkb> right some sort of click a button and the get automatically sent to a proxy that knows how to get ara to look things up from zuul? 19:49:27 <dmsimard[m]> a more ... exotic option was to consider whether ara could load databases straight from swift with something like s3ql (mounting swift as a local filesystem, basically), but I'm not sure about that since there's more than one swift, but maybe it's not impossible 19:50:17 <dmsimard[m]> clarkb: yeah, something that would let ara download the database from the provided url and then load it 19:50:20 <clarkb> before we do any of that we may want to do a current state of the expected users to see what versions of ansible and ara we're dealing with and whether or no the current mix is workable 19:50:31 <fungi> if it could read the sqlite database from a url you wouldn't need the filesystem abstraction 19:50:44 <clarkb> then if we think there aren't major conflicts in tool choice the next step is in trying to figure out the automated loading of the correct sqlite db 19:51:52 <dmsimard[m]> fungi: sqlite is fundamentally a file so loading one from a filesystem is generally the expectation, I don't know how to "stream" sqlite database queries over http if it happens to be possible 19:52:13 <clarkb> you'd probably have to fetch the entire thing and cache it locally 19:52:22 <fungi> yeah, depends on how much state the process is expected to keep 19:52:24 <clarkb> so load from url is fetch file at url to local disk then read from disk 19:53:05 <dmsimard[m]> yeah I could see how downloading to a local cache would work 19:53:52 <dmsimard[m]> we can probably figure out a not-so-insecure plumbing to make it work, then have a cron that routinely wipes the cache or something like that 19:54:22 <clarkb> but I think my main concern contineus to be the logistics of making this work with all the different consumers 19:55:07 <clarkb> particular if opendev is going to be responsible for this service its easy to get in an awkward spot where kolla and opendev are in conflict with one another and I don't think we should run an ara for each user (at that point the run a local script to deploy ara in docker makes more sense to me) 19:56:08 <dmsimard[m]> yeah I understand the maintenance aspect of it 19:57:35 <clarkb> but that is why my usggestion is to start with an audit of what that situation looks like today to inform on whether or not it is likely to be a problem later 19:58:04 <dmsimard[m]> yeah, I was going to say your suggestion from earlier would be a good start 19:58:21 <fungi> seems like consensus 19:59:00 <dmsimard[m]> A question, maybe 19:59:34 <clarkb> go for it (we have less than a minute left :) ) 19:59:35 <dmsimard[m]> I could find the gerrit patch again but I remember an installation (for bridge?) in system-config that installed ara from source instead of pip 20:00:13 <clarkb> dmsimard[m]: that would be our system-config-run-base-ansible-devel job 20:00:20 <dmsimard[m]> I was wondering to what extent installing from source was necessary, as that would be where sql migrations would land first 20:00:25 <clarkb> but I think it may pip install ara now and only install ansible from source? 20:00:55 <fungi> but could install ara or anything else from source if we wanted 20:00:56 <dmsimard[m]> I could be misremembering, I'll check it out and report back if needed 20:00:56 <clarkb> dmsimard[m]: the idea behind that job was to be forward looking to detect problems. POtentially exactly like that. In that acse I think this is a feature and not a bug if we're still doing it 20:01:17 <clarkb> people debugging that job would just have to know that if ara breaks that is signal about using unreleased ara 20:01:29 <clarkb> and we are at time. 20:01:44 <clarkb> Feel free to continue discussion about this topic or any others in #opendev or on the mailing list 20:01:53 <clarkb> but I need to end here because I need food 20:01:58 <clarkb> thank you everyone! 20:02:00 <clarkb> #endmeeting