Tuesday, 2025-11-25

clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Nov 25 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BDETGBBWKNBD6ILJ652HNF5FBOZLMEOJ/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbjust a reminder that this weekend is a big holiday weekend for several of us. I won't be around Thursday or Friday and it looks like I'll be out Wednesday afternoon doing food prep19:02
clarkbanything else to announce?19:02
clarkb#topic Gerrit 3.11 Upgrade Planning19:04
clarkb#link https://www.gerritcodereview.com/3.11.html Gerrit 3.11 release notes19:04
clarkbI've gone through these release notes and taken my own notes on this etherpad19:05
clarkb#link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade19:05
clarkbI've tried to identify concerns or potential issues and then followed up by doing my best to test them19:05
clarkbso far everything has checked out ok19:05
clarkbAfter lunch today I plan to perform an actual upgrade on the held test node to test the process and take notes on the process itself19:05
clarkbI don't anticipate any problems so I think we should probably go ahead and commit to a date. I had previously suggested December 7 at 2100 UTC so that tonyb can participate. Does that seem reasonable?19:06
clarkbIf so I can announce that (maybe I'll send that email tomorrow so that it happens after my upgrade tests and gives others a chance to chime in on timing)19:06
tonybworks for me.19:07
clarkbthoughts, concerns, or questions? Gerrit upgrades are big undertakings so more than happy to have input on it19:07
fungii'll be around19:07
clarkbgreat19:08
clarkbI'll let everyone know if I find anything concering in the upgrade tests today19:08
clarkband then if I don't hear otherwise announce the plan for December 7 at 2100 UTC tomorrow19:08
clarkb#topic Rename project app-kubernetes-module-manager to app-kernel-module-management19:08
fungidecember 7 is a sunday (at that time on this side of the globe) just to be clear19:08
clarkb#undo19:09
opendevmeetRemoving item from minutes: #topic Rename project app-kubernetes-module-manager to app-kernel-module-management19:09
clarkbright the idea is to do it during a slow time while tonyb is and me and fungi are awake and operating19:09
clarkbwhich is somewhat complicated and sunday fits the bill19:09
fungiyeah, so our sunday night, tonyb's monday morning19:09
fungiwfm19:09
clarkb#topic Rename project app-kubernetes-module-manager to app-kernel-module-management19:09
clarkbThen a related question is how do we want to go about renaming this project as requested by the starlingx community19:10
clarkbmy inclination is to focus on the gerrit upgrade first since 3.10 is eol now and we've fallen behind on gerrit upgrades19:10
clarkbthe project rename process is tested in our CI jobs for gerrit we rename y/testproject to x/test-project iirc19:10
clarkbso it should be safe to do the rename after the upgrade some time.19:10
fungii'll be around the week after the upgrade and am happy to drive a rename maintenance then19:11
fungiin theory it'll be fast, really no longer than a restart19:11
clarkbfungi: great thanks. So sometime between the 8th and 12th19:11
fungiyeah, i have no unusual commitments/appointments that week19:12
clarkbI think I have a doctors visit on the 9th19:12
clarkbbut otherwise I don't anticipate any conflicts other than the typical meetings19:13
clarkband nope its the 4th so even better19:13
clarkbok I'll let fungi drive planning and set a time19:13
clarkb#topic Upgrading old servers19:13
fungiwe'll have another meeting between now and then, so can keep this on the agenda and pick an exact date/time as things get closer19:14
clarkb++19:14
clarkbAny movement on server upgrades?19:14
tonybno progress on my stuff 19:14
clarkbI'm not aware of any other new server work recently19:14
clarkbI did want to note that I think I discovered that apt messages around lock failures have cahnged between jammy and noble19:15
clarkb#link https://review.opendev.org/c/opendev/system-config/+/968270 is the fix for that19:15
fungiapt or apt-get?19:15
clarkbfungi: unclear what the ansible module uses under the hood19:15
fungiokay, so neither. just ansible19:15
clarkbfungi: but you can look at ps1 of that chagne to see the diff between the two versions19:15
clarkbwell I think ansible is bubbling up the image from apt or apt-get 19:15
clarkbs/image/message/19:16
clarkbits a small difference but one we need to accomodate at least for zuul. Reviews very much welcome19:16
fungito be clear, apt-get tries to maintain a stable interface. apt (the cli) has not since its creation and outputs a warning to stderr if called in a pipe or otherwise unattended (though that's changing in forky)19:16
clarkbgot it. I'm guessing ansible uses apt then19:16
clarkb#topic Matrix for OpenDev comms19:17
clarkbSpeaking of reviews19:17
clarkb#link https://review.opendev.org/q/hashtag:%22opendev-matrix%22+status:open19:17
clarkbI think this effort is largely stalled out on getting these changes landed19:17
clarkbthat isn't all we need to do but is the set of next steps (statusbot, eavesdrop)19:18
clarkboh and gerritbot19:18
clarkbthen with that done we can schedule a cutover date and flip eavesdropping from irc to matrix for #opendev and start migrating users over19:18
fungii feel like i may have previously committed to reviewing those and then failed to do so (can't remember now, it's been a busy few weeks)19:18
tonybI'll try to review them today19:19
fungii'll try to prioritize them19:19
clarkbthank you19:19
clarkb#topic Zuul Launcher Updates19:19
clarkbthere are two things we've noticed recently both affecting raxflex that I want to call out19:19
fungi(at least i have session verification working in my matrix client now)19:20
clarkbThe first is that we discovered a second corrupted ubuntu noble image in raxflex sjc3. The first occurred on October 6 and the second last Friday19:20
clarkbI did more poking around and testing the second time since two occurrences in two months feels less likely to be cosmic ray bit flips and could be something bigger19:20
clarkband tl;dr is that the md5sum we calculate for the image does not match the one in glance's checksum field in sjc3 but it did match in iad319:20
fungiwe still have the region disabled in our launcher config too19:21
clarkbso it definitely seems like the data got corrupted by the time glance got to it19:21
clarkbI ended up emailing rackspace and cc'd infra root on it19:21
clarkb#link https://review.opendev.org/c/zuul/zuul/+/968090 Will validate the glance checksum against our checksum and reject mismatches19:21
fungii guess we could enable it again now that the image has been superseded?19:21
corvusi just did fungi's suggestion in that change19:21
clarkbcorvus wrote this change in response too. Basically have zuul check the hashes since glance doesn't appear to19:21
corvusi spot checked ubuntu-focal images at random in all the clouds19:22
clarkbcorvus: ok cool that was goign to be my question, can we generally expect that checksum value to match the one we calculate or does glance do some manipulation first then hash?19:22
clarkbcorvus: any concerns with that from your spot check?19:22
clarkbfungi: and yes I think we can probably reenable the region if we confirm the image has been superceded there19:23
corvusthey all match except: all the rax-classic images have different md5sums.  the raxflex-sjc3 image has a different md5sum.19:23
corvusi suspect the rax-classic is the behavior that fungi was concerned about: that on the backend, the cloud is mutating them19:23
clarkbcorvus: neat in the raxflex-sjc3 case I'm guessing that is actually another corrupted image19:23
corvusthe flex-sjc3 checksum mismatch seems unexpected, yeah, i'm guessing so19:23
clarkbcorvus: and ya I think for classic they must be manipulating it breaking the assumtpion we can validate things this way19:23
fungii suspect if we were e.g. uploading qcow2 to vexxhost and relying on a glance convert task to turn it into raw we'd also see a mismatch19:23
corvusi was just about to spot check a few more sjc3 to see if they are all like that, or if sjc3 is just very prone to breakage.19:24
clarkbcorvus: ++ that seems like a good idea19:24
clarkbcorvus: do we think we could make validating the hashes provider specific then we can just disable it for rax classic?19:24
corvusfungi: if that's an option, i'm guessing so... we are currently uploading raw to vexxhost though, so they do match19:24
corvusclarkb: i think for it to be useful, we may need to do that... i haven't thought about how yet19:25
clarkbcorvus: ack19:25
corvusi believe we are doing special vhd conversion for rax19:25
fungiright. that was more of a theoretical argument for why other users might want it to be configurable per-provider too19:25
clarkbcorvus: yes we rely on that hacked up xen tool thing monty made19:25
corvusare we doing everything expected of us there?  or are we expected to do something else before upload that we are not doing?19:25
corvusor is it just the case that no matter what we upload to rax-classic, they will mutate the image19:26
clarkbcorvus: I am not sure. I think that the whole process is a bit underdocumented. We know that qemu-img can convert to vhd but those images don't work in xen (but do work in azure?)19:26
clarkbso I think even outside of little bubble this process is somewhat odd19:26
corvusmaybe the best approach is to just turn this off for classic and wait for flex to completely take over?19:27
clarkbcorvus: but that may be something we can ask rackspace directly about. Perhaps in a followup email with more sjc3 info too19:27
clarkbcorvus: ya I think that works as a way to pay down all the xen related tech debt19:27
clarkb(to be clear I'm happy to send followup emails if there is new data. You don't need to send that)19:27
corvusi just spot checked more sjc3 images19:28
fungibut also i agree that there's probably no poing in caring about this in rackspace classic, just the ability to disable the check in certain providers should suffice19:28
fungis/poing/point/19:28
corvusand it looks like they don't match19:28
tonybwe could grab the mutated image and compare them.    I realise that may be a lot but it may also show something obvious 19:29
corvusyeah, sjc3 images never match.19:29
fungimakes me wonder if flex sjc3 has some post-upload glance task applied that other flex regions don't19:29
corvus4/4 i checked so far19:29
corvusand interesting that seems to correlate with where we have observed corruption19:30
clarkbfungi: ya that is beginning to seem likely. So maybe the mismatched checksum is not the smoking gun I had hoped it was and we need to also consider validation jobs again19:30
clarkbcorvus: ++19:30
fungitonyb: sounds interesting, but i'm also fine just asking fanatical support why19:30
corvusclarkb: i think questions about sjc3 are probably higher priority than rax-classic.  i'd suggest just focusing on that in the next email for now?19:31
fungiclarkb: though also while not smoking gun, a likely place for things to be going wrong19:31
corvus(i mean, i'm not opposed to understanding rax-classic, but it's not quite as interesting as sjc3)19:31
clarkbcorvus: wfm. And I think the new info is that the checksums never appear to match there implying there is some backend process manipulating and changing the image data. Could that process be corrupting the images as this is the only region with the problem so far19:31
corvusclarkb: ++19:31
clarkbI can write that followup today19:32
fungialways changing, occasionally corrupting19:32
clarkbthe other launcher topic I wanted to bring up was the held node floating ip deletion19:32
clarkbwe suspect that this is a side effect of our leaked floating ip cleanup and have added more debugging to that process in the launcher19:32
fungisjc3 is the oldest flex region, so possible it has some cruft they avoided in later deployments too19:33
clarkbI think my held node may have lost its floating ip too so maybe we have data to look at for this19:33
clarkbcorvus: ^ fyi on that. I'll try to dig into logs to see if I can figure out when that happened. THere is a window of time yesterday between testing gerrit things and updating the launcher where it may have occurred without logging in place19:34
fungisince it's the only provider where we use floating ips, i'm not surprised if we had a blind spot there until now19:34
clarkbanything else to note about the launchers?19:35
corvusclarkb: wonder if it's restart related19:35
corvusi can look at that (with you if you're around) after lunch19:35
clarkbcorvus: oh that is a good qusetion. The last time it happened it would've crossed an automated restart boundary19:35
fungialso something i considered, the fip disappearences seemed to span restart times19:35
clarkband this time I did a manual restart yesterday19:35
funginot too hard to confirm in that case19:35
clarkbcool we can followup after lunch19:36
clarkb#topic Gitea 1.25.2 Upgrade19:36
clarkb#link https://review.opendev.org/c/opendev/system-config/+/968245 Upgrade gitea to 1.25.219:36
clarkbthere is a new gitea bugfix release. The screenshots from that change lgtm. There is the sha1-hulud question but I did some digging and the one package I could find in the pnpm lock file that overallped with the bad list had a version that wasn't listed as bad19:36
clarkbI think if we're comfortable with that we can proceed with the upgrade. If we want more sha1-hulud stuff to shake out first we can hold off19:37
fungiyeah, thanks for burrowing beneath the sands to check that19:37
fungigitea must flow19:37
clarkbin theory pnpm is more resistant to the problems posed by this attack too so it may be a non issue even if a bad packag were included ( I don't like taking those odds though)19:38
clarkbanyway reviews welcome. I'm happy to help babysit if we proceed today or tomorrow19:38
clarkb#topic December Meeting Planning19:38
clarkbI wanted to call out that we're approaching December which is full of holidays and time off for many. I expect December 23, 30 and possibly January 6 to be the most problematic dates for meetings19:39
clarkbI believe that I can personally run meetings on every one of those days if we want to have them19:39
fungidecember 23 i'll probably skip. i should be around for the others if people want a meeting, and happy to run any where you're unavailable too19:39
tonybof those dates Dec 30 is the only one I'll likely skip.19:40
clarkbgood to know. I don't think we need to have answers right now. But if you're unliekly to make a specific meeting let me know. Then if it seems like only one or two people will be there then I can cancel19:40
fungii wouldn't mind skipping 30 too19:40
corvusi would like to be not around on dec 30.  happy to skip others but will probably be around for those.19:40
fungiwell, really i wouldn't mind skipping any of them yeah ;)19:41
clarkbok lets say we're skipping the 30th for now and we can cancel others as plans come together19:41
fungiwfm19:41
tonyb++19:41
clarkb#topic Open Discussion19:41
clarkbAnything else?19:41
clarkbdmsimard[m] indicated an interest in talking about centralized Ara deployment for rendering CI job ansible run data19:42
dmsimard[m]hi o/19:42
clarkb#link https://etherpad.opendev.org/p/ara-for-databases19:42
dmsimard[m]someone mentioned last week that this meeting would be a good place to talk about it19:42
fungias good as any19:42
clarkbthe problem is basically that doing an ara file export of a non trivial ansible run produces many small files which don't play nice with ci job log uploads19:43
dmsimard[m]At a high level I think we were starting from the principle that we would like the databases to stay in s3 so we don't need a different way of uploading/expiring files19:44
clarkbit takes many many minutes to upload all the files19:44
dmsimard[m]right, generating and uploading html doesn't scale very well19:44
clarkband one solution to that problem is that ara can export an sqlite database with all of the data in it. But then you need a running ara to render the data in the database19:44
fungia running ara that you can point to an arbitrary database location19:45
clarkbI noted on the etherpad and in irc that my main concern with this appraoch is that various tools (opendev, kolla, openstack ansible) can and do run different versions of ansible and different versiosn of ara19:45
clarkblooking at the notes from dmsimard[m] the different versions of ansible problem is probably less of a concern today19:45
clarkbbut different versions of ara could be a problem as ara wants the sqlite db to be generated by the same version of ara that is rendering it19:45
dmsimard[m]ideally, yes, otherwise there would need to be a mechanism where ara knows it should automatically runs sql migrations on arbitrary databases19:46
clarkbthis is probably not so much a problem at bootstrapping time, but updating ara on the central server would require us to update all the ci jobs. Maybe we're ok with that and basically just expect uses to update ara if they want to use the tool19:46
dmsimard[m]the database schema and migrations hasn't historically moved a lot, but I could see it being an issue that could arise19:47
dmsimard[m]we discussed an implementation that would know how to get the ara database stored as a zuul artifact19:48
clarkbright some sort of click a button and the get automatically sent to a proxy that knows how to get ara to look things up from zuul?19:48
dmsimard[m]a more ... exotic option was to consider whether ara could load databases straight from swift with something like s3ql (mounting swift as a local filesystem, basically), but I'm not sure about that since there's more than one swift, but maybe it's not impossible19:49
dmsimard[m]clarkb: yeah, something that would let ara download the database from the provided url and then load it19:50
clarkbbefore we do any of that we may want to do a current state of the expected users to see what versions of ansible and ara we're dealing with and whether or no the current mix is workable19:50
fungiif it could read the sqlite database from a url you wouldn't need the filesystem abstraction19:50
clarkbthen if we think there aren't major conflicts in tool choice the next step is in trying to figure out the automated loading of the correct sqlite db19:50
dmsimard[m]fungi: sqlite is fundamentally a file so loading one from a filesystem is generally the expectation, I don't know how to "stream" sqlite database queries over http if it happens to be possible19:51
clarkbyou'd probably have to fetch the entire thing and cache it locally19:52
fungiyeah, depends on how much state the process is expected to keep19:52
clarkbso load from url is fetch file at url to local disk then read from disk19:52
dmsimard[m]yeah I could see how downloading to a local cache would work19:53
dmsimard[m]we can probably figure out a not-so-insecure plumbing to make it work, then have a cron that routinely wipes the cache or something like that19:53
clarkbbut I think my main concern contineus to be the logistics of making this work with all the different consumers19:54
clarkbparticular if opendev is going to be responsible for this service its easy to get in an awkward spot where kolla and opendev are in conflict with one another and I don't think we should run an ara for each user (at that point the run a local script to deploy ara in docker makes more sense to me)19:55
dmsimard[m]yeah I understand the maintenance aspect of it19:56
clarkbbut that is why my usggestion is to start with an audit of what that situation looks like today to inform on whether or not it is likely to be a problem later19:57
dmsimard[m]yeah, I was going to say your suggestion from earlier would be a good start19:58
fungiseems like consensus19:58
dmsimard[m]A question, maybe19:59
clarkbgo for it (we have less than a minute left :) )19:59
dmsimard[m]I could find the gerrit patch again but I remember an installation (for bridge?) in system-config that installed ara from source instead of pip19:59
clarkbdmsimard[m]: that would be our system-config-run-base-ansible-devel job20:00
dmsimard[m]I was wondering to what extent installing from source was necessary, as that would be where sql migrations would land first20:00
clarkbbut I think it may pip install ara now and only install ansible from source?20:00
fungibut could install ara or anything else from source if we wanted20:00
dmsimard[m]I could be misremembering, I'll check it out and report back if needed20:00
clarkbdmsimard[m]: the idea behind that job was to be forward looking to detect problems. POtentially exactly like that. In that acse I think this is a feature and not a bug if we're still doing it20:00
clarkbpeople debugging that job would just have to know that if ara breaks that is signal about using unreleased ara20:01
clarkband we are at time.20:01
clarkbFeel free to continue discussion about this topic or any others in #opendev or on the mailing list20:01
clarkbbut I need to end here because I need food20:01
clarkbthank you everyone!20:01
clarkb#endmeeting20:02
opendevmeetMeeting ended Tue Nov 25 20:02:00 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:02
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-25-19.00.html20:02
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-25-19.00.txt20:02
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-25-19.00.log.html20:02

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!