| clarkb | meeting time | 19:00 |
|---|---|---|
| clarkb | #startmeeting infra | 19:00 |
| opendevmeet | Meeting started Tue Nov 25 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
| opendevmeet | The meeting name has been set to 'infra' | 19:00 |
| clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BDETGBBWKNBD6ILJ652HNF5FBOZLMEOJ/ Our Agenda | 19:01 |
| clarkb | #topic Announcements | 19:01 |
| clarkb | just a reminder that this weekend is a big holiday weekend for several of us. I won't be around Thursday or Friday and it looks like I'll be out Wednesday afternoon doing food prep | 19:02 |
| clarkb | anything else to announce? | 19:02 |
| clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:04 |
| clarkb | #link https://www.gerritcodereview.com/3.11.html Gerrit 3.11 release notes | 19:04 |
| clarkb | I've gone through these release notes and taken my own notes on this etherpad | 19:05 |
| clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade | 19:05 |
| clarkb | I've tried to identify concerns or potential issues and then followed up by doing my best to test them | 19:05 |
| clarkb | so far everything has checked out ok | 19:05 |
| clarkb | After lunch today I plan to perform an actual upgrade on the held test node to test the process and take notes on the process itself | 19:05 |
| clarkb | I don't anticipate any problems so I think we should probably go ahead and commit to a date. I had previously suggested December 7 at 2100 UTC so that tonyb can participate. Does that seem reasonable? | 19:06 |
| clarkb | If so I can announce that (maybe I'll send that email tomorrow so that it happens after my upgrade tests and gives others a chance to chime in on timing) | 19:06 |
| tonyb | works for me. | 19:07 |
| clarkb | thoughts, concerns, or questions? Gerrit upgrades are big undertakings so more than happy to have input on it | 19:07 |
| fungi | i'll be around | 19:07 |
| clarkb | great | 19:08 |
| clarkb | I'll let everyone know if I find anything concering in the upgrade tests today | 19:08 |
| clarkb | and then if I don't hear otherwise announce the plan for December 7 at 2100 UTC tomorrow | 19:08 |
| clarkb | #topic Rename project app-kubernetes-module-manager to app-kernel-module-management | 19:08 |
| fungi | december 7 is a sunday (at that time on this side of the globe) just to be clear | 19:08 |
| clarkb | #undo | 19:09 |
| opendevmeet | Removing item from minutes: #topic Rename project app-kubernetes-module-manager to app-kernel-module-management | 19:09 |
| clarkb | right the idea is to do it during a slow time while tonyb is and me and fungi are awake and operating | 19:09 |
| clarkb | which is somewhat complicated and sunday fits the bill | 19:09 |
| fungi | yeah, so our sunday night, tonyb's monday morning | 19:09 |
| fungi | wfm | 19:09 |
| clarkb | #topic Rename project app-kubernetes-module-manager to app-kernel-module-management | 19:09 |
| clarkb | Then a related question is how do we want to go about renaming this project as requested by the starlingx community | 19:10 |
| clarkb | my inclination is to focus on the gerrit upgrade first since 3.10 is eol now and we've fallen behind on gerrit upgrades | 19:10 |
| clarkb | the project rename process is tested in our CI jobs for gerrit we rename y/testproject to x/test-project iirc | 19:10 |
| clarkb | so it should be safe to do the rename after the upgrade some time. | 19:10 |
| fungi | i'll be around the week after the upgrade and am happy to drive a rename maintenance then | 19:11 |
| fungi | in theory it'll be fast, really no longer than a restart | 19:11 |
| clarkb | fungi: great thanks. So sometime between the 8th and 12th | 19:11 |
| fungi | yeah, i have no unusual commitments/appointments that week | 19:12 |
| clarkb | I think I have a doctors visit on the 9th | 19:12 |
| clarkb | but otherwise I don't anticipate any conflicts other than the typical meetings | 19:13 |
| clarkb | and nope its the 4th so even better | 19:13 |
| clarkb | ok I'll let fungi drive planning and set a time | 19:13 |
| clarkb | #topic Upgrading old servers | 19:13 |
| fungi | we'll have another meeting between now and then, so can keep this on the agenda and pick an exact date/time as things get closer | 19:14 |
| clarkb | ++ | 19:14 |
| clarkb | Any movement on server upgrades? | 19:14 |
| tonyb | no progress on my stuff | 19:14 |
| clarkb | I'm not aware of any other new server work recently | 19:14 |
| clarkb | I did want to note that I think I discovered that apt messages around lock failures have cahnged between jammy and noble | 19:15 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/968270 is the fix for that | 19:15 |
| fungi | apt or apt-get? | 19:15 |
| clarkb | fungi: unclear what the ansible module uses under the hood | 19:15 |
| fungi | okay, so neither. just ansible | 19:15 |
| clarkb | fungi: but you can look at ps1 of that chagne to see the diff between the two versions | 19:15 |
| clarkb | well I think ansible is bubbling up the image from apt or apt-get | 19:15 |
| clarkb | s/image/message/ | 19:16 |
| clarkb | its a small difference but one we need to accomodate at least for zuul. Reviews very much welcome | 19:16 |
| fungi | to be clear, apt-get tries to maintain a stable interface. apt (the cli) has not since its creation and outputs a warning to stderr if called in a pipe or otherwise unattended (though that's changing in forky) | 19:16 |
| clarkb | got it. I'm guessing ansible uses apt then | 19:16 |
| clarkb | #topic Matrix for OpenDev comms | 19:17 |
| clarkb | Speaking of reviews | 19:17 |
| clarkb | #link https://review.opendev.org/q/hashtag:%22opendev-matrix%22+status:open | 19:17 |
| clarkb | I think this effort is largely stalled out on getting these changes landed | 19:17 |
| clarkb | that isn't all we need to do but is the set of next steps (statusbot, eavesdrop) | 19:18 |
| clarkb | oh and gerritbot | 19:18 |
| clarkb | then with that done we can schedule a cutover date and flip eavesdropping from irc to matrix for #opendev and start migrating users over | 19:18 |
| fungi | i feel like i may have previously committed to reviewing those and then failed to do so (can't remember now, it's been a busy few weeks) | 19:18 |
| tonyb | I'll try to review them today | 19:19 |
| fungi | i'll try to prioritize them | 19:19 |
| clarkb | thank you | 19:19 |
| clarkb | #topic Zuul Launcher Updates | 19:19 |
| clarkb | there are two things we've noticed recently both affecting raxflex that I want to call out | 19:19 |
| fungi | (at least i have session verification working in my matrix client now) | 19:20 |
| clarkb | The first is that we discovered a second corrupted ubuntu noble image in raxflex sjc3. The first occurred on October 6 and the second last Friday | 19:20 |
| clarkb | I did more poking around and testing the second time since two occurrences in two months feels less likely to be cosmic ray bit flips and could be something bigger | 19:20 |
| clarkb | and tl;dr is that the md5sum we calculate for the image does not match the one in glance's checksum field in sjc3 but it did match in iad3 | 19:20 |
| fungi | we still have the region disabled in our launcher config too | 19:21 |
| clarkb | so it definitely seems like the data got corrupted by the time glance got to it | 19:21 |
| clarkb | I ended up emailing rackspace and cc'd infra root on it | 19:21 |
| clarkb | #link https://review.opendev.org/c/zuul/zuul/+/968090 Will validate the glance checksum against our checksum and reject mismatches | 19:21 |
| fungi | i guess we could enable it again now that the image has been superseded? | 19:21 |
| corvus | i just did fungi's suggestion in that change | 19:21 |
| clarkb | corvus wrote this change in response too. Basically have zuul check the hashes since glance doesn't appear to | 19:21 |
| corvus | i spot checked ubuntu-focal images at random in all the clouds | 19:22 |
| clarkb | corvus: ok cool that was goign to be my question, can we generally expect that checksum value to match the one we calculate or does glance do some manipulation first then hash? | 19:22 |
| clarkb | corvus: any concerns with that from your spot check? | 19:22 |
| clarkb | fungi: and yes I think we can probably reenable the region if we confirm the image has been superceded there | 19:23 |
| corvus | they all match except: all the rax-classic images have different md5sums. the raxflex-sjc3 image has a different md5sum. | 19:23 |
| corvus | i suspect the rax-classic is the behavior that fungi was concerned about: that on the backend, the cloud is mutating them | 19:23 |
| clarkb | corvus: neat in the raxflex-sjc3 case I'm guessing that is actually another corrupted image | 19:23 |
| corvus | the flex-sjc3 checksum mismatch seems unexpected, yeah, i'm guessing so | 19:23 |
| clarkb | corvus: and ya I think for classic they must be manipulating it breaking the assumtpion we can validate things this way | 19:23 |
| fungi | i suspect if we were e.g. uploading qcow2 to vexxhost and relying on a glance convert task to turn it into raw we'd also see a mismatch | 19:23 |
| corvus | i was just about to spot check a few more sjc3 to see if they are all like that, or if sjc3 is just very prone to breakage. | 19:24 |
| clarkb | corvus: ++ that seems like a good idea | 19:24 |
| clarkb | corvus: do we think we could make validating the hashes provider specific then we can just disable it for rax classic? | 19:24 |
| corvus | fungi: if that's an option, i'm guessing so... we are currently uploading raw to vexxhost though, so they do match | 19:24 |
| corvus | clarkb: i think for it to be useful, we may need to do that... i haven't thought about how yet | 19:25 |
| clarkb | corvus: ack | 19:25 |
| corvus | i believe we are doing special vhd conversion for rax | 19:25 |
| fungi | right. that was more of a theoretical argument for why other users might want it to be configurable per-provider too | 19:25 |
| clarkb | corvus: yes we rely on that hacked up xen tool thing monty made | 19:25 |
| corvus | are we doing everything expected of us there? or are we expected to do something else before upload that we are not doing? | 19:25 |
| corvus | or is it just the case that no matter what we upload to rax-classic, they will mutate the image | 19:26 |
| clarkb | corvus: I am not sure. I think that the whole process is a bit underdocumented. We know that qemu-img can convert to vhd but those images don't work in xen (but do work in azure?) | 19:26 |
| clarkb | so I think even outside of little bubble this process is somewhat odd | 19:26 |
| corvus | maybe the best approach is to just turn this off for classic and wait for flex to completely take over? | 19:27 |
| clarkb | corvus: but that may be something we can ask rackspace directly about. Perhaps in a followup email with more sjc3 info too | 19:27 |
| clarkb | corvus: ya I think that works as a way to pay down all the xen related tech debt | 19:27 |
| clarkb | (to be clear I'm happy to send followup emails if there is new data. You don't need to send that) | 19:27 |
| corvus | i just spot checked more sjc3 images | 19:28 |
| fungi | but also i agree that there's probably no poing in caring about this in rackspace classic, just the ability to disable the check in certain providers should suffice | 19:28 |
| fungi | s/poing/point/ | 19:28 |
| corvus | and it looks like they don't match | 19:28 |
| tonyb | we could grab the mutated image and compare them. I realise that may be a lot but it may also show something obvious | 19:29 |
| corvus | yeah, sjc3 images never match. | 19:29 |
| fungi | makes me wonder if flex sjc3 has some post-upload glance task applied that other flex regions don't | 19:29 |
| corvus | 4/4 i checked so far | 19:29 |
| corvus | and interesting that seems to correlate with where we have observed corruption | 19:30 |
| clarkb | fungi: ya that is beginning to seem likely. So maybe the mismatched checksum is not the smoking gun I had hoped it was and we need to also consider validation jobs again | 19:30 |
| clarkb | corvus: ++ | 19:30 |
| fungi | tonyb: sounds interesting, but i'm also fine just asking fanatical support why | 19:30 |
| corvus | clarkb: i think questions about sjc3 are probably higher priority than rax-classic. i'd suggest just focusing on that in the next email for now? | 19:31 |
| fungi | clarkb: though also while not smoking gun, a likely place for things to be going wrong | 19:31 |
| corvus | (i mean, i'm not opposed to understanding rax-classic, but it's not quite as interesting as sjc3) | 19:31 |
| clarkb | corvus: wfm. And I think the new info is that the checksums never appear to match there implying there is some backend process manipulating and changing the image data. Could that process be corrupting the images as this is the only region with the problem so far | 19:31 |
| corvus | clarkb: ++ | 19:31 |
| clarkb | I can write that followup today | 19:32 |
| fungi | always changing, occasionally corrupting | 19:32 |
| clarkb | the other launcher topic I wanted to bring up was the held node floating ip deletion | 19:32 |
| clarkb | we suspect that this is a side effect of our leaked floating ip cleanup and have added more debugging to that process in the launcher | 19:32 |
| fungi | sjc3 is the oldest flex region, so possible it has some cruft they avoided in later deployments too | 19:33 |
| clarkb | I think my held node may have lost its floating ip too so maybe we have data to look at for this | 19:33 |
| clarkb | corvus: ^ fyi on that. I'll try to dig into logs to see if I can figure out when that happened. THere is a window of time yesterday between testing gerrit things and updating the launcher where it may have occurred without logging in place | 19:34 |
| fungi | since it's the only provider where we use floating ips, i'm not surprised if we had a blind spot there until now | 19:34 |
| clarkb | anything else to note about the launchers? | 19:35 |
| corvus | clarkb: wonder if it's restart related | 19:35 |
| corvus | i can look at that (with you if you're around) after lunch | 19:35 |
| clarkb | corvus: oh that is a good qusetion. The last time it happened it would've crossed an automated restart boundary | 19:35 |
| fungi | also something i considered, the fip disappearences seemed to span restart times | 19:35 |
| clarkb | and this time I did a manual restart yesterday | 19:35 |
| fungi | not too hard to confirm in that case | 19:35 |
| clarkb | cool we can followup after lunch | 19:36 |
| clarkb | #topic Gitea 1.25.2 Upgrade | 19:36 |
| clarkb | #link https://review.opendev.org/c/opendev/system-config/+/968245 Upgrade gitea to 1.25.2 | 19:36 |
| clarkb | there is a new gitea bugfix release. The screenshots from that change lgtm. There is the sha1-hulud question but I did some digging and the one package I could find in the pnpm lock file that overallped with the bad list had a version that wasn't listed as bad | 19:36 |
| clarkb | I think if we're comfortable with that we can proceed with the upgrade. If we want more sha1-hulud stuff to shake out first we can hold off | 19:37 |
| fungi | yeah, thanks for burrowing beneath the sands to check that | 19:37 |
| fungi | gitea must flow | 19:37 |
| clarkb | in theory pnpm is more resistant to the problems posed by this attack too so it may be a non issue even if a bad packag were included ( I don't like taking those odds though) | 19:38 |
| clarkb | anyway reviews welcome. I'm happy to help babysit if we proceed today or tomorrow | 19:38 |
| clarkb | #topic December Meeting Planning | 19:38 |
| clarkb | I wanted to call out that we're approaching December which is full of holidays and time off for many. I expect December 23, 30 and possibly January 6 to be the most problematic dates for meetings | 19:39 |
| clarkb | I believe that I can personally run meetings on every one of those days if we want to have them | 19:39 |
| fungi | december 23 i'll probably skip. i should be around for the others if people want a meeting, and happy to run any where you're unavailable too | 19:39 |
| tonyb | of those dates Dec 30 is the only one I'll likely skip. | 19:40 |
| clarkb | good to know. I don't think we need to have answers right now. But if you're unliekly to make a specific meeting let me know. Then if it seems like only one or two people will be there then I can cancel | 19:40 |
| fungi | i wouldn't mind skipping 30 too | 19:40 |
| corvus | i would like to be not around on dec 30. happy to skip others but will probably be around for those. | 19:40 |
| fungi | well, really i wouldn't mind skipping any of them yeah ;) | 19:41 |
| clarkb | ok lets say we're skipping the 30th for now and we can cancel others as plans come together | 19:41 |
| fungi | wfm | 19:41 |
| tonyb | ++ | 19:41 |
| clarkb | #topic Open Discussion | 19:41 |
| clarkb | Anything else? | 19:41 |
| clarkb | dmsimard[m] indicated an interest in talking about centralized Ara deployment for rendering CI job ansible run data | 19:42 |
| dmsimard[m] | hi o/ | 19:42 |
| clarkb | #link https://etherpad.opendev.org/p/ara-for-databases | 19:42 |
| dmsimard[m] | someone mentioned last week that this meeting would be a good place to talk about it | 19:42 |
| fungi | as good as any | 19:42 |
| clarkb | the problem is basically that doing an ara file export of a non trivial ansible run produces many small files which don't play nice with ci job log uploads | 19:43 |
| dmsimard[m] | At a high level I think we were starting from the principle that we would like the databases to stay in s3 so we don't need a different way of uploading/expiring files | 19:44 |
| clarkb | it takes many many minutes to upload all the files | 19:44 |
| dmsimard[m] | right, generating and uploading html doesn't scale very well | 19:44 |
| clarkb | and one solution to that problem is that ara can export an sqlite database with all of the data in it. But then you need a running ara to render the data in the database | 19:44 |
| fungi | a running ara that you can point to an arbitrary database location | 19:45 |
| clarkb | I noted on the etherpad and in irc that my main concern with this appraoch is that various tools (opendev, kolla, openstack ansible) can and do run different versions of ansible and different versiosn of ara | 19:45 |
| clarkb | looking at the notes from dmsimard[m] the different versions of ansible problem is probably less of a concern today | 19:45 |
| clarkb | but different versions of ara could be a problem as ara wants the sqlite db to be generated by the same version of ara that is rendering it | 19:45 |
| dmsimard[m] | ideally, yes, otherwise there would need to be a mechanism where ara knows it should automatically runs sql migrations on arbitrary databases | 19:46 |
| clarkb | this is probably not so much a problem at bootstrapping time, but updating ara on the central server would require us to update all the ci jobs. Maybe we're ok with that and basically just expect uses to update ara if they want to use the tool | 19:46 |
| dmsimard[m] | the database schema and migrations hasn't historically moved a lot, but I could see it being an issue that could arise | 19:47 |
| dmsimard[m] | we discussed an implementation that would know how to get the ara database stored as a zuul artifact | 19:48 |
| clarkb | right some sort of click a button and the get automatically sent to a proxy that knows how to get ara to look things up from zuul? | 19:48 |
| dmsimard[m] | a more ... exotic option was to consider whether ara could load databases straight from swift with something like s3ql (mounting swift as a local filesystem, basically), but I'm not sure about that since there's more than one swift, but maybe it's not impossible | 19:49 |
| dmsimard[m] | clarkb: yeah, something that would let ara download the database from the provided url and then load it | 19:50 |
| clarkb | before we do any of that we may want to do a current state of the expected users to see what versions of ansible and ara we're dealing with and whether or no the current mix is workable | 19:50 |
| fungi | if it could read the sqlite database from a url you wouldn't need the filesystem abstraction | 19:50 |
| clarkb | then if we think there aren't major conflicts in tool choice the next step is in trying to figure out the automated loading of the correct sqlite db | 19:50 |
| dmsimard[m] | fungi: sqlite is fundamentally a file so loading one from a filesystem is generally the expectation, I don't know how to "stream" sqlite database queries over http if it happens to be possible | 19:51 |
| clarkb | you'd probably have to fetch the entire thing and cache it locally | 19:52 |
| fungi | yeah, depends on how much state the process is expected to keep | 19:52 |
| clarkb | so load from url is fetch file at url to local disk then read from disk | 19:52 |
| dmsimard[m] | yeah I could see how downloading to a local cache would work | 19:53 |
| dmsimard[m] | we can probably figure out a not-so-insecure plumbing to make it work, then have a cron that routinely wipes the cache or something like that | 19:53 |
| clarkb | but I think my main concern contineus to be the logistics of making this work with all the different consumers | 19:54 |
| clarkb | particular if opendev is going to be responsible for this service its easy to get in an awkward spot where kolla and opendev are in conflict with one another and I don't think we should run an ara for each user (at that point the run a local script to deploy ara in docker makes more sense to me) | 19:55 |
| dmsimard[m] | yeah I understand the maintenance aspect of it | 19:56 |
| clarkb | but that is why my usggestion is to start with an audit of what that situation looks like today to inform on whether or not it is likely to be a problem later | 19:57 |
| dmsimard[m] | yeah, I was going to say your suggestion from earlier would be a good start | 19:58 |
| fungi | seems like consensus | 19:58 |
| dmsimard[m] | A question, maybe | 19:59 |
| clarkb | go for it (we have less than a minute left :) ) | 19:59 |
| dmsimard[m] | I could find the gerrit patch again but I remember an installation (for bridge?) in system-config that installed ara from source instead of pip | 19:59 |
| clarkb | dmsimard[m]: that would be our system-config-run-base-ansible-devel job | 20:00 |
| dmsimard[m] | I was wondering to what extent installing from source was necessary, as that would be where sql migrations would land first | 20:00 |
| clarkb | but I think it may pip install ara now and only install ansible from source? | 20:00 |
| fungi | but could install ara or anything else from source if we wanted | 20:00 |
| dmsimard[m] | I could be misremembering, I'll check it out and report back if needed | 20:00 |
| clarkb | dmsimard[m]: the idea behind that job was to be forward looking to detect problems. POtentially exactly like that. In that acse I think this is a feature and not a bug if we're still doing it | 20:00 |
| clarkb | people debugging that job would just have to know that if ara breaks that is signal about using unreleased ara | 20:01 |
| clarkb | and we are at time. | 20:01 |
| clarkb | Feel free to continue discussion about this topic or any others in #opendev or on the mailing list | 20:01 |
| clarkb | but I need to end here because I need food | 20:01 |
| clarkb | thank you everyone! | 20:01 |
| clarkb | #endmeeting | 20:02 |
| opendevmeet | Meeting ended Tue Nov 25 20:02:00 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:02 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-25-19.00.html | 20:02 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-25-19.00.txt | 20:02 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-11-25-19.00.log.html | 20:02 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!