Friday, 2025-11-21

Ramereth[m]clarkb: No I hadn't noticed that. I don't know of any issues that might cause that. Is it still happening?00:15
clarkbRamereth[m]: the last occurence we have knowledge of was yesterday. I can retrigger the image build now to see if it is still happening00:16
clarkbit is also possible something else is going on, but the slow image builds were previously explained by the io problems iirc00:16
Ramereth[m]Oh, I think I know what might be causing it. I moved the nodes to a different 10g in preparation for the upcoming DC move00:17
Ramereth[m]let me look at the switch00:17
clarkbRamereth[m]: https://review.opendev.org/c/opendev/zuul-providers/+/966200 this change in particular to add debian trixie arm64 images has been hitting the timeouts. We can see copying files around is slow here https://zuul.opendev.org/t/opendev/build/3cbc8c070679496a8bf37b868284ff1d/log/job-output.txt#8737-873800:17
Ramereth[m]Is the upload happening from the mirror node you have hosted here or an external host?00:22
clarkbRamereth[m]: I think its disk io within the VM00:23
clarkbwhich I'm guessing is backed by ceph? so probably within the cloud?00:23
Ramereth[m]which VM is doing the upload?00:25
clarkbRamereth[m]: it is one of the single use VMs booted by zuul. For the currently running build that is np65006145d0154 with ip address 140.211.11.6800:26
clarkbit is still in its bootstrapping phase of the image build job and probably need a bit of time to get to that slow data copy00:27
clarkblooking at hte log for the historical run I think it was also slow when doing a qemu-img convert between raw and qcow2 as well as vhd. We don't need all of those image versions on our side so we may be able to optimize some of that on our end too00:28
clarkbcorvus: ^ fyi https://zuul.opendev.org/t/opendev/build/3cbc8c070679496a8bf37b868284ff1d/log/job-output.txt#9149-9152 we seem to be building all three image formats for arm64 when we only use one (because there is only only one arm provider. Not sure if that should be qcow2 or raw yet)00:29
clarkbRamereth[m]: I suppose it is possible that building all the different image formats is a regression on our end that puts us over the timeout and that things on your side aren't necessarily any worse than they were before00:30
clarkbRamereth[m]: and apologies for the bother if that is the case. But sometimes being forced to really examine a proble leads to new discoveries...00:30
clarkbhrm I don't think this is a new regression on our end. https://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/image-build-jobs.yaml#L36 seems to be the only place we configure this and I suspect it has been that way for some time00:31
clarkbcorvus: can I simply add a build_diskimage_formats: raw entry to all of the arm64 image builds to limit us to raw there? I thought this was supposed to be autodetecte based on provider config for some reason but I'm not finding that anywhere so I suspect that it may be this simple00:32
Ramereth[m]If you aren't using a raw image format, it will force it to download the image from the hypervisor and then upload it again I think00:32
Ramereth[m]raw will allow for CoW w/ ceph00:32
clarkbRamereth[m]: we are using raw. But our job is also converting raw to qcow2 and vhd so its wasting time in that process00:33
Ramereth[m]ah, yeah that would be a problem00:33
clarkbRamereth[m]: I think if I limit our job to only raw (which is what we want for your cloud) then we may come in under the timeout00:33
clarkbRamereth[m]: so I would say if you don't see anything obviously wrong on your end then let us fix ^ and get back to you if the problem persists00:34
clarkbmnasiadka: ^ fyi since you were looking at this too00:34
clarkbI think where I'm confused is I thought we were auto selecting the image types to build (and convert) based on the provider needs. Maybe we were and that broke? or maybe I was always mistaken and we're too close to the time limit and need to restrict things now00:36
clarkbcorvus should be able to clarify whether or not the provider based autoselection was ever occuring00:36
clarkbheh and now it is failing early because it can't install the vhd conversion tool from our ppa00:39
clarkb(some sort of ppa problem)00:39
clarkbbut yet another reason to maybe limit where we try to do anything with vhd00:39
mnasiadkaclarkb: auto selecting works after the initial upload06:20
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: nodepool-base: Add firewalld zuul-console rule  https://review.opendev.org/c/opendev/zuul-providers/+/96796208:32
mnasiadkaFound out why rocky9/10 doesn’t stream console logs properly ^^08:32
mnasiadkaclarkb: looking at ^^ I think the arm64 provider has a bad day10:48
*** tkajinam is now known as Guest3183611:05
clarkbmnasiadka: looks like some of the builds succeeded, others timed out, and others hit nodeset failures. I suspect the nodeset are a distinct failure to whatever is causing timeouts15:02
clarkbmnasiadka: re auto selecting the image format do you know where that is plumbed through?15:02
clarkbmostly I'm thinking if we build all three the first time that isn't great if we're right on the edge of timeouts so wondering how we might be able to improve that. And maybe for arm we just force it to raw only?15:02
clarkbsince for arm we know we only want raw15:03
fungiclarkb: if memory serves, the gerrit->gitea linking is done with the [gitweb] section here: https://opendev.org/opendev/system-config/src/commit/37856f0/playbooks/roles/gerrit/templates/gerrit.config.j2#L160-L172 but also we disable gitiles with https://opendev.org/opendev/system-config/src/commit/37856f0/playbooks/roles/gerrit/files/gitiles.config15:05
clarkbfungi: hrm both of those configs should be the same in the test nodes and in production. I wonder why production has a working link but testing doesn't15:06
fungii'll poke around and see if i can spot it15:07
clarkbmnasiadka: https://zuul.opendev.org/t/opendev/build/c06f086292f040d6817aa3dac89fec95/log/job-output.txt#9776-9777 does seem to confirm that autoselection works for noble images15:11
opendevreviewClark Boylan proposed opendev/zuul-providers master: Limit arm64 image builds to producing raw images  https://review.opendev.org/c/opendev/zuul-providers/+/96802915:18
clarkbthat should be a noop because it only modified existing images which should already be raw only. However, I figure we should update everything if we also update trixie.15:19
clarkbmnasiadka: ^ do you want to update the trixie builds to do that too? I figure it is still a good improvement while we try to sort out the other problems15:19
clarkbthe node failures appear to be due to Error scanning keys: Timeout connecting to 140.211.11.75 on port 2215:21
clarkbwhich might also point to slowness problems if things aren't booting quickly enough to have ssh keys?15:21
fungiclarkb: found the problem (or misunderstanding?) with the browse link. i think it's a user permission. log out of review.opendev.org and then look at https://review.opendev.org/admin/repos/opendev/bindep,general and you'll see browse is greyed out for you, sign in and it's clickable15:26
clarkboh interesting15:26
clarkbI wonder why that is when the gitea links generally work even when not logged in15:27
clarkbbut also that likely explains the observed difference between testing and prod and isn't an actual regression (which is good news)15:27
fungiyeah, it seems like an inconsistency in the permissions model, but doesn't look like it's necessarily changed behavior15:28
fungiit may have been this way for a very long time15:28
clarkbya I doubt the featuer is used often so not surprised this behavior is easy to overlook15:28
clarkbthank you for checking. Do you want to update the entry on the etherpad and strike it off or should I?15:30
fungii added a comment about it, but can strike through it too15:30
corvusclarkb: i have some stuff to do this morning; can we avoid changing the image format stuff until i look at it later?15:32
clarkbcorvus: yes I'll WIP it15:32
clarkbinfra-root we've gotten a support request to the security incident list :/ anyway tl;dr is that loading openstack-helm helm charts from https://tarballs.opendev.org/openstack/openstack-helm is slow. I can confirm this behavior via my web browser and directly interacting with /afs/ on static. Looks like that directory has almost 15k entries in it which  Isuspect is part of the issue?16:35
clarkbfungi: is there an easy way to bump this over to service-discuss? But then as far as why this is low it seems to speed up once cached, but I'm wondering if this behavior of create a new version of everything for each commit is generally compatbile with afs16:36
clarkbsorry I meant to direct the question of email management to fungi. The afs questions is more out loud to everyone16:37
clarkbin theory this is what we have git for16:37
clarkbdone some general poking around it looks like things are definitely more responsive when there are fewer entries (openstacksdk has <500 and comes back quicky. Nova has <6k and comes back quicker than openstack-helm but more slowly than openstacksdk)16:39
clarkbcardoe: can this data be stored in subdirectories for each service rather than doing a flat giant directory? Additionally or alternatively do we really need a new object for each commit (or at least that is what it looks like in the directory listings)16:40
clarkbit looks like helm chart repositories do have indexes of some sort so I think things could be reorganized. I don't know how painful that would be to convert from the old layout to a new one though16:42
clarkbin any case I think the problem is afs doesn't handle large directories like this super quickly so sharding is helpful. openstack-helm is not currently sharding and has ~15k entries in a single directory. THis makes the web responses slow (however they do respond so helms timeouts may also be too aggressive)16:44
clarkbthe default `helm repo add` --timeout value is 2 minutes16:45
clarkbfetching that page locally took wget 3 minutes and 12 seconds16:49
clarkbso ya definitely over that timeout16:49
clarkbthe actual data trasnfer took about 1 second after it started, so all the time is in creating the index.html on the apache side which requires reading all the necessary dir ent metadata16:50
clarkbhttps://docs.openafs.org/AdminGuide/HDRWQ402.html talks about tuning the cache but it looks like it should be making decisions based on the cache size (whcih we set) so I'm not sure we need to do much more. But I'm trying to dobuel check the cache size is reasonable now16:56
clarkbI think we set our cache to 50GB and we are using 42GB16:58
clarkbso ya best I can figure is "don't do that because afs" but I'm definitely not an expert here16:59
clarkbI think we should bump the thread over to service-discuss if we can do that easily then respond with a suggestion to increase the --timeout value to something like 5 minutes then work with the openstack-helm team to better organize the data17:01
clarkbmy suggestion would be openstack-helm/index.yaml then then points to openstack-helm/ironic/2024.2/ironic-2024.2-$helmversion17:03
clarkbIdeally it would be graet if you could just have the one helm chart version for the one version of ironic, but if we can't have that then I think we should break it down by logical component then component version, then the helm chart version. That will liekly make afs much happier17:05
clarkbcardoe: ^ fyi17:05
clarkbI need to pop out for a bit now. I skipped normal morning tasks today and need toget some things done17:06
fungiclarkb: i think the helm tarballs slowness was brought up in here at one point a month or two back (maybe by mnaser?) and we pointed out at the time that it has waaaaaay too many entries in that directory17:08
clarkbfungi: ya I think if they broke things down how I suggest aboev it would make things happen on the openstacksdk scale of performance which is well well under 2 minute timeouts17:08
fungias for redirecting a post, i would reply to service-discuss and set both reply-to and (if your mua will let you) mail-followup-to as service-discuss as well17:09
clarkbfungi: so I think if we can get that email redirected to our normal support channel then respond that this is a fundamental issue with how they've laid things out, suggest increasing the timeout as a workaround and then point them to openstack-helm to refactor that is our job done17:09
clarkbI think my mua is of the braindead variety. Not sure I can do that but will check17:09
clarkblooks like fastmail wants you to manage reply-to's as additional email addresses within your account rather than allowing you to set the header arbitrarily. I bet this is to avoid abuse but it makes doing what you suggest very difficult or maybe impossible? I think I would have to verify the address to use it17:12
clarkbanyway I really need to pop out now. fungi when I get back I can draft an email and maybe you can send it?17:13
clarkbthat way you don't have to do what I think is the difficult part but can avoid me figuring out a new client etc17:13
fungiclarkb: i'm not even sure it's necessarily afs at fault for making that slow, could easily be that apache autoindex is just not optimized for it17:14
fungias far as me sending something, i can't really cc the original poster's (gmail) address from my subscribed address, but i can try to work out something with one of my alternate addresses17:16
cardoeI would like to improve that as well.17:16
cardoeSo one thing I've noticed is that there's 2 charts built for every merge to master17:17
cardoeThe PTL has wanted the charts to remain generic and support multiple versions of the containers at once.17:17
cardoeHe's not on IRC at the moment.17:18
fungiand yeah, i just tested with `time ls /afs/openstack.org/project/tarballs.opendev.org/openstack/openstack-helm>/dev/null` on static.openstack.org (the server hosting the tarballs site) and it takes 0.029 seconds to complete, so i think it's apache autoindexing that's slow with 15k files to turn into an html index17:18
Clark[m]fungi: maybe I just quote the entire email and send it to discuss and cc the original sender. That may be a reasonable compromise 17:19
fungiyeah, that's probably easiest17:19
fungier, tested on static.opendev.org i meant17:19
Clark[m]Sharding like I suggest would help apache auto index too so I think that is still a viable option17:19
Clark[m]And I think is something that the helm team can drive from their end but we may have to clean up the old data when the new data is in place?17:20
fungianyway, getting a directory listing from afs is fast, looks like, but yeah generating an html file index on the fly is not17:20
Clark[m]fungi it may also be that the timestamp and file size data is slow to retrieve from afs and your ls didn't stat all that data 17:21
cardoeor give me an OCI registry and I'll push the charts there. :D17:21
fungifair, `ls -l` does seem to take longer17:21
Clark[m]cardoe there are several that are free to use for open source projects. We use quay.io17:21
Clark[m]But I think this is solvable with the tools we have we just can't have such a large flat directory. The data lends itself to sharding in a simple manner and I think it will solve the problem17:22
cardoeRight now the project pushes to quay.io/airshipit17:22
cardoeIt used to push to docker.io/openstackhelm 17:23
cardoeThe tarballs.openstack.org is more compliance thing (and well all the docs pointing there)17:23
fungiany idea what it's complying with?17:24
* cardoe shrugs.17:25
cardoeThat's what I've been told.17:25
cardoeI would love to make some radical changes.17:25
fungimaybe i'm the only one who asks questions when hearing unsubstantiated claims about compliance17:25
fungibut then again i used to work in compliance17:26
cardoeThere's a lot of stuff that goes against the best practices guide published by helm. Or the even more extended common patterns published by bitnami previously and now by bwj-s that most of the world refers to and linting tools use.17:26
cardoeI would love to have a real quay org that conveys the project name instead of being centered on airship17:27
fungi(i guess i do still spend a fair amount of my work hours on regulatory compliance, for that matter)17:27
fungicardoe: yeah, basically openstack teams create quay namespaces and then set their jobs up to push to those17:28
fungisimilar to pushing to dockerhub17:28
opendevreviewEduardo Franca proposed openstack/project-config master: Rename Project to Kernel Module Management  https://review.opendev.org/c/openstack/project-config/+/96804717:29
corvusfungi: Clark cardoe keep in mind that afs has a dirent maximum of between 15k and 60k based on filename length.  my guess based on the average lengths i'm seeing in that dir is probably 32k for that particular directory.17:49
fungiaha, so it'll also likely fill up in the near-to-mid-term future if it doesn't get sharded or trimmed17:51
clarkboh so not only is performance a concern but there is a hard limit. Another reason for them to shard17:55
clarkbI'm working on figuring out a response now17:55
corvuswe can calculate the exact limit with some effort (it's a fixed number of blocks, and you get 15 bytes with the first block, and then an additional 32 bytes for each additional block the entry takes up).17:57
clarkbprobably not worth doing since we already know they should be sharding anyway17:58
corvusthe number of blocks will vary, but a lot of those entries are over the 15 char limit, so we're looking at 2 blocks for most of them.  i see some that will use 3 blocks.17:58
clarkbthough in my email I'm responding to a user not the openstack-helm folks so I'll basically be saying this is why this happens and this is how tehy can fix it. It would be great if you could engage with them to drive that to avoid us playing a game of telephone. Feel free to quote this response or have them engage with us on the technical details17:59
opendevreviewEduardo Franca proposed openstack/project-config master: Rename Project to Kernel Module Management  https://review.opendev.org/c/openstack/project-config/+/96804718:03
corvusclarkb: the image building friction is this: we can calculate the formats used for the image once the image is attached to a provider.  if it's not attached to any providers, there are image formats to use.  https://review.opendev.org/c/opendev/zuul-providers/+/966200 is creating the image and adding it to providers.18:07
corvusall of the zuul config for that change is evaluated dynamically, except providers.  they're like pipelines: they don't change until something merges.  so the moment that change merges, any images built after that point will have the format set correctly.18:07
corvusbut until it merges, we're using the default value that we (opendev) chose which is all 3 formats18:07
corvusmaybe we should just choose a different default value?18:07
clarkbcorvus: maybe defaulting to raw (which I think is most common and the dfeault for dib) is a good way to bootstrap then let the provider values take over once the initial builds are done is one approach18:08
corvusoh sorry, correction: s/there are image formats to use/there are NO image formats to use/.  sorry that was hard to read.18:08
clarkbcorvus: another might be to start arm64 builds with an explicit raw request since we know that arch only uses raw then we can optionall clean that up later once onboarded?18:09
corvusyeah, that's a possibility, but i don't love it because if we're unable to build all 3 (that's the case, right? like we're running out of time or space or something when building all 3 on arm?) then if we ever have to recover, we'll have to remember to do that again.18:10
corvus(was referring to second suggestion)18:10
clarkbcorvus: in this case we only need one because osuosl only supports the one. But as a general rule I think that is the case18:11
corvusyeah, so i think if we wanted to set the value explicitly, i'd say let's just do it for all arm images (like you did in your change) and leave in in place rather than remove it18:12
corvusthat ways it's clear that there is a real limitation for those jobs18:12
corvusi think maybe my concern wasn't clear18:12
clarkbthat wfm and I think makes it clear if we add a second arm64 provider whati s going on for any potential edits18:13
corvusoh, well, actually, even if we lost all our images, we'd still have the correct provider configuration, so we're not likely to run into a problem in practice if we remove the format restriction after the initial merge.18:14
corvusso we're only likely to see this problem when we add new images.18:14
corvusbut still, there's a good argument to keep the restriction in the job: so that we copypasta them correctly when we make new images.18:15
corvusclarkb: qq: what about increasing the timeout for the arm image build jobs?18:16
clarkbcorvus: that was another idea I contemplated. I think we are close to the limit but we timed out during the first half of the vhd conversion so it is hard to say how much longer we need18:16
clarkbthat is why I preferred just using raw since we know we only need raw and we build that as the base image format then convert to others18:17
clarkbmaybe we want to do both things?18:17
clarkblimiting to raw as an optimization and increasing the timeout in an effort to avoid throwing away good work?18:17
corvusyeah, but it's still a one-time issue... i sort of like the idea that our jobs should be able to run the same on all the providers, so i think my first choice would be to increase the timeout (with the knowledge that we're only going to waste just the first time).  but then if we decide to only increase the timeout for arm (so the job definitions would be different anyway), then yeah, maybe the format restriction is better.18:19
corvusi think if we were to consider increasing the timeout globally, then that's my first choice.  if we want it tailored narrowly, then i like your format restriction change.18:20
clarkback I'm willing to start with the longer timeout. The current one is 2 hours. I guess bump to 3?18:20
corvusyeah, maybe let's see if that's enough... and if it isn't, then we'll do the other change so we don't set a crazy timeout.18:21
corvusclarkb: the results just came back on 968029 and they all failed18:22
clarkbthat is curious. Only one timed out. Most are neat the timeout though18:23
clarkbmaybe not most, but several are near the timeout18:24
clarkbimplying even if restricting to raw we may need longer timeouts18:24
clarkbcorvus:  chown: cannot access '/opt/dib_tmp/dib-images/debian-bullseye-arm64.r': No such file or directory18:25
corvusoh i think the failures are yaml18:25
corvusi think it needs to be [raw] not raw18:26
clarkbcorvus: ya I think format: raw is being treated as three entres r, a, w18:26
clarkbI can leave a comment about that but leave it wip and instead pivot to increasing the timeout18:26
clarkbthen taht way if we return to the format selection idea we'll know what needs updating18:26
corvussgtm, and agree about the timeout analysis18:26
clarkbhttps://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VTMDDVSPM5HRUYWAATNMZOILT5OE57VR/ responded to the openstack helm qusetion18:30
opendevreviewClark Boylan proposed opendev/zuul-providers master: Increase image builds timeouts to 3 hours  https://review.opendev.org/c/opendev/zuul-providers/+/96805218:34
clarkbcorvus: ^ I think that should do it18:34
corvusclarkb:  +2; we could probably merge now if you think that's appropriate.18:36
corvusi think that will probably run all the jobs18:37
corvusso that might be something to squash into the arm change for expediency?18:37
clarkbcorvus: into the add trixie change you mean?18:37
clarkb*add trixie arm64 change18:37
corvuser yeah that18:37
clarkbsure let me combine the two18:37
clarkbif I abandon 968052 zuul should abandon its jobs rgiht?18:40
clarkbor dequeue them18:40
opendevreviewClark Boylan proposed opendev/zuul-providers master: Add trixie-arm64  https://review.opendev.org/c/opendev/zuul-providers/+/96620018:40
clarkbyes looks like it dequeued perfect18:41
clarkbhttps://zuul.opendev.org/t/opendev/build/21e1aa2d3e4b409087c2811b1516fd2d/log/job-output.txt#4040-4049 how does this happen when the other image builds are all apparently happy?18:49
clarkboh maybe its another corrupted cache in an image?18:50
clarkbthat build ran in rax flex sjc3. Lets see if more hit this problem and if they all belong to that cloud region18:50
clarkbdoes anyone remember what we did to improve debugging around that case? I feel like we did something but my brain is fried this friday and I can't remember what18:51
fungias far as debugging corrupt git caches?18:51
clarkbya18:52
fungii don't remember either (just trying to be clear about what it is i don't remember)18:52
clarkback, if we see additional errors of this sort from the same provider we should probably pull on that thread and dig into what we did before beyond simply deleting the suspected bad image (if anything)18:53
clarkbreading the error messages it is complaining about loose objects. I think when you do a git fetch you get packs always so yes this does imply the data already on disk is bad?18:55
clarkblast time I booted the noble image and ran git fsck across all repos in the cache to confirm the problem18:55
clarkband then we 'resolved' it by deleting the image from the cloud18:55
clarkbI think step 1 here is to see if any other builds fail the same way and try to correlate to this one location and if that happens step 2 is booting a node out of band and checking it with git fsck18:59
clarkbthen if we confirm this is the problem step 3 is figuring out what if anything we did last time beyond deleting the bad image and then figure out potential causes or debugging from there18:59
fungisounds right to me19:00
clarkbhttps://zuul.opendev.org/t/opendev/build/6d12ece0bfbc48d0bb497a5b42b4c8a3/log/job-output.txt#4011 is a second case and in the same provider so ya I think this is now the likely issue19:02
clarkbfor my own sanity I may not dig into that until after lunch. I'm happy to support someone else who looks into it too19:05
clarkbphoronix reports that git 3.0 will switch the dfeault branch name to main and is expected to release by the end of 2026. It is also expected to have working sha256-sha1 interoperability19:15
clarkbI think opendev is pretty close to being set to flip the switch on branch name defaults. We just have to change the gitea management and gerrit management defaults. It is possible to work with main in zuul today as well so I dont' expect major problems but worth calling out19:16
clarkbcorvus: ^ you may be interested in that from a zuul perspective too19:16
fungiis git 3.0 protocol going to work with gerrit?19:22
fungii mean, eventually yes i know, just wondering if gerrit (and gitea) are set up to handle the new protocol and hash length and such19:23
clarkbfungi: my understanding is that using sha256 by default but interoperating with sha1 is the goal19:23
clarkbso I'm assuming that git 3.0 will fallback to speaking whatever old git server speaks if it can't speak the new thing19:23
clarkbbut ya I think that is a big open question too19:24
clarkbgit currently has a number of fallbacks. protocol v2  vs v1, and smart http vs dumb http19:25
clarkbonce we're generally caught up on gerrit stuff it is probably a good idea to ask upstream what if any impacts they anticipate from the git 3.0 release19:27
mnaserFYI as a Zuul user there a bit of intricacies with assumptions around "master" being the default branch name19:28
mnaserwe had to do a bit of configuring to make it work well with "main"19:28
clarkbyes you have to configure it per repo using main today19:29
clarkbbut once you do that the fallback behavior that default to master should just work19:29
mnaseri think in our case it's only been an issue for the config projects that we had to afaik19:29
mnasernormal projects i believe use the git provider configs (so github branch protection or default branch from gerit)19:29
clarkbfungi: looking at https://git-scm.com/docs/BreakingChanges#_changes I don't see anything indicationg that git 3.0 won't be able to talk to older servers. The main issue is likely if you git init on sha356 locally then try to push that into an older server19:30
clarkbfungi: but the implication seems to be that existing repos should interoperate.19:30
clarkbthen there is the reftable change which jgit actually implemented first so is likely to be ahead of the game on that aspect of the 3.0 changes19:31
clarkbin any case my take aware from this is we'll probably have to be careful about which tools we use to create and maybe fsck repos going forward but gerrit should handle them as long as we don't jump to far ahead of it in terms of hashes19:31
fungiyeah, mostly wondering how long before new repos we create in gerrit will have the longer hashes19:32
fungiand whether projects will want to convert19:32
clarkbhttps://github.com/eclipse-jgit/jgit/issues/7319:36
clarkbgerrit does not support sha256 today because jgit does not. They seem to be aware of the upstream timeline19:36
clarkbthis will almost certainly require us to upgrade to a version of gerrit newer than any released today to gain that support19:37
clarkbso that is probably the biggest impact to us19:37
clarkbI guess all the more reason to figure ou the 3.11 upgrade but I've been distracted so far today19:39
fungiopendev-build-diskimage-debian-trixie-arm64 ended in node_failure this time20:39
fungias did opendev-build-diskimage-ubuntu-noble-arm6420:39
fungia bunch of other arm64 image builds succeeded though20:40
fungii'm guessing the amd64 build failures are all from the corrupt git cache (the one i spot-checked was)20:42
fungiand it ran in flex sjc320:43
clarkbfungi: ya I'm finishing up lunch and can boot a noble node in sjc3 to git fsck the cache in a bit20:50
fungithough is there much point if the fix is to revert/replace the image in that region?21:03
clarkbfungi: mostly to confirm this is the issue so that we can escalate to the cloud provider21:06
clarkbthis is interesting: simply running find or ls on the files in question produces the error21:07
clarkb`find ./ -name .git -type d` hits it21:07
fungiso sounds like the filesystem is corrupt?21:07
fungiwhich could mean the image is corrupt at the block level too21:08
clarkbcertainly seems that way21:08
fungianything in dmesg?21:08
clarkb159.135.207.215 is the ip addr I booted it with infra root keys if you want to look too21:08
clarkb`ls -l /opt/git/opendev.org/openstack/python-ironicclient/.git/objects/7a` seems to be a trivial reproduction case21:09
clarkbEXT4-fs error (device vda1): htree_dirblock_to_tree:1083: inode #2099948: comm ls: Directory block failed checksum21:09
fungiyeah, a ton of EXT4-fs hits in dmesg21:09
clarkbso ya I think the fs is corrupted whcih has impacted the git repo21:09
clarkbcorvus: is there any way to mark the image as nto to be used but otherwise preserve it for the cloud to debug with?21:10
clarkbI don't think there is21:10
fungiwe might be able to copy it?21:11
clarkbwe could disable the region entirely. I think that may be worth considering because this is the second time we have seen this problem occur in the same cloud21:11
clarkbfungi: I worry that any data trasnfer stuff like that may make it harder to debug the problem (either by masking the issue or decoupling attributes that are important)21:11
fungibut yeah, a copy would probably end up on different backend blocks21:11
clarkbtl;dr seems to be that we occasionally get corrupted images in this one particular cloud. That seems liek the sort of issue that as cloud users we wouldn't want to ahppen and as recipients of free cloud quota we should do our best to help debug21:12
fungihas it only been in sjc3 so far?21:13
clarkbI think so. The last time this happened appears to have also been sjc3 based on my command history on bridge21:13
clarkbI was able to copy the old boot server command and only had to change the image id21:13
clarkbmy proposal is to disable sjc3 via server boot limits, then send an email to rax with the details we know21:14
fungiwe upload qcow2 or raw? do the checksums in glance match between there and other regions? in reference to recent discussions about the usefulness or pointlessness of glance checksums21:14
clarkbthis is interesting I'm getting errors trying to run image list and image show against sjc3 right now. let me try other regions21:15
fungiin theory glance provides a checksum so that nova can verify the image copy it has isn't altered/corrupted21:15
clarkbya I mean I have strong feelings (but not clear memory) that the entire glance checksum system is smoke and mirrors and doesn't actually help at all21:16
fungiand i agree wrt disabling sjc3 temporarily. i doubt we'll be running close to capacity for the next week21:17
clarkbwe upload a qcow2 to sjc3 according to my old shows21:17
clarkbso ya maybe they are using raw images on the backend and the conversion is corrupting the image. Or maybe the storage system for glance has bad disks or bad memory21:18
fungiglance checksums, aiui, record what gets stored/served to other services, so reflects the post-conversion-task state of the image if applicable21:18
fungi(not necessarily a checksum of what file was uploaded to the api)21:19
clarkbooh this is interesting the corresponding image in IAD3 lists the same owner_specified.openstack.md5='801dc3ed32c4910d9fc7dd636f7cf376' value as sjc321:19
clarkbthat iad3 image says checksum         | 801dc3ed32c4910d9fc7dd636f7cf37621:19
clarkbwhich coincidentally matches the md5 we supply which is good I would expect that to be the case21:20
clarkbchecksum         | e22ce366197c876b45efb75d629a4dd821:20
clarkbthat is the checksum for the image in sjc321:20
clarkbso what is the point of supplying a checksum if glance will happily store different data and boot it later?21:20
clarkbmaybe this is why I have these feelings about glance checksums not mattering21:21
clarkbits not that it isn't checking things its that it doesn't take appropriate action when they differ21:21
clarkbsjc3 image name: ubuntu-noble-0e9f85a5dc5440c78366a363a999448f. iad3 image name: ubuntu-noble-5b0c8ce4c6c44f6099dda90b73e51e0421:21
fungiso if it weren't for some clouds doing conversions or other automated alterations, we could compare owner_specified.openstack.md5 and checksum to detect this21:26
clarkbfungi: yes though glance could (and imo should) checksum the pre conversion value and report that (at least in addition)_21:27
clarkbbecause then we'd know if we failed at the first step at least but it doesn't do that aiui 21:27
opendevreviewClark Boylan proposed opendev/zuul-providers master: Disable raxflex sjc3  https://review.opendev.org/c/opendev/zuul-providers/+/96808221:29
clarkbI think the commit message there sums it up pretty well21:29
clarkbI can work on an email that repeats that info and links to ^ and send that to rackspace folks21:30
clarkbfungi: is there any other information that you think would be useful?21:30
clarkbmaybe offer to add their ssh keys to my test node?21:31
fungino, that sums it up nicely21:32
opendevreviewMerged opendev/zuul-providers master: Disable raxflex sjc3  https://review.opendev.org/c/opendev/zuul-providers/+/96808221:32
clarkbI think the launcher will autoamtically age that image out over the next few days too which might mean it is gone after the weekend anyway21:32
fungiwe can offer, but i doubt having shell access to a vm booted from the image will be much help to tracking down why/where it got corrupted21:32
clarkbkeeping the node booted might help preserve things at least21:33
clarkblast time this happened I booted the test node october 621:33
clarkbso ya twice in ~ 2 months or so21:33
fungioh good point21:36
fungithough it may not block image deletion since we're not doing bfv there afaik21:38
clarkboh right21:42
clarkbfungi: I could boot a bfv node to do that probably21:42
clarkbwhy don't I try that21:43
funginot a bad idea21:43
clarkbthis server is spending a lot of time in the BUILD state21:49
clarkboh its a qcow2 image that it probably has to convert to raw to bfv?21:49
clarkbI'll be patient21:49
clarkbok bfv instance is up and running at 66.70.103.231 and exhibits the same issue21:52
clarkbI'm going to clean up the non bfv instance now21:52
clarkbok I sent an email22:15
clarkbarg the image show output did not format nicely...22:16

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!