| Ramereth[m] | clarkb: No I hadn't noticed that. I don't know of any issues that might cause that. Is it still happening? | 00:15 |
|---|---|---|
| clarkb | Ramereth[m]: the last occurence we have knowledge of was yesterday. I can retrigger the image build now to see if it is still happening | 00:16 |
| clarkb | it is also possible something else is going on, but the slow image builds were previously explained by the io problems iirc | 00:16 |
| Ramereth[m] | Oh, I think I know what might be causing it. I moved the nodes to a different 10g in preparation for the upcoming DC move | 00:17 |
| Ramereth[m] | let me look at the switch | 00:17 |
| clarkb | Ramereth[m]: https://review.opendev.org/c/opendev/zuul-providers/+/966200 this change in particular to add debian trixie arm64 images has been hitting the timeouts. We can see copying files around is slow here https://zuul.opendev.org/t/opendev/build/3cbc8c070679496a8bf37b868284ff1d/log/job-output.txt#8737-8738 | 00:17 |
| Ramereth[m] | Is the upload happening from the mirror node you have hosted here or an external host? | 00:22 |
| clarkb | Ramereth[m]: I think its disk io within the VM | 00:23 |
| clarkb | which I'm guessing is backed by ceph? so probably within the cloud? | 00:23 |
| Ramereth[m] | which VM is doing the upload? | 00:25 |
| clarkb | Ramereth[m]: it is one of the single use VMs booted by zuul. For the currently running build that is np65006145d0154 with ip address 140.211.11.68 | 00:26 |
| clarkb | it is still in its bootstrapping phase of the image build job and probably need a bit of time to get to that slow data copy | 00:27 |
| clarkb | looking at hte log for the historical run I think it was also slow when doing a qemu-img convert between raw and qcow2 as well as vhd. We don't need all of those image versions on our side so we may be able to optimize some of that on our end too | 00:28 |
| clarkb | corvus: ^ fyi https://zuul.opendev.org/t/opendev/build/3cbc8c070679496a8bf37b868284ff1d/log/job-output.txt#9149-9152 we seem to be building all three image formats for arm64 when we only use one (because there is only only one arm provider. Not sure if that should be qcow2 or raw yet) | 00:29 |
| clarkb | Ramereth[m]: I suppose it is possible that building all the different image formats is a regression on our end that puts us over the timeout and that things on your side aren't necessarily any worse than they were before | 00:30 |
| clarkb | Ramereth[m]: and apologies for the bother if that is the case. But sometimes being forced to really examine a proble leads to new discoveries... | 00:30 |
| clarkb | hrm I don't think this is a new regression on our end. https://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/image-build-jobs.yaml#L36 seems to be the only place we configure this and I suspect it has been that way for some time | 00:31 |
| clarkb | corvus: can I simply add a build_diskimage_formats: raw entry to all of the arm64 image builds to limit us to raw there? I thought this was supposed to be autodetecte based on provider config for some reason but I'm not finding that anywhere so I suspect that it may be this simple | 00:32 |
| Ramereth[m] | If you aren't using a raw image format, it will force it to download the image from the hypervisor and then upload it again I think | 00:32 |
| Ramereth[m] | raw will allow for CoW w/ ceph | 00:32 |
| clarkb | Ramereth[m]: we are using raw. But our job is also converting raw to qcow2 and vhd so its wasting time in that process | 00:33 |
| Ramereth[m] | ah, yeah that would be a problem | 00:33 |
| clarkb | Ramereth[m]: I think if I limit our job to only raw (which is what we want for your cloud) then we may come in under the timeout | 00:33 |
| clarkb | Ramereth[m]: so I would say if you don't see anything obviously wrong on your end then let us fix ^ and get back to you if the problem persists | 00:34 |
| clarkb | mnasiadka: ^ fyi since you were looking at this too | 00:34 |
| clarkb | I think where I'm confused is I thought we were auto selecting the image types to build (and convert) based on the provider needs. Maybe we were and that broke? or maybe I was always mistaken and we're too close to the time limit and need to restrict things now | 00:36 |
| clarkb | corvus should be able to clarify whether or not the provider based autoselection was ever occuring | 00:36 |
| clarkb | heh and now it is failing early because it can't install the vhd conversion tool from our ppa | 00:39 |
| clarkb | (some sort of ppa problem) | 00:39 |
| clarkb | but yet another reason to maybe limit where we try to do anything with vhd | 00:39 |
| mnasiadka | clarkb: auto selecting works after the initial upload | 06:20 |
| opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: nodepool-base: Add firewalld zuul-console rule https://review.opendev.org/c/opendev/zuul-providers/+/967962 | 08:32 |
| mnasiadka | Found out why rocky9/10 doesn’t stream console logs properly ^^ | 08:32 |
| mnasiadka | clarkb: looking at ^^ I think the arm64 provider has a bad day | 10:48 |
| *** tkajinam is now known as Guest31836 | 11:05 | |
| clarkb | mnasiadka: looks like some of the builds succeeded, others timed out, and others hit nodeset failures. I suspect the nodeset are a distinct failure to whatever is causing timeouts | 15:02 |
| clarkb | mnasiadka: re auto selecting the image format do you know where that is plumbed through? | 15:02 |
| clarkb | mostly I'm thinking if we build all three the first time that isn't great if we're right on the edge of timeouts so wondering how we might be able to improve that. And maybe for arm we just force it to raw only? | 15:02 |
| clarkb | since for arm we know we only want raw | 15:03 |
| fungi | clarkb: if memory serves, the gerrit->gitea linking is done with the [gitweb] section here: https://opendev.org/opendev/system-config/src/commit/37856f0/playbooks/roles/gerrit/templates/gerrit.config.j2#L160-L172 but also we disable gitiles with https://opendev.org/opendev/system-config/src/commit/37856f0/playbooks/roles/gerrit/files/gitiles.config | 15:05 |
| clarkb | fungi: hrm both of those configs should be the same in the test nodes and in production. I wonder why production has a working link but testing doesn't | 15:06 |
| fungi | i'll poke around and see if i can spot it | 15:07 |
| clarkb | mnasiadka: https://zuul.opendev.org/t/opendev/build/c06f086292f040d6817aa3dac89fec95/log/job-output.txt#9776-9777 does seem to confirm that autoselection works for noble images | 15:11 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Limit arm64 image builds to producing raw images https://review.opendev.org/c/opendev/zuul-providers/+/968029 | 15:18 |
| clarkb | that should be a noop because it only modified existing images which should already be raw only. However, I figure we should update everything if we also update trixie. | 15:19 |
| clarkb | mnasiadka: ^ do you want to update the trixie builds to do that too? I figure it is still a good improvement while we try to sort out the other problems | 15:19 |
| clarkb | the node failures appear to be due to Error scanning keys: Timeout connecting to 140.211.11.75 on port 22 | 15:21 |
| clarkb | which might also point to slowness problems if things aren't booting quickly enough to have ssh keys? | 15:21 |
| fungi | clarkb: found the problem (or misunderstanding?) with the browse link. i think it's a user permission. log out of review.opendev.org and then look at https://review.opendev.org/admin/repos/opendev/bindep,general and you'll see browse is greyed out for you, sign in and it's clickable | 15:26 |
| clarkb | oh interesting | 15:26 |
| clarkb | I wonder why that is when the gitea links generally work even when not logged in | 15:27 |
| clarkb | but also that likely explains the observed difference between testing and prod and isn't an actual regression (which is good news) | 15:27 |
| fungi | yeah, it seems like an inconsistency in the permissions model, but doesn't look like it's necessarily changed behavior | 15:28 |
| fungi | it may have been this way for a very long time | 15:28 |
| clarkb | ya I doubt the featuer is used often so not surprised this behavior is easy to overlook | 15:28 |
| clarkb | thank you for checking. Do you want to update the entry on the etherpad and strike it off or should I? | 15:30 |
| fungi | i added a comment about it, but can strike through it too | 15:30 |
| corvus | clarkb: i have some stuff to do this morning; can we avoid changing the image format stuff until i look at it later? | 15:32 |
| clarkb | corvus: yes I'll WIP it | 15:32 |
| clarkb | infra-root we've gotten a support request to the security incident list :/ anyway tl;dr is that loading openstack-helm helm charts from https://tarballs.opendev.org/openstack/openstack-helm is slow. I can confirm this behavior via my web browser and directly interacting with /afs/ on static. Looks like that directory has almost 15k entries in it which Isuspect is part of the issue? | 16:35 |
| clarkb | fungi: is there an easy way to bump this over to service-discuss? But then as far as why this is low it seems to speed up once cached, but I'm wondering if this behavior of create a new version of everything for each commit is generally compatbile with afs | 16:36 |
| clarkb | sorry I meant to direct the question of email management to fungi. The afs questions is more out loud to everyone | 16:37 |
| clarkb | in theory this is what we have git for | 16:37 |
| clarkb | done some general poking around it looks like things are definitely more responsive when there are fewer entries (openstacksdk has <500 and comes back quicky. Nova has <6k and comes back quicker than openstack-helm but more slowly than openstacksdk) | 16:39 |
| clarkb | cardoe: can this data be stored in subdirectories for each service rather than doing a flat giant directory? Additionally or alternatively do we really need a new object for each commit (or at least that is what it looks like in the directory listings) | 16:40 |
| clarkb | it looks like helm chart repositories do have indexes of some sort so I think things could be reorganized. I don't know how painful that would be to convert from the old layout to a new one though | 16:42 |
| clarkb | in any case I think the problem is afs doesn't handle large directories like this super quickly so sharding is helpful. openstack-helm is not currently sharding and has ~15k entries in a single directory. THis makes the web responses slow (however they do respond so helms timeouts may also be too aggressive) | 16:44 |
| clarkb | the default `helm repo add` --timeout value is 2 minutes | 16:45 |
| clarkb | fetching that page locally took wget 3 minutes and 12 seconds | 16:49 |
| clarkb | so ya definitely over that timeout | 16:49 |
| clarkb | the actual data trasnfer took about 1 second after it started, so all the time is in creating the index.html on the apache side which requires reading all the necessary dir ent metadata | 16:50 |
| clarkb | https://docs.openafs.org/AdminGuide/HDRWQ402.html talks about tuning the cache but it looks like it should be making decisions based on the cache size (whcih we set) so I'm not sure we need to do much more. But I'm trying to dobuel check the cache size is reasonable now | 16:56 |
| clarkb | I think we set our cache to 50GB and we are using 42GB | 16:58 |
| clarkb | so ya best I can figure is "don't do that because afs" but I'm definitely not an expert here | 16:59 |
| clarkb | I think we should bump the thread over to service-discuss if we can do that easily then respond with a suggestion to increase the --timeout value to something like 5 minutes then work with the openstack-helm team to better organize the data | 17:01 |
| clarkb | my suggestion would be openstack-helm/index.yaml then then points to openstack-helm/ironic/2024.2/ironic-2024.2-$helmversion | 17:03 |
| clarkb | Ideally it would be graet if you could just have the one helm chart version for the one version of ironic, but if we can't have that then I think we should break it down by logical component then component version, then the helm chart version. That will liekly make afs much happier | 17:05 |
| clarkb | cardoe: ^ fyi | 17:05 |
| clarkb | I need to pop out for a bit now. I skipped normal morning tasks today and need toget some things done | 17:06 |
| fungi | clarkb: i think the helm tarballs slowness was brought up in here at one point a month or two back (maybe by mnaser?) and we pointed out at the time that it has waaaaaay too many entries in that directory | 17:08 |
| clarkb | fungi: ya I think if they broke things down how I suggest aboev it would make things happen on the openstacksdk scale of performance which is well well under 2 minute timeouts | 17:08 |
| fungi | as for redirecting a post, i would reply to service-discuss and set both reply-to and (if your mua will let you) mail-followup-to as service-discuss as well | 17:09 |
| clarkb | fungi: so I think if we can get that email redirected to our normal support channel then respond that this is a fundamental issue with how they've laid things out, suggest increasing the timeout as a workaround and then point them to openstack-helm to refactor that is our job done | 17:09 |
| clarkb | I think my mua is of the braindead variety. Not sure I can do that but will check | 17:09 |
| clarkb | looks like fastmail wants you to manage reply-to's as additional email addresses within your account rather than allowing you to set the header arbitrarily. I bet this is to avoid abuse but it makes doing what you suggest very difficult or maybe impossible? I think I would have to verify the address to use it | 17:12 |
| clarkb | anyway I really need to pop out now. fungi when I get back I can draft an email and maybe you can send it? | 17:13 |
| clarkb | that way you don't have to do what I think is the difficult part but can avoid me figuring out a new client etc | 17:13 |
| fungi | clarkb: i'm not even sure it's necessarily afs at fault for making that slow, could easily be that apache autoindex is just not optimized for it | 17:14 |
| fungi | as far as me sending something, i can't really cc the original poster's (gmail) address from my subscribed address, but i can try to work out something with one of my alternate addresses | 17:16 |
| cardoe | I would like to improve that as well. | 17:16 |
| cardoe | So one thing I've noticed is that there's 2 charts built for every merge to master | 17:17 |
| cardoe | The PTL has wanted the charts to remain generic and support multiple versions of the containers at once. | 17:17 |
| cardoe | He's not on IRC at the moment. | 17:18 |
| fungi | and yeah, i just tested with `time ls /afs/openstack.org/project/tarballs.opendev.org/openstack/openstack-helm>/dev/null` on static.openstack.org (the server hosting the tarballs site) and it takes 0.029 seconds to complete, so i think it's apache autoindexing that's slow with 15k files to turn into an html index | 17:18 |
| Clark[m] | fungi: maybe I just quote the entire email and send it to discuss and cc the original sender. That may be a reasonable compromise | 17:19 |
| fungi | yeah, that's probably easiest | 17:19 |
| fungi | er, tested on static.opendev.org i meant | 17:19 |
| Clark[m] | Sharding like I suggest would help apache auto index too so I think that is still a viable option | 17:19 |
| Clark[m] | And I think is something that the helm team can drive from their end but we may have to clean up the old data when the new data is in place? | 17:20 |
| fungi | anyway, getting a directory listing from afs is fast, looks like, but yeah generating an html file index on the fly is not | 17:20 |
| Clark[m] | fungi it may also be that the timestamp and file size data is slow to retrieve from afs and your ls didn't stat all that data | 17:21 |
| cardoe | or give me an OCI registry and I'll push the charts there. :D | 17:21 |
| fungi | fair, `ls -l` does seem to take longer | 17:21 |
| Clark[m] | cardoe there are several that are free to use for open source projects. We use quay.io | 17:21 |
| Clark[m] | But I think this is solvable with the tools we have we just can't have such a large flat directory. The data lends itself to sharding in a simple manner and I think it will solve the problem | 17:22 |
| cardoe | Right now the project pushes to quay.io/airshipit | 17:22 |
| cardoe | It used to push to docker.io/openstackhelm | 17:23 |
| cardoe | The tarballs.openstack.org is more compliance thing (and well all the docs pointing there) | 17:23 |
| fungi | any idea what it's complying with? | 17:24 |
| * cardoe shrugs. | 17:25 | |
| cardoe | That's what I've been told. | 17:25 |
| cardoe | I would love to make some radical changes. | 17:25 |
| fungi | maybe i'm the only one who asks questions when hearing unsubstantiated claims about compliance | 17:25 |
| fungi | but then again i used to work in compliance | 17:26 |
| cardoe | There's a lot of stuff that goes against the best practices guide published by helm. Or the even more extended common patterns published by bitnami previously and now by bwj-s that most of the world refers to and linting tools use. | 17:26 |
| cardoe | I would love to have a real quay org that conveys the project name instead of being centered on airship | 17:27 |
| fungi | (i guess i do still spend a fair amount of my work hours on regulatory compliance, for that matter) | 17:27 |
| fungi | cardoe: yeah, basically openstack teams create quay namespaces and then set their jobs up to push to those | 17:28 |
| fungi | similar to pushing to dockerhub | 17:28 |
| opendevreview | Eduardo Franca proposed openstack/project-config master: Rename Project to Kernel Module Management https://review.opendev.org/c/openstack/project-config/+/968047 | 17:29 |
| corvus | fungi: Clark cardoe keep in mind that afs has a dirent maximum of between 15k and 60k based on filename length. my guess based on the average lengths i'm seeing in that dir is probably 32k for that particular directory. | 17:49 |
| fungi | aha, so it'll also likely fill up in the near-to-mid-term future if it doesn't get sharded or trimmed | 17:51 |
| clarkb | oh so not only is performance a concern but there is a hard limit. Another reason for them to shard | 17:55 |
| clarkb | I'm working on figuring out a response now | 17:55 |
| corvus | we can calculate the exact limit with some effort (it's a fixed number of blocks, and you get 15 bytes with the first block, and then an additional 32 bytes for each additional block the entry takes up). | 17:57 |
| clarkb | probably not worth doing since we already know they should be sharding anyway | 17:58 |
| corvus | the number of blocks will vary, but a lot of those entries are over the 15 char limit, so we're looking at 2 blocks for most of them. i see some that will use 3 blocks. | 17:58 |
| clarkb | though in my email I'm responding to a user not the openstack-helm folks so I'll basically be saying this is why this happens and this is how tehy can fix it. It would be great if you could engage with them to drive that to avoid us playing a game of telephone. Feel free to quote this response or have them engage with us on the technical details | 17:59 |
| opendevreview | Eduardo Franca proposed openstack/project-config master: Rename Project to Kernel Module Management https://review.opendev.org/c/openstack/project-config/+/968047 | 18:03 |
| corvus | clarkb: the image building friction is this: we can calculate the formats used for the image once the image is attached to a provider. if it's not attached to any providers, there are image formats to use. https://review.opendev.org/c/opendev/zuul-providers/+/966200 is creating the image and adding it to providers. | 18:07 |
| corvus | all of the zuul config for that change is evaluated dynamically, except providers. they're like pipelines: they don't change until something merges. so the moment that change merges, any images built after that point will have the format set correctly. | 18:07 |
| corvus | but until it merges, we're using the default value that we (opendev) chose which is all 3 formats | 18:07 |
| corvus | maybe we should just choose a different default value? | 18:07 |
| clarkb | corvus: maybe defaulting to raw (which I think is most common and the dfeault for dib) is a good way to bootstrap then let the provider values take over once the initial builds are done is one approach | 18:08 |
| corvus | oh sorry, correction: s/there are image formats to use/there are NO image formats to use/. sorry that was hard to read. | 18:08 |
| clarkb | corvus: another might be to start arm64 builds with an explicit raw request since we know that arch only uses raw then we can optionall clean that up later once onboarded? | 18:09 |
| corvus | yeah, that's a possibility, but i don't love it because if we're unable to build all 3 (that's the case, right? like we're running out of time or space or something when building all 3 on arm?) then if we ever have to recover, we'll have to remember to do that again. | 18:10 |
| corvus | (was referring to second suggestion) | 18:10 |
| clarkb | corvus: in this case we only need one because osuosl only supports the one. But as a general rule I think that is the case | 18:11 |
| corvus | yeah, so i think if we wanted to set the value explicitly, i'd say let's just do it for all arm images (like you did in your change) and leave in in place rather than remove it | 18:12 |
| corvus | that ways it's clear that there is a real limitation for those jobs | 18:12 |
| corvus | i think maybe my concern wasn't clear | 18:12 |
| clarkb | that wfm and I think makes it clear if we add a second arm64 provider whati s going on for any potential edits | 18:13 |
| corvus | oh, well, actually, even if we lost all our images, we'd still have the correct provider configuration, so we're not likely to run into a problem in practice if we remove the format restriction after the initial merge. | 18:14 |
| corvus | so we're only likely to see this problem when we add new images. | 18:14 |
| corvus | but still, there's a good argument to keep the restriction in the job: so that we copypasta them correctly when we make new images. | 18:15 |
| corvus | clarkb: qq: what about increasing the timeout for the arm image build jobs? | 18:16 |
| clarkb | corvus: that was another idea I contemplated. I think we are close to the limit but we timed out during the first half of the vhd conversion so it is hard to say how much longer we need | 18:16 |
| clarkb | that is why I preferred just using raw since we know we only need raw and we build that as the base image format then convert to others | 18:17 |
| clarkb | maybe we want to do both things? | 18:17 |
| clarkb | limiting to raw as an optimization and increasing the timeout in an effort to avoid throwing away good work? | 18:17 |
| corvus | yeah, but it's still a one-time issue... i sort of like the idea that our jobs should be able to run the same on all the providers, so i think my first choice would be to increase the timeout (with the knowledge that we're only going to waste just the first time). but then if we decide to only increase the timeout for arm (so the job definitions would be different anyway), then yeah, maybe the format restriction is better. | 18:19 |
| corvus | i think if we were to consider increasing the timeout globally, then that's my first choice. if we want it tailored narrowly, then i like your format restriction change. | 18:20 |
| clarkb | ack I'm willing to start with the longer timeout. The current one is 2 hours. I guess bump to 3? | 18:20 |
| corvus | yeah, maybe let's see if that's enough... and if it isn't, then we'll do the other change so we don't set a crazy timeout. | 18:21 |
| corvus | clarkb: the results just came back on 968029 and they all failed | 18:22 |
| clarkb | that is curious. Only one timed out. Most are neat the timeout though | 18:23 |
| clarkb | maybe not most, but several are near the timeout | 18:24 |
| clarkb | implying even if restricting to raw we may need longer timeouts | 18:24 |
| clarkb | corvus: chown: cannot access '/opt/dib_tmp/dib-images/debian-bullseye-arm64.r': No such file or directory | 18:25 |
| corvus | oh i think the failures are yaml | 18:25 |
| corvus | i think it needs to be [raw] not raw | 18:26 |
| clarkb | corvus: ya I think format: raw is being treated as three entres r, a, w | 18:26 |
| clarkb | I can leave a comment about that but leave it wip and instead pivot to increasing the timeout | 18:26 |
| clarkb | then taht way if we return to the format selection idea we'll know what needs updating | 18:26 |
| corvus | sgtm, and agree about the timeout analysis | 18:26 |
| clarkb | https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VTMDDVSPM5HRUYWAATNMZOILT5OE57VR/ responded to the openstack helm qusetion | 18:30 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Increase image builds timeouts to 3 hours https://review.opendev.org/c/opendev/zuul-providers/+/968052 | 18:34 |
| clarkb | corvus: ^ I think that should do it | 18:34 |
| corvus | clarkb: +2; we could probably merge now if you think that's appropriate. | 18:36 |
| corvus | i think that will probably run all the jobs | 18:37 |
| corvus | so that might be something to squash into the arm change for expediency? | 18:37 |
| clarkb | corvus: into the add trixie change you mean? | 18:37 |
| clarkb | *add trixie arm64 change | 18:37 |
| corvus | er yeah that | 18:37 |
| clarkb | sure let me combine the two | 18:37 |
| clarkb | if I abandon 968052 zuul should abandon its jobs rgiht? | 18:40 |
| clarkb | or dequeue them | 18:40 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Add trixie-arm64 https://review.opendev.org/c/opendev/zuul-providers/+/966200 | 18:40 |
| clarkb | yes looks like it dequeued perfect | 18:41 |
| clarkb | https://zuul.opendev.org/t/opendev/build/21e1aa2d3e4b409087c2811b1516fd2d/log/job-output.txt#4040-4049 how does this happen when the other image builds are all apparently happy? | 18:49 |
| clarkb | oh maybe its another corrupted cache in an image? | 18:50 |
| clarkb | that build ran in rax flex sjc3. Lets see if more hit this problem and if they all belong to that cloud region | 18:50 |
| clarkb | does anyone remember what we did to improve debugging around that case? I feel like we did something but my brain is fried this friday and I can't remember what | 18:51 |
| fungi | as far as debugging corrupt git caches? | 18:51 |
| clarkb | ya | 18:52 |
| fungi | i don't remember either (just trying to be clear about what it is i don't remember) | 18:52 |
| clarkb | ack, if we see additional errors of this sort from the same provider we should probably pull on that thread and dig into what we did before beyond simply deleting the suspected bad image (if anything) | 18:53 |
| clarkb | reading the error messages it is complaining about loose objects. I think when you do a git fetch you get packs always so yes this does imply the data already on disk is bad? | 18:55 |
| clarkb | last time I booted the noble image and ran git fsck across all repos in the cache to confirm the problem | 18:55 |
| clarkb | and then we 'resolved' it by deleting the image from the cloud | 18:55 |
| clarkb | I think step 1 here is to see if any other builds fail the same way and try to correlate to this one location and if that happens step 2 is booting a node out of band and checking it with git fsck | 18:59 |
| clarkb | then if we confirm this is the problem step 3 is figuring out what if anything we did last time beyond deleting the bad image and then figure out potential causes or debugging from there | 18:59 |
| fungi | sounds right to me | 19:00 |
| clarkb | https://zuul.opendev.org/t/opendev/build/6d12ece0bfbc48d0bb497a5b42b4c8a3/log/job-output.txt#4011 is a second case and in the same provider so ya I think this is now the likely issue | 19:02 |
| clarkb | for my own sanity I may not dig into that until after lunch. I'm happy to support someone else who looks into it too | 19:05 |
| clarkb | phoronix reports that git 3.0 will switch the dfeault branch name to main and is expected to release by the end of 2026. It is also expected to have working sha256-sha1 interoperability | 19:15 |
| clarkb | I think opendev is pretty close to being set to flip the switch on branch name defaults. We just have to change the gitea management and gerrit management defaults. It is possible to work with main in zuul today as well so I dont' expect major problems but worth calling out | 19:16 |
| clarkb | corvus: ^ you may be interested in that from a zuul perspective too | 19:16 |
| fungi | is git 3.0 protocol going to work with gerrit? | 19:22 |
| fungi | i mean, eventually yes i know, just wondering if gerrit (and gitea) are set up to handle the new protocol and hash length and such | 19:23 |
| clarkb | fungi: my understanding is that using sha256 by default but interoperating with sha1 is the goal | 19:23 |
| clarkb | so I'm assuming that git 3.0 will fallback to speaking whatever old git server speaks if it can't speak the new thing | 19:23 |
| clarkb | but ya I think that is a big open question too | 19:24 |
| clarkb | git currently has a number of fallbacks. protocol v2 vs v1, and smart http vs dumb http | 19:25 |
| clarkb | once we're generally caught up on gerrit stuff it is probably a good idea to ask upstream what if any impacts they anticipate from the git 3.0 release | 19:27 |
| mnaser | FYI as a Zuul user there a bit of intricacies with assumptions around "master" being the default branch name | 19:28 |
| mnaser | we had to do a bit of configuring to make it work well with "main" | 19:28 |
| clarkb | yes you have to configure it per repo using main today | 19:29 |
| clarkb | but once you do that the fallback behavior that default to master should just work | 19:29 |
| mnaser | i think in our case it's only been an issue for the config projects that we had to afaik | 19:29 |
| mnaser | normal projects i believe use the git provider configs (so github branch protection or default branch from gerit) | 19:29 |
| clarkb | fungi: looking at https://git-scm.com/docs/BreakingChanges#_changes I don't see anything indicationg that git 3.0 won't be able to talk to older servers. The main issue is likely if you git init on sha356 locally then try to push that into an older server | 19:30 |
| clarkb | fungi: but the implication seems to be that existing repos should interoperate. | 19:30 |
| clarkb | then there is the reftable change which jgit actually implemented first so is likely to be ahead of the game on that aspect of the 3.0 changes | 19:31 |
| clarkb | in any case my take aware from this is we'll probably have to be careful about which tools we use to create and maybe fsck repos going forward but gerrit should handle them as long as we don't jump to far ahead of it in terms of hashes | 19:31 |
| fungi | yeah, mostly wondering how long before new repos we create in gerrit will have the longer hashes | 19:32 |
| fungi | and whether projects will want to convert | 19:32 |
| clarkb | https://github.com/eclipse-jgit/jgit/issues/73 | 19:36 |
| clarkb | gerrit does not support sha256 today because jgit does not. They seem to be aware of the upstream timeline | 19:36 |
| clarkb | this will almost certainly require us to upgrade to a version of gerrit newer than any released today to gain that support | 19:37 |
| clarkb | so that is probably the biggest impact to us | 19:37 |
| clarkb | I guess all the more reason to figure ou the 3.11 upgrade but I've been distracted so far today | 19:39 |
| fungi | opendev-build-diskimage-debian-trixie-arm64 ended in node_failure this time | 20:39 |
| fungi | as did opendev-build-diskimage-ubuntu-noble-arm64 | 20:39 |
| fungi | a bunch of other arm64 image builds succeeded though | 20:40 |
| fungi | i'm guessing the amd64 build failures are all from the corrupt git cache (the one i spot-checked was) | 20:42 |
| fungi | and it ran in flex sjc3 | 20:43 |
| clarkb | fungi: ya I'm finishing up lunch and can boot a noble node in sjc3 to git fsck the cache in a bit | 20:50 |
| fungi | though is there much point if the fix is to revert/replace the image in that region? | 21:03 |
| clarkb | fungi: mostly to confirm this is the issue so that we can escalate to the cloud provider | 21:06 |
| clarkb | this is interesting: simply running find or ls on the files in question produces the error | 21:07 |
| clarkb | `find ./ -name .git -type d` hits it | 21:07 |
| fungi | so sounds like the filesystem is corrupt? | 21:07 |
| fungi | which could mean the image is corrupt at the block level too | 21:08 |
| clarkb | certainly seems that way | 21:08 |
| fungi | anything in dmesg? | 21:08 |
| clarkb | 159.135.207.215 is the ip addr I booted it with infra root keys if you want to look too | 21:08 |
| clarkb | `ls -l /opt/git/opendev.org/openstack/python-ironicclient/.git/objects/7a` seems to be a trivial reproduction case | 21:09 |
| clarkb | EXT4-fs error (device vda1): htree_dirblock_to_tree:1083: inode #2099948: comm ls: Directory block failed checksum | 21:09 |
| fungi | yeah, a ton of EXT4-fs hits in dmesg | 21:09 |
| clarkb | so ya I think the fs is corrupted whcih has impacted the git repo | 21:09 |
| clarkb | corvus: is there any way to mark the image as nto to be used but otherwise preserve it for the cloud to debug with? | 21:10 |
| clarkb | I don't think there is | 21:10 |
| fungi | we might be able to copy it? | 21:11 |
| clarkb | we could disable the region entirely. I think that may be worth considering because this is the second time we have seen this problem occur in the same cloud | 21:11 |
| clarkb | fungi: I worry that any data trasnfer stuff like that may make it harder to debug the problem (either by masking the issue or decoupling attributes that are important) | 21:11 |
| fungi | but yeah, a copy would probably end up on different backend blocks | 21:11 |
| clarkb | tl;dr seems to be that we occasionally get corrupted images in this one particular cloud. That seems liek the sort of issue that as cloud users we wouldn't want to ahppen and as recipients of free cloud quota we should do our best to help debug | 21:12 |
| fungi | has it only been in sjc3 so far? | 21:13 |
| clarkb | I think so. The last time this happened appears to have also been sjc3 based on my command history on bridge | 21:13 |
| clarkb | I was able to copy the old boot server command and only had to change the image id | 21:13 |
| clarkb | my proposal is to disable sjc3 via server boot limits, then send an email to rax with the details we know | 21:14 |
| fungi | we upload qcow2 or raw? do the checksums in glance match between there and other regions? in reference to recent discussions about the usefulness or pointlessness of glance checksums | 21:14 |
| clarkb | this is interesting I'm getting errors trying to run image list and image show against sjc3 right now. let me try other regions | 21:15 |
| fungi | in theory glance provides a checksum so that nova can verify the image copy it has isn't altered/corrupted | 21:15 |
| clarkb | ya I mean I have strong feelings (but not clear memory) that the entire glance checksum system is smoke and mirrors and doesn't actually help at all | 21:16 |
| fungi | and i agree wrt disabling sjc3 temporarily. i doubt we'll be running close to capacity for the next week | 21:17 |
| clarkb | we upload a qcow2 to sjc3 according to my old shows | 21:17 |
| clarkb | so ya maybe they are using raw images on the backend and the conversion is corrupting the image. Or maybe the storage system for glance has bad disks or bad memory | 21:18 |
| fungi | glance checksums, aiui, record what gets stored/served to other services, so reflects the post-conversion-task state of the image if applicable | 21:18 |
| fungi | (not necessarily a checksum of what file was uploaded to the api) | 21:19 |
| clarkb | ooh this is interesting the corresponding image in IAD3 lists the same owner_specified.openstack.md5='801dc3ed32c4910d9fc7dd636f7cf376' value as sjc3 | 21:19 |
| clarkb | that iad3 image says checksum | 801dc3ed32c4910d9fc7dd636f7cf376 | 21:19 |
| clarkb | which coincidentally matches the md5 we supply which is good I would expect that to be the case | 21:20 |
| clarkb | checksum | e22ce366197c876b45efb75d629a4dd8 | 21:20 |
| clarkb | that is the checksum for the image in sjc3 | 21:20 |
| clarkb | so what is the point of supplying a checksum if glance will happily store different data and boot it later? | 21:20 |
| clarkb | maybe this is why I have these feelings about glance checksums not mattering | 21:21 |
| clarkb | its not that it isn't checking things its that it doesn't take appropriate action when they differ | 21:21 |
| clarkb | sjc3 image name: ubuntu-noble-0e9f85a5dc5440c78366a363a999448f. iad3 image name: ubuntu-noble-5b0c8ce4c6c44f6099dda90b73e51e04 | 21:21 |
| fungi | so if it weren't for some clouds doing conversions or other automated alterations, we could compare owner_specified.openstack.md5 and checksum to detect this | 21:26 |
| clarkb | fungi: yes though glance could (and imo should) checksum the pre conversion value and report that (at least in addition)_ | 21:27 |
| clarkb | because then we'd know if we failed at the first step at least but it doesn't do that aiui | 21:27 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Disable raxflex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/968082 | 21:29 |
| clarkb | I think the commit message there sums it up pretty well | 21:29 |
| clarkb | I can work on an email that repeats that info and links to ^ and send that to rackspace folks | 21:30 |
| clarkb | fungi: is there any other information that you think would be useful? | 21:30 |
| clarkb | maybe offer to add their ssh keys to my test node? | 21:31 |
| fungi | no, that sums it up nicely | 21:32 |
| opendevreview | Merged opendev/zuul-providers master: Disable raxflex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/968082 | 21:32 |
| clarkb | I think the launcher will autoamtically age that image out over the next few days too which might mean it is gone after the weekend anyway | 21:32 |
| fungi | we can offer, but i doubt having shell access to a vm booted from the image will be much help to tracking down why/where it got corrupted | 21:32 |
| clarkb | keeping the node booted might help preserve things at least | 21:33 |
| clarkb | last time this happened I booted the test node october 6 | 21:33 |
| clarkb | so ya twice in ~ 2 months or so | 21:33 |
| fungi | oh good point | 21:36 |
| fungi | though it may not block image deletion since we're not doing bfv there afaik | 21:38 |
| clarkb | oh right | 21:42 |
| clarkb | fungi: I could boot a bfv node to do that probably | 21:42 |
| clarkb | why don't I try that | 21:43 |
| fungi | not a bad idea | 21:43 |
| clarkb | this server is spending a lot of time in the BUILD state | 21:49 |
| clarkb | oh its a qcow2 image that it probably has to convert to raw to bfv? | 21:49 |
| clarkb | I'll be patient | 21:49 |
| clarkb | ok bfv instance is up and running at 66.70.103.231 and exhibits the same issue | 21:52 |
| clarkb | I'm going to clean up the non bfv instance now | 21:52 |
| clarkb | ok I sent an email | 22:15 |
| clarkb | arg the image show output did not format nicely... | 22:16 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!