Friday, 2025-11-21

Ramereth[m]	clarkb: No I hadn't noticed that. I don't know of any issues that might cause that. Is it still happening?	00:15
clarkb	Ramereth[m]: the last occurence we have knowledge of was yesterday. I can retrigger the image build now to see if it is still happening	00:16
clarkb	it is also possible something else is going on, but the slow image builds were previously explained by the io problems iirc	00:16
Ramereth[m]	Oh, I think I know what might be causing it. I moved the nodes to a different 10g in preparation for the upcoming DC move	00:17
Ramereth[m]	let me look at the switch	00:17
clarkb	Ramereth[m]: https://review.opendev.org/c/opendev/zuul-providers/+/966200 this change in particular to add debian trixie arm64 images has been hitting the timeouts. We can see copying files around is slow here https://zuul.opendev.org/t/opendev/build/3cbc8c070679496a8bf37b868284ff1d/log/job-output.txt#8737-8738	00:17
Ramereth[m]	Is the upload happening from the mirror node you have hosted here or an external host?	00:22
clarkb	Ramereth[m]: I think its disk io within the VM	00:23
clarkb	which I'm guessing is backed by ceph? so probably within the cloud?	00:23
Ramereth[m]	which VM is doing the upload?	00:25
clarkb	Ramereth[m]: it is one of the single use VMs booted by zuul. For the currently running build that is np65006145d0154 with ip address 140.211.11.68	00:26
clarkb	it is still in its bootstrapping phase of the image build job and probably need a bit of time to get to that slow data copy	00:27
clarkb	looking at hte log for the historical run I think it was also slow when doing a qemu-img convert between raw and qcow2 as well as vhd. We don't need all of those image versions on our side so we may be able to optimize some of that on our end too	00:28
clarkb	corvus: ^ fyi https://zuul.opendev.org/t/opendev/build/3cbc8c070679496a8bf37b868284ff1d/log/job-output.txt#9149-9152 we seem to be building all three image formats for arm64 when we only use one (because there is only only one arm provider. Not sure if that should be qcow2 or raw yet)	00:29
clarkb	Ramereth[m]: I suppose it is possible that building all the different image formats is a regression on our end that puts us over the timeout and that things on your side aren't necessarily any worse than they were before	00:30
clarkb	Ramereth[m]: and apologies for the bother if that is the case. But sometimes being forced to really examine a proble leads to new discoveries...	00:30
clarkb	hrm I don't think this is a new regression on our end. https://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/image-build-jobs.yaml#L36 seems to be the only place we configure this and I suspect it has been that way for some time	00:31
clarkb	corvus: can I simply add a build_diskimage_formats: raw entry to all of the arm64 image builds to limit us to raw there? I thought this was supposed to be autodetecte based on provider config for some reason but I'm not finding that anywhere so I suspect that it may be this simple	00:32
Ramereth[m]	If you aren't using a raw image format, it will force it to download the image from the hypervisor and then upload it again I think	00:32
Ramereth[m]	raw will allow for CoW w/ ceph	00:32
clarkb	Ramereth[m]: we are using raw. But our job is also converting raw to qcow2 and vhd so its wasting time in that process	00:33
Ramereth[m]	ah, yeah that would be a problem	00:33
clarkb	Ramereth[m]: I think if I limit our job to only raw (which is what we want for your cloud) then we may come in under the timeout	00:33
clarkb	Ramereth[m]: so I would say if you don't see anything obviously wrong on your end then let us fix ^ and get back to you if the problem persists	00:34
clarkb	mnasiadka: ^ fyi since you were looking at this too	00:34
clarkb	I think where I'm confused is I thought we were auto selecting the image types to build (and convert) based on the provider needs. Maybe we were and that broke? or maybe I was always mistaken and we're too close to the time limit and need to restrict things now	00:36
clarkb	corvus should be able to clarify whether or not the provider based autoselection was ever occuring	00:36
clarkb	heh and now it is failing early because it can't install the vhd conversion tool from our ppa	00:39
clarkb	(some sort of ppa problem)	00:39
clarkb	but yet another reason to maybe limit where we try to do anything with vhd	00:39
mnasiadka	clarkb: auto selecting works after the initial upload	06:20
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: nodepool-base: Add firewalld zuul-console rule https://review.opendev.org/c/opendev/zuul-providers/+/967962	08:32
mnasiadka	Found out why rocky9/10 doesn’t stream console logs properly ^^	08:32
mnasiadka	clarkb: looking at ^^ I think the arm64 provider has a bad day	10:48
*** tkajinam is now known as Guest31836		11:05
clarkb	mnasiadka: looks like some of the builds succeeded, others timed out, and others hit nodeset failures. I suspect the nodeset are a distinct failure to whatever is causing timeouts	15:02
clarkb	mnasiadka: re auto selecting the image format do you know where that is plumbed through?	15:02
clarkb	mostly I'm thinking if we build all three the first time that isn't great if we're right on the edge of timeouts so wondering how we might be able to improve that. And maybe for arm we just force it to raw only?	15:02
clarkb	since for arm we know we only want raw	15:03
fungi	clarkb: if memory serves, the gerrit->gitea linking is done with the [gitweb] section here: https://opendev.org/opendev/system-config/src/commit/37856f0/playbooks/roles/gerrit/templates/gerrit.config.j2#L160-L172 but also we disable gitiles with https://opendev.org/opendev/system-config/src/commit/37856f0/playbooks/roles/gerrit/files/gitiles.config	15:05
clarkb	fungi: hrm both of those configs should be the same in the test nodes and in production. I wonder why production has a working link but testing doesn't	15:06
fungi	i'll poke around and see if i can spot it	15:07
clarkb	mnasiadka: https://zuul.opendev.org/t/opendev/build/c06f086292f040d6817aa3dac89fec95/log/job-output.txt#9776-9777 does seem to confirm that autoselection works for noble images	15:11
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Limit arm64 image builds to producing raw images https://review.opendev.org/c/opendev/zuul-providers/+/968029	15:18
clarkb	that should be a noop because it only modified existing images which should already be raw only. However, I figure we should update everything if we also update trixie.	15:19
clarkb	mnasiadka: ^ do you want to update the trixie builds to do that too? I figure it is still a good improvement while we try to sort out the other problems	15:19
clarkb	the node failures appear to be due to Error scanning keys: Timeout connecting to 140.211.11.75 on port 22	15:21
clarkb	which might also point to slowness problems if things aren't booting quickly enough to have ssh keys?	15:21
fungi	clarkb: found the problem (or misunderstanding?) with the browse link. i think it's a user permission. log out of review.opendev.org and then look at https://review.opendev.org/admin/repos/opendev/bindep,general and you'll see browse is greyed out for you, sign in and it's clickable	15:26
clarkb	oh interesting	15:26
clarkb	I wonder why that is when the gitea links generally work even when not logged in	15:27
clarkb	but also that likely explains the observed difference between testing and prod and isn't an actual regression (which is good news)	15:27
fungi	yeah, it seems like an inconsistency in the permissions model, but doesn't look like it's necessarily changed behavior	15:28
fungi	it may have been this way for a very long time	15:28
clarkb	ya I doubt the featuer is used often so not surprised this behavior is easy to overlook	15:28
clarkb	thank you for checking. Do you want to update the entry on the etherpad and strike it off or should I?	15:30
fungi	i added a comment about it, but can strike through it too	15:30
corvus	clarkb: i have some stuff to do this morning; can we avoid changing the image format stuff until i look at it later?	15:32
clarkb	corvus: yes I'll WIP it	15:32
clarkb	infra-root we've gotten a support request to the security incident list :/ anyway tl;dr is that loading openstack-helm helm charts from https://tarballs.opendev.org/openstack/openstack-helm is slow. I can confirm this behavior via my web browser and directly interacting with /afs/ on static. Looks like that directory has almost 15k entries in it which Isuspect is part of the issue?	16:35
clarkb	fungi: is there an easy way to bump this over to service-discuss? But then as far as why this is low it seems to speed up once cached, but I'm wondering if this behavior of create a new version of everything for each commit is generally compatbile with afs	16:36
clarkb	sorry I meant to direct the question of email management to fungi. The afs questions is more out loud to everyone	16:37
clarkb	in theory this is what we have git for	16:37
clarkb	done some general poking around it looks like things are definitely more responsive when there are fewer entries (openstacksdk has <500 and comes back quicky. Nova has <6k and comes back quicker than openstack-helm but more slowly than openstacksdk)	16:39
clarkb	cardoe: can this data be stored in subdirectories for each service rather than doing a flat giant directory? Additionally or alternatively do we really need a new object for each commit (or at least that is what it looks like in the directory listings)	16:40
clarkb	it looks like helm chart repositories do have indexes of some sort so I think things could be reorganized. I don't know how painful that would be to convert from the old layout to a new one though	16:42
clarkb	in any case I think the problem is afs doesn't handle large directories like this super quickly so sharding is helpful. openstack-helm is not currently sharding and has ~15k entries in a single directory. THis makes the web responses slow (however they do respond so helms timeouts may also be too aggressive)	16:44
clarkb	the default `helm repo add` --timeout value is 2 minutes	16:45
clarkb	fetching that page locally took wget 3 minutes and 12 seconds	16:49
clarkb	so ya definitely over that timeout	16:49
clarkb	the actual data trasnfer took about 1 second after it started, so all the time is in creating the index.html on the apache side which requires reading all the necessary dir ent metadata	16:50
clarkb	https://docs.openafs.org/AdminGuide/HDRWQ402.html talks about tuning the cache but it looks like it should be making decisions based on the cache size (whcih we set) so I'm not sure we need to do much more. But I'm trying to dobuel check the cache size is reasonable now	16:56
clarkb	I think we set our cache to 50GB and we are using 42GB	16:58
clarkb	so ya best I can figure is "don't do that because afs" but I'm definitely not an expert here	16:59
clarkb	I think we should bump the thread over to service-discuss if we can do that easily then respond with a suggestion to increase the --timeout value to something like 5 minutes then work with the openstack-helm team to better organize the data	17:01
clarkb	my suggestion would be openstack-helm/index.yaml then then points to openstack-helm/ironic/2024.2/ironic-2024.2-$helmversion	17:03
clarkb	Ideally it would be graet if you could just have the one helm chart version for the one version of ironic, but if we can't have that then I think we should break it down by logical component then component version, then the helm chart version. That will liekly make afs much happier	17:05
clarkb	cardoe: ^ fyi	17:05
clarkb	I need to pop out for a bit now. I skipped normal morning tasks today and need toget some things done	17:06
fungi	clarkb: i think the helm tarballs slowness was brought up in here at one point a month or two back (maybe by mnaser?) and we pointed out at the time that it has waaaaaay too many entries in that directory	17:08
clarkb	fungi: ya I think if they broke things down how I suggest aboev it would make things happen on the openstacksdk scale of performance which is well well under 2 minute timeouts	17:08
fungi	as for redirecting a post, i would reply to service-discuss and set both reply-to and (if your mua will let you) mail-followup-to as service-discuss as well	17:09
clarkb	fungi: so I think if we can get that email redirected to our normal support channel then respond that this is a fundamental issue with how they've laid things out, suggest increasing the timeout as a workaround and then point them to openstack-helm to refactor that is our job done	17:09
clarkb	I think my mua is of the braindead variety. Not sure I can do that but will check	17:09
clarkb	looks like fastmail wants you to manage reply-to's as additional email addresses within your account rather than allowing you to set the header arbitrarily. I bet this is to avoid abuse but it makes doing what you suggest very difficult or maybe impossible? I think I would have to verify the address to use it	17:12
clarkb	anyway I really need to pop out now. fungi when I get back I can draft an email and maybe you can send it?	17:13
clarkb	that way you don't have to do what I think is the difficult part but can avoid me figuring out a new client etc	17:13
fungi	clarkb: i'm not even sure it's necessarily afs at fault for making that slow, could easily be that apache autoindex is just not optimized for it	17:14
fungi	as far as me sending something, i can't really cc the original poster's (gmail) address from my subscribed address, but i can try to work out something with one of my alternate addresses	17:16
cardoe	I would like to improve that as well.	17:16
cardoe	So one thing I've noticed is that there's 2 charts built for every merge to master	17:17
cardoe	The PTL has wanted the charts to remain generic and support multiple versions of the containers at once.	17:17
cardoe	He's not on IRC at the moment.	17:18
fungi	and yeah, i just tested with `time ls /afs/openstack.org/project/tarballs.opendev.org/openstack/openstack-helm>/dev/null` on static.openstack.org (the server hosting the tarballs site) and it takes 0.029 seconds to complete, so i think it's apache autoindexing that's slow with 15k files to turn into an html index	17:18
Clark[m]	fungi: maybe I just quote the entire email and send it to discuss and cc the original sender. That may be a reasonable compromise	17:19
fungi	yeah, that's probably easiest	17:19
fungi	er, tested on static.opendev.org i meant	17:19
Clark[m]	Sharding like I suggest would help apache auto index too so I think that is still a viable option	17:19
Clark[m]	And I think is something that the helm team can drive from their end but we may have to clean up the old data when the new data is in place?	17:20
fungi	anyway, getting a directory listing from afs is fast, looks like, but yeah generating an html file index on the fly is not	17:20
Clark[m]	fungi it may also be that the timestamp and file size data is slow to retrieve from afs and your ls didn't stat all that data	17:21
cardoe	or give me an OCI registry and I'll push the charts there. :D	17:21
fungi	fair, `ls -l` does seem to take longer	17:21
Clark[m]	cardoe there are several that are free to use for open source projects. We use quay.io	17:21
Clark[m]	But I think this is solvable with the tools we have we just can't have such a large flat directory. The data lends itself to sharding in a simple manner and I think it will solve the problem	17:22
cardoe	Right now the project pushes to quay.io/airshipit	17:22
cardoe	It used to push to docker.io/openstackhelm	17:23
cardoe	The tarballs.openstack.org is more compliance thing (and well all the docs pointing there)	17:23
fungi	any idea what it's complying with?	17:24
* cardoe shrugs.		17:25
cardoe	That's what I've been told.	17:25
cardoe	I would love to make some radical changes.	17:25
fungi	maybe i'm the only one who asks questions when hearing unsubstantiated claims about compliance	17:25
fungi	but then again i used to work in compliance	17:26
cardoe	There's a lot of stuff that goes against the best practices guide published by helm. Or the even more extended common patterns published by bitnami previously and now by bwj-s that most of the world refers to and linting tools use.	17:26
cardoe	I would love to have a real quay org that conveys the project name instead of being centered on airship	17:27
fungi	(i guess i do still spend a fair amount of my work hours on regulatory compliance, for that matter)	17:27
fungi	cardoe: yeah, basically openstack teams create quay namespaces and then set their jobs up to push to those	17:28
fungi	similar to pushing to dockerhub	17:28
opendevreview	Eduardo Franca proposed openstack/project-config master: Rename Project to Kernel Module Management https://review.opendev.org/c/openstack/project-config/+/968047	17:29
corvus	fungi: Clark cardoe keep in mind that afs has a dirent maximum of between 15k and 60k based on filename length. my guess based on the average lengths i'm seeing in that dir is probably 32k for that particular directory.	17:49
fungi	aha, so it'll also likely fill up in the near-to-mid-term future if it doesn't get sharded or trimmed	17:51
clarkb	oh so not only is performance a concern but there is a hard limit. Another reason for them to shard	17:55
clarkb	I'm working on figuring out a response now	17:55
corvus	we can calculate the exact limit with some effort (it's a fixed number of blocks, and you get 15 bytes with the first block, and then an additional 32 bytes for each additional block the entry takes up).	17:57
clarkb	probably not worth doing since we already know they should be sharding anyway	17:58
corvus	the number of blocks will vary, but a lot of those entries are over the 15 char limit, so we're looking at 2 blocks for most of them. i see some that will use 3 blocks.	17:58
clarkb	though in my email I'm responding to a user not the openstack-helm folks so I'll basically be saying this is why this happens and this is how tehy can fix it. It would be great if you could engage with them to drive that to avoid us playing a game of telephone. Feel free to quote this response or have them engage with us on the technical details	17:59
opendevreview	Eduardo Franca proposed openstack/project-config master: Rename Project to Kernel Module Management https://review.opendev.org/c/openstack/project-config/+/968047	18:03
corvus	clarkb: the image building friction is this: we can calculate the formats used for the image once the image is attached to a provider. if it's not attached to any providers, there are image formats to use. https://review.opendev.org/c/opendev/zuul-providers/+/966200 is creating the image and adding it to providers.	18:07
corvus	all of the zuul config for that change is evaluated dynamically, except providers. they're like pipelines: they don't change until something merges. so the moment that change merges, any images built after that point will have the format set correctly.	18:07
corvus	but until it merges, we're using the default value that we (opendev) chose which is all 3 formats	18:07
corvus	maybe we should just choose a different default value?	18:07
clarkb	corvus: maybe defaulting to raw (which I think is most common and the dfeault for dib) is a good way to bootstrap then let the provider values take over once the initial builds are done is one approach	18:08
corvus	oh sorry, correction: s/there are image formats to use/there are NO image formats to use/. sorry that was hard to read.	18:08
clarkb	corvus: another might be to start arm64 builds with an explicit raw request since we know that arch only uses raw then we can optionall clean that up later once onboarded?	18:09
corvus	yeah, that's a possibility, but i don't love it because if we're unable to build all 3 (that's the case, right? like we're running out of time or space or something when building all 3 on arm?) then if we ever have to recover, we'll have to remember to do that again.	18:10
corvus	(was referring to second suggestion)	18:10
clarkb	corvus: in this case we only need one because osuosl only supports the one. But as a general rule I think that is the case	18:11
corvus	yeah, so i think if we wanted to set the value explicitly, i'd say let's just do it for all arm images (like you did in your change) and leave in in place rather than remove it	18:12
corvus	that ways it's clear that there is a real limitation for those jobs	18:12
corvus	i think maybe my concern wasn't clear	18:12
clarkb	that wfm and I think makes it clear if we add a second arm64 provider whati s going on for any potential edits	18:13
corvus	oh, well, actually, even if we lost all our images, we'd still have the correct provider configuration, so we're not likely to run into a problem in practice if we remove the format restriction after the initial merge.	18:14
corvus	so we're only likely to see this problem when we add new images.	18:14
corvus	but still, there's a good argument to keep the restriction in the job: so that we copypasta them correctly when we make new images.	18:15
corvus	clarkb: qq: what about increasing the timeout for the arm image build jobs?	18:16
clarkb	corvus: that was another idea I contemplated. I think we are close to the limit but we timed out during the first half of the vhd conversion so it is hard to say how much longer we need	18:16
clarkb	that is why I preferred just using raw since we know we only need raw and we build that as the base image format then convert to others	18:17
clarkb	maybe we want to do both things?	18:17
clarkb	limiting to raw as an optimization and increasing the timeout in an effort to avoid throwing away good work?	18:17
corvus	yeah, but it's still a one-time issue... i sort of like the idea that our jobs should be able to run the same on all the providers, so i think my first choice would be to increase the timeout (with the knowledge that we're only going to waste just the first time). but then if we decide to only increase the timeout for arm (so the job definitions would be different anyway), then yeah, maybe the format restriction is better.	18:19
corvus	i think if we were to consider increasing the timeout globally, then that's my first choice. if we want it tailored narrowly, then i like your format restriction change.	18:20
clarkb	ack I'm willing to start with the longer timeout. The current one is 2 hours. I guess bump to 3?	18:20
corvus	yeah, maybe let's see if that's enough... and if it isn't, then we'll do the other change so we don't set a crazy timeout.	18:21
corvus	clarkb: the results just came back on 968029 and they all failed	18:22
clarkb	that is curious. Only one timed out. Most are neat the timeout though	18:23
clarkb	maybe not most, but several are near the timeout	18:24
clarkb	implying even if restricting to raw we may need longer timeouts	18:24
clarkb	corvus: chown: cannot access '/opt/dib_tmp/dib-images/debian-bullseye-arm64.r': No such file or directory	18:25
corvus	oh i think the failures are yaml	18:25
corvus	i think it needs to be [raw] not raw	18:26
clarkb	corvus: ya I think format: raw is being treated as three entres r, a, w	18:26
clarkb	I can leave a comment about that but leave it wip and instead pivot to increasing the timeout	18:26
clarkb	then taht way if we return to the format selection idea we'll know what needs updating	18:26
corvus	sgtm, and agree about the timeout analysis	18:26
clarkb	https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VTMDDVSPM5HRUYWAATNMZOILT5OE57VR/ responded to the openstack helm qusetion	18:30
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Increase image builds timeouts to 3 hours https://review.opendev.org/c/opendev/zuul-providers/+/968052	18:34
clarkb	corvus: ^ I think that should do it	18:34
corvus	clarkb: +2; we could probably merge now if you think that's appropriate.	18:36
corvus	i think that will probably run all the jobs	18:37
corvus	so that might be something to squash into the arm change for expediency?	18:37
clarkb	corvus: into the add trixie change you mean?	18:37
clarkb	*add trixie arm64 change	18:37
corvus	er yeah that	18:37
clarkb	sure let me combine the two	18:37
clarkb	if I abandon 968052 zuul should abandon its jobs rgiht?	18:40
clarkb	or dequeue them	18:40
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Add trixie-arm64 https://review.opendev.org/c/opendev/zuul-providers/+/966200	18:40
clarkb	yes looks like it dequeued perfect	18:41
clarkb	https://zuul.opendev.org/t/opendev/build/21e1aa2d3e4b409087c2811b1516fd2d/log/job-output.txt#4040-4049 how does this happen when the other image builds are all apparently happy?	18:49
clarkb	oh maybe its another corrupted cache in an image?	18:50
clarkb	that build ran in rax flex sjc3. Lets see if more hit this problem and if they all belong to that cloud region	18:50
clarkb	does anyone remember what we did to improve debugging around that case? I feel like we did something but my brain is fried this friday and I can't remember what	18:51
fungi	as far as debugging corrupt git caches?	18:51
clarkb	ya	18:52
fungi	i don't remember either (just trying to be clear about what it is i don't remember)	18:52
clarkb	ack, if we see additional errors of this sort from the same provider we should probably pull on that thread and dig into what we did before beyond simply deleting the suspected bad image (if anything)	18:53
clarkb	reading the error messages it is complaining about loose objects. I think when you do a git fetch you get packs always so yes this does imply the data already on disk is bad?	18:55
clarkb	last time I booted the noble image and ran git fsck across all repos in the cache to confirm the problem	18:55
clarkb	and then we 'resolved' it by deleting the image from the cloud	18:55
clarkb	I think step 1 here is to see if any other builds fail the same way and try to correlate to this one location and if that happens step 2 is booting a node out of band and checking it with git fsck	18:59
clarkb	then if we confirm this is the problem step 3 is figuring out what if anything we did last time beyond deleting the bad image and then figure out potential causes or debugging from there	18:59
fungi	sounds right to me	19:00
clarkb	https://zuul.opendev.org/t/opendev/build/6d12ece0bfbc48d0bb497a5b42b4c8a3/log/job-output.txt#4011 is a second case and in the same provider so ya I think this is now the likely issue	19:02
clarkb	for my own sanity I may not dig into that until after lunch. I'm happy to support someone else who looks into it too	19:05
clarkb	phoronix reports that git 3.0 will switch the dfeault branch name to main and is expected to release by the end of 2026. It is also expected to have working sha256-sha1 interoperability	19:15
clarkb	I think opendev is pretty close to being set to flip the switch on branch name defaults. We just have to change the gitea management and gerrit management defaults. It is possible to work with main in zuul today as well so I dont' expect major problems but worth calling out	19:16
clarkb	corvus: ^ you may be interested in that from a zuul perspective too	19:16
fungi	is git 3.0 protocol going to work with gerrit?	19:22
fungi	i mean, eventually yes i know, just wondering if gerrit (and gitea) are set up to handle the new protocol and hash length and such	19:23
clarkb	fungi: my understanding is that using sha256 by default but interoperating with sha1 is the goal	19:23
clarkb	so I'm assuming that git 3.0 will fallback to speaking whatever old git server speaks if it can't speak the new thing	19:23
clarkb	but ya I think that is a big open question too	19:24
clarkb	git currently has a number of fallbacks. protocol v2 vs v1, and smart http vs dumb http	19:25
clarkb	once we're generally caught up on gerrit stuff it is probably a good idea to ask upstream what if any impacts they anticipate from the git 3.0 release	19:27
mnaser	FYI as a Zuul user there a bit of intricacies with assumptions around "master" being the default branch name	19:28
mnaser	we had to do a bit of configuring to make it work well with "main"	19:28
clarkb	yes you have to configure it per repo using main today	19:29
clarkb	but once you do that the fallback behavior that default to master should just work	19:29
mnaser	i think in our case it's only been an issue for the config projects that we had to afaik	19:29
mnaser	normal projects i believe use the git provider configs (so github branch protection or default branch from gerit)	19:29
clarkb	fungi: looking at https://git-scm.com/docs/BreakingChanges#_changes I don't see anything indicationg that git 3.0 won't be able to talk to older servers. The main issue is likely if you git init on sha356 locally then try to push that into an older server	19:30
clarkb	fungi: but the implication seems to be that existing repos should interoperate.	19:30
clarkb	then there is the reftable change which jgit actually implemented first so is likely to be ahead of the game on that aspect of the 3.0 changes	19:31
clarkb	in any case my take aware from this is we'll probably have to be careful about which tools we use to create and maybe fsck repos going forward but gerrit should handle them as long as we don't jump to far ahead of it in terms of hashes	19:31
fungi	yeah, mostly wondering how long before new repos we create in gerrit will have the longer hashes	19:32
fungi	and whether projects will want to convert	19:32
clarkb	https://github.com/eclipse-jgit/jgit/issues/73	19:36
clarkb	gerrit does not support sha256 today because jgit does not. They seem to be aware of the upstream timeline	19:36
clarkb	this will almost certainly require us to upgrade to a version of gerrit newer than any released today to gain that support	19:37
clarkb	so that is probably the biggest impact to us	19:37
clarkb	I guess all the more reason to figure ou the 3.11 upgrade but I've been distracted so far today	19:39
fungi	opendev-build-diskimage-debian-trixie-arm64 ended in node_failure this time	20:39
fungi	as did opendev-build-diskimage-ubuntu-noble-arm64	20:39
fungi	a bunch of other arm64 image builds succeeded though	20:40
fungi	i'm guessing the amd64 build failures are all from the corrupt git cache (the one i spot-checked was)	20:42
fungi	and it ran in flex sjc3	20:43
clarkb	fungi: ya I'm finishing up lunch and can boot a noble node in sjc3 to git fsck the cache in a bit	20:50
fungi	though is there much point if the fix is to revert/replace the image in that region?	21:03
clarkb	fungi: mostly to confirm this is the issue so that we can escalate to the cloud provider	21:06
clarkb	this is interesting: simply running find or ls on the files in question produces the error	21:07
clarkb	`find ./ -name .git -type d` hits it	21:07
fungi	so sounds like the filesystem is corrupt?	21:07
fungi	which could mean the image is corrupt at the block level too	21:08
clarkb	certainly seems that way	21:08
fungi	anything in dmesg?	21:08
clarkb	159.135.207.215 is the ip addr I booted it with infra root keys if you want to look too	21:08
clarkb	`ls -l /opt/git/opendev.org/openstack/python-ironicclient/.git/objects/7a` seems to be a trivial reproduction case	21:09
clarkb	EXT4-fs error (device vda1): htree_dirblock_to_tree:1083: inode #2099948: comm ls: Directory block failed checksum	21:09
fungi	yeah, a ton of EXT4-fs hits in dmesg	21:09
clarkb	so ya I think the fs is corrupted whcih has impacted the git repo	21:09
clarkb	corvus: is there any way to mark the image as nto to be used but otherwise preserve it for the cloud to debug with?	21:10
clarkb	I don't think there is	21:10
fungi	we might be able to copy it?	21:11
clarkb	we could disable the region entirely. I think that may be worth considering because this is the second time we have seen this problem occur in the same cloud	21:11
clarkb	fungi: I worry that any data trasnfer stuff like that may make it harder to debug the problem (either by masking the issue or decoupling attributes that are important)	21:11
fungi	but yeah, a copy would probably end up on different backend blocks	21:11
clarkb	tl;dr seems to be that we occasionally get corrupted images in this one particular cloud. That seems liek the sort of issue that as cloud users we wouldn't want to ahppen and as recipients of free cloud quota we should do our best to help debug	21:12
fungi	has it only been in sjc3 so far?	21:13
clarkb	I think so. The last time this happened appears to have also been sjc3 based on my command history on bridge	21:13
clarkb	I was able to copy the old boot server command and only had to change the image id	21:13
clarkb	my proposal is to disable sjc3 via server boot limits, then send an email to rax with the details we know	21:14
fungi	we upload qcow2 or raw? do the checksums in glance match between there and other regions? in reference to recent discussions about the usefulness or pointlessness of glance checksums	21:14
clarkb	this is interesting I'm getting errors trying to run image list and image show against sjc3 right now. let me try other regions	21:15
fungi	in theory glance provides a checksum so that nova can verify the image copy it has isn't altered/corrupted	21:15
clarkb	ya I mean I have strong feelings (but not clear memory) that the entire glance checksum system is smoke and mirrors and doesn't actually help at all	21:16
fungi	and i agree wrt disabling sjc3 temporarily. i doubt we'll be running close to capacity for the next week	21:17
clarkb	we upload a qcow2 to sjc3 according to my old shows	21:17
clarkb	so ya maybe they are using raw images on the backend and the conversion is corrupting the image. Or maybe the storage system for glance has bad disks or bad memory	21:18
fungi	glance checksums, aiui, record what gets stored/served to other services, so reflects the post-conversion-task state of the image if applicable	21:18
fungi	(not necessarily a checksum of what file was uploaded to the api)	21:19
clarkb	ooh this is interesting the corresponding image in IAD3 lists the same owner_specified.openstack.md5='801dc3ed32c4910d9fc7dd636f7cf376' value as sjc3	21:19
clarkb	that iad3 image says checksum \| 801dc3ed32c4910d9fc7dd636f7cf376	21:19
clarkb	which coincidentally matches the md5 we supply which is good I would expect that to be the case	21:20
clarkb	checksum \| e22ce366197c876b45efb75d629a4dd8	21:20
clarkb	that is the checksum for the image in sjc3	21:20
clarkb	so what is the point of supplying a checksum if glance will happily store different data and boot it later?	21:20
clarkb	maybe this is why I have these feelings about glance checksums not mattering	21:21
clarkb	its not that it isn't checking things its that it doesn't take appropriate action when they differ	21:21
clarkb	sjc3 image name: ubuntu-noble-0e9f85a5dc5440c78366a363a999448f. iad3 image name: ubuntu-noble-5b0c8ce4c6c44f6099dda90b73e51e04	21:21
fungi	so if it weren't for some clouds doing conversions or other automated alterations, we could compare owner_specified.openstack.md5 and checksum to detect this	21:26
clarkb	fungi: yes though glance could (and imo should) checksum the pre conversion value and report that (at least in addition)_	21:27
clarkb	because then we'd know if we failed at the first step at least but it doesn't do that aiui	21:27
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Disable raxflex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/968082	21:29
clarkb	I think the commit message there sums it up pretty well	21:29
clarkb	I can work on an email that repeats that info and links to ^ and send that to rackspace folks	21:30
clarkb	fungi: is there any other information that you think would be useful?	21:30
clarkb	maybe offer to add their ssh keys to my test node?	21:31
fungi	no, that sums it up nicely	21:32
opendevreview	Merged opendev/zuul-providers master: Disable raxflex sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/968082	21:32
clarkb	I think the launcher will autoamtically age that image out over the next few days too which might mean it is gone after the weekend anyway	21:32
fungi	we can offer, but i doubt having shell access to a vm booted from the image will be much help to tracking down why/where it got corrupted	21:32
clarkb	keeping the node booted might help preserve things at least	21:33
clarkb	last time this happened I booted the test node october 6	21:33
clarkb	so ya twice in ~ 2 months or so	21:33
fungi	oh good point	21:36
fungi	though it may not block image deletion since we're not doing bfv there afaik	21:38
clarkb	oh right	21:42
clarkb	fungi: I could boot a bfv node to do that probably	21:42
clarkb	why don't I try that	21:43
fungi	not a bad idea	21:43
clarkb	this server is spending a lot of time in the BUILD state	21:49
clarkb	oh its a qcow2 image that it probably has to convert to raw to bfv?	21:49
clarkb	I'll be patient	21:49
clarkb	ok bfv instance is up and running at 66.70.103.231 and exhibits the same issue	21:52
clarkb	I'm going to clean up the non bfv instance now	21:52
clarkb	ok I sent an email	22:15
clarkb	arg the image show output did not format nicely...	22:16

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!