Wednesday, 2025-08-27

opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/95799502:13
opendevreviewMichal Nasiadka proposed opendev/system-config master: epel: Add mirroring of EPEL-10  https://review.opendev.org/c/opendev/system-config/+/95861806:44
opendevreviewClark Boylan proposed opendev/system-config master: Revert "reprepro: temporarily ignore undefinedtarget"  https://review.opendev.org/c/opendev/system-config/+/95866614:53
clarkbthere is a fair bit of interest in adding things to the mirrors right now. I think we should avoid adding anything until upgrades are complete and any outstanding cleanups are done. For example it looks like we may still have some debian stretch content in the mirrors, we also have the bullseye backports content that we're working on cleaning out. I think we'll also be able to delete14:54
clarkbubuntu bionic soon though this one may take longer14:54
mnasiadkaclarkb: I'm fine with not adding that for now, just raised it so it doesn't get forgotten14:59
nhicher[m]clarkb: Hello, I've started to have a look to use geneve tunnels + linux bridge to replace ovs for multi-node-bridge role in zuul-jobs, so I created a simple playbook to validate it works as expected. I'd like to know if you know if somebody already started to work on that and also if the config have to survive a reboot for I only tested the setup using iproutes and brctl, it will more work to support centos, debian, fedora, gentoo, redhat15:06
nhicher[m]and suse if I have to setup config files (https://softwarefactory-project.io/paste/show/2531/). Thanks15:06
clarkbnhicher[m]: I am not aware of anyone else working on this yet. I don't think the current ovs system survives a reboot so ignoring that problem is probably fine. As far as testing goes if you udpate the role in zuul/zuul-jobs the existing test framework there is probably a good start15:07
opendevreviewClark Boylan proposed opendev/system-config master: Update jinjia-init and gitea-init to modern image build tooling  https://review.opendev.org/c/opendev/system-config/+/95859815:09
nhicher[m]clarkb: great, thanks. I have a zuul deployment to start my work, I will try to propose a change next week on opendev15:10
clarkbcorvus: ^ just a note that I think the earlier failure on this change indicated that our new system generally works and continues to prove the old system does not work.15:10
clarkbnhicher[m]: the only other thought I have is that I'm not sure how new of a kernel you need for geneve support. It may be worth continuing to use vxlan (or maybe it an option?) if you need a very new system15:10
nhicher[m]clarkb: yes I tested on fedora 42 only, I will add an option to check if geneve could be used15:12
corvusclarkb: \o/15:13
opendevreviewClark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io  https://review.opendev.org/c/opendev/grafyaml/+/95860115:25
clarkbI'm less sure about ^ it failed when building with docker so I'm trying podman to see if that works. We are building with the buildkit configuration set up correctly as far as I can tell (we no log though)15:26
clarkbthe change for the base images themselves has uwsgi-base depending on python-base that only exists in the speculative state but maybe it works because it is all in one change?15:27
clarkbwe'll see if using podman changes the behavior15:27
clarkband then a couple of failures are similar but don't run a buildset registry so will need to be updated to add the buildset registry and then possibly switch to podman I guess?15:27
clarkbanyway this has been a good exercise to propose the changes and see what assumptions hold and which fail15:28
corvusclarkb: i don't think we should be looking at building with podman -- i mean it's one thing if you're doing it for data collection, but if you start thinking that might be something we do for real, we should probably have a conversation about that.15:32
clarkbya at this point I'm mostly trying to narrow down where the failure might be. If podman works but docker doesn't then the buildset registry itself is probably fine etc15:35
clarkbthough I think we are already building system-config images with podman by default15:35
corvus++15:35
corvuseek?  like which ones?15:36
clarkbcorvus: https://review.opendev.org/c/opendev/system-config/+/947872 I guess you didn't get a chance to weigh in on that one before it was approved15:37
clarkb(so basically any images that don't override it back to docker which the python base images change does because they are multiarch currently)15:37
clarkbthough I've just realized we don't need multiarch now that we are running zuul-launcher instead of nodepool15:38
clarkbfwiw the grafyaml image did build with podman15:38
corvusokay, i thought we had discussed sticking with docker buildx15:39
clarkbso there probably is something in our config that is preventing buildkit and the buildset registry from cooperating together to find the quay image via the proxy system15:39
corvusthat's what zuul does, and i sort of expected system-config to do that15:39
corvus      container_command: docker15:39
corvus      container_build_extra_env:15:39
corvus        DOCKER_BUILDKIT: 115:39
clarkbcorvus: yes that was more recent. I had forgotten things had switched before15:39
corvuszuul does that ^15:39
clarkback let me try updating to use container_command: docker and DOCKER_BUILDKIT: 115:40
corvusdid you set the buildkit var?  i wonder if that was missing/necessary15:40
clarkbgrafyaml is not. I thought it was the default, but maybe we need it to be explicit or something15:40
clarkbI'll test that next15:40
corvuskk15:40
clarkbI think using docker + buildkit would be fine. I'm not really attached to any particular tool. Just trying to find the magic combo that works with the least headache (the venn diagram here is amazing)15:41
corvusyep.  i also think if we like the podman buildkit path, that could be fine too.  my main goal is consistency :)15:42
corvus(my inclination without further research is to stick with docker buildkit because i feel like that's the easiest and most reliable path to supporting all the new image build options that improve efficiency; but if someone does the research and says all that works great with podman now, i'd be fine with moving opendev and zuul both)15:44
corvus(because -- theoretically -- they should be the same because they should both be buildkit, iiuc)15:44
clarkbanother thing I personally think docker does better than the podman toolchain is make it easy to install on all the platforms with consistent behavior15:44
opendevreviewClark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io  https://review.opendev.org/c/opendev/grafyaml/+/95860115:49
clarkbok that tests with the explicit env var15:49
clarkbcorvus: ya so that fails: https://zuul.opendev.org/t/openstack/build/0a810637de964545b78a9a3e537ecc3b that implies we don't need the explicit flag to enable extra buildkit features. But doesn't explain what is broken. I think it is either something to do with the buildkit config, the buildset registry, or our job config utilizing those items16:03
clarkbI'm wondering what the best way to debug that is. I need to hold both the buildset registry job and the build job. But that means I need to induce a failure in the buildset registry job?16:04
*** dhill is now known as Guest2515616:10
clarkblooks like we restart docker before we modify the buildkitd.toml16:11
clarkbmaybe we need to restart docker afterwards. I can test that pretty easily16:11
corvusclarkb: yes i think induce failure on both16:16
opendevreviewClark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io  https://review.opendev.org/c/opendev/grafyaml/+/95860116:17
clarkbok I'll try ^ since it is easy and if that fails then look into holding both jobs16:18
clarkboh!16:18
clarkbcorvus: since the parent change that moves things to quay hasn't landed we don't actually have these images with these tags in quay.io yet16:19
clarkbcorvus: we configure buildkitd.toml to proxy requests for quay.io to the buildset registry. But the image is actually in the insecure-ci-registry and we need to pre prime things by pushing that image into the buildset registry16:20
clarkbI wonder if buildkit's mirror config differs in a way compared to podman making that necessary but podman is looking at insecure ci registry directly or something along those lines16:20
clarkbhrm no we're running pull-from-intermediate-registry and it appears to be putting the images into the buildset registry properly16:24
clarkbhttps://zuul.opendev.org/t/openstack/build/0a810637de964545b78a9a3e537ecc3b/console that happened here. So the communication that is breaking is between docker build (which we think is buildkit using its buildkitd.toml) and the buildset registry16:24
clarkbso ya if restarting doesn't help things holding a node and inspecting configs and behavior is a good enxt step16:25
clarkb:w16:25
clarkboops16:25
clarkbI've rechecked https://review.opendev.org/c/opendev/statusbot/+/958603 and put a hold in place for it as I realize that that job runs a local buildset registry rather than a separate one. So its simpler to debug in that context16:39
clarkbone job to hold instead of two16:39
clarkbI've got a held node and initial inspection looks fine. I do notice that the buildkitd.toml has an empty [registry] block in addition to the specific targets16:53
clarkbI have no idea if that would impact behaviors, but can experiment with it16:53
clarkbbut first I'm going to pop out for some outdoor time before it gets hot today16:53
clarkboh the other thing I notice is in the buildset registry log we can see all the writes but no reads with timestamps that match from when the build ran16:54
clarkbso seems like docker build isn't trying to hit the local buildset registry at all16:54
clarkbmy hunch right now is its DNS16:55
clarkbin the configuration we setup zuul-jobs.buildset-registry:5000 as the mirror target then in /etc/hosts we configure zuul-jobs.buildset-registry to point at our IP address (or the remote buildset registry). I suspect but am not certain that buildkitd is runnign things within a different fs namespace and isn't seeing the host level /etc/hosts16:56
clarkbI think we can change that to an ipv4 address to test since this held node is using ipv4. We use the name instead because docker doesn't know how to handle ipv6 literals16:56
clarkbanyway I'm going to pop out now. Will be back to test these ideas16:57
clarkbAfter poking some more I'm not entirely convinced that the default buildx builder looks at that config. I'll try creating a new buildx builder that does use that config and see what happens if I do that19:25
corvusclarkb: istr some stuff to create a builder explicitly, i'll see if i can find it19:29
corvusroles/build-container-image/tasks/setup-buildx.yaml:  command: "docker buildx create --name mybuilder --node {{ inventory_hostname | replace('-', '_') }} --driver=docker-container --driver-opt image=quay.io/opendevmirror/buildkit:buildx-stable-1 --driver-opt network=host{% if buildset_registry is defined %} --config /etc/buildkit/buildkitd.toml {% endif %}"19:30
corvuslooks like that's only run for multiarch19:31
corvusthat file also talks about etc/hosts19:32
corvusso i'm wondering 2 things: should we do that everywhere?  what are we missing in the testing for these roles?19:33
corvus(cause i thought all of this was tested in zuul-jobs test jobs)19:33
Clark[m]corvus: I think it's required for multiarch to get the qemu emulation. But ya let me test doing that without multiarch after lunch and see if it fixes things19:40
corvusyeah, if we do need to control the builder start, we may need to separate the builder start stuff from the qemu stuff19:43
Clark[m]corvus: if this is the issue I think it explains why uwsgi works. That job specifies an arch list (of just x86-64 but still it's there) and that may cause it to go into the multiarch path19:50
Clark[m]One hack may be to just always enable multiarch and default to the local system arch if not specified19:51
Clark[m]That way we don't need confusing config options or branching19:51
corvusi don't know if there's anything in that file related to other arches other than starting the qemu image19:54
corvus(that's my way of agreeing with you, but phrasing it differently)19:54
opendevreviewClark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io  https://review.opendev.org/c/opendev/grafyaml/+/95860120:05
clarkbI'm going to let zuul test that theory out for us while I use the held node for additional experiments20:05
clarkboh wow the /etc/hosts handling is quite involved20:11
clarkbI'm just going to use the ip address in the config for now20:11
clarkbcorvus: does the buildset registry run with tls or just http?20:19
clarkblooks like we configure it to use tls. I wonder why when I tell it that things are insecure it still fails making the request due to https20:20
clarkbthe grafyaml update there to specify arch (effectively going through the multiarch path and setting up all of this stuff) did work20:21
clarkbI haven't successfully reproduced on the held node as I'm having issues with ssl, but I think that confirms what the problem is20:21
clarkbto tl;dr I think there are a couple of issues here. THe main one is that we need to configure out own buildx builder to modify config in this was. We do that for multiarch builds. If you do configure your own builder then you need to worry about dns resolution and ssl certificate trust chains20:25
clarkbI've got everything working in my held node but the ssl cert validation20:25
clarkbI'm not sure it is valuable to keep trying to make it locally when we can see it work with multiarch builds. That basically proves how to make this work20:25
corvusclarkb: i don't have an answer to your question other than it looks like we copy the cert into the buildx container ca-certificates dir, and i bet there's a reason for that (like, the process in the builder can't be told to ignore the cert or something)20:26
clarkbya20:26
corvusclarkb: the only question i still have is "what's the deal with testing in zuul-jobs?"  are we not testing this, or is the test broken?20:27
clarkbcorvus: or maybe we're testing it with the multiarch case so it works and not testing the default?20:27
corvusyep.  i know we're testing multiarch, but i thought we were testing default too20:28
clarkbalternatively its possible the test is properly broken. I think an easy way to break that is using an image that we can find in the upstream and not just in our downstream buildset registry mirror20:28
corvuszuul-jobs-test-build-container-image-docker-release and zuul-jobs-test-build-container-image-docker-release-multiarch are what i assume are the 2 jobs20:29
clarkblooking at my held node builder logs it appaers to run a HEAD request against both the mirror and the canonical location20:29
clarkbit fails here bceause we don't have any python3.11 or 3.12 tags for these images in quay yet20:29
clarkbbut if those tests use an image and tag that exists in the upstream we may get a fallback behavior. This might be a trivial test update to use a random tag value to confirm20:30
clarkbcorvus: FROM quay.io/opendevmirror/httpd:alpine this is what the image build does in those jobs20:34
clarkband we don't seem to build downstream images. So I think this is largely just uncovered by the test case20:34
corvusgot it.  so maybe we should be building an image from scratch with an unused name or something.20:35
clarkbto cover the case I'm running into I think we want two image builds. The first can essentially just copy httpd:alpine to httpd:some-downstream-tag. Then have that publish to the buildset registry (which we don't seem to run in these tests) then build a second image FROM httpd:some-downstream-tag.20:36
clarkbmaybe the buildset registry tests are a closer match to what we're trying to cover here?20:36
clarkbyes I think zuul-jobs-test-registry-docker is closer to what we want20:39
clarkbcorvus: that ^ test covers this except that it uses docker.io as the image location so I think things magically work20:42
clarkbthe simplest thing here may be to change that test to use quay.io as the registry (or something else entirely) so that we're not running into the docker.io magic20:42
clarkbhrm that test also uses the build-docker-image role when container command is docker and build-container-image otherwise. We probably want a build-container-image version of the job and a build-docker-image version of the job?20:45
clarkbto tl;dr there do appear to be some definite test improvements that could be made in zuul-jobs (really speaks to the complexity of the venn diagram the ecosystem has given us here). Then for making things actually work we could just multiarch always and default to the host's arch. I don't think there is any overhead for native builds other than pulling the qemu stuff in. It will be20:47
clarkbunused20:47
corvusclarkb: it's possible this is why we don't have that test.  :(20:47
clarkbcorvus: fwiw I think the two roles share the multiarch for docker code (at least today). So we could make both roles just always do multiarch and default to current host arch and get some coverage via the existing jobs if we change the image location from docker.io to something else20:49
clarkbthen maybe build from there?20:49
corvusclarkb: i think the main thing is just run the buildx-setup task list always.  i don't actually understand what you mean when you say "default to the current host arch, because i don't see anything in the buildx-setup file that does anything with architecture.  maybe i'm missing it.20:50
clarkbcorvus: in our opendev jobs we enable multiarch builds by setting the arch: list in the container_images dict. I guess I was thinking of setting that list by default to the current host arch and let the existing code run as is20:51
clarkbcorvus: https://review.opendev.org/c/opendev/grafyaml/+/958601/5/.zuul.yaml basically that but do it in the role by default20:51
corvusyeah that is a way we could make things work for opendev image builds quickly, but if we now know the roles are broken i think we should fix the roles20:51
corvusokay this is becoming more clear now20:52
corvusthe issue is really that this only works with buildkit, and buildkit must be configured correctly, and the only code path that configures buildkit correctly is multiarch.20:53
clarkbI guess doing it that was leaves dead code in the role20:53
clarkbcorvus: ya20:53
clarkb*doing it that way20:53
corvusin zuul, we set DOCKER_BUILDKIT=120:53
corvusbut that isn't enough because that doesn't init buildkit correctly20:54
corvusso maybe what we should do is have a role-level variable that is "enable buildkit" and have it init buildkit correctly, set that env var.  and if multiarch is specified, we also set that variable.20:54
corvus(and then someday make that the default)20:55
clarkbon the OpenDev side of things I think we have these options: A) Use docker buildx everywhere with multiarch builds configured on our side or help fix zuul-jobs as proposed above (or otherwise) B) Use podman everywhere and add support for podman multiarch builds to zuul-jobs or C) use podman everywhere for non multiarch and only incur the complicated overhead of docker buildx20:55
clarkbmultiarch stuff when necessary20:55
clarkbI suspect A is what corvus would prefer20:56
clarkband I don't really have any objections to it other than a small concern that the build process is a lot more complicated that you'd hope is necesasry but it is with docker so meh20:56
corvusdo we need any multiarch builds?20:57
clarkbcorvus: currently opendev does not. However, nodepool still mostly exists and it relies on the python-base and python-builder opendev images20:57
corvusi think we can disregard that as a requirement20:58
corvuslike, not a design requirement20:58
clarkblooking at https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/build-container-image/tasks/buildx.yaml I think you are right that we can just always setup buildx and then only when doing multiarch builds do the extra registry and the weird push pull push dance20:59
clarkbwhich simplifies the docker build path a bit (but we still have to do all of the buildx setup stuff either way)20:59
corvusi think we need all of those because buildx doesn't put output in the local image cache by default21:00
clarkbit is also possible that we would get resources in a cloud that we want to use for services that are arm based and having the ability to do multiarch would be nice for that. But if we ignore nodepool then I don't think there is any reason for opendev to care about multiarch today21:00
clarkbcorvus: oh right its inside the buildx builder container21:00
clarkbcorvus: so the main difference would be whether or not we provide the --platform argument or not. But otherwise the tasks are the same21:01
corvusoh, we use a temporary registry because of multiarch though21:01
corvusso yeah, that's more complicated21:01
corvus(like, i think there's still the step to get it into the image cache, but we don't normally need an extra registry for that, we only needed it for multiarch for some reason i don't recall)21:02
clarkbcorvus: I wonder if that temporary registry is there to take advatnage of the --push argument rather than say use skopeo after the build to moev things from the buildx builder to the local image cache?21:03
clarkbthere are docker image loading commands that can be used too I think21:03
clarkbits possible that could be refactored21:04
corvusclarkb: i think i prefer A then B and am against C.  A and B both seem reasonable.  what A has going for it is that all the code is already written, it's just changing the conditional blocks.  B needs a bunch of new code developed and we don't actually know if we love podman for building images.21:04
corvusmaybe -- i thought maybe there was something about maybe we only got the local arch into the local image cache instead of all of them?  but if we do the registry we get all?21:05
corvusmaybe i like the word maybe21:05
clarkboh ya that could be it since docker tries to be smart about giving you images that work with the local cpu(s)21:05
clarkbto catch up opendev we've found a small gap in the container image building with docker that we have solved for multiarch builds. OpenDev can workaround that by doing single arch multiarch builds as I proposed for grafyaml. I think before we do that we should try and fix things for option A) above21:07
clarkbI can probably start to dig itno that either tomorrow or Friday. And in the meantime we can hold off on moving our python base images to quay21:07
clarkbcorvus: do you have any need for the held statusbot image builder node? I will clean it up if not21:09
corvusclarkb: nope, thanks21:13
clarkbthe autohold has been deleted and I put a WIP on https://review.opendev.org/c/opendev/system-config/+/957277 so that we don't accidentally land that before the zuul-jobs situation is cleaned up21:27

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!