| opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/957995 | 02:13 |
|---|---|---|
| opendevreview | Michal Nasiadka proposed opendev/system-config master: epel: Add mirroring of EPEL-10 https://review.opendev.org/c/opendev/system-config/+/958618 | 06:44 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Revert "reprepro: temporarily ignore undefinedtarget" https://review.opendev.org/c/opendev/system-config/+/958666 | 14:53 |
| clarkb | there is a fair bit of interest in adding things to the mirrors right now. I think we should avoid adding anything until upgrades are complete and any outstanding cleanups are done. For example it looks like we may still have some debian stretch content in the mirrors, we also have the bullseye backports content that we're working on cleaning out. I think we'll also be able to delete | 14:54 |
| clarkb | ubuntu bionic soon though this one may take longer | 14:54 |
| mnasiadka | clarkb: I'm fine with not adding that for now, just raised it so it doesn't get forgotten | 14:59 |
| nhicher[m] | clarkb: Hello, I've started to have a look to use geneve tunnels + linux bridge to replace ovs for multi-node-bridge role in zuul-jobs, so I created a simple playbook to validate it works as expected. I'd like to know if you know if somebody already started to work on that and also if the config have to survive a reboot for I only tested the setup using iproutes and brctl, it will more work to support centos, debian, fedora, gentoo, redhat | 15:06 |
| nhicher[m] | and suse if I have to setup config files (https://softwarefactory-project.io/paste/show/2531/). Thanks | 15:06 |
| clarkb | nhicher[m]: I am not aware of anyone else working on this yet. I don't think the current ovs system survives a reboot so ignoring that problem is probably fine. As far as testing goes if you udpate the role in zuul/zuul-jobs the existing test framework there is probably a good start | 15:07 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update jinjia-init and gitea-init to modern image build tooling https://review.opendev.org/c/opendev/system-config/+/958598 | 15:09 |
| nhicher[m] | clarkb: great, thanks. I have a zuul deployment to start my work, I will try to propose a change next week on opendev | 15:10 |
| clarkb | corvus: ^ just a note that I think the earlier failure on this change indicated that our new system generally works and continues to prove the old system does not work. | 15:10 |
| clarkb | nhicher[m]: the only other thought I have is that I'm not sure how new of a kernel you need for geneve support. It may be worth continuing to use vxlan (or maybe it an option?) if you need a very new system | 15:10 |
| nhicher[m] | clarkb: yes I tested on fedora 42 only, I will add an option to check if geneve could be used | 15:12 |
| corvus | clarkb: \o/ | 15:13 |
| opendevreview | Clark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io https://review.opendev.org/c/opendev/grafyaml/+/958601 | 15:25 |
| clarkb | I'm less sure about ^ it failed when building with docker so I'm trying podman to see if that works. We are building with the buildkit configuration set up correctly as far as I can tell (we no log though) | 15:26 |
| clarkb | the change for the base images themselves has uwsgi-base depending on python-base that only exists in the speculative state but maybe it works because it is all in one change? | 15:27 |
| clarkb | we'll see if using podman changes the behavior | 15:27 |
| clarkb | and then a couple of failures are similar but don't run a buildset registry so will need to be updated to add the buildset registry and then possibly switch to podman I guess? | 15:27 |
| clarkb | anyway this has been a good exercise to propose the changes and see what assumptions hold and which fail | 15:28 |
| corvus | clarkb: i don't think we should be looking at building with podman -- i mean it's one thing if you're doing it for data collection, but if you start thinking that might be something we do for real, we should probably have a conversation about that. | 15:32 |
| clarkb | ya at this point I'm mostly trying to narrow down where the failure might be. If podman works but docker doesn't then the buildset registry itself is probably fine etc | 15:35 |
| clarkb | though I think we are already building system-config images with podman by default | 15:35 |
| corvus | ++ | 15:35 |
| corvus | eek? like which ones? | 15:36 |
| clarkb | corvus: https://review.opendev.org/c/opendev/system-config/+/947872 I guess you didn't get a chance to weigh in on that one before it was approved | 15:37 |
| clarkb | (so basically any images that don't override it back to docker which the python base images change does because they are multiarch currently) | 15:37 |
| clarkb | though I've just realized we don't need multiarch now that we are running zuul-launcher instead of nodepool | 15:38 |
| clarkb | fwiw the grafyaml image did build with podman | 15:38 |
| corvus | okay, i thought we had discussed sticking with docker buildx | 15:39 |
| clarkb | so there probably is something in our config that is preventing buildkit and the buildset registry from cooperating together to find the quay image via the proxy system | 15:39 |
| corvus | that's what zuul does, and i sort of expected system-config to do that | 15:39 |
| corvus | container_command: docker | 15:39 |
| corvus | container_build_extra_env: | 15:39 |
| corvus | DOCKER_BUILDKIT: 1 | 15:39 |
| clarkb | corvus: yes that was more recent. I had forgotten things had switched before | 15:39 |
| corvus | zuul does that ^ | 15:39 |
| clarkb | ack let me try updating to use container_command: docker and DOCKER_BUILDKIT: 1 | 15:40 |
| corvus | did you set the buildkit var? i wonder if that was missing/necessary | 15:40 |
| clarkb | grafyaml is not. I thought it was the default, but maybe we need it to be explicit or something | 15:40 |
| clarkb | I'll test that next | 15:40 |
| corvus | kk | 15:40 |
| clarkb | I think using docker + buildkit would be fine. I'm not really attached to any particular tool. Just trying to find the magic combo that works with the least headache (the venn diagram here is amazing) | 15:41 |
| corvus | yep. i also think if we like the podman buildkit path, that could be fine too. my main goal is consistency :) | 15:42 |
| corvus | (my inclination without further research is to stick with docker buildkit because i feel like that's the easiest and most reliable path to supporting all the new image build options that improve efficiency; but if someone does the research and says all that works great with podman now, i'd be fine with moving opendev and zuul both) | 15:44 |
| corvus | (because -- theoretically -- they should be the same because they should both be buildkit, iiuc) | 15:44 |
| clarkb | another thing I personally think docker does better than the podman toolchain is make it easy to install on all the platforms with consistent behavior | 15:44 |
| opendevreview | Clark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io https://review.opendev.org/c/opendev/grafyaml/+/958601 | 15:49 |
| clarkb | ok that tests with the explicit env var | 15:49 |
| clarkb | corvus: ya so that fails: https://zuul.opendev.org/t/openstack/build/0a810637de964545b78a9a3e537ecc3b that implies we don't need the explicit flag to enable extra buildkit features. But doesn't explain what is broken. I think it is either something to do with the buildkit config, the buildset registry, or our job config utilizing those items | 16:03 |
| clarkb | I'm wondering what the best way to debug that is. I need to hold both the buildset registry job and the build job. But that means I need to induce a failure in the buildset registry job? | 16:04 |
| *** dhill is now known as Guest25156 | 16:10 | |
| clarkb | looks like we restart docker before we modify the buildkitd.toml | 16:11 |
| clarkb | maybe we need to restart docker afterwards. I can test that pretty easily | 16:11 |
| corvus | clarkb: yes i think induce failure on both | 16:16 |
| opendevreview | Clark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io https://review.opendev.org/c/opendev/grafyaml/+/958601 | 16:17 |
| clarkb | ok I'll try ^ since it is easy and if that fails then look into holding both jobs | 16:18 |
| clarkb | oh! | 16:18 |
| clarkb | corvus: since the parent change that moves things to quay hasn't landed we don't actually have these images with these tags in quay.io yet | 16:19 |
| clarkb | corvus: we configure buildkitd.toml to proxy requests for quay.io to the buildset registry. But the image is actually in the insecure-ci-registry and we need to pre prime things by pushing that image into the buildset registry | 16:20 |
| clarkb | I wonder if buildkit's mirror config differs in a way compared to podman making that necessary but podman is looking at insecure ci registry directly or something along those lines | 16:20 |
| clarkb | hrm no we're running pull-from-intermediate-registry and it appears to be putting the images into the buildset registry properly | 16:24 |
| clarkb | https://zuul.opendev.org/t/openstack/build/0a810637de964545b78a9a3e537ecc3b/console that happened here. So the communication that is breaking is between docker build (which we think is buildkit using its buildkitd.toml) and the buildset registry | 16:24 |
| clarkb | so ya if restarting doesn't help things holding a node and inspecting configs and behavior is a good enxt step | 16:25 |
| clarkb | :w | 16:25 |
| clarkb | oops | 16:25 |
| clarkb | I've rechecked https://review.opendev.org/c/opendev/statusbot/+/958603 and put a hold in place for it as I realize that that job runs a local buildset registry rather than a separate one. So its simpler to debug in that context | 16:39 |
| clarkb | one job to hold instead of two | 16:39 |
| clarkb | I've got a held node and initial inspection looks fine. I do notice that the buildkitd.toml has an empty [registry] block in addition to the specific targets | 16:53 |
| clarkb | I have no idea if that would impact behaviors, but can experiment with it | 16:53 |
| clarkb | but first I'm going to pop out for some outdoor time before it gets hot today | 16:53 |
| clarkb | oh the other thing I notice is in the buildset registry log we can see all the writes but no reads with timestamps that match from when the build ran | 16:54 |
| clarkb | so seems like docker build isn't trying to hit the local buildset registry at all | 16:54 |
| clarkb | my hunch right now is its DNS | 16:55 |
| clarkb | in the configuration we setup zuul-jobs.buildset-registry:5000 as the mirror target then in /etc/hosts we configure zuul-jobs.buildset-registry to point at our IP address (or the remote buildset registry). I suspect but am not certain that buildkitd is runnign things within a different fs namespace and isn't seeing the host level /etc/hosts | 16:56 |
| clarkb | I think we can change that to an ipv4 address to test since this held node is using ipv4. We use the name instead because docker doesn't know how to handle ipv6 literals | 16:56 |
| clarkb | anyway I'm going to pop out now. Will be back to test these ideas | 16:57 |
| clarkb | After poking some more I'm not entirely convinced that the default buildx builder looks at that config. I'll try creating a new buildx builder that does use that config and see what happens if I do that | 19:25 |
| corvus | clarkb: istr some stuff to create a builder explicitly, i'll see if i can find it | 19:29 |
| corvus | roles/build-container-image/tasks/setup-buildx.yaml: command: "docker buildx create --name mybuilder --node {{ inventory_hostname | replace('-', '_') }} --driver=docker-container --driver-opt image=quay.io/opendevmirror/buildkit:buildx-stable-1 --driver-opt network=host{% if buildset_registry is defined %} --config /etc/buildkit/buildkitd.toml {% endif %}" | 19:30 |
| corvus | looks like that's only run for multiarch | 19:31 |
| corvus | that file also talks about etc/hosts | 19:32 |
| corvus | so i'm wondering 2 things: should we do that everywhere? what are we missing in the testing for these roles? | 19:33 |
| corvus | (cause i thought all of this was tested in zuul-jobs test jobs) | 19:33 |
| Clark[m] | corvus: I think it's required for multiarch to get the qemu emulation. But ya let me test doing that without multiarch after lunch and see if it fixes things | 19:40 |
| corvus | yeah, if we do need to control the builder start, we may need to separate the builder start stuff from the qemu stuff | 19:43 |
| Clark[m] | corvus: if this is the issue I think it explains why uwsgi works. That job specifies an arch list (of just x86-64 but still it's there) and that may cause it to go into the multiarch path | 19:50 |
| Clark[m] | One hack may be to just always enable multiarch and default to the local system arch if not specified | 19:51 |
| Clark[m] | That way we don't need confusing config options or branching | 19:51 |
| corvus | i don't know if there's anything in that file related to other arches other than starting the qemu image | 19:54 |
| corvus | (that's my way of agreeing with you, but phrasing it differently) | 19:54 |
| opendevreview | Clark Boylan proposed opendev/grafyaml master: Pull python base images from quay.io https://review.opendev.org/c/opendev/grafyaml/+/958601 | 20:05 |
| clarkb | I'm going to let zuul test that theory out for us while I use the held node for additional experiments | 20:05 |
| clarkb | oh wow the /etc/hosts handling is quite involved | 20:11 |
| clarkb | I'm just going to use the ip address in the config for now | 20:11 |
| clarkb | corvus: does the buildset registry run with tls or just http? | 20:19 |
| clarkb | looks like we configure it to use tls. I wonder why when I tell it that things are insecure it still fails making the request due to https | 20:20 |
| clarkb | the grafyaml update there to specify arch (effectively going through the multiarch path and setting up all of this stuff) did work | 20:21 |
| clarkb | I haven't successfully reproduced on the held node as I'm having issues with ssl, but I think that confirms what the problem is | 20:21 |
| clarkb | to tl;dr I think there are a couple of issues here. THe main one is that we need to configure out own buildx builder to modify config in this was. We do that for multiarch builds. If you do configure your own builder then you need to worry about dns resolution and ssl certificate trust chains | 20:25 |
| clarkb | I've got everything working in my held node but the ssl cert validation | 20:25 |
| clarkb | I'm not sure it is valuable to keep trying to make it locally when we can see it work with multiarch builds. That basically proves how to make this work | 20:25 |
| corvus | clarkb: i don't have an answer to your question other than it looks like we copy the cert into the buildx container ca-certificates dir, and i bet there's a reason for that (like, the process in the builder can't be told to ignore the cert or something) | 20:26 |
| clarkb | ya | 20:26 |
| corvus | clarkb: the only question i still have is "what's the deal with testing in zuul-jobs?" are we not testing this, or is the test broken? | 20:27 |
| clarkb | corvus: or maybe we're testing it with the multiarch case so it works and not testing the default? | 20:27 |
| corvus | yep. i know we're testing multiarch, but i thought we were testing default too | 20:28 |
| clarkb | alternatively its possible the test is properly broken. I think an easy way to break that is using an image that we can find in the upstream and not just in our downstream buildset registry mirror | 20:28 |
| corvus | zuul-jobs-test-build-container-image-docker-release and zuul-jobs-test-build-container-image-docker-release-multiarch are what i assume are the 2 jobs | 20:29 |
| clarkb | looking at my held node builder logs it appaers to run a HEAD request against both the mirror and the canonical location | 20:29 |
| clarkb | it fails here bceause we don't have any python3.11 or 3.12 tags for these images in quay yet | 20:29 |
| clarkb | but if those tests use an image and tag that exists in the upstream we may get a fallback behavior. This might be a trivial test update to use a random tag value to confirm | 20:30 |
| clarkb | corvus: FROM quay.io/opendevmirror/httpd:alpine this is what the image build does in those jobs | 20:34 |
| clarkb | and we don't seem to build downstream images. So I think this is largely just uncovered by the test case | 20:34 |
| corvus | got it. so maybe we should be building an image from scratch with an unused name or something. | 20:35 |
| clarkb | to cover the case I'm running into I think we want two image builds. The first can essentially just copy httpd:alpine to httpd:some-downstream-tag. Then have that publish to the buildset registry (which we don't seem to run in these tests) then build a second image FROM httpd:some-downstream-tag. | 20:36 |
| clarkb | maybe the buildset registry tests are a closer match to what we're trying to cover here? | 20:36 |
| clarkb | yes I think zuul-jobs-test-registry-docker is closer to what we want | 20:39 |
| clarkb | corvus: that ^ test covers this except that it uses docker.io as the image location so I think things magically work | 20:42 |
| clarkb | the simplest thing here may be to change that test to use quay.io as the registry (or something else entirely) so that we're not running into the docker.io magic | 20:42 |
| clarkb | hrm that test also uses the build-docker-image role when container command is docker and build-container-image otherwise. We probably want a build-container-image version of the job and a build-docker-image version of the job? | 20:45 |
| clarkb | to tl;dr there do appear to be some definite test improvements that could be made in zuul-jobs (really speaks to the complexity of the venn diagram the ecosystem has given us here). Then for making things actually work we could just multiarch always and default to the host's arch. I don't think there is any overhead for native builds other than pulling the qemu stuff in. It will be | 20:47 |
| clarkb | unused | 20:47 |
| corvus | clarkb: it's possible this is why we don't have that test. :( | 20:47 |
| clarkb | corvus: fwiw I think the two roles share the multiarch for docker code (at least today). So we could make both roles just always do multiarch and default to current host arch and get some coverage via the existing jobs if we change the image location from docker.io to something else | 20:49 |
| clarkb | then maybe build from there? | 20:49 |
| corvus | clarkb: i think the main thing is just run the buildx-setup task list always. i don't actually understand what you mean when you say "default to the current host arch, because i don't see anything in the buildx-setup file that does anything with architecture. maybe i'm missing it. | 20:50 |
| clarkb | corvus: in our opendev jobs we enable multiarch builds by setting the arch: list in the container_images dict. I guess I was thinking of setting that list by default to the current host arch and let the existing code run as is | 20:51 |
| clarkb | corvus: https://review.opendev.org/c/opendev/grafyaml/+/958601/5/.zuul.yaml basically that but do it in the role by default | 20:51 |
| corvus | yeah that is a way we could make things work for opendev image builds quickly, but if we now know the roles are broken i think we should fix the roles | 20:51 |
| corvus | okay this is becoming more clear now | 20:52 |
| corvus | the issue is really that this only works with buildkit, and buildkit must be configured correctly, and the only code path that configures buildkit correctly is multiarch. | 20:53 |
| clarkb | I guess doing it that was leaves dead code in the role | 20:53 |
| clarkb | corvus: ya | 20:53 |
| clarkb | *doing it that way | 20:53 |
| corvus | in zuul, we set DOCKER_BUILDKIT=1 | 20:53 |
| corvus | but that isn't enough because that doesn't init buildkit correctly | 20:54 |
| corvus | so maybe what we should do is have a role-level variable that is "enable buildkit" and have it init buildkit correctly, set that env var. and if multiarch is specified, we also set that variable. | 20:54 |
| corvus | (and then someday make that the default) | 20:55 |
| clarkb | on the OpenDev side of things I think we have these options: A) Use docker buildx everywhere with multiarch builds configured on our side or help fix zuul-jobs as proposed above (or otherwise) B) Use podman everywhere and add support for podman multiarch builds to zuul-jobs or C) use podman everywhere for non multiarch and only incur the complicated overhead of docker buildx | 20:55 |
| clarkb | multiarch stuff when necessary | 20:55 |
| clarkb | I suspect A is what corvus would prefer | 20:56 |
| clarkb | and I don't really have any objections to it other than a small concern that the build process is a lot more complicated that you'd hope is necesasry but it is with docker so meh | 20:56 |
| corvus | do we need any multiarch builds? | 20:57 |
| clarkb | corvus: currently opendev does not. However, nodepool still mostly exists and it relies on the python-base and python-builder opendev images | 20:57 |
| corvus | i think we can disregard that as a requirement | 20:58 |
| corvus | like, not a design requirement | 20:58 |
| clarkb | looking at https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/build-container-image/tasks/buildx.yaml I think you are right that we can just always setup buildx and then only when doing multiarch builds do the extra registry and the weird push pull push dance | 20:59 |
| clarkb | which simplifies the docker build path a bit (but we still have to do all of the buildx setup stuff either way) | 20:59 |
| corvus | i think we need all of those because buildx doesn't put output in the local image cache by default | 21:00 |
| clarkb | it is also possible that we would get resources in a cloud that we want to use for services that are arm based and having the ability to do multiarch would be nice for that. But if we ignore nodepool then I don't think there is any reason for opendev to care about multiarch today | 21:00 |
| clarkb | corvus: oh right its inside the buildx builder container | 21:00 |
| clarkb | corvus: so the main difference would be whether or not we provide the --platform argument or not. But otherwise the tasks are the same | 21:01 |
| corvus | oh, we use a temporary registry because of multiarch though | 21:01 |
| corvus | so yeah, that's more complicated | 21:01 |
| corvus | (like, i think there's still the step to get it into the image cache, but we don't normally need an extra registry for that, we only needed it for multiarch for some reason i don't recall) | 21:02 |
| clarkb | corvus: I wonder if that temporary registry is there to take advatnage of the --push argument rather than say use skopeo after the build to moev things from the buildx builder to the local image cache? | 21:03 |
| clarkb | there are docker image loading commands that can be used too I think | 21:03 |
| clarkb | its possible that could be refactored | 21:04 |
| corvus | clarkb: i think i prefer A then B and am against C. A and B both seem reasonable. what A has going for it is that all the code is already written, it's just changing the conditional blocks. B needs a bunch of new code developed and we don't actually know if we love podman for building images. | 21:04 |
| corvus | maybe -- i thought maybe there was something about maybe we only got the local arch into the local image cache instead of all of them? but if we do the registry we get all? | 21:05 |
| corvus | maybe i like the word maybe | 21:05 |
| clarkb | oh ya that could be it since docker tries to be smart about giving you images that work with the local cpu(s) | 21:05 |
| clarkb | to catch up opendev we've found a small gap in the container image building with docker that we have solved for multiarch builds. OpenDev can workaround that by doing single arch multiarch builds as I proposed for grafyaml. I think before we do that we should try and fix things for option A) above | 21:07 |
| clarkb | I can probably start to dig itno that either tomorrow or Friday. And in the meantime we can hold off on moving our python base images to quay | 21:07 |
| clarkb | corvus: do you have any need for the held statusbot image builder node? I will clean it up if not | 21:09 |
| corvus | clarkb: nope, thanks | 21:13 |
| clarkb | the autohold has been deleted and I put a WIP on https://review.opendev.org/c/opendev/system-config/+/957277 so that we don't accidentally land that before the zuul-jobs situation is cleaned up | 21:27 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!