*** amoralej|off is now known as amoralej | 06:11 | |
opendevreview | Alfredo Moralejo proposed zuul/zuul-jobs master: Use release CentOS SIGS repo to install openvswitch in C9S https://review.opendev.org/c/zuul/zuul-jobs/+/883790 | 07:23 |
---|---|---|
opendevreview | Alfredo Moralejo proposed zuul/zuul-jobs master: Use release CentOS SIGS repo to install openvswitch in C9S https://review.opendev.org/c/zuul/zuul-jobs/+/883790 | 08:06 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: fedora: don't use CI mirrors https://review.opendev.org/c/openstack/diskimage-builder/+/883798 | 10:31 |
*** amoralej is now known as amoralej|lunch | 12:29 | |
*** dmellado90 is now known as dmellado | 12:48 | |
*** amoralej|lunch is now known as amoralej | 13:09 | |
TheJulia | o/ Hi folks, any chance we can get a node held for the next failed ironic-grenade job ? | 14:55 |
TheJulia | we have a few different changes which now seems to result in the database upgrade freezing :( | 14:56 |
fungi | TheJulia: sure, failure on any project and change for the job named "ironic-grenade" ? | 14:59 |
TheJulia | on openstack/ironic is fine | 14:59 |
TheJulia | but I think that is the only place it is run | 14:59 |
fungi | in the past we matched failures for a ironic-grenade-multinode-multitenant according to my shell history, but this time the job name is just "ironic-grenade" right? | 15:00 |
TheJulia | correct | 15:01 |
fungi | cool, i set this just now: | 15:01 |
fungi | zuul-client autohold --tenant=openstack --project=opendev.org/openstack/ironic --job=ironic-grenade --reason="TheJulia troubleshooting frozen database upgrades" --count=1 | 15:01 |
TheJulia | awesome, either myself or iurygregory will be investigating. We have independent changes which seem to tickle extreme database sadness :( | 15:02 |
fungi | TheJulia: iurygregory: you should see a node with state=hold and the above reason text in the comment column at https://zuul.opendev.org/t/openstack/nodes once there is one. just let us know in here and one of us can add access for your ssh key | 15:03 |
TheJulia | thanks | 15:04 |
fungi | also i love the new(ish) nodes view in the zuul dashboard | 15:04 |
TheJulia | That is kind of nice to see | 15:05 |
TheJulia | gives people an idea of the scope of what is going on quite nicely | 15:05 |
fungi | now to figure out what's gone sideways in rax-iad | 15:05 |
fungi | TheJulia: a more direct indicator of scale is can be seen at https://grafana.opendev.org/d/21a6e53ea4/zuul-status | 15:06 |
fungi | but that's not built into zuul, just plotting the statsd emissions it provides | 15:07 |
fungi | as for rax-iad, openstack server list reports 117 instances in ERROR state. looking at one chosen at random, it has task_state=deleting vm_state=error fault={'message': 'InternalServerError', 'code': 500, 'created': '2023-04-07T12:36:17Z'} | 15:14 |
fungi | so that one has been stuck that way consuming quota for a month and a half | 15:15 |
fungi | i wonder if they all have roughly the same timestamp | 15:15 |
iurygregory | Thanks for the information fungi o/ | 15:20 |
opendevreview | Birger J. Nordølum proposed openstack/diskimage-builder master: feat: add almalinux-container element https://review.opendev.org/c/openstack/diskimage-builder/+/883855 | 15:31 |
fungi | looking at the created timesamps in the fault messages, they range from 2023-04-03T03:35:15Z to 2023-05-22T05:23:55Z so whatever the issue, it's ongoing | 15:31 |
clarkb | I've had a weekend to think about it after now spending a good chunk of a couple of weeks digging into the whole quay.io + docker + speculative container testing problem and I just can't bring myself to recommend "switch everything to podman first". Podman, unfortunately, brings its own set of problems that we've run into so far. Installating it on Ubuntu is sketchy until Jammy, you | 15:35 |
clarkb | can't syslog, there's a whole transition problem I haven't even begun to really dig into (are podman and docker even coinstallable, I think they share some binary dependencies?), do we temporarily double our disk space needs between images and volumes?, how do we automate the switch (do we automate the switch)? Nothing that would prevent us from moving forward (though I havne't yet | 15:35 |
clarkb | been able to poke at nested podman with nodepool-builder), but plenty that will make this process necessarily slow and measured. Additionally, it feels like I'm being expected to do 99% of the work. I understand there are ideals at play here but I can't personally be expected to upgrade every server to Jammy so that podman is installable, rewrite and test all of the configuration | 15:35 |
clarkb | management, and transition running services myself. If others continue to feel strongly about this I can help get the nodepool-builder testing up and running, but I don't think I can commit to more at this point. I'm also happy to revert the quay.io image moves or implement the skopeo workaround hack. The more I think about this workaround the less it bothers me. It is quick, | 15:35 |
clarkb | straightforward, gets us the functionality we want without dramatically compromising the testing of what we will eventually deploy to production. The biggest downside is we have to manually curate the list of images and the inclusion of the role in our playbooks/roles. | 15:35 |
clarkb | cc corvus fungi frickler ianw tonyb and anyone else that might be interested | 15:35 |
corvus | clarkb: we don't store any important data volumes, so i think you can basically strike that one off the list. | 15:39 |
corvus | clarkb: (that's a minor point of course) | 15:40 |
fungi | infra-root: i've opened ticket #230522-ord-0001072 with rackspace about the stuck deleting nodes in iad | 15:41 |
clarkb | corvus: thats true everything should be bind mounted in opendev. Except for any mounts we may have missed. It does looks like both mysql/mariadb and zookeepre have a complete set though which would be the main ones to worry about | 15:41 |
clarkb | fungi: thanks! | 15:41 |
corvus | clarkb: i think if we don't want to do podman, then we should either switch back to docker, or make the skopeo solution a real solution (with automated pulls from zuul artifacts). but keep in mind that has issues, like we can never do a "docker pull", and our production playbooks do that a lot. | 15:42 |
corvus | i mean, do we actually have an idea for a solution to the "pull" problem? | 15:42 |
clarkb | corvus: the change I wrote addresses that by injecting the skopeo pull after the docker(-compose) pulls. I think that is really the only option | 15:43 |
corvus | right, so it's basically giving up on the testing production idea -- we have to remember to write our production playbooks to include test code, and if we forget that, we transparently loose testing without any indication. | 15:45 |
clarkb | we could update the role I have written to only do the skopeo pull based on artifacts and stop needing to account for the specific list of images there. But I don't think you can make this transparent and have docker(-compose) pulls | 15:45 |
corvus | tbh, dockerhub is sounding pretty good now | 15:46 |
opendevreview | Birger J. Nordølum proposed openstack/diskimage-builder master: feat: add almalinux-container element https://review.opendev.org/c/openstack/diskimage-builder/+/883855 | 15:46 |
clarkb | I don't personally see it as giving up on testing production. We are still testing production. We even cover the docker(-compose) pull code. We just run a little extra code in testing. It isn't perfect but I don't see it as giving up | 15:46 |
corvus | if the choice is between two principles, then i'd rather chose the principle of testing our exact production playbooks | 15:46 |
fungi | infra-root: i also opened ticket #230522-ord-0001075 for removal of a stuck shutoff instance in dfw which responds with an "is locked" error if i try to delete it | 15:46 |
clarkb | and ya I see both states as less than ideal, and if we decide to pick one it is a matter of deciding which is less problematic for us | 15:48 |
corvus | since dockerhub is still an option, i have a hard time saying it's better to give up the fully working system we have now just to avoid using it. | 15:48 |
corvus | (and keep in mind, the alternative is still "use all the docker tools just with quay.io for hosting", so we're not even making a very strong "pro-community" stance) | 15:49 |
corvus | i think all things considered, we should just roll back to dockerhub, then start picking things off the podman punch list as we can (jammy, running nested, etc) | 15:50 |
clarkb | that works for me. FWIW I think if people want to they could push on podman for servies already running on jammy too. (gitea and etherpad for example) | 15:51 |
corvus | also -- if we make the tool switch before the hosting switch, that addresses a lot | 15:51 |
clarkb | yup | 15:51 |
corvus | like we can potentially slowly migrate to podman with images on dockerhub, one service at a time, then later switch hosts | 15:52 |
fungi | that sounds like a reasonable way forward to me | 15:53 |
clarkb | also this morning it has been discovered that siblings testing with nodepool, dib, and glean is broken due to this issue. Apparently we push a :siblings tag into the buildset registry to make that happen? | 15:53 |
clarkb | This may have a different solution (I think the skopeo hack may be more acceptable there for example) | 15:53 |
corvus | fwiw, i'm okay with eating crow and switching zuul back to dockerhub too, though i'm not sure if that's necessary or not? | 15:53 |
clarkb | or maybe just move all of that to podman since it isn't touching production | 15:54 |
clarkb | corvus: I think the nodepool builder jobs that do siblings with dib and glean are the only place that should really affect zuul and friends. And I think there are options there | 15:54 |
clarkb | specifically move those jobs to podman and if that doens't work for some reason use a skopeo hack since that isn't a test like production case (its a test for testing sake case) | 15:54 |
clarkb | 99% of the problem here for opendev is that we're trying to also deploy this stuff to production on real servers which brings different concerns and needs | 15:55 |
corvus | clarkb: i agree that doesn't need to drive the question | 15:56 |
corvus | yeah, i think the main reason to move zuul back would be in solidarity (ie, to keep using the same sets of jobs), but if we still have a desire to (very slowly over time) move opendev to quay, then maybe zuul should stay there and be the advance team? | 15:57 |
corvus | i should say: move opendev to podman and quay | 15:58 |
clarkb | ya I think it is ok for the jobs to differ. We mgiht also be able to run the same jobs just with different options. I think maybe that the container jobs would work with docker hub too | 15:58 |
clarkb | but that change should come post rollback to simplify things | 15:58 |
corvus | (if you actually want to keep opendev on docker+dockerhub indefinitely, then we should move zuul back i think. that way we're better using our limited resources to collectively maintain a smaller set of common jobs) | 15:59 |
corvus | (or, after reading your last comment, maintaining a smaller set of common job configurations :) | 15:59 |
clarkb | ok cool. I had a lot of time to noodle on this over the weekend and wanted to get the week started with a conversation to avoid doing a bunch of unnecessary work then deciding on things. We can bring this ack up in tomorrow's meeting to catch any other opinions and if there aren't objections there I can start on the rollback for opendev. | 15:59 |
corvus | okay. i think on the zuul side, we still need to see nodepool functional testing in action with podman, right? but for zuul itself, we worked out the issues and can switch when we're ready? | 16:01 |
clarkb | corvus: I think there are still good reasons to move to podman. I just don't see it as being quick and easy. side note: I feel like both docker and podman exhibit problems with what seems like straightforward functionality (logging, pulling images from not docker.io, exploding on ubuntu rootless due to a documented fallback that doesn't actually fallback, etc) | 16:01 |
clarkb | corvus: correct re zuul and nodepool testing | 16:01 |
corvus | ok, so i think if we want to have zuul as the advance party, then we should do the nodepool thing next, and if that works, switch them both over. | 16:02 |
corvus | the nested nodepool thing is not something i will be able to do though, unfortunately. | 16:03 |
clarkb | now that I think about it the nodepool testing udpate may exercise that for us. So we can use that as the advance party too | 16:03 |
corvus | nodepool testing update? | 16:04 |
clarkb | "we still need to see nodepool functional testing in action with podman" | 16:04 |
corvus | right -- i mean, if that is anything other than straightforward, i'm not going to be in a position to fix it | 16:05 |
clarkb | gotcha | 16:05 |
clarkb | also sidenote: podman had a ppa, they removed this ppa in favor of the opensuse kubic obs repo, kubic deleted packages from this beacuse they weren't going to support it anymore for older things, everyone (rightly imo) complained since the ppa was also dead and the documentation for installing things says use kubic, kubic restored the packages but isn't updating them aiui. But kubic | 16:07 |
clarkb | doesn't matter for new things because new things package podman but that is all kubic updates for. TL;DR I'm highly skeptical of kubic as a package source | 16:07 |
*** amoralej is now known as amoralej|off | 16:42 | |
opendevreview | Merged opendev/system-config master: reprepro: mirror Ubuntu UCA Antelope for Ubuntu Jammy https://review.opendev.org/c/opendev/system-config/+/883467 | 17:54 |
*** mooynick is now known as yoctozepto | 18:22 | |
yoctozepto | morning | 18:22 |
yoctozepto | a question about opendev container jobs not being ready for podman | 18:23 |
yoctozepto | https://opendev.org/opendev/base-jobs/src/commit/3fc688b08dbe2ff41a75f051f53b4929dd35800f/playbooks/buildset-registry/pre.yaml | 18:23 |
yoctozepto | only docker is installed there | 18:23 |
yoctozepto | would it be ok if I proposed a patch to handle podman as well? | 18:24 |
yoctozepto | maybe there is one already | 18:24 |
yoctozepto | forgot to check it | 18:24 |
yoctozepto | https://review.opendev.org/q/project:opendev/base-jobs+podman | 18:24 |
yoctozepto | nope | 18:24 |
yoctozepto | ok, so let me experiment in a moment | 18:25 |
yoctozepto | unfortunately, it's a config project | 18:25 |
yoctozepto | :-( | 18:25 |
*** mooynick is now known as yoctozepto | 18:31 | |
yoctozepto | (mobile network switch) | 18:31 |
clarkb | yoctozepto: the jobs are fine wiht podman | 18:32 |
clarkb | you'll need to be more specific why they are not | 18:32 |
clarkb | (I mean zuul is doing it in a half merged state and I've got at least one change up to experiment with it too. The problems are not with the jobs) | 18:33 |
clarkb | what the buildset registry uses to run the buildset registry software is orthogonal to what you end up testing with the buildset registry as a tool | 18:34 |
yoctozepto | clarkb: https://review.opendev.org/c/nebulous/component-template/+/883304?tab=change-view-tab-header-zuul-results-summary | 18:48 |
yoctozepto | it tries to run podman | 18:48 |
yoctozepto | not having installed it | 18:49 |
yoctozepto | and fails obviously | 18:49 |
yoctozepto | that's the issue | 18:49 |
yoctozepto | what you are saying means to me that it should not be trying to use podman | 19:00 |
yoctozepto | maybe it's some new development that it dies | 19:00 |
yoctozepto | s/dies/does | 19:00 |
yoctozepto | https://opendev.org/zuul/zuul-jobs/commits/branch/master/roles/run-buildset-registry/tasks/main.yaml | 19:02 |
yoctozepto | nah, been like this for quite some time | 19:02 |
fungi | looks like the container-image pre-run picks between docker and podman: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/container-image/pre.yaml | 19:02 |
fungi | depending on what container_command is set to | 19:03 |
yoctozepto | fungi: yeah, but it will come later | 19:03 |
yoctozepto | it does not reach it by then | 19:03 |
yoctozepto | see the run | 19:03 |
yoctozepto | https://zuul.opendev.org/t/nebulous/build/b86ef46fbe424a54ac4cb46b0432dfb0/console | 19:03 |
yoctozepto | container_command is set to podman | 19:03 |
iurygregory | fungi, hey just saw the node in https://zuul.opendev.org/t/openstack/nodes | 19:03 |
fungi | right, i was comparing to the buildset-registry pre-run | 19:03 |
yoctozepto | my patch would be doing the same in buildset-registry | 19:04 |
fungi | iurygregory: what ssh key do you want added? | 19:04 |
yoctozepto | I wonder if that's the only place that will need fixing | 19:04 |
yoctozepto | but I guess we won't know without trying | 19:04 |
iurygregory | fungi, will send to you in 1min | 19:06 |
fungi | iurygregory: ssh root@104.130.135.41 | 19:13 |
fungi | let me know if it doesn't authenticate for you | 19:13 |
opendevreview | Radosław Piliszek proposed opendev/base-jobs master: buildset-registry: Add podman support https://review.opendev.org/c/opendev/base-jobs/+/883869 | 19:16 |
yoctozepto | fungi, clarkb ^ | 19:16 |
iurygregory | fungi, done | 19:16 |
iurygregory | it worked | 19:16 |
yoctozepto | meh that I can't just depends-on it to test it | 19:18 |
clarkb | yoctozepto: as metnioned that is orthogonal to what you are doing | 19:20 |
clarkb | the buildset registry is a service that runs in jobs/buildsets. How it runs is independent of your jobs. If your jobs need podmnan then you need to install it in your jobs | 19:20 |
yoctozepto | clarkb: it's buildset-registry that fails to run when I set the command to podman | 19:20 |
yoctozepto | see https://review.opendev.org/c/nebulous/component-template/+/883304 | 19:21 |
yoctozepto | according to docs, this should work fine | 19:21 |
yoctozepto | it fails to start the buildset-registry | 19:21 |
clarkb | ok I see you finally linked to a failure :) | 19:21 |
clarkb | ok so run-buildset-registry doesn't install either docker or podman | 19:24 |
yoctozepto | yeah, I even wrote a nice commit message on the fix to explain what's happening | 19:24 |
yoctozepto | buildset-registry installs docker only now | 19:24 |
clarkb | I feel like "buildset-regsitry is independent of your job content" is what should be happening but I guess isn't | 19:24 |
yoctozepto | seems like some old approach before it was made more flexible | 19:24 |
clarkb | corvus: ^ do you have an opinion on that? | 19:24 |
yoctozepto | yeah, we could go that direction | 19:25 |
yoctozepto | like, using buildset_registry_container_command | 19:25 |
yoctozepto | independent of container_command | 19:25 |
clarkb | yoctozepto: or just set the var when you include the role | 19:25 |
clarkb | that should override it in inner scopes but not outer right? | 19:25 |
yoctozepto | I simply reuse your jobs, see my commit | 19:26 |
yoctozepto | not easy to hack without violating DRY | 19:26 |
yoctozepto | :-) | 19:26 |
clarkb | yes I mean here https://review.opendev.org/c/opendev/base-jobs/+/883869/1/playbooks/buildset-registry/pre.yaml#36 | 19:26 |
yoctozepto | ah | 19:26 |
yoctozepto | could be | 19:26 |
yoctozepto | though maybe supporting podman simply makes more sense | 19:26 |
yoctozepto | in the long term | 19:27 |
clarkb | my concern with that is podman doesn't run in a lot of places | 19:27 |
clarkb | docker runs everywhere so for generic "run this service" things where we don't really care about speculative gating I think docker might still be a better choice | 19:27 |
clarkb | I also don't know that there is much value to supporting more than one way to run it | 19:27 |
clarkb | twice as many ways it might break | 19:28 |
yoctozepto | true that | 19:28 |
clarkb | I guess I can go either way on that now that I understand the problem | 19:28 |
clarkb | flexibility vs potential reliablity | 19:28 |
clarkb | I'll update my review | 19:29 |
yoctozepto | thanks, I am also largely indifferent; at most lazy to update the commit to do the other way ;p | 19:29 |
yoctozepto | as long as the desired speculative runs work in the end, I am happy to base it on either solution | 19:30 |
clarkb | I left two notes one to fix the proposal as is and the other to try and isolate running a registry from what is happening in the jobs | 19:32 |
clarkb | One upside ot being consistent is that it reduces the number of external deps | 19:35 |
clarkb | which is probably a bigger reliability concern than the chance of podman or docker changing behavior in unexpected ways | 19:36 |
clarkb | yoctozepto: ^ if you want to update it to fix the default value to match run-buildset-regitry's default of docker I think we can probably land it for that reason. Note that ensure-podman does not work on older ubuntu | 19:36 |
corvus | there's some pretty docker-specifc stuff in there, so adding podman to that might be more than initially expected. i think clarkb 's suggestion about the default makes sense | 19:40 |
yoctozepto | amending, clarkb | 19:42 |
opendevreview | Radosław Piliszek proposed opendev/base-jobs master: buildset-registry: Always use Docker https://review.opendev.org/c/opendev/base-jobs/+/883869 | 19:46 |
* yoctozepto is finishing work for today | 19:51 | |
yoctozepto | talk to you on gerrit | 19:51 |
dansmith | clarkb: I didn't really follow the above, but just FYI we're jamming podman into jammy for the ceph jobs: https://github.com/openstack/devstack-plugin-ceph/blob/master/devstack/files/debs/devstack-plugin-ceph#L4 | 21:13 |
dansmith | because cephdm wants it (I think) | 21:14 |
dansmith | I certainly agree that docker is a known and supported quantity for anything that isn't opinionated about it | 21:14 |
clarkb | dansmith: for your needs the main issue is that there is no reliable source of podman for ubuntu older than jammy | 21:17 |
dansmith | clarkb: ah, older than jammy, yeah for sure | 21:18 |
dansmith | we were using another package repo (possibly the one you mentioned above) but switched to the inbuilt packages during the recent modernization effort | 21:18 |
clarkb | dansmith: the longer story is that there was a PPA for this stuff which got deprecated and is no longer updated. This happened because there is an OBS repo called "kubic" that started building pacakges instead. But then kubic said nevermind the older distros and deleted them all. People got angry/panicked/voiced displeasure so kubic added the packages back but is no longer updating | 21:19 |
clarkb | them. The problem is that podman exists on newer stuff so really you only need kubic for the older things anyway which means it too is not super useful | 21:19 |
clarkb | for OpenDev we have a mixture of servers and can't simply rely on jammy everywhere. In CI this is less problematic | 21:19 |
dansmith | ack | 21:19 |
clarkb | I've just updated the meeting agenda. Anything important missing? | 22:53 |
fungi | nothing i can think of | 22:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!