Monday, 2023-05-22

*** amoralej\|off is now known as amoralej		06:11
opendevreview	Alfredo Moralejo proposed zuul/zuul-jobs master: Use release CentOS SIGS repo to install openvswitch in C9S https://review.opendev.org/c/zuul/zuul-jobs/+/883790	07:23
opendevreview	Alfredo Moralejo proposed zuul/zuul-jobs master: Use release CentOS SIGS repo to install openvswitch in C9S https://review.opendev.org/c/zuul/zuul-jobs/+/883790	08:06
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: fedora: don't use CI mirrors https://review.opendev.org/c/openstack/diskimage-builder/+/883798	10:31
*** amoralej is now known as amoralej\|lunch		12:29
*** dmellado90 is now known as dmellado		12:48
*** amoralej\|lunch is now known as amoralej		13:09
TheJulia	o/ Hi folks, any chance we can get a node held for the next failed ironic-grenade job ?	14:55
TheJulia	we have a few different changes which now seems to result in the database upgrade freezing :(	14:56
fungi	TheJulia: sure, failure on any project and change for the job named "ironic-grenade" ?	14:59
TheJulia	on openstack/ironic is fine	14:59
TheJulia	but I think that is the only place it is run	14:59
fungi	in the past we matched failures for a ironic-grenade-multinode-multitenant according to my shell history, but this time the job name is just "ironic-grenade" right?	15:00
TheJulia	correct	15:01
fungi	cool, i set this just now:	15:01
fungi	zuul-client autohold --tenant=openstack --project=opendev.org/openstack/ironic --job=ironic-grenade --reason="TheJulia troubleshooting frozen database upgrades" --count=1	15:01
TheJulia	awesome, either myself or iurygregory will be investigating. We have independent changes which seem to tickle extreme database sadness :(	15:02
fungi	TheJulia: iurygregory: you should see a node with state=hold and the above reason text in the comment column at https://zuul.opendev.org/t/openstack/nodes once there is one. just let us know in here and one of us can add access for your ssh key	15:03
TheJulia	thanks	15:04
fungi	also i love the new(ish) nodes view in the zuul dashboard	15:04
TheJulia	That is kind of nice to see	15:05
TheJulia	gives people an idea of the scope of what is going on quite nicely	15:05
fungi	now to figure out what's gone sideways in rax-iad	15:05
fungi	TheJulia: a more direct indicator of scale is can be seen at https://grafana.opendev.org/d/21a6e53ea4/zuul-status	15:06
fungi	but that's not built into zuul, just plotting the statsd emissions it provides	15:07
fungi	as for rax-iad, openstack server list reports 117 instances in ERROR state. looking at one chosen at random, it has task_state=deleting vm_state=error fault={'message': 'InternalServerError', 'code': 500, 'created': '2023-04-07T12:36:17Z'}	15:14
fungi	so that one has been stuck that way consuming quota for a month and a half	15:15
fungi	i wonder if they all have roughly the same timestamp	15:15
iurygregory	Thanks for the information fungi o/	15:20
opendevreview	Birger J. Nordølum proposed openstack/diskimage-builder master: feat: add almalinux-container element https://review.opendev.org/c/openstack/diskimage-builder/+/883855	15:31
fungi	looking at the created timesamps in the fault messages, they range from 2023-04-03T03:35:15Z to 2023-05-22T05:23:55Z so whatever the issue, it's ongoing	15:31
clarkb	I've had a weekend to think about it after now spending a good chunk of a couple of weeks digging into the whole quay.io + docker + speculative container testing problem and I just can't bring myself to recommend "switch everything to podman first". Podman, unfortunately, brings its own set of problems that we've run into so far. Installating it on Ubuntu is sketchy until Jammy, you	15:35
clarkb	can't syslog, there's a whole transition problem I haven't even begun to really dig into (are podman and docker even coinstallable, I think they share some binary dependencies?), do we temporarily double our disk space needs between images and volumes?, how do we automate the switch (do we automate the switch)? Nothing that would prevent us from moving forward (though I havne't yet	15:35
clarkb	been able to poke at nested podman with nodepool-builder), but plenty that will make this process necessarily slow and measured. Additionally, it feels like I'm being expected to do 99% of the work. I understand there are ideals at play here but I can't personally be expected to upgrade every server to Jammy so that podman is installable, rewrite and test all of the configuration	15:35
clarkb	management, and transition running services myself. If others continue to feel strongly about this I can help get the nodepool-builder testing up and running, but I don't think I can commit to more at this point. I'm also happy to revert the quay.io image moves or implement the skopeo workaround hack. The more I think about this workaround the less it bothers me. It is quick,	15:35
clarkb	straightforward, gets us the functionality we want without dramatically compromising the testing of what we will eventually deploy to production. The biggest downside is we have to manually curate the list of images and the inclusion of the role in our playbooks/roles.	15:35
clarkb	cc corvus fungi frickler ianw tonyb and anyone else that might be interested	15:35
corvus	clarkb: we don't store any important data volumes, so i think you can basically strike that one off the list.	15:39
corvus	clarkb: (that's a minor point of course)	15:40
fungi	infra-root: i've opened ticket #230522-ord-0001072 with rackspace about the stuck deleting nodes in iad	15:41
clarkb	corvus: thats true everything should be bind mounted in opendev. Except for any mounts we may have missed. It does looks like both mysql/mariadb and zookeepre have a complete set though which would be the main ones to worry about	15:41
clarkb	fungi: thanks!	15:41
corvus	clarkb: i think if we don't want to do podman, then we should either switch back to docker, or make the skopeo solution a real solution (with automated pulls from zuul artifacts). but keep in mind that has issues, like we can never do a "docker pull", and our production playbooks do that a lot.	15:42
corvus	i mean, do we actually have an idea for a solution to the "pull" problem?	15:42
clarkb	corvus: the change I wrote addresses that by injecting the skopeo pull after the docker(-compose) pulls. I think that is really the only option	15:43
corvus	right, so it's basically giving up on the testing production idea -- we have to remember to write our production playbooks to include test code, and if we forget that, we transparently loose testing without any indication.	15:45
clarkb	we could update the role I have written to only do the skopeo pull based on artifacts and stop needing to account for the specific list of images there. But I don't think you can make this transparent and have docker(-compose) pulls	15:45
corvus	tbh, dockerhub is sounding pretty good now	15:46
opendevreview	Birger J. Nordølum proposed openstack/diskimage-builder master: feat: add almalinux-container element https://review.opendev.org/c/openstack/diskimage-builder/+/883855	15:46
clarkb	I don't personally see it as giving up on testing production. We are still testing production. We even cover the docker(-compose) pull code. We just run a little extra code in testing. It isn't perfect but I don't see it as giving up	15:46
corvus	if the choice is between two principles, then i'd rather chose the principle of testing our exact production playbooks	15:46
fungi	infra-root: i also opened ticket #230522-ord-0001075 for removal of a stuck shutoff instance in dfw which responds with an "is locked" error if i try to delete it	15:46
clarkb	and ya I see both states as less than ideal, and if we decide to pick one it is a matter of deciding which is less problematic for us	15:48
corvus	since dockerhub is still an option, i have a hard time saying it's better to give up the fully working system we have now just to avoid using it.	15:48
corvus	(and keep in mind, the alternative is still "use all the docker tools just with quay.io for hosting", so we're not even making a very strong "pro-community" stance)	15:49
corvus	i think all things considered, we should just roll back to dockerhub, then start picking things off the podman punch list as we can (jammy, running nested, etc)	15:50
clarkb	that works for me. FWIW I think if people want to they could push on podman for servies already running on jammy too. (gitea and etherpad for example)	15:51
corvus	also -- if we make the tool switch before the hosting switch, that addresses a lot	15:51
clarkb	yup	15:51
corvus	like we can potentially slowly migrate to podman with images on dockerhub, one service at a time, then later switch hosts	15:52
fungi	that sounds like a reasonable way forward to me	15:53
clarkb	also this morning it has been discovered that siblings testing with nodepool, dib, and glean is broken due to this issue. Apparently we push a :siblings tag into the buildset registry to make that happen?	15:53
clarkb	This may have a different solution (I think the skopeo hack may be more acceptable there for example)	15:53
corvus	fwiw, i'm okay with eating crow and switching zuul back to dockerhub too, though i'm not sure if that's necessary or not?	15:53
clarkb	or maybe just move all of that to podman since it isn't touching production	15:54
clarkb	corvus: I think the nodepool builder jobs that do siblings with dib and glean are the only place that should really affect zuul and friends. And I think there are options there	15:54
clarkb	specifically move those jobs to podman and if that doens't work for some reason use a skopeo hack since that isn't a test like production case (its a test for testing sake case)	15:54
clarkb	99% of the problem here for opendev is that we're trying to also deploy this stuff to production on real servers which brings different concerns and needs	15:55
corvus	clarkb: i agree that doesn't need to drive the question	15:56
corvus	yeah, i think the main reason to move zuul back would be in solidarity (ie, to keep using the same sets of jobs), but if we still have a desire to (very slowly over time) move opendev to quay, then maybe zuul should stay there and be the advance team?	15:57
corvus	i should say: move opendev to podman and quay	15:58
clarkb	ya I think it is ok for the jobs to differ. We mgiht also be able to run the same jobs just with different options. I think maybe that the container jobs would work with docker hub too	15:58
clarkb	but that change should come post rollback to simplify things	15:58
corvus	(if you actually want to keep opendev on docker+dockerhub indefinitely, then we should move zuul back i think. that way we're better using our limited resources to collectively maintain a smaller set of common jobs)	15:59
corvus	(or, after reading your last comment, maintaining a smaller set of common job configurations :)	15:59
clarkb	ok cool. I had a lot of time to noodle on this over the weekend and wanted to get the week started with a conversation to avoid doing a bunch of unnecessary work then deciding on things. We can bring this ack up in tomorrow's meeting to catch any other opinions and if there aren't objections there I can start on the rollback for opendev.	15:59
corvus	okay. i think on the zuul side, we still need to see nodepool functional testing in action with podman, right? but for zuul itself, we worked out the issues and can switch when we're ready?	16:01
clarkb	corvus: I think there are still good reasons to move to podman. I just don't see it as being quick and easy. side note: I feel like both docker and podman exhibit problems with what seems like straightforward functionality (logging, pulling images from not docker.io, exploding on ubuntu rootless due to a documented fallback that doesn't actually fallback, etc)	16:01
clarkb	corvus: correct re zuul and nodepool testing	16:01
corvus	ok, so i think if we want to have zuul as the advance party, then we should do the nodepool thing next, and if that works, switch them both over.	16:02
corvus	the nested nodepool thing is not something i will be able to do though, unfortunately.	16:03
clarkb	now that I think about it the nodepool testing udpate may exercise that for us. So we can use that as the advance party too	16:03
corvus	nodepool testing update?	16:04
clarkb	"we still need to see nodepool functional testing in action with podman"	16:04
corvus	right -- i mean, if that is anything other than straightforward, i'm not going to be in a position to fix it	16:05
clarkb	gotcha	16:05
clarkb	also sidenote: podman had a ppa, they removed this ppa in favor of the opensuse kubic obs repo, kubic deleted packages from this beacuse they weren't going to support it anymore for older things, everyone (rightly imo) complained since the ppa was also dead and the documentation for installing things says use kubic, kubic restored the packages but isn't updating them aiui. But kubic	16:07
clarkb	doesn't matter for new things because new things package podman but that is all kubic updates for. TL;DR I'm highly skeptical of kubic as a package source	16:07
*** amoralej is now known as amoralej\|off		16:42
opendevreview	Merged opendev/system-config master: reprepro: mirror Ubuntu UCA Antelope for Ubuntu Jammy https://review.opendev.org/c/opendev/system-config/+/883467	17:54
*** mooynick is now known as yoctozepto		18:22
yoctozepto	morning	18:22
yoctozepto	a question about opendev container jobs not being ready for podman	18:23
yoctozepto	https://opendev.org/opendev/base-jobs/src/commit/3fc688b08dbe2ff41a75f051f53b4929dd35800f/playbooks/buildset-registry/pre.yaml	18:23
yoctozepto	only docker is installed there	18:23
yoctozepto	would it be ok if I proposed a patch to handle podman as well?	18:24
yoctozepto	maybe there is one already	18:24
yoctozepto	forgot to check it	18:24
yoctozepto	https://review.opendev.org/q/project:opendev/base-jobs+podman	18:24
yoctozepto	nope	18:24
yoctozepto	ok, so let me experiment in a moment	18:25
yoctozepto	unfortunately, it's a config project	18:25
yoctozepto	:-(	18:25
*** mooynick is now known as yoctozepto		18:31
yoctozepto	(mobile network switch)	18:31
clarkb	yoctozepto: the jobs are fine wiht podman	18:32
clarkb	you'll need to be more specific why they are not	18:32
clarkb	(I mean zuul is doing it in a half merged state and I've got at least one change up to experiment with it too. The problems are not with the jobs)	18:33
clarkb	what the buildset registry uses to run the buildset registry software is orthogonal to what you end up testing with the buildset registry as a tool	18:34
yoctozepto	clarkb: https://review.opendev.org/c/nebulous/component-template/+/883304?tab=change-view-tab-header-zuul-results-summary	18:48
yoctozepto	it tries to run podman	18:48
yoctozepto	not having installed it	18:49
yoctozepto	and fails obviously	18:49
yoctozepto	that's the issue	18:49
yoctozepto	what you are saying means to me that it should not be trying to use podman	19:00
yoctozepto	maybe it's some new development that it dies	19:00
yoctozepto	s/dies/does	19:00
yoctozepto	https://opendev.org/zuul/zuul-jobs/commits/branch/master/roles/run-buildset-registry/tasks/main.yaml	19:02
yoctozepto	nah, been like this for quite some time	19:02
fungi	looks like the container-image pre-run picks between docker and podman: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/container-image/pre.yaml	19:02
fungi	depending on what container_command is set to	19:03
yoctozepto	fungi: yeah, but it will come later	19:03
yoctozepto	it does not reach it by then	19:03
yoctozepto	see the run	19:03
yoctozepto	https://zuul.opendev.org/t/nebulous/build/b86ef46fbe424a54ac4cb46b0432dfb0/console	19:03
yoctozepto	container_command is set to podman	19:03
iurygregory	fungi, hey just saw the node in https://zuul.opendev.org/t/openstack/nodes	19:03
fungi	right, i was comparing to the buildset-registry pre-run	19:03
yoctozepto	my patch would be doing the same in buildset-registry	19:04
fungi	iurygregory: what ssh key do you want added?	19:04
yoctozepto	I wonder if that's the only place that will need fixing	19:04
yoctozepto	but I guess we won't know without trying	19:04
iurygregory	fungi, will send to you in 1min	19:06
fungi	iurygregory: ssh root@104.130.135.41	19:13
fungi	let me know if it doesn't authenticate for you	19:13
opendevreview	Radosław Piliszek proposed opendev/base-jobs master: buildset-registry: Add podman support https://review.opendev.org/c/opendev/base-jobs/+/883869	19:16
yoctozepto	fungi, clarkb ^	19:16
iurygregory	fungi, done	19:16
iurygregory	it worked	19:16
yoctozepto	meh that I can't just depends-on it to test it	19:18
clarkb	yoctozepto: as metnioned that is orthogonal to what you are doing	19:20
clarkb	the buildset registry is a service that runs in jobs/buildsets. How it runs is independent of your jobs. If your jobs need podmnan then you need to install it in your jobs	19:20
yoctozepto	clarkb: it's buildset-registry that fails to run when I set the command to podman	19:20
yoctozepto	see https://review.opendev.org/c/nebulous/component-template/+/883304	19:21
yoctozepto	according to docs, this should work fine	19:21
yoctozepto	it fails to start the buildset-registry	19:21
clarkb	ok I see you finally linked to a failure :)	19:21
clarkb	ok so run-buildset-registry doesn't install either docker or podman	19:24
yoctozepto	yeah, I even wrote a nice commit message on the fix to explain what's happening	19:24
yoctozepto	buildset-registry installs docker only now	19:24
clarkb	I feel like "buildset-regsitry is independent of your job content" is what should be happening but I guess isn't	19:24
yoctozepto	seems like some old approach before it was made more flexible	19:24
clarkb	corvus: ^ do you have an opinion on that?	19:24
yoctozepto	yeah, we could go that direction	19:25
yoctozepto	like, using buildset_registry_container_command	19:25
yoctozepto	independent of container_command	19:25
clarkb	yoctozepto: or just set the var when you include the role	19:25
clarkb	that should override it in inner scopes but not outer right?	19:25
yoctozepto	I simply reuse your jobs, see my commit	19:26
yoctozepto	not easy to hack without violating DRY	19:26
yoctozepto	:-)	19:26
clarkb	yes I mean here https://review.opendev.org/c/opendev/base-jobs/+/883869/1/playbooks/buildset-registry/pre.yaml#36	19:26
yoctozepto	ah	19:26
yoctozepto	could be	19:26
yoctozepto	though maybe supporting podman simply makes more sense	19:26
yoctozepto	in the long term	19:27
clarkb	my concern with that is podman doesn't run in a lot of places	19:27
clarkb	docker runs everywhere so for generic "run this service" things where we don't really care about speculative gating I think docker might still be a better choice	19:27
clarkb	I also don't know that there is much value to supporting more than one way to run it	19:27
clarkb	twice as many ways it might break	19:28
yoctozepto	true that	19:28
clarkb	I guess I can go either way on that now that I understand the problem	19:28
clarkb	flexibility vs potential reliablity	19:28
clarkb	I'll update my review	19:29
yoctozepto	thanks, I am also largely indifferent; at most lazy to update the commit to do the other way ;p	19:29
yoctozepto	as long as the desired speculative runs work in the end, I am happy to base it on either solution	19:30
clarkb	I left two notes one to fix the proposal as is and the other to try and isolate running a registry from what is happening in the jobs	19:32
clarkb	One upside ot being consistent is that it reduces the number of external deps	19:35
clarkb	which is probably a bigger reliability concern than the chance of podman or docker changing behavior in unexpected ways	19:36
clarkb	yoctozepto: ^ if you want to update it to fix the default value to match run-buildset-regitry's default of docker I think we can probably land it for that reason. Note that ensure-podman does not work on older ubuntu	19:36
corvus	there's some pretty docker-specifc stuff in there, so adding podman to that might be more than initially expected. i think clarkb 's suggestion about the default makes sense	19:40
yoctozepto	amending, clarkb	19:42
opendevreview	Radosław Piliszek proposed opendev/base-jobs master: buildset-registry: Always use Docker https://review.opendev.org/c/opendev/base-jobs/+/883869	19:46
* yoctozepto is finishing work for today		19:51
yoctozepto	talk to you on gerrit	19:51
dansmith	clarkb: I didn't really follow the above, but just FYI we're jamming podman into jammy for the ceph jobs: https://github.com/openstack/devstack-plugin-ceph/blob/master/devstack/files/debs/devstack-plugin-ceph#L4	21:13
dansmith	because cephdm wants it (I think)	21:14
dansmith	I certainly agree that docker is a known and supported quantity for anything that isn't opinionated about it	21:14
clarkb	dansmith: for your needs the main issue is that there is no reliable source of podman for ubuntu older than jammy	21:17
dansmith	clarkb: ah, older than jammy, yeah for sure	21:18
dansmith	we were using another package repo (possibly the one you mentioned above) but switched to the inbuilt packages during the recent modernization effort	21:18
clarkb	dansmith: the longer story is that there was a PPA for this stuff which got deprecated and is no longer updated. This happened because there is an OBS repo called "kubic" that started building pacakges instead. But then kubic said nevermind the older distros and deleted them all. People got angry/panicked/voiced displeasure so kubic added the packages back but is no longer updating	21:19
clarkb	them. The problem is that podman exists on newer stuff so really you only need kubic for the older things anyway which means it too is not super useful	21:19
clarkb	for OpenDev we have a mixture of servers and can't simply rely on jammy everywhere. In CI this is less problematic	21:19
dansmith	ack	21:19
clarkb	I've just updated the meeting agenda. Anything important missing?	22:53
fungi	nothing i can think of	22:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!