Friday, 2020-03-13

openstackgerrit	Merged opendev/system-config master: pip3: Add python3-distutils https://review.opendev.org/712818	00:29
openstackgerrit	Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819	01:40
openstackgerrit	Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819	01:43
openstackgerrit	Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819	01:44
openstackgerrit	Ian Wienand proposed openstack/project-config master: Move fedora-30 builds to nb01.opendev.org https://review.opendev.org/693120	01:55
openstackgerrit	Merged openstack/project-config master: Move fedora-30 builds to nb01.opendev.org https://review.opendev.org/693120	02:22
openstackgerrit	Ian Wienand proposed opendev/system-config master: nodepool-builder: add /opt/dib_cache https://review.opendev.org/712824	02:22
openstackgerrit	Merged openstack/diskimage-builder master: Remove hacking from requirements https://review.opendev.org/712778	05:27
openstackgerrit	Ian Wienand proposed openstack/project-config master: Revert "Move fedora-30 builds to nb01.opendev.org" https://review.opendev.org/712836	05:28
*** DSpider has joined #opendev		05:51
openstackgerrit	Merged openstack/project-config master: Revert "Move fedora-30 builds to nb01.opendev.org" https://review.opendev.org/712836	06:33
*** factor has joined #opendev		07:50
*** lpetrut has joined #opendev		10:12
*** factor has quit IRC		14:00
*** factor has joined #opendev		14:00
*** factor has quit IRC		14:03
*** factor has joined #opendev		14:04
*** factor has quit IRC		14:14
*** factor has joined #opendev		14:14
*** factor has quit IRC		14:15
openstackgerrit	Merged opendev/glean master: Fix a handful of bugs in config-drive processing https://review.opendev.org/703623	14:42
openstackgerrit	Merged openstack/project-config master: Add a new project and repository for tripleo-ipa https://review.opendev.org/711114	15:44
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: DNM: rebase unittests to base-minimal-test https://review.opendev.org/712985	15:45
*** lpetrut has quit IRC		15:48
openstackgerrit	Merged openstack/project-config master: New repo: devstack-plugin-open-cas https://review.opendev.org/711878	15:51
openstackgerrit	Merged openstack/project-config master: Add OpenInfra Labs IRC channels to bots https://review.opendev.org/712586	15:52
openstackgerrit	Merged openstack/project-config master: Remove neutron-tempest-dvr job from Neutron's dashboard https://review.opendev.org/712048	15:52
*** lpetrut has joined #opendev		16:14
openstackgerrit	Clark Boylan proposed openstack/project-config master: Install tox into a virtualenv on our images https://review.opendev.org/713017	16:29
*** openstackgerrit has quit IRC		16:31
*** openstackgerrit has joined #opendev		16:34
openstackgerrit	Lance Bragstad proposed openstack/project-config master: Add queues for tripleo-ipa project https://review.opendev.org/711115	16:34
openstackgerrit	Merged openstack/project-config master: Install tox into a virtualenv on our images https://review.opendev.org/713017	17:32
corvus	i'm seeing node failures for fedora-30 nodes	18:06
corvus	and i notice that there are some recent commits to project-config moving them around	18:06
corvus	is this known / is anyone working on it?	18:06
corvus	infra-root, config-core: ^ ?	18:06
clarkb	it is not known to me	18:07
clarkb	my guess is that the new builder managed to upload a fedora-30 image that does not work	18:07
fungi	ianw moved them to the new nb01.opendev.org late yesterday	18:07
clarkb	we can probably pause the builds on that builder then revert to the previous image?	18:08
fungi	i think they're already paused	18:08
corvus	i don't see builder logs on nb01.opendev	18:08
fungi	i believe he said he stopped the services on it and added it to the emergency disable list	18:08
corvus	i see build logs	18:08
corvus	but not the builder itself	18:08
corvus	i don't know how to tell if the builder is running	18:09
corvus	we don't have docker on that machine, so i guess we're using podman	18:09
clarkb	corvus: `docker ps -a` or podman equivalent if using podman	18:09
corvus	clarkb: which user should i run podman as?	18:10
Shrews	i thought the new builder was reverted? https://review.opendev.org/#/c/712836/	18:10
corvus	(there's no global podman thingy like docker)	18:10
corvus	Shrews: apparently that's not a revert, that's a rework	18:10
clarkb	looks like root	18:10
corvus	Shrews: at least that's what the commit message says?	18:10
clarkb	(because we run podman-compose as root in system-config/playbooks/roles/nodepool-builder/tasks/main.yaml	18:11
corvus	there are now 2 podman processes, one running as corvus, one as nodepool, probably because i ran "podman ls"	18:11
clarkb	but ya rereading it should be stopped because of nodepools "delete everythign I don't know about" behavior	18:12
clarkb	in which case I think we can remove the newer fedora-30 image and fall back to the older one?	18:13
corvus	as far as i can tell, nodepool-builder is not running on this host	18:13
corvus	do we have a plan for running the nodepool cli on the docker hosts?	18:13
clarkb	corvus: I think a lot of that is still in the learning phase	18:13
mordred	corvus: I agree that nodepool-builder does not seem to be running on the host	18:13
clarkb	but we do produce a nodepool image to run commands	18:14
fungi	ianw mentioned in scrollback (maybe in #openstack-infra) that he stopped it	18:14
clarkb	(so could probably add that to our setup)	18:14
mordred	perhaps add a convenience script so that just running "nodepool" works and does the right thing with the image	18:14
mordred	like in that openstackclient patch I did in system-config a little while ago	18:15
corvus	that would be swell	18:15
corvus	i think having the nodepool cli handy before we break things would be good	18:15
mordred	++	18:15
fungi	yeah, #openstack-infra at 02:44z	18:15
corvus	so i guess we're done with nb01.opendev for now; i'll log into a different host	18:15
corvus	"nodepool image-list \|grep -i fedora" shows nothing	18:16
fungi	he stopped the builder because it was deleting all the images id didn't know about	18:16
fungi	s/id/it/	18:16
corvus	so i'm guessing all the f30 images were deleted, and the current state of the half-revert is that the builder config for f30 is on a host which is down	18:16
corvus	so we should continue to complete the revert so that the f30 builder config moves back to the nb*.openstack hosts ?	18:17
clarkb	corvus: we can't build fedora30 on those hosts though	18:17
clarkb	I think we have to roll forward?	18:17
corvus	how did we ever build f30?	18:17
fungi	https://review.opendev.org/712836 should have put them back	18:17
clarkb	corvus: ~5 months ago fedora rpms were made with a compression tool that was available on ubuntu xenial then at some point they switched off that aiui	18:18
corvus	fungi: line 280: pause: true	18:18
fungi	oh, it was put back with pause: true	18:18
fungi	yep, just spotted that myself	18:18
clarkb	at that point we paused the builds then ianw has spent the intervening time trying to come up with a system to build them (and this is the result)	18:18
corvus	then having that system delete the irreplacable images is especially unfortunate	18:19
clarkb	yes, I think this is an aspect of nodepool that we should probably think about more. (running disjoint builders is likely desireable to accomodate different builder needs, architecture, operating system, whatever)	18:21
clarkb	I noted on IRC last night that I think the way nodepool wants you to express this is to always list all images, then pause them where they should not build	18:21
mordred	I don't suppose we accidentally still have the old fedora30 qcow on any of the nodes right?	18:21
clarkb	mordred: probably not if nodepool deleted them	18:21
fungi	it tries to clean them up aggressively	18:21
clarkb	(its pretty good about cleaning those up)	18:21
mordred	yeah	18:22
Shrews	i suspect the hostname change is what triggered the cleanup (cc: ianw)	18:22
clarkb	Shrews: there was no hostname change	18:22
clarkb	Shrews: this is a new additive host	18:22
Shrews	clarkb: nb01.opendev.org vs. nb01.openstack.org	18:23
Shrews	right?	18:23
clarkb	Shrews: yes that wasn't a change	18:23
clarkb	both are/were expected to run side by side	18:23
Shrews	clarkb: nodepool stores the hostname of the builder, so yes, as far as nodepool is concerned, it was a change	18:23
clarkb	Shrews: I am trying to clarify that we didn't delete or remove nb01.openstack.org	18:24
clarkb	we added nb01.opendev.org to the set of existing servers	18:24
clarkb	(I understand why the images were deleted)	18:24
Shrews	clarkb: i understand that	18:24
Shrews	i don't think that invalidates my statement	18:25
clarkb	Shrews: I read it as we changed nb01.openstack.org to nb01.opendev.org which did not happen	18:25
clarkb	we simply added nb01.opendev.org	18:25
Shrews	clarkb: didn't mean that. i meant "host ownership of a build" changed	18:26
clarkb	Shrews: ah. I don't think that is fully it either. Because nb01.opendev.org apparently tried to delete all of nb01.openstack.org's images	18:26
clarkb	and nb01.openstack.org tried to delete nb01.opendev.org's f30 image	18:27
clarkb	maybe that is what you mean by host ownership? Basically they each decided the others disjoint set was invalid	18:27
clarkb	(which is why I suggested that listing all images then pausing where we don't want to run it would be a way to express this to nodepool)	18:28
corvus	they are a cluster, and are all supposed to have the same configuration. the thing that we're doing with nb03 only works because it has a disjoint set of providers.	18:28
corvus	(it is, in effect, a second cluster of one)	18:29
corvus	but... about the future...	18:29
corvus	this is affecting at least one project (nodepool). what are our options?	18:29
clarkb	I expect that if we fixed the nodepool configs (possibly via the pause idea or just letting the new server build all the images) that we'd be able to build and upload a fedora30 image	18:30
clarkb	basically roll forward	18:30
clarkb	other ideas: manually upload a fedora-30 image in some set of clouds and use that in our providers	18:31
clarkb	(and the probably bad option) stop testing on fedora	18:31
fungi	i thnik either press forward trying to get a nodepool builder working on a newer distro, or try to get the necessary decompression tooling backported to ubuntu-xenial so the existing builders can unpack newer rpms	18:31
fungi	and yes, also possibly someone build an image locally as a stopgap	18:32
clarkb	(I've secretly been hoping that centos rolling distro can slip into the spot fedora fills, but thats an entirely different set of things to sort out and should maybe be ignored for now)	18:32
fungi	what are the details on the rpm decompression problem? do we have that documented somewhere?	18:32
clarkb	I'm sure ianw has it in a story. Let me see if I can find it	18:33
mordred	wait- the container deployment is the thing that solves the rpm decompression problem	18:33
clarkb	mordred: yes	18:33
mordred	I don't think that's a thing that we need to go back to try to solve, is it?	18:33
clarkb	mordred: no	18:34
clarkb	(other than the container deployment had a sad)	18:34
mordred	right.	18:34
clarkb	but I think we can make it not have a sad	18:34
mordred	I agree	18:34
clarkb	(the put all images in the config and then pause where we don't want them to build idea)	18:34
mordred	I just wanted to be clear that we didn't need to go back to a more complicated drawing board	18:34
mordred	clarkb: ++	18:34
corvus	yeah, i think rolling forward with container using the new config file strategy is probably easiest (depending on how easy manually uploading an image is)	18:35
corvus	i can help with that after lunch	18:35
fungi	but yeah, i expect that if there were an easy way to backport a decompression solution for new rpms to xenial, ianw would already have done that	18:35
AJaeger	corvus: the move around was reverted	18:35
corvus	AJaeger: partially reverted	18:35
fungi	AJaeger: it was, but only after fedora images we've ceased to be able to build were accidentally deleted by it	18:36
corvus	AJaeger: it's in backscroll, but tldr: nothing is building f30 images now	18:36
corvus	(and there are no f30 images)	18:36
mordred	how about I take a stab at the config file change	18:36
AJaeger	corvus: see it now	18:36
clarkb	the paging buttons in the storyboard story search page don't work	18:36
clarkb	oh its 1 to 6 stories not 1 to 6 pages	18:37
*** tristanC has joined #opendev		18:37
clarkb	mordred: wfm (I'll keep trying to dig the story details out of storyboard)	18:37
corvus	i gotta run, i'll be back after lunch to help.	18:37
openstackgerrit	Monty Taylor proposed openstack/project-config master: Add fedora-30 to nb01.opendev.org https://review.opendev.org/713047	18:40
mordred	clarkb, corvus, fungi: ^^	18:40
mordred	I believe that is what we're saying we want on nb01 yeah?	18:41
mordred	Shrews:	18:41
clarkb	mordred: yes I think that would allow us to build without associated deletes. If shrews can confirm that would be good	18:41
fungi	that'll have to be manually installed for the time being, right?	18:42
Shrews	i'm still not fully understanding why one is trying to delete the other's images. i can't say for sure if that's what we want until i know that	18:42
clarkb	fungi: no I think we are still ansibling it	18:43
clarkb	fungi: we've just stopped giving the service a config to do any work with	18:43
fungi	clarkb: did it get taken back out of the emergency disable list?	18:43
clarkb	oh if its in the emergency disable then ya we have to remove it from there or manually add it	18:43
clarkb	maybe manually adding it is the safest thing	18:44
fungi	i haven't checked the disable list, just saw ianw say that he had added it there	18:44
clarkb	ya its in there	18:45
clarkb	so ya I think once we are comfortable with that change (I am but others should definitely double check me on it) we can manually apply it and re up the podman-compose config on nb01.opendev	18:45
clarkb	then monitor it for deletions as well as building f30	18:45
clarkb	another option while we are brainstorming is to reduce that provider list in the nb01.opendev.org config to a single cloud to reduce blast radius. Have it build the image and upload it, then add the other providers once we are happy with it	18:51
clarkb	I'm worried that will trigger some other cluster mismatch deletion behavior though. I think the proposed config from mordred is likely safest	18:52
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050	18:52
mordred	how's that look for a helper script?	18:52
clarkb	mordred: I think we can probably trim the mount list since nodepool cli commands aren't running dib builds or logging to disk.	18:53
clarkb	mordred: we should only need to mount in the cloud config and the nodepool config I think	18:54
clarkb	I'm going to find lunch now too	18:54
clarkb	(I think both changes are good as proposed even if we can do cleanup on the helper script)	18:55
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050	18:57
mordred	clarkb: good point	18:57
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050	18:57
*** lpetrut has quit IRC		19:05
fungi	mordred: is that going to have to be run via sudo -u nodepool so that it has access to bindmount the config?	19:23
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Use a zuul_* and add an .ansible-lint file https://review.opendev.org/712547	19:27
AJaeger	clarkb: the goaccess reports run fine today, see https://7b8a363e2631f871420c-9d822a96c58fccb739d55f79e396b06d.ssl.cf1.rackcdn.com/periodic/opendev.org/opendev/system-config/master/docs-openstack-goaccess-report/44c9931/docs.openstack.org_goaccess_report.html	19:29
clarkb	AJaeger: that is great to hear	19:29
clarkb	AJaeger: and the reports have at least as much info as the old one right?	19:30
AJaeger	clarkb: it gives the data - but lots of other stuff as well ;) Need to dig into how to just get those URLs.	19:35
AJaeger	clarkb: so, I'm fine with going forward with it	19:35
* AJaeger calls it a day and waves good night		19:36
* corvus is back and catching up		19:36
clarkb	corvus: I think if people agree mordred changes look good we can proceed yo apply the config update to new nb01 manually and manually up the service there	19:37
clarkb	I'm still about 20 minutes from being able to help with that	19:38
corvus	clarkb, mordred: +3	19:40
clarkb	corvus: note the server is in the emergency file	19:40
clarkb	if we want anwible to update it instead of manual we should remove it	19:40
corvus	clarkb: yeah i saw	19:40
corvus	Shrews: are you still looking into that? (don't mind the +3, since it's not getting applied until we're ready)	19:41
corvus	I think we should add warnings to both files that they need to be kept in sync during the transition period.	19:41
Shrews	corvus: "that" being why images were being deleted? if so, no. i think i need ianw to walk me through it.	19:43
openstackgerrit	Merged openstack/project-config master: Add fedora-30 to nb01.opendev.org https://review.opendev.org/713047	19:47
corvus	Shrews: i think you are right to be concerned. the builder id's present in 'nodepool dib-image-list' are short hostnames	19:50
corvus	Shrews: ie "nb01"	19:50
Shrews	we left the hostname comparisons in for compatibility with older nodepool (each should have a unique id now). it might be time to just remove that	19:51
corvus	ugh. i just triet do run "podman run ..." to see if i could test the behavior on nb01.opendev.org and ran into gshadow permissions	19:52
corvus	i thought we were removing that?	19:52
Shrews	i don't think there were plans to do so	19:53
clarkb	I'm not aware of gshadow issues (is that a podman thing?)	19:53
corvus	no, it's a we're doing something in the nodepool image we shouldn't be doing thing	19:53
* fungi assumed something related to the system shadow group file (/etc/gshadow)		19:54
corvus	i thought there was a patch to nodepool to revert that out, after we rejected the corresponding patch to zuul	19:54
corvus	anyway, i also can't run it as root, for a different reason (failed to find plugin "loopback")	19:55
corvus	so, i can't really predict the behavior on nb01.opendev.org because i can't get a python prompt in the production environment to test :/	19:55
mordred	corvus: I can confirm we have not reverted that out of nodepool	19:56
mordred	but I think we should	19:56
corvus	mordred: ack; i'll put in on my backlog of things to do when we can merge nodepool changes again	19:56
corvus	mordred: i'm less confident about the config file change now	19:56
corvus	(i'm also less confident we can actually run nodepool)	19:57
mordred	corvus: we can make an image by hand and upload it to a personal dockerhub location then try running that manually	19:57
mordred	to check	19:57
mordred	corvus: I don't see the revert patch in the system - want me to make one real quick?	19:57
corvus	mordred: let's not worry about that for now	19:58
corvus	the gshadow thing is preventing me from running nodepool as a user	19:58
corvus	but we run it as root	19:58
corvus	the root error is different and i don't understand it	19:58
mordred	corvus: ok. let me see if the root error makes sense to me	19:59
clarkb	are you running it with all the mounts? if not perhaps that is causing trouble?	19:59
corvus	no mounts	19:59
openstackgerrit	James E. Blair proposed openstack/project-config master: Revert "Add fedora-30 to nb01.opendev.org" https://review.opendev.org/713058	19:59
corvus	mordred: ^ that's a revert of your change because of the nb01/nb01 issue	20:00
mordred	corvus: +#	20:00
mordred	gah	20:00
Shrews	https://review.opendev.org/713057 Stop comparing hostnames to determine ownership	20:00
corvus	so to summarize, i have 2 concerns: 1) some nodepool behavior is determined by the hostname, and we will have 2 hosts with the same name. 2) i wasn't able to run the nodepool container as root with "podman run" which weirds me out	20:01
corvus	Shrews: it looks like that "or" is going to cause us to hit that case when we currently don't want to.	20:02
mordred	corvus: we need --net=host	20:02
mordred	corvus: I have a script in /root	20:02
mordred	called "n"	20:02
corvus	Shrews: so i think we either need to change the nb01.opendev.org hostname, or merge your change first	20:02
mordred	that you can use	20:02
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050	20:03
fungi	or turn down nb01.openstack.org, though i suspect we have too much build load to do that before the new builder is in operation	20:03
corvus	mordred: thanks. that confirms that the nb01.opendev enironment reports 'nb01' as the hostname	20:04
corvus	fungi: yes, that's an option. i'm unsure about the build load, let's check that out.	20:04
fungi	i believe it comes down to how much usable disk space there is. the more builders we have, the less disk utilization on each	20:04
clarkb	its a lot better now than it was after cleaning up old fedoras and precise	20:05
fungi	nb02 is 65% used on /opt, nb01 34%	20:05
fungi	so ~100% if we turned off nb01, i expect?	20:05
clarkb	fungi: ya that math should ve roughly correct	20:06
fungi	(i mean, it wouldn't be immediate, it would take some time to hit that, but still)	20:06
clarkb	and in the meantime we'd possibly delete all thoseimages?	20:06
fungi	and yeah, the sudden deletion of fedora images is presumably what dropped the disk usage on nb01	20:06
clarkb	spinning up nb04.opendev.org wouldnt be too bad	20:07
corvus	that's probably the safest thing	20:07
clarkb	basically add mordred change back but change the name	20:07
clarkb	and then update groups and stuff as necessary	20:07
fungi	makes sense to me, though i'm about to be tending a hot wok for the next little while so will be less help	20:07
corvus	there's also the possibility we could change the hostname in the podman-compose file without building a new host. but that could also inspire madness.	20:08
clarkb	corvus: I like that for its simplicity but have no idea how reliable it would be	20:08
fungi	right, i thought about that, convince nodepool its hostname is different that what the system knows itself as	20:08
fungi	but i agree that's icky	20:09
clarkb	or land shrews' change	20:09
corvus	can't land nodepool changes	20:09
fungi	which i've already +2'd	20:09
fungi	but right, catch-22	20:09
clarkb	because we need fedora-30?	20:09
corvus	yep. it's used in a gating job	20:09
corvus	i think it would be safe to make it non-voting for Shrews change though, if we wanted to go that way	20:10
Shrews	could make that job nv in my change, then re-enable?	20:10
corvus	it's just going to be a thing.	20:10
Shrews	jinx	20:10
corvus	(there are some changes though that i definitely don't want to land without it)	20:10
corvus	(including the one to remove the gshadow thing)	20:10
mordred	++	20:11
corvus	adding "--hostname nb04" to podman run works as expected	20:11
clarkb	crazy udea tine lets do ^ to tide us over then on monday we can build it right	20:12
clarkb	*crazy idea time	20:12
mordred	so - are we thinking do that - get a f30 image, be able to land changes - fix the underlying stuff	20:12
corvus	so i think changing the podman-compose file as a temporary measure would be workable. but the longer that goes on, the more cognitive dissonance we will experience.	20:12
mordred	then stop nb01 and rename it back	20:12
mordred	yeah	20:12
mordred	it seems like a thing that should only be in place for exactly as long as it takes for us to get a f30 node	20:12
clarkb	ya I think we should commit to replacing new nb01.opemdev with nb04 monday	20:12
corvus	this is all assuming we can build an f30 image after not doing it for 6 months :)	20:13
clarkb	(I can do that)	20:13
mordred	corvus: wcpgw?	20:13
corvus	how about we call the innerhostname "nb01opendev" ?	20:13
clarkb	I think the basics of f30 building is gated in dib	20:13
fungi	nb01forealz	20:14
clarkb	corvus: ++ that will be less co fusing	20:14
corvus	then it won't conflict with its past or future replacemnet	20:14
fungi	yeah, wfm	20:14
mordred	++	20:14
mordred	just call it george	20:14
Shrews	can we just suspend all testing over covid 19 concerns?	20:15
corvus	Shrews: but we promised anyone who wants fedora30 tests could get them	20:15
mordred	corvus: by anyone we only meant Tom Hanks	20:16
clarkb	and nba players	20:16
fungi	it would probably be okay if the tests were only 50% accurate	20:16
mordred	clarkb: Ruby Gobert and Tom Hanks are the same person	20:16
mordred	clarkb: Tom is just that good of an actor you never noticed	20:16
clarkb	the french american sweetheart	20:16
corvus	i think the way to do the compose file change is just to keep nb01.opendev in emergency and manually apply that and the new rev of mordred's change	20:17
clarkb	corvus: wfm	20:17
mordred	++	20:17
fungi	seems reasonable	20:18
corvus	i'll work on that now	20:19
openstackgerrit	Merged openstack/project-config master: Revert "Add fedora-30 to nb01.opendev.org" https://review.opendev.org/713058	20:21
corvus	mordred, clarkb, fungi, Shrews: any of you who are available, want to check out the state on nb01.opendev.org? i modified /etc/nodepool-builder-compose/docker-compose.yaml and checkout out mordred's change 713047 to update /etc/nodepool/nodepool.yaml	20:23
clarkb	corvus: looking	20:23
corvus	s/checkout out/checked out/	20:23
clarkb	both lgtm (and for others looking /opt/project-config was updated with mordreds change and /etc/nodepool/nodepool.yaml is a symlink into that repo)	20:25
corvus	anyone else want to weigh in, or should we run that now?	20:29
mordred	corvus: lgtm	20:30
corvus	so now we should: cd /etc/nodepool-builder-compose; podman-compose up ?	20:30
corvus	(as root)	20:31
clarkb	yes, I'll start a tail -f on builder-debug.log here and watch it	20:31
corvus	oh yeah, that was the other thing	20:31
corvus	no builder-debug.log	20:31
clarkb	oh that file doesn't exist right now	20:31
corvus	hrm. we do have /var/log/nodepool bind mounted	20:32
clarkb	the permissions and mounts are such that it should be logging there	20:32
clarkb	and /var/log/nodepool/builds/ has content	20:32
corvus	but we're probably runinng the default run in foreground and log to stderr thing	20:32
clarkb	in which case podman logs $containername would work?	20:32
corvus	yes: podman logs nodepool-builder-compose_nodepool-builder_1	20:33
clarkb	as root	20:33
corvus	yep. i feel like this is probably not how we want to run it in the long run.	20:33
clarkb	I'm ready to run that command (as root) and watch it once it is going	20:34
corvus	okay, i will "up" it now	20:34
corvus	http://paste.openstack.org/show/790683/	20:35
corvus	apparently podman-compose does not work the same as docker-compose	20:35
corvus	also, it's just sitting there now, it hasn't returned from that command invocation	20:35
clarkb	scared me for a second that it was trying to delete things again in the logs but those are from 18 hours ago	20:35
corvus	so... i guess i will ^C ?	20:36
clarkb	ya it doesn't seem to be doing anything from what I am able to see	20:36
mordred	hrm	20:36
corvus	i will run podman-compose down, then podman-compose up.	20:36
corvus	(docker-compose "up" automatically recreates containers if needed)	20:36
corvus	it's running	20:37
mordred	yay	20:37
clarkb	mkdir: cannot create directory '/opt/dib_cache': Permission denied	20:38
clarkb	that is why the build is failing	20:38
clarkb	I want to say I saw something about this one moment please	20:38
corvus	it wants to delete some vexxhost images	20:38
mordred	that seems like a really bad choice	20:39
clarkb	corvus: mordred I think those are old logs double check timestamops	20:39
corvus	no	20:39
mordred	oh - wait - vexxhost - those could be f30 vexxhost?	20:39
clarkb	oh no its doing it again	20:39
clarkb	probably need to stop it then	20:39
corvus	i've stopped it	20:39
corvus	but i didn't see it succeed at deleting anything it shouldn't	20:39
Shrews	does it still think it's name is nb01 by chance?	20:39
corvus	can somepone point it out to me	20:39
clarkb	corvus: Shrews one sec I think I know what is happening (and its ok for us)	20:40
clarkb	we leak images in vexxhost	20:40
clarkb	when that happens its fair game for any nodepool builder to delete them from the cloud side	20:40
clarkb	I think it has detected this case and is helpfully trying to delete a leaked image in vexxhost. But we should double check before turning it back on	20:40
clarkb	also https://review.opendev.org/#/c/712824/ is the proposed fix for the dib cache thing	20:40
corvus	so all our builders just log that eror every few minutes?	20:40
clarkb	corvus: ya I think so	20:41
clarkb	I have a paste from the other day where I dug into this tring to find it now so I can cross reference ids	20:41
corvus	openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://image-sjc1.vexxhost.us/v2/images/a0b6ea3e-6c39-41b6-8243-c0c9c6d027c8, Image a0b6ea3e-6c39-41b6-8243-c0c9c6d027c8 could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance.: 409 Conflict	20:41
clarkb	http://paste.openstack.org/show/790497/ hrm those ids don't match so either they are new leaks or it is bugging out in the scary way	20:42
clarkb	http://paste.openstack.org/show/790684/ I think that shows we have new leaks	20:43
clarkb	and those ids seem to match what it was deleting on new nb01 (I think that means we are ok)	20:43
mordred	sigh	20:44
corvus	clarkb: er, so your final word is, these leaks are expected?	20:44
mordred	clarkb: do we need to clear bfv's again?	20:44
clarkb	corvus: ya basically boot from volume in vexxhost sometimes leaks volumes whcih prevents us from deleting the image. Once the image is in a "deleting" state in zk any builder is free to delete it in the cloud	20:45
clarkb	corvus: the transition from active to deleting only happens on the "owner" builder though so we avoid races there by having it try first	20:45
corvus	yeah, but your pastes i don't understand	20:45
clarkb	(that also allows it to clean up its local disk)	20:45
corvus	clarkb: i think you said "ids don't match which is bad" "something something nevermind it's good we're safe"	20:46
clarkb	corvus: you can ignore the first paste at this point I think. What the second paste shows is we have 6 images that are all failing to delete in vexxhost and they are in a deleting state in zk.	20:46
corvus	so i just want to make sure you still think those particular leaks are harmless and don't represent some new bad thing that nb01.opendev is doing	20:46
corvus	ok, cool.	20:46
clarkb	corvus: what the builder logs show from what you just ran is that those images are not deleting because there are leaked volumes in vexxhost preventing the image from deleting	20:46
corvus	clarkb, mordred: so what's the deal with cleaning this up?	20:46
clarkb	yup I do think that because the ids in the builder logs match the ids in my second paste	20:47
clarkb	corvus: usually we run volume list and search for volumes which have leaked then try to delete them again (there are heuristics for this and mordred has a tool but its not perfect)	20:47
corvus	so it's impractical to have nodepool fix this?	20:47
clarkb	corvus: do you think we should try cleaning that up first so taht we can have clean logs in the builder when we try again with a dib_cache dir?	20:47
clarkb	corvus: yes I would say it is impractical to have nodepool fix it	20:47
corvus	clarkb: under the circumstances, i think clean logs is important	20:48
clarkb	nodepool could probably do a subset of cases though	20:48
mordred	yes - there are some cases that are safe	20:48
clarkb	but there is another subset where the server itself can't delete and it has the volume attached and that prevents the volume from deleting	20:48
mordred	but not all	20:48
clarkb	and we've seen that require intervention from the cloud itself	20:48
mordred	yeah	20:48
clarkb	corvus: ok I'm going to look into cleaning these up manually	20:48
corvus	clarkb: cool, i'll modify the compose file with the cache fix	20:49
corvus	oh it's a perm thing	20:49
corvus	whatever	20:49
corvus	i will do what 712824 does :)	20:49
clarkb	for anyone else following along there are a bunch of unattached volumes in vexxhost (these are likely the source of the leak)	20:49
clarkb	I'm going to spot check them to ensure they don't belong to something important (its a test node tenant so shouldn't) then delete them	20:50
clarkb	mordred: ^ your tool might do it quicker than me though if you want to queue up running it ?	20:50
mordred	sure	20:50
clarkb	yup spot checking ~5 of them they are all from just after 2300UTC on march 11	20:52
clarkb	and they are boot from volume volumes with image ids that match our unhappy images in nodepool	20:52
clarkb	what should happen is we delete them (using mordreds too should work in this case), then old nb0X will delete them from the cloud as there is no volume keeping them around	20:52
clarkb	we can then confirm with nodepool image-list as per my paste above and then try again with the new builder	20:53
openstackgerrit	James E. Blair proposed opendev/system-config master: nodepool-builder: add /opt/dib_cache https://review.opendev.org/712824	20:53
corvus	clarkb: ^ the directory had been created; but it was not bind-mounted it, so i have added it to docker-compose	20:53
corvus	(i'm guessing ianw may have manually run the mkdir?)	20:53
mordred	I'm going to run the clean tool yeah?	20:53
clarkb	oh possibly	20:53
clarkb	mordred: yes I think it is safe to do so since the cleanup tool checks for unattached volumes >24 hours old iirc	20:54
clarkb	mordred: and I've confirmed these seem to be in that state	20:54
mordred	that is correcrt	20:54
mordred	clarkb: can you re-check your list?	20:54
clarkb	mordred: they seem to still be there	20:55
mordred	clarkb: yeah. I didn't get prints. lemme see what's uo	20:55
clarkb	should I start manually deleting?	20:55
mordred	one sec	20:56
clarkb	ok. I've edited a file with a list of them and can run it through xargs openstack volume delete if we want another option	20:57
clarkb	will wait	20:57
mordred	clarkb: yeah. I don't know why it's not cleaning them	20:58
mordred	go ahead	20:58
clarkb	k	20:58
clarkb	its doing them serially and taking a couple seconds each so may be a minute or two	21:00
clarkb	but it is going	21:00
mordred	cool	21:00
mordred	clarkb: oh - I think my script didn't do it because they're already unattached volumes	21:00
clarkb	ah	21:00
mordred	not volumes reporting being attached bogusly	21:01
mordred	so - yay script doing what it's supposed to!	21:01
clarkb	down to 4 images in zk now	21:01
clarkb	from 6	21:01
clarkb	there is one that is much older than the others that we might have less luck cleaning	21:01
clarkb	but if we can get it down to one, check logs for a single uuid is better than 6	21:01
clarkb	down to 2 now	21:03
clarkb	ok I don't think that bionic image will delete beacuse its used by 3 volumes that refuse to delete	21:05
clarkb	the opensuse image isn't deleting because volume list claims it is in use by a server volume (not unattached)	21:05
clarkb	\| 303ed29e-3c06-4738-a0bd-e2f0eb50991c \| \| in-use \| 80 \| Attached to opensuse-15-vexxhost-sjc1-0014437332 on /dev/vda \|	21:06
clarkb	I'm checking to see if that is a held node	21:06
clarkb	\| 0014437332 \| vexxhost-sjc1 \| opensuse-15 \| d2d73e84-d988-4605-a596-b0ddef9b2b23 \| 38.108.68.90 \| 2604:e100:3:0:f816:3eff:fe52:b724 \| deleting \| 00:00:02:34 \| locked \|	21:07
corvus	that seems to be a recently deleting node...	21:07
clarkb	ya we can probably be patient with it assuming that server deletes	21:07
clarkb	if it doesn't delete then it may be in a similar situation to the bionic images where it was attached to servers that refuse to delete which causes a chain reaction of undeletable resources	21:07
corvus	it is an opensuse-15 node, that does seem likely	21:08
clarkb	c5b3b55a-4c74-4d41-998c-265342ab3afc and c10176f9-56a3-4749-a5dc-44ab56ec3771 are the images that are safe for new builder to delete if it comes to that	21:08
corvus	well, how about we go aheand and fire it up again	21:08
clarkb	I'm ok with that	21:08
corvus	hopefully we can deal with those 2 errors :/	21:08
corvus	ok, here goes	21:09
corvus	failing the build again	21:09
clarkb	2020-03-13 21:09:53.958 \| mount: /opt/dib_tmp/dib_build.s3OQSzgg/mnt/proc: permission denied.	21:10
clarkb	I think fs perms are ok, is that a caps issue with procfs?	21:11
clarkb	I wonder if the ci of this is using docker instead of podman and we are hitting behavior differences there	21:12
corvus	what kind of testing has this undergone?	21:12
corvus	what ci?	21:12
clarkb	corvus: there is a full on integration job similar to the older nodepool job that runs it outside of a container. I'm trying to pull it up now	21:12
corvus	right, i'm curious where "run a nodepool-builder which runs dib inside a podman container" has been tested	21:13
corvus	(or even in a docker container)	21:13
clarkb	I know there was something its what we set up the sibling container stuff for so we could use glean and stuff from source in containers	21:15
clarkb	now just trying to sort out where it ended up	21:15
clarkb	(but that may have been docker not podman)	21:15
mordred	yeah. may have been	21:15
mordred	in fact - probably was	21:15
mordred	so maybe this is a good reason to use docker not podman - at least until such a time as we have podman-based gate testing	21:16
corvus	we should "test like production"	21:16
clarkb	nodepool-functional-container*	21:17
clarkb	https://zuul.opendev.org/t/zuul/build/459f34fe1c93447c8353fe43a88e81b6 is a semi recent run	21:18
clarkb	and ya it is using docker	21:18
clarkb	https://review.opendev.org/#/c/698818/6/playbooks/nodepool-functional-container-openstack/templates/docker-compose.yaml.j2 shows the compose file	21:19
clarkb	ok more investigating has been done. that mnt/proc path may be owned by root	21:20
clarkb	its readable by not root though	21:20
clarkb	but dib is trying to mount a thing there	21:21
clarkb	\| + /opt/dib_tmp/dib_build.w2ztziu9/hooks/root.d/08-yum-chroot:main:239 : sudo mount -t proc none /opt/dib_tmp/dib_build.w2ztziu9/mnt/proc	21:21
clarkb	and that will require root which it probably doesn't have?	21:22
clarkb	will docker run the container processes as root maybe?	21:22
clarkb	(and podman does not)	21:22
fungi	okay, stir fry has been produced, consumed and then cleaned up. skimming to see where i can be of help	21:22
mordred	clarkb: the container as it is now is supposed to be running as nodepool and that nodepool is supposed to have sudo access	21:23
corvus	clarkb: no, the nodepool dockerfile says run as the nodepool user	21:23
corvus	mordred: i don't see evidence of sudo access	21:23
mordred	corvus: there's a line adding a sudoers fiel that I deleted in the revert patch ... one sec	21:23
clarkb	drwxr-xr-x 2 root root 4096 Mar 13 21:21 proc	21:23
clarkb	that is what it looks like from outside of the container	21:23
corvus	mordred: when i run "sudo" in "podman run" i get a password prompt	21:24
mordred	https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L55-L56	21:24
mordred	corvus: that's in the nodepool-builder image specifically	21:24
corvus	mordred: yeah, that's what i'm running	21:25
mordred	corvus: so we should maybe update that script to use that and not nodepool-builder	21:25
mordred	yeah?	21:25
mordred	awesome	21:25
corvus	mordred: oh, nope, sorry	21:25
corvus	mordred: nm, sudo works	21:25
corvus	i don't understand dib, so i'm not the person to take point on fixing this.	21:26
clarkb	then I'm stumped why this doesn't work because sudo is used in the element (and the dirs on the line above created with sudo are created)	21:26
clarkb	corvus: basically its trying to create a procfs because tools need it	21:26
clarkb	it does that in /opt/dib_tmp/$build_dir/proc so that it can be chrooted itno and seprate from the hosts procfs aiui	21:27
clarkb	http://paste.openstack.org/show/790686/ we see sudo succeed at creating the dirs on the first line	21:28
clarkb	but then line 3 fails due to "mount: /opt/dib_tmp/dib_build.JnvNFPzW/mnt/proc: permission denied."	21:28
mordred	we're setting privileged: true in the compose file - so I'd expect it to not be a procfs thing	21:28
clarkb	I'm guessing this is a capabilities thing since fs stuff looks fine	21:28
clarkb	mordred: maybe priviled means less privileged than docker on podman?	21:28
corvus	we're throwing this host away anyway, right? should we jut apt-get install docker and see if it works?	21:29
clarkb	corvus: ya we could try that I suppose.	21:29
clarkb	(since testing says that should work)	21:30
corvus	mordred: ?	21:30
mordred	yeah	21:30
corvus	k, will do	21:30
mordred	I think bionic has new enough that we don't need to bother doing the upstream repo	21:30
mnaser	i am just jumping in this and not reading scrollback	21:30
mnaser	(yet)	21:30
mnaser	but we use cri-o in prod with k8s for nodepool builder	21:30
mnaser	and our image builds have been ok, if that helps signal anything.	21:31
mordred	cool. so it may just be a settings thing	21:31
mnaser	i _really_ remember running to the similar issue for that mount thing	21:31
corvus	yeah that does seem to suggest there should be a route to getting it working with podman	21:31
* mnaser looks at the helm charts		21:31
mordred	yeah. it would be more fun to figure out if we weren't unexpectedly doing so when it's more important :)	21:32
mnaser	https://opendev.org/zuul/zuul-helm/src/branch/master/charts/nodepool/templates/builder/statefulset.yaml	21:32
mnaser	ok so	21:32
mnaser	i mount /dev into the actual container. i don't remember why, a note there would be nice.	21:32
mnaser	i think its for losetup things	21:32
corvus	we do not have that in our compose file	21:32
mordred	corvus: do you think it's quicker to try adding /dev to the volume mount list real quick?	21:33
mnaser	that might be something that fails much later on though	21:33
clarkb	corvus: its also not in the testing compose file	21:33
mnaser	but .. i remember very much needing it	21:33
corvus	either one at this point :) i have installed docker	21:33
mordred	corvus: you're driving - I defer to you on which you want to try first	21:33
corvus	i'm happy to throw /dev in there, see if it works with podman, then try docker without /dev, then try docker with /dev	21:33
corvus	i'll do that. test cycle should be fast.	21:33
mordred	corvus: let's do that	21:34
corvus	- /dev:/dev:rw	21:34
corvus	just like that?	21:34
mordred	corvus: let's say yes!	21:34
mnaser	seems like that translates to roughly what k8s is modeling, eah	21:34
corvus	it failed, i'm trying to find the error	21:35
corvus	mount: /opt/dib_tmp/dib_build.xUudni10/mnt/proc: permission denied.	21:36
mordred	docker it is!	21:36
corvus	is dim_tmp bind-mounted in the testing?	21:36
corvus	dib_tmp	21:36
corvus	or is it a straight-up volume?	21:36
clarkb	corvus: good question, no	21:36
clarkb	its not even that, it may be using /tmp	21:37
corvus	sigh	21:37
corvus	this isn't going to work in docker either	21:37
clarkb	dib's default is to use the regular /tmp implementation. We have to use something else because our images are too big for that	21:37
clarkb	(that is where /opt/dib_tmp comes from)	21:38
corvus	i guess i'll do the docker test for completeness	21:38
corvus	but i'm not optimistic	21:38
mordred	corvus: maybe override user and run the container as root	21:38
mordred	rather than with the USER nodepool setting from inside the container	21:39
mordred	I don' tknow why that would have any difference of course	21:39
mordred	grasping at straws	21:39
corvus	mordred: i doubt that's it -- i suspect the problem has something to do with mounting something on a bind mount in a container	21:39
corvus	like, i don't think the mount can propagate up	21:39
clarkb	hrm	21:39
mordred	clarkb: maybe try not passing -v /opt/dib_tmp ?	21:40
clarkb	mordred: if our / is big enough to support that it may work	21:40
corvus	and hope that's enough for 1 image?	21:40
clarkb	/dev/xvda1 39G 3.4G 36G 9% /	21:40
clarkb	it will be close	21:40
clarkb	but I think that may be enough for a single image	21:40
mordred	yeah - it might work	21:40
mordred	then - when we come back to this - we can set up docker/podman to put its container storage directly in /opt perhaps	21:41
mordred	but that'll be for when we're sorting this out properly in the first place	21:41
clarkb	(fwiw I thought mounts were effectively flat in the kernel, and then the mount points give us an illusion of nesting, however cgroups may have completely changed that I ugess)	21:41
corvus	i think it may be succeeding under docker	21:42
mordred	cool	21:42
clarkb	ya fedora-30-0000000549.log is showing it doing package stuff which implies it got further than the /proc mount	21:42
corvus	(current run is docker-compose up without /dev mounted)	21:42
clarkb	yay and itneresting podman difference for the supposedly compatible too :)	21:43
mordred	cool so maybe docker is doing a mount propagation different	21:43
corvus	clarkb: it's incompatible except for that one thing, i guess :(	21:43
corvus	now i regret running docker-compose without the -d argument	21:43
corvus	my hubris is why it succeeded	21:43
clarkb	that is probably worth bringing up with the podman folks since I know rhel installs podman as `docker`	21:43
clarkb	corvus: oops	21:43
clarkb	it complains about its hostname not being resolvable but taht appears to be a non issue so far (its not like my dib VMs locally ever resolve in dns properly either)	21:44
mordred	clarkb: https://github.com/containers/libpod/blob/master/docs/source/markdown/podman-run.1.md search for mount propagation	21:44
clarkb	mordred: I think we need shared propagation	21:45
clarkb	?	21:45
mordred	I think it's worth trying if/when we get back to investigation	21:46
clarkb	ya	21:46
clarkb	the build is cloning all the git repos right now	21:46
clarkb	may be a while	21:46
clarkb	mordred: maybe running podman as docker changes those behaviors?	21:46
clarkb	(so is a non issue when running it on rhel8 that way)	21:46
mordred	clarkb: maybe so? I'm sure there's more than one thing to learn here	21:47
mnaser	i think the /dev thing comes in play with losetup happens and tries to mount the qcow2	21:48
clarkb	mnaser: well we don't mount /dev in our testing either	21:48
* mnaser shrugs at why i ended up needing it		21:49
clarkb	at this point I'm mostly worried some element that is infra image specific will break rather than the stuff in dib because we test the stuff in dib	21:49
mnaser	should have probably documented that but yeah	21:49
clarkb	mnaser: possibly because docker mounts a /dev by default?	21:49
clarkb	or it does when privileged?	21:49
clarkb	you need things like /dev/random typically	21:49
mnaser	perhaps it was that, or maybe cause cri-o doesn't mount it? i dunno, sorry, can't provide much more useful input other than memory that not usually well :)	21:50
clarkb	mnaser: also if podman is cool with breaking these things why can't it break or add ipv6 support	21:50
clarkb	er sorry that was for mordred	21:51
mnaser	i'll take that one too	21:51
mnaser	:p	21:51
clarkb	Like I get the goal of being compatible if they actually did that, but they haven't (as evidenced here)	21:51
corvus	i have run "CTRL-\" to exit docker-compose without stopping the underlying containers	21:51
mordred	clarkb: sigh	21:51
mordred	corvus: neat!	21:51
clarkb	the dib build is still proceeding fwiw so seems to have worked	21:51
clarkb	I'm going to take a break from watching git clone log lines scroll by and find something to drink, back in a bit	21:54
corvus	clarkb: something to drink, or something to DRINK? cause i'm pretty sure we could all use the latter	21:55
corvus	i will also bbiab	21:55
clarkb	For now just drink :)	21:55
*** DSpider has quit IRC		22:02
ianw	... thanks for looking in on this ... it was supposed to be a soft rollout but clearly got a little out of hand	22:04
fungi	corvus: i'm just impressed you can ctrl-\ without killing your xsession	22:13
ianw	none of this is helped by me forgetting to git add nb01.opendev.org.yaml in https://review.opendev.org/712836 ... sigh :( but then it seems we've found the hosts use short-names that collide anyway	22:15
clarkb	ianw ya I thibk all we want at this point is to get f30 uploaded. then we make nb04.opendev.org	22:15
clarkb	as well as clean up nodepool as necessary in parallel	22:16
corvus	ianw: i think we're a little fuzzy on the contribution of the short-names -- the current system is using a unique short name but also completely duplicated file with all the images	22:16
corvus	at this point, i don't know which of those, or perhaps both, are necessary for this to work	22:16
ianw	the other thing i found, that i was hoping would not be an issue till monday, was that the limestone .pem in the config file is hard-coded to ~nodepool	22:17
corvus	ianw: the other thing is this change is necessary: https://review.opendev.org/712824 and also we either need to run with docker, or explore mount propoagation settings for podman	22:18
corvus	ianw: i don't think we've recorded that last bit yet; you might want to jot that in your notes	22:18
ianw	yes, i had clearly over-estimated the podman == docker situation	22:18
ianw	btw the new rpm format was switched in for f31; that's what i've been trying to get going	22:20
clarkb	ianw: what stopped the f30 builds then?	22:20
ianw	f30, iirc, started having segfaults building, that, again, iirc, didn't happen with bionic building	22:20
clarkb	ah ok so different issue, but happier on newer platform	22:20
ianw	but, f31 was supposed to be the solution anyway	22:21
fungi	got it, so even if we'd unpaused it on the xenial builders they still wouldn't have produced a f30 image	22:22
ianw	no; and i don't have good story filled out on this :/ which is my own fault	22:23
mordred	ianw: I think we can sort out the limestone thing	22:23
mordred	ianw: it might be a better choice to run the containerized hosts with the config in /etc/openstack and bind-mount that in rather than putting them in /home/nodepool like we have been doing - but we probably have a few things to figure out before we get to that :)	22:24
ianw	mordred: yeah, it will just prevent uploading; could either link or i was thinking it is probably better but a bigger change to move it to /etc all together	22:24
mordred	yah	22:24
ianw	heh, jinx, ... that's why it was a "monday" thing :)	22:25
mordred	yup	22:25
mordred	and - I like making it a self-contained change rather than just part of the puppet>ansible+container	22:25
ianw	2020-03-13 22:24:14.798 \| Couldn't parse 'sudo: unable to resolve host nb01opendev: Name or service not known file /opt/cache/files/sudo: unable to resolve host nb01opendev: Name or service not known sudo: unable to resolve host nb01opendev: Name or service not known' as a source repository	22:26
fungi	well poop	22:27
ianw	oh dear; i guess we somehow look at the result of a "sudo" command, and that message has confused it	22:27
mordred	headdesk	22:27
fungi	so whatever we tell docker/podman the hostname is also has to resolve (at least via hostfile?)	22:27
clarkb	fungi: it must be at least via hostfile because I've done builds without proper dns setups on local VMs	22:28
mordred	is there a way to tell sudo to shut up about the host thing?	22:28
clarkb	mordred: there is iirc	22:28
clarkb	of course all the docs around this say just edit /etc/hosts	22:30
ianw	i really have no idea why all this need sudo but it's just been like that @ https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/source-repositories/extra-data.d/98-source-repositories#L157	22:31
fungi	clarkb: yeah, scoured manpages and haven't turned up an option to disable local host resolution	22:33
clarkb	fungi: I thought it was to quiet the logging not necessarily stop the lookups	22:34
ianw	also, this doesn't happen in gate?	22:34
clarkb	I want to say we tried this with devstack at some point	22:34
clarkb	ianw: no beacuse its just checking /etc/hosts	22:34
clarkb	(I guess we can bind mount that in)	22:34
ianw	i mean the gate test	22:34
ianw	... ohh, we probably just don't run it; don't cache any repos	22:35
clarkb	ianw: it uses host networking too so the hostname is probably correct	22:35
ianw	yeah; nothing in e.g. https://zuul.opendev.org/t/zuul/build/6130771f463743708b410e7a2647641f/log/nodepool/builds/test-image-0000000001.log	22:36
corvus	oh, i guess if you use host networking docker might not update /etc/hosts on the container?	22:36
clarkb	corvus: thats my guess	22:37
corvus	the contents inside the container match the host; what if we add it to the host /etc/hosts and restart the container? maybe it will copy it?	22:37
clarkb	corvus: seems reasonable	22:37
corvus	i'll do that now	22:37
corvus	yes, it did that	22:38
clarkb	this time around should be quicker due to caching	22:38
corvus	it's running build 551 now	22:38
clarkb	I dont' see sudo warnings after sudo commands	22:39
ianw	we can also iterate faster if we want to manually stop the caching with an override	22:40
ianw	https://opendev.org/openstack/project-config/src/branch/master/tools/build-image.sh#L77	22:41
clarkb	ianw: I think we've cached the bulk of them now	22:41
clarkb	ianw: so it should just update them at this point (and be much quicker)	22:41
ianw	yeah, much is relative :)	22:41
clarkb	its 1/4 done now :)	22:44
ianw	i'm adding stories to https://storyboard.openstack.org/#!/story/2007407	22:49
clarkb	thanks	22:49
ianw	converting this to nb04 after seems to avoid any collision issues, i'll put that in	22:49
ianw	do we want to fully investigate podman before that, or covert to docker?	22:49
ianw	(all of this is me just following mordred ... at the time of the gate tests we were using docker, then when i wrote the production deployment i switched to podman because that's waht gerrit was using now :)	22:50
clarkb	ianw: I'm fine with docker honestly. But mordred did link to the podman docs on this	22:50
clarkb	ianw: https://github.com/containers/libpod/blob/master/docs/source/markdown/podman-run.1.md search mount propagation	22:51
clarkb	2020-03-13 22:52:11.795 \| Couldn't parse 'E: Unable to locate package lsb-release file /opt/cache/files/E: Unable to locate package lsb-release E: Unable to locate package lsb-release' as a source repository	22:53
clarkb	I'm guessing that means lsb_release isn't working properly	22:54
clarkb	and it is trying to cache distro packages? so relies on that info	22:54
ianw	E: ... is that from apt?	22:55
clarkb	or yum?	22:55
ianw	sorry, yeah pkg manager ... that seems like a weird place to get that	22:55
clarkb	2020-03-13 22:52:11.775 \| Getting /opt/dib_cache/source-repositories/repositories_flock: Fri Mar 13 22:52:11 UTC 2020 for /opt/dib_tmp/dib_build.IoVQmc68/hooks/source-repository-images	22:56
clarkb	is the thing before	22:56
clarkb	https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos from there maybe?	22:57
clarkb	oh that script generates the list then cache-url consumes it	22:59
clarkb	I think we may be generating invalid image list	23:01
clarkb	possibly because lsb-release doesn't exist	23:01
clarkb	https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L84 that script (I'm pulling it up now to see what it does and if it expects lsb to be there)	23:02
ianw	our gate testing build of f30 for comparision https://zuul.opendev.org/t/openstack/build/879fcc41186d4a55b8ed4b6f561e2909/log/nodepool/builds/test-image-0000000001.log	23:04
clarkb	ya I think that is it. That script sources devstack/functions which sources devstack/functions-common which attempts to install lsb-release if it does not exist	23:04
clarkb	and that is coming from apt	23:04
clarkb	whats odd is that it would fail though	23:05
clarkb	(I would expect it to install, but it doesn't possibly because we clean things up on the image enough that it can't do a package install without an update first?)	23:05
ianw	that sounds highly likely	23:06
clarkb	ianw: https://opendev.org/openstack/devstack/src/branch/master/functions-common#L321-L338 is the underlying source of that though I think	23:06
ianw	it seems like lsb-release could be a bindep of dib	23:06
clarkb	except we don't even need it for this functionality	23:07
ianw	lsb-release [platform:dpkg]	23:07
clarkb	(the image list generation doesn't need to know what distro it is running on)	23:07
ianw	it is actually	23:07
clarkb	only as a side effect I think	23:07
ianw	so that should only trigger if lsb_release isn't there, and it should be there from bindep.txt	23:08
clarkb	do we install dib's bindep?	23:09
clarkb	thats what the sibling stuff should get us?	23:09
ianw	wait, "command lsb_release" actually runs it, right	23:10
ianw	oh no, it's "-v"	23:10
ianw	clarkb: yes, i think that dib's bindep should be installed by the container build	23:11
clarkb	thinking out loud here we could edit /opt/project-config to stop caching images	23:12
clarkb	(just to see if anything else will break	23:12
clarkb	corvus: mordred fungi ^ any opinions on that?	23:12
ianw	ohhh, actually https://zuul.opendev.org/t/zuul/build/d3ffa91e9f8d4fbea364203a864a054a/log/job-output.txt	23:15
ianw	it doesn't install dib bindep.txt ... i remember we talked about this https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L66	23:16
clarkb	whats the difference between lsb-base and lsb-release?	23:16
fungi	what's broken in the current image list?	23:16
ianw	i think we need to add it @ https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L66	23:16
fungi	lsb-base is a set of "standard" packages which make a given distro meet the lsb minimum requirements	23:17
corvus	catching up	23:17
fungi	lsb-release is a tool to tell you what distro you're on and various other bits needed to gauge lsb compatibility	23:17
clarkb	fungi: we need lsb-release on our image to make https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L84 happy	23:17
clarkb	corvus: ^	23:17
fungi	the lsb and thus lsb-base are effectively dead since years	23:18
clarkb	*on our nodepool-builder image	23:18
fungi	lsb-release as a relatively commonly found utility to tell you what distro/release your on has survived	23:18
fungi	s/your/you're	23:19
fungi	/	23:19
ianw	clarkb: it's emergency-ed right? i think we can skip caching just to see if it gets a .qcow2 out	23:19
clarkb	ianw: yes it is emergency'd	23:19
clarkb	ianw: and ya I think that is probably the next best step now. Should we just disable image caching?	23:19
corvus	yeah, i don't think this is used for devstack tests?	23:19
corvus	so shouldn't be a big deal	23:20
corvus	this==f30	23:20
ianw	umm, it will be, but at this point that's minor concern, it's non-voting	23:20
clarkb	I can comment out https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L123-L145 and add a pass?	23:20
clarkb	or rm the file entirely	23:20
clarkb	in /opt/project-config on the server	23:20
fungi	well, also devstack will just download those images if they're not cached	23:20
corvus	i didn't see it in a codesearch	23:21
corvus	but it's there, i stand corrected	23:22
clarkb	oh we can just remove the cache-devstack element	23:22
corvus	it's not hard to update the image if we want	23:22
clarkb	thats cleaner and nicer	23:22
corvus	we can just exec a shell, install the package and docker commit	23:23
corvus	as long as we're just plowing through the punch list to get something working	23:23
ianw	i think it's actually more just the "apt-get update" that's required	23:23
clarkb	corvus: well dib is actually trying to install that but it is failing (so we'd have to debug that too)	23:23
clarkb	corvus: the error we get is from the attempted isntall :)	23:23
ianw	yeah, i think that's just because the container has had all its metadata purged	23:23
corvus	dib is trying to install packages on the system it's running on?	23:24
clarkb	corvus: by way of devstack :/	23:25
ianw	... i'm not saying it's right, but it is	23:25
clarkb	corvus: so not really dib, but the dib element that calls into devstack which then side effects	23:25
corvus	i am now in favor of removing the image caching element everywhere	23:25
corvus	that is really uncool	23:25
clarkb	corvus: https://opendev.org/openstack/devstack/src/branch/master/functions-common#L321-L338 there	23:25
corvus	and unsafe	23:25
corvus	that was never supposed to happen	23:25
ianw	well, that package is part of bindep, so if we had fully installed dib's bindep we probably wouldn't notice	23:26
corvus	(i mean, we just gave devstack root on the entire control plane)	23:26
corvus	(i'm not exaggerating -- you can use our image builds to eventually jump to any host we manage)	23:26
clarkb	and yes it wasn't supposed to from memory way back when sean added that	23:26
corvus	well, i added the image caching but sure	23:27
fungi	not just as we've been running a script from it during image builds since, what, 2014?	23:27
clarkb	corvus: well the original version was a naive scan (whcih was super safe)	23:27
corvus	clarkb: yep	23:27
clarkb	then sean wrote a thing in devstack to list the images non naively	23:27
corvus	and there was a reason for that	23:27
clarkb	and I'm pretty sure the original version of that was also safe	23:27
clarkb	(but I could be wrong)	23:27
clarkb	switching to caching a cirros or three is probably sane at this point	23:28
fungi	sean's implementation that i remember parsed the devstack scripts to find urls	23:28
clarkb	trove and the weird container stuff that was going on are basically EOF	23:28
clarkb	so what we end up with is "what cirros images do we care about"	23:28
corvus	i have to go now. my vote is to disable the caching element globally and add anything we want in a static element.	23:29
clarkb	(and I think we can just list those)	23:29
fungi	i do think maintaining a static list of images at this point is probably low-effort. devstack rarely adds/removes/updates its own image set since ages	23:29
fungi	but also i agree, granting the devstack project a root access backdoor on all nodes through the image build process is not good under our present model	23:30
ianw	i've filed https://storyboard.openstack.org/#!/story/2007407 task #39066	23:32
ianw	clarkb: so are you removing cache-devstack?	23:34
clarkb	ianw: I got sidetracked thinking about rewriting cache-devstack :)	23:36
clarkb	I'll rm it from /etc/nodepool/nodepool.yaml now	23:36
clarkb	done	23:36
ianw	for right now, do you want to apt-get update in the container and just see if it passes?	23:38
clarkb	I probably won't thinking time is better spent updating cache-devstack at the moment	23:39
ianw	docker exec 4f141126d67d sudo apt-get update ... i did that ... let's see if this build goes	23:39
ianw	clarkb: #39066 assigned to you :)	23:40
openstackgerrit	Mohammed Naser proposed openstack/project-config master: add vexxhost/openstack-operator https://review.opendev.org/713080	23:48
ianw	... ok, i think it got a little further, it's still going	23:50
ianw	it's at everyone's favourite pip-and-virtualenv	23:54
* fungi can't wait to see that gone		23:55
ianw	it's making the image! yay	23:55
openstackgerrit	Clark Boylan proposed openstack/project-config master: Statically cache devstack images and packages https://review.opendev.org/713081	23:55
clarkb	ianw: fungi corvus mordred ^ that should be a safe version and largely backward compatible with the current set of cached stuff	23:56
clarkb	I decided to punt on arm64 for now	23:56
ianw	clarkb: i was thinking perhaps devstack should keep that static list? i'm not sure anyone contributing changes there would know to update it, at least it would come up in a grep in the local source tree?	23:57
clarkb	ianw: ya we could do that as an improvement. I was basically trying to get simple thing done that would work for now	23:58
clarkb	though one issue is architecture differences	23:58
clarkb	if that is dynamic and in devstack we'd potentially have the same problem all over again	23:58
clarkb	(also the etcd caching is a bit annoying because as far as I know basically nothing is using it)	23:59
clarkb	(basically I acknowledge that I've punted on a few things including arch and user accessibility, but for short term this should be a good change and we can figure out longer term solutions to those problems?)	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!