openstackgerrit | Merged opendev/system-config master: pip3: Add python3-distutils https://review.opendev.org/712818 | 00:29 |
---|---|---|
openstackgerrit | Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819 | 01:40 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819 | 01:43 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819 | 01:44 |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Move fedora-30 builds to nb01.opendev.org https://review.opendev.org/693120 | 01:55 |
openstackgerrit | Merged openstack/project-config master: Move fedora-30 builds to nb01.opendev.org https://review.opendev.org/693120 | 02:22 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: nodepool-builder: add /opt/dib_cache https://review.opendev.org/712824 | 02:22 |
openstackgerrit | Merged openstack/diskimage-builder master: Remove hacking from requirements https://review.opendev.org/712778 | 05:27 |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Revert "Move fedora-30 builds to nb01.opendev.org" https://review.opendev.org/712836 | 05:28 |
*** DSpider has joined #opendev | 05:51 | |
openstackgerrit | Merged openstack/project-config master: Revert "Move fedora-30 builds to nb01.opendev.org" https://review.opendev.org/712836 | 06:33 |
*** factor has joined #opendev | 07:50 | |
*** lpetrut has joined #opendev | 10:12 | |
*** factor has quit IRC | 14:00 | |
*** factor has joined #opendev | 14:00 | |
*** factor has quit IRC | 14:03 | |
*** factor has joined #opendev | 14:04 | |
*** factor has quit IRC | 14:14 | |
*** factor has joined #opendev | 14:14 | |
*** factor has quit IRC | 14:15 | |
openstackgerrit | Merged opendev/glean master: Fix a handful of bugs in config-drive processing https://review.opendev.org/703623 | 14:42 |
openstackgerrit | Merged openstack/project-config master: Add a new project and repository for tripleo-ipa https://review.opendev.org/711114 | 15:44 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: DNM: rebase unittests to base-minimal-test https://review.opendev.org/712985 | 15:45 |
*** lpetrut has quit IRC | 15:48 | |
openstackgerrit | Merged openstack/project-config master: New repo: devstack-plugin-open-cas https://review.opendev.org/711878 | 15:51 |
openstackgerrit | Merged openstack/project-config master: Add OpenInfra Labs IRC channels to bots https://review.opendev.org/712586 | 15:52 |
openstackgerrit | Merged openstack/project-config master: Remove neutron-tempest-dvr job from Neutron's dashboard https://review.opendev.org/712048 | 15:52 |
*** lpetrut has joined #opendev | 16:14 | |
openstackgerrit | Clark Boylan proposed openstack/project-config master: Install tox into a virtualenv on our images https://review.opendev.org/713017 | 16:29 |
*** openstackgerrit has quit IRC | 16:31 | |
*** openstackgerrit has joined #opendev | 16:34 | |
openstackgerrit | Lance Bragstad proposed openstack/project-config master: Add queues for tripleo-ipa project https://review.opendev.org/711115 | 16:34 |
openstackgerrit | Merged openstack/project-config master: Install tox into a virtualenv on our images https://review.opendev.org/713017 | 17:32 |
corvus | i'm seeing node failures for fedora-30 nodes | 18:06 |
corvus | and i notice that there are some recent commits to project-config moving them around | 18:06 |
corvus | is this known / is anyone working on it? | 18:06 |
corvus | infra-root, config-core: ^ ? | 18:06 |
clarkb | it is not known to me | 18:07 |
clarkb | my guess is that the new builder managed to upload a fedora-30 image that does not work | 18:07 |
fungi | ianw moved them to the new nb01.opendev.org late yesterday | 18:07 |
clarkb | we can probably pause the builds on that builder then revert to the previous image? | 18:08 |
fungi | i think they're already paused | 18:08 |
corvus | i don't see builder logs on nb01.opendev | 18:08 |
fungi | i believe he said he stopped the services on it and added it to the emergency disable list | 18:08 |
corvus | i see build logs | 18:08 |
corvus | but not the builder itself | 18:08 |
corvus | i don't know how to tell if the builder is running | 18:09 |
corvus | we don't have docker on that machine, so i guess we're using podman | 18:09 |
clarkb | corvus: `docker ps -a` or podman equivalent if using podman | 18:09 |
corvus | clarkb: which user should i run podman as? | 18:10 |
Shrews | i thought the new builder was reverted? https://review.opendev.org/#/c/712836/ | 18:10 |
corvus | (there's no global podman thingy like docker) | 18:10 |
corvus | Shrews: apparently that's not a revert, that's a rework | 18:10 |
clarkb | looks like root | 18:10 |
corvus | Shrews: at least that's what the commit message says? | 18:10 |
clarkb | (because we run podman-compose as root in system-config/playbooks/roles/nodepool-builder/tasks/main.yaml | 18:11 |
corvus | there are now 2 podman processes, one running as corvus, one as nodepool, probably because i ran "podman ls" | 18:11 |
clarkb | but ya rereading it should be stopped because of nodepools "delete everythign I don't know about" behavior | 18:12 |
clarkb | in which case I think we can remove the newer fedora-30 image and fall back to the older one? | 18:13 |
corvus | as far as i can tell, nodepool-builder is not running on this host | 18:13 |
corvus | do we have a plan for running the nodepool cli on the docker hosts? | 18:13 |
clarkb | corvus: I think a lot of that is still in the learning phase | 18:13 |
mordred | corvus: I agree that nodepool-builder does not seem to be running on the host | 18:13 |
clarkb | but we do produce a nodepool image to run commands | 18:14 |
fungi | ianw mentioned in scrollback (maybe in #openstack-infra) that he stopped it | 18:14 |
clarkb | (so could probably add that to our setup) | 18:14 |
mordred | perhaps add a convenience script so that just running "nodepool" works and does the right thing with the image | 18:14 |
mordred | like in that openstackclient patch I did in system-config a little while ago | 18:15 |
corvus | that would be swell | 18:15 |
corvus | i think having the nodepool cli handy before we break things would be good | 18:15 |
mordred | ++ | 18:15 |
fungi | yeah, #openstack-infra at 02:44z | 18:15 |
corvus | so i guess we're done with nb01.opendev for now; i'll log into a different host | 18:15 |
corvus | "nodepool image-list |grep -i fedora" shows nothing | 18:16 |
fungi | he stopped the builder because it was deleting all the images id didn't know about | 18:16 |
fungi | s/id/it/ | 18:16 |
corvus | so i'm guessing all the f30 images were deleted, and the current state of the half-revert is that the builder config for f30 is on a host which is down | 18:16 |
corvus | so we should continue to complete the revert so that the f30 builder config moves back to the nb*.openstack hosts ? | 18:17 |
clarkb | corvus: we can't build fedora30 on those hosts though | 18:17 |
clarkb | I think we have to roll forward? | 18:17 |
corvus | how did we ever build f30? | 18:17 |
fungi | https://review.opendev.org/712836 should have put them back | 18:17 |
clarkb | corvus: ~5 months ago fedora rpms were made with a compression tool that was available on ubuntu xenial then at some point they switched off that aiui | 18:18 |
corvus | fungi: line 280: pause: true | 18:18 |
fungi | oh, it was put back with pause: true | 18:18 |
fungi | yep, just spotted that myself | 18:18 |
clarkb | at that point we paused the builds then ianw has spent the intervening time trying to come up with a system to build them (and this is the result) | 18:18 |
corvus | then having that system delete the irreplacable images is especially unfortunate | 18:19 |
clarkb | yes, I think this is an aspect of nodepool that we should probably think about more. (running disjoint builders is likely desireable to accomodate different builder needs, architecture, operating system, whatever) | 18:21 |
clarkb | I noted on IRC last night that I think the way nodepool wants you to express this is to always list all images, then pause them where they should not build | 18:21 |
mordred | I don't suppose we accidentally still have the old fedora30 qcow on any of the nodes right? | 18:21 |
clarkb | mordred: probably not if nodepool deleted them | 18:21 |
fungi | it tries to clean them up aggressively | 18:21 |
clarkb | (its pretty good about cleaning those up) | 18:21 |
mordred | yeah | 18:22 |
Shrews | i suspect the hostname change is what triggered the cleanup (cc: ianw) | 18:22 |
clarkb | Shrews: there was no hostname change | 18:22 |
clarkb | Shrews: this is a new additive host | 18:22 |
Shrews | clarkb: nb01.opendev.org vs. nb01.openstack.org | 18:23 |
Shrews | right? | 18:23 |
clarkb | Shrews: yes that wasn't a change | 18:23 |
clarkb | both are/were expected to run side by side | 18:23 |
Shrews | clarkb: nodepool stores the hostname of the builder, so yes, as far as nodepool is concerned, it was a change | 18:23 |
clarkb | Shrews: I am trying to clarify that we didn't delete or remove nb01.openstack.org | 18:24 |
clarkb | we added nb01.opendev.org to the set of existing servers | 18:24 |
clarkb | (I understand why the images were deleted) | 18:24 |
Shrews | clarkb: i understand that | 18:24 |
Shrews | i don't think that invalidates my statement | 18:25 |
clarkb | Shrews: I read it as we changed nb01.openstack.org to nb01.opendev.org which did not happen | 18:25 |
clarkb | we simply added nb01.opendev.org | 18:25 |
Shrews | clarkb: didn't mean that. i meant "host ownership of a build" changed | 18:26 |
clarkb | Shrews: ah. I don't think that is fully it either. Because nb01.opendev.org apparently tried to delete all of nb01.openstack.org's images | 18:26 |
clarkb | and nb01.openstack.org tried to delete nb01.opendev.org's f30 image | 18:27 |
clarkb | maybe that is what you mean by host ownership? Basically they each decided the others disjoint set was invalid | 18:27 |
clarkb | (which is why I suggested that listing all images then pausing where we don't want to run it would be a way to express this to nodepool) | 18:28 |
corvus | they are a cluster, and are all supposed to have the same configuration. the thing that we're doing with nb03 only works because it has a disjoint set of providers. | 18:28 |
corvus | (it is, in effect, a second cluster of one) | 18:29 |
corvus | but... about the future... | 18:29 |
corvus | this is affecting at least one project (nodepool). what are our options? | 18:29 |
clarkb | I expect that if we fixed the nodepool configs (possibly via the pause idea or just letting the new server build all the images) that we'd be able to build and upload a fedora30 image | 18:30 |
clarkb | basically roll forward | 18:30 |
clarkb | other ideas: manually upload a fedora-30 image in some set of clouds and use that in our providers | 18:31 |
clarkb | (and the probably bad option) stop testing on fedora | 18:31 |
fungi | i thnik either press forward trying to get a nodepool builder working on a newer distro, or try to get the necessary decompression tooling backported to ubuntu-xenial so the existing builders can unpack newer rpms | 18:31 |
fungi | and yes, also possibly someone build an image locally as a stopgap | 18:32 |
clarkb | (I've secretly been hoping that centos rolling distro can slip into the spot fedora fills, but thats an entirely different set of things to sort out and should maybe be ignored for now) | 18:32 |
fungi | what are the details on the rpm decompression problem? do we have that documented somewhere? | 18:32 |
clarkb | I'm sure ianw has it in a story. Let me see if I can find it | 18:33 |
mordred | wait- the container deployment is the thing that solves the rpm decompression problem | 18:33 |
clarkb | mordred: yes | 18:33 |
mordred | I don't think that's a thing that we need to go back to try to solve, is it? | 18:33 |
clarkb | mordred: no | 18:34 |
clarkb | (other than the container deployment had a sad) | 18:34 |
mordred | right. | 18:34 |
clarkb | but I think we can make it not have a sad | 18:34 |
mordred | I agree | 18:34 |
clarkb | (the put all images in the config and then pause where we don't want them to build idea) | 18:34 |
mordred | I just wanted to be clear that we didn't need to go back to a more complicated drawing board | 18:34 |
mordred | clarkb: ++ | 18:34 |
corvus | yeah, i think rolling forward with container using the new config file strategy is probably easiest (depending on how easy manually uploading an image is) | 18:35 |
corvus | i can help with that after lunch | 18:35 |
fungi | but yeah, i expect that if there were an easy way to backport a decompression solution for new rpms to xenial, ianw would already have done that | 18:35 |
AJaeger | corvus: the move around was reverted | 18:35 |
corvus | AJaeger: *partially* reverted | 18:35 |
fungi | AJaeger: it was, but only after fedora images we've ceased to be able to build were accidentally deleted by it | 18:36 |
corvus | AJaeger: it's in backscroll, but tldr: nothing is building f30 images now | 18:36 |
corvus | (and there are no f30 images) | 18:36 |
mordred | how about I take a stab at the config file change | 18:36 |
AJaeger | corvus: see it now | 18:36 |
clarkb | the paging buttons in the storyboard story search page don't work | 18:36 |
clarkb | oh its 1 to 6 stories not 1 to 6 pages | 18:37 |
*** tristanC has joined #opendev | 18:37 | |
clarkb | mordred: wfm (I'll keep trying to dig the story details out of storyboard) | 18:37 |
corvus | i gotta run, i'll be back after lunch to help. | 18:37 |
openstackgerrit | Monty Taylor proposed openstack/project-config master: Add fedora-30 to nb01.opendev.org https://review.opendev.org/713047 | 18:40 |
mordred | clarkb, corvus, fungi: ^^ | 18:40 |
mordred | I believe that is what we're saying we want on nb01 yeah? | 18:41 |
mordred | Shrews: | 18:41 |
clarkb | mordred: yes I think that would allow us to build without associated deletes. If shrews can confirm that would be good | 18:41 |
fungi | that'll have to be manually installed for the time being, right? | 18:42 |
Shrews | i'm still not fully understanding why one is trying to delete the other's images. i can't say for sure if that's what we want until i know that | 18:42 |
clarkb | fungi: no I think we are still ansibling it | 18:43 |
clarkb | fungi: we've just stopped giving the service a config to do any work with | 18:43 |
fungi | clarkb: did it get taken back out of the emergency disable list? | 18:43 |
clarkb | oh if its in the emergency disable then ya we have to remove it from there or manually add it | 18:43 |
clarkb | maybe manually adding it is the safest thing | 18:44 |
fungi | i haven't checked the disable list, just saw ianw say that he had added it there | 18:44 |
clarkb | ya its in there | 18:45 |
clarkb | so ya I think once we are comfortable with that change (I am but others should definitely double check me on it) we can manually apply it and re up the podman-compose config on nb01.opendev | 18:45 |
clarkb | then monitor it for deletions as well as building f30 | 18:45 |
clarkb | another option while we are brainstorming is to reduce that provider list in the nb01.opendev.org config to a single cloud to reduce blast radius. Have it build the image and upload it, then add the other providers once we are happy with it | 18:51 |
clarkb | I'm worried that will trigger some other cluster mismatch deletion behavior though. I think the proposed config from mordred is likely safest | 18:52 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050 | 18:52 |
mordred | how's that look for a helper script? | 18:52 |
clarkb | mordred: I think we can probably trim the mount list since nodepool cli commands aren't running dib builds or logging to disk. | 18:53 |
clarkb | mordred: we should only need to mount in the cloud config and the nodepool config I think | 18:54 |
clarkb | I'm going to find lunch now too | 18:54 |
clarkb | (I think both changes are good as proposed even if we can do cleanup on the helper script) | 18:55 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050 | 18:57 |
mordred | clarkb: good point | 18:57 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050 | 18:57 |
*** lpetrut has quit IRC | 19:05 | |
fungi | mordred: is that going to have to be run via sudo -u nodepool so that it has access to bindmount the config? | 19:23 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Use a zuul_* and add an .ansible-lint file https://review.opendev.org/712547 | 19:27 |
AJaeger | clarkb: the goaccess reports run fine today, see https://7b8a363e2631f871420c-9d822a96c58fccb739d55f79e396b06d.ssl.cf1.rackcdn.com/periodic/opendev.org/opendev/system-config/master/docs-openstack-goaccess-report/44c9931/docs.openstack.org_goaccess_report.html | 19:29 |
clarkb | AJaeger: that is great to hear | 19:29 |
clarkb | AJaeger: and the reports have at least as much info as the old one right? | 19:30 |
AJaeger | clarkb: it gives the data - but lots of other stuff as well ;) Need to dig into how to just get those URLs. | 19:35 |
AJaeger | clarkb: so, I'm fine with going forward with it | 19:35 |
* AJaeger calls it a day and waves good night | 19:36 | |
* corvus is back and catching up | 19:36 | |
clarkb | corvus: I think if people agree mordred changes look good we can proceed yo apply the config update to new nb01 manually and manually up the service there | 19:37 |
clarkb | I'm still about 20 minutes from being able to help with that | 19:38 |
corvus | clarkb, mordred: +3 | 19:40 |
clarkb | corvus: note the server is in the emergency file | 19:40 |
clarkb | if we want anwible to update it instead of manual we should remove it | 19:40 |
corvus | clarkb: yeah i saw | 19:40 |
corvus | Shrews: are you still looking into that? (don't mind the +3, since it's not getting applied until we're ready) | 19:41 |
corvus | I think we should add warnings to both files that they need to be kept in sync during the transition period. | 19:41 |
Shrews | corvus: "that" being why images were being deleted? if so, no. i think i need ianw to walk me through it. | 19:43 |
openstackgerrit | Merged openstack/project-config master: Add fedora-30 to nb01.opendev.org https://review.opendev.org/713047 | 19:47 |
corvus | Shrews: i think you are right to be concerned. the builder id's present in 'nodepool dib-image-list' are short hostnames | 19:50 |
corvus | Shrews: ie "nb01" | 19:50 |
Shrews | we left the hostname comparisons in for compatibility with older nodepool (each should have a unique id now). it might be time to just remove that | 19:51 |
corvus | ugh. i just triet do run "podman run ..." to see if i could test the behavior on nb01.opendev.org and ran into gshadow permissions | 19:52 |
corvus | i thought we were removing that? | 19:52 |
Shrews | i don't think there were plans to do so | 19:53 |
clarkb | I'm not aware of gshadow issues (is that a podman thing?) | 19:53 |
corvus | no, it's a we're doing something in the nodepool image we shouldn't be doing thing | 19:53 |
* fungi assumed something related to the system shadow group file (/etc/gshadow) | 19:54 | |
corvus | i thought there was a patch to nodepool to revert that out, after we rejected the corresponding patch to zuul | 19:54 |
corvus | anyway, i *also* can't run it as root, for a different reason (failed to find plugin "loopback") | 19:55 |
corvus | so, i can't really predict the behavior on nb01.opendev.org because i can't get a python prompt in the production environment to test :/ | 19:55 |
mordred | corvus: I can confirm we have not reverted that out of nodepool | 19:56 |
mordred | but I think we should | 19:56 |
corvus | mordred: ack; i'll put in on my backlog of things to do when we can merge nodepool changes again | 19:56 |
corvus | mordred: i'm less confident about the config file change now | 19:56 |
corvus | (i'm also less confident we can actually run nodepool) | 19:57 |
mordred | corvus: we can make an image by hand and upload it to a personal dockerhub location then try running that manually | 19:57 |
mordred | to check | 19:57 |
mordred | corvus: I don't see the revert patch in the system - want me to make one real quick? | 19:57 |
corvus | mordred: let's not worry about that for now | 19:58 |
corvus | the gshadow thing is preventing me from running nodepool as a user | 19:58 |
corvus | but we run it as root | 19:58 |
corvus | the root error is different and i don't understand it | 19:58 |
mordred | corvus: ok. let me see if the root error makes sense to me | 19:59 |
clarkb | are you running it with all the mounts? if not perhaps that is causing trouble? | 19:59 |
corvus | no mounts | 19:59 |
openstackgerrit | James E. Blair proposed openstack/project-config master: Revert "Add fedora-30 to nb01.opendev.org" https://review.opendev.org/713058 | 19:59 |
corvus | mordred: ^ that's a revert of your change because of the nb01/nb01 issue | 20:00 |
mordred | corvus: +# | 20:00 |
mordred | gah | 20:00 |
Shrews | https://review.opendev.org/713057 Stop comparing hostnames to determine ownership | 20:00 |
corvus | so to summarize, i have 2 concerns: 1) some nodepool behavior is determined by the hostname, and we will have 2 hosts with the same name. 2) i wasn't able to run the nodepool container as root with "podman run" which weirds me out | 20:01 |
corvus | Shrews: it looks like that "or" is going to cause us to hit that case when we currently don't want to. | 20:02 |
mordred | corvus: we need --net=host | 20:02 |
mordred | corvus: I have a script in /root | 20:02 |
mordred | called "n" | 20:02 |
corvus | Shrews: so i think we either need to change the nb01.opendev.org hostname, or merge your change first | 20:02 |
mordred | that you can use | 20:02 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local https://review.opendev.org/713050 | 20:03 |
fungi | or turn down nb01.openstack.org, though i suspect we have too much build load to do that before the new builder is in operation | 20:03 |
corvus | mordred: thanks. that confirms that the nb01.opendev enironment reports 'nb01' as the hostname | 20:04 |
corvus | fungi: yes, that's an option. i'm unsure about the build load, let's check that out. | 20:04 |
fungi | i believe it comes down to how much usable disk space there is. the more builders we have, the less disk utilization on each | 20:04 |
clarkb | its a lot better now than it was after cleaning up old fedoras and precise | 20:05 |
fungi | nb02 is 65% used on /opt, nb01 34% | 20:05 |
fungi | so ~100% if we turned off nb01, i expect? | 20:05 |
clarkb | fungi: ya that math should ve roughly correct | 20:06 |
fungi | (i mean, it wouldn't be immediate, it would take some time to hit that, but still) | 20:06 |
clarkb | and in the meantime we'd possibly delete all thoseimages? | 20:06 |
fungi | and yeah, the sudden deletion of fedora images is presumably what dropped the disk usage on nb01 | 20:06 |
clarkb | spinning up nb04.opendev.org wouldnt be too bad | 20:07 |
corvus | that's probably the safest thing | 20:07 |
clarkb | basically add mordred change back but change the name | 20:07 |
clarkb | and then update groups and stuff as necessary | 20:07 |
fungi | makes sense to me, though i'm about to be tending a hot wok for the next little while so will be less help | 20:07 |
corvus | there's also the possibility we could change the hostname in the podman-compose file without building a new host. but that could also inspire madness. | 20:08 |
clarkb | corvus: I like that for its simplicity but have no idea how reliable it would be | 20:08 |
fungi | right, i thought about that, convince nodepool its hostname is different that what the system knows itself as | 20:08 |
fungi | but i agree that's icky | 20:09 |
clarkb | or land shrews' change | 20:09 |
corvus | can't land nodepool changes | 20:09 |
fungi | which i've already +2'd | 20:09 |
fungi | but right, catch-22 | 20:09 |
clarkb | because we need fedora-30? | 20:09 |
corvus | yep. it's used in a gating job | 20:09 |
corvus | i think it would be safe to make it non-voting for Shrews change though, if we wanted to go that way | 20:10 |
Shrews | could make that job nv in my change, then re-enable? | 20:10 |
corvus | it's just going to be a thing. | 20:10 |
Shrews | jinx | 20:10 |
corvus | (there are some changes though that i definitely don't want to land without it) | 20:10 |
corvus | (including the one to remove the gshadow thing) | 20:10 |
mordred | ++ | 20:11 |
corvus | adding "--hostname nb04" to podman run works as expected | 20:11 |
clarkb | crazy udea tine lets do ^ to tide us over then on monday we can build it right | 20:12 |
clarkb | *crazy idea time | 20:12 |
mordred | so - are we thinking do that - get a f30 image, be able to land changes - fix the underlying stuff | 20:12 |
corvus | so i think changing the podman-compose file as a temporary measure would be workable. but the longer that goes on, the more cognitive dissonance we will experience. | 20:12 |
mordred | then stop nb01 and rename it back | 20:12 |
mordred | yeah | 20:12 |
mordred | it seems like a thing that should only be in place for exactly as long as it takes for us to get a f30 node | 20:12 |
clarkb | ya I think we should commit to replacing new nb01.opemdev with nb04 monday | 20:12 |
corvus | this is all assuming we can build an f30 image after not doing it for 6 months :) | 20:13 |
clarkb | (I can do that) | 20:13 |
mordred | corvus: wcpgw? | 20:13 |
corvus | how about we call the innerhostname "nb01opendev" ? | 20:13 |
clarkb | I think the basics of f30 building is gated in dib | 20:13 |
fungi | nb01forealz | 20:14 |
clarkb | corvus: ++ that will be less co fusing | 20:14 |
corvus | then it won't conflict with its past or future replacemnet | 20:14 |
fungi | yeah, wfm | 20:14 |
mordred | ++ | 20:14 |
mordred | just call it george | 20:14 |
Shrews | can we just suspend all testing over covid 19 concerns? | 20:15 |
corvus | Shrews: but we promised anyone who wants fedora30 tests could get them | 20:15 |
mordred | corvus: by anyone we only meant Tom Hanks | 20:16 |
clarkb | and nba players | 20:16 |
fungi | it would probably be okay if the tests were only 50% accurate | 20:16 |
mordred | clarkb: Ruby Gobert and Tom Hanks are the same person | 20:16 |
mordred | clarkb: Tom is just that good of an actor you never noticed | 20:16 |
clarkb | the french american sweetheart | 20:16 |
corvus | i think the way to do the compose file change is just to keep nb01.opendev in emergency and manually apply that and the new rev of mordred's change | 20:17 |
clarkb | corvus: wfm | 20:17 |
mordred | ++ | 20:17 |
fungi | seems reasonable | 20:18 |
corvus | i'll work on that now | 20:19 |
openstackgerrit | Merged openstack/project-config master: Revert "Add fedora-30 to nb01.opendev.org" https://review.opendev.org/713058 | 20:21 |
corvus | mordred, clarkb, fungi, Shrews: any of you who are available, want to check out the state on nb01.opendev.org? i modified /etc/nodepool-builder-compose/docker-compose.yaml and checkout out mordred's change 713047 to update /etc/nodepool/nodepool.yaml | 20:23 |
clarkb | corvus: looking | 20:23 |
corvus | s/checkout out/checked out/ | 20:23 |
clarkb | both lgtm (and for others looking /opt/project-config was updated with mordreds change and /etc/nodepool/nodepool.yaml is a symlink into that repo) | 20:25 |
corvus | anyone else want to weigh in, or should we run that now? | 20:29 |
mordred | corvus: lgtm | 20:30 |
corvus | so now we should: cd /etc/nodepool-builder-compose; podman-compose up ? | 20:30 |
corvus | (as root) | 20:31 |
clarkb | yes, I'll start a tail -f on builder-debug.log here and watch it | 20:31 |
corvus | oh yeah, that was the other thing | 20:31 |
corvus | no builder-debug.log | 20:31 |
clarkb | oh that file doesn't exist right now | 20:31 |
corvus | hrm. we do have /var/log/nodepool bind mounted | 20:32 |
clarkb | the permissions and mounts are such that it should be logging there | 20:32 |
clarkb | and /var/log/nodepool/builds/ has content | 20:32 |
corvus | but we're probably runinng the default run in foreground and log to stderr thing | 20:32 |
clarkb | in which case podman logs $containername would work? | 20:32 |
corvus | yes: podman logs nodepool-builder-compose_nodepool-builder_1 | 20:33 |
clarkb | as root | 20:33 |
corvus | yep. i feel like this is probably not how we want to run it in the long run. | 20:33 |
clarkb | I'm ready to run that command (as root) and watch it once it is going | 20:34 |
corvus | okay, i will "up" it now | 20:34 |
corvus | http://paste.openstack.org/show/790683/ | 20:35 |
corvus | apparently podman-compose does not work the same as docker-compose | 20:35 |
corvus | also, it's just sitting there now, it hasn't returned from that command invocation | 20:35 |
clarkb | scared me for a second that it was trying to delete things again in the logs but those are from 18 hours ago | 20:35 |
corvus | so... i guess i will ^C ? | 20:36 |
clarkb | ya it doesn't seem to be doing anything from what I am able to see | 20:36 |
mordred | hrm | 20:36 |
corvus | i will run podman-compose down, then podman-compose up. | 20:36 |
corvus | (docker-compose "up" automatically recreates containers if needed) | 20:36 |
corvus | it's running | 20:37 |
mordred | yay | 20:37 |
clarkb | mkdir: cannot create directory '/opt/dib_cache': Permission denied | 20:38 |
clarkb | that is why the build is failing | 20:38 |
clarkb | I want to say I saw something about this one moment please | 20:38 |
corvus | it wants to delete some vexxhost images | 20:38 |
mordred | that seems like a really bad choice | 20:39 |
clarkb | corvus: mordred I think those are old logs double check timestamops | 20:39 |
corvus | no | 20:39 |
mordred | oh - wait - vexxhost - those could be f30 vexxhost? | 20:39 |
clarkb | oh no its doing it again | 20:39 |
clarkb | probably need to stop it then | 20:39 |
corvus | i've stopped it | 20:39 |
corvus | but i didn't see it succeed at deleting anything it shouldn't | 20:39 |
Shrews | does it still think it's name is nb01 by chance? | 20:39 |
corvus | can somepone point it out to me | 20:39 |
clarkb | corvus: Shrews one sec I think I know what is happening (and its ok for us) | 20:40 |
clarkb | we leak images in vexxhost | 20:40 |
clarkb | when that happens its fair game for any nodepool builder to delete them from the cloud side | 20:40 |
clarkb | I think it has detected this case and is helpfully trying to delete a leaked image in vexxhost. But we should double check before turning it back on | 20:40 |
clarkb | also https://review.opendev.org/#/c/712824/ is the proposed fix for the dib cache thing | 20:40 |
corvus | so all our builders just log that eror every few minutes? | 20:40 |
clarkb | corvus: ya I think so | 20:41 |
clarkb | I have a paste from the other day where I dug into this tring to find it now so I can cross reference ids | 20:41 |
corvus | openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://image-sjc1.vexxhost.us/v2/images/a0b6ea3e-6c39-41b6-8243-c0c9c6d027c8, Image a0b6ea3e-6c39-41b6-8243-c0c9c6d027c8 could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance.: 409 Conflict | 20:41 |
clarkb | http://paste.openstack.org/show/790497/ hrm those ids don't match so either they are new leaks or it is bugging out in the scary way | 20:42 |
clarkb | http://paste.openstack.org/show/790684/ I think that shows we have new leaks | 20:43 |
clarkb | and those ids seem to match what it was deleting on new nb01 (I think that means we are ok) | 20:43 |
mordred | sigh | 20:44 |
corvus | clarkb: er, so your final word is, these leaks are expected? | 20:44 |
mordred | clarkb: do we need to clear bfv's again? | 20:44 |
clarkb | corvus: ya basically boot from volume in vexxhost sometimes leaks volumes whcih prevents us from deleting the image. Once the image is in a "deleting" state in zk any builder is free to delete it in the cloud | 20:45 |
clarkb | corvus: the transition from active to deleting only happens on the "owner" builder though so we avoid races there by having it try first | 20:45 |
corvus | yeah, but your pastes i don't understand | 20:45 |
clarkb | (that also allows it to clean up its local disk) | 20:45 |
corvus | clarkb: i think you said "ids don't match which is bad" "something something nevermind it's good we're safe" | 20:46 |
clarkb | corvus: you can ignore the first paste at this point I think. What the second paste shows is we have 6 images that are all failing to delete in vexxhost and they are in a deleting state in zk. | 20:46 |
corvus | so i just want to make sure you still think those particular leaks are harmless and don't represent some new bad thing that nb01.opendev is doing | 20:46 |
corvus | ok, cool. | 20:46 |
clarkb | corvus: what the builder logs show from what you just ran is that those images are not deleting because there are leaked volumes in vexxhost preventing the image from deleting | 20:46 |
corvus | clarkb, mordred: so what's the deal with cleaning this up? | 20:46 |
clarkb | yup I do think that because the ids in the builder logs match the ids in my second paste | 20:47 |
clarkb | corvus: usually we run volume list and search for volumes which have leaked then try to delete them again (there are heuristics for this and mordred has a tool but its not perfect) | 20:47 |
corvus | so it's impractical to have nodepool fix this? | 20:47 |
clarkb | corvus: do you think we should try cleaning that up first so taht we can have clean logs in the builder when we try again with a dib_cache dir? | 20:47 |
clarkb | corvus: yes I would say it is impractical to have nodepool fix it | 20:47 |
corvus | clarkb: under the circumstances, i think clean logs is important | 20:48 |
clarkb | nodepool could probably do a subset of cases though | 20:48 |
mordred | yes - there are some cases that are safe | 20:48 |
clarkb | but there is another subset where the server itself can't delete and it has the volume attached and that prevents the volume from deleting | 20:48 |
mordred | but not all | 20:48 |
clarkb | and we've seen that require intervention from the cloud itself | 20:48 |
mordred | yeah | 20:48 |
clarkb | corvus: ok I'm going to look into cleaning these up manually | 20:48 |
corvus | clarkb: cool, i'll modify the compose file with the cache fix | 20:49 |
corvus | oh it's a perm thing | 20:49 |
corvus | whatever | 20:49 |
corvus | i will do what 712824 does :) | 20:49 |
clarkb | for anyone else following along there are a bunch of unattached volumes in vexxhost (these are likely the source of the leak) | 20:49 |
clarkb | I'm going to spot check them to ensure they don't belong to something important (its a test node tenant so shouldn't) then delete them | 20:50 |
clarkb | mordred: ^ your tool might do it quicker than me though if you want to queue up running it ? | 20:50 |
mordred | sure | 20:50 |
clarkb | yup spot checking ~5 of them they are all from just after 2300UTC on march 11 | 20:52 |
clarkb | and they are boot from volume volumes with image ids that match our unhappy images in nodepool | 20:52 |
clarkb | what should happen is we delete them (using mordreds too should work in this case), then old nb0X will delete them from the cloud as there is no volume keeping them around | 20:52 |
clarkb | we can then confirm with nodepool image-list as per my paste above and then try again with the new builder | 20:53 |
openstackgerrit | James E. Blair proposed opendev/system-config master: nodepool-builder: add /opt/dib_cache https://review.opendev.org/712824 | 20:53 |
corvus | clarkb: ^ the directory had been created; but it was not bind-mounted it, so i have added it to docker-compose | 20:53 |
corvus | (i'm guessing ianw may have manually run the mkdir?) | 20:53 |
mordred | I'm going to run the clean tool yeah? | 20:53 |
clarkb | oh possibly | 20:53 |
clarkb | mordred: yes I think it is safe to do so since the cleanup tool checks for unattached volumes >24 hours old iirc | 20:54 |
clarkb | mordred: and I've confirmed these seem to be in that state | 20:54 |
mordred | that is correcrt | 20:54 |
mordred | clarkb: can you re-check your list? | 20:54 |
clarkb | mordred: they seem to still be there | 20:55 |
mordred | clarkb: yeah. I didn't get prints. lemme see what's uo | 20:55 |
clarkb | should I start manually deleting? | 20:55 |
mordred | one sec | 20:56 |
clarkb | ok. I've edited a file with a list of them and can run it through xargs openstack volume delete if we want another option | 20:57 |
clarkb | will wait | 20:57 |
mordred | clarkb: yeah. I don't know why it's not cleaning them | 20:58 |
mordred | go ahead | 20:58 |
clarkb | k | 20:58 |
clarkb | its doing them serially and taking a couple seconds each so may be a minute or two | 21:00 |
clarkb | but it is going | 21:00 |
mordred | cool | 21:00 |
mordred | clarkb: oh - I think my script didn't do it because they're already unattached volumes | 21:00 |
clarkb | ah | 21:00 |
mordred | not volumes reporting being attached bogusly | 21:01 |
mordred | so - yay script doing what it's supposed to! | 21:01 |
clarkb | down to 4 images in zk now | 21:01 |
clarkb | from 6 | 21:01 |
clarkb | there is one that is much older than the others that we might have less luck cleaning | 21:01 |
clarkb | but if we can get it down to one, check logs for a single uuid is better than 6 | 21:01 |
clarkb | down to 2 now | 21:03 |
clarkb | ok I don't think that bionic image will delete beacuse its used by 3 volumes that refuse to delete | 21:05 |
clarkb | the opensuse image isn't deleting because volume list claims it is in use by a server volume (not unattached) | 21:05 |
clarkb | | 303ed29e-3c06-4738-a0bd-e2f0eb50991c | | in-use | 80 | Attached to opensuse-15-vexxhost-sjc1-0014437332 on /dev/vda | | 21:06 |
clarkb | I'm checking to see if that is a held node | 21:06 |
clarkb | | 0014437332 | vexxhost-sjc1 | opensuse-15 | d2d73e84-d988-4605-a596-b0ddef9b2b23 | 38.108.68.90 | 2604:e100:3:0:f816:3eff:fe52:b724 | deleting | 00:00:02:34 | locked | | 21:07 |
corvus | that seems to be a recently deleting node... | 21:07 |
clarkb | ya we can probably be patient with it assuming that server deletes | 21:07 |
clarkb | if it doesn't delete then it may be in a similar situation to the bionic images where it was attached to servers that refuse to delete which causes a chain reaction of undeletable resources | 21:07 |
corvus | it is an opensuse-15 node, that does seem likely | 21:08 |
clarkb | c5b3b55a-4c74-4d41-998c-265342ab3afc and c10176f9-56a3-4749-a5dc-44ab56ec3771 are the images that are safe for new builder to delete if it comes to that | 21:08 |
corvus | well, how about we go aheand and fire it up again | 21:08 |
clarkb | I'm ok with that | 21:08 |
corvus | hopefully we can deal with those 2 errors :/ | 21:08 |
corvus | ok, here goes | 21:09 |
corvus | failing the build again | 21:09 |
clarkb | 2020-03-13 21:09:53.958 | mount: /opt/dib_tmp/dib_build.s3OQSzgg/mnt/proc: permission denied. | 21:10 |
clarkb | I think fs perms are ok, is that a caps issue with procfs? | 21:11 |
clarkb | I wonder if the ci of this is using docker instead of podman and we are hitting behavior differences there | 21:12 |
corvus | what kind of testing has this undergone? | 21:12 |
corvus | what ci? | 21:12 |
clarkb | corvus: there is a full on integration job similar to the older nodepool job that runs it outside of a container. I'm trying to pull it up now | 21:12 |
corvus | right, i'm curious where "run a nodepool-builder which runs dib inside a podman container" has been tested | 21:13 |
corvus | (or even in a docker container) | 21:13 |
clarkb | I know there was something its what we set up the sibling container stuff for so we could use glean and stuff from source in containers | 21:15 |
clarkb | now just trying to sort out where it ended up | 21:15 |
clarkb | (but that may have been docker not podman) | 21:15 |
mordred | yeah. may have been | 21:15 |
mordred | in fact - probably was | 21:15 |
mordred | so maybe this is a good reason to use docker not podman - at least until such a time as we have podman-based gate testing | 21:16 |
corvus | we should "test like production" | 21:16 |
clarkb | nodepool-functional-container* | 21:17 |
clarkb | https://zuul.opendev.org/t/zuul/build/459f34fe1c93447c8353fe43a88e81b6 is a semi recent run | 21:18 |
clarkb | and ya it is using docker | 21:18 |
clarkb | https://review.opendev.org/#/c/698818/6/playbooks/nodepool-functional-container-openstack/templates/docker-compose.yaml.j2 shows the compose file | 21:19 |
clarkb | ok more investigating has been done. that mnt/proc path may be owned by root | 21:20 |
clarkb | its readable by not root though | 21:20 |
clarkb | but dib is trying to mount a thing there | 21:21 |
clarkb | | + /opt/dib_tmp/dib_build.w2ztziu9/hooks/root.d/08-yum-chroot:main:239 : sudo mount -t proc none /opt/dib_tmp/dib_build.w2ztziu9/mnt/proc | 21:21 |
clarkb | and that will require root which it probably doesn't have? | 21:22 |
clarkb | will docker run the container processes as root maybe? | 21:22 |
clarkb | (and podman does not) | 21:22 |
fungi | okay, stir fry has been produced, consumed and then cleaned up. skimming to see where i can be of help | 21:22 |
mordred | clarkb: the container as it is now is supposed to be running as nodepool and that nodepool is supposed to have sudo access | 21:23 |
corvus | clarkb: no, the nodepool dockerfile says run as the nodepool user | 21:23 |
corvus | mordred: i don't see evidence of sudo access | 21:23 |
mordred | corvus: there's a line adding a sudoers fiel that I deleted in the revert patch ... one sec | 21:23 |
clarkb | drwxr-xr-x 2 root root 4096 Mar 13 21:21 proc | 21:23 |
clarkb | that is what it looks like from outside of the container | 21:23 |
corvus | mordred: when i run "sudo" in "podman run" i get a password prompt | 21:24 |
mordred | https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L55-L56 | 21:24 |
mordred | corvus: that's in the nodepool-builder image specifically | 21:24 |
corvus | mordred: yeah, that's what i'm running | 21:25 |
mordred | corvus: so we should maybe update that script to use that and not nodepool-builder | 21:25 |
mordred | yeah? | 21:25 |
mordred | awesome | 21:25 |
corvus | mordred: oh, nope, sorry | 21:25 |
corvus | mordred: nm, sudo works | 21:25 |
corvus | i don't understand dib, so i'm not the person to take point on fixing this. | 21:26 |
clarkb | then I'm stumped why this doesn't work because sudo is used in the element (and the dirs on the line above created with sudo are created) | 21:26 |
clarkb | corvus: basically its trying to create a procfs because tools need it | 21:26 |
clarkb | it does that in /opt/dib_tmp/$build_dir/proc so that it can be chrooted itno and seprate from the hosts procfs aiui | 21:27 |
clarkb | http://paste.openstack.org/show/790686/ we see sudo succeed at creating the dirs on the first line | 21:28 |
clarkb | but then line 3 fails due to "mount: /opt/dib_tmp/dib_build.JnvNFPzW/mnt/proc: permission denied." | 21:28 |
mordred | we're setting privileged: true in the compose file - so I'd expect it to not be a procfs thing | 21:28 |
clarkb | I'm guessing this is a capabilities thing since fs stuff looks fine | 21:28 |
clarkb | mordred: maybe priviled means less privileged than docker on podman? | 21:28 |
corvus | we're throwing this host away anyway, right? should we jut apt-get install docker and see if it works? | 21:29 |
clarkb | corvus: ya we could try that I suppose. | 21:29 |
clarkb | (since testing says that should work) | 21:30 |
corvus | mordred: ? | 21:30 |
mordred | yeah | 21:30 |
corvus | k, will do | 21:30 |
mordred | I think bionic has new enough that we don't need to bother doing the upstream repo | 21:30 |
mnaser | i am just jumping in this and not reading scrollback | 21:30 |
mnaser | (yet) | 21:30 |
mnaser | but we use cri-o in prod with k8s for nodepool builder | 21:30 |
mnaser | and our image builds have been ok, if that helps signal anything. | 21:31 |
mordred | cool. so it may just be a settings thing | 21:31 |
mnaser | i _really_ remember running to the similar issue for that mount thing | 21:31 |
corvus | yeah that does seem to suggest there should be a route to getting it working with podman | 21:31 |
* mnaser looks at the helm charts | 21:31 | |
mordred | yeah. it would be more fun to figure out if we weren't unexpectedly doing so when it's more important :) | 21:32 |
mnaser | https://opendev.org/zuul/zuul-helm/src/branch/master/charts/nodepool/templates/builder/statefulset.yaml | 21:32 |
mnaser | ok so | 21:32 |
mnaser | i mount /dev into the actual container. i don't remember why, a note there would be nice. | 21:32 |
mnaser | i think its for losetup things | 21:32 |
corvus | we do not have that in our compose file | 21:32 |
mordred | corvus: do you think it's quicker to try adding /dev to the volume mount list real quick? | 21:33 |
mnaser | that might be something that fails much later on though | 21:33 |
clarkb | corvus: its also not in the testing compose file | 21:33 |
mnaser | but .. i remember very much needing it | 21:33 |
corvus | either one at this point :) i have installed docker | 21:33 |
mordred | corvus: you're driving - I defer to you on which you want to try first | 21:33 |
corvus | i'm happy to throw /dev in there, see if it works with podman, then try docker without /dev, then try docker with /dev | 21:33 |
corvus | i'll do that. test cycle should be fast. | 21:33 |
mordred | corvus: let's do that | 21:34 |
corvus | - /dev:/dev:rw | 21:34 |
corvus | just like that? | 21:34 |
mordred | corvus: let's say yes! | 21:34 |
mnaser | seems like that translates to roughly what k8s is modeling, eah | 21:34 |
corvus | it failed, i'm trying to find the error | 21:35 |
corvus | mount: /opt/dib_tmp/dib_build.xUudni10/mnt/proc: permission denied. | 21:36 |
mordred | docker it is! | 21:36 |
corvus | is dim_tmp bind-mounted in the testing? | 21:36 |
corvus | dib_tmp | 21:36 |
corvus | or is it a straight-up volume? | 21:36 |
clarkb | corvus: good question, no | 21:36 |
clarkb | its not even that, it may be using /tmp | 21:37 |
corvus | sigh | 21:37 |
corvus | this isn't going to work in docker either | 21:37 |
clarkb | dib's default is to use the regular /tmp implementation. We have to use something else because our images are too big for that | 21:37 |
clarkb | (that is where /opt/dib_tmp comes from) | 21:38 |
corvus | i guess i'll do the docker test for completeness | 21:38 |
corvus | but i'm not optimistic | 21:38 |
mordred | corvus: maybe override user and run the container as root | 21:38 |
mordred | rather than with the USER nodepool setting from inside the container | 21:39 |
mordred | I don' tknow why that would have any difference of course | 21:39 |
mordred | grasping at straws | 21:39 |
corvus | mordred: i doubt that's it -- i suspect the problem has something to do with mounting something on a bind mount in a container | 21:39 |
corvus | like, i don't think the mount can propagate up | 21:39 |
clarkb | hrm | 21:39 |
mordred | clarkb: maybe try not passing -v /opt/dib_tmp ? | 21:40 |
clarkb | mordred: if our / is big enough to support that it may work | 21:40 |
corvus | and hope that's enough for 1 image? | 21:40 |
clarkb | /dev/xvda1 39G 3.4G 36G 9% / | 21:40 |
clarkb | it will be close | 21:40 |
clarkb | but I think that may be enough for a single image | 21:40 |
mordred | yeah - it might work | 21:40 |
mordred | then - when we come back to this - we can set up docker/podman to put its container storage directly in /opt perhaps | 21:41 |
mordred | but that'll be for when we're sorting this out properly in the first place | 21:41 |
clarkb | (fwiw I thought mounts were effectively flat in the kernel, and then the mount points give us an illusion of nesting, however cgroups may have completely changed that I ugess) | 21:41 |
corvus | i think it may be succeeding under docker | 21:42 |
mordred | cool | 21:42 |
clarkb | ya fedora-30-0000000549.log is showing it doing package stuff which implies it got further than the /proc mount | 21:42 |
corvus | (current run is docker-compose up without /dev mounted) | 21:42 |
clarkb | yay and itneresting podman difference for the supposedly compatible too :) | 21:43 |
mordred | cool so maybe docker is doing a mount propagation different | 21:43 |
corvus | clarkb: it's incompatible except for that one thing, i guess :( | 21:43 |
corvus | now i regret running docker-compose without the -d argument | 21:43 |
corvus | my hubris is why it succeeded | 21:43 |
clarkb | that is probably worth bringing up with the podman folks since I know rhel installs podman as `docker` | 21:43 |
clarkb | corvus: oops | 21:43 |
clarkb | it complains about its hostname not being resolvable but taht appears to be a non issue so far (its not like my dib VMs locally ever resolve in dns properly either) | 21:44 |
mordred | clarkb: https://github.com/containers/libpod/blob/master/docs/source/markdown/podman-run.1.md search for mount propagation | 21:44 |
clarkb | mordred: I think we need shared propagation | 21:45 |
clarkb | ? | 21:45 |
mordred | I think it's worth trying if/when we get back to investigation | 21:46 |
clarkb | ya | 21:46 |
clarkb | the build is cloning all the git repos right now | 21:46 |
clarkb | may be a while | 21:46 |
clarkb | mordred: maybe running podman as docker changes those behaviors? | 21:46 |
clarkb | (so is a non issue when running it on rhel8 that way) | 21:46 |
mordred | clarkb: maybe so? I'm sure there's more than one thing to learn here | 21:47 |
mnaser | i think the /dev thing comes in play with losetup happens and tries to mount the qcow2 | 21:48 |
clarkb | mnaser: well we don't mount /dev in our testing either | 21:48 |
* mnaser shrugs at why i ended up needing it | 21:49 | |
clarkb | at this point I'm mostly worried some element that is infra image specific will break rather than the stuff in dib because we test the stuff in dib | 21:49 |
mnaser | should have probably documented that but yeah | 21:49 |
clarkb | mnaser: possibly because docker mounts a /dev by default? | 21:49 |
clarkb | or it does when privileged? | 21:49 |
clarkb | you need things like /dev/random typically | 21:49 |
mnaser | perhaps it was that, or maybe cause cri-o doesn't mount it? i dunno, sorry, can't provide much more useful input other than memory that not usually well :) | 21:50 |
clarkb | mnaser: also if podman is cool with breaking these things why can't it break or add ipv6 support | 21:50 |
clarkb | er sorry that was for mordred | 21:51 |
mnaser | i'll take that one too | 21:51 |
mnaser | :p | 21:51 |
clarkb | Like I get the goal of being compatible if they actually did that, but they haven't (as evidenced here) | 21:51 |
corvus | i have run "CTRL-\" to exit docker-compose without stopping the underlying containers | 21:51 |
mordred | clarkb: sigh | 21:51 |
mordred | corvus: neat! | 21:51 |
clarkb | the dib build is still proceeding fwiw so seems to have worked | 21:51 |
clarkb | I'm going to take a break from watching git clone log lines scroll by and find something to drink, back in a bit | 21:54 |
corvus | clarkb: something to drink, or something to DRINK? cause i'm pretty sure we could all use the latter | 21:55 |
corvus | i will also bbiab | 21:55 |
clarkb | For now just drink :) | 21:55 |
*** DSpider has quit IRC | 22:02 | |
ianw | ... thanks for looking in on this ... it was supposed to be a soft rollout but clearly got a little out of hand | 22:04 |
fungi | corvus: i'm just impressed you can ctrl-\ without killing your xsession | 22:13 |
ianw | none of this is helped by me forgetting to git add nb01.opendev.org.yaml in https://review.opendev.org/712836 ... sigh :( but then it seems we've found the hosts use short-names that collide anyway | 22:15 |
clarkb | ianw ya I thibk all we want at this point is to get f30 uploaded. then we make nb04.opendev.org | 22:15 |
clarkb | as well as clean up nodepool as necessary in parallel | 22:16 |
corvus | ianw: i think we're a little fuzzy on the contribution of the short-names -- the current system is using a unique short name but also completely duplicated file with all the images | 22:16 |
corvus | at this point, i don't know which of those, or perhaps both, are necessary for this to work | 22:16 |
ianw | the other thing i found, that i was hoping would not be an issue till monday, was that the limestone .pem in the config file is hard-coded to ~nodepool | 22:17 |
corvus | ianw: the other thing is this change is necessary: https://review.opendev.org/712824 and also we either need to run with docker, or explore mount propoagation settings for podman | 22:18 |
corvus | ianw: i don't think we've recorded that last bit yet; you might want to jot that in your notes | 22:18 |
ianw | yes, i had clearly over-estimated the podman == docker situation | 22:18 |
ianw | btw the new rpm format was switched in for *f31*; that's what i've been trying to get going | 22:20 |
clarkb | ianw: what stopped the f30 builds then? | 22:20 |
ianw | f30, iirc, started having segfaults building, that, again, iirc, didn't happen with bionic building | 22:20 |
clarkb | ah ok so different issue, but happier on newer platform | 22:20 |
ianw | but, f31 was supposed to be the solution anyway | 22:21 |
fungi | got it, so even if we'd unpaused it on the xenial builders they still wouldn't have produced a f30 image | 22:22 |
ianw | no; and i don't have good story filled out on this :/ which is my own fault | 22:23 |
mordred | ianw: I think we can sort out the limestone thing | 22:23 |
mordred | ianw: it might be a better choice to run the containerized hosts with the config in /etc/openstack and bind-mount that in rather than putting them in /home/nodepool like we have been doing - but we probably have a few things to figure out before we get to that :) | 22:24 |
ianw | mordred: yeah, it will just prevent uploading; could either link or i was thinking it is probably better but a bigger change to move it to /etc all together | 22:24 |
mordred | yah | 22:24 |
ianw | heh, jinx, ... that's why it was a "monday" thing :) | 22:25 |
mordred | yup | 22:25 |
mordred | and - I like making it a self-contained change rather than just part of the puppet>ansible+container | 22:25 |
ianw | 2020-03-13 22:24:14.798 | Couldn't parse 'sudo: unable to resolve host nb01opendev: Name or service not known file /opt/cache/files/sudo: unable to resolve host nb01opendev: Name or service not known sudo: unable to resolve host nb01opendev: Name or service not known' as a source repository | 22:26 |
fungi | well poop | 22:27 |
ianw | oh dear; i guess we somehow look at the result of a "sudo" command, and that message has confused it | 22:27 |
mordred | *headdesk* | 22:27 |
fungi | so whatever we tell docker/podman the hostname is also has to resolve (at least via hostfile?) | 22:27 |
clarkb | fungi: it must be at least via hostfile because I've done builds without proper dns setups on local VMs | 22:28 |
mordred | is there a way to tell sudo to shut up about the host thing? | 22:28 |
clarkb | mordred: there is iirc | 22:28 |
clarkb | of course all the docs around this say just edit /etc/hosts | 22:30 |
ianw | i really have no idea why all this need sudo but it's just been like that @ https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/source-repositories/extra-data.d/98-source-repositories#L157 | 22:31 |
fungi | clarkb: yeah, scoured manpages and haven't turned up an option to disable local host resolution | 22:33 |
clarkb | fungi: I thought it was to quiet the logging not necessarily stop the lookups | 22:34 |
ianw | also, this doesn't happen in gate? | 22:34 |
clarkb | I want to say we tried this with devstack at some point | 22:34 |
clarkb | ianw: no beacuse its just checking /etc/hosts | 22:34 |
clarkb | (I guess we can bind mount that in) | 22:34 |
ianw | i mean the gate test | 22:34 |
ianw | ... ohh, we probably just don't run it; don't cache any repos | 22:35 |
clarkb | ianw: it uses host networking too so the hostname is probably correct | 22:35 |
ianw | yeah; nothing in e.g. https://zuul.opendev.org/t/zuul/build/6130771f463743708b410e7a2647641f/log/nodepool/builds/test-image-0000000001.log | 22:36 |
corvus | oh, i guess if you use host networking docker might not update /etc/hosts on the container? | 22:36 |
clarkb | corvus: thats my guess | 22:37 |
corvus | the contents inside the container match the host; what if we add it to the host /etc/hosts and restart the container? maybe it will copy it? | 22:37 |
clarkb | corvus: seems reasonable | 22:37 |
corvus | i'll do that now | 22:37 |
corvus | yes, it did that | 22:38 |
clarkb | this time around should be quicker due to caching | 22:38 |
corvus | it's running build 551 now | 22:38 |
clarkb | I dont' see sudo warnings after sudo commands | 22:39 |
ianw | we can also iterate faster if we want to manually stop the caching with an override | 22:40 |
ianw | https://opendev.org/openstack/project-config/src/branch/master/tools/build-image.sh#L77 | 22:41 |
clarkb | ianw: I think we've cached the bulk of them now | 22:41 |
clarkb | ianw: so it should just update them at this point (and be much quicker) | 22:41 |
ianw | yeah, much is relative :) | 22:41 |
clarkb | its 1/4 done now :) | 22:44 |
ianw | i'm adding stories to https://storyboard.openstack.org/#!/story/2007407 | 22:49 |
clarkb | thanks | 22:49 |
ianw | converting this to nb04 after seems to avoid any collision issues, i'll put that in | 22:49 |
ianw | do we want to fully investigate podman before that, or covert to docker? | 22:49 |
ianw | (all of this is me just following mordred ... at the time of the gate tests we were using docker, then when i wrote the production deployment i switched to podman because that's waht gerrit was using now :) | 22:50 |
clarkb | ianw: I'm fine with docker honestly. But mordred did link to the podman docs on this | 22:50 |
clarkb | ianw: https://github.com/containers/libpod/blob/master/docs/source/markdown/podman-run.1.md search mount propagation | 22:51 |
clarkb | 2020-03-13 22:52:11.795 | Couldn't parse 'E: Unable to locate package lsb-release file /opt/cache/files/E: Unable to locate package lsb-release E: Unable to locate package lsb-release' as a source repository | 22:53 |
clarkb | I'm guessing that means lsb_release isn't working properly | 22:54 |
clarkb | and it is trying to cache distro packages? so relies on that info | 22:54 |
ianw | E: ... is that from apt? | 22:55 |
clarkb | or yum? | 22:55 |
ianw | sorry, yeah pkg manager ... that seems like a weird place to get that | 22:55 |
clarkb | 2020-03-13 22:52:11.775 | Getting /opt/dib_cache/source-repositories/repositories_flock: Fri Mar 13 22:52:11 UTC 2020 for /opt/dib_tmp/dib_build.IoVQmc68/hooks/source-repository-images | 22:56 |
clarkb | is the thing before | 22:56 |
clarkb | https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos from there maybe? | 22:57 |
clarkb | oh that script generates the list then cache-url consumes it | 22:59 |
clarkb | I think we may be generating invalid image list | 23:01 |
clarkb | possibly because lsb-release doesn't exist | 23:01 |
clarkb | https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L84 that script (I'm pulling it up now to see what it does and if it expects lsb to be there) | 23:02 |
ianw | our gate testing build of f30 for comparision https://zuul.opendev.org/t/openstack/build/879fcc41186d4a55b8ed4b6f561e2909/log/nodepool/builds/test-image-0000000001.log | 23:04 |
clarkb | ya I think that is it. That script sources devstack/functions which sources devstack/functions-common which attempts to install lsb-release if it does not exist | 23:04 |
clarkb | and that is coming from apt | 23:04 |
clarkb | whats odd is that it would fail though | 23:05 |
clarkb | (I would expect it to install, but it doesn't possibly because we clean things up on the image enough that it can't do a package install without an update first?) | 23:05 |
ianw | that sounds highly likely | 23:06 |
clarkb | ianw: https://opendev.org/openstack/devstack/src/branch/master/functions-common#L321-L338 is the underlying source of that though I think | 23:06 |
ianw | it seems like lsb-release could be a bindep of dib | 23:06 |
clarkb | except we don't even need it for this functionality | 23:07 |
ianw | lsb-release [platform:dpkg] | 23:07 |
clarkb | (the image list generation doesn't need to know what distro it is running on) | 23:07 |
ianw | it is actually | 23:07 |
clarkb | only as a side effect I think | 23:07 |
ianw | so that should only trigger if lsb_release isn't there, and it should be there from bindep.txt | 23:08 |
clarkb | do we install dib's bindep? | 23:09 |
clarkb | thats what the sibling stuff should get us? | 23:09 |
ianw | wait, "command lsb_release" actually runs it, right | 23:10 |
ianw | oh no, it's "-v" | 23:10 |
ianw | clarkb: yes, i think that dib's bindep should be installed by the container build | 23:11 |
clarkb | thinking out loud here we could edit /opt/project-config to stop caching images | 23:12 |
clarkb | (just to see if anything else will break | 23:12 |
clarkb | corvus: mordred fungi ^ any opinions on that? | 23:12 |
ianw | ohhh, actually https://zuul.opendev.org/t/zuul/build/d3ffa91e9f8d4fbea364203a864a054a/log/job-output.txt | 23:15 |
ianw | it doesn't install dib bindep.txt ... i remember we talked about this https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L66 | 23:16 |
clarkb | whats the difference between lsb-base and lsb-release? | 23:16 |
fungi | what's broken in the current image list? | 23:16 |
ianw | i think we need to add it @ https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L66 | 23:16 |
fungi | lsb-base is a set of "standard" packages which make a given distro meet the lsb minimum requirements | 23:17 |
corvus | catching up | 23:17 |
fungi | lsb-release is a tool to tell you what distro you're on and various other bits needed to gauge lsb compatibility | 23:17 |
clarkb | fungi: we need lsb-release on our image to make https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L84 happy | 23:17 |
clarkb | corvus: ^ | 23:17 |
fungi | the lsb and thus lsb-base are effectively dead since years | 23:18 |
clarkb | *on our nodepool-builder image | 23:18 |
fungi | lsb-release as a relatively commonly found utility to tell you what distro/release your on has survived | 23:18 |
fungi | s/your/you're | 23:19 |
fungi | / | 23:19 |
ianw | clarkb: it's emergency-ed right? i think we can skip caching just to see if it gets a .qcow2 out | 23:19 |
clarkb | ianw: yes it is emergency'd | 23:19 |
clarkb | ianw: and ya I think that is probably the next best step now. Should we just disable image caching? | 23:19 |
corvus | yeah, i don't think this is used for devstack tests? | 23:19 |
corvus | so shouldn't be a big deal | 23:20 |
corvus | this==f30 | 23:20 |
ianw | umm, it will be, but at this point that's minor concern, it's non-voting | 23:20 |
clarkb | I can comment out https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L123-L145 and add a pass? | 23:20 |
clarkb | or rm the file entirely | 23:20 |
clarkb | in /opt/project-config on the server | 23:20 |
fungi | well, also devstack will just download those images if they're not cached | 23:20 |
corvus | i didn't see it in a codesearch | 23:21 |
corvus | but it's there, i stand corrected | 23:22 |
clarkb | oh we can just remove the cache-devstack element | 23:22 |
corvus | it's not hard to update the image if we want | 23:22 |
clarkb | thats cleaner and nicer | 23:22 |
corvus | we can just exec a shell, install the package and docker commit | 23:23 |
corvus | as long as we're just plowing through the punch list to get something working | 23:23 |
ianw | i think it's actually more just the "apt-get update" that's required | 23:23 |
clarkb | corvus: well dib is actually trying to install that but it is failing (so we'd have to debug that too) | 23:23 |
clarkb | corvus: the error we get is from the attempted isntall :) | 23:23 |
ianw | yeah, i think that's just because the container has had all its metadata purged | 23:23 |
corvus | dib is trying to install packages on the system it's running on? | 23:24 |
clarkb | corvus: by way of devstack :/ | 23:25 |
ianw | ... i'm not saying it's right, but it is | 23:25 |
clarkb | corvus: so not really dib, but the dib element that calls into devstack which then side effects | 23:25 |
corvus | i am now in favor of removing the image caching element everywhere | 23:25 |
corvus | that is really uncool | 23:25 |
clarkb | corvus: https://opendev.org/openstack/devstack/src/branch/master/functions-common#L321-L338 there | 23:25 |
corvus | and unsafe | 23:25 |
corvus | that was never supposed to happen | 23:25 |
ianw | well, that package *is* part of bindep, so if we had fully installed dib's bindep we probably wouldn't notice | 23:26 |
corvus | (i mean, we just gave devstack root on the entire control plane) | 23:26 |
corvus | (i'm not exaggerating -- you can use our image builds to eventually jump to any host we manage) | 23:26 |
clarkb | and yes it wasn't supposed to from memory way back when sean added that | 23:26 |
corvus | well, i added the image caching but sure | 23:27 |
fungi | not *just* as we've been running a script from it during image builds since, what, 2014? | 23:27 |
clarkb | corvus: well the original version was a naive scan (whcih was super safe) | 23:27 |
corvus | clarkb: yep | 23:27 |
clarkb | then sean wrote a thing in devstack to list the images non naively | 23:27 |
corvus | and there was a reason for that | 23:27 |
clarkb | and I'm pretty sure the original version of that was also safe | 23:27 |
clarkb | (but I could be wrong) | 23:27 |
clarkb | switching to caching a cirros or three is probably sane at this point | 23:28 |
fungi | sean's implementation that i remember parsed the devstack scripts to find urls | 23:28 |
clarkb | trove and the weird container stuff that was going on are basically EOF | 23:28 |
clarkb | so what we end up with is "what cirros images do we care about" | 23:28 |
corvus | i have to go now. my vote is to disable the caching element globally and add anything we want in a static element. | 23:29 |
clarkb | (and I think we can just list those) | 23:29 |
fungi | i do think maintaining a static list of images at this point is probably low-effort. devstack rarely adds/removes/updates its own image set since ages | 23:29 |
fungi | but also i agree, granting the devstack project a root access backdoor on all nodes through the image build process is not good under our present model | 23:30 |
ianw | i've filed https://storyboard.openstack.org/#!/story/2007407 task #39066 | 23:32 |
ianw | clarkb: so are you removing cache-devstack? | 23:34 |
clarkb | ianw: I got sidetracked thinking about rewriting cache-devstack :) | 23:36 |
clarkb | I'll rm it from /etc/nodepool/nodepool.yaml now | 23:36 |
clarkb | done | 23:36 |
ianw | for right now, do you want to apt-get update in the container and just see if it passes? | 23:38 |
clarkb | I probably won't thinking time is better spent updating cache-devstack at the moment | 23:39 |
ianw | docker exec 4f141126d67d sudo apt-get update ... i did that ... let's see if this build goes | 23:39 |
ianw | clarkb: #39066 assigned to you :) | 23:40 |
openstackgerrit | Mohammed Naser proposed openstack/project-config master: add vexxhost/openstack-operator https://review.opendev.org/713080 | 23:48 |
ianw | ... ok, i think it got a little further, it's still going | 23:50 |
ianw | it's at everyone's favourite pip-and-virtualenv | 23:54 |
* fungi can't wait to see that gone | 23:55 | |
ianw | it's making the image! yay | 23:55 |
openstackgerrit | Clark Boylan proposed openstack/project-config master: Statically cache devstack images and packages https://review.opendev.org/713081 | 23:55 |
clarkb | ianw: fungi corvus mordred ^ that should be a safe version and largely backward compatible with the current set of cached stuff | 23:56 |
clarkb | I decided to punt on arm64 for now | 23:56 |
ianw | clarkb: i was thinking perhaps devstack should keep that static list? i'm not sure anyone contributing changes there would know to update it, at least it would come up in a grep in the local source tree? | 23:57 |
clarkb | ianw: ya we could do that as an improvement. I was basically trying to get simple thing done that would work for now | 23:58 |
clarkb | though one issue is architecture differences | 23:58 |
clarkb | if that is dynamic and in devstack we'd potentially have the same problem all over again | 23:58 |
clarkb | (also the etcd caching is a bit annoying because as far as I know basically nothing is using it) | 23:59 |
clarkb | (basically I acknowledge that I've punted on a few things including arch and user accessibility, but for short term this should be a good change and we can figure out longer term solutions to those problems?) | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!