*** stevebaker has joined #openstack-infra | 00:01 | |
*** mattw4 has quit IRC | 00:04 | |
*** dchen has joined #openstack-infra | 00:09 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 00:20 |
---|---|---|
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 00:27 |
*** hrw has joined #openstack-infra | 00:33 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: collect-container-logs: add role https://review.opendev.org/701867 | 00:34 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 00:36 |
*** zhurong has quit IRC | 00:36 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: collect-container-logs: add role https://review.opendev.org/701867 | 00:38 |
*** tetsuro has joined #openstack-infra | 00:39 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-registry master: Switch to collect-container-logs https://review.opendev.org/701868 | 00:39 |
openstackgerrit | Mohammed Naser proposed zuul/nodepool master: Switch to collect-container-logs https://review.opendev.org/701869 | 00:42 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-registry master: Switch to collect-container-logs https://review.opendev.org/701868 | 00:42 |
openstackgerrit | Mohammed Naser proposed opendev/system-config master: Switch to collect-container-logs https://review.opendev.org/701870 | 00:47 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: collect-container-logs: add role https://review.opendev.org/701867 | 00:52 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 00:52 |
*** eandersson has joined #openstack-infra | 00:54 | |
eandersson | stackalytics.com cert expired? | 00:54 |
fungi | eandersson: so we've heard | 00:56 |
fungi | we don't run it, and no clue who to reach out to at mirantis | 00:56 |
fungi | (we offered to run it more than once in the past) | 00:57 |
eandersson | Hopefully someone that cares enough to fix it :p | 00:57 |
mnaser | infra-root: i think one of the executors might have issues with log streaming, as i'm seeing occasional "--- END OF STREAM ---" on jobs that are clearly running and eventually report a result | 00:58 |
mnaser | example: http://zuul.opendev.org/t/zuul/stream/bf5d120011d448c8baedcce26d0b31d0?logfile=console.log | 00:59 |
mnaser | according to the api, ze05 is the one running that job | 00:59 |
fungi | mnaser: it happens frequently that we go over memory on them and the oom-killer decides the output streamer would be a good thing to arbitrarily kill | 01:04 |
fungi | i'll take a look | 01:04 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: helm-template: Add role to run 'helm template' https://review.opendev.org/701871 | 01:06 |
fungi | the following executors need restarts to get their output streamers going again: 02,03,04,05,12 | 01:07 |
fungi | so almost half | 01:07 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 01:07 |
fungi | i'll poke at them for a bit | 01:07 |
fungi | things look pretty quiet, so i can just restart them all at the same time and let the other 7 handle the load in the interim | 01:09 |
*** zbr|rover has quit IRC | 01:11 | |
*** HenryG has quit IRC | 01:11 | |
*** HenryG has joined #openstack-infra | 01:12 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 01:14 |
*** zhurong has joined #openstack-infra | 01:15 | |
*** roman_g has quit IRC | 01:15 | |
*** zbr has joined #openstack-infra | 01:19 | |
*** zbr has quit IRC | 01:24 | |
*** zbr has joined #openstack-infra | 01:25 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: apply-helm-charts: Job to apply Helm charts https://review.opendev.org/701874 | 01:29 |
clarkb | we have LE certs on zuul.o.o now | 01:31 |
clarkb | I'll merge the change to start using those certs first thing tomorrow morning | 01:31 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 01:31 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: helm-template: Add role to run 'helm template' https://review.opendev.org/701871 | 01:40 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: apply-helm-charts: Job to apply Helm charts https://review.opendev.org/701874 | 01:40 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 01:40 |
*** Lucas_Gray has quit IRC | 01:43 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-helm master: Test helm charts against k8s https://review.opendev.org/701764 | 01:47 |
*** ricolin_ has joined #openstack-infra | 01:50 | |
*** zbr_ has joined #openstack-infra | 02:15 | |
*** zbr has quit IRC | 02:16 | |
*** zbr_ has quit IRC | 02:17 | |
*** gyee has quit IRC | 02:23 | |
*** rh-jelabarre has joined #openstack-infra | 02:26 | |
*** zbr has joined #openstack-infra | 02:28 | |
*** zxiiro has quit IRC | 02:35 | |
fungi | #status log restarted zuul-executor service on ze02,03,04,05,12 to get log streamers running again after oom-killer got them; had to clear stale pidfile on zm04 | 02:41 |
openstackstatus | fungi: finished logging | 02:41 |
*** rh-jelabarre has quit IRC | 02:44 | |
*** rlandy has quit IRC | 02:55 | |
mnaser | thank you for taking care of it fungi | 02:56 |
fungi | no promlem | 02:59 |
*** apetrich has quit IRC | 03:12 | |
*** ricolin_ has quit IRC | 03:18 | |
*** ricolin has joined #openstack-infra | 03:19 | |
*** psachin has joined #openstack-infra | 03:22 | |
*** armax has quit IRC | 03:41 | |
*** ykarel|away has joined #openstack-infra | 04:22 | |
*** hwoarang has quit IRC | 04:24 | |
*** hwoarang has joined #openstack-infra | 04:26 | |
*** stevebaker has quit IRC | 04:41 | |
*** tetsuro has quit IRC | 04:44 | |
*** tetsuro has joined #openstack-infra | 04:45 | |
*** tetsuro has quit IRC | 04:49 | |
hrw | fungi: thanks! | 04:54 |
*** surpatil has joined #openstack-infra | 05:00 | |
*** factor has quit IRC | 05:08 | |
*** factor has joined #openstack-infra | 05:08 | |
*** ykarel has joined #openstack-infra | 05:18 | |
*** ykarel|away has quit IRC | 05:20 | |
*** tkajinam has quit IRC | 05:26 | |
*** tkajinam has joined #openstack-infra | 05:29 | |
*** goldyfruit has quit IRC | 05:30 | |
*** goldyfruit has joined #openstack-infra | 05:30 | |
*** evrardjp has quit IRC | 05:33 | |
*** evrardjp has joined #openstack-infra | 05:34 | |
*** ykarel_ has joined #openstack-infra | 05:35 | |
*** kjackal has joined #openstack-infra | 05:35 | |
*** bdodd has joined #openstack-infra | 05:37 | |
*** ykarel has quit IRC | 05:38 | |
*** exsdev has quit IRC | 05:44 | |
*** tetsuro has joined #openstack-infra | 05:45 | |
*** exsdev has joined #openstack-infra | 05:46 | |
*** exsdev has quit IRC | 05:48 | |
*** tetsuro has quit IRC | 05:49 | |
*** tetsuro has joined #openstack-infra | 05:53 | |
*** tetsuro has quit IRC | 05:57 | |
*** tetsuro_ has joined #openstack-infra | 05:57 | |
*** tkajinam_ has joined #openstack-infra | 06:02 | |
*** exsdev has joined #openstack-infra | 06:02 | |
*** tkajinam has quit IRC | 06:04 | |
*** lpetrut has joined #openstack-infra | 06:08 | |
*** lpetrut has quit IRC | 06:09 | |
*** lpetrut has joined #openstack-infra | 06:10 | |
*** kjackal has quit IRC | 06:11 | |
*** lmiccini has joined #openstack-infra | 06:42 | |
*** ykarel_ is now known as ykarel | 07:03 | |
*** exsdev has quit IRC | 07:09 | |
*** rcernin has quit IRC | 07:15 | |
*** slaweq has joined #openstack-infra | 07:18 | |
*** exsdev has joined #openstack-infra | 07:23 | |
*** pgaxatte has joined #openstack-infra | 07:24 | |
*** dpawlik has joined #openstack-infra | 07:41 | |
*** iurygregory has joined #openstack-infra | 07:42 | |
*** rpittau|afk is now known as rpittau | 07:44 | |
*** kjackal has joined #openstack-infra | 07:45 | |
*** pcaruana has joined #openstack-infra | 07:53 | |
*** kjackal has quit IRC | 07:57 | |
*** ykarel is now known as ykarel|lunch | 07:57 | |
hrw | fungi: kolla-build-ubuntu-source-aarch64 SUCCESS in 1h 48m 13s (non-voting) | 07:58 |
hrw | fungi: thanks again | 07:58 |
*** jtomasek has joined #openstack-infra | 08:09 | |
*** tetsuro_ has quit IRC | 08:10 | |
*** gfidente|afk is now known as gfidente | 08:11 | |
*** tosky has joined #openstack-infra | 08:12 | |
*** dchen has quit IRC | 08:16 | |
*** iurygregory has quit IRC | 08:17 | |
*** tesseract has joined #openstack-infra | 08:22 | |
*** tkajinam_ has quit IRC | 08:22 | |
*** fdegir has quit IRC | 08:22 | |
*** fdegir has joined #openstack-infra | 08:23 | |
*** pkopec has joined #openstack-infra | 08:23 | |
*** pkopec has quit IRC | 08:23 | |
*** kjackal has joined #openstack-infra | 08:29 | |
*** iurygregory has joined #openstack-infra | 08:33 | |
*** dpawlik has quit IRC | 08:39 | |
*** dpawlik has joined #openstack-infra | 08:45 | |
*** pcaruana has quit IRC | 08:46 | |
*** harlowja has quit IRC | 08:48 | |
*** xek__ has joined #openstack-infra | 08:49 | |
*** factor has quit IRC | 08:50 | |
*** factor has joined #openstack-infra | 08:51 | |
*** harlowja has joined #openstack-infra | 08:51 | |
*** pcaruana has joined #openstack-infra | 08:53 | |
*** jpena|off is now known as jpena | 08:54 | |
*** ralonsoh has joined #openstack-infra | 08:56 | |
*** ykarel|lunch is now known as ykarel | 09:04 | |
*** gibi has left #openstack-infra | 09:06 | |
*** zbr is now known as zbr|rover | 09:19 | |
*** lucasagomes has joined #openstack-infra | 09:22 | |
*** apetrich has joined #openstack-infra | 09:28 | |
*** derekh has joined #openstack-infra | 09:35 | |
*** ociuhandu has joined #openstack-infra | 09:48 | |
*** apetrich has quit IRC | 10:04 | |
*** dtantsur|afk is now known as dtantsur | 10:05 | |
*** ykarel is now known as ykarel|afk | 10:11 | |
*** apetrich has joined #openstack-infra | 10:13 | |
*** ykarel|afk is now known as ykarel | 10:35 | |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: [WIP] Docker compose example: add keycloak authentication https://review.opendev.org/664813 | 10:41 |
*** hrw has left #openstack-infra | 10:52 | |
*** aedc has joined #openstack-infra | 10:58 | |
*** aedc has quit IRC | 11:03 | |
*** rpittau is now known as rpittau|bbl | 11:21 | |
*** Lucas_Gray has joined #openstack-infra | 11:48 | |
*** sshnaidm is now known as sshnaidm|off | 11:52 | |
*** Lucas_Gray has quit IRC | 11:53 | |
*** jpena is now known as jpena|lunch | 12:01 | |
*** ykarel is now known as ykarel|afk | 12:20 | |
*** pcaruana has quit IRC | 12:34 | |
*** surpatil has quit IRC | 12:34 | |
*** ykarel|afk is now known as ykarel | 12:36 | |
*** pcaruana has joined #openstack-infra | 12:38 | |
*** rpittau|bbl is now known as rpittau | 12:55 | |
*** ociuhandu has quit IRC | 12:55 | |
*** ociuhandu has joined #openstack-infra | 12:56 | |
*** jpena|lunch is now known as jpena | 12:57 | |
*** ociuhandu has quit IRC | 12:58 | |
*** ociuhandu has joined #openstack-infra | 12:58 | |
*** ykarel is now known as ykarel|afk | 13:03 | |
*** goldyfruit has quit IRC | 13:05 | |
*** rh-jelabarre has joined #openstack-infra | 13:05 | |
*** goldyfruit has joined #openstack-infra | 13:06 | |
*** psachin has quit IRC | 13:09 | |
*** ykarel|afk is now known as ykarel|away | 13:09 | |
openstackgerrit | Lee Yarwood proposed openstack/devstack-gate master: nova: Renable n-net on stable/queens|pike|ocata https://review.opendev.org/701957 | 13:12 |
*** trident has quit IRC | 13:13 | |
*** trident has joined #openstack-infra | 13:15 | |
*** ociuhandu has quit IRC | 13:22 | |
*** ociuhandu_ has joined #openstack-infra | 13:22 | |
*** rlandy has joined #openstack-infra | 13:22 | |
*** aedc has joined #openstack-infra | 13:35 | |
*** Goneri has joined #openstack-infra | 13:39 | |
openstackgerrit | Lee Yarwood proposed openstack/devstack-gate master: nova: Renable n-net on stable/rocky|queens|pike|ocata https://review.opendev.org/701957 | 13:47 |
*** aedc has quit IRC | 13:51 | |
*** gfidente has quit IRC | 13:59 | |
*** gfidente has joined #openstack-infra | 14:05 | |
openstackgerrit | Simon Westphahl proposed zuul/nodepool master: Always identify static nodes by node tuple https://review.opendev.org/701969 | 14:06 |
openstackgerrit | Simon Westphahl proposed zuul/nodepool master: Always identify static nodes by node tuple https://review.opendev.org/701969 | 14:10 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: JWT drivers: Deprecate RS256withJWKS, introduce OpenIDConnect https://review.opendev.org/701972 | 14:20 |
*** liuyulong has joined #openstack-infra | 14:25 | |
*** dtantsur is now known as dtantsur|brb | 14:27 | |
*** ociuhandu has joined #openstack-infra | 14:46 | |
*** ociuhandu_ has quit IRC | 14:46 | |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: Extract project config YAML into ref docs https://review.opendev.org/701977 | 14:47 |
*** eernst has joined #openstack-infra | 14:49 | |
*** ykarel|away is now known as ykarel | 14:53 | |
*** eernst has quit IRC | 14:56 | |
*** lmiccini has quit IRC | 14:57 | |
*** dave-mccowan has joined #openstack-infra | 15:06 | |
fungi | amotoki: tosky: AJaeger: looking at those stable tox failures, the log shows /usr/local/bin/tox is being run directly (not under an explicit interpreter) so it must be getting installed with python3. the job log also indicates the ensure-tox role found an existing tox executable so that suggests it's preinstalled in our images (i couldn't find any record of tox getting installed within the job). | 15:08 |
fungi | unfortunately our nodepool image build logs don't seem to be verbose enough to include confirmation that tox is being installed or how, so i'll need to dig into nodepool element sources | 15:08 |
AJaeger | fungi, thanks for digging into this | 15:09 |
frickler | fungi: where is this failing? I seem to remember that we had (and fixed) some similar issue a couple of weeks ago | 15:15 |
fungi | infra-root: nb01 has run out of tempspace and is no longer able to build images. i'll attempt to remedy | 15:15 |
*** jtomasek has quit IRC | 15:16 | |
*** armax has joined #openstack-infra | 15:16 | |
*** eharney has joined #openstack-infra | 15:17 | |
fungi | frickler: stable branch tox jobs for horizon at least. this example was a pep8 job for stable/pike: https://zuul.opendev.org/t/openstack/build/daaeaedb0a184e29a03eeaae59157c78/ | 15:18 |
fungi | it looks like probably a few weeks ago (mid-december) our ubuntu-xenial images started installing tox under python3 | 15:18 |
frickler | yes they did and iirc we said the fix was to set basepython=2.7 for those jobs that need that | 15:20 |
*** zul has joined #openstack-infra | 15:20 | |
fungi | the solution we suggested for ubuntu-bionic is still applicable in my opinion (stable/pike of horizon can't run `tox -e pep8` on python3 but its tox.ini doesn't actually indicate that) | 15:20 |
fungi | frickler: the ml thread from november was about the default python for tox changing on our ubuntu-bionic images | 15:21 |
fungi | some weeks later, something happened to make it the case on ubuntu-xenial images as well | 15:21 |
fungi | http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010957.html | 15:22 |
openstackgerrit | Merged openstack/project-config master: Remove duplicated ACL files https://review.opendev.org/700913 | 15:22 |
*** ociuhandu has quit IRC | 15:23 | |
amotoki | it seems tox uses the interpreter where tox is installed as the default python when basepython is not specified. | 15:26 |
amotoki | in case of horizon, we have landed a workaround like https://review.opendev.org/#/c/701848/ | 15:27 |
amotoki | it turns out it affects all jobs without basepython.... horizon is okay now but I am afraid that at least horizon plugins are affected. | 15:28 |
fungi | amotoki: yes, the solution in ianw_pto's ml post from november 20 is still probably a good idea just so that the tox.ini is appropriately explicit about what major version of python it needs | 15:31 |
fungi | otherwise a developer who has installed tox with python3 on their machine will encounter the same problems when trying to run tests locally | 15:32 |
frickler | fungi: amotoki: and that arguments holds regardless of how we did change the default for xenial, so I'm not sure how much value there is in trying to dig that down | 15:32 |
fungi | i'm still curious to know how it ended up changing for xenial images, but yes the answer likely doesn't change the recommendation | 15:33 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Define check-release-approval executor job https://review.opendev.org/701982 | 15:33 |
amotoki | yeah, I agree that we suggest to have the interpreter explicitly in tox.ini, | 15:33 |
amotoki | on the other hand, I am confused as it happens in xenial. we would like to avoid a workaround for older stable branches. | 15:34 |
fungi | well, it's not a workaround, it's fixing a latent bug which simply hadn't surfaced in our ci jobs | 15:35 |
fungi | but it's a bug which could easily bite developers running tox locally, as i mentioned | 15:35 |
amotoki | exactly | 15:36 |
openstackgerrit | Merged zuul/zuul-jobs master: Make pre-molecule tox playbook platform agnostic https://review.opendev.org/700452 | 15:38 |
*** ociuhandu has joined #openstack-infra | 15:44 | |
*** jpena is now known as jpena|brb | 15:45 | |
fungi | i wonder if https://review.opendev.org/697211 (merged to dib on december 12, released in 2.32.0 the next day, probably started influencing our image builds the day after that) is what changed | 15:46 |
AJaeger | fungi: if it worked on the 18th, then its still 5 days difference, isn't it? | 15:47 |
AJaeger | still, we might not have build images for 5 days... | 15:47 |
fungi | yeah, not sure. you're right the timing doesn't match up though | 15:48 |
fungi | our image build logs don't go back that far | 15:48 |
frickler | fungi: builds on nb02 also seem to be failing, does it need to be cleaned up, too? /me needs to leave now | 15:51 |
fungi | frickler: quite possibly, i'll take a look after i finish with nb01 | 15:52 |
*** rfolco has quit IRC | 15:52 | |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Define check-release-approval executor job https://review.opendev.org/701982 | 15:55 |
fungi | frickler: which image build failures were you seenig on nb02? its dib scratchspace is only 56% used right now | 15:57 |
*** ykarel is now known as ykarel|away | 16:01 | |
*** eernst has joined #openstack-infra | 16:04 | |
*** roman_g has joined #openstack-infra | 16:07 | |
*** dtantsur|brb is now known as dtantsur | 16:08 | |
clarkb | infra-root https://review.opendev.org/#/c/701821/ should be the last step in LE'ing zuul.opendev.org (the certs are in place now we have to consume them in apache) | 16:14 |
clarkb | mordred: corvus frickler ^ I'm able to watch that today if you approve it | 16:14 |
clarkb | also I've confirmed that zuul-ci.org no longer has a "your cert will expire soon" warning | 16:14 |
mordred | clarkb: +A | 16:15 |
clarkb | tyty | 16:15 |
*** jpena|brb is now known as jpena | 16:22 | |
*** mattw4 has joined #openstack-infra | 16:23 | |
*** rlandy is now known as rlandy|brb | 16:29 | |
openstackgerrit | Merged opendev/system-config master: Use zuul.opendev.org LE cert https://review.opendev.org/701821 | 16:34 |
fungi | what's the safest way to clean up a full /opt on a nodepool builder? on nb01 i see we have 0.9tb in /opt/nodepool_dib and 45gb in /opt/dib_cache | 16:35 |
fungi | i've stopped the nodepool-builder service on the server | 16:36 |
fungi | looks like there are a ton of kernel threads for loop and bioset handling, suggesting leaked devices in chroots? | 16:37 |
clarkb | fungi: usually I disable the service then reboot it to clear stale mounts | 16:37 |
fungi | do those need to be cleared out somehow too? | 16:37 |
clarkb | yes | 16:37 |
clarkb | then /opt/dib_tmp is typically what you clean up | 16:37 |
clarkb | having .9tb in nodepool_dib implies that the space is consumed by actual image builds | 16:38 |
clarkb | which may imply that nb02 isn't sharing the load | 16:38 |
fungi | yeah, /opt/dib_tmp is only 7mb | 16:38 |
clarkb | ya nb02 is out to lunch too | 16:38 |
fungi | so clearing that out won't do much | 16:38 |
clarkb | its got a ton of stale build processes from 2020 | 16:39 |
clarkb | er 2019 | 16:39 |
clarkb | I think the fix in this case is to have nb02 come back and take some of the image load off of nb01 | 16:39 |
clarkb | then clean up nb01 if necessary | 16:39 |
fungi | okay, but same cleanup process on both? | 16:39 |
clarkb | ya | 16:39 |
*** lucasagomes has quit IRC | 16:40 | |
clarkb | (this was why I was cleaning up old images a while back, to reduce the total number of images we had so that a single builder had a chance at building them. I think we cleaned up all the images we could clean up at the time though) | 16:40 |
fungi | /opt/dib_tmp on nb02 is definitely larger, waiting on du to tell me how much | 16:41 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: web capabilities: remove unused job_history attribute https://review.opendev.org/702001 | 16:41 |
clarkb | we cleared out a couple fedora images and opensuse images iirc | 16:41 |
clarkb | I wonder if maybe the oldest debian can go too? | 16:41 |
fungi | likely, but we'd want to codesearch to see if it's in use before we pull it | 16:42 |
clarkb | yup that is what we did with the other images. Pushed up changes to remove jobs that use them if just old or update them to use newer options. Then remove the nodeset. Then remove the images | 16:43 |
clarkb | not a quick process, but this was a big part of the motivation for it. | 16:44 |
fungi | i wonder if it's time for another amd64 builder so losing one doesn't cause the other to fill up | 16:44 |
clarkb | or add more disk to the existing builders | 16:45 |
clarkb | another option would be to delete the image from local disk once uploaded to all the clouds (but then people won't be able to download them) | 16:45 |
*** pgaxatte has quit IRC | 16:45 | |
clarkb | we could potentially keep just the qcow2 compressed version and then convert to raw or vhd if necessary from there | 16:45 |
clarkb | Shrews: ^ as soon as I've said that I've realized that could be a really nice nodepool-builder feature | 16:46 |
clarkb | Shrews: basically keep a version of the image (qcow2 will almost always be smallest) for recovery purposes if necessary but delete the other versions once they have finished uploading | 16:46 |
clarkb | then we have 9GB * num images storage space instead of 60GB * num images storage space | 16:46 |
fungi | that does make automated reuploading of raw images harder i guess? | 16:48 |
fungi | or when adding a new provider (the builder would normally start uploading already-built images to it automatically as soon as the provider was added, right?) | 16:48 |
clarkb | fungi: you'd have to qemu-img convert them first | 16:48 |
clarkb | that is a good point about adding a new provider | 16:49 |
clarkb | we could force a new build at that point as a workaround but that isn't very user friendly | 16:49 |
fungi | hrm, not a lot more in /opt/dib_tmp on nb02 either... 50gb according to du | 16:50 |
clarkb | fungi: ya check the ps listings for disk-image-create though | 16:51 |
clarkb | nb02 seems stuck on a process problem not a disk problem | 16:51 |
fungi | right, just in terms of clearing out /opt/dib_tmp it's not really going to free up much is what i meant | 16:51 |
clarkb | ya but its got about 500GB free | 16:51 |
clarkb | whcih is normal | 16:51 |
clarkb | (and why if we lose one the other fills 1GB of disk | 16:52 |
*** lpetrut has quit IRC | 16:52 | |
fungi | how do you normally go about disabling nodepool-builder? the update-rc.d tool or rename the rc.2 symlinks from S to K or via systemctl disable or some other way? | 16:52 |
fungi | edit the initscript to exit 0? | 16:52 |
clarkb | systemctl disable nodepool-builder | 16:53 |
clarkb | it should give you a mesage about updating init script things | 16:53 |
fungi | cool, i didn't know that worked for sysv-compat | 16:56 |
clarkb | yup, the way systemd sysv compat works is it automatically adds a shim unit file for each sysv init script | 16:56 |
*** gyee has joined #openstack-infra | 16:56 | |
*** ociuhandu has quit IRC | 16:57 | |
clarkb | systemctl can then manage that shim as if it were any other unit | 16:57 |
fungi | thanks! | 16:57 |
clarkb | (this is why we have to daemon-reload systemd in our puppetry to have systemd figure out the service exists) | 16:57 |
*** dpawlik has quit IRC | 16:57 | |
fungi | got it | 16:57 |
fungi | so as far as bringing these back online after rebooting and clearing out /opt/dib_tmp, i should enable and start nodepool-builder on nb02 first and leave it stopped on nb01 until a full set of images is going? | 16:59 |
fungi | (so that nb01 doesn't try to build more images when it lacks disk space to write them?) | 16:59 |
clarkb | you'll need to leave it running on nb01 so that it can delete the images that nb02 buidls new ones for | 16:59 |
clarkb | I think | 17:00 |
fungi | is it smart enough to know not to try to build any on nb01 until it deletes some? | 17:00 |
clarkb | no | 17:00 |
clarkb | it will fail to build images in that period | 17:00 |
fungi | because it's going to have maybe 90gb free here after i clear dib_tmp | 17:00 |
clarkb | this is where the auto rebuild aggresiveness makes it difficult to work with nodepool | 17:00 |
clarkb | because we could delete the older images of a pair to free up space but then it will immediately start trying to build that image | 17:01 |
fungi | is it safe to delete /opt/dib_tmp itself, or do i need to leave the directory and just remove contents? | 17:01 |
clarkb | you need to remove the contents or wait for puppet to run and put it back or put it back yourself | 17:01 |
clarkb | nodepool doesn't create that dir | 17:01 |
fungi | okay | 17:01 |
fungi | thanks | 17:01 |
clarkb | (typically dib would use /tmp) | 17:02 |
clarkb | another option is to pause all images in the nb01 config, then delete the older image of the pairs on it | 17:02 |
fungi | hrm, nb02 isn't reachable yet. maybe a periodic fsck was triggered | 17:02 |
clarkb | then only unpause the images in nb01's config once nb02 has picked up some slack | 17:03 |
clarkb | Its probably ok to simply leave it running and let some errors happen? | 17:03 |
clarkb | fungi: I think reboots may be slow there due to needing to clean up all those mounts and stuff | 17:03 |
fungi | yeah, that seems reasonable | 17:03 |
clarkb | systemd will immediately stop sshd but then other things are slower | 17:03 |
clarkb | maybe what we need is the ability to set image pausing outside of config | 17:04 |
*** rlandy|brb is now known as rlandy | 17:04 | |
clarkb | then we could say nodepool pause foo, nodepool delete foo-1, wait for nb02, nodepool unpause foo | 17:05 |
clarkb | and not bother with emergency files and config | 17:05 |
*** ociuhandu has joined #openstack-infra | 17:06 | |
clarkb | we had a similar problem in the past where the most recent image was the problem. I wanted to delete the most recent image and use the previous image but nodepool immediately started building a new image that would be broken | 17:06 |
clarkb | solution there is to pause then delete | 17:06 |
*** rpittau is now known as rpittau|afk | 17:10 | |
fungi | nb01 cleanup is done but i'm not starting it just yet because nb02 is still unreachable | 17:12 |
fungi | i'll check the oob console | 17:12 |
*** tesseract has quit IRC | 17:19 | |
fungi | nb02 console just shows "Ubuntu 16.04" and a little spinner | 17:20 |
fungi | not sure if it's booting or stopping | 17:20 |
fungi | hiding boot/shutdown progress from the console display is an unpardonable sin. why would that be the default? | 17:21 |
*** eernst has quit IRC | 17:22 | |
fungi | and the `console log show` cli command is unsupported for rackspace | 17:23 |
clarkb | fungi: thats long been an issue on ubuntu (the hiding console output on servers problem) | 17:23 |
fungi | i guess our options are to wait, or try to force a(nother) reboot and hope it doesn't irrecoverably corrupt /opt | 17:23 |
clarkb | I seem to recall that is a symptom of fscking | 17:23 |
*** zxiiro has joined #openstack-infra | 17:24 | |
clarkb | because you get error messages if there was actually something wrong much more quickly | 17:24 |
fungi | ahh | 17:24 |
*** liuyulong_ has joined #openstack-infra | 17:25 | |
fungi | so maybe it did hit a scheduled fsck on boot and since /opt is 1tb (and maybe slow)... | 17:25 |
clarkb | puppet apply at about 1800UTC should update zuul.opendev.org cert | 17:27 |
*** liuyulong has quit IRC | 17:28 | |
*** dtantsur is now known as dtantsur|afk | 17:30 | |
*** eernst has joined #openstack-infra | 17:30 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: helm-template: Add role to run 'helm template' https://review.opendev.org/701871 | 17:31 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: apply-helm-charts: Job to apply Helm charts https://review.opendev.org/701874 | 17:31 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: apply-helm-charts: Job to apply Helm charts https://review.opendev.org/701874 | 17:31 |
*** ociuhandu has quit IRC | 17:33 | |
*** evrardjp has quit IRC | 17:33 | |
*** evrardjp has joined #openstack-infra | 17:34 | |
*** ociuhandu has joined #openstack-infra | 17:36 | |
openstackgerrit | Merged zuul/zuul-jobs master: install-go: bump version to 1.13.5 https://review.opendev.org/700467 | 17:42 |
fungi | okay, nb02 finally became reachable, cleaning it up now | 17:45 |
clarkb | infra-root: thoughts on adding gmann to devstack-gate core? seems like the changes goign in now are for life support on old branches | 17:46 |
fungi | i'm in favor | 17:46 |
clarkb | gmann in particular seems to be helping to ensure those changes get in so having him be able to approve would be good I think | 17:46 |
fungi | and he's helping drive the openstack cycle goal to drop legacy jobs from master | 17:47 |
fungi | which should mean less use of d-g overall | 17:47 |
clarkb | gmann: ^ would you be interested in that? | 17:47 |
gmann | clarkb: fungi sure, that will be helpful. thanks | 17:48 |
tosky | oh, right, devstack-gate was originally part of infra and not QA | 17:48 |
tosky | (I guess it is still infra) | 17:49 |
*** iurygregory has quit IRC | 17:54 | |
*** derekh has quit IRC | 18:01 | |
smcginnis | zuul down? | 18:01 |
*** ykarel|away has quit IRC | 18:02 | |
clarkb | smcginnis: no, looks like the apache config for new ssl cert is unhappy :/ | 18:04 |
clarkb | zuul is still running though | 18:04 |
smcginnis | OK, I'm just getting a connection refused trying to access the status page. Good it's only that part. | 18:05 |
clarkb | ya I'm trying to sort it out | 18:05 |
smcginnis | Thanks! | 18:05 |
clarkb | ok I've put the old vhost config abck in place | 18:08 |
clarkb | and put the host in the emergency file. This way the webserver is up and runnign while I sort this out | 18:08 |
smcginnis | Confirmed - at least loads for me now. | 18:08 |
clarkb | oh I know what is wrong ugh | 18:09 |
*** pcaruana has quit IRC | 18:09 | |
clarkb | ok, I made the (bad) assumption that having any content at all in /etc/letsencrypt-certs/zuul.opendev.org/ was a sign that things were happy there | 18:09 |
clarkb | they were not, I could not issue the certificate because zuul01.opendev.org does not have a acme delegation record | 18:10 |
clarkb | and the reason for that is we don'y have a zuul01.opendev.org, just a zuul01.openstack.org | 18:11 |
clarkb | fix incoming | 18:11 |
*** ociuhandu has quit IRC | 18:12 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Don't issue cert for zuul01.opendev.org https://review.opendev.org/702020 | 18:14 |
clarkb | infra-root ^ that cleanup is ncessary for zuul.opendev.org le happyness | 18:14 |
*** stevebaker has joined #openstack-infra | 18:15 | |
*** rfolco has joined #openstack-infra | 18:16 | |
*** eernst has quit IRC | 18:18 | |
*** gfidente is now known as gfidente|afk | 18:24 | |
*** jpena is now known as jpena|off | 18:25 | |
fungi | okay, nb02 is cleaned up and nodepool-builder service enabled and started on it, currently building debian-stretch-0000100801 for 6 minutes now | 18:29 |
fungi | per earlier discussion, i'll start the service on nb01 now and maybe it'll fail for a bit until nb02 builds enough replacements that 01 can delete some of its older images | 18:29 |
fungi | nb01 is now building (or at least trying to) gentoo-17-0-systemd-0000131965 | 18:31 |
clarkb | fungi: great. I expect things should start ot settle down on the builders after a cuple images manage to build and get their old versions cleaned up | 18:32 |
clarkb | ~3 hours away probably | 18:32 |
fungi | yup | 18:33 |
fungi | the /opt partition on nb01 has only 76gb to spare, so i do expect some failures | 18:34 |
*** liuyulong_ has quit IRC | 18:41 | |
*** kjackal has quit IRC | 18:45 | |
*** aedc has joined #openstack-infra | 18:51 | |
*** eharney has quit IRC | 18:54 | |
*** pcaruana has joined #openstack-infra | 18:57 | |
*** ralonsoh has quit IRC | 18:59 | |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Remove retired x/js-* repos from gerritbot https://review.opendev.org/702028 | 19:05 |
yoctozepto | AJaeger: by the looks of it js-openstack-lib already spams in openstack-sdks | 19:06 |
yoctozepto | gerritbot config needs no update :-) | 19:06 |
AJaeger | yoctozepto: indeed, was surprised by that - so, that governance change somehow reflects reality ;) | 19:06 |
AJaeger | yoctozepto: 702028 is the change I did - thanks for reminding me of that one | 19:07 |
yoctozepto | AJaeger: well, it is far from infra | 19:07 |
yoctozepto | no problem, apply cleaning procedures before the weekend :-) | 19:07 |
AJaeger | ;) | 19:07 |
* AJaeger would just love to have the house cleaned up as easily :) | 19:08 | |
clarkb | AJaeger: ++ | 19:08 |
yoctozepto | I'm allergic to dust so I clean mine regularly... | 19:09 |
yoctozepto | AJaeger, clarkb: regarding https://review.opendev.org/#/admin/groups/1408,members <- how to propose change - I presume it should happen after governance change anyways :-) | 19:10 |
AJaeger | clarkb: regarding js-openstack-lib, I suggest you +1 as infra PTL the governance change https://review.opendev.org/#/c/701854/ | 19:11 |
clarkb | yoctozepto: mordred is sdk ptl so we'd give him access then he can edit the list as he wants | 19:11 |
AJaeger | yoctozepto: mordred as PTL and infra-core can take care of it | 19:11 |
clarkb | and ya he is already in there | 19:11 |
yoctozepto | clarkb, AJaeger: mhm, that makes sense | 19:12 |
clarkb | AJaeger: done | 19:12 |
AJaeger | thanks | 19:13 |
* AJaeger disappears to cycle for collecting his kids | 19:14 | |
*** kjackal has joined #openstack-infra | 19:18 | |
yoctozepto | AJaeger: healthy you! | 19:20 |
*** factor has quit IRC | 19:27 | |
*** dklyle has quit IRC | 19:28 | |
openstackgerrit | Radosław Piliszek proposed openstack/project-config master: Remove old openstack/js-openstack-lib jobs https://review.opendev.org/702030 | 19:29 |
*** lpetrut has joined #openstack-infra | 19:30 | |
*** dklyle has joined #openstack-infra | 19:37 | |
fungi | okay, nb02 finished the debian-stretch image and is now onto opensuse-15 as of 30 minutes ago | 19:43 |
fungi | nb01 is still building gentoo-17-0-systemd for over an hour, but will hopefully complete soon | 19:43 |
fungi | and it's still got 55gb worth of space left in /opt so maybe it won't fail to write | 19:44 |
fungi | i'm going to go out for a brief walk since we seem to have a spate of pleasant weather, but i'll be back in an hour-ish to check in on it | 19:45 |
openstackgerrit | Merged opendev/system-config master: Don't issue cert for zuul01.opendev.org https://review.opendev.org/702020 | 19:45 |
*** stevebaker has quit IRC | 19:50 | |
*** eharney has joined #openstack-infra | 19:52 | |
openstackgerrit | Merged openstack/devstack-gate master: nova: Renable n-net on stable/rocky|queens|pike|ocata https://review.opendev.org/701957 | 19:52 |
*** Goneri has quit IRC | 19:59 | |
clarkb | I have remoed zuul01 from the emergency file and will keep an eye on it | 20:00 |
clarkb | not hearing any opposition I've added gmann to d-g core | 20:02 |
clarkb | I think that will help with the straggler changes that go in there to keep stable branches running | 20:03 |
clarkb | while checking where we are in ansible + puppet loop I discovered that logstash-worker05 was not responding to ssh | 20:05 |
clarkb | this has been the case for days according to logs. I will reboot it via the api | 20:06 |
clarkb | #status log Added gmann to devstack-gate-core to help support fixes necessary for stable branches there. | 20:07 |
openstackstatus | clarkb: finished logging | 20:07 |
clarkb | #status log Rebooted logstash-worker05 via nova api after discovering it has stopped responding to ssh for several days. | 20:07 |
openstackstatus | clarkb: finished logging | 20:07 |
clarkb | syslog doesn't show anything but it appears to have stopped on december 21, 2019 | 20:08 |
clarkb | fungi: looks like nb03 is also in a similar no disk state. I'm going to apply similar cleanup to it now | 20:13 |
clarkb | fungi: also I think part of our problem is we are holding much older copies of images possibly because we're failing to delete them from clouds (probably vexxhost because of the BFV you can't delete this image because something is using it problem) | 20:23 |
clarkb | fungi: I'm going through on nb01 and clearing out files in /opt/nodepool_dib that don't correspond to images reported by dib-image-list | 20:23 |
clarkb | as a first pass cleanup | 20:23 |
clarkb | anything that remains is still valid and possibly "stuck" | 20:23 |
*** arif-ali has quit IRC | 20:25 | |
clarkb | fungi: ok nb01's /opt/nodepool_dib contents should reflect what is in nodepool dib-image-list now | 20:39 |
clarkb | we have an excess of bionic, buster, centos-7, and gentoo images which I think is related to not being able to delete them from cloud providers | 20:40 |
clarkb | nb03 /opt/dib_tmp cleanup is very slow | 20:41 |
clarkb | we have issued an LE cert properly for zuul.opendev.org now | 20:42 |
clarkb | just waiting for puppet to run and switch the apache config over | 20:42 |
clarkb | Failed to delete image with name or ID 'ubuntu-bionic-1573653999': 409 Conflict: Image c68d93eb-72ff-42ad-b5c8-63daace0286a could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance. (HTTP 409) | 20:46 |
clarkb | that confirms that for at least one of the images | 20:46 |
clarkb | and I've tracked one of the centos-7 image leaks to a volume that reports to be attached to a server that no longer exists | 20:51 |
clarkb | I think that means we want to start with a volume cleanup | 20:51 |
clarkb | then let nodepool cleanup images again then see if there is anything left | 20:51 |
clarkb | I expect that to be fairly involved and I want to finish up this zuul cert thing and find lunch first | 20:52 |
fungi | i wonder if nodepool could be adjusted to delete local copies of images it also wants to delete remotely, regardless of whether remote deletion fails | 20:54 |
fungi | if it's actively trying to delete those images from providers, there's probably no need to keep the local copy of them on disk any longer | 20:55 |
clarkb | ++ | 20:55 |
clarkb | fwiw cleaning up nb03's dib_tmp freed like 16GB. I'm not doing the same cleanup to /opt/nodepool_dib there that I did on nb01 to see if we can free more space | 20:55 |
openstackgerrit | Andreas Jaeger proposed openstack/openstack-zuul-jobs master: Remove jobs and templates used by js-openstack-lib https://review.opendev.org/701510 | 20:55 |
fungi | i assume we'd still need to keep a record of the images since that's how it knows to keep trying to delete them? | 20:55 |
clarkb | nb03 poses a slightly different problem. We've got images associated with linaro-cn1 in zk and those will never delete because the cloud is gone. | 20:57 |
*** michael-beaver has joined #openstack-infra | 20:57 | |
clarkb | For there I'll delete them from disk then after lunch I can figure out how to surgery the zk db? | 20:57 |
clarkb | fungi: all of that is stored in zk and is the source of the problem for ^ | 20:58 |
clarkb | zk says that image must be deleted but it will never be deleted at this point because the cloud is gone so we need to edit the zk db | 20:58 |
clarkb | I'll start with simply removing them from disk as that is easy and frees space | 20:58 |
clarkb | oh except the ones for cn1 are not on disk? we'ev also got images that refuse to delete in london? | 20:59 |
fungi | that sounds like a royal mess | 21:00 |
fungi | i don't suppose zk has a convenient cli you can use to inspect and manipulate records? | 21:00 |
mordred | fungi: zkshell | 21:01 |
mordred | fungi: https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperCLI.md | 21:01 |
clarkb | I'm trying to manually image delete the leaked images in london now | 21:02 |
clarkb | to see if the error is useful | 21:02 |
clarkb | at the very least we should be able to apply fungi's new rule for deleting from disk when db record is set to deleting manually | 21:03 |
fungi | that's a good point | 21:03 |
clarkb | zuul.opendev.org is LE'd now | 21:04 |
clarkb | I'm going to go find lunch while i wait for this image delete to return | 21:04 |
clarkb | fungi: if you want to poke at the vexxhost image leaks via volume leaks I can poke at nb03 | 21:04 |
clarkb | I'm not doing anything with nb01 or nb02 right now so we won't be getting in each other's way | 21:05 |
fungi | cool, will do | 21:05 |
fungi | though i need to get started making dinner soon | 21:05 |
clarkb | fungi: what I noticed is that if you volume list sjc1 you'll get some volumes that say "attached to $name" and others are "attached to $uuid" | 21:05 |
fungi | will see if i can get through them quickly | 21:05 |
clarkb | the $uuid ones seem to not have names because those servers do not exist anymore and we have leaked those volumes | 21:06 |
clarkb | I think if we delete those volumes after confirming the servers do not exist then the images should be able to delete | 21:06 |
*** rfolco has quit IRC | 21:06 | |
clarkb | and then nodepool will automatically remove the files on disk | 21:06 |
fungi | and there's a special way mordred worked out to possibly delete them? | 21:06 |
clarkb | fungi: ya you unattach them first | 21:06 |
fungi | (if still attached to nonexistent instance) | 21:06 |
clarkb | I don't know what the specific details for that are but its some api call to do an unattach | 21:07 |
fungi | right, i have a feeling there is no way to do it with osc, will need to use sdk or api | 21:07 |
clarkb | ah | 21:07 |
fungi | you can detach normally *if* the instance still exists | 21:08 |
fungi | if the instance was deleted but cinder still has an attachment record pointing to it, then you need an undocumented api call | 21:08 |
*** hwoarang has quit IRC | 21:16 | |
*** hwoarang has joined #openstack-infra | 21:16 | |
corvus | clarkb: thanks for z.o.o! | 21:23 |
*** zxiiro has quit IRC | 21:23 | |
fungi | just reconfirmed, if i try `openstack server remove volume eb0cbf8e-16b5-4712-8274-c4989b1bf956 0f91579c-c627-452b-aad4-67cdeae865c3` i get No server with a name or ID of 'eb0cbf8e-16b5-4712-8274-c4989b1bf956' exists. | 21:24 |
smcginnis | fungi, clarkb: I believe mordred was going to look at doing something for that. | 21:24 |
smcginnis | We were talking about it the other day and he confirmed he can call the API needed to clean things up. | 21:25 |
fungi | smcginnis: yep, in the meantime i can probably use the api/sdk | 21:25 |
clarkb | my image delete against linaro-london hasn't returned yet | 21:25 |
clarkb | I think I'll go ahead and apply fungi's rule of deleting from disk when we start the delete process on nb03 | 21:26 |
smcginnis | It's a long was from Portland to London. | 21:26 |
*** rlandy has quit IRC | 21:26 | |
clarkb | this will give us room for normal operations while we sort out why those images aren't deleting | 21:26 |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Fix typo in helm role https://review.opendev.org/702046 | 21:27 |
*** kjackal has quit IRC | 21:27 | |
*** Goneri has joined #openstack-infra | 21:28 | |
clarkb | nb03 is running a builder again after that cleanup | 21:33 |
clarkb | fungi: is doing the zk surgery something you are interested in doing? I don't think that is urgent so fine if you want to give it a go next week | 21:33 |
fungi | i can, sure | 21:33 |
fungi | still looking at forced volume detachment | 21:34 |
clarkb | I've done it a few times in the past so can help, but figured if you hadn't done it before this might be a good time to try :) | 21:34 |
fungi | might be nice of osc grew a --force option to volume delete which did the os-force_detach action from https://docs.openstack.org/api-ref/block-storage/v3/#force-detach-a-volume | 21:34 |
fungi | oh, whaddya know! `openstack volume delete --force <uuid>` is a thing! | 21:36 |
fungi | --force Attempt forced removal of volume(s), regardless of state | 21:36 |
fungi | unfortunately, in vexxhost: | 21:36 |
fungi | "Policy doesn't allow volume_extension:volume_admin_actions:force_delete to be performed. (HTTP 403)" | 21:36 |
fungi | mnaser: do you happen to know if there's a (maybe safety-related) reason for that ^ ? | 21:37 |
fungi | i guess that's considered a protected admin-only function? | 21:38 |
mnaser | i believe that cinder by default has that as an admin-only policy thing | 21:38 |
mnaser | (we don't have custom policy fwiw) | 21:38 |
fungi | maybe force deleting volumes associated with an existing instance could crash hypervisors or something | 21:38 |
fungi | makes sense | 21:38 |
mnaser | https://github.com/openstack/cinder/blob/master/cinder/policies/volume_actions.py#L106 | 21:39 |
mnaser | yeah the cinder default is admin API | 21:39 |
mnaser | so i'll delegate that answer to them :-p | 21:39 |
fungi | if pabelanger still hung out in here i'd ask him for details on the environment where he was successfully using the os-force_detach action | 21:39 |
mnaser | fungi: i think he might have deleted the attachment in cinder first | 21:40 |
fungi | yeah, i was hoping that's what `openstack volume delete --force` was doing under the hood | 21:40 |
clarkb | fungi: and openstack server remove volume fails because the server does not exist anymore? | 21:45 |
clarkb | that is a command btw `openstack server remove volume` | 21:45 |
fungi | clarkb: yep, that's what i was trying first | 21:45 |
clarkb | still waiting to hear back on this image delete against linaro-london :/ | 21:45 |
clarkb | hrm | 21:45 |
fungi | the instance you specify must exist, because i guess it's asking nova to process the detachment and nova says "i have no idea what server that is" | 21:46 |
fungi | the actual error is "No server with a name or ID of '<uudi>' exists." | 21:47 |
fungi | s/uudi/uuid/ | 21:47 |
clarkb | fungi: well if you want to do the zk stuff instead I can try to pick this up if pabelanger responds in #zuul | 21:48 |
*** lbragstad has quit IRC | 21:54 | |
*** lbragstad has joined #openstack-infra | 21:54 | |
openstackgerrit | Merged zuul/zuul-jobs master: collect-container-logs: add role https://review.opendev.org/701867 | 21:56 |
*** ahosam has joined #openstack-infra | 21:57 | |
openstackgerrit | Clark Boylan proposed opendev/zone-opendev.org master: Manage insecure-ci-registry ssl with LE https://review.opendev.org/702050 | 22:14 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Manage insecure-ci-registry cert with LE https://review.opendev.org/702051 | 22:14 |
clarkb | infra-root ^ the dns change should be able to go in whenever its reviewed and ready. But I'm thinking best to hold off on the switch until next week while we juggle this nodepool cleaning | 22:15 |
mordred | clarkb, fungi: moving here | 22:16 |
mordred | clarkb: what did you mean by volume size? | 22:16 |
clarkb | mordred: our mirror volume would be bigger than 80GB so you can check that attribute as another sanity check | 22:16 |
mordred | nod | 22:16 |
clarkb | 80GB is our standard size for our nodepool instances | 22:16 |
mordred | yes - that whole list is 80 | 22:17 |
mordred | clarkb, fungi: ready for me to try running it for real? | 22:19 |
clarkb | ya so I think worst case you might kill a running job or delete a held node | 22:19 |
clarkb | mordred: does your script handle the case where volume doesn't have a server because the server hasn't booted yet on initial create? | 22:20 |
clarkb | mordred: that would be the only other case I'd worry about | 22:20 |
clarkb | (I think checking that volume age > 1 hour would be sufficient ot guard against that) | 22:20 |
mordred | uh. I have no idea what the race conditions there would be ... good call ... one sec | 22:20 |
openstackgerrit | Merged zuul/zuul master: Make files matcher match changes with no files https://review.opendev.org/678273 | 22:23 |
*** lastmikoi has quit IRC | 22:28 | |
*** arif-ali has joined #openstack-infra | 22:31 | |
openstackgerrit | James E. Blair proposed zuul/zuul-helm master: Add option to manage secrets outside of helm https://review.opendev.org/702052 | 22:32 |
mordred | clarkb: would you look at clean-volumes.py and tell me if my time delta code looks right? | 22:34 |
clarkb | is that on bridge? | 22:35 |
mordred | yeah | 22:35 |
mordred | clarkb: I did it by hand - but datetimes are so horrible in python I'd like a second set of eyes | 22:35 |
mordred | mnaser: are created_at values from volumes in vexxhost going to come back in UTC? | 22:36 |
clarkb | mordred: ya agreed on the horribleness | 22:37 |
mordred | clarkb: also - it's safe to run that script in its current form via the line in the first line | 22:37 |
mordred | if you want to run it and look at the output | 22:37 |
clarkb | mordred: whats with the truncation of created_at? | 22:38 |
clarkb | otherwise it looks right to me. Might also want to print the server uuid for volumes being deleted as that will be a breadcrumb for debugging if it doesn't do what we want | 22:38 |
mordred | clarkb: it has microseconds ... oh - you konw - that was from when I was trying to parse with dateutil which doesn't grok those | 22:39 |
clarkb | mordred: ah ya that should be fine | 22:39 |
mordred | clarkb: ok. so - game for me to run that for real? | 22:39 |
clarkb | mordred: I think so. And maybe add in the server uuid logging if you want | 22:39 |
fungi | everything should just return utc epoch seconds, and then whatever you need is basic arithmetic | 22:41 |
mordred | ok. I'm going to run it and then I'll paste the output | 22:41 |
fungi | i mean, ideally we'd return planck units since the big bang within our relativistic frame of reference, but that's probably overengineering until we crack near-light-speed travel | 22:43 |
mordred | :) | 22:43 |
*** KeithMnemonic has quit IRC | 22:46 | |
mordred | http://paste.openstack.org/show/788262/ | 22:47 |
mordred | clarkb, fungi: ^^ | 22:47 |
mordred | (the output is a bit verbose - I should just print volume_id on the delete line I think | 22:47 |
clarkb | we deleted ~23 volumes? | 22:48 |
mordred | next time someone runs it - it should be slightly less chatty | 22:48 |
mordred | yeah - I think so | 22:48 |
mordred | there's still likely one left around where I manually deleted the attachment from fungi's example earlier | 22:48 |
mordred | b/c I did not delete the volume itself | 22:48 |
*** dave-mccowan has quit IRC | 22:48 | |
openstackgerrit | James E. Blair proposed zuul/zuul-helm master: Add option to manage secrets outside of helm https://review.opendev.org/702052 | 22:49 |
mnaser | mordred: uh, i think so. | 22:49 |
clarkb | the image I tried deleting before is no longer there | 22:49 |
fungi | yeah, looks good | 22:50 |
openstackgerrit | James E. Blair proposed zuul/zuul-helm master: Change builder container name https://review.opendev.org/701793 | 22:50 |
openstackgerrit | James E. Blair proposed zuul/zuul-helm master: Add empty clouds value https://review.opendev.org/701865 | 22:50 |
openstackgerrit | James E. Blair proposed zuul/zuul-helm master: Add option to manage secrets outside of helm https://review.opendev.org/702052 | 22:50 |
clarkb | mordred: fungi there is one image left in vexxhost to delete | 22:50 |
clarkb | probably associated to that volume monty did not delete | 22:50 |
* clarkb looks to find it | 22:50 | |
fungi | it'll be 0f91579c-c627-452b-aad4-67cdeae865c3 | 22:51 |
clarkb | fungi: mordred 0f91579c-c627-452b-aad4-67cdeae865c3 I think it is that one | 22:51 |
clarkb | yup | 22:51 |
clarkb | should I go ahead and delete it | 22:51 |
fungi | go for it as far as i'm concerned | 22:51 |
clarkb | done | 22:53 |
clarkb | cool now only linaro images stuck in deleting | 22:54 |
clarkb | fungi: are you at the end of your week or do you want to do that now? | 22:54 |
fungi | i can take a look after dinner | 22:55 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add quick script for cleaning boot from volume leaks https://review.opendev.org/702053 | 22:55 |
mordred | mnaser: ^^ there's a script that non-admin users can use to cleanup leaked BFV volumes on vexxhost | 22:55 |
mordred | mnaser: I'm going to generalize it a bit and put it into sdk / osc ... but for now, that seems to safely work | 22:56 |
clarkb | fungi: ok. basically what we want to do is `nodepool image-list | grep linaro-cn1` then for each of those records delete the zk nodes that correspond to them | 22:56 |
fungi | via zkshell | 22:56 |
mordred | mnaser: I mean - it should work on any modern openstack - it's just hardcoded for us to point at vexxhost since that's the only nodepool place we BFV | 22:56 |
clarkb | fungi: yup which you install via pip into a venv (I've got an install on zk01.o.o in my homedir) | 22:57 |
mnaser | mordred: neat | 22:58 |
clarkb | fungi: /nodepool/images/ubuntu-bionic-arm64/builds/0000001978/providers/linaro-cn1/images the content at that path up to linaro-cn1 should be deleted I think | 23:06 |
clarkb | It should be sufficient to simply delete the content below images/ though | 23:06 |
clarkb | I found a case for the other arm64 cloud that was removed and it still was listed under providers but had no image under providers/name/images/ | 23:06 |
fungi | is zkshell known by another name? pypi doesn't know it | 23:12 |
clarkb | fungi: zk-shell | 23:13 |
fungi | ahh | 23:13 |
fungi | yup, that's working | 23:13 |
clarkb | ya I think you only need to remove 0000000001 from /nodepool/images/ubuntu-bionic-arm64/builds/0000001978/providers/linaro-cn1/images | 23:14 |
clarkb | then do that for the other 6 linaro-cn1 images | 23:14 |
fungi | what about nodepool/images/ubuntu-bionic-arm64/builds/0000001978/providers/linaro-cn1/images/lock | 23:19 |
fungi | leave that there? | 23:19 |
clarkb | ya that node is still there for the nrt arm cloud | 23:20 |
clarkb | I think we can actually delete everything linaro-cn1 and below | 23:20 |
fungi | as in `rm nodepool/images/ubuntu-bionic-arm64/builds/0000001978/providers/linaro-cn1` | 23:21 |
clarkb | ya | 23:21 |
clarkb | I think you have to rm things below it first, there is no -r for this | 23:22 |
fungi | indeed: /nodepool/images/ubuntu-bionic-arm64/builds/0000001978/providers/linaro-cn1 is not empty. | 23:25 |
*** rfolco has joined #openstack-infra | 23:25 | |
fungi | okay, manually recursed | 23:25 |
fungi | i'll work through the others | 23:25 |
clarkb | and now image-list doesn't show that image anymore | 23:26 |
clarkb | really the key thing here is to avoid operating on zk when nodepool may be operating on those nodes, but we know zk won't do that because this cloud dne anymore | 23:26 |
*** mattw4 has quit IRC | 23:27 | |
*** lastmikoi has joined #openstack-infra | 23:27 | |
fungi | how did you identify the /nodepool/images/ubuntu-bionic-arm64/builds/0000001978 node?is that the build id reported by nodepool image-list? or the upload id, or something else entirely? | 23:28 |
clarkb | 1978 is the build id for that image name | 23:28 |
fungi | i'm looking at build id 0000012627 upload id 0000010921 for debian-stretch-arm64 in linaro-cn1 | 23:28 |
clarkb | then the 00000...1 you removed under provider is the upload id for the provider | 23:29 |
fungi | oh... i need to make the image name in the path match too | 23:29 |
clarkb | nodepool/images/debian-stretch-arm64/builds/0000012627/providers/linaro-cn1 | 23:29 |
* fungi smacks forehead | 23:29 | |
clarkb | yup | 23:29 |
clarkb | in old nodepool the build id was unique globally | 23:30 |
clarkb | but now it is per image name | 23:30 |
fungi | okay, it's working as expected | 23:30 |
fungi | one more down | 23:31 |
fungi | will work my way through the rest | 23:31 |
openstackgerrit | James E. Blair proposed zuul/zuul-helm master: Add Zuul charts https://review.opendev.org/700460 | 23:31 |
fungi | okay, linaro-cn1 entries are entirely gone from image-list | 23:34 |
*** pcaruana has quit IRC | 23:35 | |
clarkb | confirmed | 23:37 |
clarkb | that leaves us with figuring out linaro-london image situation | 23:37 |
clarkb | my image delete is still sitting there | 23:37 |
*** rfolco has quit IRC | 23:40 | |
fungi | i still need to wash dishes, but can probably help once i'm done | 23:43 |
clarkb | I'm going to try manually deleting the other images that have leaked there and see if any act different than the random one I selected first | 23:44 |
clarkb | I can talk to the api because image show on that image name works | 23:46 |
clarkb | unless there is layer 7 firewalling we shouldn't be getting lost that way | 23:46 |
clarkb | adding --debug to the osc command shows it getting all the way to the delete request on the image uuid | 23:47 |
clarkb | so also not getting lost somewhere in between due to name lookups | 23:48 |
clarkb | that makes me think it is likely a cloud problem | 23:49 |
clarkb | kevinz: http://paste.openstack.org/show/788263/ is a list of images that nodepool has been trying to delete in the linaro london cloud. Manually attempting to delete them shows the commands getting as far as the DELETE http request but they seem to hang there | 23:52 |
clarkb | kevinz: nodepool not being able to clean up these images has meant it kept them around on disk which ended up filling the disk on our builder node. | 23:52 |
clarkb | kevinz: hrw: maybe I'm thinking this must be something on the cloud side as I am able to show those images just fine (implying api access is otherwise working) | 23:52 |
clarkb | is that something you can look into when you get a chance? | 23:53 |
fungi | in good news, /opt utilization on nb01 is falling rapidly | 23:55 |
clarkb | kevinz: hrw I realize it is likely your weekend now so no rush. We can pick this up next week | 23:56 |
*** michael-beaver has quit IRC | 23:57 | |
clarkb | fungi: ya I'm not sure there is much else we can do re deleting these images | 23:58 |
openstackgerrit | James E. Blair proposed zuul/zuul-helm master: Allow tenant config file to be managed externally https://review.opendev.org/702057 | 23:58 |
clarkb | lets see if kevinz can help on monday | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!