*** tosky has quit IRC | 00:01 | |
fungi | infra-root: we got a ticket from rackspace saying the host for paste01 is "showing imminent signs of hardware failure" so looks like they're going to migrate the instance. maybe related to the connectivity issue this time? maybe coincidence? maybe the migration will fix the connectivity issue anyway? place your bets! | 00:14 |
---|---|---|
ianw | fungi: my bet is that it's something to do with migration that then breaks the ipv6 | 00:19 |
clarkb | ianw I do think that fixture would be helpful. Is it readu for review? | 00:41 |
ianw | clarkb: yep | 00:41 |
ianw | and it's used in the follow-on to autogen the ssl check list | 00:42 |
*** Meiyan has joined #opendev | 00:59 | |
*** ysandeep|away is now known as ysandeep | 01:02 | |
fungi | ianw: ooh, interesting theory... leftover routes or neighbor discovery responses for the old host? | 01:26 |
fungi | something cached | 01:27 |
ianw | yeah, i don't think we could tell without backend access | 01:28 |
fungi | agreed | 01:28 |
*** mlavalle has quit IRC | 02:20 | |
*** elod has quit IRC | 03:23 | |
*** elod has joined #opendev | 03:35 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add tool to export Rackspace DNS domains to bind format https://review.opendev.org/728739 | 04:00 |
*** Meiyan has quit IRC | 04:11 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add tool to export Rackspace DNS domains to bind format https://review.opendev.org/728739 | 04:20 |
*** sshnaidm is now known as sshnaidm|off | 04:33 | |
*** ykarel|away is now known as ykarel | 04:34 | |
ianw | infra-root: ^ i have done a manual run of that tool and the results are in bridge:/var/lib/rax-dns-backup | 04:42 |
ianw | clarkb: did you get an answer on if we could post the openstack.org for audit on a public tool? | 04:43 |
clarkb | ianw: fungi (or I) were going to share the output with them privately amd have them double check first | 04:48 |
clarkb | I dont think that has happened yet | 04:48 |
clarkb | but with the info on bridge that will make it easy | 04:49 |
ianw | np, we end up with 39 domains dumped all up when we walk the domain list | 04:50 |
*** sgw has quit IRC | 06:01 | |
*** slaweq has joined #opendev | 06:57 | |
openstackgerrit | zhangboye proposed openstack/diskimage-builder master: Add py38 package metadata https://review.opendev.org/730220 | 07:04 |
*** ysandeep is now known as ysandeep|afk | 07:12 | |
*** ysandeep|afk is now known as ysandeep | 07:34 | |
*** tosky has joined #opendev | 07:34 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck https://review.opendev.org/729336 | 07:39 |
*** DSpider has joined #opendev | 07:51 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed openstack/diskimage-builder master: Validate virtualenv and pip https://review.opendev.org/707104 | 07:58 |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Bump ansible-lint to 4.3.0 https://review.opendev.org/702679 | 08:04 |
*** tkajinam_ has quit IRC | 08:05 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck https://review.opendev.org/729336 | 08:26 |
*** lpetrut has joined #opendev | 08:26 | |
*** larainema has joined #opendev | 08:29 | |
*** hashar has joined #opendev | 08:32 | |
*** elod has quit IRC | 08:43 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: revoke-sudo: improve sudo removal https://review.opendev.org/703065 | 08:44 |
*** elod has joined #opendev | 08:50 | |
*** ykarel is now known as ykarel|lunch | 08:56 | |
*** elod has quit IRC | 08:56 | |
*** elod has joined #opendev | 08:58 | |
*** ysandeep is now known as ysandeep|lunch | 09:09 | |
*** elod has quit IRC | 09:10 | |
*** elod has joined #opendev | 09:10 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: bindep: Add missing virtualenv and fixed repo install https://review.opendev.org/693637 | 09:10 |
openstackgerrit | RotanChen proposed openstack/diskimage-builder master: The old link does't work,this one does. https://review.opendev.org/730286 | 09:35 |
slaweq | fungi: mordred clarkb: thx a lot, sorry but I was busy yesterday and missed what You told me here. I will try to use that NODEPOOL_MIRROR_HOST variable next week in neutron-tempest-plugin jobs | 09:51 |
*** ysandeep|lunch is now known as ysandeep | 10:01 | |
*** yuri has joined #opendev | 10:04 | |
*** ykarel|lunch is now known as ykarel | 10:10 | |
*** hashar has quit IRC | 10:31 | |
hrw | clarkb: thanks for invitation. Will discuss with my manager and then reply. | 10:39 |
*** roman_g has joined #opendev | 11:01 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684 | 11:26 |
fungi | ianw: clarkb: i shared the original list with them yesterday... consensus was there's nothing sensitive in there to worry about, but lots of abandoned records they plan to clean up | 11:40 |
zbr | what is the status of gentoo support in zuul-jobs? i see failures like https://zuul.opendev.org/t/zuul/build/ddc06a12b0f44d7a991cc4799c98b7cc | 11:56 |
zbr | can we make it non voting? | 11:56 |
zbr | that reminds me on an older questions: who decides when to add/drop support for a specific operating system in zuul-roles? | 11:57 |
zbr | it can easily grow out of control, especially by introducing less mainstream platforms | 11:57 |
*** priteau has joined #opendev | 12:33 | |
*** hashar has joined #opendev | 12:36 | |
hrw | who I can talk with about build-wheel-mirror-* CI jobs? | 12:45 |
fungi | probably any of us, what's the question? | 12:46 |
hrw | I should probably find it 2-3 years ago ;D | 12:47 |
hrw | from what I see it is used by requirements to build x86-64 wheels and push them to infra mirrors | 12:47 |
hrw | looks like I should add aarch64 to it and then all aarch64 builds will speed up a lot | 12:48 |
hrw | as numpy/scipy/grpcio etc will be already built as binary wheels on infra mirrors | 12:48 |
hrw | am I right? | 12:48 |
AJaeger | hrw: https://review.opendev.org/#/c/550582 was pushed 2 years ago but never moved forward, not sure why. that gives a start. | 12:50 |
AJaeger | hrw: yes, that should speedup the builds | 12:50 |
hrw | AJaeger: will concentrate on getting it working | 12:51 |
hrw | I had no idea that such thing exists | 12:51 |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Minor documentation rephrase https://review.opendev.org/728640 | 12:52 |
hrw | it is not used even on x86-64 | 12:53 |
hrw | as it is run only when bindep change instead of upper-constraints | 12:53 |
zbr | fungi: clarkb ok to merge https://review.opendev.org/#/c/729974/ ? | 12:53 |
fungi | hrw: i think we were previously waiting to have a stable arm64/aarch64 provider to run the job in, but now that we do we should be able to run a mirror-update job there | 12:54 |
fungi | hrw: we run a periodic job, hold on i'll find it | 12:54 |
hrw | fungi: thanks | 12:55 |
AJaeger | hrw: https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5442 - its run every day | 12:56 |
AJaeger | fungi, do we need to publish wheels for focal and CentOS-8 as well? I don't see them | 12:57 |
fungi | AJaeger: eventually, i expect | 12:58 |
hrw | AJaeger: ok. so have to add job there | 12:58 |
fungi | infra-root: bridge.o.o can now reach paste.o.o over ipv6, so may have been related to (or fixed by) host migration after all | 12:58 |
hrw | AJaeger: c7 wheels should work on every other distro (maybe not xenial) | 12:59 |
hrw | manylinux2014 PEP defines c7 as base | 12:59 |
hrw | https://review.opendev.org/#/c/728798/ finally is able to build all wheels as CI job. With one Debian package build on a way | 13:00 |
*** ykarel is now known as ykarel|afk | 13:02 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL https://review.opendev.org/730322 | 13:12 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL https://review.opendev.org/730322 | 13:17 |
hrw | need to find a job which makes use of those wheels from infra mirror | 13:22 |
hrw | ok, I see it used to create venv. now, in kolla we need to sneak it into being used in build too | 13:24 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL https://review.opendev.org/730322 | 13:27 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 13:28 |
openstackgerrit | Marcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for more distributions (x86-64) https://review.opendev.org/730323 | 13:31 |
hrw | AJaeger: here you have - focal, buster, centos8 | 13:31 |
hrw | but it is probably not complete | 13:32 |
hrw | release.yaml has afs_volume list which need to be filled with extra entries | 13:33 |
hrw | I may only guess their names | 13:34 |
fungi | hrw: if you're looking for the magic to get those provider-local wheelhouse caches, it's done with the /etc/pip.conf our base jobs install on all nodes | 13:34 |
hrw | fungi: thanks! | 13:34 |
fungi | so if you need them in a container chroot or something you could bindmount that in | 13:34 |
*** lpetrut has quit IRC | 13:35 | |
hrw | cool | 13:35 |
openstackgerrit | Marcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for more distributions (x86-64) https://review.opendev.org/730323 | 13:36 |
hrw | with afs_volume names in it. guessed ones so need someone to take a look and fix | 13:36 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL https://review.opendev.org/730322 | 13:36 |
hrw | now time to add arm64 ones on top | 13:36 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL https://review.opendev.org/730322 | 13:43 |
openstackgerrit | Marcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for more distributions (x86-64) https://review.opendev.org/730323 | 13:44 |
hrw | Debian needs py2 too | 13:44 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 13:44 |
zbr | AJaeger: how big of a gentoo fun are you? | 13:45 |
mordred | zbr: you want prometheanfire for gentoo things | 13:47 |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Revert "Add Gentoo integration tests" https://review.opendev.org/730329 | 13:47 |
zbr | mordred: i asked AJaeger because the change originated from him and created jobs that where not triggered by the addition itself. | 13:48 |
zbr | hmm, in fact i am bit wrong, they only recently got broken. | 13:48 |
zbr | i still hope to get an answer about the platform support in general, i do not find current setup sustainable | 13:50 |
zbr | very often I want to touch a role I endup discovering that the role was already broken on a less than mainstream platform. | 13:51 |
hrw | which version would be better: build/publish-wheel-mirror job defitions for x86-64 and then same for aarch64 or rather grouped by distro so build/publish-c7, build/publish-c7-arm64 etc? | 13:52 |
AJaeger | zbr: they were added for completeness. | 13:52 |
zbr | but nobody knew because we do not have periodic on them and also no owners. | 13:52 |
AJaeger | fungi, mordred, do we really need all these different wheels per OS version? | 13:52 |
AJaeger | zbr: prometheanfire is the local Gentoo expert | 13:52 |
zbr | maybe we should run all zuul-jobs once a week to get an idea about what went broken... naturally. | 13:54 |
zbr | a bit-rot pipeline | 13:54 |
*** owalsh has quit IRC | 13:55 | |
AJaeger | and who monitors that one? | 13:56 |
mordred | AJaeger: yeah - if we don't build per-os and per-arch wheels the wheel mirror won't work | 13:56 |
mordred | I mean- it won't work for those arches | 13:56 |
mordred | so - we should build wheels for every arch we have in the gate | 13:57 |
mordred | s/arch/arch-distro-combo/ | 13:57 |
zbr | we can send email on failures, i would not mind looking at it. i would also take responsability to fix redhat ones. | 13:57 |
zbr | we can now assume that everything is fine, because we do not run them, but we have no idea how many are in the same situation. | 13:58 |
zbr | maybe we can run every 10, or 14 days, that is only an implementation detail. | 13:58 |
*** priteau has quit IRC | 13:59 | |
zbr | travis has a very neat feature that allows a conditional cron, that runs only if nothing run recently, but that is not possible for us. | 14:00 |
zbr | still zuul-jobs is really high-profile imho | 14:00 |
mordred | zbr: the idea of a conditional periodic has come up before - I think it would have to wait for zuul v4 (which isn't too far away) because the scheduler would have to ask the database if a job has been run recently and the db is currently optional | 14:02 |
AJaeger | mordred: I see, seems we missed a few when setting up. This needs a bit of review. | 14:02 |
mordred | zbr: saying that - I still don't know how feasible it would be for us - just that it would _definitely_ require v4 | 14:03 |
mordred | I haven't actually thought about it from a design perspective | 14:03 |
zbr | mordred: super. clearly db would enable lots of useful things. | 14:03 |
mordred | yeah. that's the main v4 thing - the db becomes mandatory instead of optional (also TLS for zk) | 14:04 |
mordred | because from an ops pov, the db all of a sudden becoming mandatory is a breaking change :) | 14:05 |
zbr | probably would make it easy to implement regression detection compared with last-passed-build (coverage going down, more warnings,....) | 14:05 |
mordred | we're pretty sure _everyone_ has a db though | 14:05 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: tox: empty envlist should behave like tox -e ALL https://review.opendev.org/730322 | 14:06 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 14:06 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL https://review.opendev.org/730334 | 14:06 |
*** owalsh has joined #opendev | 14:13 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: tox: empty envlist should behave like tox -e ALL https://review.opendev.org/730322 | 14:24 |
openstackgerrit | Marcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for aarch64 architecture https://review.opendev.org/730342 | 14:25 |
hrw | fungi, AJaeger: please take a look. | 14:26 |
*** ykarel|afk is now known as ykarel | 14:27 | |
AJaeger | hrw: both look good but I let fungi and al review it since it needs manual steps | 14:34 |
*** sgw has joined #opendev | 14:39 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL https://review.opendev.org/730334 | 14:40 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 14:40 |
hrw | AJaeger: thanks. I am aware that some changes may need manual work. Just wanted to know that changes are more or less fine | 14:47 |
hrw | 2020-05-22 14:37:14.511378 | primary | INFO:kolla.common.utils.kolla-toolbox: Downloading http://mirror.bhs1.ovh.opendev.org/wheel/ubuntu-18.04-x86_64/distlib/distlib-0.3.0-py3-none-any.whl (340 kB) | 14:47 |
hrw | mirror will be in use ;D | 14:48 |
hrw | Have to think should it (pip.conf) be included in final images or not. distro repos are | 14:49 |
prometheanfire | zbr: hi? | 14:52 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL https://review.opendev.org/730334 | 14:53 |
zbr | prometheanfire: hi! if you can help with https://review.opendev.org/#/c/728640/ it would be great, gentoo error is unrelated to the test patch. | 14:55 |
zbr | feel free to reuse the patch | 14:56 |
*** mlavalle has joined #opendev | 14:56 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 15:04 |
prometheanfire | zbr: looks like one host worked, the other failed, so going to recheck | 15:11 |
zbr | if you check the build history you will see that it started to fail few days ago, and is not random. | 15:11 |
prometheanfire | zbr: the nature of the error, is it always that it can't see the ovs bridge? | 15:12 |
zbr | https://zuul.opendev.org/t/zuul/builds?job_name=zuul-jobs-test-multinode-roles-gentoo-17-0-systemd&project=zuul/zuul-jobs | 15:13 |
prometheanfire | it looks like we stablized openvswitch-2.13.0 on the 11th | 15:13 |
zbr | i bet something happened between 7th and 9th. | 15:13 |
openstackgerrit | Sagi Shnaidman proposed zuul/zuul-jobs master: WIP Add ansible collection roles https://review.opendev.org/730360 | 15:14 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 15:19 |
prometheanfire | zbr: to be honest I've been waiting on https://review.opendev.org/717177 to merge | 15:27 |
prometheanfire | it's what I'm using for http://distfiles.gentoo.org/experimental/amd64/openstack/ at least | 15:27 |
mordred | prometheanfire: +2 | 15:29 |
zbr | i am clueless about ^ but if that is fixing it, merge it. | 15:29 |
prometheanfire | it helps simplify the image build process imo | 15:30 |
prometheanfire | atm, upstream is shipping an older kernel for instance | 15:30 |
zbr | in that case I will make the gentoo job nv. | 15:34 |
prometheanfire | ya, atm that sounds fine | 15:35 |
zbr | mordred: how to make the job nv without breaking update-test-platforms ? | 15:38 |
*** hashar has quit IRC | 15:40 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 15:41 |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Disable broken gentoo job nv https://review.opendev.org/728640 | 15:43 |
zbr | for some reason removing auto-generated tag and adding voting: false has a nasty side effect: update-test-platforms creates a duplicate. | 15:46 |
*** ykarel is now known as ykarel|away | 15:57 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 15:59 |
* mordred afks for a bit | 16:00 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 16:03 |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Make gentoo jobs nv https://review.opendev.org/728640 | 16:03 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 16:14 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 16:18 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 16:31 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 16:53 |
*** ysandeep is now known as ysandeep|away | 16:54 | |
*** cmurphy is now known as cmorpheus | 17:03 | |
corvus | i *think* we expect the base playbook to run successfully now? i'll re-enqueue that change again | 17:03 |
openstackgerrit | Merged opendev/system-config master: Use ipv4 in inventory https://review.opendev.org/730144 | 17:20 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 17:33 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: tox: envlist bugfixes https://review.opendev.org/730381 | 17:33 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check https://review.opendev.org/729966 | 17:35 |
corvus | base worked. letsencrypt failed. | 17:36 |
corvus | le failed on nb01 and nb02 | 17:37 |
corvus | not entirely sure what nb01 and nb02 are doing with ssl certs... | 17:38 |
corvus | /opt is full on both of those hosts | 17:40 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check https://review.opendev.org/729966 | 17:45 |
corvus | infra-root: anyone else around? there seem to be some nodepool problems | 17:46 |
corvus | it looks like there's a db error related to the dib image records | 17:46 |
corvus | and we seem to have a whole bunch of failed image uploads | 17:46 |
corvus | i'm going to start looking into the db error since it's preventing use of a diagnostic tool ("nodepool dib-image-list" fails) | 17:47 |
*** tosky has quit IRC | 17:50 | |
*** tosky has joined #opendev | 17:51 | |
corvus | the znode for build 0000124190 exists but is empty | 17:56 |
corvus | but it does have a providers/vexxhost-ca-ymq-1/images directory (which is also empty) | 17:57 |
corvus | hrm, we should be doing a recursive delete on the build znodes when we delete it, so it shouldn't have mattered that there are nodes under it | 17:58 |
corvus | i can't think of what may have gone wrong; perhaps a zk conflict of some kind | 18:01 |
zbr | infra-root: the POLLPRI change is ready for review at https://review.opendev.org/#/c/729966/ | 18:02 |
corvus | zbr: you can use infra-core to notify infra folks with core approval rights (not the smaller set with root access) | 18:03 |
zbr | tx, time to update the magic keyword list. | 18:04 |
corvus | #status log manually deleted empty znode /nodepool/images/centos-7/builds/0000124190 | 18:05 |
openstackstatus | corvus: finished logging | 18:05 |
zbr | my hopes are quite low around paramiko, it does not have an active community | 18:06 |
corvus | okay now, i can see that we have znodes for 28k failed builds | 18:06 |
corvus | hopefully without the dead znode there, they'll get cleaned up | 18:06 |
corvus | yes, that number is slowly decreasing; i think the thing to do now is to let in run for a bit and see what gets automatically cleaned up | 18:08 |
corvus | nb02 has already managed to recover some space on /opt | 18:09 |
zbr | corvus: give https://review.opendev.org/#/c/729974/ a kick if you do not mind, that use of lowercase l, drives me crazy. | 18:12 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Resolve unsafe yaml.load use https://review.opendev.org/730389 | 18:22 |
fungi | corvus: i'm back now, can look into nodepool problems | 18:37 |
fungi | thanks for finding/clearing the dead znode. i'll try to keep an eye on it | 18:38 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684 | 18:39 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: tox: empty envlist should behave like tox -e ALL https://review.opendev.org/730322 | 18:41 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL https://review.opendev.org/730334 | 18:41 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: tox: envlist bugfixes https://review.opendev.org/730381 | 18:41 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv https://review.opendev.org/726830 | 18:41 |
fungi | looks like they started growing around the 18th | 18:44 |
corvus | it looks like we may have another znode in a similar situation | 19:02 |
corvus | (presumably newly placed into this situation) | 19:02 |
corvus | i'll dig after lunch | 19:05 |
hrw | fungi: can you take a look at https://review.opendev.org/#/c/730323/ and https://review.opendev.org/#/c/730342/ patches? And add whoever is needed to get AFS volumes created? | 19:15 |
fungi | hrw: i can create them, just may not get to it until next week. trying to take today through monday off except for urgent crises | 19:17 |
fungi | it may end up being straightforward, but i need to check quotas to see how much room we have | 19:18 |
fungi | and how much we've allocated to the other wheel volumes | 19:18 |
hrw | fungi: no problem | 19:20 |
hrw | fungi: get some rest etc. I know the feeling. Spent too much time recently on yak shaving.. | 19:21 |
hrw | fungi: https://marcin.juszkiewicz.com.pl/2020/05/21/from-a-diary-of-aarch64-porter-firefighting/ ;D | 19:22 |
fungi | Error uploading image opensuse-15 to provider airship-kna1: [...] json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) | 19:22 |
fungi | problem with the citycloud api responses? | 19:23 |
fungi | nope. also see it for ovh-bhs1 | 19:24 |
fungi | ahh, yeah this is bubbling up from nodepool.zk._bytesToDict() so presumably the same thing corvus saw earlier | 19:25 |
*** elod has quit IRC | 19:26 | |
* hrw off | 19:26 | |
fungi | looks like the current opensuse-15 image may be the commonality | 19:26 |
hrw | have a nice weekend folks | 19:26 |
fungi | thanks hrw, you too! | 19:26 |
*** elod has joined #opendev | 19:27 | |
fungi | oh, yeah, nodepool dib-image-list even returns the same error | 19:28 |
fungi | and traceback | 19:28 |
fungi | refreshed my memory on getting zk-shell to work, looks like we have 7 znodes under /nodepool/images/opensuse-15 | 19:37 |
fungi | though in retrospect, opensuse-15 may have been showing up in the errors because that was just the image it was in the process of trying to upload | 19:37 |
fungi | since dib-image-list is also generally returning an error, possible it could be anywhere in the /nodepool/images tree i suppose | 19:38 |
fungi | oof, running `tree` there exceeds my buffer | 19:41 |
fungi | most of the tree looks reasonable except ubuntu-xenial, ubuntu-bionic, and opensuse-tumbleweed, which each have thousands of empty builds | 19:47 |
*** slaweq has quit IRC | 19:49 | |
*** roman_g has quit IRC | 19:51 | |
*** roman_g has joined #opendev | 19:53 | |
corvus | fungi: back | 20:04 |
corvus | fungi: i'll see if i can figure out what znode is borked | 20:05 |
*** jesusaur has joined #opendev | 20:06 | |
corvus | it's /nodepool/images/opensuse-15/builds/0000089491 | 20:11 |
corvus | it has providers/airship-kna1/images under it | 20:12 |
corvus | which is empty, similar to before | 20:12 |
*** jesusaur has quit IRC | 20:37 | |
*** jesusaur has joined #opendev | 20:37 | |
fungi | yep, sorry, had to jump to dinner mode, back again | 20:42 |
*** lpetrut has joined #opendev | 20:43 | |
fungi | okay, so an empty build tree is fine as long as it doesn't have an empty image provider list in it? | 20:43 |
fungi | er, empty provider image | 20:43 |
corvus | i've been looking at the code, and i think we're seening issues with multiple builders racing and the lock node being held underneath the thing we're deleting | 20:43 |
corvus | fungi: no, it's never okay | 20:43 |
corvus | fungi: but i think it's a clue as to why the node is still there | 20:43 |
fungi | got it, so the slew of empty image build znodes is likely a symptom of that one empty provider upload znode? | 20:44 |
fungi | and each time a builder throws an exception trying to parse that empty upload znode it leaves another empty build znode behind? | 20:45 |
corvus | oh no idea about that | 20:45 |
corvus | i'm not sure if there's a quick code fix for this... i'm inclined to just attempt to get things cleaned up for the weekend though and hope that whatever triggered this doesn't happen again for a while | 20:46 |
corvus | (i think we've learned not to put the lock node under the thing we're locking in future designs) | 20:47 |
corvus | i think the best way out of this is to shut down nb02, clear out the empty znode, then let nb01 do all its cleanup, then start up nb02 again | 20:47 |
fungi | that sounds reasonable. why only nb02? is it the source of the trouble? | 20:49 |
corvus | no, just so it's not racing nb01 | 20:49 |
fungi | oh! right | 20:49 |
fungi | so either 01 or 02 just doesn't have to be both | 20:49 |
corvus | yep | 20:49 |
fungi | are you doing that or shall i? | 20:50 |
corvus | i am | 20:50 |
corvus | nb02 is off, and i've deleted the znode | 20:50 |
fungi | cool, thanks! | 20:50 |
fungi | and we expect those other empty build znodes to clear out on their own | 20:50 |
corvus | there was only one empty znode | 20:50 |
corvus | nodepool dib-image-list succeeds now; and reports ~4400 builds | 20:51 |
corvus | so we're close to bottoming out | 20:51 |
fungi | if i do tree for /nodepool/images i see a ton like ubuntu-xenial/builds/0000109279 with nothing under them... that's what i meant by empty | 20:51 |
*** DSpider has quit IRC | 20:51 | |
corvus | i meant if you "get" them you get the empty string back | 20:52 |
corvus | that's the cause of the traceback | 20:52 |
fungi | only a few have a providers subtree | 20:52 |
fungi | ahh, okay | 20:52 |
fungi | are those leaf build trees normal then? | 20:52 |
fungi | i guess they signify an image build with no provider uploads? | 20:53 |
corvus | yes, probably because the build failed | 20:53 |
fungi | and under at least some conditions we don't clear them out i suppose | 20:54 |
corvus | one of those conditions is when everything is broke because of corrupt data | 20:55 |
fungi | right, so likely a symptom of the problem with the empty upload znode you removed | 20:58 |
corvus | okay, i think nb01 has finished clearing out its stuff, i'm going to stop it and restart nb02 | 20:59 |
fungi | watching the builder log on nb01, exceptions now seem (so far) to be only about failures to delete backing images for bfv in vex | 20:59 |
fungi | sounds good | 20:59 |
corvus | okay restarting nb02 now | 21:13 |
corvus | er nb01 | 21:13 |
corvus | looking at the image list now, it seems like we have some images that i would expect to be deleted but aren't | 21:15 |
corvus | example: | ubuntu-xenial-0000099848 | ubuntu-xenial | nb01 | qcow2,raw,vhd | ready | 19:01:35:13 | | 21:15 |
corvus | there are 3 newer images than that, and no uploads for it, so it should be gone | 21:15 |
*** lpetrut has quit IRC | 21:15 | |
fungi | yeah, and i don't see it in any providers according to nodepool image-list | 21:17 |
fungi | the zk tree for it shows locks under each provider though | 21:18 |
corvus | oooh | 21:18 |
corvus | did we replace the build nodes? | 21:18 |
fungi | yes | 21:18 |
corvus | we did not copy over the builder ids | 21:18 |
fungi | nb01 and 02 went from openstack.org to opendev.org | 21:18 |
corvus | so everything with an nb01 or nb01 hostname is orphaned | 21:19 |
corvus | since i'm here, i'll just delete the znodes | 21:20 |
fungi | aha, and dib-image-list apparently still only shows short hostnames | 21:21 |
corvus | no it shows whatever hostname was used to build it | 21:21 |
corvus | so you can see both nb01 and nb01.opendev.org in there | 21:21 |
corvus | but i think we ran a version of nodepool that used short hostnames when we ran it on the openstack nodes | 21:22 |
fungi | ohh, okay. there was a patch which merged at one point to switch from short hostnames to full hostnames. so could those be from before that transition? | 21:22 |
fungi | yeah, got it | 21:22 |
corvus | i think we're going to leak | 0000123991 | 0000000002 | vexxhost-sjc1 | centos-7 | centos-7-1585726429 | e894339c-807d-4d46-9a36-51b2338e536d | deleting | 47:19:42:15 | | 21:23 |
corvus | since there's nothing left to delete that upload any more | 21:23 |
corvus | i mean, it'll leak on the cloud side | 21:24 |
fungi | so we probably need a todo to check our providers for orphaned images next week? | 21:24 |
*** lpetrut has joined #opendev | 21:26 | |
fungi | though odds are it'll just be vexxhost-sjc1, since we occasionally get stuck undeleteable instances which lock the backing images for their boot volumes indefinitely | 21:26 |
corvus | yeah | 21:26 |
fungi | so there were likely a few when the old builders were being taken down | 21:26 |
corvus | okay, i cleaned up everything that looked unused; there are still several images in use that only existed on the old builders :/ | 21:32 |
openstackgerrit | Oleksandr Kozachenko proposed openstack/project-config master: Add openstack/heat and openstack/heat-tempest-plugin https://review.opendev.org/730419 | 21:33 |
fungi | we may be able to forcibly detach and delete the volumes which have them locked in use | 21:34 |
fungi | next week we can try https://opendev.org/opendev/system-config/src/branch/master/tools/clean-leaked-bfv.py on them if that's the problem | 21:38 |
corvus | sorry, i meant that we have uploads of images that we have no built copies of | 21:39 |
fungi | ohh, got it | 21:40 |
corvus | ie, nb01.openstack.org built opensuse-15, uploaded it everywhere, and now we can't build new ones, and we deleted the underlying image when we deleted the builder | 21:40 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684 | 21:45 |
Open10K8S | Hi team | 21:50 |
Open10K8S | Can you check this PS on project-config? https://review.opendev.org/#/c/730419/ | 21:50 |
*** lpetrut has quit IRC | 22:00 | |
openstackgerrit | Merged openstack/project-config master: Add openstack/heat and openstack/heat-tempest-plugin https://review.opendev.org/730419 | 22:17 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684 | 22:21 |
*** smcginnis has quit IRC | 22:24 | |
*** smcginnis has joined #opendev | 22:30 | |
Open10K8S | Hi team | 22:43 |
Open10K8S | Zuul deploy failed for this https://review.opendev.org/#/c/730419/ | 22:43 |
Open10K8S | The error msg is "Please check connectivity to [bridge.openstack.org:19885]" | 22:43 |
fungi | Open10K8S: that error is actually a red herring, we don't stream console logs from that node as it's our deployment bastion, output is directed to /var/log/ansible/service-zuul.yaml.log (per the failed task) so i'll check that file for the actual error | 22:49 |
fungi | exciting, most of our zuul servers, including our scheduler, were considered unreachable | 22:50 |
fungi | but it seems to be reachable from there now (via both ipv4 and ipv6) | 22:52 |
Open10K8S | fungi: ok | 22:52 |
fungi | might have been a temporary network issue in that provider, i'll try to reenqueue the commit into the deploy pipeline | 22:52 |
Open10K8S | fungi: ok | 22:53 |
mnaser | just a heads up | 22:54 |
mnaser | github is unhappy -- https://www.githubstatus.com | 22:54 |
fungi | oh, funzies | 22:57 |
fungi | thanks for the heads up, mnaser! | 22:58 |
*** mlavalle has quit IRC | 22:58 | |
fungi | since 16:41z looks like | 22:59 |
*** tosky has quit IRC | 22:59 | |
*** larainema has quit IRC | 23:00 | |
fungi | and the reenqueued deployment just bombed again | 23:00 |
Open10K8S | fungi: yeah | 23:01 |
Open10K8S | fungi: the same reason, seems like | 23:01 |
fungi | well, the reason is entirely hidden from the ci log | 23:04 |
fungi | the only reason the ci is really reporting there is "something failed during deployment" | 23:04 |
fungi | we redirect all the deployment logging to a local file on the bastion so as to avoid leaking production credentials | 23:04 |
fungi | still seeing a ton of unreachable states reported for most of the zuul servers | 23:05 |
fungi | though also this error for the scheduler: | 23:06 |
fungi | groupadd: GID '10001' already exists | 23:06 |
clarkb | fungi: could connectivity issues be related to https://review.opendev.org/730144 ? | 23:07 |
fungi | getting that for zuul01.openstack.org and ze09.openstack.org | 23:07 |
clarkb | perhaps due to ssh host keys | 23:07 |
mnaser | btw there seems to be an error relating to permission denied | 23:07 |
fungi | oh, possibly... they're all ipv4 addresses it's complaining about in the log | 23:07 |
mnaser | when checking things out | 23:07 |
mnaser | i dont know if thats just a warning _or_ maybe prboelmatic | 23:07 |
fungi | Data could not be sent to remote host "23.253.248.30". Make sure this host can be reached over ssh: Host key verification failed. | 23:08 |
fungi | et cetera | 23:08 |
mnaser | ahhh, i am going to guess known_hosts contains hostnames and ipv6 addresses only fungi | 23:08 |
fungi | and indeed, if i `sudo ssh 23.253.248.30` from bridge.o.o i see it prompts about an unknown host key | 23:08 |
clarkb | we don't use hostnames | 23:08 |
fungi | do i need to `sudo ssh -4 ...` all of the zuul servers from bridge, or are we maintaining a confiuration-managed known_hosts file? | 23:09 |
clarkb | fungi: the servers get added to known hosts with the launch node script. I expect it was only adding ipv6 records | 23:10 |
clarkb | fungi: I think that means we need to manually add the ipv4 records (or we could go back to ipv6, or we could switch to hostnames) | 23:10 |
fungi | manually running sudo ssh -4 for any of the zuul servers root already had in its known_hosts file by hostname auto-added the v4 addresses without any need to confirm an unknown key | 23:14 |
fungi | though it choked on zm01-04, ze09 and ze12 | 23:15 |
fungi | those four mergers it complained about mismatched host keys (i guess we've rebuilt them since the last time it connected to them by name) | 23:15 |
clarkb | note the gid thing is likely to prevent the scheduelr from being updated too | 23:15 |
clarkb | and I'm not sure what the correct answer is there | 23:16 |
fungi | and the two executors seemed to not have entries by hostname | 23:16 |
clarkb | I think corvus expected some unhappyness that might need to be corrected? | 23:16 |
clarkb | possibly via manual edit of /etc/passwd and /etc/group | 23:16 |
clarkb | and then maybe restarting services? though the uids stay the same so restarting is probably less important | 23:16 |
fungi | let me at least check whether the change i approved for Open10K8S got applied to the scheduler | 23:17 |
fungi | but yeah, the last successful build for infra-prod-service-zuul was 2020-05-15 and today is the first time it's been triggered since | 23:19 |
fungi | so something we've merged in the past week, presumably | 23:19 |
clarkb | fungi: yes, yesterday I think. Its the zuul -> zuuld user/group name change (but not uid/gid) | 23:20 |
fungi | nope, the config addition from 730419 is not getting applied, so we're currently unable to update the tenant config it looks like | 23:20 |
clarkb | I think we half expected ansible to be angry about it | 23:20 |
clarkb | since a user and group already exist with those uids and gids | 23:21 |
openstackgerrit | Merged zuul/zuul-jobs master: Patch CoreDNS corefile https://review.opendev.org/727868 | 23:24 |
mordred | clarkb: yeah - I think we just have to manually edit the /etc/passwd and group files - I don't think we need to restart anything | 23:35 |
mordred | clarkb: the zuulcd change landed? | 23:35 |
clarkb | mordred: ya I think that stack was what corvus was trying to get applied yesterday when we ran into the problems | 23:36 |
clarkb | /etc/shadow may also need editing too | 23:36 |
mordred | clarkb: yes - almost certainly | 23:39 |
mordred | clarkb: I think it would be 'sed -i "s/ˆzuul:/zuulcd:/" /etc/passwd /etc/group /etc/shadow'' | 23:40 |
clarkb | also we probably want to audit pur ssh host key problem after the ipv4 change landed. But Im on a phone for the forseeable future | 23:40 |
clarkb | mordred: it might be zuuld not zuulcd | 23:40 |
mordred | oh - yes, zuul d | 23:41 |
mordred | zuuld | 23:41 |
mordred | clarkb: yeah- I'm not in a great position to do a business but I could do either thing in the morning | 23:41 |
clarkb | bbut otherwise that looks correct to me too | 23:41 |
mordred | or - I think I can do the zuul user rename on the zuul hosts | 23:41 |
mordred | want me to try that and then try re-running service-zuul? | 23:42 |
clarkb | up to you I guess. I expect its that simple but it may not be | 23:42 |
clarkb | fungi: ^ thoughts | 23:42 |
fungi | mordred: worth a try if you're in a position to be able to | 23:45 |
mordred | ok - I just did: | 23:45 |
mordred | ansible zuul -mshell -a"grep zuul: /etc/passwd /etc/group /etc/shadow" | 23:45 |
mordred | (as a quick test) | 23:45 |
mordred | and I had to accept a few more host keys) | 23:45 |
mordred | but I can run that now with no issues | 23:45 |
mordred | so - I think what I'd run is: ansible zuul -mshell -a"sed -i 's/ˆzuul:/zuulcd:/' /etc/passwd /etc/group /etc/shadow" | 23:47 |
mordred | ok - I ran that (but a fixed version) just on ze01.openstack.org and it seems to have worked | 23:56 |
mordred | ps now shows the zuul processes running as zuuld | 23:56 |
mordred | ansible ze01.openstack.org -mshell -a"sed -i 's/^zuul:/zuuld:/' /etc/passwd /etc/group /etc/shadow" | 23:56 |
mordred | for the record | 23:56 |
mordred | I'm going to run it across all of them | 23:56 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!