openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: edit-json-file: add role to combine values into a .json https://review.opendev.org/746834 | 00:46 |
---|---|---|
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062 | 00:46 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063 | 00:46 |
ianw | hrmmm, linaro mirror issues again ... https://zuul.opendev.org/t/zuul/build/f0f9658cd3ca40ff8abb74586e6bb569/console failed getting apt | 01:13 |
ianw | doesn't seem to be responding :/ | 01:14 |
ianw | SHUTOFF | 01:15 |
ianw | again | 01:15 |
ianw | kevinz: ^ | 01:15 |
ianw | i feel like this has to be an oops taking it down | 01:15 |
ianw | i think i might as well rebuild it as a focal node. i'm not going to spend time setting up captures etc. for an old kernel | 01:17 |
ianw | sigh .. .bridge is dying too | 01:23 |
ianw | $ ps -aef | grep ansible-playbook | wc -l | 01:23 |
ianw | 211 | 01:23 |
ianw | all stuck on /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-zuul.yaml >> /var/log/ansible/service-zuul.yaml.log | 01:23 |
ianw | i've killed them all. the log file isn't much help, as everything has tried to write to it | 01:26 |
clarkb | ianw: I thinj that may be the result if our zuul job tineouts that run the service playbooks | 01:36 |
clarkb | they dont seem to clean up nicely (and we run zuul hourly to get images?) | 01:36 |
ianw | clarkb: i'll keep it open and see if one gets stuck, it's easier to debug one than 200 ontop of each other :) | 01:38 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069 | 01:43 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069 | 01:49 |
ianw | ok, we've caught an afs oops during boot -> http://paste.openstack.org/show/796970/ | 02:03 |
ianw | auristor: ^ ... if that rings any bells | 02:03 |
ianw | i'm performing a hard reboot | 02:04 |
ianw | ... interesting .. same oops | 02:05 |
ianw | so then we seem to be stuck in "A start job is running for OpenAFS client (2min 56s / 3min 3s)" | 02:06 |
ianw | [ 8.338401] Starting AFS cache scan... ; i wonder if the cache is bad | 02:07 |
ianw | i'm going to delete /var/cache/openafs | 02:08 |
ianw | the server is up, but no afs to be clear at this point | 02:09 |
ianw | well that solved the oops, but still no afs. i'm starting to think ipv4 issues agian | 02:14 |
ianw | hrm, i dunno, i can ping afs servers | 02:15 |
fungi | that's booting the ubuntu focal replacement arm64 server? | 02:21 |
ianw | fungi: no, the extant bionic one that died | 02:38 |
ianw | i'm going to try rebooting it again ... in case the fresh cache makes some difference | 02:39 |
fungi | okay, but you're ready for reviews on the focal replacement then | 02:41 |
ianw | sort of, it hasn't tested on focal arm64 i don't think because the mirror is down | 02:42 |
ianw | but i think we can merge 747069 | 02:42 |
ianw | ok, it's back, and ls /afs works ... | 02:44 |
ianw | and now the system-config gate is broken due to some linter stuff ... | 02:46 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069 | 02:56 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Work around new ansible lint errors. https://review.opendev.org/747094 | 02:56 |
ianw | ok, back to the zuul thing. one of hte playbooks is stuck again | 03:08 |
ianw | it's ... 30.248.253.23.in-addr.arpa domain name pointer zm05.openstack.org. | 03:09 |
ianw | as somewhat expected, it accepts the ssh connection then hangs | 03:10 |
ianw | standardish hung tasks messages on console | 03:11 |
ianw | #status log reboot zm05.openstack.org that had hung | 03:13 |
openstackstatus | ianw: finished logging | 03:13 |
openstackgerrit | Merged opendev/system-config master: Work around new ansible lint errors. https://review.opendev.org/747094 | 03:31 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069 | 03:32 |
*** ysandeep|away is now known as ysandeep | 03:34 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ara-report: add option for artifact prefix https://review.opendev.org/747100 | 04:11 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: run-base-post: fix ARA artifact link https://review.opendev.org/747101 | 04:13 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ara-report: add option for artifact prefix https://review.opendev.org/747100 | 04:39 |
openstackgerrit | Merged opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069 | 04:42 |
*** raukadah is now known as chkumar|rover | 04:43 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: launch-node: get sshfp entries from the host https://review.opendev.org/744821 | 05:09 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: launch-node: get sshfp entries from the host https://review.opendev.org/744821 | 05:10 |
frickler | ianw: seems logstash-worker08.openstack.org is broken, closes ssh connection immediately, failing the ansible deploy job. do you want to take a deeper look or just reboot via the API? | 05:33 |
ianw | frickler: sounds like the same old thing; i have the console up and canr eboot it | 05:34 |
ianw | should be done | 05:42 |
*** lseki has quit IRC | 05:54 | |
*** lseki has joined #opendev | 05:54 | |
ianw | kevinz: so i'm having trouble starting another mirror node ... it seems ipv4 can't get in. i'm attaching to os-control-network. it actually worked once, but i had to delete that node, and now not | 06:13 |
ianw | os-control-network=192.168.1.63, 2604:1380:4111:3e54:f816:3eff:fe57:7781, 139.178.85.144 | 06:17 |
ianw | ls -l /tmp/ | grep console | wc -l | 06:20 |
ianw | 104161 | 06:20 |
ianw | bridge has this many "console-bc764e02-6612-005b-e2c9-000000000012-bridgeopenstackorg.log" files | 06:20 |
ianw | i've removed them | 06:23 |
*** lpetrut has joined #opendev | 06:50 | |
*** DSpider has joined #opendev | 07:02 | |
*** hashar has joined #opendev | 07:04 | |
zbr | anyone that can help with https://review.opendev.org/#/c/747056/2 ? | 07:10 |
yoctozepto | morning infra; is https://docs.opendev.org/opendev/infra-manual/latest/creators.html the right guide to follow if I want to coordiante the etcd3gw move under the Oslo governance? i.e. the project already exists and this guide assumes it does not - what should I be aware of? | 07:27 |
yoctozepto | the current repo state (for reference) is here: https://github.com/dims/etcd3-gateway | 07:30 |
yoctozepto | it already used the (very old) cookiecutter template for libs; depends on tox but obviously does not use Zuul but Travis | 07:31 |
*** dtantsur|afk is now known as dtantsur | 07:34 | |
*** johnsom has quit IRC | 07:41 | |
AJaeger | yoctozepto: yes, that's the right guide - and it explains what to do to import a repository that exists. | 07:43 |
AJaeger | yoctozepto: check step 3 in https://docs.opendev.org/opendev/infra-manual/latest/creators.html#add-the-project-to-the-master-projects-list | 07:44 |
*** rpittau has quit IRC | 07:47 | |
*** fressi has joined #opendev | 07:48 | |
yoctozepto | AJaeger: ah, thanks! I was misled by the toc: https://docs.opendev.org/opendev/infra-manual/latest/creators.html#preparing-a-new-git-repository-using-cookiecutter | 07:53 |
*** rpittau has joined #opendev | 07:56 | |
*** johnsom has joined #opendev | 07:57 | |
*** elod is now known as elod_off | 07:58 | |
chkumar|rover | Hello Infra, We are seeing rate limit issue in gate job | 08:00 |
chkumar|rover | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_047/746801/2/gate/tripleo-buildimage-overcloud-full-centos-8/04734f1/job-output.txt | 08:00 |
chkumar|rover | prepare-workspace-git : Clone cached repo to workspace | 08:00 |
chkumar|rover | primary | /bin/sh: line 1: git: command not found | 08:00 |
jrosser | i have an odd failure here https://zuul.opendev.org/t/openstack/build/f267841a98b443808365468e94ccdfa9/log/job-output.txt#178 | 08:00 |
jrosser | ^ same | 08:00 |
*** moppy has quit IRC | 08:01 | |
chkumar|rover | I think it is widespeared on all distros | 08:01 |
*** moppy has joined #opendev | 08:01 | |
openstackgerrit | Antoine Musso proposed opendev/gear master: wakeConnections: Randomize connections before scanning them https://review.opendev.org/747119 | 08:05 |
cgoncalves | chkumar|rover, jrosser: this may help https://review.opendev.org/#/c/747025/ | 08:09 |
chkumar|rover | cgoncalves: thanks, just opened a bug https://bugs.launchpad.net/tripleo/+bug/1892326 | 08:10 |
openstack | Launchpad bug 1892326 in tripleo "Jobs failing with RETRY_LIMIT with primary | /bin/sh: line 1: git: command not found at prepare-workspace-git : Clone cached repo to workspace" [Critical,Triaged] | 08:10 |
cgoncalves | infra-root: would it be possible to manually trigger rebuild of nodepool images and push them to providers once https://review.opendev.org/#/c/747025/ merges? | 08:14 |
*** ykarel has joined #opendev | 08:14 | |
*** tosky has joined #opendev | 08:18 | |
openstackgerrit | yatin proposed zuul/zuul-jobs master: Ensure git is installed in prepare-workspace-git role https://review.opendev.org/747121 | 08:21 |
*** lseki has quit IRC | 08:30 | |
*** lseki has joined #opendev | 08:30 | |
*** rpittau has quit IRC | 08:30 | |
*** rpittau has joined #opendev | 08:30 | |
*** johnsom has quit IRC | 08:30 | |
*** johnsom has joined #opendev | 08:30 | |
ykarel | if some core around please also check ^ | 08:35 |
ykarel | all jobs relying on this role are affected | 08:36 |
ianw | cgoncalves: i think we might have to release dib now to get it picked up | 08:39 |
cgoncalves | ianw, thing is we got ourselves in a chicken-n-egg situation where CI is failing to verify the revert | 08:40 |
ianw | ykarel: installing git there is probably a better idea than relying on it in the base image, at any rate | 08:40 |
cgoncalves | at least two voting jobs already hit RETRY_LIMIT | 08:40 |
ianw | i think the build-only thing is a bit of a foot-gun unfortunately. anyway, that's not of immediate importance | 08:42 |
ianw | cgoncalves: will 747121 fix those jobs? | 08:42 |
cgoncalves | ianw, I think so but I've been wrong many times before xD | 08:42 |
ianw | welcome to the club :) | 08:43 |
cgoncalves | thanks!! | 08:43 |
ianw | i'm going to single approve 747121 as i think that should unblock things. then we can worry about the slower path of reverting, releasing, and rebuilding nodepool images and then ci images | 08:48 |
ianw | i have to afk for a bit | 08:48 |
*** priteau has joined #opendev | 08:50 | |
openstackgerrit | Merged zuul/zuul-jobs master: Ensure git is installed in prepare-workspace-git role https://review.opendev.org/747121 | 09:02 |
openstackgerrit | Tobias Henkel proposed openstack/project-config master: Create zuul/zuul-cli https://review.opendev.org/747127 | 09:13 |
openstackgerrit | Tobias Henkel proposed openstack/project-config master: Create zuul/zuul-client https://review.opendev.org/747127 | 09:33 |
*** andrewbonney has joined #opendev | 09:41 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck https://review.opendev.org/729336 | 10:15 |
ykarel | ianw, Thanks for merging quickly | 10:58 |
ykarel | yes should not depend on base image, having in base image is a plus though as it save couple of seconds | 10:58 |
zbr | AJaeger: tobiash: https://review.opendev.org/#/c/747056/ -- please, is need for https://review.opendev.org/#/c/729336/ | 11:11 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck https://review.opendev.org/729336 | 11:12 |
*** hipr_c has joined #opendev | 11:33 | |
*** hipr_c has joined #opendev | 11:33 | |
*** hipr_c has joined #opendev | 11:33 | |
tosky | hi! If I click on the "Unit Tests Report" link here https://zuul.opendev.org/t/openstack/build/0cd50335a91b4e22a4776001e2d84785 | 12:19 |
*** jaicaa has quit IRC | 12:19 | |
AJaeger | zbr: please explain what the change is about so that I can decide whether to open it or not. I'm not reviewing either of these repos - and neither does tobiash. Please ask the rest of the admins later | 12:19 |
tosky | I get an empty page on chrome and an encoding error on Firefox | 12:19 |
tosky | s/chrome/Chromium/ | 12:19 |
AJaeger | tosky: is that only for this specific report - or for every? I'm wondering whether that single file is corrupt or whether there's a generic problem. | 12:20 |
AJaeger | tosky: I can confirm the error on Firefox | 12:21 |
tosky | AJaeger: just that one | 12:22 |
tosky | I understand it may be a specific and once-in-a-while issue | 12:22 |
tosky | but just in case... | 12:22 |
*** jaicaa has joined #opendev | 12:22 | |
hashar | hello. I have a basic patch that fails the task "ubuntu-bionic: Build a tarball and wheel", python setup.py sdist bdist_wheel yields "no module named setuptools | 12:29 |
hashar | is that a known issue by any chance? The repository is opendev/gear , patch is https://review.opendev.org/#/c/747119/1 | 12:30 |
AJaeger | tosky: ok. Hope other can help further | 12:30 |
tosky | AJaeger: thanks for checking! I know it may not be fixed, and that file is not critical anyway | 12:38 |
tosky | just reporting in case other reports start to pile up | 12:38 |
*** hashar has quit IRC | 12:50 | |
*** redrobot has quit IRC | 13:08 | |
frickler | tosky: AJaeger: looks like a bad upload to me, unless we see duplicates of that, I'd say this can happen and just do a recheck of that patch | 13:13 |
tosky | ack, thanks | 13:17 |
*** hashar has joined #opendev | 13:35 | |
fungi | hashar: i've seen that when a different python is used than the one for which setuptools is installed. we should probably switch that from python to python3 if it's not using a virtualenv | 13:45 |
lourot | hi o/ "openstack-tox-py35 https://zuul.opendev.org/t/openstack/build/8f4947ec185c4479a57b552de4338956 : RETRY_LIMIT in 2m 54s" | 13:45 |
lourot | this happened on at least two of our (openstack-charmers/canonical) reviews this afternoon | 13:46 |
lourot | the job seems to fail apt-installing git on xenial, is it something you noticed already? | 13:47 |
hashar | fungi: I am not sure I understand the reason ;] I have a hard time finding out where the job "build-python-release" is defined though | 13:47 |
fungi | lourot: that looks like the fallout from diskimage-builder removing git by default from images. we're hoping https://review.opendev.org/747121 fixes it so we don't have to wait for a revert and release in dib followed by nodepool image rebuilds and uploads to all providers | 13:47 |
fungi | hashar: take a look at the "console" tab for that build result and it shows the repository and path for the playbook which called the failing task, in this case opendev.org/opendev/base-jobs/playbooks/base/pre.yaml | 13:49 |
yoctozepto | fungi: it seems xenial broke | 13:49 |
yoctozepto | because it has no git packages | 13:49 |
fungi | hashar: er, sorry, i was looking at the wrong console, trying to answer too many questions at once | 13:49 |
lourot | fungi, understood, thanks! | 13:50 |
hashar | :]]]]] | 13:50 |
fungi | hashar: opendev.org/zuul/zuul-jobs/playbooks/python/release.yaml | 13:50 |
yoctozepto | https://review.opendev.org/747121 broke xenial and now we can't merged https://review.opendev.org/747025 | 13:50 |
fungi | yoctozepto: thanks, yeah i think we need git-vcs on xenial... checking now | 13:51 |
hashar | fungi: ahhh thank you very much. So yeah it runs {{ release_python }} setup.py sdist bdist_wheel , which would be python3 | 13:52 |
hashar | and somehow I guess the base image lacks setuptools | 13:52 |
yoctozepto | fungi: thanks | 13:53 |
fungi | hashar: we install setuptools for python3 i think, not python. ideally things should be calling python3 these days | 13:54 |
hashar | oh | 13:54 |
hashar | roles/build-python-release/defaults/main.yaml has an override: release_python: python | 13:54 |
fungi | yoctozepto: i was wrong, it's not git-vcs on xenial either, this error is strange, https://packages.ubuntu.com/xenial/git says it should exist | 13:55 |
fungi | hashar: yeah, probably we're not seeing this in other places because we set release_python: python3 (or something like that). you could check codesearch.openstack.org for release_python: | 13:55 |
*** ykarel is now known as ykarel|away | 13:55 | |
openstackgerrit | Antoine Musso proposed opendev/gear master: zuul: use python3 for build-python-release https://review.opendev.org/747167 | 13:56 |
*** ykarel|away is now known as ykarel | 13:56 | |
hashar | fungi: or maybe if the base image has python2, it should also have setuptools? | 13:57 |
hashar | anyway, I might have found a way to set it to python3 | 13:57 |
yoctozepto | fungi: lack of apt-get update perhaps? | 13:57 |
ykarel | seems ^ the case for git not found | 13:57 |
yoctozepto | after working a lot on centos it feels nice to just hit install | 13:57 |
yoctozepto | but debian does not think so :-) | 13:58 |
hashar | fungi: thank you very much for your guidances | 13:59 |
fungi | yoctozepto: yeah, i suspect we may have tried to install a package too early before we've primed the pump for mirror stuff | 13:59 |
yoctozepto | fungi, ykarel: then let's just do the apt-get update in the role, shall we? | 14:00 |
fungi | though strange that this is only showing up for xenial | 14:00 |
openstackgerrit | Antoine Musso proposed opendev/gear master: zuul: use python3 for build-python-release https://review.opendev.org/747167 | 14:01 |
yoctozepto | maybe bionic+ images have the cache in them | 14:01 |
yoctozepto | which is valid enough | 14:01 |
ykarel | in other images it's seems installed, task is returning ok | 14:01 |
yoctozepto | or xenial's apt just got b0rken in the meantime | 14:01 |
ykarel | i seen a bionic jobs's log | 14:01 |
yoctozepto | sad the gate on zuul change will not trigger the issue | 14:01 |
yoctozepto | maybe the ubuntu images did not rebuild? | 14:02 |
yoctozepto | i mean bionic+ ones | 14:02 |
yoctozepto | if you say they re 'ok' and not changed | 14:02 |
yoctozepto | tbh, I only saw centos failures in kolla today | 14:02 |
ykarel | ubuntu-bionic | ok https://5fadcfca1ff80d23fcf2-2bdb8be3dd1329f8a48d0e165eec17e9.ssl.cf2.rackcdn.com/746432/1/check/openstack-tox-py36/8f6daf0/job-output.txt | 14:02 |
yoctozepto | so might have been the case | 14:02 |
yoctozepto | bingo | 14:03 |
fungi | i've only just rubbed teh sleep from my eyes, started to sip my coffee and stumbled into this in the last few minutes, so still trying to catch up on what's been happening from scrollback | 14:03 |
yoctozepto | fungi: it's a fire-fighting week for me | 14:03 |
ykarel | may be can hold a node? and see what's going to fix it quickly? | 14:03 |
yoctozepto | can't wait to see what Friday brings to the table | 14:03 |
fungi | we'll have to pick a change to recheck for the hold. i guess we can use the failing job for the dib revert | 14:04 |
fungi | working on that now | 14:05 |
ykarel | strange, in dib change the job passed in check, ubuntu-xenial | ok | 14:06 |
frickler | fungi: maybe we also want throw away current images and revert to the previous ones until we can fix dib? | 14:06 |
fungi | frickler: i think we need to pause all image builds/uploads if we do that, because just deleting the images will trigger nodepool to start trying to upload them again | 14:08 |
fungi | last time i tried that i think i must not have paused them correctly | 14:08 |
fungi | anyway, the autohold and recheck are in, now waiting for openstack-tox-py35 to get a node | 14:09 |
openstackgerrit | Radosław Piliszek proposed zuul/zuul-jobs master: Fix git install on Debian distro family https://review.opendev.org/747170 | 14:10 |
yoctozepto | in case we want to go the apt-get update route, I prepared the above ^ | 14:10 |
fungi | once we have this node held, i can also bypass zuul to merge 739717 so dib folks can continue with the revert | 14:14 |
dmsimard | regarding that git install issue, I've also seen the issue in non-debian distros | 14:15 |
dmsimard | "/bin/sh: line 1: git: command not found" on CentOS8: https://zuul.openstack.org/build/d48c1f1a9e024f7ba4b1d68dea285d3e/console#0/3/8/centos-8 | 14:16 |
fungi | dmsimard: yep, but for those the role is installing git successfully now i think | 14:17 |
dmsimard | ah, was there a separate fix ? not caught up with entire backlog | 14:17 |
fungi | dmsimard: yeah, https://review.opendev.org/747121 | 14:17 |
dmsimard | neat, thanks | 14:18 |
frickler | fungi: yeah, forcing the revert in would be the other option, but IIUC we'd need to have another dib release then, too. not sure who except ianw can do that | 14:20 |
fungi | oh, right, since this job is failing in pre we have to wait for it to fail three times before it will trigger the autohold :/ | 14:21 |
*** lpetrut has quit IRC | 14:24 | |
*** chkumar|rover is now known as raukadah | 14:33 | |
fungi | it's starting attempt #3 now | 14:33 |
openstackgerrit | Radosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw https://review.opendev.org/747185 | 14:36 |
mnaser | `/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\" install 'git'' failed: E: Package 'git' has no installation candidate\n` | 14:39 |
fungi | i think we finally have a held node | 14:39 |
fungi | or should momentarily | 14:40 |
mnaser | ^ anyone seen this today? i'm not seeing anything in logs | 14:40 |
openstackgerrit | Radosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw https://review.opendev.org/747185 | 14:40 |
fungi | mnaser: yes, it's fallout from the fix for the fix for dib removing git | 14:40 |
fungi | we're now trying to install git in the setup-workspace role but can't figure out why xenial is saying there's no git package | 14:41 |
fungi | i'm trying to get a node with that failure held now to see if i can work out what we're missing | 14:41 |
yoctozepto | fungi, mnaser: I bet on lack of apt-get update and await my fix merged :-) https://review.opendev.org/747170 | 14:42 |
fungi | does retry_limit not trigger autoholds? | 14:42 |
mnaser | ouch | 14:42 |
fungi | oh, nevermind, zuul hasn't finalized that build i guess | 14:42 |
fungi | seems the scheduler's in the middle of a reconfiguration event | 14:44 |
clarkb | retrylimit will hold and only the third and final instance | 14:44 |
fungi | yeah, it's finally failed the third but the result is in the queue backlog while the scheduler's reconfiguring | 14:45 |
fungi | i was just being impatient | 14:45 |
fungi | and there it goes | 14:46 |
fungi | though i still don't have a held node yet | 14:47 |
clarkb | another approach is to manually boot a xenial node | 14:48 |
fungi | oh fudge, i pasted in the wrong change number | 14:49 |
fungi | clarkb: well, we want to see what state the node is in when it's claiming it can't install git | 14:49 |
fungi | so just booting a xenial image won't necessarily get us that | 14:49 |
clarkb | it should be pretty close though | 14:50 |
clarkb | prepare-workspace-git happens very early iirc | 14:50 |
fungi | yep, our current suspicion is that it happens too early to be able to install distro packages | 14:50 |
fungi | like before we've set up mirroring configs and stuff | 14:51 |
clarkb | we can add git to our infra package needs element too | 14:52 |
*** ysandeep is now known as ysandeep|away | 14:52 | |
clarkb | rather than revert dibs change and rerelease | 14:52 |
fungi | yeah, i was considering that as a fallback option | 14:52 |
fungi | fallback to installing it in the prepare workspace role i mean | 14:52 |
fungi | i'm ambivalent on whether dib maintainers want to keep or undo the git removal | 14:52 |
fungi | i corrected my autohold and abused zuul promote to restart check pipeline testing on the change in question | 14:54 |
*** qchris has quit IRC | 14:57 | |
fungi | i'm about to enter an hour where i'm triple-booked for meetings, but will try to keep tabs on this at the same time | 14:58 |
clarkb | I'm slowly getting to a real keyboard and can help more shortly | 15:02 |
clarkb | I'll probably work on the infrapackage needs change first so we've got it if we want it | 15:02 |
fungi | thanks | 15:02 |
clarkb | git is already in infra-package-needs | 15:10 |
clarkb | is dib removing it | 15:10 |
*** larainema has quit IRC | 15:10 | |
*** qchris has joined #opendev | 15:10 | |
* clarkb needs to find this dib change | 15:10 | |
fungi | yeesh | 15:11 |
clarkb | https://review.opendev.org/#/c/745678/1 | 15:11 |
clarkb | ya I think the build time only thing gets handled at a later build stage which then removes it | 15:11 |
fungi | right, that was the change which triggered this | 15:12 |
clarkb | basically that overrides our explicit request to install the package elswhere | 15:12 |
clarkb | thta makes me like the revert more | 15:12 |
clarkb | its one thing to install it at runtime because we didn't install it on ou rimages. Its another to tell dib to install it on the image and be ignored | 15:13 |
clarkb | I'm going to see if we can have the package installs override the other direction | 15:13 |
clarkb | if you ask to intsall it and not uninstall it somewhere then don't uninstall it | 15:13 |
fungi | looks like we're a bit backlogged on available nodes | 15:20 |
*** ykarel is now known as ykarel|away | 15:21 | |
*** ykarel|away has quit IRC | 15:28 | |
*** tosky_ has joined #opendev | 15:35 | |
*** tosky has quit IRC | 15:36 | |
*** tosky_ is now known as tosky | 15:37 | |
fungi | yep,test nodes flat-lined around 750 in use as of ~12:30z and the node requests have been climbing since | 15:37 |
fungi | current demand seems to be around 2x capacity | 15:38 |
fungi | also looks like we might could stand to have an additional executor or two | 15:39 |
fungi | since around 14:00z there's been very little time where we had any executors accepting new builds | 15:40 |
fungi | and the executor queue graph shows we started running fewer concurrent builds since then | 15:41 |
openstackgerrit | Clark Boylan proposed openstack/diskimage-builder master: Don't remove packages that are requested to be installed https://review.opendev.org/747220 | 15:41 |
clarkb | something like that maybe? | 15:41 |
clarkb | fungi: the pre run churn is likely part of that | 15:41 |
fungi | i agree, this is probably a pathological situation | 15:41 |
fungi | openstack-tox-py35 has finally started its first try | 15:42 |
*** mlavalle has joined #opendev | 15:45 | |
fungi | looks like these are spending almost as much time waiting on an executor as they are waiting for a node | 15:47 |
clarkb | that will be affected by the job churn | 15:48 |
fungi | absolutely | 15:48 |
clarkb | since we rate limit job starts on executors | 15:48 |
clarkb | https://review.opendev.org/#/c/729336/ shows https://review.opendev.org/#/c/747056/ is working. fungi once the bigger fire calms down (the gate won't pass for this with broken git anyway) maybe we can get reviews on those? | 16:04 |
fungi | yep! | 16:05 |
fungi | that's good | 16:05 |
corvus | clarkb: is there a fire that i can help with? | 16:05 |
fungi | also i'll have a break from meetings in about 55 minutes, maybe sooner | 16:05 |
clarkb | corvus: there is a fire. I think we're just trying to confirm which of the various fixes is our best bet. TL;DR is https://review.opendev.org/#/c/745678/1 merged to dib and was released. This has resulted in dib removing git from our images even though we explicitly request for git to be installed in infra-package-needs. | 16:06 |
fungi | corvus: we've discovered that if dib marks a package as build-specific like in https://review.opendev.org/745678 then you can't also explicitly install that package as a runtime need in another element | 16:06 |
clarkb | corvus: an earlier attempt at a fix does a git install in prepare-workspace-git. but on ubuntu we think that may need an apt-get update (fungi is working t oconfirm that now before we land the update change into prepare-workspace-git) | 16:07 |
clarkb | corvus: on the dib side I've written https://review.opendev.org/747220 to not uninstall packages if something requests they be installed normally | 16:07 |
clarkb | for some reason this seems to most affect xenial. (Do we know why yet?) | 16:07 |
fungi | one theory i've not had a chance to check is that we haven't uploaded new images for bionic et al yet | 16:08 |
fungi | and so they already have git preinstalled causing that task to no-op | 16:08 |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: Revert "Ensure git is installed in prepare-workspace-git role" https://review.opendev.org/747238 | 16:08 |
clarkb | ^^ that revert won't help anything aiui | 16:10 |
clarkb | the jobs will just fail on the next tasks | 16:10 |
clarkb | fwiw I think its reasonable to make git an image dependency hence https://review.opendev.org/747220 | 16:11 |
corvus | clarkb, fungi: pabelanger left a comment that may be relevant on 747238 | 16:12 |
clarkb | corvus: ya I think we'll be installing git with default package mirrors (whatever those may be) | 16:13 |
clarkb | DNS should work on boot (thats a thing we've tried very hard to ensure) | 16:13 |
clarkb | though we arne't using the same images as ansible so ... | 16:13 |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: Revert "Ensure git is installed in prepare-workspace-git role" https://review.opendev.org/747238 | 16:13 |
fungi | i expect that yoctozepto's fix to do an apt update first would clear the error we're seeing with it, but i respect that ansible's use of the role may be incompatible with installing packages (though it should also no-op if the package is preinstalled in their images) | 16:14 |
yoctozepto | hmm, based on https://review.opendev.org/747170 - I know why it did not fail on the change - it tests PREVIOUS playbooks | 16:14 |
yoctozepto | fungi: yeah, if only it wanted to merge now that the queues are b0rken ^ | 16:15 |
clarkb | fungi: yup that is what I just noted on the change about the noop | 16:15 |
clarkb | bsaically reverting that zuul-jobs change doens't help much if the images are broken | 16:16 |
clarkb | if the images are fixed then it noops (so I think we should either fix zuul-jobs or ignore it in favor of fixing the images) | 16:16 |
clarkb | then we can swing around and clean up zuul-jobs as necessary | 16:16 |
clarkb | unless paul has images with git and the only problem is that ansible doesn't noop there for some reason | 16:17 |
clarkb | (that info would be useful /me adds to chnage) | 16:17 |
fungi | looks like we have a node assignment for retry #3 on the job which my autohold is set for, and then i'll see if i can work out why that's failing so cryptically | 16:17 |
fungi | once it gets a free executor slot anyway | 16:17 |
corvus | iiuc that paul is saying dns is broken, it may be that yoctozepto's change is unsafe for paul | 16:18 |
corvus | (because even a no-op 'apt-get update' would fail due to broken dns) | 16:18 |
yoctozepto | fungi, clarkb, corvus: I think our best bet is to force-merge the zuul-jobs revert by pabelanger, then same with dib revert and rebuild the images | 16:18 |
clarkb | corvus: yup, that is why I'm thinking addressing the image problem is our best bet | 16:18 |
clarkb | yoctozepto: but pauls revert shouldn't affect anything | 16:19 |
clarkb | thats what I'm trying to say. If the images re fixed we don't need the revert. If the images are not fixed the revert won't help | 16:19 |
clarkb | we need to focus on the images imo | 16:19 |
yoctozepto | clarkb: but we do want the revert, let's start off the clean plate | 16:19 |
yoctozepto | anyhow, any idea why those jobs test the PREVIOUS playbooks? | 16:19 |
yoctozepto | I mean, they don't test the CURRENT change | 16:19 |
clarkb | yoctozepto: the revert has no bearing on whether jobs will fail or pass. I think we should ignore it and focus on what has an affect | 16:20 |
clarkb | then later we can revert if we want to clean up | 16:20 |
clarkb | yoctozepto: because they run in trusted repos | 16:20 |
clarkb | yoctozepto: that is normal expected behavior by zuul | 16:20 |
yoctozepto | clarkb: ok, missed that | 16:20 |
yoctozepto | clarkb: so it's even in gate? | 16:20 |
yoctozepto | it's scary to +2 such changes there then | 16:21 |
clarkb | yoctozepto: yes, you have to merge the change before it can be used. We have the base-test bsae job set up to act as a tester for these things | 16:21 |
fungi | which was not used in this case because things were already broken | 16:21 |
corvus | real quick q -- since there's a fire, did we delete the broken images to revert to previous ones? | 16:22 |
clarkb | corvus: no because nodpeool will just rebuild and break us again | 16:22 |
clarkb | (at least that was my read of scrollback) | 16:22 |
corvus | well, that's what pause is for | 16:22 |
fungi | i think last time i tried to pause all image updates i got it wrong | 16:23 |
corvus | i mean, we have a documented procedure for exactly this case. if we had followed it, everything would not be broken. | 16:23 |
corvus | fungi: as an alternative, if there is any confusion, you can just stop the builders | 16:23 |
yoctozepto | can we focus on force-merging the dib revert change? :-) | 16:23 |
corvus | or we could follow procedure and not have to force-merge anything | 16:23 |
fungi | https://docs.opendev.org/opendev/system-config/latest/nodepool.html#bad-images | 16:23 |
fungi | maybe we didn't have that documented the last time i tried to do it | 16:24 |
fungi | i think i'll have to run the nodepool commands from nb03? | 16:24 |
fungi | all the others are docker containers now | 16:24 |
clarkb | fungi: you docker exec | 16:24 |
yoctozepto | corvus: so you pause, delete newest ones, and get previous ones? | 16:25 |
fungi | yoctozepto: the older images will be used automatically | 16:25 |
yoctozepto | fungi: ack | 16:25 |
openstackgerrit | Clark Boylan proposed openstack/project-config master: Pause all image builds https://review.opendev.org/747241 | 16:25 |
clarkb | so we force merge ^ then delete the image(s)? | 16:25 |
corvus | sure, or delete the image and then regular-merge that | 16:26 |
clarkb | fungi: `sudo docker exec nodepool-builder-compose_nodepool-builder_1 nodepool $command` from my scrollback on nb01 | 16:26 |
fungi | or stop builders delete the image, regular merge that, then start builders again once it's deployed? | 16:26 |
corvus | fungi: yes or that | 16:27 |
corvus | main thing is -- we shouldn't have to force-merge anything in this situation | 16:27 |
corvus | (and we should be able to get people working again immediately) | 16:27 |
fungi | okay, i'll start downing the builders now | 16:28 |
clarkb | if we delete the image then regularl merge it will build and then upload I think? so ya downing seems better | 16:28 |
clarkb | (the pause will only apply to builds after the config is updated iirc) | 16:28 |
yoctozepto | that sounds very nice | 16:29 |
fungi | doing `sudo docker-compose down` in /etc/nodepool-builder-compose on nb01,02,04 and `sudo systemctl stop nodepool-builder` on nb03 now | 16:29 |
fungi | #status log all nodepool builders stopped in preparation for image rollback and pause config deployment | 16:31 |
openstackstatus | fungi: finished logging | 16:31 |
fungi | so next we need to build a list of the most recent images where there is at least one prior image and the latest image was built within the past day | 16:32 |
corvus | fungi: should just be the list of images with "00:" as the first part of the age column | 16:33 |
fungi | yep, that's what i just filtered on | 16:34 |
fungi | i guess we can assume there are prior images for all of those | 16:34 |
corvus | nodepool dib-image-list|grep " 00:" | 16:34 |
corvus | fungi: if there aren't, i don't think it matters anyway (essentially, every 00: image is broken yeah?) | 16:34 |
fungi | well, technically it's been less than 24 hours since the regression merged | 16:35 |
clarkb | corvus: the centos-8 one is 14 hours old which may not be new neough | 16:35 |
clarkb | but I think we can just assume they are broken if new like that and clean them up | 16:35 |
fungi | more important is when dib release was published i guess | 16:35 |
* clarkb checks zuul builds | 16:35 | |
corvus | this is what i get for that: http://paste.openstack.org/show/797001/ | 16:35 |
yoctozepto | I think https://review.opendev.org/747025 can (and should) be abandoned thanks to clarkb's patch | 16:36 |
clarkb | 05:28 UTC yesterday | 16:36 |
fungi | 3.2.0 appeared on pypi 05:31z yesterday, so yeah more than 24 hours maybe | 16:36 |
clarkb | oh today is the 20th not 18th | 16:36 |
clarkb | so ya anything built in the last 24 hours is likely bad | 16:36 |
fungi | i guess we just start with 00: | 16:36 |
clarkb | fungi: ++ | 16:37 |
fungi | if i nodepool dib-image-delete will that also delete all the uploads of that build? | 16:37 |
fungi | or do i need to also manually delete them? | 16:37 |
clarkb | fungi: it will but only once the builders are started | 16:37 |
clarkb | (same iwth the on disk contents) | 16:37 |
fungi | ohh... right | 16:37 |
clarkb | the zk db updates should be sufficient to start booting on the older images though | 16:37 |
corvus | and yes, the docs say only to run "dib-image-delete"; image-delete is not necessary. | 16:38 |
clarkb | actually wait my earlier day math was right. Today is the 20th. The release was 05:30ish on the 19th | 16:39 |
clarkb | so about 11 hours ago | 16:39 |
clarkb | I think that means the centos-8 image is ok | 16:39 |
clarkb | (but deleting it is also fine) | 16:39 |
fungi | 05:31 utc yesterday is 24 hours before 05:31 utc today. it's now 16:40 utc, so >24 hours | 16:40 |
clarkb | bah timezones | 16:40 |
fungi | i failed to delete centos-7-0000134775 because it was building not ready, i guess i should have filtered on ready too | 16:42 |
clarkb | we'll need to delete it when it goes ready | 16:42 |
clarkb | oh wait it wont | 16:43 |
clarkb | because we stopped the builders :) | 16:43 |
fungi | yep | 16:43 |
clarkb | that should autocleanup then. Cool | 16:43 |
fungi | so this is the list: http://paste.openstack.org/show/797003 | 16:43 |
fungi | for posterity | 16:43 |
fungi | all but centos-7-0000134775 are in deleting state now | 16:43 |
clarkb | now we should cross check with the image-list | 16:44 |
clarkb | it may be the case that we need the builders running to update their states | 16:44 |
corvus | i approved the zuul-jobs revert for paul | 16:44 |
corvus | yes, i think the 'stop the builders' variant is untested | 16:44 |
fungi | this has reminded me that last time we did it without stopping the builders and had to deal with them immediately starting to build new bad images | 16:45 |
fungi | granted, that takes a bit of time | 16:45 |
fungi | so maybe also okay | 16:45 |
corvus | yes, "immediately" is relative here | 16:45 |
clarkb | ya we haven't updated the image-list | 16:46 |
fungi | also while i was working on that, my autohold was finally satisfied, so i'll see if i can confirm why the apt install git was breaking | 16:46 |
corvus | sure, we would probably need to delete a few again | 16:46 |
clarkb | we can set those to delete too, or update a builder config to pause and start it | 16:46 |
corvus | better to let the builder do it | 16:46 |
corvus | tbh, i'd like to just follow the directions we wrote :) | 16:46 |
fungi | well, nodepool dib-image-delete won't let us delete an image which is building, so we have to catch it between completing the build and uploading | 16:47 |
clarkb | fungi: and we'll also start new image builds | 16:47 |
clarkb | but corvus is saying we should just manually delete those again when they happen | 16:47 |
fungi | the directions we wrote last time ended us with the problem coming back because we didn't catch and delete the new images fast enough | 16:47 |
clarkb | maybe start just nb01 to minimize the number of builds that can happen? A single builder should haldne clenaup just fine | 16:48 |
clarkb | fungi: yes | 16:48 |
corvus | yes, it's possible that one or two jobs may end up running on new images with this process. but right now, we've been running thousands of jobs on bad images | 16:48 |
corvus | so it's like a 10000000000% improvement | 16:48 |
clarkb | should I up the container on nb01? | 16:48 |
fungi | i suppose we could mitigate it by manually applying the pause configuration to all the builders before starting to delete images? | 16:48 |
clarkb | fungi: we only need to start one, and yes we could manually apply the config there | 16:49 |
clarkb | (corvus is saying don't bother though) | 16:49 |
*** fressi has left #opendev | 16:49 | |
fungi | or do we then risk ansible deploying the old config back over them before the pause config is merged? | 16:49 |
clarkb | fungi: I think the idea is eve nif we rebuild one or two images we can just delete them again | 16:50 |
clarkb | while we land the pause config change | 16:50 |
clarkb | and if we restart only nb01 we'll minimize nodepools ability to build new images | 16:51 |
clarkb | so I think that is safe enough | 16:51 |
clarkb | corvus: ^ is that basically what you are saying? | 16:51 |
fungi | wfm | 16:51 |
corvus | you may need all the builders up. but yes. | 16:51 |
clarkb | ok I'll start with nb01, then check and see if we need to start the others | 16:51 |
corvus | i'm pretty much going to just keep saying "do what the instructions say" | 16:51 |
clarkb | I'm making sure I'm interperting them correclty as well as articulating the corner case(s) the instructions say | 16:52 |
clarkb | Note the direction say to pause first which we are not doing | 16:53 |
clarkb | do we want to manually edit the configs to pause first then? | 16:53 |
corvus | nope | 16:53 |
corvus | just start the buildersn | 16:53 |
corvus | merge the change | 16:53 |
corvus | keep deleting broken images | 16:53 |
fungi | i'd like to improve the instructions if we can come up with a less racy process for this, or at least figure out what feature to implement in nodepool so we can eliminate the race condition | 16:53 |
clarkb | ok nb01 is running | 16:54 |
corvus | fungi: sure it could be better, but i honestly don't think it's a big deal | 16:54 |
corvus | and considering we went off-script (even after we decided to go on-script) by stopping the builders, i don't think we can actually say we followed them this time | 16:54 |
openstackgerrit | Pierre Riteau proposed opendev/irc-meetings master: Update CloudKitty meeting information https://review.opendev.org/747256 | 16:54 |
corvus | they don't say anything about stopping or starting builders | 16:54 |
fungi | my main concern is that in the past it's resulted in us telling people a problem is fixed, only to have it crop back up again hours later and then there's confusion as to when it was actually fixed and what can safely be rechecked | 16:54 |
corvus | okay, let's add a pgraph at the end saying "if new images got built, delete those as well after the pause change has landed" | 16:55 |
fungi | the instructions don't say to stop the builders, they also don't say to keep monitoring the builders and deleting new images which were started before the pause went into place | 16:56 |
corvus | sure, but they do say "if you have a broken image, delete it" | 16:56 |
clarkb | nb01 is attempting to delete images according to the log | 16:57 |
clarkb | there are some auth exceptions to some url I don't recognize | 16:57 |
corvus | clarkb: its own images or others? | 16:57 |
corvus | clarkb: or rather, i think nb01 will only delete images on providers it talks to | 16:57 |
clarkb | corvus: so far just confirmed its own | 16:57 |
corvus | so it may only delete non-arm images | 16:58 |
clarkb | oh ya good point | 16:58 |
* clarkb checks arm | 16:58 | |
clarkb | logan-: fwiw it seems we get a cert verification error talking to limestone. We can dig in more once the images are in a happier place | 16:58 |
clarkb | ya doesn't seem to have touched the arm64 images | 16:59 |
clarkb | I'll start nb03 too | 16:59 |
clarkb | ok I think we're good until new images get uploaded (which will start with centos-7-0000134776 and ubuntu-xenial-arm64-0000094376 in an hour or two) | 17:01 |
fungi | yoctozepto: i've confirmed that your proposed patch to apt update also wouldn't have helped. this is running before we've set our apt configuration so fails with "The repository 'http://mirror.dfw.rax.opendev.org/ubuntu xenial-security Release' is not signed. Updating from such a repository can't be done securely, and is therefore disabled by default." | 17:03 |
corvus | clarkb: okay so should we zuul enqueue 747241 into gate? | 17:03 |
corvus | i'm assuming it's partway through failing some check jobs or something on the old images | 17:03 |
clarkb | corvus: ya I thik we can try that now | 17:04 |
clarkb | fungi: thats odd because we build the images using the mirrors to ensure we don't get ahead with their packages iirc. Which means we should bake in the override for that? | 17:04 |
fungi | clarkb: apparently that's not carried over | 17:04 |
clarkb | fungi: maybe that extra apt config is helpfully cleaned up | 17:04 |
*** dtantsur is now known as dtantsur|afk | 17:04 | |
yoctozepto | fungi: ack, I've abandoned it either way because the approach is wrong | 17:04 |
corvus | clarkb: in progress | 17:04 |
fungi | and yeah, it's also trying to use mirror.dfw.rax.opendev.org when it was booted in ovh-gra1 | 17:05 |
fungi | so apparently *some* of the configuration is not cleaned up | 17:05 |
fungi | though i did confirm that once the package lists were correctly updated, it was able to successfully install the git package | 17:06 |
clarkb | as next steps I'm thinking revert the dib change, push a release. Then we can land my fix and a revert revert (and test it) then do another release | 17:08 |
clarkb | https://review.opendev.org/#/c/747025/ is the dib revert | 17:09 |
clarkb | yoctozepto: ^ see plan above. I think it makes sense to test this more completely and start by going back to what is known to work then roll forward with better testing from there | 17:09 |
clarkb | I'm going to recheck that change now | 17:09 |
corvus | clarkb, fungi: the pause change is running jobs which have passed the point at which they're doing things with 'git' | 17:13 |
corvus | so ++ | 17:13 |
fungi | good deal | 17:13 |
fungi | does pause cause uploads to be paused too, or just builds? | 17:17 |
openstackgerrit | Merged openstack/project-config master: Pause all image builds https://review.opendev.org/747241 | 17:17 |
corvus | fungi: there's a pause for either; clarkb paused the builds | 17:18 |
yoctozepto | clarkb: I'm not sure I agree but it's not bad either | 17:18 |
corvus | so uploads of already built images will continue | 17:18 |
fungi | corvus: yep, thanks, just found that in the docs too | 17:19 |
corvus | (i think that is fine and correct in this case) | 17:19 |
fungi | so if we wanted to avoid uploading images which were in a building state when the diskimage pause was set, we'd need to also add it for all providers | 17:19 |
fungi | we don't have a mechanism for cancelling a build in progress, right? other than maybe a well placed sigterm | 17:20 |
clarkb | fungi: ya killing the dib process would do it, but nothing beyond that iirc | 17:21 |
fungi | and at that point it wouldn't retry the build because of the pause | 17:23 |
clarkb | yes | 17:23 |
* clarkb is trying to figure out how to test https://review.opendev.org/747220 now | 17:24 | |
corvus | i was wondering if some of the nodepool/devstack jobs actually boot an image? but they probably don't do anything on it | 17:25 |
corvus | as a one-off, you could probably do something that verifies that git is installed on the booted vm? | 17:25 |
corvus | but also, aren't there some dib tests that can check stuff like that? | 17:26 |
corvus | (ie, build the image, then verify contents?) | 17:26 |
corvus | at the functional test level | 17:26 |
clarkb | corvus: they boot the vm and I think check that ssh works. Which makes me wonder if I should s/git/openssh-server/ as that will confirm the package ends up sticking around | 17:27 |
fungi | that does seem like it could also just be added as commands in a very last stage of an element, so that if the sanity checks don't succeed the image build fails | 17:27 |
clarkb | corvus: for the functional level tests they seem pretty basic. | 17:27 |
clarkb | but maybe there is something there I am missing /me looks more | 17:27 |
fungi | then the test would be to try building the image. if those checks fail, the image build fails and the job then fails | 17:28 |
clarkb | oh you know I can probably just run the scripts in that element and check the outputs | 17:28 |
*** sgw has joined #opendev | 17:30 | |
*** andrewbonney has quit IRC | 17:35 | |
clarkb | ya I think that is enough to show I've got a bug so I'll keep pulling on it that way | 17:36 |
*** hashar has quit IRC | 17:41 | |
corvus | i'm going to delete | ubuntu-xenial-arm64-0000094376 | ubuntu-xenial-arm64 | nb03.openstack.org | qcow2 | ready | 00:00:12:45 | | 17:46 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Docs: Extra details for image rollback https://review.opendev.org/747261 | 17:47 |
fungi | corvus: thanks! related ^ | 17:47 |
corvus | fungi: i'm not sure that pausing the provider-images would be effective. it can't to into effect any earlier than the dib pause, and i think the dib pause is sufficient to stop the upload | 17:50 |
fungi | oh, uploads won't occur if the diskimage build is paused? | 17:50 |
corvus | fungi: that's my understanding of the intent of the code. | 17:51 |
fungi | that's what i was asking earlier as to whether pausing the diskimage building would also pause uploading of the images | 17:51 |
corvus | i may have misunderstood that question then | 17:51 |
fungi | so if an image is in building state when the pause for it takes effect, once it reaches ready state the nodepool-builder won't attempt to upload it to providers? | 17:52 |
corvus | i believe that's the intent, but i'd give that 50/50 odds that what actually happens, because that's essentially a reconfiguration edge-case. | 17:53 |
corvus | but other than that potential edge case, in general, pausing a dib should stop derived uploads. | 17:53 |
fungi | well, yeah, i mean if you don't build an image then there's nothing to upload | 17:54 |
corvus | uploads fail all the time, so the builders are constantly retrying them | 17:54 |
corvus | (this is why i may have answered your question in a different context earlier) | 17:55 |
fungi | oh, i see, so it would prevent the upload from being retried, but not from being tried the first time | 17:55 |
fungi | (maybe) | 17:55 |
corvus | fungi: i'm just hedging my answer because it's a really specific question which i'm not sure is covered by a unit test | 17:56 |
fungi | sure, makes sense | 17:56 |
corvus | in general, i think what we all want to have happen is what the authors of the code wanted to have happen too | 17:56 |
fungi | so maybe really the only race we've encountered is from deleting images before the pause takes effect | 17:56 |
corvus | so i think our docs should reflect that, until we prove otherwise :) | 17:56 |
corvus | fungi: that is my expectation | 17:56 |
corvus | speaking of which, if infra-prod-service-nodepool ran, successfully, shouldn't "pause: true" appear in /etc/nodepool/nodepool.yaml on nb03? | 17:58 |
fungi | that's what i would have expected | 17:58 |
fungi | unless infra-prod-service-nodepool isn't handling the non-container deployment? | 17:59 |
fungi | maybe that's still being done by the puppet-all job? | 17:59 |
corvus | that may be the case | 18:00 |
fungi | even though it's technically not being configuration-managed by puppet | 18:00 |
corvus | nb01 has true | 18:00 |
corvus | will that end up updated by a cron or something? | 18:00 |
fungi | also i don't know how far ianw got with bringing the mirror for the arm64 provider back to sanity, so it's possible arm64 builds are hopelessly broken at the moment either way | 18:01 |
fungi | looks like infra-prod-remote-puppet-else is queued in opendev-prod-hourly right now | 18:02 |
corvus | okay, given the limited impact, i don't think exceptional action is warranted. | 18:04 |
corvus | fungi: presumably the currently-building fedora-30 image will be a test of your question | 18:05 |
corvus | fungi: i've confirmed that dibs are paused on nb01, and it's 20m into a build of fedora-30 | 18:05 |
fungi | yeah, we'll know in a "bit" (or "while" at least) whether infra-prod-remote-puppet-else takes care of it | 18:05 |
corvus | so maybe when it's done, before we delete it, let's check to see if it uploads | 18:06 |
fungi | sounds good | 18:06 |
fungi | then i'll revise the docs change accordingly | 18:06 |
openstackgerrit | Clark Boylan proposed openstack/diskimage-builder master: Don't remove packages that are requested to be installed https://review.opendev.org/747220 | 18:06 |
clarkb | that is tested now. It fails pep8 locally but not on any of the files I changed? I wantt to see what zuul says about linting | 18:07 |
fungi | though also i agree if nodepool is expected to not upload images in that state, it's probably something worth fixing in nodepool | 18:07 |
corvus | clarkb: ^ fyi double check that there are no fedora-30-0000018222 uploads once it finishes building | 18:07 |
corvus | (before deleting it) | 18:07 |
clarkb | k | 18:08 |
corvus | i'm going to take a break | 18:09 |
fungi | i'll be breaking in about an hour to work on dinner prep but keeping an eye on this in the meantime | 18:10 |
clarkb | fungi: can https://review.opendev.org/#/c/747056/ get a review before dinner prep? | 18:24 |
fungi | yep, looking | 18:26 |
fungi | deleting centos-8-arm64-0000006345 which went ready ~20 minutes ago | 18:28 |
*** hashar has joined #opendev | 18:30 | |
fungi | dib-image-list indicates fedora-30-0000018222 went ready 2 minutes ago | 18:38 |
fungi | also indicates that nb01 has started building fedora-31-0000011973 | 18:38 |
fungi | so, um, does it not realize we asked it to pause? | 18:38 |
clarkb | fungi: it started before the pause | 18:39 |
fungi | it started 2 minutes ago | 18:39 |
clarkb | oh 31 not 30 | 18:39 |
clarkb | interesting | 18:39 |
fungi | yup | 18:39 |
fungi | also i can confirm fedora-30 is "uploading" to all providers currently | 18:40 |
clarkb | the config for fedora-31 on nb01 clearly says pause: true | 18:40 |
clarkb | maybe its using cached config? | 18:40 |
fungi | though that could also simply be because the builder didn't actually pause | 18:40 |
fungi | i'm deleting fedora-30-0000018222 now before it taints more job builds | 18:41 |
clarkb | fungi: you mean because there is a bug? | 18:41 |
fungi | which statement was that question in relation to? | 18:41 |
clarkb | "though that could also simply be because the builder didn't actually pause" | 18:41 |
fungi | yes, either a bug in nodepool or a bug in how we're updating its configuration | 18:42 |
fungi | like does the builder daemon also need some signal to tell it to reread its configuration? | 18:43 |
fungi | or does it only read its config at start? | 18:43 |
clarkb | reading the code it seems to read it on every pass through its run loop | 18:44 |
fungi | also deleting debian-stretch-arm64-0000093525 which has gone ready | 18:44 |
clarkb | fungi: I think the loop is roughly : while true: load config; for image in images: if imgae is stale rebuild | 18:46 |
fungi | also, the infra-prod-remote-puppet-else build in opendev-prod-hourly finished, but /etc/nodepool/nodepool.yaml on nb03 still hasn't been updated | 18:46 |
clarkb | fungi: my hunch is that its going to try and build every image with the pre pause config as it loops through that list | 18:46 |
clarkb | its not reloading the config between rebuilds until it gets through the whole list | 18:46 |
fungi | so if we want it to take effect ~immediately that requires a service restart, otherwise it will take effect in 6-12 hours | 18:47 |
clarkb | yes? Would be good for someone else to double check my read of the code but that is my read of it | 18:48 |
fungi | and i suppose we should bump the config read down one layer deeper in the nested loop if so | 18:48 |
fungi | in other news, /etc/ansible/hosts/emergency.yaml includes "nb03.openstack.org # ianw 2020-05-20 hand edits applied to dib to build focal on xenial" | 18:50 |
fungi | so this marks the three-month anniversary of the last configuration update there, i suppose | 18:50 |
fungi | i'll edit its config by hand for now | 18:50 |
clarkb | hrm I think that can be removed now, but we should confirm with ianw today | 18:50 |
fungi | i should have looked there sooner, but so much going on | 18:50 |
corvus | yeah, sounds like restart is needed currently, and we should have nodepool reload its config after each image build | 18:51 |
fungi | #status log edited /etc/nodepool/nodepool.yaml on nb03 to pause all image builds for now, since its in the emergency disable list | 18:52 |
openstackstatus | fungi: finished logging | 18:52 |
fungi | i've restarted nodepool-builder on nb03 to get it to read its updated configuration now | 18:53 |
fungi | interestingly, after a restart it immediately began building ubuntu-focal-arm64 | 18:55 |
fungi | the config sets pause: true for ubuntu-focal-arm64 | 18:55 |
fungi | why would it begin building? | 18:55 |
fungi | oh! because it's a pause under providers, not under diskimages | 18:56 |
fungi | all the pauses in its config are providers | 18:56 |
* fungi sighs, then fixes | 18:56 | |
clarkb | fungi: oh sorry I missed the difference in contenxt. Normally we have the images set to pause: false ahead of time to toggle them | 18:58 |
fungi | well, in this case the config on nb03 had them set to pause: false in the diskimages list for linaro-us, not the main diskimages definitions list | 18:59 |
fungi | and even so, after fixing and another restart it's still starting to build yet another new image | 19:02 |
openstackgerrit | Merged opendev/system-config master: Convert ssh keys for ruby net-ssh if necessary https://review.opendev.org/747056 | 19:02 |
fungi | ubuntu-xenial-arm64 this time | 19:02 |
clarkb | hae we restarted nb01 ? | 19:03 |
fungi | aha! that one's on me, i missed adding a pause to ubuntu-xenial-arm64 | 19:03 |
fungi | i haven't restarted anything else yet. was trying to wrestle nb03 into line | 19:04 |
clarkb | gotcha | 19:04 |
clarkb | should I restart nb01 then so that it short circuits that loop? | 19:04 |
fungi | please do | 19:04 |
clarkb | done | 19:05 |
fungi | okay, after correctly reconfiguring nb03 it's no longer trying to build new images | 19:05 |
clarkb | there are no building images now | 19:05 |
fungi | not sure why the pause: false placeholders were in the provider instantiations rather than the definitions | 19:05 |
fungi | i did double-check nb01 and it looked correctly configured by comparison | 19:06 |
fungi | no remaining images in a building state now | 19:06 |
clarkb | ya I think things have stablized now. IF we want we can start nb02 and nb04 | 19:18 |
clarkb | but I'm going to get lunch first. | 19:19 |
clarkb | the dib change will be entering the gate soon I hope as well | 19:20 |
fungi | no need to start more builders until we're ready to un-pause them. they're just going to sit there twiddling their thumbs anyway | 19:24 |
clarkb | yup dib change is gating now. Now I'm really getting food as its just sit and wait time for zuul to run jobs | 19:25 |
fungi | yeh, disappearing to work on dinner now | 19:25 |
openstackgerrit | Radosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw https://review.opendev.org/747185 | 19:25 |
openstackgerrit | Antoine Musso proposed opendev/gear master: wakeConnections: Randomize connections before scanning them https://review.opendev.org/747119 | 19:51 |
*** hashar has quit IRC | 19:52 | |
*** yoctozepto2 has joined #opendev | 20:06 | |
*** yoctozepto has quit IRC | 20:07 | |
*** yoctozepto2 is now known as yoctozepto | 20:07 | |
*** smcginnis has quit IRC | 20:12 | |
openstackgerrit | Merged openstack/diskimage-builder master: Revert "source-repositories: git is a build-only dependency" https://review.opendev.org/747025 | 20:37 |
clarkb | I expect ianw will be around soon and we can talk about making a release with ^ next | 20:45 |
clarkb | then work to land my change to package accounting and land a revert revert | 20:45 |
*** sshnaidm is now known as sshnaidm|afk | 20:47 | |
openstackgerrit | Pierre Riteau proposed opendev/irc-meetings master: Update CloudKitty meeting information https://review.opendev.org/747256 | 20:50 |
*** priteau has quit IRC | 20:52 | |
clarkb | zbr: the fix for the puppet jobs has merged. I'll try to approve the e-r python3 switch tomorrow (I'm running out of daylight today and want to make sure all the cleanup from the dib stuff is in a good spot) | 20:57 |
openstackgerrit | Merged openstack/project-config master: Re-introduce puppet-tripleo-core group https://review.opendev.org/746759 | 21:00 |
clarkb | the dib change to modify how package installs are handled is passing tests and has new tests to cover the behavior at https://review.opendev.org/#/c/747220/ | 21:18 |
clarkb | I'll stack a revert revert on top of that now | 21:18 |
clarkb | hrm do I need to rebase to do that? | 21:19 |
clarkb | maybe I won't stack then | 21:19 |
ianw | clarkb: hey, looking | 21:49 |
fungi | ianw: to catch you up, all diskimages are paused for all builders right now, and we've deleted the most recent diskimages. i manually edited the config for nb03 since it's been in the emergency disable list for months. we also discovered that builders won't notice config changes straight away generally, and so a restart is warranted if you need them to immediately apply | 21:51 |
fungi | oh, and on nb03 i moved the pause placeholders out of the provider section into the diskimage definitions to pause building instead of only pausing uploading | 21:52 |
ianw | sigh ... so i guess we exposed a lot of assumptions about git being on the host ... | 21:53 |
fungi | ianw: well, also we actually explicitly install git in infra-package-needs | 21:53 |
fungi | but the change to dib "cleans it up" helpfully anyway | 21:53 |
johnsom | FYI, docs.openstack.org seems to not be responding | 21:53 |
fungi | johnsom: thanks, checking now | 21:54 |
ianw | https://review.opendev.org/#/c/747121/ didn't work? | 21:54 |
fungi | ianw: at the time setup workspace runs, we haven't configured package management on the systems yet, and they don't have package indices on debuntu type systems at that point | 21:54 |
clarkb | ianw: fungi and pabelanger in particular doesn't even have working dns at that point | 21:55 |
fungi | johnsom: isn't not down for me | 21:55 |
johnsom | fungi Yeah, just started loading for me | 21:55 |
clarkb | but ya we explicitly intsall git in infra-package-needs and so dib shouldn't undo that | 21:55 |
fungi | yeah, i was about to add, also other users of that role don't even necessarily have fundamental network bits in place yet | 21:55 |
fungi | so trying to install packages at that point is going to break for them regardless | 21:56 |
ianw | ok, so git is a special flower | 21:56 |
ianw | the revert is in -2 https://review.opendev.org/#/c/747238/ | 21:57 |
fungi | johnsom: looks like the webserver temporarily lost contact with the fileserver for six seconds at 21:47:22 and again for 27 seconds at 21:47:31 and another 7 seconds at 21:54:00 | 21:58 |
johnsom | That would do it. | 21:58 |
ianw | so there hasn't been a dib point release? | 21:59 |
clarkb | ianw: not yet, the revert merged not that long ago so I figured we'd wait for you just to double check | 21:59 |
clarkb | ianw: but I think we do that release then work on something like https://review.opendev.org/#/c/747220/ as the next step | 22:00 |
ianw | clarkb: ok, your merging change lgtm as a stop-gap against this returning | 22:00 |
clarkb | then people can have git removed if they don't explicitly install it elsewhere | 22:00 |
ianw | i agree, let me then push a .0.1 release | 22:00 |
fungi | johnsom: which in turn seems to be due to high iowait on the fileserver out of the blue: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=6397&rra_id=all | 22:01 |
fungi | i'm trying to ssh into it now | 22:01 |
clarkb | fungi: did static fail over to the RO server? | 22:02 |
clarkb | (I think that is how it is supposed to work so yay if it did) | 22:02 |
fungi | clarkb: i'm not sure, all the errors in dmesg are about losing and regaining access for 23.253.73.143 (afs02.dfw) | 22:03 |
fungi | and i'm still waiting for ssh to respond on it | 22:03 |
ianw | fungi/clarkb: so now we need to roll out 3.2.1 to builders and rebuild images? | 22:03 |
fungi | checking oob console too | 22:03 |
fungi | ianw: yeah | 22:03 |
fungi | i'm woefully overdue for an evening beer | 22:04 |
ianw | i guess the best way to do that is to bump the dib requirement in nodepool? | 22:04 |
fungi | ianw: or at least blacklist 3.2.0 | 22:04 |
*** tosky has quit IRC | 22:07 | |
clarkb | ya adding a != 3.2.0 is what I would do | 22:07 |
clarkb | and then we need to revert the pause change | 22:07 |
fungi | oob console is showing hung kernel tasks | 22:07 |
clarkb | and start builders on nb02 and nb04 | 22:07 |
ianw | i feel like before we've just done a >= | 22:08 |
ianw | great, my ssh-agent seems to have somehow died | 22:08 |
clarkb | I think thats fine too | 22:08 |
fungi | some day distros will start having console kmesg spew use estimated datetime rather than seconds since boot | 22:08 |
fungi | hung kernel tasks on afs02.dfw began 23818825 seconds after boot | 22:09 |
fungi | if that was ~now then it means the server was booted 2019-11-19 05:50 | 22:11 |
fungi | checking to see if we happened to log that | 22:11 |
fungi | yay us! "2019-11-19 06:09:03 UTC rebooted afs02.dfw.openstack.org after it's console was full of I/O errors. very much like what we've seen before during host migrations that didn't go so well" | 22:12 |
fungi | unfortunately unless it miraculously clears up, this probably means an ungraceful reboot, fsck and then lengthy full resync of all afs volumes | 22:14 |
fungi | the cacti graph is also less reassuring... looks like the server stopped responding to snmp entirely 20 minutes ago | 22:15 |
fungi | infra-root: i'm going to hard reboot afs02.dfw | 22:15 |
clarkb | fungi: ok | 22:15 |
clarkb | also looks like docs is still unhappy implying we are't using the other volume? | 22:16 |
fungi | hopefully once its down all consumers will switch to the other server | 22:16 |
clarkb | ah ok maybe that is what is needed to flip flop | 22:16 |
ianw | my notes from that day say | 22:17 |
ianw | * eventually debug to afs02 being broken; reboot, retest, working | 22:17 |
fungi | #status log hard rebooted afs02.dfw.openstack.org after it became entirely unresponsive (hung kernel tasks on console too) | 22:17 |
openstackstatus | fungi: finished logging | 22:17 |
ianw | that i didn't log something about having to rebuild the world might be positive :) | 22:18 |
fungi | ianw: the subsequent entries in our status log worried me, until i realized that they were actually the result of a problem with afs01.dfw some days earlier which we didn't really grasp the full effects of until afs02.dfw hung | 22:19 |
fungi | docs.o.o seems to be back up for me, btw | 22:20 |
ianw | to nb03 -- i have hand edited the debootstrap there to know how to build focal images. the plan was to get that replaced with a container. *that* has been somewhat sidetracked by the slow builds of those containers. which led to us looking at arm wheels. which led to us doing 3rd party ci for cryptography | 22:21 |
ianw | which led to us finding page size issues with the manylinux2014 images, which has led to patches for patchelf | 22:21 |
ianw | i think this might be the definition of yak shaving | 22:21 |
fungi | ianw: ubuntu is usually good about backporting debootstrap so you can build chroots of newer releases on older systems | 22:22 |
ianw | perhaps in the mean time xenial has updated it's debootstrap | 22:22 |
ianw | i don't think so, last entry seems to be 2016 | 22:23 |
fungi | :( | 22:23 |
ianw | sorry, better if i look int he updates repo | 22:24 |
fungi | check xenial-backports | 22:24 |
fungi | but yeah, not in xenial-backports | 22:24 |
ianw | * Add (Ubuntu) focal as a symlink to gutsy. (LP: #1848716) | 22:24 |
openstack | Launchpad bug 1848716 in debootstrap (Ubuntu) "Add Ubuntu Focal as a known release" [High,Fix released] https://launchpad.net/bugs/1848716 - Assigned to Łukasz Zemczak (sil2100) | 22:24 |
ianw | -- Åukasz 'sil2100' Zemczak <lukasz.zemczak@ubuntu.com> Fri, 18 Oct 2019 14:17:06 +0100 | 22:24 |
ianw | hrmm, i wonder if we don't have that | 22:24 |
ianw | oh i think that's right, we need 1.0.114 for some other reason | 22:27 |
ianw | https://launchpad.net/~openstack-ci-core/+archive/ubuntu/debootstrap/+sourcepub/11302190/+listing-archive-extra | 22:27 |
ianw | http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-05-19.log.html#t2020-05-19T09:30:19 and that's the discussion about it all ... | 22:31 |
clarkb | is the debootstrap fix not in our ppa? | 22:32 |
clarkb | if it is can't we turn ansible puppet back on? | 22:32 |
ianw | it is; i think we can probably turn puppet back on. i'm starting to think i might have just forgotten to do that after building ^^^ | 22:32 |
clarkb | gotcha | 22:32 |
ianw | the reason we run the backport is to build buster (http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-05-19.log.html#t2020-05-19T09:36:59) | 22:33 |
ianw | *that's* why the xenial-updates version doesn't work for us, it can build focal but not buster | 22:34 |
clarkb | ah | 22:35 |
ianw | clarkb: so you're looking into the "builders don't notice config changes"? | 22:36 |
fungi | ianw: he's got a fix proposed | 22:37 |
fungi | https://review.opendev.org/747277 | 22:37 |
clarkb | https://review.opendev.org/747277 is that proposed fix | 22:37 |
ianw | ok, https://review.opendev.org/#/c/747277/ ... | 22:37 |
ianw | jinx | 22:37 |
fungi | they *do* (eventually) load config changes | 22:37 |
fungi | just not until after cycling through all the defined images which need builds | 22:37 |
ianw | so, before everyone eod's :) i can monitor the deploy of https://review.opendev.org/747303 and re-enable builds. nb03 we can probably re-puppet, i'll look into that. and clarkb has the config-not-noticed issue in review | 22:39 |
ianw | i think that was the 3 main branches of the problems? | 22:39 |
fungi | yep, i think that covers it | 22:40 |
clarkb | we also want a revert of the pause change? | 22:40 |
clarkb | Iguess that falls under reenable builds | 22:40 |
fungi | good reminder that we need to do that part though, yep | 22:41 |
ianw | yeah, i can watch that | 22:42 |
*** mlavalle has quit IRC | 22:56 | |
ianw | kevinz: if you can give me a ping about ipv4 access in the control plane cloud in linaro that would be super :) | 22:58 |
clarkb | oh that was the other thing I noticed | 22:58 |
clarkb | limestone has an ssl cert error | 22:58 |
clarkb | I don't think it is an emergency but onc ethe other fires are out we should look into that /me makes a note for tomorrow and iwll try to catch lourot | 22:58 |
clarkb | er logan- sorry lourot bad tab complete | 22:58 |
ianw | clarkb: rejection issues or more like not in container issues? | 23:05 |
clarkb | ianw: I think that cloud may use a self signed cert and we explicitly add a trust for it? and ya maybe that isn't bind mounted or now its an LE cert or something | 23:06 |
clarkb | I should actually point s_client at it | 23:06 |
clarkb | ya s_client says it is a self signed cert | 23:07 |
clarkb | so we're probably just not supplying the cert in clouds.yaml for verification | 23:08 |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Revert "Pause all image builds" https://review.opendev.org/747312 | 23:23 |
*** DSpider has quit IRC | 23:41 | |
fungi | infra-root: i keep forgetting to mention, but i'm planning to try to be on "vacation" all next week. in theory i'll be avoiding the computer | 23:44 |
ianw | fungi: jealous! i will be within my 5km restriction zone and 1hr of exercise time :/ | 23:49 |
fungi | oh, i'm not going anywhere. i'll probably be put to work on a backlog of home improvement tasks | 23:50 |
clarkb | fungi: but will you go past 5km? | 23:51 |
fungi | doubtful. the hardware store is at most half that | 23:51 |
ianw | heh, you could if you *wanted* to though :) | 23:53 |
ianw | so the nodepool image is promoted, i guess we just need to wait for the next hourly roll out | 23:54 |
*** knikolla has quit IRC | 23:56 | |
*** dviroel has quit IRC | 23:56 | |
fungi | ianw: i *could* but i'd rather keep my good health ;) | 23:56 |
*** aannuusshhkkaa has quit IRC | 23:56 | |
clarkb | ianw: yes next hourly should restart the builders even iirc | 23:57 |
*** ildikov has quit IRC | 23:58 | |
*** knikolla has joined #opendev | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!