Thursday, 2020-08-20

openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: edit-json-file: add role to combine values into a .json https://review.opendev.org/746834	00:46
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062	00:46
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063	00:46
ianw	hrmmm, linaro mirror issues again ... https://zuul.opendev.org/t/zuul/build/f0f9658cd3ca40ff8abb74586e6bb569/console failed getting apt	01:13
ianw	doesn't seem to be responding :/	01:14
ianw	SHUTOFF	01:15
ianw	again	01:15
ianw	kevinz: ^	01:15
ianw	i feel like this has to be an oops taking it down	01:15
ianw	i think i might as well rebuild it as a focal node. i'm not going to spend time setting up captures etc. for an old kernel	01:17
ianw	sigh .. .bridge is dying too	01:23
ianw	$ ps -aef \| grep ansible-playbook \| wc -l	01:23
ianw	211	01:23
ianw	all stuck on /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-zuul.yaml >> /var/log/ansible/service-zuul.yaml.log	01:23
ianw	i've killed them all. the log file isn't much help, as everything has tried to write to it	01:26
clarkb	ianw: I thinj that may be the result if our zuul job tineouts that run the service playbooks	01:36
clarkb	they dont seem to clean up nicely (and we run zuul hourly to get images?)	01:36
ianw	clarkb: i'll keep it open and see if one gets stuck, it's easier to debug one than 200 ontop of each other :)	01:38
openstackgerrit	Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069	01:43
openstackgerrit	Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069	01:49
ianw	ok, we've caught an afs oops during boot -> http://paste.openstack.org/show/796970/	02:03
ianw	auristor: ^ ... if that rings any bells	02:03
ianw	i'm performing a hard reboot	02:04
ianw	... interesting .. same oops	02:05
ianw	so then we seem to be stuck in "A start job is running for OpenAFS client (2min 56s / 3min 3s)"	02:06
ianw	[ 8.338401] Starting AFS cache scan... ; i wonder if the cache is bad	02:07
ianw	i'm going to delete /var/cache/openafs	02:08
ianw	the server is up, but no afs to be clear at this point	02:09
ianw	well that solved the oops, but still no afs. i'm starting to think ipv4 issues agian	02:14
ianw	hrm, i dunno, i can ping afs servers	02:15
fungi	that's booting the ubuntu focal replacement arm64 server?	02:21
ianw	fungi: no, the extant bionic one that died	02:38
ianw	i'm going to try rebooting it again ... in case the fresh cache makes some difference	02:39
fungi	okay, but you're ready for reviews on the focal replacement then	02:41
ianw	sort of, it hasn't tested on focal arm64 i don't think because the mirror is down	02:42
ianw	but i think we can merge 747069	02:42
ianw	ok, it's back, and ls /afs works ...	02:44
ianw	and now the system-config gate is broken due to some linter stuff ...	02:46
openstackgerrit	Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069	02:56
openstackgerrit	Ian Wienand proposed opendev/system-config master: Work around new ansible lint errors. https://review.opendev.org/747094	02:56
ianw	ok, back to the zuul thing. one of hte playbooks is stuck again	03:08
ianw	it's ... 30.248.253.23.in-addr.arpa domain name pointer zm05.openstack.org.	03:09
ianw	as somewhat expected, it accepts the ssh connection then hangs	03:10
ianw	standardish hung tasks messages on console	03:11
ianw	#status log reboot zm05.openstack.org that had hung	03:13
openstackstatus	ianw: finished logging	03:13
openstackgerrit	Merged opendev/system-config master: Work around new ansible lint errors. https://review.opendev.org/747094	03:31
openstackgerrit	Ian Wienand proposed opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069	03:32
*** ysandeep\|away is now known as ysandeep		03:34
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ara-report: add option for artifact prefix https://review.opendev.org/747100	04:11
openstackgerrit	Ian Wienand proposed opendev/system-config master: run-base-post: fix ARA artifact link https://review.opendev.org/747101	04:13
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ara-report: add option for artifact prefix https://review.opendev.org/747100	04:39
openstackgerrit	Merged opendev/system-config master: arm64 mirror : update to Focal https://review.opendev.org/747069	04:42
*** raukadah is now known as chkumar\|rover		04:43
openstackgerrit	Ian Wienand proposed opendev/system-config master: launch-node: get sshfp entries from the host https://review.opendev.org/744821	05:09
openstackgerrit	Ian Wienand proposed opendev/system-config master: launch-node: get sshfp entries from the host https://review.opendev.org/744821	05:10
frickler	ianw: seems logstash-worker08.openstack.org is broken, closes ssh connection immediately, failing the ansible deploy job. do you want to take a deeper look or just reboot via the API?	05:33
ianw	frickler: sounds like the same old thing; i have the console up and canr eboot it	05:34
ianw	should be done	05:42
*** lseki has quit IRC		05:54
*** lseki has joined #opendev		05:54
ianw	kevinz: so i'm having trouble starting another mirror node ... it seems ipv4 can't get in. i'm attaching to os-control-network. it actually worked once, but i had to delete that node, and now not	06:13
ianw	os-control-network=192.168.1.63, 2604:1380:4111:3e54:f816:3eff:fe57:7781, 139.178.85.144	06:17
ianw	ls -l /tmp/ \| grep console \| wc -l	06:20
ianw	104161	06:20
ianw	bridge has this many "console-bc764e02-6612-005b-e2c9-000000000012-bridgeopenstackorg.log" files	06:20
ianw	i've removed them	06:23
*** lpetrut has joined #opendev		06:50
*** DSpider has joined #opendev		07:02
*** hashar has joined #opendev		07:04
zbr	anyone that can help with https://review.opendev.org/#/c/747056/2 ?	07:10
yoctozepto	morning infra; is https://docs.opendev.org/opendev/infra-manual/latest/creators.html the right guide to follow if I want to coordiante the etcd3gw move under the Oslo governance? i.e. the project already exists and this guide assumes it does not - what should I be aware of?	07:27
yoctozepto	the current repo state (for reference) is here: https://github.com/dims/etcd3-gateway	07:30
yoctozepto	it already used the (very old) cookiecutter template for libs; depends on tox but obviously does not use Zuul but Travis	07:31
*** dtantsur\|afk is now known as dtantsur		07:34
*** johnsom has quit IRC		07:41
AJaeger	yoctozepto: yes, that's the right guide - and it explains what to do to import a repository that exists.	07:43
AJaeger	yoctozepto: check step 3 in https://docs.opendev.org/opendev/infra-manual/latest/creators.html#add-the-project-to-the-master-projects-list	07:44
*** rpittau has quit IRC		07:47
*** fressi has joined #opendev		07:48
yoctozepto	AJaeger: ah, thanks! I was misled by the toc: https://docs.opendev.org/opendev/infra-manual/latest/creators.html#preparing-a-new-git-repository-using-cookiecutter	07:53
*** rpittau has joined #opendev		07:56
*** johnsom has joined #opendev		07:57
*** elod is now known as elod_off		07:58
chkumar\|rover	Hello Infra, We are seeing rate limit issue in gate job	08:00
chkumar\|rover	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_047/746801/2/gate/tripleo-buildimage-overcloud-full-centos-8/04734f1/job-output.txt	08:00
chkumar\|rover	prepare-workspace-git : Clone cached repo to workspace	08:00
chkumar\|rover	primary \| /bin/sh: line 1: git: command not found	08:00
jrosser	i have an odd failure here https://zuul.opendev.org/t/openstack/build/f267841a98b443808365468e94ccdfa9/log/job-output.txt#178	08:00
jrosser	^ same	08:00
*** moppy has quit IRC		08:01
chkumar\|rover	I think it is widespeared on all distros	08:01
*** moppy has joined #opendev		08:01
openstackgerrit	Antoine Musso proposed opendev/gear master: wakeConnections: Randomize connections before scanning them https://review.opendev.org/747119	08:05
cgoncalves	chkumar\|rover, jrosser: this may help https://review.opendev.org/#/c/747025/	08:09
chkumar\|rover	cgoncalves: thanks, just opened a bug https://bugs.launchpad.net/tripleo/+bug/1892326	08:10
openstack	Launchpad bug 1892326 in tripleo "Jobs failing with RETRY_LIMIT with primary \| /bin/sh: line 1: git: command not found at prepare-workspace-git : Clone cached repo to workspace" [Critical,Triaged]	08:10
cgoncalves	infra-root: would it be possible to manually trigger rebuild of nodepool images and push them to providers once https://review.opendev.org/#/c/747025/ merges?	08:14
*** ykarel has joined #opendev		08:14
*** tosky has joined #opendev		08:18
openstackgerrit	yatin proposed zuul/zuul-jobs master: Ensure git is installed in prepare-workspace-git role https://review.opendev.org/747121	08:21
*** lseki has quit IRC		08:30
*** lseki has joined #opendev		08:30
*** rpittau has quit IRC		08:30
*** rpittau has joined #opendev		08:30
*** johnsom has quit IRC		08:30
*** johnsom has joined #opendev		08:30
ykarel	if some core around please also check ^	08:35
ykarel	all jobs relying on this role are affected	08:36
ianw	cgoncalves: i think we might have to release dib now to get it picked up	08:39
cgoncalves	ianw, thing is we got ourselves in a chicken-n-egg situation where CI is failing to verify the revert	08:40
ianw	ykarel: installing git there is probably a better idea than relying on it in the base image, at any rate	08:40
cgoncalves	at least two voting jobs already hit RETRY_LIMIT	08:40
ianw	i think the build-only thing is a bit of a foot-gun unfortunately. anyway, that's not of immediate importance	08:42
ianw	cgoncalves: will 747121 fix those jobs?	08:42
cgoncalves	ianw, I think so but I've been wrong many times before xD	08:42
ianw	welcome to the club :)	08:43
cgoncalves	thanks!!	08:43
ianw	i'm going to single approve 747121 as i think that should unblock things. then we can worry about the slower path of reverting, releasing, and rebuilding nodepool images and then ci images	08:48
ianw	i have to afk for a bit	08:48
*** priteau has joined #opendev		08:50
openstackgerrit	Merged zuul/zuul-jobs master: Ensure git is installed in prepare-workspace-git role https://review.opendev.org/747121	09:02
openstackgerrit	Tobias Henkel proposed openstack/project-config master: Create zuul/zuul-cli https://review.opendev.org/747127	09:13
openstackgerrit	Tobias Henkel proposed openstack/project-config master: Create zuul/zuul-client https://review.opendev.org/747127	09:33
*** andrewbonney has joined #opendev		09:41
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck https://review.opendev.org/729336	10:15
ykarel	ianw, Thanks for merging quickly	10:58
ykarel	yes should not depend on base image, having in base image is a plus though as it save couple of seconds	10:58
zbr	AJaeger: tobiash: https://review.opendev.org/#/c/747056/ -- please, is need for https://review.opendev.org/#/c/729336/	11:11
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck https://review.opendev.org/729336	11:12
*** hipr_c has joined #opendev		11:33
*** hipr_c has joined #opendev		11:33
*** hipr_c has joined #opendev		11:33
tosky	hi! If I click on the "Unit Tests Report" link here https://zuul.opendev.org/t/openstack/build/0cd50335a91b4e22a4776001e2d84785	12:19
*** jaicaa has quit IRC		12:19
AJaeger	zbr: please explain what the change is about so that I can decide whether to open it or not. I'm not reviewing either of these repos - and neither does tobiash. Please ask the rest of the admins later	12:19
tosky	I get an empty page on chrome and an encoding error on Firefox	12:19
tosky	s/chrome/Chromium/	12:19
AJaeger	tosky: is that only for this specific report - or for every? I'm wondering whether that single file is corrupt or whether there's a generic problem.	12:20
AJaeger	tosky: I can confirm the error on Firefox	12:21
tosky	AJaeger: just that one	12:22
tosky	I understand it may be a specific and once-in-a-while issue	12:22
tosky	but just in case...	12:22
*** jaicaa has joined #opendev		12:22
hashar	hello. I have a basic patch that fails the task "ubuntu-bionic: Build a tarball and wheel", python setup.py sdist bdist_wheel yields "no module named setuptools	12:29
hashar	is that a known issue by any chance? The repository is opendev/gear , patch is https://review.opendev.org/#/c/747119/1	12:30
AJaeger	tosky: ok. Hope other can help further	12:30
tosky	AJaeger: thanks for checking! I know it may not be fixed, and that file is not critical anyway	12:38
tosky	just reporting in case other reports start to pile up	12:38
*** hashar has quit IRC		12:50
*** redrobot has quit IRC		13:08
frickler	tosky: AJaeger: looks like a bad upload to me, unless we see duplicates of that, I'd say this can happen and just do a recheck of that patch	13:13
tosky	ack, thanks	13:17
*** hashar has joined #opendev		13:35
fungi	hashar: i've seen that when a different python is used than the one for which setuptools is installed. we should probably switch that from python to python3 if it's not using a virtualenv	13:45
lourot	hi o/ "openstack-tox-py35 https://zuul.opendev.org/t/openstack/build/8f4947ec185c4479a57b552de4338956 : RETRY_LIMIT in 2m 54s"	13:45
lourot	this happened on at least two of our (openstack-charmers/canonical) reviews this afternoon	13:46
lourot	the job seems to fail apt-installing git on xenial, is it something you noticed already?	13:47
hashar	fungi: I am not sure I understand the reason ;] I have a hard time finding out where the job "build-python-release" is defined though	13:47
fungi	lourot: that looks like the fallout from diskimage-builder removing git by default from images. we're hoping https://review.opendev.org/747121 fixes it so we don't have to wait for a revert and release in dib followed by nodepool image rebuilds and uploads to all providers	13:47
fungi	hashar: take a look at the "console" tab for that build result and it shows the repository and path for the playbook which called the failing task, in this case opendev.org/opendev/base-jobs/playbooks/base/pre.yaml	13:49
yoctozepto	fungi: it seems xenial broke	13:49
yoctozepto	because it has no git packages	13:49
fungi	hashar: er, sorry, i was looking at the wrong console, trying to answer too many questions at once	13:49
lourot	fungi, understood, thanks!	13:50
hashar	:]]]]]	13:50
fungi	hashar: opendev.org/zuul/zuul-jobs/playbooks/python/release.yaml	13:50
yoctozepto	https://review.opendev.org/747121 broke xenial and now we can't merged https://review.opendev.org/747025	13:50
fungi	yoctozepto: thanks, yeah i think we need git-vcs on xenial... checking now	13:51
hashar	fungi: ahhh thank you very much. So yeah it runs {{ release_python }} setup.py sdist bdist_wheel , which would be python3	13:52
hashar	and somehow I guess the base image lacks setuptools	13:52
yoctozepto	fungi: thanks	13:53
fungi	hashar: we install setuptools for python3 i think, not python. ideally things should be calling python3 these days	13:54
hashar	oh	13:54
hashar	roles/build-python-release/defaults/main.yaml has an override: release_python: python	13:54
fungi	yoctozepto: i was wrong, it's not git-vcs on xenial either, this error is strange, https://packages.ubuntu.com/xenial/git says it should exist	13:55
fungi	hashar: yeah, probably we're not seeing this in other places because we set release_python: python3 (or something like that). you could check codesearch.openstack.org for release_python:	13:55
*** ykarel is now known as ykarel\|away		13:55
openstackgerrit	Antoine Musso proposed opendev/gear master: zuul: use python3 for build-python-release https://review.opendev.org/747167	13:56
*** ykarel\|away is now known as ykarel		13:56
hashar	fungi: or maybe if the base image has python2, it should also have setuptools?	13:57
hashar	anyway, I might have found a way to set it to python3	13:57
yoctozepto	fungi: lack of apt-get update perhaps?	13:57
ykarel	seems ^ the case for git not found	13:57
yoctozepto	after working a lot on centos it feels nice to just hit install	13:57
yoctozepto	but debian does not think so :-)	13:58
hashar	fungi: thank you very much for your guidances	13:59
fungi	yoctozepto: yeah, i suspect we may have tried to install a package too early before we've primed the pump for mirror stuff	13:59
yoctozepto	fungi, ykarel: then let's just do the apt-get update in the role, shall we?	14:00
fungi	though strange that this is only showing up for xenial	14:00
openstackgerrit	Antoine Musso proposed opendev/gear master: zuul: use python3 for build-python-release https://review.opendev.org/747167	14:01
yoctozepto	maybe bionic+ images have the cache in them	14:01
yoctozepto	which is valid enough	14:01
ykarel	in other images it's seems installed, task is returning ok	14:01
yoctozepto	or xenial's apt just got b0rken in the meantime	14:01
ykarel	i seen a bionic jobs's log	14:01
yoctozepto	sad the gate on zuul change will not trigger the issue	14:01
yoctozepto	maybe the ubuntu images did not rebuild?	14:02
yoctozepto	i mean bionic+ ones	14:02
yoctozepto	if you say they re 'ok' and not changed	14:02
yoctozepto	tbh, I only saw centos failures in kolla today	14:02
ykarel	ubuntu-bionic \| ok https://5fadcfca1ff80d23fcf2-2bdb8be3dd1329f8a48d0e165eec17e9.ssl.cf2.rackcdn.com/746432/1/check/openstack-tox-py36/8f6daf0/job-output.txt	14:02
yoctozepto	so might have been the case	14:02
yoctozepto	bingo	14:03
fungi	i've only just rubbed teh sleep from my eyes, started to sip my coffee and stumbled into this in the last few minutes, so still trying to catch up on what's been happening from scrollback	14:03
yoctozepto	fungi: it's a fire-fighting week for me	14:03
ykarel	may be can hold a node? and see what's going to fix it quickly?	14:03
yoctozepto	can't wait to see what Friday brings to the table	14:03
fungi	we'll have to pick a change to recheck for the hold. i guess we can use the failing job for the dib revert	14:04
fungi	working on that now	14:05
ykarel	strange, in dib change the job passed in check, ubuntu-xenial \| ok	14:06
frickler	fungi: maybe we also want throw away current images and revert to the previous ones until we can fix dib?	14:06
fungi	frickler: i think we need to pause all image builds/uploads if we do that, because just deleting the images will trigger nodepool to start trying to upload them again	14:08
fungi	last time i tried that i think i must not have paused them correctly	14:08
fungi	anyway, the autohold and recheck are in, now waiting for openstack-tox-py35 to get a node	14:09
openstackgerrit	Radosław Piliszek proposed zuul/zuul-jobs master: Fix git install on Debian distro family https://review.opendev.org/747170	14:10
yoctozepto	in case we want to go the apt-get update route, I prepared the above ^	14:10
fungi	once we have this node held, i can also bypass zuul to merge 739717 so dib folks can continue with the revert	14:14
dmsimard	regarding that git install issue, I've also seen the issue in non-debian distros	14:15
dmsimard	"/bin/sh: line 1: git: command not found" on CentOS8: https://zuul.openstack.org/build/d48c1f1a9e024f7ba4b1d68dea285d3e/console#0/3/8/centos-8	14:16
fungi	dmsimard: yep, but for those the role is installing git successfully now i think	14:17
dmsimard	ah, was there a separate fix ? not caught up with entire backlog	14:17
fungi	dmsimard: yeah, https://review.opendev.org/747121	14:17
dmsimard	neat, thanks	14:18
frickler	fungi: yeah, forcing the revert in would be the other option, but IIUC we'd need to have another dib release then, too. not sure who except ianw can do that	14:20
fungi	oh, right, since this job is failing in pre we have to wait for it to fail three times before it will trigger the autohold :/	14:21
*** lpetrut has quit IRC		14:24
*** chkumar\|rover is now known as raukadah		14:33
fungi	it's starting attempt #3 now	14:33
openstackgerrit	Radosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw https://review.opendev.org/747185	14:36
mnaser	`/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\" install 'git'' failed: E: Package 'git' has no installation candidate\n`	14:39
fungi	i think we finally have a held node	14:39
fungi	or should momentarily	14:40
mnaser	^ anyone seen this today? i'm not seeing anything in logs	14:40
openstackgerrit	Radosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw https://review.opendev.org/747185	14:40
fungi	mnaser: yes, it's fallout from the fix for the fix for dib removing git	14:40
fungi	we're now trying to install git in the setup-workspace role but can't figure out why xenial is saying there's no git package	14:41
fungi	i'm trying to get a node with that failure held now to see if i can work out what we're missing	14:41
yoctozepto	fungi, mnaser: I bet on lack of apt-get update and await my fix merged :-) https://review.opendev.org/747170	14:42
fungi	does retry_limit not trigger autoholds?	14:42
mnaser	ouch	14:42
fungi	oh, nevermind, zuul hasn't finalized that build i guess	14:42
fungi	seems the scheduler's in the middle of a reconfiguration event	14:44
clarkb	retrylimit will hold and only the third and final instance	14:44
fungi	yeah, it's finally failed the third but the result is in the queue backlog while the scheduler's reconfiguring	14:45
fungi	i was just being impatient	14:45
fungi	and there it goes	14:46
fungi	though i still don't have a held node yet	14:47
clarkb	another approach is to manually boot a xenial node	14:48
fungi	oh fudge, i pasted in the wrong change number	14:49
fungi	clarkb: well, we want to see what state the node is in when it's claiming it can't install git	14:49
fungi	so just booting a xenial image won't necessarily get us that	14:49
clarkb	it should be pretty close though	14:50
clarkb	prepare-workspace-git happens very early iirc	14:50
fungi	yep, our current suspicion is that it happens too early to be able to install distro packages	14:50
fungi	like before we've set up mirroring configs and stuff	14:51
clarkb	we can add git to our infra package needs element too	14:52
*** ysandeep is now known as ysandeep\|away		14:52
clarkb	rather than revert dibs change and rerelease	14:52
fungi	yeah, i was considering that as a fallback option	14:52
fungi	fallback to installing it in the prepare workspace role i mean	14:52
fungi	i'm ambivalent on whether dib maintainers want to keep or undo the git removal	14:52
fungi	i corrected my autohold and abused zuul promote to restart check pipeline testing on the change in question	14:54
*** qchris has quit IRC		14:57
fungi	i'm about to enter an hour where i'm triple-booked for meetings, but will try to keep tabs on this at the same time	14:58
clarkb	I'm slowly getting to a real keyboard and can help more shortly	15:02
clarkb	I'll probably work on the infrapackage needs change first so we've got it if we want it	15:02
fungi	thanks	15:02
clarkb	git is already in infra-package-needs	15:10
clarkb	is dib removing it	15:10
*** larainema has quit IRC		15:10
*** qchris has joined #opendev		15:10
* clarkb needs to find this dib change		15:10
fungi	yeesh	15:11
clarkb	https://review.opendev.org/#/c/745678/1	15:11
clarkb	ya I think the build time only thing gets handled at a later build stage which then removes it	15:11
fungi	right, that was the change which triggered this	15:12
clarkb	basically that overrides our explicit request to install the package elswhere	15:12
clarkb	thta makes me like the revert more	15:12
clarkb	its one thing to install it at runtime because we didn't install it on ou rimages. Its another to tell dib to install it on the image and be ignored	15:13
clarkb	I'm going to see if we can have the package installs override the other direction	15:13
clarkb	if you ask to intsall it and not uninstall it somewhere then don't uninstall it	15:13
fungi	looks like we're a bit backlogged on available nodes	15:20
*** ykarel is now known as ykarel\|away		15:21
*** ykarel\|away has quit IRC		15:28
*** tosky_ has joined #opendev		15:35
*** tosky has quit IRC		15:36
*** tosky_ is now known as tosky		15:37
fungi	yep,test nodes flat-lined around 750 in use as of ~12:30z and the node requests have been climbing since	15:37
fungi	current demand seems to be around 2x capacity	15:38
fungi	also looks like we might could stand to have an additional executor or two	15:39
fungi	since around 14:00z there's been very little time where we had any executors accepting new builds	15:40
fungi	and the executor queue graph shows we started running fewer concurrent builds since then	15:41
openstackgerrit	Clark Boylan proposed openstack/diskimage-builder master: Don't remove packages that are requested to be installed https://review.opendev.org/747220	15:41
clarkb	something like that maybe?	15:41
clarkb	fungi: the pre run churn is likely part of that	15:41
fungi	i agree, this is probably a pathological situation	15:41
fungi	openstack-tox-py35 has finally started its first try	15:42
*** mlavalle has joined #opendev		15:45
fungi	looks like these are spending almost as much time waiting on an executor as they are waiting for a node	15:47
clarkb	that will be affected by the job churn	15:48
fungi	absolutely	15:48
clarkb	since we rate limit job starts on executors	15:48
clarkb	https://review.opendev.org/#/c/729336/ shows https://review.opendev.org/#/c/747056/ is working. fungi once the bigger fire calms down (the gate won't pass for this with broken git anyway) maybe we can get reviews on those?	16:04
fungi	yep!	16:05
fungi	that's good	16:05
corvus	clarkb: is there a fire that i can help with?	16:05
fungi	also i'll have a break from meetings in about 55 minutes, maybe sooner	16:05
clarkb	corvus: there is a fire. I think we're just trying to confirm which of the various fixes is our best bet. TL;DR is https://review.opendev.org/#/c/745678/1 merged to dib and was released. This has resulted in dib removing git from our images even though we explicitly request for git to be installed in infra-package-needs.	16:06
fungi	corvus: we've discovered that if dib marks a package as build-specific like in https://review.opendev.org/745678 then you can't also explicitly install that package as a runtime need in another element	16:06
clarkb	corvus: an earlier attempt at a fix does a git install in prepare-workspace-git. but on ubuntu we think that may need an apt-get update (fungi is working t oconfirm that now before we land the update change into prepare-workspace-git)	16:07
clarkb	corvus: on the dib side I've written https://review.opendev.org/747220 to not uninstall packages if something requests they be installed normally	16:07
clarkb	for some reason this seems to most affect xenial. (Do we know why yet?)	16:07
fungi	one theory i've not had a chance to check is that we haven't uploaded new images for bionic et al yet	16:08
fungi	and so they already have git preinstalled causing that task to no-op	16:08
openstackgerrit	Paul Belanger proposed zuul/zuul-jobs master: Revert "Ensure git is installed in prepare-workspace-git role" https://review.opendev.org/747238	16:08
clarkb	^^ that revert won't help anything aiui	16:10
clarkb	the jobs will just fail on the next tasks	16:10
clarkb	fwiw I think its reasonable to make git an image dependency hence https://review.opendev.org/747220	16:11
corvus	clarkb, fungi: pabelanger left a comment that may be relevant on 747238	16:12
clarkb	corvus: ya I think we'll be installing git with default package mirrors (whatever those may be)	16:13
clarkb	DNS should work on boot (thats a thing we've tried very hard to ensure)	16:13
clarkb	though we arne't using the same images as ansible so ...	16:13
openstackgerrit	Paul Belanger proposed zuul/zuul-jobs master: Revert "Ensure git is installed in prepare-workspace-git role" https://review.opendev.org/747238	16:13
fungi	i expect that yoctozepto's fix to do an apt update first would clear the error we're seeing with it, but i respect that ansible's use of the role may be incompatible with installing packages (though it should also no-op if the package is preinstalled in their images)	16:14
yoctozepto	hmm, based on https://review.opendev.org/747170 - I know why it did not fail on the change - it tests PREVIOUS playbooks	16:14
yoctozepto	fungi: yeah, if only it wanted to merge now that the queues are b0rken ^	16:15
clarkb	fungi: yup that is what I just noted on the change about the noop	16:15
clarkb	bsaically reverting that zuul-jobs change doens't help much if the images are broken	16:16
clarkb	if the images are fixed then it noops (so I think we should either fix zuul-jobs or ignore it in favor of fixing the images)	16:16
clarkb	then we can swing around and clean up zuul-jobs as necessary	16:16
clarkb	unless paul has images with git and the only problem is that ansible doesn't noop there for some reason	16:17
clarkb	(that info would be useful /me adds to chnage)	16:17
fungi	looks like we have a node assignment for retry #3 on the job which my autohold is set for, and then i'll see if i can work out why that's failing so cryptically	16:17
fungi	once it gets a free executor slot anyway	16:17
corvus	iiuc that paul is saying dns is broken, it may be that yoctozepto's change is unsafe for paul	16:18
corvus	(because even a no-op 'apt-get update' would fail due to broken dns)	16:18
yoctozepto	fungi, clarkb, corvus: I think our best bet is to force-merge the zuul-jobs revert by pabelanger, then same with dib revert and rebuild the images	16:18
clarkb	corvus: yup, that is why I'm thinking addressing the image problem is our best bet	16:18
clarkb	yoctozepto: but pauls revert shouldn't affect anything	16:19
clarkb	thats what I'm trying to say. If the images re fixed we don't need the revert. If the images are not fixed the revert won't help	16:19
clarkb	we need to focus on the images imo	16:19
yoctozepto	clarkb: but we do want the revert, let's start off the clean plate	16:19
yoctozepto	anyhow, any idea why those jobs test the PREVIOUS playbooks?	16:19
yoctozepto	I mean, they don't test the CURRENT change	16:19
clarkb	yoctozepto: the revert has no bearing on whether jobs will fail or pass. I think we should ignore it and focus on what has an affect	16:20
clarkb	then later we can revert if we want to clean up	16:20
clarkb	yoctozepto: because they run in trusted repos	16:20
clarkb	yoctozepto: that is normal expected behavior by zuul	16:20
yoctozepto	clarkb: ok, missed that	16:20
yoctozepto	clarkb: so it's even in gate?	16:20
yoctozepto	it's scary to +2 such changes there then	16:21
clarkb	yoctozepto: yes, you have to merge the change before it can be used. We have the base-test bsae job set up to act as a tester for these things	16:21
fungi	which was not used in this case because things were already broken	16:21
corvus	real quick q -- since there's a fire, did we delete the broken images to revert to previous ones?	16:22
clarkb	corvus: no because nodpeool will just rebuild and break us again	16:22
clarkb	(at least that was my read of scrollback)	16:22
corvus	well, that's what pause is for	16:22
fungi	i think last time i tried to pause all image updates i got it wrong	16:23
corvus	i mean, we have a documented procedure for exactly this case. if we had followed it, everything would not be broken.	16:23
corvus	fungi: as an alternative, if there is any confusion, you can just stop the builders	16:23
yoctozepto	can we focus on force-merging the dib revert change? :-)	16:23
corvus	or we could follow procedure and not have to force-merge anything	16:23
fungi	https://docs.opendev.org/opendev/system-config/latest/nodepool.html#bad-images	16:23
fungi	maybe we didn't have that documented the last time i tried to do it	16:24
fungi	i think i'll have to run the nodepool commands from nb03?	16:24
fungi	all the others are docker containers now	16:24
clarkb	fungi: you docker exec	16:24
yoctozepto	corvus: so you pause, delete newest ones, and get previous ones?	16:25
fungi	yoctozepto: the older images will be used automatically	16:25
yoctozepto	fungi: ack	16:25
openstackgerrit	Clark Boylan proposed openstack/project-config master: Pause all image builds https://review.opendev.org/747241	16:25
clarkb	so we force merge ^ then delete the image(s)?	16:25
corvus	sure, or delete the image and then regular-merge that	16:26
clarkb	fungi: `sudo docker exec nodepool-builder-compose_nodepool-builder_1 nodepool $command` from my scrollback on nb01	16:26
fungi	or stop builders delete the image, regular merge that, then start builders again once it's deployed?	16:26
corvus	fungi: yes or that	16:27
corvus	main thing is -- we shouldn't have to force-merge anything in this situation	16:27
corvus	(and we should be able to get people working again immediately)	16:27
fungi	okay, i'll start downing the builders now	16:28
clarkb	if we delete the image then regularl merge it will build and then upload I think? so ya downing seems better	16:28
clarkb	(the pause will only apply to builds after the config is updated iirc)	16:28
yoctozepto	that sounds very nice	16:29
fungi	doing `sudo docker-compose down` in /etc/nodepool-builder-compose on nb01,02,04 and `sudo systemctl stop nodepool-builder` on nb03 now	16:29
fungi	#status log all nodepool builders stopped in preparation for image rollback and pause config deployment	16:31
openstackstatus	fungi: finished logging	16:31
fungi	so next we need to build a list of the most recent images where there is at least one prior image and the latest image was built within the past day	16:32
corvus	fungi: should just be the list of images with "00:" as the first part of the age column	16:33
fungi	yep, that's what i just filtered on	16:34
fungi	i guess we can assume there are prior images for all of those	16:34
corvus	nodepool dib-image-list\|grep " 00:"	16:34
corvus	fungi: if there aren't, i don't think it matters anyway (essentially, every 00: image is broken yeah?)	16:34
fungi	well, technically it's been less than 24 hours since the regression merged	16:35
clarkb	corvus: the centos-8 one is 14 hours old which may not be new neough	16:35
clarkb	but I think we can just assume they are broken if new like that and clean them up	16:35
fungi	more important is when dib release was published i guess	16:35
* clarkb checks zuul builds		16:35
corvus	this is what i get for that: http://paste.openstack.org/show/797001/	16:35
yoctozepto	I think https://review.opendev.org/747025 can (and should) be abandoned thanks to clarkb's patch	16:36
clarkb	05:28 UTC yesterday	16:36
fungi	3.2.0 appeared on pypi 05:31z yesterday, so yeah more than 24 hours maybe	16:36
clarkb	oh today is the 20th not 18th	16:36
clarkb	so ya anything built in the last 24 hours is likely bad	16:36
fungi	i guess we just start with 00:	16:36
clarkb	fungi: ++	16:37
fungi	if i nodepool dib-image-delete will that also delete all the uploads of that build?	16:37
fungi	or do i need to also manually delete them?	16:37
clarkb	fungi: it will but only once the builders are started	16:37
clarkb	(same iwth the on disk contents)	16:37
fungi	ohh... right	16:37
clarkb	the zk db updates should be sufficient to start booting on the older images though	16:37
corvus	and yes, the docs say only to run "dib-image-delete"; image-delete is not necessary.	16:38
clarkb	actually wait my earlier day math was right. Today is the 20th. The release was 05:30ish on the 19th	16:39
clarkb	so about 11 hours ago	16:39
clarkb	I think that means the centos-8 image is ok	16:39
clarkb	(but deleting it is also fine)	16:39
fungi	05:31 utc yesterday is 24 hours before 05:31 utc today. it's now 16:40 utc, so >24 hours	16:40
clarkb	bah timezones	16:40
fungi	i failed to delete centos-7-0000134775 because it was building not ready, i guess i should have filtered on ready too	16:42
clarkb	we'll need to delete it when it goes ready	16:42
clarkb	oh wait it wont	16:43
clarkb	because we stopped the builders :)	16:43
fungi	yep	16:43
clarkb	that should autocleanup then. Cool	16:43
fungi	so this is the list: http://paste.openstack.org/show/797003	16:43
fungi	for posterity	16:43
fungi	all but centos-7-0000134775 are in deleting state now	16:43
clarkb	now we should cross check with the image-list	16:44
clarkb	it may be the case that we need the builders running to update their states	16:44
corvus	i approved the zuul-jobs revert for paul	16:44
corvus	yes, i think the 'stop the builders' variant is untested	16:44
fungi	this has reminded me that last time we did it without stopping the builders and had to deal with them immediately starting to build new bad images	16:45
fungi	granted, that takes a bit of time	16:45
fungi	so maybe also okay	16:45
corvus	yes, "immediately" is relative here	16:45
clarkb	ya we haven't updated the image-list	16:46
fungi	also while i was working on that, my autohold was finally satisfied, so i'll see if i can confirm why the apt install git was breaking	16:46
corvus	sure, we would probably need to delete a few again	16:46
clarkb	we can set those to delete too, or update a builder config to pause and start it	16:46
corvus	better to let the builder do it	16:46
corvus	tbh, i'd like to just follow the directions we wrote :)	16:46
fungi	well, nodepool dib-image-delete won't let us delete an image which is building, so we have to catch it between completing the build and uploading	16:47
clarkb	fungi: and we'll also start new image builds	16:47
clarkb	but corvus is saying we should just manually delete those again when they happen	16:47
fungi	the directions we wrote last time ended us with the problem coming back because we didn't catch and delete the new images fast enough	16:47
clarkb	maybe start just nb01 to minimize the number of builds that can happen? A single builder should haldne clenaup just fine	16:48
clarkb	fungi: yes	16:48
corvus	yes, it's possible that one or two jobs may end up running on new images with this process. but right now, we've been running thousands of jobs on bad images	16:48
corvus	so it's like a 10000000000% improvement	16:48
clarkb	should I up the container on nb01?	16:48
fungi	i suppose we could mitigate it by manually applying the pause configuration to all the builders before starting to delete images?	16:48
clarkb	fungi: we only need to start one, and yes we could manually apply the config there	16:49
clarkb	(corvus is saying don't bother though)	16:49
*** fressi has left #opendev		16:49
fungi	or do we then risk ansible deploying the old config back over them before the pause config is merged?	16:49
clarkb	fungi: I think the idea is eve nif we rebuild one or two images we can just delete them again	16:50
clarkb	while we land the pause config change	16:50
clarkb	and if we restart only nb01 we'll minimize nodepools ability to build new images	16:51
clarkb	so I think that is safe enough	16:51
clarkb	corvus: ^ is that basically what you are saying?	16:51
fungi	wfm	16:51
corvus	you may need all the builders up. but yes.	16:51
clarkb	ok I'll start with nb01, then check and see if we need to start the others	16:51
corvus	i'm pretty much going to just keep saying "do what the instructions say"	16:51
clarkb	I'm making sure I'm interperting them correclty as well as articulating the corner case(s) the instructions say	16:52
clarkb	Note the direction say to pause first which we are not doing	16:53
clarkb	do we want to manually edit the configs to pause first then?	16:53
corvus	nope	16:53
corvus	just start the buildersn	16:53
corvus	merge the change	16:53
corvus	keep deleting broken images	16:53
fungi	i'd like to improve the instructions if we can come up with a less racy process for this, or at least figure out what feature to implement in nodepool so we can eliminate the race condition	16:53
clarkb	ok nb01 is running	16:54
corvus	fungi: sure it could be better, but i honestly don't think it's a big deal	16:54
corvus	and considering we went off-script (even after we decided to go on-script) by stopping the builders, i don't think we can actually say we followed them this time	16:54
openstackgerrit	Pierre Riteau proposed opendev/irc-meetings master: Update CloudKitty meeting information https://review.opendev.org/747256	16:54
corvus	they don't say anything about stopping or starting builders	16:54
fungi	my main concern is that in the past it's resulted in us telling people a problem is fixed, only to have it crop back up again hours later and then there's confusion as to when it was actually fixed and what can safely be rechecked	16:54
corvus	okay, let's add a pgraph at the end saying "if new images got built, delete those as well after the pause change has landed"	16:55
fungi	the instructions don't say to stop the builders, they also don't say to keep monitoring the builders and deleting new images which were started before the pause went into place	16:56
corvus	sure, but they do say "if you have a broken image, delete it"	16:56
clarkb	nb01 is attempting to delete images according to the log	16:57
clarkb	there are some auth exceptions to some url I don't recognize	16:57
corvus	clarkb: its own images or others?	16:57
corvus	clarkb: or rather, i think nb01 will only delete images on providers it talks to	16:57
clarkb	corvus: so far just confirmed its own	16:57
corvus	so it may only delete non-arm images	16:58
clarkb	oh ya good point	16:58
* clarkb checks arm		16:58
clarkb	logan-: fwiw it seems we get a cert verification error talking to limestone. We can dig in more once the images are in a happier place	16:58
clarkb	ya doesn't seem to have touched the arm64 images	16:59
clarkb	I'll start nb03 too	16:59
clarkb	ok I think we're good until new images get uploaded (which will start with centos-7-0000134776 and ubuntu-xenial-arm64-0000094376 in an hour or two)	17:01
fungi	yoctozepto: i've confirmed that your proposed patch to apt update also wouldn't have helped. this is running before we've set our apt configuration so fails with "The repository 'http://mirror.dfw.rax.opendev.org/ubuntu xenial-security Release' is not signed. Updating from such a repository can't be done securely, and is therefore disabled by default."	17:03
corvus	clarkb: okay so should we zuul enqueue 747241 into gate?	17:03
corvus	i'm assuming it's partway through failing some check jobs or something on the old images	17:03
clarkb	corvus: ya I thik we can try that now	17:04
clarkb	fungi: thats odd because we build the images using the mirrors to ensure we don't get ahead with their packages iirc. Which means we should bake in the override for that?	17:04
fungi	clarkb: apparently that's not carried over	17:04
clarkb	fungi: maybe that extra apt config is helpfully cleaned up	17:04
*** dtantsur is now known as dtantsur\|afk		17:04
yoctozepto	fungi: ack, I've abandoned it either way because the approach is wrong	17:04
corvus	clarkb: in progress	17:04
fungi	and yeah, it's also trying to use mirror.dfw.rax.opendev.org when it was booted in ovh-gra1	17:05
fungi	so apparently some of the configuration is not cleaned up	17:05
fungi	though i did confirm that once the package lists were correctly updated, it was able to successfully install the git package	17:06
clarkb	as next steps I'm thinking revert the dib change, push a release. Then we can land my fix and a revert revert (and test it) then do another release	17:08
clarkb	https://review.opendev.org/#/c/747025/ is the dib revert	17:09
clarkb	yoctozepto: ^ see plan above. I think it makes sense to test this more completely and start by going back to what is known to work then roll forward with better testing from there	17:09
clarkb	I'm going to recheck that change now	17:09
corvus	clarkb, fungi: the pause change is running jobs which have passed the point at which they're doing things with 'git'	17:13
corvus	so ++	17:13
fungi	good deal	17:13
fungi	does pause cause uploads to be paused too, or just builds?	17:17
openstackgerrit	Merged openstack/project-config master: Pause all image builds https://review.opendev.org/747241	17:17
corvus	fungi: there's a pause for either; clarkb paused the builds	17:18
yoctozepto	clarkb: I'm not sure I agree but it's not bad either	17:18
corvus	so uploads of already built images will continue	17:18
fungi	corvus: yep, thanks, just found that in the docs too	17:19
corvus	(i think that is fine and correct in this case)	17:19
fungi	so if we wanted to avoid uploading images which were in a building state when the diskimage pause was set, we'd need to also add it for all providers	17:19
fungi	we don't have a mechanism for cancelling a build in progress, right? other than maybe a well placed sigterm	17:20
clarkb	fungi: ya killing the dib process would do it, but nothing beyond that iirc	17:21
fungi	and at that point it wouldn't retry the build because of the pause	17:23
clarkb	yes	17:23
* clarkb is trying to figure out how to test https://review.opendev.org/747220 now		17:24
corvus	i was wondering if some of the nodepool/devstack jobs actually boot an image? but they probably don't do anything on it	17:25
corvus	as a one-off, you could probably do something that verifies that git is installed on the booted vm?	17:25
corvus	but also, aren't there some dib tests that can check stuff like that?	17:26
corvus	(ie, build the image, then verify contents?)	17:26
corvus	at the functional test level	17:26
clarkb	corvus: they boot the vm and I think check that ssh works. Which makes me wonder if I should s/git/openssh-server/ as that will confirm the package ends up sticking around	17:27
fungi	that does seem like it could also just be added as commands in a very last stage of an element, so that if the sanity checks don't succeed the image build fails	17:27
clarkb	corvus: for the functional level tests they seem pretty basic.	17:27
clarkb	but maybe there is something there I am missing /me looks more	17:27
fungi	then the test would be to try building the image. if those checks fail, the image build fails and the job then fails	17:28
clarkb	oh you know I can probably just run the scripts in that element and check the outputs	17:28
*** sgw has joined #opendev		17:30
*** andrewbonney has quit IRC		17:35
clarkb	ya I think that is enough to show I've got a bug so I'll keep pulling on it that way	17:36
*** hashar has quit IRC		17:41
corvus	i'm going to delete \| ubuntu-xenial-arm64-0000094376 \| ubuntu-xenial-arm64 \| nb03.openstack.org \| qcow2 \| ready \| 00:00:12:45 \|	17:46
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Docs: Extra details for image rollback https://review.opendev.org/747261	17:47
fungi	corvus: thanks! related ^	17:47
corvus	fungi: i'm not sure that pausing the provider-images would be effective. it can't to into effect any earlier than the dib pause, and i think the dib pause is sufficient to stop the upload	17:50
fungi	oh, uploads won't occur if the diskimage build is paused?	17:50
corvus	fungi: that's my understanding of the intent of the code.	17:51
fungi	that's what i was asking earlier as to whether pausing the diskimage building would also pause uploading of the images	17:51
corvus	i may have misunderstood that question then	17:51
fungi	so if an image is in building state when the pause for it takes effect, once it reaches ready state the nodepool-builder won't attempt to upload it to providers?	17:52
corvus	i believe that's the intent, but i'd give that 50/50 odds that what actually happens, because that's essentially a reconfiguration edge-case.	17:53
corvus	but other than that potential edge case, in general, pausing a dib should stop derived uploads.	17:53
fungi	well, yeah, i mean if you don't build an image then there's nothing to upload	17:54
corvus	uploads fail all the time, so the builders are constantly retrying them	17:54
corvus	(this is why i may have answered your question in a different context earlier)	17:55
fungi	oh, i see, so it would prevent the upload from being retried, but not from being tried the first time	17:55
fungi	(maybe)	17:55
corvus	fungi: i'm just hedging my answer because it's a really specific question which i'm not sure is covered by a unit test	17:56
fungi	sure, makes sense	17:56
corvus	in general, i think what we all want to have happen is what the authors of the code wanted to have happen too	17:56
fungi	so maybe really the only race we've encountered is from deleting images before the pause takes effect	17:56
corvus	so i think our docs should reflect that, until we prove otherwise :)	17:56
corvus	fungi: that is my expectation	17:56
corvus	speaking of which, if infra-prod-service-nodepool ran, successfully, shouldn't "pause: true" appear in /etc/nodepool/nodepool.yaml on nb03?	17:58
fungi	that's what i would have expected	17:58
fungi	unless infra-prod-service-nodepool isn't handling the non-container deployment?	17:59
fungi	maybe that's still being done by the puppet-all job?	17:59
corvus	that may be the case	18:00
fungi	even though it's technically not being configuration-managed by puppet	18:00
corvus	nb01 has true	18:00
corvus	will that end up updated by a cron or something?	18:00
fungi	also i don't know how far ianw got with bringing the mirror for the arm64 provider back to sanity, so it's possible arm64 builds are hopelessly broken at the moment either way	18:01
fungi	looks like infra-prod-remote-puppet-else is queued in opendev-prod-hourly right now	18:02
corvus	okay, given the limited impact, i don't think exceptional action is warranted.	18:04
corvus	fungi: presumably the currently-building fedora-30 image will be a test of your question	18:05
corvus	fungi: i've confirmed that dibs are paused on nb01, and it's 20m into a build of fedora-30	18:05
fungi	yeah, we'll know in a "bit" (or "while" at least) whether infra-prod-remote-puppet-else takes care of it	18:05
corvus	so maybe when it's done, before we delete it, let's check to see if it uploads	18:06
fungi	sounds good	18:06
fungi	then i'll revise the docs change accordingly	18:06
openstackgerrit	Clark Boylan proposed openstack/diskimage-builder master: Don't remove packages that are requested to be installed https://review.opendev.org/747220	18:06
clarkb	that is tested now. It fails pep8 locally but not on any of the files I changed? I wantt to see what zuul says about linting	18:07
fungi	though also i agree if nodepool is expected to not upload images in that state, it's probably something worth fixing in nodepool	18:07
corvus	clarkb: ^ fyi double check that there are no fedora-30-0000018222 uploads once it finishes building	18:07
corvus	(before deleting it)	18:07
clarkb	k	18:08
corvus	i'm going to take a break	18:09
fungi	i'll be breaking in about an hour to work on dinner prep but keeping an eye on this in the meantime	18:10
clarkb	fungi: can https://review.opendev.org/#/c/747056/ get a review before dinner prep?	18:24
fungi	yep, looking	18:26
fungi	deleting centos-8-arm64-0000006345 which went ready ~20 minutes ago	18:28
*** hashar has joined #opendev		18:30
fungi	dib-image-list indicates fedora-30-0000018222 went ready 2 minutes ago	18:38
fungi	also indicates that nb01 has started building fedora-31-0000011973	18:38
fungi	so, um, does it not realize we asked it to pause?	18:38
clarkb	fungi: it started before the pause	18:39
fungi	it started 2 minutes ago	18:39
clarkb	oh 31 not 30	18:39
clarkb	interesting	18:39
fungi	yup	18:39
fungi	also i can confirm fedora-30 is "uploading" to all providers currently	18:40
clarkb	the config for fedora-31 on nb01 clearly says pause: true	18:40
clarkb	maybe its using cached config?	18:40
fungi	though that could also simply be because the builder didn't actually pause	18:40
fungi	i'm deleting fedora-30-0000018222 now before it taints more job builds	18:41
clarkb	fungi: you mean because there is a bug?	18:41
fungi	which statement was that question in relation to?	18:41
clarkb	"though that could also simply be because the builder didn't actually pause"	18:41
fungi	yes, either a bug in nodepool or a bug in how we're updating its configuration	18:42
fungi	like does the builder daemon also need some signal to tell it to reread its configuration?	18:43
fungi	or does it only read its config at start?	18:43
clarkb	reading the code it seems to read it on every pass through its run loop	18:44
fungi	also deleting debian-stretch-arm64-0000093525 which has gone ready	18:44
clarkb	fungi: I think the loop is roughly : while true: load config; for image in images: if imgae is stale rebuild	18:46
fungi	also, the infra-prod-remote-puppet-else build in opendev-prod-hourly finished, but /etc/nodepool/nodepool.yaml on nb03 still hasn't been updated	18:46
clarkb	fungi: my hunch is that its going to try and build every image with the pre pause config as it loops through that list	18:46
clarkb	its not reloading the config between rebuilds until it gets through the whole list	18:46
fungi	so if we want it to take effect ~immediately that requires a service restart, otherwise it will take effect in 6-12 hours	18:47
clarkb	yes? Would be good for someone else to double check my read of the code but that is my read of it	18:48
fungi	and i suppose we should bump the config read down one layer deeper in the nested loop if so	18:48
fungi	in other news, /etc/ansible/hosts/emergency.yaml includes "nb03.openstack.org # ianw 2020-05-20 hand edits applied to dib to build focal on xenial"	18:50
fungi	so this marks the three-month anniversary of the last configuration update there, i suppose	18:50
fungi	i'll edit its config by hand for now	18:50
clarkb	hrm I think that can be removed now, but we should confirm with ianw today	18:50
fungi	i should have looked there sooner, but so much going on	18:50
corvus	yeah, sounds like restart is needed currently, and we should have nodepool reload its config after each image build	18:51
fungi	#status log edited /etc/nodepool/nodepool.yaml on nb03 to pause all image builds for now, since its in the emergency disable list	18:52
openstackstatus	fungi: finished logging	18:52
fungi	i've restarted nodepool-builder on nb03 to get it to read its updated configuration now	18:53
fungi	interestingly, after a restart it immediately began building ubuntu-focal-arm64	18:55
fungi	the config sets pause: true for ubuntu-focal-arm64	18:55
fungi	why would it begin building?	18:55
fungi	oh! because it's a pause under providers, not under diskimages	18:56
fungi	all the pauses in its config are providers	18:56
* fungi sighs, then fixes		18:56
clarkb	fungi: oh sorry I missed the difference in contenxt. Normally we have the images set to pause: false ahead of time to toggle them	18:58
fungi	well, in this case the config on nb03 had them set to pause: false in the diskimages list for linaro-us, not the main diskimages definitions list	18:59
fungi	and even so, after fixing and another restart it's still starting to build yet another new image	19:02
openstackgerrit	Merged opendev/system-config master: Convert ssh keys for ruby net-ssh if necessary https://review.opendev.org/747056	19:02
fungi	ubuntu-xenial-arm64 this time	19:02
clarkb	hae we restarted nb01 ?	19:03
fungi	aha! that one's on me, i missed adding a pause to ubuntu-xenial-arm64	19:03
fungi	i haven't restarted anything else yet. was trying to wrestle nb03 into line	19:04
clarkb	gotcha	19:04
clarkb	should I restart nb01 then so that it short circuits that loop?	19:04
fungi	please do	19:04
clarkb	done	19:05
fungi	okay, after correctly reconfiguring nb03 it's no longer trying to build new images	19:05
clarkb	there are no building images now	19:05
fungi	not sure why the pause: false placeholders were in the provider instantiations rather than the definitions	19:05
fungi	i did double-check nb01 and it looked correctly configured by comparison	19:06
fungi	no remaining images in a building state now	19:06
clarkb	ya I think things have stablized now. IF we want we can start nb02 and nb04	19:18
clarkb	but I'm going to get lunch first.	19:19
clarkb	the dib change will be entering the gate soon I hope as well	19:20
fungi	no need to start more builders until we're ready to un-pause them. they're just going to sit there twiddling their thumbs anyway	19:24
clarkb	yup dib change is gating now. Now I'm really getting food as its just sit and wait time for zuul to run jobs	19:25
fungi	yeh, disappearing to work on dinner now	19:25
openstackgerrit	Radosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw https://review.opendev.org/747185	19:25
openstackgerrit	Antoine Musso proposed opendev/gear master: wakeConnections: Randomize connections before scanning them https://review.opendev.org/747119	19:51
*** hashar has quit IRC		19:52
*** yoctozepto2 has joined #opendev		20:06
*** yoctozepto has quit IRC		20:07
*** yoctozepto2 is now known as yoctozepto		20:07
*** smcginnis has quit IRC		20:12
openstackgerrit	Merged openstack/diskimage-builder master: Revert "source-repositories: git is a build-only dependency" https://review.opendev.org/747025	20:37
clarkb	I expect ianw will be around soon and we can talk about making a release with ^ next	20:45
clarkb	then work to land my change to package accounting and land a revert revert	20:45
*** sshnaidm is now known as sshnaidm\|afk		20:47
openstackgerrit	Pierre Riteau proposed opendev/irc-meetings master: Update CloudKitty meeting information https://review.opendev.org/747256	20:50
*** priteau has quit IRC		20:52
clarkb	zbr: the fix for the puppet jobs has merged. I'll try to approve the e-r python3 switch tomorrow (I'm running out of daylight today and want to make sure all the cleanup from the dib stuff is in a good spot)	20:57
openstackgerrit	Merged openstack/project-config master: Re-introduce puppet-tripleo-core group https://review.opendev.org/746759	21:00
clarkb	the dib change to modify how package installs are handled is passing tests and has new tests to cover the behavior at https://review.opendev.org/#/c/747220/	21:18
clarkb	I'll stack a revert revert on top of that now	21:18
clarkb	hrm do I need to rebase to do that?	21:19
clarkb	maybe I won't stack then	21:19
ianw	clarkb: hey, looking	21:49
fungi	ianw: to catch you up, all diskimages are paused for all builders right now, and we've deleted the most recent diskimages. i manually edited the config for nb03 since it's been in the emergency disable list for months. we also discovered that builders won't notice config changes straight away generally, and so a restart is warranted if you need them to immediately apply	21:51
fungi	oh, and on nb03 i moved the pause placeholders out of the provider section into the diskimage definitions to pause building instead of only pausing uploading	21:52
ianw	sigh ... so i guess we exposed a lot of assumptions about git being on the host ...	21:53
fungi	ianw: well, also we actually explicitly install git in infra-package-needs	21:53
fungi	but the change to dib "cleans it up" helpfully anyway	21:53
johnsom	FYI, docs.openstack.org seems to not be responding	21:53
fungi	johnsom: thanks, checking now	21:54
ianw	https://review.opendev.org/#/c/747121/ didn't work?	21:54
fungi	ianw: at the time setup workspace runs, we haven't configured package management on the systems yet, and they don't have package indices on debuntu type systems at that point	21:54
clarkb	ianw: fungi and pabelanger in particular doesn't even have working dns at that point	21:55
fungi	johnsom: isn't not down for me	21:55
johnsom	fungi Yeah, just started loading for me	21:55
clarkb	but ya we explicitly intsall git in infra-package-needs and so dib shouldn't undo that	21:55
fungi	yeah, i was about to add, also other users of that role don't even necessarily have fundamental network bits in place yet	21:55
fungi	so trying to install packages at that point is going to break for them regardless	21:56
ianw	ok, so git is a special flower	21:56
ianw	the revert is in -2 https://review.opendev.org/#/c/747238/	21:57
fungi	johnsom: looks like the webserver temporarily lost contact with the fileserver for six seconds at 21:47:22 and again for 27 seconds at 21:47:31 and another 7 seconds at 21:54:00	21:58
johnsom	That would do it.	21:58
ianw	so there hasn't been a dib point release?	21:59
clarkb	ianw: not yet, the revert merged not that long ago so I figured we'd wait for you just to double check	21:59
clarkb	ianw: but I think we do that release then work on something like https://review.opendev.org/#/c/747220/ as the next step	22:00
ianw	clarkb: ok, your merging change lgtm as a stop-gap against this returning	22:00
clarkb	then people can have git removed if they don't explicitly install it elsewhere	22:00
ianw	i agree, let me then push a .0.1 release	22:00
fungi	johnsom: which in turn seems to be due to high iowait on the fileserver out of the blue: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=6397&rra_id=all	22:01
fungi	i'm trying to ssh into it now	22:01
clarkb	fungi: did static fail over to the RO server?	22:02
clarkb	(I think that is how it is supposed to work so yay if it did)	22:02
fungi	clarkb: i'm not sure, all the errors in dmesg are about losing and regaining access for 23.253.73.143 (afs02.dfw)	22:03
fungi	and i'm still waiting for ssh to respond on it	22:03
ianw	fungi/clarkb: so now we need to roll out 3.2.1 to builders and rebuild images?	22:03
fungi	checking oob console too	22:03
fungi	ianw: yeah	22:03
fungi	i'm woefully overdue for an evening beer	22:04
ianw	i guess the best way to do that is to bump the dib requirement in nodepool?	22:04
fungi	ianw: or at least blacklist 3.2.0	22:04
*** tosky has quit IRC		22:07
clarkb	ya adding a != 3.2.0 is what I would do	22:07
clarkb	and then we need to revert the pause change	22:07
fungi	oob console is showing hung kernel tasks	22:07
clarkb	and start builders on nb02 and nb04	22:07
ianw	i feel like before we've just done a >=	22:08
ianw	great, my ssh-agent seems to have somehow died	22:08
clarkb	I think thats fine too	22:08
fungi	some day distros will start having console kmesg spew use estimated datetime rather than seconds since boot	22:08
fungi	hung kernel tasks on afs02.dfw began 23818825 seconds after boot	22:09
fungi	if that was ~now then it means the server was booted 2019-11-19 05:50	22:11
fungi	checking to see if we happened to log that	22:11
fungi	yay us! "2019-11-19 06:09:03 UTC rebooted afs02.dfw.openstack.org after it's console was full of I/O errors. very much like what we've seen before during host migrations that didn't go so well"	22:12
fungi	unfortunately unless it miraculously clears up, this probably means an ungraceful reboot, fsck and then lengthy full resync of all afs volumes	22:14
fungi	the cacti graph is also less reassuring... looks like the server stopped responding to snmp entirely 20 minutes ago	22:15
fungi	infra-root: i'm going to hard reboot afs02.dfw	22:15
clarkb	fungi: ok	22:15
clarkb	also looks like docs is still unhappy implying we are't using the other volume?	22:16
fungi	hopefully once its down all consumers will switch to the other server	22:16
clarkb	ah ok maybe that is what is needed to flip flop	22:16
ianw	my notes from that day say	22:17
ianw	* eventually debug to afs02 being broken; reboot, retest, working	22:17
fungi	#status log hard rebooted afs02.dfw.openstack.org after it became entirely unresponsive (hung kernel tasks on console too)	22:17
openstackstatus	fungi: finished logging	22:17
ianw	that i didn't log something about having to rebuild the world might be positive :)	22:18
fungi	ianw: the subsequent entries in our status log worried me, until i realized that they were actually the result of a problem with afs01.dfw some days earlier which we didn't really grasp the full effects of until afs02.dfw hung	22:19
fungi	docs.o.o seems to be back up for me, btw	22:20
ianw	to nb03 -- i have hand edited the debootstrap there to know how to build focal images. the plan was to get that replaced with a container. that has been somewhat sidetracked by the slow builds of those containers. which led to us looking at arm wheels. which led to us doing 3rd party ci for cryptography	22:21
ianw	which led to us finding page size issues with the manylinux2014 images, which has led to patches for patchelf	22:21
ianw	i think this might be the definition of yak shaving	22:21
fungi	ianw: ubuntu is usually good about backporting debootstrap so you can build chroots of newer releases on older systems	22:22
ianw	perhaps in the mean time xenial has updated it's debootstrap	22:22
ianw	i don't think so, last entry seems to be 2016	22:23
fungi	:(	22:23
ianw	sorry, better if i look int he updates repo	22:24
fungi	check xenial-backports	22:24
fungi	but yeah, not in xenial-backports	22:24
ianw	* Add (Ubuntu) focal as a symlink to gutsy. (LP: #1848716)	22:24
openstack	Launchpad bug 1848716 in debootstrap (Ubuntu) "Add Ubuntu Focal as a known release" [High,Fix released] https://launchpad.net/bugs/1848716 - Assigned to Łukasz Zemczak (sil2100)	22:24
ianw	-- Åukasz 'sil2100' Zemczak <lukasz.zemczak@ubuntu.com> Fri, 18 Oct 2019 14:17:06 +0100	22:24
ianw	hrmm, i wonder if we don't have that	22:24
ianw	oh i think that's right, we need 1.0.114 for some other reason	22:27
ianw	https://launchpad.net/~openstack-ci-core/+archive/ubuntu/debootstrap/+sourcepub/11302190/+listing-archive-extra	22:27
ianw	http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-05-19.log.html#t2020-05-19T09:30:19 and that's the discussion about it all ...	22:31
clarkb	is the debootstrap fix not in our ppa?	22:32
clarkb	if it is can't we turn ansible puppet back on?	22:32
ianw	it is; i think we can probably turn puppet back on. i'm starting to think i might have just forgotten to do that after building ^^^	22:32
clarkb	gotcha	22:32
ianw	the reason we run the backport is to build buster (http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-05-19.log.html#t2020-05-19T09:36:59)	22:33
ianw	that's why the xenial-updates version doesn't work for us, it can build focal but not buster	22:34
clarkb	ah	22:35
ianw	clarkb: so you're looking into the "builders don't notice config changes"?	22:36
fungi	ianw: he's got a fix proposed	22:37
fungi	https://review.opendev.org/747277	22:37
clarkb	https://review.opendev.org/747277 is that proposed fix	22:37
ianw	ok, https://review.opendev.org/#/c/747277/ ...	22:37
ianw	jinx	22:37
fungi	they do (eventually) load config changes	22:37
fungi	just not until after cycling through all the defined images which need builds	22:37
ianw	so, before everyone eod's :) i can monitor the deploy of https://review.opendev.org/747303 and re-enable builds. nb03 we can probably re-puppet, i'll look into that. and clarkb has the config-not-noticed issue in review	22:39
ianw	i think that was the 3 main branches of the problems?	22:39
fungi	yep, i think that covers it	22:40
clarkb	we also want a revert of the pause change?	22:40
clarkb	Iguess that falls under reenable builds	22:40
fungi	good reminder that we need to do that part though, yep	22:41
ianw	yeah, i can watch that	22:42
*** mlavalle has quit IRC		22:56
ianw	kevinz: if you can give me a ping about ipv4 access in the control plane cloud in linaro that would be super :)	22:58
clarkb	oh that was the other thing I noticed	22:58
clarkb	limestone has an ssl cert error	22:58
clarkb	I don't think it is an emergency but onc ethe other fires are out we should look into that /me makes a note for tomorrow and iwll try to catch lourot	22:58
clarkb	er logan- sorry lourot bad tab complete	22:58
ianw	clarkb: rejection issues or more like not in container issues?	23:05
clarkb	ianw: I think that cloud may use a self signed cert and we explicitly add a trust for it? and ya maybe that isn't bind mounted or now its an LE cert or something	23:06
clarkb	I should actually point s_client at it	23:06
clarkb	ya s_client says it is a self signed cert	23:07
clarkb	so we're probably just not supplying the cert in clouds.yaml for verification	23:08
openstackgerrit	Ian Wienand proposed openstack/project-config master: Revert "Pause all image builds" https://review.opendev.org/747312	23:23
*** DSpider has quit IRC		23:41
fungi	infra-root: i keep forgetting to mention, but i'm planning to try to be on "vacation" all next week. in theory i'll be avoiding the computer	23:44
ianw	fungi: jealous! i will be within my 5km restriction zone and 1hr of exercise time :/	23:49
fungi	oh, i'm not going anywhere. i'll probably be put to work on a backlog of home improvement tasks	23:50
clarkb	fungi: but will you go past 5km?	23:51
fungi	doubtful. the hardware store is at most half that	23:51
ianw	heh, you could if you wanted to though :)	23:53
ianw	so the nodepool image is promoted, i guess we just need to wait for the next hourly roll out	23:54
*** knikolla has quit IRC		23:56
*** dviroel has quit IRC		23:56
fungi	ianw: i could but i'd rather keep my good health ;)	23:56
*** aannuusshhkkaa has quit IRC		23:56
clarkb	ianw: yes next hourly should restart the builders even iirc	23:57
*** ildikov has quit IRC		23:58
*** knikolla has joined #opendev		23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!