Thursday, 2018-11-29

ianw	let me bring it up...	00:00
mnaser	corvus: oops that's probably a horizon bug that should be fixed	00:00
*** wolverineav has joined #openstack-infra		00:04
*** slaweq has quit IRC		00:05
ianw	clarkb: root@38.145.34.8 ... that is a f29 vm from the .qcow2 of a failed job that i copied off a held node. it never got network during the job, i moved it and it booted just fine :/	00:05
ianw	then i tunneled port 6080 into the held node, got into devstack's horizion, broke into the bootloader and reset the root password, rebooted it and ... it was alive. so couldn't even replicate it in the same place it failed	00:06
clarkb	ianw: did you delete the interface file before rebooting in that case?	00:07
clarkb	chacnes are glean got it configured enough that on next startup it worked?	00:07
clarkb	looking at this f29 instance the glean unit file is a bit different than the one in gerrit	00:08
clarkb	it doesn't have the before network manager service thing	00:08
ianw	yeah, this was just before i switched it to the local-fs.target -- which made it work in CI	00:08
ianw	unfortunately, the logs are recorded because i was watching it, and when it failed i uploaded a new change so zuul canceled it	00:10
ianw	aren't recorded	00:10
openstackgerrit	Ian Wienand proposed openstack-infra/system-config master: bridge.o.o : install ansible 2.7.2 https://review.openstack.org/617218	00:13
clarkb	ianw: maybe this is something. systemctl list-units shows glean@lo.service but not glean@ens3.service on that f29 instance	00:13
*** mriedem has quit IRC		00:13
clarkb	ianw: is it possible glean isn't managing this interface at all?	00:13
ianw	hrm, it's in nmcli ...	00:14
clarkb	it certainly seems like there is an ifcfg-ens3 file that looks like it came from glean though	00:14
ianw	oh, hrm i've rebooted this, it's second boot so it would have skipped it	00:15
clarkb	oh ya	00:15
ianw	can delete the file and reboot	00:15
clarkb	lets dothat, I think it iwll help to better understand what it is doing on boot. Do you also want to update the unit ot match the current version?	00:15
clarkb	(you can reboot it under me, I don't have anything running on that connection I care about currently)	00:15
ianw	ok, let me try	00:17
*** kjackal has joined #openstack-infra		00:17
clarkb	should glean configure an lo ifcfg too for consistency? (I don't actually know, maybe linux just does that for us?)	00:19
ianw	i think we skip it on purpose, but not sure	00:20
ianw	rebooting. i've left the defaultdependencies=no in, for now	00:20
ianw	it's going to be a while before the gate reports if it's happier with that	00:20
clarkb	I think we skip it in glean ya, but that means we end up with the unit running in systemd (not a huge deal but we don't run the unit for everything else after first boot)	00:20
ianw	without that	00:20
ianw	yeah, ens3 service there now	00:21
ianw	hrm, the dot file from systemd-analyze creates a 21mb png	00:24
*** hamerins has quit IRC		00:24
clarkb	it made a 6.3kb svg for me doing systemd-analyze dot glean@ens3.service \| dot -Tsvg systemd.svg	00:25
clarkb	did you run it without specifying a unit? that must be for all the things?	00:25
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Add kube config to nodepool servers https://review.openstack.org/620755	00:25
clarkb	in any case it seems to config that NetworkManager happens after glean@ens3	00:25
clarkb	and journalctl -u glean@ens3 -u NetworkManager seems to confirm as well	00:26
*** hamerins has joined #openstack-infra		00:27
ianw	yeah, https://imgur.com/a/2VBZ11G is a more restricted set	00:28
clarkb	http://paste.openstack.org/show/736341/ logging with very precise timestamps	00:28
*** hamerins has quit IRC		00:30
clarkb	rw-r--r--. 1 root root 134 2018-11-29 00:20:04.439000000 +0000 ifcfg-ens3	00:30
clarkb	that lines up with when selinux was restored. Further evidence it isn't likely a sync/flush	00:30
openstackgerrit	James E. Blair proposed openstack-infra/project-config master: Nodepool: add kubernetes provider https://review.openstack.org/620756	00:31
*** kjackal has quit IRC		00:32
ianw	i guess the big difference is that during the job it's a binary translated nested vm	00:32
clarkb	ianw: if it continues to not work, it would be curious if adding a sleep(5) after the selinux restore makes it work. Like maybe we just need to go slower beacuse qemu	00:33
corvus	clarkb: can you +3 https://review.openstack.org/620704 and https://review.openstack.org/620646 ?	00:34
clarkb	corvus: yes	00:35
ianw	clarkb: yep, good idea.	00:36
ianw	clarkb: to summarise -- setting "after= & wants = local-fs.service" empirically works; but is theoretically wrong. setting "Before=NetworkManager.service network-pre.target" is theoretically right, but empirically does not work in the gate	00:39
ianw	currently i'm testing before=networkmanager but dropping "defaultdependencies=no" (which we've just always had) to see if that makes a difference	00:40
ianw	if not, i'll try again with a sleep() and sync() in glean to see if it's some sort of qemu race in the gate between getting the file out and starting networkmanager	00:41
ianw	if not that, well just go back to local-fs.service and call it a day i guess	00:41
clarkb	ianw: note that case seems to matter so its NetworkManager not networkmanager	00:41
ianw	yep, it's using the CamelCase name in the system files	00:42
openstackgerrit	Clark Boylan proposed openstack-infra/system-config master: Nodepool group no longer hosts zookeeper https://review.openstack.org/620760	00:47
clarkb	corvus: fyi ^ is a cleanup I noticed when reviewing your change	00:48
*** tosky has quit IRC		00:50
*** jistr has quit IRC		01:00
*** jistr has joined #openstack-infra		01:01
clarkb	ianw: alright I've still got nothing. I am going to go rake leaves and take a better look at my retaining wall that fell over. Heres to hoping there is understanding when I'm back tomorrow :)	01:05
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: executor: add support for generic build resource https://review.openstack.org/570668	01:07
*** markvoelker has joined #openstack-infra		01:23
*** markvoelker has quit IRC		01:28
ianw	clarkb: yeah, me either :) i'm on a half-day today so will check it all out later	01:30
*** dklyle has joined #openstack-infra		01:32
*** sthussey has quit IRC		01:35
openstackgerrit	Merged openstack-infra/project-config master: Add opendev-website project to Zuul https://review.openstack.org/620704	01:48
*** gyee has quit IRC		01:54
*** hamerins has joined #openstack-infra		02:00
*** mgutehall has quit IRC		02:19
*** wolverineav has quit IRC		02:21
*** mrsoul has joined #openstack-infra		02:21
*** wolverineav has joined #openstack-infra		02:24
*** hongbin has joined #openstack-infra		02:28
*** wolverineav has quit IRC		02:28
*** markvoelker has joined #openstack-infra		02:30
*** hamerins has quit IRC		02:33
*** sthussey has joined #openstack-infra		02:36
*** bhavikdbavishi has joined #openstack-infra		02:42
*** ramishra has joined #openstack-infra		02:53
*** rcernin has quit IRC		02:58
*** jamesmcarthur has joined #openstack-infra		03:00
*** jamesmcarthur has quit IRC		03:04
*** ramishra has quit IRC		03:04
*** ramishra has joined #openstack-infra		03:06
*** diablo_rojo has quit IRC		03:11
*** jamesmcarthur has joined #openstack-infra		03:14
*** apetrich has quit IRC		03:16
*** rlandy\|bbl is now known as rlandy		03:18
*** hamerins has joined #openstack-infra		03:20
*** jamesmcarthur has quit IRC		03:30
*** hamerins has quit IRC		03:34
*** bobh has joined #openstack-infra		03:35
*** rcernin has joined #openstack-infra		03:38
*** bobh has quit IRC		03:41
*** EmLOveAnh has joined #openstack-infra		03:42
*** roman_g has quit IRC		03:49
openstackgerrit	Merged openstack/diskimage-builder master: Fix unit tests for elements https://review.openstack.org/619387	03:53
*** EmLOveAnh has quit IRC		03:54
*** dklyle has quit IRC		03:58
openstackgerrit	Merged openstack-infra/zuul master: Remove uneeded if statement https://review.openstack.org/617984	04:02
*** diablo_rojo has joined #openstack-infra		04:17
*** jamesmcarthur has joined #openstack-infra		04:20
*** armax has quit IRC		04:24
*** jamesmcarthur has quit IRC		04:25
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver https://review.openstack.org/535558	04:29
*** hongbin has quit IRC		04:34
*** dave-mccowan has quit IRC		04:34
*** dave-mccowan has joined #openstack-infra		04:35
*** auristor has quit IRC		04:38
*** eernst has joined #openstack-infra		04:45
*** auristor has joined #openstack-infra		04:45
*** dave-mccowan has quit IRC		04:49
*** auristor has quit IRC		04:50
*** auristor has joined #openstack-infra		04:53
openstackgerrit	Kendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments https://review.openstack.org/607377	04:54
*** eernst has quit IRC		04:55
openstackgerrit	Merged openstack-infra/project-config master: Add opendev-website jobs https://review.openstack.org/620646	05:00
*** yboaron_ has joined #openstack-infra		05:01
*** sthussey has quit IRC		05:05
*** udesale has joined #openstack-infra		05:06
*** janki has joined #openstack-infra		05:09
*** bhavikdbavishi has quit IRC		05:18
*** bhavikdbavishi has joined #openstack-infra		05:18
*** ykarel\|away has joined #openstack-infra		05:20
*** bhavikdbavishi has quit IRC		05:30
*** ykarel\|away has quit IRC		06:02
*** bhavikdbavishi has joined #openstack-infra		06:12
*** mgutehall has joined #openstack-infra		06:14
*** ykarel\|away has joined #openstack-infra		06:16
openstackgerrit	Ian Wienand proposed openstack-infra/glean master: Add NetworkManager distro plugin support https://review.openstack.org/618964	06:19
openstackgerrit	Ian Wienand proposed openstack-infra/glean master: A systemd skip for Debuntu systems https://review.openstack.org/620420	06:19
*** ykarel\|away is now known as ykarel\|lunch		06:20
*** ccamacho has quit IRC		06:21
*** armax has joined #openstack-infra		06:44
*** kjackal has joined #openstack-infra		06:56
*** bhavikdbavishi has quit IRC		07:01
*** ahosam has joined #openstack-infra		07:02
*** slaweq has joined #openstack-infra		07:03
*** quiquell\|off is now known as quiquell		07:07
*** e0ne has joined #openstack-infra		07:10
*** ccamacho has joined #openstack-infra		07:12
*** chkumar\|away has quit IRC		07:13
*** chandan_kumar has joined #openstack-infra		07:15
*** rascasoft has quit IRC		07:25
*** apetrich has joined #openstack-infra		07:26
*** ahosam has quit IRC		07:26
*** ahosam has joined #openstack-infra		07:26
*** armax has quit IRC		07:27
*** aojea has joined #openstack-infra		07:32
*** dpawlik has joined #openstack-infra		07:32
*** rcernin has quit IRC		07:35
*** ccamacho has quit IRC		07:39
*** rascasoft has joined #openstack-infra		07:40
openstackgerrit	Kendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments https://review.openstack.org/607377	07:53
*** florianf\|afk is now known as florianf		07:57
*** ralonsoh has joined #openstack-infra		07:58
*** ccamacho has joined #openstack-infra		08:06
*** ccamacho has quit IRC		08:07
*** ccamacho has joined #openstack-infra		08:08
*** ginopc has joined #openstack-infra		08:11
*** quiquell is now known as quiquell\|brb		08:13
*** roman_g has joined #openstack-infra		08:18
*** bhavikdbavishi has joined #openstack-infra		08:18
*** olivierbourdon38 has joined #openstack-infra		08:27
*** diablo_rojo has quit IRC		08:29
*** jpena\|off is now known as jpena		08:31
*** ykarel\|lunch is now known as ykarel		08:31
*** takamatsu has quit IRC		08:31
*** dims has quit IRC		08:32
*** dims has joined #openstack-infra		08:33
*** quiquell\|brb is now known as quiquell		08:37
*** bhavikdbavishi1 has joined #openstack-infra		08:37
*** bhavikdbavishi has quit IRC		08:38
*** bhavikdbavishi1 is now known as bhavikdbavishi		08:38
*** ginopc has quit IRC		08:39
*** ginopc has joined #openstack-infra		08:39
*** pcaruana has joined #openstack-infra		08:48
*** markvoelker has quit IRC		08:50
*** apetrich has quit IRC		08:52
openstackgerrit	Merged openstack-infra/irc-meetings master: Update Meeting Chairs https://review.openstack.org/620685	08:53
*** rossella_s has joined #openstack-infra		08:54
openstackgerrit	Merged openstack-infra/irc-meetings master: rpm-packaging: Adjust meeting time https://review.openstack.org/620616	08:57
*** tosky has joined #openstack-infra		09:03
*** shardy has joined #openstack-infra		09:04
*** gfidente has joined #openstack-infra		09:06
*** ahosam has quit IRC		09:10
*** ahosam has joined #openstack-infra		09:10
*** takamatsu has joined #openstack-infra		09:13
*** shardy has quit IRC		09:13
*** jpich has joined #openstack-infra		09:15
*** derekh has joined #openstack-infra		09:22
openstackgerrit	Brendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838	09:25
*** kaiokmo has quit IRC		09:38
*** apetrich has joined #openstack-infra		09:43
*** ssbarnea has quit IRC		09:45
*** ssbarnea\|rover has joined #openstack-infra		09:45
*** yboaron_ has quit IRC		09:46
*** yboaron_ has joined #openstack-infra		09:46
openstackgerrit	Merged openstack-infra/nodepool master: Update node request during locking https://review.openstack.org/618807	09:50
*** shardy has joined #openstack-infra		09:51
*** markvoelker has joined #openstack-infra		09:51
*** rossella_s has quit IRC		09:52
*** bhavikdbavishi has quit IRC		09:55
openstackgerrit	Merged openstack-infra/nodepool master: Add second level cache of nodes https://review.openstack.org/619025	09:59
openstackgerrit	Merged openstack-infra/nodepool master: Add second level cache to node requests https://review.openstack.org/619069	09:59
openstackgerrit	Merged openstack-infra/nodepool master: Only setup zNode caches in launcher https://review.openstack.org/619440	09:59
*** jchhatbar has joined #openstack-infra		10:02
*** janki has quit IRC		10:05
*** ahosam has quit IRC		10:06
*** kashyap has left #openstack-infra		10:10
ianw	clarkb: http://logs.openstack.org/71/618671/14/experimental/nodepool-functional-py35-redhat-src/0067296/controller/logs/screen-nodepool-launcher.txt.gz#_Nov_29_08_02_15_155672	10:23
ianw	http://paste.openstack.org/show/736383/	10:23
ianw	even with two syncs() and a pause something still goes wrong	10:23
openstackgerrit	Benoît Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs https://review.openstack.org/613261	10:23
*** markvoelker has quit IRC		10:24
*** udesale has quit IRC		10:27
*** ahosam has joined #openstack-infra		10:31
openstackgerrit	Ian Wienand proposed openstack-infra/glean master: Add NetworkManager distro plugin support https://review.openstack.org/618964	10:31
openstackgerrit	Ian Wienand proposed openstack-infra/glean master: A systemd skip for Debuntu systems https://review.openstack.org/620420	10:31
*** AJaeger_ has quit IRC		10:32
*** rossella_s has joined #openstack-infra		10:36
*** jchhatba_ has joined #openstack-infra		10:37
*** lpetrut has joined #openstack-infra		10:38
*** jchhatbar has quit IRC		10:39
*** gfidente has quit IRC		10:39
*** mgutehall has quit IRC		10:40
*** shardy has quit IRC		10:41
*** yamamoto has quit IRC		10:43
*** shardy has joined #openstack-infra		10:43
*** rossella_s has quit IRC		10:49
*** rossella_s has joined #openstack-infra		10:53
*** mgutehall has joined #openstack-infra		10:56
*** electrofelix has joined #openstack-infra		11:04
*** AJaeger_ has joined #openstack-infra		11:13
*** markvoelker has joined #openstack-infra		11:21
*** takamatsu has quit IRC		11:25
*** jchhatbar has joined #openstack-infra		11:30
*** takamatsu has joined #openstack-infra		11:31
*** jchhatba_ has quit IRC		11:33
*** rossella_s has quit IRC		11:39
openstackgerrit	Merged openstack/diskimage-builder master: Fix a typo in the help message of disk-image-create https://review.openstack.org/619679	11:42
*** jchhatbar has quit IRC		11:43
*** ahosam has quit IRC		11:45
*** markvoelker has quit IRC		11:55
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589	11:57
*** jpena is now known as jpena\|lunch		11:59
*** jamesmcarthur has joined #openstack-infra		12:00
*** jamesmcarthur has quit IRC		12:04
*** takamatsu has quit IRC		12:06
*** yamamoto has joined #openstack-infra		12:08
*** xek_ has joined #openstack-infra		12:12
*** gfidente has joined #openstack-infra		12:12
*** tpsilva has joined #openstack-infra		12:15
*** dave-mccowan has joined #openstack-infra		12:16
*** jcoufal has joined #openstack-infra		12:19
*** xek_ has quit IRC		12:21
chandan_kumar	odyssey4me: Hello	12:23
chandan_kumar	odyssey4me: https://review.openstack.org/#/c/620800/ and https://review.openstack.org/#/c/619986/4 both does different tasks	12:24
chandan_kumar	odyssey4me: I am not getting how it is similar	12:24
chandan_kumar	odyssey4me: need help here	12:24
chandan_kumar	sorry wrong channel	12:24
*** jcoufal has quit IRC		12:33
*** rh-jelabarre has joined #openstack-infra		12:37
*** panda\|pto is now known as panda		12:41
*** sshnaidm\|afk is now known as sshnaidm		12:51
*** markvoelker has joined #openstack-infra		12:52
openstackgerrit	Erno Kuvaja proposed openstack-infra/project-config master: Add Review Priority column to glance repos https://review.openstack.org/620904	12:53
*** e0ne has quit IRC		12:53
*** boden has joined #openstack-infra		12:55
*** shardy has quit IRC		13:03
*** shardy has joined #openstack-infra		13:10
*** rlandy has joined #openstack-infra		13:11
*** rossella_s has joined #openstack-infra		13:12
*** kgiusti has joined #openstack-infra		13:14
*** yamamoto has quit IRC		13:19
*** yamamoto has joined #openstack-infra		13:19
*** markvoelker has quit IRC		13:24
*** rh-jelabarre has quit IRC		13:27
*** jpena\|lunch is now known as jpena		13:31
*** kaiokmo has joined #openstack-infra		13:31
*** e0ne has joined #openstack-infra		13:31
Dobroslaw	Hello again zuul masters	13:32
Dobroslaw	what I want: in `release` step create docker image with tag containing release version	13:32
Dobroslaw	question: does zuul create some env variable on the machine when creating new release so that I could catch it with bash script?	13:32
Dobroslaw	or is there any other way for getting this value?	13:32
Dobroslaw	I can't find anything useful in docs or zuul code	13:32
pabelanger	Dobroslaw: you can look for zuul.tag in the inventory	13:33
pabelanger	then check zuul.pipeline	13:34
pabelanger	to know you are in the release pipeline	13:34
*** janki has joined #openstack-infra		13:35
Dobroslaw	pabelanger: something like this?: https://github.com/openstack/kolla/blob/master/tests/templates/kolla-build.conf.j2#L5	13:35
*** yboaron_ has quit IRC		13:36
pabelanger	yup	13:36
pabelanger	https://zuul-ci.org/docs/zuul/user/jobs.html#tag-items	13:36
pabelanger	for more info	13:36
*** yboaron_ has joined #openstack-infra		13:36
Dobroslaw	pabelanger: great, checking, thank you	13:37
*** jamesmcarthur has joined #openstack-infra		13:48
*** jcoufal has joined #openstack-infra		13:48
*** jcoufal has quit IRC		13:50
*** takamatsu has joined #openstack-infra		13:56
*** zul has quit IRC		14:00
*** dpawlik has quit IRC		14:01
*** dpawlik has joined #openstack-infra		14:03
*** jamesmcarthur has quit IRC		14:04
*** dpawlik has quit IRC		14:04
*** bobh has joined #openstack-infra		14:09
efried	Hey folks, I think I brought this up a few weeks ago, but then lost track of it.	14:11
efried	http://ci-watch.tintri.com/ <== seems to be down. Did we figure out who had been maintaining it, if it was going to be fixed or replaced with something equivalent, etc?	14:11
efried	The nova team (at least) used to get a lot of use out of it.	14:11
mordred	efried: I don't know that we know anything about it. what did it do?	14:13
efried	mordred: It was like a summary table of all the CIs, including 3rd party. You could filter (by things like project, date range, etc) with queryparams. It had nice big green checkmark or red X to indicate whether a particular run passed or failed, with links to the run.	14:14
efried	made it easy to tell at a glance whether a particular CI was really dead (lots of red in a row) etc.	14:15
fungi	it's referenced in this infra spec: https://specs.openstack.org/openstack-infra/infra-specs/specs/deploy-ci-dashboard.html#proposed-change	14:16
*** zul has joined #openstack-infra		14:16
mordred	gotcha	14:16
mordred	ah - so we have the source code for it at least	14:17
*** roman_g has quit IRC		14:17
fungi	looks like krtaylor and mmedvede were involved in the drafting of that spec, so maybe they know who was running the poc	14:17
mordred	with the update to config management stuff, it might be an easier task for someone to pick up now	14:18
*** roman_g has joined #openstack-infra		14:19
fungi	in unrelated news, looks like we're getting a lot of pip install errors in vexxhost-sjc1 so i'm checking out the proxy host now	14:19
ttx	infra-core: we are holding off on releases until the pypi access issue affecting some regions is solved (http://status.openstack.org/elastic-recheck/#1449136) -- if you notice things are working correctly again please let us know !	14:19
fungi	ttx: the issue in rax-dfw seemed to clear up late yesterday but now we have a problem in another provider i'm just starting to check into	14:20
fungi	mirror.sjc1.vexxhost.o.o seems to be completely unreachable for me	14:20
ttx	hmm, yeah, that graph could be explained by two different issues	14:20
fungi	right, if you tick on the node_provider field in one of the relevant logstash queries you'll see it's a different issue	14:21
*** sthussey has joined #openstack-infra		14:22
fungi	nova claims the instance is active	14:24
mordred	fungi: anything exciting in the nova console log?	14:24
fungi	console log show is empty	14:24
fungi	nova reboot?	14:25
fungi	er, server reboot	14:25
fungi	unless we want mnaser to see if there's a network-related explanation for why it's unreachable	14:25
fungi	in which case we can turn down that region	14:26
fungi	but we're already running full-out with some ~3 hours to get node assignments in check at this point, so further reduction in capacity is probably not going to help that situation	14:27
fungi	judgement call... i'm going to reboot it via the api and see if i can get any sort of post-mortem from whatever system logs it managed to write (if any)	14:28
mmedvede	fungi: the person who was running the ci-watch.tintri.com poc no longer works there. He left a contact email which is not responding so far. I did deploy the same service on http://ciwatch.mmedvede.net	14:28
fungi	#status log rebooted mirror01.sjc1.vexxhost.openstack.org via api as it seems to have been unreachable since ~02:30z	14:29
*** mriedem has joined #openstack-infra		14:29
openstackstatus	fungi: finished logging	14:29
fungi	mmedvede: thanks! efried: see mmedvede's comment above	14:30
fungi	A start job is running for Raise ne...k interfaces (2min 5s / 5min 1s)	14:31
fungi	that's not very promising	14:31
efried	mmedvede, fungi: Thanks!	14:31
fungi	if it can't bring up the nic in the next couple minutes, i'll get the region temporarily disabled	14:31
*** eharney has joined #openstack-infra		14:33
*** dpawlik has joined #openstack-infra		14:33
*** yboaron_ has quit IRC		14:35
*** yboaron_ has joined #openstack-infra		14:35
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691	14:36
*** fuentess has joined #openstack-infra		14:37
*** dpawlik has quit IRC		14:38
openstackgerrit	Jeremy Stanley proposed openstack-infra/project-config master: Temporarily disable vexxhost-sjc1 in nodepool https://review.openstack.org/620924	14:38
fungi	config-core: ^	14:38
fungi	i'll see about adding nl03 to the emergency disable list and manually applying that diff while we wait for the change to merge	14:39
mordred	++	14:39
pabelanger	+3	14:39
*** graphene has joined #openstack-infra		14:40
pabelanger	At one point in time, we quickly discussed the idea of running multiple mirrors, so if one when down we didn't have to disable it in nodepool. logan- actually rehashed the discussion in berlin for another reason	14:41
fungi	after editing nodepool.yaml, do i need to reload the config with the nodepool rpc cli?	14:41
pabelanger	fungi: no, it will be live on save	14:42
fungi	perfect	14:42
fungi	#status log temporarily added nl03.o.o to the emergency disable list and manually applied https://review.openstack.org/620924 in advance of it merging	14:42
openstackstatus	fungi: finished logging	14:42
*** rh-jelabarre has joined #openstack-infra		14:45
*** jcoufal has joined #openstack-infra		14:54
*** bhavikdbavishi has joined #openstack-infra		14:56
*** udesale has joined #openstack-infra		14:58
*** lpetrut has quit IRC		14:58
*** xek has joined #openstack-infra		15:00
*** trown has quit IRC		15:03
*** trown has joined #openstack-infra		15:04
fungi	we're down to 0 nodes in the vexxhost-sjc1 main pool now	15:08
*** ykarel is now known as ykarel\|away		15:11
*** ykarel\|away has quit IRC		15:15
*** zul has quit IRC		15:18
*** hamerins has joined #openstack-infra		15:23
*** roman_g has quit IRC		15:24
mnaser	hi	15:24
* mnaser looks		15:24
mnaser	is it all vms or just mirror, fungi ?	15:25
cmurphy	clarkb: mordred any idea why this failed https://review.openstack.org/602380 http://logs.openstack.org/80/602380/3/gate/infra-puppet-apply-3-ubuntu-trusty/a3a1e0c/job-output.txt.gz#_2018-11-27_21_19_01_265126 and if i recheck is someone around to babysit in case it makes it through?	15:26
mnaser	sigh	15:26
mnaser	ok i know what sgoing on	15:26
mnaser	fungi: it should be back	15:28
*** udesale has quit IRC		15:29
*** mriedem is now known as mriedem_afk		15:29
*** udesale has joined #openstack-infra		15:29
*** hamerins has quit IRC		15:30
mnaser	config-core: feel free to propose a revert of that patch	15:30
mnaser	the issue should be fixed now	15:30
*** yboaron_ has quit IRC		15:31
ianychoi	Hello infra team, would some system-config cores kindly review https://review.openstack.org/#/c/620661/ and +A? so many spams (>400 just for about latest 12 hours..) to me really heart my mail box..	15:31
*** hamerins has joined #openstack-infra		15:31
ianychoi	s/heart/hurt	15:31
*** ykarel\|away has joined #openstack-infra		15:32
fungi	mnaser: okay, thanks for finding the issue!	15:32
fungi	i confirm i can ping it now	15:32
fungi	and ssh into it	15:33
*** ykarel\|away is now known as ykarel		15:34
*** panda is now known as panda\|pto		15:34
fungi	i've abandoned the change since it hadn't merged yet, and will unroll the emergency disablement	15:35
mnaser	fungi: thanks	15:38
evrardjp	a non important patch in my opened things just need a simple review of someone here: https://review.openstack.org/#/c/619216/	15:39
*** ahosam has joined #openstack-infra		15:39
*** ccamacho has quit IRC		15:39
*** ramishra has quit IRC		15:39
openstackgerrit	Merged openstack-infra/system-config master: Add kube config to nodepool servers https://review.openstack.org/620755	15:44
openstackgerrit	Merged openstack-infra/system-config master: Nodepool group no longer hosts zookeeper https://review.openstack.org/620760	15:44
*** ccamacho has joined #openstack-infra		15:49
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589	15:50
*** quiquell is now known as quiquell\|off		15:55
*** aojea has quit IRC		15:59
fungi	we're back up running with the original max-servers count in vexxhost-sjc1	16:00
*** dklyle has joined #openstack-infra		16:01
*** graphene has quit IRC		16:03
*** graphene has joined #openstack-infra		16:04
*** jamesmcarthur has joined #openstack-infra		16:05
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors) https://review.openstack.org/620950	16:07
*** jamesmcarthur has quit IRC		16:10
openstackgerrit	Merged openstack-infra/system-config master: Blackhole messages to openstack-ko-owner@l.o.o https://review.openstack.org/620661	16:13
frickler	mnaser: can you share what the issue was? (just being curious from an ops perspective)	16:19
*** udesale has quit IRC		16:20
*** mriedem_afk is now known as mriedem		16:22
mnaser	frickler: openstack-ansible bug seems like didn't cleanly restart neutron-openvswitch-agent on reboot	16:23
*** janki has quit IRC		16:24
*** boden has quit IRC		16:25
*** e0ne has quit IRC		16:26
frickler	mnaser: ah, ok, that certainly causes network issues ;) luckily should be pretty easy to spot. thx	16:26
*** jcoufal has quit IRC		16:31
*** jcoufal has joined #openstack-infra		16:31
ssbarnea\|rover	anyone looking at zuul , it seems unresponsive	16:31
ssbarnea\|rover	http://zuul.openstack.org/status .... not loading.	16:31
*** ykarel is now known as ykarel\|away		16:33
mnaser	loads for me ssbarnea\|rover	16:33
mnaser	there is a lot of things in queue so many it takes a little while before it comes up	16:33
ssbarnea\|rover	mnaser: yep, it did load for me like 1-2 minutes later....	16:34
*** ccamacho has quit IRC		16:34
pabelanger	yes, status json file is pretty large, but does load	16:34
*** hamerins has quit IRC		16:34
pabelanger	wow, tripleo queue is 45hrs	16:34
pabelanger	wonder what is going on there	16:35
*** hamerins has joined #openstack-infra		16:36
*** boden has joined #openstack-infra		16:36
ssbarnea\|rover	pabelanger: i do have the impression that this was caused by pip : http://status.openstack.org/elastic-recheck/ for which I raised a CR yesterday which was not approved due to risks.	16:36
*** armax has joined #openstack-infra		16:37
ssbarnea\|rover	pabelanger: https://review.openstack.org/#/c/620630/ if you remember well	16:37
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954	16:37
pabelanger	ssbarnea\|rover: yah, looks like vexxhost had a large impact, but that seems to just be from this morning. Do you know if rax is still having an issue?	16:38
ssbarnea\|rover	now I didn't see the failure going down since yesterday so I suspect the problem is still valid.	16:38
*** e0ne has joined #openstack-infra		16:38
pabelanger	ssbarnea\|rover: http://status.openstack.org/elastic-recheck/data/others.html#tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates	16:39
*** roman_g has joined #openstack-infra		16:39
fungi	pabelanger: rax networking issues cleared up around 1800z yesterday	16:39
pabelanger	looks to be unclassified failures, if we add them to elastic-recheck we'll get a better idea of is happening too	16:39
pabelanger	fungi: ack	16:39
fungi	ssbarnea\|rover: it's not that i pushed back on those changes due to "risks" but that you seem to not understand how extra-index-url works. it won't result in fewer failures (if anything it'll result in more)	16:40
fungi	it adds indexes, all of which will be queried, and then pip will nondeterministically pick one at random to pull packages from if there are duplicate entries	16:41
ssbarnea\|rover	i know about its weird implementation, i do not question the downvote ;)	16:41
fungi	it's not a "fallback" mechanism	16:41
mordred	it would be so nice if it was a fallback mechanism	16:41
ssbarnea\|rover	but based on my local tests it should make it more reliable	16:41
ssbarnea\|rover	if I would not be under pressure i would have tried to fix pip	16:42
fungi	it will likely result in ~half of our packages skipping the mirror and pulling from pypi.org anyway	16:42
ssbarnea\|rover	fungi: and is that a big issue?	16:42
ssbarnea\|rover	the idea is when one of the sources fails the other ones should still be able to serve it , right?	16:43
fungi	yes, that's why we have a local cache in each region. reduces the calls to outside services which, because of the unreliability of the internet, fails at random more often	16:43
fungi	that's not how pip works	16:43
fungi	if the source it decides to pick fails, pip will return an error and exit	16:44
mordred	it never ceases to boggle my mind that this is how it works :)	16:44
ssbarnea\|rover	fungi: not if it timesout, if it times out other source should win.	16:44
ssbarnea\|rover	mordred++ :D	16:45
fungi	it doesn't pull from both. it picks one of the entries exclusively	16:45
*** ahosam has quit IRC		16:46
ssbarnea\|rover	fungi: btw, why we not just using a local http proxy?	16:46
fungi	first it retrieves all the indices, and then it decides which entry from which index will satisfy the stated requirement, and then it tries to retrieve that. and if it gets an error, it fails (and if you have it set to retry, it just retries that same one multiple times)	16:46
fungi	ssbarnea\|rover: we _are_ using a local http proxy	16:46
mordred	yeah. that's what the mirrors are	16:46
*** hamzy has quit IRC		16:46
fungi	ssbarnea\|rover: furthermore, the errors we saw in rackspace's dfw region yesterday (which prompted you to write that change) weren't because the proxy was broken but because networking from that provider region to the pypi.org cdn was broken. bypassing the proxy wouldn't have made any difference there	16:47
pabelanger	in the case of vexxhost this morning, the mirror was down for unrelated pypi reasons. So, other things outside of pip were also affected, eg: apt / rpm. In the past we talked about the idea of standing up a 2nd mirror to load balance requests, maybe we should look back to that and do something like round robin DNS. As I type this, I am also not sure how our validate-hosts roles didn't catch the	16:47
pabelanger	issue and abort the job	16:47
fungi	pabelanger: then the load balancer becomes a single point of failure instead of the proxy	16:47
*** gfidente has quit IRC		16:48
pabelanger	agreed	16:48
logan-	pabelanger fungi: my thinking is we should handle the mirror selection using a random list of eligible mirrors in the pre-run. select a random one until a health check passes, then use it for the job	16:48
fungi	unless we set up some sort of distributed lb with address takeover anyway	16:48
*** e0ne has quit IRC		16:48
pabelanger	right, we could make our configure-mirror role a little smarter in that way	16:49
fungi	but that could be an interesting challenge in providers who block multicast	16:49
logan-	there is no need for a lb or spof that way, the zuul pre-run can select a mirror from a list of 1 or more eligibles	16:49
pabelanger	get IP from dns, validate online, then use	16:49
*** trown is now known as trown\|lunch		16:49
pabelanger	that is kinda how validate-host role should work	16:49
fungi	logan-: yeah, if the job node performs the health check, then that at least gets us a solution for situations where one of the servers has died completely, maybe not for when one is having intermittent issues and not the other	16:50
ssbarnea\|rover	... just to recapitulate in less thab 24h we were hit by two different mirrors going down: rex, and later vexhost.	16:51
fungi	however, most of the time when there's a problem with the mirror/proxy servers it's either a global issue or it's an issue impacting an entire region in a provider, so multiple servers is really only a solution for a fraction of these situations	16:51
fungi	ssbarnea\|rover: correct	16:51
pabelanger	Odd, validate-host does not actually check our mirrors	16:51
pabelanger	I thought it did	16:51
ssbarnea\|rover	oh,... next time someone tells me that mirrors are reliable I will send them a link to irc logs.	16:52
fungi	ssbarnea\|rover: i don't dispute that these problems occurred, i'm saying they were different problems entirely and there's no single solution here	16:52
ssbarnea\|rover	does pip respect http_proxy env value or not really?	16:52
fungi	ssbarnea\|rover: they weren't "mirror problems" (in both cases they were "network problems")	16:52
pabelanger	ssbarnea\|rover: fungi: logan-: At a minium, I think we could update our pre-run playbook in base, to do a health check of both git.o.o and regional mirror, of if either fail, the job will abort and hopefully rerun on another provider	16:53
fungi	they were noticed as errors from the mirror servers because that's where jobs were trying to hit the outside through	16:53
ssbarnea\|rover	fungi: sure, mirror can got down for various reasons, but if pip would know how to fallback, this could be one solution coverving both outages.	16:53
fungi	that's like saying that you have a "foot problem" because your leg fell off	16:53
fungi	ssbarnea\|rover: it wouldn't have solved both outages, no	16:54
ssbarnea\|rover	and the irony is that pip would have worked without custom mirror in both cases	16:54
fungi	ssbarnea\|rover: yesterday, the inability of rackspace's dfw region to reach pypi was the problem. the only solution would have been to stop running jobs there	16:55
fungi	which we nearly did, but then their network issues in that region cleared up	16:56
fungi	(either a network problem within that region or a problem with the nearest endpoint for the fastly cdn pypi.org uses)	16:56
fungi	if nodes in that region had tried to connect directly to pypi.org the failure rate would have been identical	16:57
fungi	no amount of patching pip or its configuration will solve that	16:57
ssbarnea\|rover	fungi: ohh, are you sure? if i heard this issue was limited to ipv6, and if the nodes were using ipv4, it would have worked.	16:59
fungi	ssbarnea\|rover: yes, an alternative would have been to figure out how to disable ipv6 on our test nodes in that region. also a drastic enough solution that you're not going to just work around it	17:00
ssbarnea\|rover	what's the status with vex, is it sorted? i am worried about gate queue which seems to only go up.	17:01
fungi	ssbarnea\|rover: it's sorted, yes	17:01
fungi	the gate queue is only going up because tripleo monopolizes it. sorry, i have to point that out. complaints from the team who consume most of our resources are not really making this a fun thing to spend my valuable time maintaining today	17:02
ssbarnea\|rover	the gate queue is still 1.9 didn't see it going down at all.	17:02
ssbarnea\|rover	1.9days, not hours :D	17:02
fungi	i think you mean tripleo's gate queue	17:03
ssbarnea\|rover	yeah	17:03
*** eernst has joined #openstack-infra		17:03
fungi	yeah	17:03
*** david-lyle has joined #openstack-infra		17:04
*** ginopc has quit IRC		17:05
*** dklyle has quit IRC		17:05
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356	17:06
fungi	ssbarnea\|rover: ^ that should help get everyone else's changes moving faster at least	17:06
openstackgerrit	Paul Belanger proposed openstack-infra/project-config master: base-test: Check that regional mirror is online https://review.openstack.org/620961	17:08
pabelanger	ssbarnea\|rover: fungi: logan-: ^is something we can test, to offer some protection to jobs	17:08
*** shardy has quit IRC		17:09
clarkb	ssbarnea\|rover: fungi right re proxy our mirror is a proxy	17:09
clarkb	setting up a different proxy wouldnt help	17:09
*** gyee has joined #openstack-infra		17:09
clarkb	also we shouldnt set up a transparent proxy as the potential for abuse on that skyrockets	17:10
mordred	pabelanger: I think that would be helpful	17:10
clarkb	this is why we reverse proxy specific services from our mirrors as we do	17:10
pabelanger	mordred: yah, I could have sweared we did that in validate-host before, but seems to be just git.o.o	17:10
pabelanger	also think, we could maybe make validate-hosts take a list of hosts too, and have it do validate there	17:11
fungi	i'd be good with either solution	17:11
clarkb	we always kne this would be a risk with using a proxy instead of proper mirror	17:12
clarkb	but pypi growth just isnt manageable anymore to mirror properly	17:12
clarkb	blame cuda linking	17:12
*** bhavikdbavishi has quit IRC		17:15
*** bhavikdbavishi has joined #openstack-infra		17:16
pabelanger	clarkb: https://review.openstack.org/620961/ should help with vexxhost mirror outage this moring, if you'd like to review. We can do a bit of testing before deciding to move it to our base job	17:19
clarkb	pabelanger: I'm not sure it will change anything. We already run bindep and things like devstack in pre	17:20
clarkb	these will look for the distor package mirrors and fail in pre if that is down	17:20
clarkb	we can certainly explicitly check things but I dont expect a major change in job retry behavior	17:21
*** agopi is now known as agopi\|food		17:21
pabelanger	tripleo isn't using devstack, so they don't have coverage. That said, we could just move that check into their jobs	17:21
pabelanger	but figured, since mirrors our infra servers, having the check in base might help protect all jobs	17:22
clarkb	pabelanger: ya we can explicitly check. Maybe this also points to tripleo maybe needing to move stuff into pre? I dont know I get lost every time I try to trace through a job there	17:23
*** jamesmcarthur has joined #openstack-infra		17:24
pabelanger	clarkb: yah, that is fair	17:24
*** dpawlik has joined #openstack-infra		17:25
*** dpawlik has quit IRC		17:26
clarkb	generally low churn boot strapping steps that are epected to succeed should be in pre	17:26
clarkb	for most of our jobs this means install distro packages is in pre	17:26
*** dpawlik has joined #openstack-infra		17:26
clarkb	maybe not exclusivle as tripleo is a deployment project an wamts to test those steps but tripleo must have deps itself?	17:27
clarkb	pabelanger: because the next thing we'll run into is mirror is up but we rsynced a bad state and so its broken when you install	17:28
clarkb	then we'll go through all of this again	17:28
*** jpich has quit IRC		17:29
*** hamerins has quit IRC		17:29
pabelanger	clarkb: Yah, that was my thought of adding this check to validate-host, there we only do a traceroute to git.o.o. If that failures, we assume network issue. We could update it to support a list of hosts, and fail if we cannot traceroute both git.o.o and mirror	17:30
*** hamerins has joined #openstack-infra		17:31
*** lujinluo has joined #openstack-infra		17:33
fungi	i thought we did a ping. the traceroute was more so that we could perform post-mortem analysis on reachability issues	17:36
clarkb	its all to git.o.o though that host is no longer a good one actually since zuul pushes all git state into jobs now	17:39
clarkb	maybe change git to the in region mirror and do that host instead	17:39
pabelanger	yah, we could do that too	17:40
pabelanger	I think the idea with git.o.o, is to confirm we can route out side the provider network	17:40
pabelanger	so, might be good to also keep that	17:40
fungi	yes, granted it doesn't actually confirm that in rax-dfw since that's where git.o.o resides	17:42
clarkb	in the past all jobs had to talk to that node	17:42
clarkb	so not being able to talk to that load balancer would result in job failure	17:43
clarkb	this is not longer true but I think that is why it was chosen	17:43
fungi	but having the node try to reach across the internet to some host we expect to always be up can be a good canary to keep	17:43
clarkb	yup	17:43
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment https://review.openstack.org/620969	17:44
fungi	clarkb: is there an "easy" way to export the events data from a logstash query in kibana?	17:52
*** ykarel\|away has quit IRC		17:53
fungi	i have a query with 449 results and don't feel like stitching together 5 pages of copy+paste	17:53
*** jpena is now known as jpena\|off		17:53
fungi	though that's what i did in the end	17:55
clarkb	there should be a csv export option somewhere	17:56
fungi	i found where to adjust the pagination at least	17:57
*** wolverineav has joined #openstack-infra		17:59
openstackgerrit	Merged openstack-infra/system-config master: docs: add info on generating DS records https://review.openstack.org/619334	18:00
fungi	config-core: we can add another 80 nodes back to the pool with https://review.openstack.org/619750	18:03
clarkb	mordred: pabelanger and the actual http GET runs on all of the ansible test nodes not the ansible process on the executor?	18:03
clarkb	fungi: we aren't concerned that it will go back to being unstable?	18:04
fungi	halving our utilization in ovh-bhs1 didn't solve the excessive timeouts there	18:04
clarkb	roger	18:04
fungi	see the comment i just added	18:04
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Add support for zones in executors https://review.openstack.org/549197	18:05
*** derekh has quit IRC		18:06
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors) https://review.openstack.org/620950	18:06
*** trown\|lunch is now known as trown		18:08
pabelanger	clarkb: yah, all nodes should curl the mirror	18:08
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors) https://review.openstack.org/620950	18:08
sshnaidm	can somebody tell why ara-report is from Sep in todays job? http://logs.openstack.org/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/	18:09
sshnaidm	is some server not time synced?	18:09
sshnaidm	ara-report/2018-09-04 17:54	18:09
fungi	that's definitely weird	18:09
clarkb	that ara report is generated by the zuul executor I think. Could be that one of them got confused?	18:10
fungi	the timestamps inside the ara report look like they're from today	18:10
pabelanger	ara-report folder will contain a sqlite.db	18:10
corvus	fungi, clarkb: can one of you please obtain an opendev.org cert?	18:10
pabelanger	I think that is created from the executor	18:10
fungi	i think apache is confused	18:10
clarkb	corvus: yes, I can do that	18:11
fungi	ls -l /srv/static/logs/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/	18:11
fungi	drwxr-xr-x 2 jenkins jenkins 4096 Nov 29 11:30 ara-report	18:11
clarkb	corvus: it will likely require we edit DNS to verify ownership of the domain though now that gdpr means whois is useless	18:11
*** hamzy has joined #openstack-infra		18:12
pabelanger	lol, jenkins	18:12
corvus	clarkb: the automation should be in place	18:12
clarkb	corvus: cool, I'll push up a change for that when I've got the details from namecheap	18:12
fungi	pabelanger: yeah, we haven't renamed the account on static.o.o	18:12
*** bobh has quit IRC		18:12
*** bhavikdbavishi has quit IRC		18:13
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029	18:13
*** bhavikdbavishi1 has joined #openstack-infra		18:13
ssbarnea\|rover	fungi: clarkb : if you could help with recent CRs on https://review.openstack.org/#/q/project:openstack-infra/elastic-recheck+status:open it would be great	18:13
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Serve opendev.org website from files.o.o https://review.openstack.org/620979	18:14
*** dpawlik has quit IRC		18:14
pabelanger	fungi: sshnaidm: http://logs.openstack.org/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/job-output.txt.gz#_2018-11-29_11_30_44_008080 I think is when we create the directory	18:14
*** david-lyle has quit IRC		18:15
TheJulia	so the maximum zuul job length is 3 hours?	18:15
*** agopi\|food is now known as agopi		18:15
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029	18:15
pabelanger	TheJulia: yes	18:15
*** bhavikdbavishi1 is now known as bhavikdbavishi		18:15
clarkb	corvus: fungi: a year cert as per normal? I expect we'll be maybe using free certs in the not too distant future so that makes snse to me	18:15
corvus	clarkb: ++	18:16
sshnaidm	pabelanger, so is it apache fault..?	18:16
fungi	clarkb: yeah, that's what i'd do	18:16
*** wolverineav has quit IRC		18:16
openstackgerrit	Purnendu Ghosh proposed openstack-infra/project-config master: Create airship-spyglass repo https://review.openstack.org/619493	18:17
clarkb	sshnaidm: that is fungi's theory	18:17
fungi	sshnaidm: yeah, must be? i can't for the life of me figure out where apache is getting that timestamp: http://paste.openstack.org/show/736434/	18:17
TheJulia	pabelanger: I guess I should try and see what I can prune out of a grenade job :(	18:18
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: DNS: replace ip addresses with names https://review.openstack.org/620980	18:18
*** wolverineav has joined #openstack-infra		18:18
corvus	TheJulia: for context: the idea was that 3 hours is plenty of headroom for jobs that were designed to be 1 hour long	18:19
*** jistr has quit IRC		18:19
*** jistr has joined #openstack-infra		18:19
TheJulia	except grenade has always been 2-2.5	18:19
*** bhavikdbavishi has quit IRC		18:19
TheJulia	hit a slow node, and you pass 3 hours	18:20
fungi	always?	18:20
sshnaidm	fungi, yeah, weird..maybe some network storage bug	18:20
TheJulia	as far as I can remember	18:20
fungi	i vaguely remember grenade jobs taking ~1.25 hours when devstack jobs took ~0.75 hours	18:20
TheJulia	You have to keep in mind the whole fake baremetal deployment boot/deploy cycle for ironic adds a chunk of time	18:20
clarkb	corvus: fungi: do we want an alt name of www.opendev.org?	18:20
fungi	sshnaidm: i have a feeling it's something weird in apache's caching layer	18:20
corvus	clarkb: yeah i think so	18:21
*** roman_g has quit IRC		18:22
fungi	clarkb: yes, that would then be consistent with the one for zuul-ci.org	18:22
fungi	X509v3 Subject Alternative Name: DNS:zuul-ci.org, DNS:www.zuul-ci.org	18:22
fungi	echo\|openssl s_client -connect zuul-ci.org:443\|openssl x509 -text	18:23
fungi	so "do whatever you did for that domain"	18:23
clarkb	fungi: I don't think I did that domain	18:23
*** jamesmcarthur has quit IRC		18:24
openstackgerrit	James E. Blair proposed openstack-infra/zone-opendev.org master: Add A(AAA) records for (www.)opendev.org https://review.openstack.org/620982	18:25
fungi	clarkb: it's in bridge.o.o:~root/certs/2018-03-26/ so one of us did anyway	18:25
fungi	oh, wait, that's git.zuul-ci.org	18:26
fungi	in 2018-01-19 instead	18:26
*** lujinluo has quit IRC		18:26
corvus	fungi, clarkb: i set "topic:opendev" on all related changes	18:27
fungi	thanks!!!	18:27
clarkb	I'll model it off of the static cert and createa new opendev.org cnf file and generate with that	18:27
openstackgerrit	Merged openstack-infra/project-config master: Revert "Halve ovh-bhs1 max-servers temporarily" https://review.openstack.org/619750	18:28
fungi	clarkb: i think it adds www automagically as a san, but i could be wrong	18:29
clarkb	fungi: it being openssl?	18:29
fungi	clarkb: namecheap	18:29
fungi	X509v3 Subject Alternative Name: DNS:git.zuul-ci.org, DNS:www.git.zuul-ci.org	18:30
clarkb	oh huh	18:30
fungi	X509v3 Subject Alternative Name: DNS:zuul.openstack.org, DNS:www.zuul.openstack.org	18:30
clarkb	ya same for git.openstack.org. In that case I'll just use our normal process and it should just work (tm)	18:31
clarkb	(I hope)	18:31
clarkb	easy enough if so	18:31
fungi	so if you just ask for a cert for opendev.org they should end up giving you www.opendev.org as a san	18:31
*** hamzy has quit IRC		18:31
*** diablo_rojo has joined #openstack-infra		18:31
clarkb	yup	18:31
openstackgerrit	Clark Boylan proposed openstack-infra/zone-opendev.org master: Add SSL Cert verification record https://review.openstack.org/620985	18:38
openstackgerrit	Clark Boylan proposed openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record" https://review.openstack.org/620986	18:38
clarkb	fungi: corvus ^ double check me on that its been years since I hand edited a bind zone file :)	18:39
*** eernst has quit IRC		18:40
clarkb	I didn't increment the serial	18:41
clarkb	let me fix that real quick	18:41
* fungi was about to point that out ;)		18:41
fungi	make sure to increment it again in the "revert" too	18:41
corvus	emacs'll do it for you	18:41
fungi	emagic	18:41
*** eernst has joined #openstack-infra		18:42
openstackgerrit	Clark Boylan proposed openstack-infra/zone-opendev.org master: Add SSL Cert verification record https://review.openstack.org/620985	18:43
openstackgerrit	Clark Boylan proposed openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record" https://review.openstack.org/620986	18:43
clarkb	corvus: yup got both of them	18:43
fungi	you might want to wip 620986 until you're done, just to be safe	18:44
*** gfidente has joined #openstack-infra		18:45
clarkb	++	18:45
frickler	$ ls -l /usr/local/bin/ara-wsgi-sqlite	18:45
frickler	-rwxr-xr-x 1 root root 1818 Sep 4 17:54 /usr/local/bin/ara-wsgi-sqlite	18:45
*** dklyle has joined #openstack-infra		18:45
frickler	together with WSGIScriptAliasMatch ^.*/ara-report(?!/ansible.sqlite) /usr/local/bin/ara-wsgi-sqlite	18:45
frickler	is what makes that timestamp	18:46
fungi	frickler: oh! so that's the timestamp of the cgi	18:46
fungi	sshnaidm: ^ mystery solved	18:46
frickler	bit of a weird behaviour of apache I'd say	18:46
sshnaidm	interesting	18:46
clarkb	corvus: fungi: ok I've approved the dns updates so now we wait for that to merge and apply and comodo to notice. I'll get the cert data into hiera/ansiblevars as soon as I have it	18:49
*** gfidente has quit IRC		18:51
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Clarify executor zone documentation https://review.openstack.org/620989	18:51
*** hamzy has joined #openstack-infra		18:52
clarkb	ssbarnea\|rover: pabelanger out of curiousity where is the rdo/rpm packaging line vs pypi line drawn for tripleo testing? is there a general rule there ?	18:52
openstackgerrit	Merged openstack-infra/zone-opendev.org master: Add A(AAA) records for (www.)opendev.org https://review.openstack.org/620982	18:52
*** hamzy_ has joined #openstack-infra		18:57
openstackgerrit	Merged openstack-infra/zone-opendev.org master: Add SSL Cert verification record https://review.openstack.org/620985	18:57
*** lpetrut has joined #openstack-infra		18:58
*** hamzy has quit IRC		18:58
pabelanger	clarkb: I'm not sure myself, but believe anything that is openstack (and dependency), gets built as RPM.	18:59
fungi	i think they have some tox jobs too though?	19:01
*** mtreinish has joined #openstack-infra		19:03
*** wolverineav has quit IRC		19:04
*** wolverineav has joined #openstack-infra		19:05
weshay	thanks clarkb!	19:05
weshay	for the elastic recheck reviews	19:05
*** wolverineav has quit IRC		19:06
clarkb	weshay: np. I'm happy to see people using it for this :)	19:06
*** wolverineav has joined #openstack-infra		19:06
clarkb	weshay: I figure if mriedem or fungi or some other current root doesn't review in the next day or so I can approve the stack (except maybe the py3 support change since ianw was reviewing that one already and its a bit bigger than adding queries)	19:06
ssbarnea\|rover	clarkb: yep, the main rule is use rpm whenever is possible, but there are few exceptions, like what tox testing.	19:06
weshay	k	19:07
*** eernst has quit IRC		19:07
*** wolverineav has quit IRC		19:07
*** dklyle has quit IRC		19:07
mriedem	huh?	19:07
*** wolverineav has joined #openstack-infra		19:07
clarkb	mriedem: e-r reviews	19:07
ssbarnea\|rover	i seen even other workarounds, but usually are only temporary, like installing a package from pip until we get a rpm for it.	19:07
ssbarnea\|rover	clarkb: exception do apply only for non shipping code like test related. anything that ships / installs on production must be rpm based.	19:08
clarkb	ssbarnea\|rover: got it,thanks	19:08
*** fuentess has quit IRC		19:08
*** electrofelix has quit IRC		19:12
*** eernst has joined #openstack-infra		19:13
clarkb	ssbarnea\|rover: comment on https://review.openstack.org/#/c/620950/3	19:16
mtreinish	clarkb, fungi: so I've got kind of a random question, do you have a pointer to the script used to build wheels?	19:17
clarkb	mtreinish: ya I'll dig it up	19:17
mtreinish	I got a request to upload wheels for stestr to pypi, and I've been using the old tarball script to upload that to pypi	19:17
mtreinish	and reading the twine docs was not at all helpful	19:17
mtreinish	clarkb: cool, thanks	19:18
*** eernst has quit IRC		19:18
clarkb	mtreinish: https://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/build-wheels/files/wheel-build.sh	19:18
clarkb	darn I typed that out wrong	19:19
*** wolverineav has quit IRC		19:19
*** eernst has joined #openstack-infra		19:19
clarkb	mtreinish: https://git.openstack.org/cgit/openstack-infra/project-config/tree/roles/build-wheels/files/wheel-build.sh	19:19
clarkb	there	19:19
fungi	you can do it in the same command where you also build an sdist, i.e. `python setup.py bdist_wheel sdist`	19:20
*** wolverineav has joined #openstack-infra		19:20
clarkb	oh wait you are wanting the publish side of wheel building	19:20
clarkb	that is the build all the wheels for mirroring script	19:20
fungi	aha, yes	19:20
fungi	sorry, i too misread	19:20
clarkb	but ya you basically pip wheel ./ or python setup.py bdist_wheel	19:21
fungi	`twine upload dist/*`	19:21
mtreinish	thanks	19:21
mtreinish	hmm, that's what I tried	19:21
*** eernst has quit IRC		19:21
fungi	if you want to test against the dev pypi, do `twine upload --repository-url https://test.pypi.org/legacy/ dist/*`	19:21
mtreinish	well I've got a bug fix release to push, so I'll give it a try on a fresh tag	19:21
mtreinish	oh, that's good to know	19:22
fungi	before doing that, i also recommend running `twine check dist/*`	19:22
clarkb	corvus: root email says ns2.opendev.org requires a reboot to complete package upgrades. Maybe we should do that once dns is verified with comodo?	19:22
fungi	at the moment i think that only checks the long description to make sure pypi will render it successfully, but in the future the expectation is that will grow additional checks for things like invalid trove classifiers	19:22
clarkb	od that ns1 wouldn't require it, but then i Remember we likely use different base images in the two different clouds	19:23
*** xek has quit IRC		19:23
ssbarnea\|rover	weshay: please read https://review.openstack.org/#/c/620950/3 and add your intake, i am not sure if voting:1 should be in or not.	19:23
*** wolverineav has quit IRC		19:24
*** wolverineav has joined #openstack-infra		19:24
ssbarnea\|rover	mriedem: fungi : i am sure if you are also aware about "twine check" command which proved to VERY useful as it also lints the readme that goes to pypi and assures it renders well.	19:24
ssbarnea\|rover	mtreinish: ^^ this was for you. :)	19:25
clarkb	ssbarnea\|rover: the openstack release process uses that command to check things before we make releases	19:25
clarkb	it is indeed quite helpful	19:25
ssbarnea\|rover	clarkb: the problem is that most of the project do not run it as part of their tox targets, so we find about breakage late in the process.	19:26
ssbarnea\|rover	my personal preference is to include "twine check" as part of tox-linters ... as this is what it does, mostly.	19:26
clarkb	ssbarnea\|rover: ya, though at least before we try to publish now	19:26
mtreinish	heh, yeah it looks like it will be useful	19:26
fungi	ssbarnea\|rover: well, twine check doesn't technically check the readme, it checks the long description field of the built packages (which in our case are embedded copies of a readme)	19:26
mtreinish	but there is a typo in the warning message about a missing dep	19:27
mtreinish	told me to install 'readme_render[md]' but it meant 'readme_renderer[md]'	19:27
ssbarnea\|rover	yeah, i know about this. i think I made a PR to fix it, or wanted to.	19:28
fungi	though if you don't use markdown you can ignore that	19:28
fungi	restructuredtext is supported by default	19:28
mtreinish	yeah the README for stestr is rst, but I have other projects that use md so I figured better to have it	19:28
ssbarnea\|rover	mtreinish: i had the same impression, better to have it.	19:29
ssbarnea\|rover	clarkb: wes replied, on https://review.openstack.org/#/c/620950/ -- you can make a decision. I think in this case is better to have voting on.	19:31
ssbarnea\|rover	to eliminate expected noise	19:31
*** graphene has quit IRC		19:32
*** graphene has joined #openstack-infra		19:33
clarkb	ssbarnea\|rover: ok I'm ok if we choose voting, just wanted to make sure it was explicit	19:34
ssbarnea\|rover	clarkb: now we only need to find someone else to workflow these. slowly we improve the categ rate.	19:36
*** markvoelker has joined #openstack-infra		19:36
clarkb	ssbarnea\|rover: as I mentioend I'm happy to approve the query changes with my +2 if no one else reviews them today. The py3 port should get more eyes though	19:36
*** jamesmcarthur has joined #openstack-infra		19:37
ssbarnea\|rover	i wonder if we track its value over time, so we can see how it goes. it would be nice to have an alarm, when it goes above: x% we start working on it until we bring it to y%.	19:38
openstackgerrit	Jeremy Stanley proposed openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995	19:38
ssbarnea\|rover	i do find elastic-recheck extremly useful, especially for those doing ruck/rovering.	19:38
fungi	clarkb: i think in the past we've only expected a single +2 on e-r query additions/removals	19:39
*** dpawlik has joined #openstack-infra		19:39
fungi	mriedem can correct me there	19:39
clarkb	fungi: yes, though typically those reviews were from mriedem and mtreinish who know how to review the changes :)	19:39
clarkb	I'm quite rusty :)	19:39
mriedem	it's one thing if it's a query	19:40
mriedem	these are py3 changes right?	19:40
*** markvoelker has quit IRC		19:40
clarkb	mriedem: one change is a py3 porting. That one should have multiple reviews. The others are all queries which I've +2'd and can aopprove if that is what we want	19:40
mriedem	oh ok let me review	19:42
corvus	clarkb: ns2 reboot post comodo wfm	19:43
ssbarnea\|rover	mriedem: thanks, i am here to answer your question. the only tricky part was around few lp modules managed by canonical, which were initially not py3 ready, but they made a new release.... those libraries do not even have a CI... :D	19:43
mtreinish	ooh, an updated review on: https://github.com/ansible/ansible/pull/23769 maybe we won't have to carry a local version soon	19:48
fungi	that would be swell	19:49
fungi	latest state just got two shipits. i think you're set?	19:50
*** jamesmcarthur has quit IRC		19:50
* fungi has no idea how the review process for ansible works		19:50
mtreinish	nor do I	19:51
ssbarnea\|rover	mtreinish: i can help you with few hints around: ping key people on #ansible-devel -- bcoca helped me many times.	19:51
ssbarnea\|rover	mtreinish: ok, now you got feedback on it, be sure you address it.	19:52
clarkb	ok ansible + puppet are not running	19:52
*** jamesmcarthur has joined #openstack-infra		19:52
clarkb	this explains why I'm awiting on dns records for longer than I expected	19:52
clarkb	Failed to discover available identity versions when contacting https://La1.citycloud.com:5000/v3/. Attempting to parse version from URL.	19:53
clarkb	hrm is that a cloud outage? we should've fixed the ctiycloud per region keystone thing	19:54
clarkb	and I get 502 bad gateway if I try to talk to that url	19:54
mtreinish	ssbarnea\|rover: thanks, it might be a while though. I don't have a lot of bandwidth for it right now. It sat idle for a long time and is low on my prio list right now	19:54
clarkb	mordred: ^ any thoughts?	19:54
*** wolverineav has quit IRC		19:55
ssbarnea\|rover	clarkb: mtreinish regarding elastic-recheck I observed that in many cases before a CR is reviewed the logstash already recycled the logs so we should aim to review changes while they are fresh. I aim to review all in 24-48h to avoid this.	19:56
*** wolverineav has joined #openstack-infra		19:56
*** sshnaidm is now known as sshnaidm\|afk		19:56
*** markvoelker has joined #openstack-infra		20:00
*** jamesmcarthur has quit IRC		20:00
mriedem	not sure why specific tests are called out in this https://review.openstack.org/#/c/617579/	20:00
*** wolverineav has quit IRC		20:00
clarkb	http://cnstatus.com/?p=4413 maybe the issue is that	20:00
mriedem	as generic ssh failures can hit most of tempest now since it runs with validatoin in tempest-full jobs	20:00
openstackgerrit	Merged openstack-infra/elastic-recheck master: Categorize ImageNotFoundException on tripleo jobs https://review.openstack.org/620114	20:02
mordred	clarkb: looking	20:04
*** wolverineav has joined #openstack-infra		20:04
mordred	clarkb: yeah - I think it's likely that	20:05
mordred	clarkb: sto2.citycloud.com is working	20:05
*** wolverineav has quit IRC		20:10
mriedem	ssbarnea\|rover: https://review.openstack.org/#/c/616578/9	20:10
ssbarnea\|rover	sure, taking care of these now.	20:11
*** slaweq has quit IRC		20:11
*** irdr has quit IRC		20:12
*** jtomasek has quit IRC		20:12
mriedem	clarkb: btw, i haven't seen e-r commenting on failures lately	20:13
mriedem	i wonder if the log index workers are overwhelmed with tripleo console log indexing?	20:13
mriedem	one of the comments when you brought this up as a goal in berlin was that it'd be nice if we had some kind of status page / dashboard for the logstash workers and/or e-r bot to know if it's off the rails	20:14
openstackgerrit	Merged openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691	20:14
clarkb	mriedem: we do sort of have one for the logstash workers. http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1 the logstash job queue graph there is part of the zuul status	20:14
clarkb	mriedem: it looks like its keeping up, though individual files in the pipeline may be lagging more than say 20 minutes	20:15
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Made elastic-recheck py3 compatible https://review.openstack.org/616578	20:15
fungi	possible the bot has crashed/hung?	20:15
mriedem	a lot of the time, e-r is dead or something	20:16
mriedem	b/c of a bad query or something like that	20:16
mriedem	although i'd think a bad query would also break the graph	20:16
*** ralonsoh has quit IRC		20:16
fungi	in the middle of making fried rice, but can take a look once i'm done	20:17
mriedem	oh i didn't know efried was there	20:17
clarkb	mriedem: ya I would expect that to be the case too	20:17
efried	:P	20:17
clarkb	fungi: thanks!	20:17
* clarkb returns to reviewing relative priority support in zuul		20:17
openstackgerrit	Matt Riedemann proposed openstack-infra/elastic-recheck master: fix tox python3 overrides https://review.openstack.org/605618	20:18
fungi	my kitchen would be a lot more awesome if efried were running it, i'm sure	20:18
ianw	infra-root: can we look at ansible 2.7.2 install for bridge with -> https://review.openstack.org/#/c/617218/ . the other version didn't get reviews, and the cloud-launcher is still broken. i know we're not rolling out cloud changes, but evidence shows it tends to bitrot easily	20:18
*** eernst has joined #openstack-infra		20:19
efried	I make a mean fried rice. Though I'm better at curries.	20:19
openstackgerrit	Matt Riedemann proposed openstack-infra/elastic-recheck master: Include query results in graph https://review.openstack.org/260188	20:19
*** florianf is now known as florianf\|afk		20:19
*** irdr has joined #openstack-infra		20:19
ianw	infra-root: and actually, now i look at http://grafana.openstack.org/d/qzQ_v2oiz/bridge-runtime?orgId=1&from=now-12h&to=now ... clearly something has gone wrong	20:20
*** e0ne has joined #openstack-infra		20:20
pabelanger	Nice, didn't know it was hooked up to grafana	20:21
clarkb	ianw: see above, citycloud outage is preventing us from generating inventory	20:21
*** eernst has quit IRC		20:21
clarkb	ianw: http://cnstatus.com/?p=4413 is the issue I Think (sparse on details though)	20:21
ianw	clarkb: oh, oh cool, if we have a reason good :)	20:21
ianw	mordred: if around, would be great if you could review at least the glean bits of https://review.openstack.org/#/q/status:open+topic:fedora29 to enable networkmanager support	20:23
*** eernst has joined #openstack-infra		20:25
*** jamesmcarthur has joined #openstack-infra		20:25
openstackgerrit	Merged openstack-infra/elastic-recheck master: Categorize ovs crash bug #1805176 https://review.openstack.org/620105	20:26
openstack	bug 1805176 in tripleo "tripleo jobs failing to setup bridge: fatal_signal\|WARN\|terminating with signal 14 (Alarm clock)" [High,Triaged] https://launchpad.net/bugs/1805176	20:26
clarkb	mriedem: fwiw on the bug that is limited to those specific tests I also left a note that maybe the qa team wants to be involved since many jobs seems to match	20:26
clarkb	ssbarnea\|rover: as far as alerting goes, we've generally try to avoid any semblence of on call, must react now type behavior. Instead we present the data so that it can be consumed by individuals as they have time/ability	20:27
clarkb	ssbarnea\|rover: so we generate graphs. We could also maybe light a batsignal if thresholds are reached that requires you to "look to the sky" or wherever for that rather than it hitting your laptop/phone	20:28
*** eernst has quit IRC		20:29
*** jamesmcarthur has quit IRC		20:29
*** e0ne has quit IRC		20:30
ssbarnea\|rover	clarkb: sure.	20:30
openstackgerrit	Merged openstack-infra/elastic-recheck master: Categorize error mounting image volumes due to libpod bug[1] https://review.openstack.org/619059	20:31
clarkb	ssbarnea\|rover: I think in the first (less explicit) signal has been the categorization rate being under like 80%	20:31
clarkb	*in the past the first	20:31
clarkb	since the first order of business is tracking the issues, then gaining understanding, then fixing them	20:31
clarkb	we can probably have some metric we call out on the stuff we understand too. Like X failures in a day	20:32
*** slaweq has joined #openstack-infra		20:32
ssbarnea\|rover	clarkb: practical question, this is uncategorized: http://logs.openstack.org/04/618604/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/e892675/job-output.txt.gz -- any idea on how to categorize it?	20:35
ssbarnea\|rover	i guess i found the string: "PRE-RUN END RESULT_UNREACHABLE"	20:36
clarkb	ssbarnea\|rover: I think we already have a query for that one. Its a cloud level issue with using duplicate IPs :/	20:36
clarkb	however because that ran in pre the job will be retried	20:36
clarkb	hrm not seeing a query for it anymore. maybe it was cleaned up? its a known issue we've engaged rax on. But I'm not sure if they know what causes it or if there is a fix	20:37
ssbarnea\|rover	clarkb: i was not able to find any bug or query with "PRE-RUN END RESULT_UNREACHABLE" in it, so maybe this one was missed. I will create new one, unless someone knows an existing one that can be adapted.	20:42
ssbarnea\|rover	https://bugs.launchpad.net/openstack-gate/+bug/1805900	20:45
openstack	Launchpad bug 1805900 in OpenStack-Gate "PRE-RUN END RESULT_UNREACHABLE" [Undecided,New] - Assigned to Sorin Sbarnea (ssbarnea)	20:45
*** jcoufal has quit IRC		20:46
fungi	recheck 21279 0.2 14.0 964108 569948 ? Sl Nov16 40:13 /usr/bin/python /usr/local/bin/elastic-recheck /etc/elastic-recheck/elastic-recheck.conf	20:46
fungi	looks like the recheck bot is running, at least	20:46
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify POST-RUN END RESULT_UNREACHABLE https://review.openstack.org/621004	20:50
*** wolverineav has joined #openstack-infra		20:51
*** wolverineav has quit IRC		20:53
*** wolverineav has joined #openstack-infra		20:53
*** jamesmcarthur has joined #openstack-infra		20:53
*** olivierbourdon38 has quit IRC		20:57
corvus	mordred, clarkb: any news on the ansible issue?	21:00
openstackgerrit	Merged openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589	21:00
openstackgerrit	Merged openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995	21:00
clarkb	corvus: pretty sure its that outage I linked to on citycloud status page. We can disable those regions if it persists	21:01
clarkb	I'm eating lunch now. back in a bit	21:01
corvus	mordred, clarkb: our theory with nodepool is that clouds are unreliable so we handle them disappearing gracefully. but we have the same clouds now blocking our operations whenever there's an error. is there a way to mitigate this, or should we just switch to static inventories?	21:03
corvus	clarkb: this can wait till after lunch :)	21:03
*** hamerins has quit IRC		21:05
mordred	corvus: I'm honestly torn on that question	21:05
mordred	corvus: there's a config flag that can be given to the inventory to not let cloud errors bomb the whole thing out	21:05
fungi	what other alternatives are there? not update the cached inventory for a given provider if shade hits an error trying to query it?	21:06
mordred	but I think we've been reluctant to set it in the past out of fear we'd silently ignore a chunk of our inventory	21:06
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Fix updateFromDict overrides https://review.openstack.org/621008	21:07
mordred	fungi: it's a single inventory and the caching is at the inventory level, so there's not really a way to selectively update or not update portions of the cache	21:07
mordred	we could also switch to static/generated inventories - it's not like the vms we're running in our clouds are terribly dynamic	21:07
mordred	so MOST of the time it's the same servers	21:07
fungi	yeah, i get that would likely entail an overhaul of the dynamic inventory generation	21:07
mordred	but we'd need to think through workflows for that with new server creation and old server deletion	21:08
corvus	mordred: yeah, i'd worry about losing the inventory. i think the biggest problem is we'd have to think about that possibility every time we make a change ("will this role work if half the servers are gone?")	21:08
fungi	perhaps if shade errors aborted inventory generation we could just proceed with the previous cache?	21:08
corvus	previous cache or pre-generation sound better to me	21:09
fungi	as you say, it's the same servers 99.9% of the time anyway	21:09
mordred	fungi: maybe? I'm not sure if we have that possibility - the caching layer is actually handled inside of ansible itself now	21:09
fungi	worst case it goes unnoticed until we need to add or remove a server and notice the inventory's not updating	21:09
corvus	previous cache and pre-generation are functionally the same thing; it sounds like just doing pre-generation would be the simplest way of getting the result	21:10
*** dklyle has joined #openstack-infra		21:10
mordred	yeah. I mean - I was thinking generating the inventory and shoving it in to git	21:10
corvus	mordred: oh, i was imagining we just run a script at the start of the cron job, and if it errors, continue anyway...	21:11
fungi	right, only copy the generated inventory into place if the pregen script succeeds, otherwise leave the old one in place	21:11
mordred	corvus: I was thinking basically just update the inventory when we create or delete servers	21:11
mordred	and just remove it from being in the cronjob execution path completely	21:12
corvus	auto-proposed changes to git is interesting	21:12
corvus	i don't object to that, but it sounds like extra work and i don't know what we'd gain (but maybe that's lack of imagination on my part)	21:13
*** dpawlik has quit IRC		21:13
mordred	corvus: I think the main thing I was thinking we'd gain is less moving parts at runtime - and I wasn't thinking auto-proposed as much as "when you're done running launch-node, run this script and submit a patch to git" ... but you're right, at that point I don't know that there's much benefit to having the data in git - other than visibility of the otherwise hidden data	21:14
mordred	corvus: I'm not sure I'm actually advocating that we do that- it's just been a thought in the back of my head when we have inventory issues	21:15
corvus	mordred: yeah. that'd work too.	21:16
clarkb	mordred: ansible handling the caching is why I think we are noticing this now	21:21
fungi	if we ever want more automation in server launching though, that's one more wait-for-a-human step	21:21
clarkb	I mean I'm sure it was an issue before but we just used the last version right?	21:21
fungi	granted, we've already got the "submit a patch to add dns records" step which needs to wait for reviewers	21:22
fungi	unless we decide there's a subdomain we want to run completely off an autogenerated zonefile	21:23
clarkb	another option could be to put all of the mirror nodes in their own ansible cron run thing and stop generating inventories for them in the main control plane run	21:23
clarkb	for the main control plane run we only care about vexxhost and rax today	21:23
clarkb	doesn't fix the issue but limits the scope of it	21:24
*** agopi is now known as agopi\|pto		21:24
clarkb	citynetwork says the issue I Found is estimated to be fixed at 2400 CET	21:25
clarkb	which is 35 minutes from now?	21:25
clarkb	or is CET only +1?	21:25
*** agopi\|pto has quit IRC		21:26
*** yamamoto has quit IRC		21:26
clarkb	mordred: would it be worthwile to suggest to ansible (or implement for ansible) fallback to prior cache? I mean what is the cache actually buying us if we can't use it without hitting the clouds?	21:26
*** dpawlik has joined #openstack-infra		21:28
corvus	clarkb: i wask thinking about splitting, but we have some control-plane nodes in non-rax clouds. i don't think i want to inhibit more of that in the future, so i prefer the idea of making it more robust.	21:29
fungi	clarkb: utc+1	21:29
clarkb	corvus: ya I'm beginning to think the most generally robust thing would be for ansible cacheing to act as a cache that doesn't need refreshing every run	21:30
fungi	cest (their dst) is utc+2	21:30
mordred	I've got a patch locally with a static copy of the inventory that we can look at for sake of argument. I'm 99% sure it's safe to push up - does anyone want to double-check it somewhere private before I do?	21:30
clarkb	corvus: but that is also likely the longest fix time wise	21:30
clarkb	mordred: rax sets passwords, if that is done via metadata that might leak out in the inventory?	21:31
mordred	all the adminPass fields are null	21:31
corvus	i'll give it a look, you want to put it on bridge?	21:31
mordred	they only show it to you the one time in the initial server creation response	21:31
mordred	corvus: /home/mordred/static-inventory.yaml on bridge	21:31
clarkb	mordred: ah	21:31
corvus	that's a restless api	21:31
mordred	also - we could write a friendlier generation script than what I have there - we don't actually use 99% of those variables	21:32
clarkb	if we go the static inventory route do we want to use the machine generated inventory applied against our groups.yaml file? or should we just write an inventory that accomdoates both things for human and machine consumption	21:32
clarkb	mordred: ya that	21:32
openstackgerrit	Sean McGinnis proposed openstack-infra/project-config master: Add openstack/arch-design https://review.openstack.org/621012	21:33
mordred	clarkb: easiest first step is just have a our current groups.yaml plus a really simple file with server_name: ansible_host: ip_address list	21:33
*** dpawlik has quit IRC		21:33
mordred	I mean- we could even leave out the ansible_host thing and just rely on dns	21:33
*** hamerins has joined #openstack-infra		21:34
openstackgerrit	Sean McGinnis proposed openstack-infra/project-config master: Add openstack/arch-design https://review.openstack.org/621012	21:34
*** wolverineav has quit IRC		21:34
*** wolverineav has joined #openstack-infra		21:35
fungi	i don't think we want to rely on dns for this	21:35
fungi	multiple instances with the same names, being able to update configuration before dns is in place...	21:36
fungi	also allows things to keep working even if dns won't resolve for unrelated (or related!) reasons	21:37
*** wolverineav has quit IRC		21:38
*** wolverineav has joined #openstack-infra		21:38
*** markvoelker has quit IRC		21:38
*** markvoelker has joined #openstack-infra		21:38
corvus	is our rax user id at all sensitive?	21:39
*** dklyle has quit IRC		21:39
clarkb	I don't think so.	21:40
corvus	i kind of doubt it. but that's the only thing i can think to question.	21:40
corvus	the file lgtm.	21:40
*** markvoelker has quit IRC		21:43
mordred	ok. I've actually got a slimmed down version	21:43
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Switch to a static inventory https://review.openstack.org/621031	21:44
*** rlandy is now known as rlandy\|biab		21:44
clarkb	mordred: the location: block there isn't gonna cause ansible to do any lookups that would fail similarly to inventory generation right? (I don't think so)	21:45
clarkb	just double checking that it is info only	21:45
mordred	nope. it's just a piece of metadata from the shade record that I thought might be useful to us as humans looking at a record	21:45
clarkb	++	21:45
*** kjackal has quit IRC		21:46
clarkb	I'm willing to give ^ a go. It will change how we launch new servers, which might be a little weird until we get into the practice of that	21:46
*** dpawlik has joined #openstack-infra		21:46
mordred	now - if we decided to go this route - we probably want to make a script that generates that file decently - I pulled that one from the json in the ansible inventory cache and then did some transforms on it	21:46
mordred	so consider it 'hand made'	21:46
clarkb	mordred: and maybe we check in the script not the file?	21:47
clarkb	like it could just be run the regularly inventory generation, if failed us last sucessful result?	21:47
mordred	well - we should _definitely_ check in the script ... but I think running it regularly as part of runs doesn't gain much value	21:47
*** hamzy_ has quit IRC		21:47
mordred	if we don't check the inventory in - we should at most just run it after launch-node	21:47
mordred	(if we're not going to be fully dynamic)	21:48
*** jamesmcarthur has quit IRC		21:48
mordred	but honestly - I still don't know what I think about this :)	21:48
*** jamesmcarthur has joined #openstack-infra		21:48
clarkb	ya I mostly want to avoid needing to launch node without any inventory (so you only get base server), then push a git change, wait for two people to approve it, then be able to run ansible/puppet/docker on your new server	21:49
clarkb	if we have to do that temporarily for a bit thats fine, and maybe we discover its not that painful	21:49
clarkb	mordred: what does the ansible cache actually cache?	21:50
clarkb	I think understanding ^ may help us formulate a plan too. Like maybe its a matter of using the cache more effectively?	21:50
mordred	clarkb: if you look in ./playbooks/roles/install-ansible/files/inventory_plugins/openstack.py	21:51
mordred	around line 193	21:52
mordred	that's where the generation sets the cache data	21:52
mordred	clarkb: maybe what we need is to be able to tell if fail_on_errors caused anything to be skipped	21:53
mordred	clarkb: and if so, skip the cache.set step	21:53
mordred	so that we can run during that period on partial data	21:55
mordred	but not cache the partial data, so we're sure to get full data once it comes back?	21:55
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356	21:57
*** hamerins has quit IRC		21:57
clarkb	mriedem: fungi looking at the bot logs it seems that the bot is behind on querying console logs? like its looking for a failure in neutron-grenade from the 26th against today's index so that has no results	21:57
*** hamerins has joined #openstack-infra		21:57
clarkb	mriedem: fungi: I think this is a bug in how we check for current results in e-r, not an issue with indexing?	21:57
*** rcernin has joined #openstack-infra		21:57
mriedem	hmm	21:58
clarkb	the data is there in the index from a few days ago. My guess is over time we get further and further behind then we start querying newer indexes for older data and never get results and then at that point are forever behind on the bot side	21:59
clarkb	mordred: ++	21:59
fungi	oh, yeah i looked at the log but didn	21:59
fungi	't notice the time on those events	21:59
*** markvoelker has joined #openstack-infra		22:01
*** hamerins has quit IRC		22:02
*** jamesmcarthur has quit IRC		22:03
clarkb	basically we have such a long time out that things in the queue pile up for a relatively short period of time and we'll be backlogged logn enough that future queries stop working	22:04
clarkb	its almost like we want a more global timeout rather than timing out per event	22:04
clarkb	event comes in, mark that time, then if after 20 minutes from then (regardless of how quickly anything before it went) we don't have results move on	22:05
*** trown is now known as trown\|outtypewww		22:05
*** yamamoto has joined #openstack-infra		22:06
fungi	i can restart it for now i guess	22:10
*** xek has joined #openstack-infra		22:11
clarkb	ya that should reset things	22:11
*** manjeets has quit IRC		22:15
fungi	#status log manually restarted elastic-recheck service on status.openstack.org to clear event backlog	22:16
openstackstatus	fungi: finished logging	22:16
*** yamamoto has quit IRC		22:18
*** dklyle has joined #openstack-infra		22:19
*** rlandy\|biab is now known as rlandy		22:21
*** graphene has quit IRC		22:23
*** dklyle has quit IRC		22:24
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954	22:26
*** manjeets has joined #openstack-infra		22:28
*** rh-jelabarre has quit IRC		22:29
*** manjeets has quit IRC		22:29
*** manjeets has joined #openstack-infra		22:29
*** dklyle has joined #openstack-infra		22:32
*** mriedem is now known as mriedem_afk		22:33
*** sshnaidm\|afk is now known as sshnaidm\|off		22:37
*** agopi has joined #openstack-infra		22:38
*** slaweq has quit IRC		22:41
openstackgerrit	Clark Boylan proposed openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038	22:42
clarkb	mriedem_afk: fungi ^ something like that should help	22:42
*** tpsilva has quit IRC		22:43
*** dklyle has quit IRC		22:44
*** xek has quit IRC		22:44
clarkb	ssbarnea\|rover: ^ you may be interested in that too (though it doesn't affect the dashboard generation of elastic-recheck, just the IRC and gerrit commenting)	22:45
*** jonher has joined #openstack-infra		22:45
*** dpawlik has quit IRC		22:47
*** kgiusti has left #openstack-infra		22:47
*** dpawlik has joined #openstack-infra		22:48
*** dpawlik has quit IRC		22:48
*** slaweq has joined #openstack-infra		22:53
ianw	dmsimard: thanks for review :) i just removed the +w as explained, sorry i should have had it marked as wip as it requires a glean release	22:53
*** rkukura_ has joined #openstack-infra		22:54
*** rkukura has quit IRC		22:57
*** rkukura_ is now known as rkukura		22:57
*** slaweq has quit IRC		22:57
openstackgerrit	Merged openstack-infra/project-config master: Remove ansible-role-redhat-subscription from central repo https://review.openstack.org/617974	22:58
openstackgerrit	Merged openstack-infra/project-config master: add jobs to publish library from governance repo https://review.openstack.org/619347	22:58
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting" https://review.openstack.org/620201	23:00
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: package-installs: provide for skip from env var https://review.openstack.org/619119	23:00
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: simple-init: allow for NetworkManager support https://review.openstack.org/619120	23:00
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting" https://review.openstack.org/620201	23:03
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: package-installs: provide for skip from env var https://review.openstack.org/619119	23:03
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: simple-init: allow for NetworkManager support https://review.openstack.org/619120	23:03
*** boden has quit IRC		23:07
*** lpetrut has quit IRC		23:07
*** mgutehal_ has joined #openstack-infra		23:08
*** mgutehall has quit IRC		23:09
*** agopi has quit IRC		23:16
*** eernst has joined #openstack-infra		23:17
*** eernst has quit IRC		23:22
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota https://review.openstack.org/621040	23:22
*** jamesdenton has quit IRC		23:23
*** dhellmann_ has joined #openstack-infra		23:26
*** eernst has joined #openstack-infra		23:26
*** dhellmann has quit IRC		23:26
*** eernst has quit IRC		23:27
*** dhellmann_ is now known as dhellmann		23:30
*** jamesdenton has joined #openstack-infra		23:32
ianw	http://grafana.openstack.org/d/qzQ_v2oiz/bridge-runtime?orgId=1&from=1543271530069&to=1543377490234 this does seem to be a consistent increase in bridge runtime from around 2018-11-27	23:32
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes https://review.openstack.org/621043	23:38
*** pbourke has quit IRC		23:46
*** manjeets has quit IRC		23:46
*** manjeets has joined #openstack-infra		23:46
*** pbourke has joined #openstack-infra		23:47
ianw	looks like the missing 1/2 hour is in here -> http://paste.openstack.org/show/736459/	23:48
clarkb	puppet unhappy on the arm node?	23:49
ianw	i'm guessing so, logs on host look weird	23:55
ianw	Nov 29 11:30:19 mirror01 puppet-user[31959]: Compiled catalog for mirror01.nrt1.arm64ci.openstack.org in environment production in 4.82 seconds	23:55
ianw	like it starts but then nothing?	23:55
ianw	oh, wow, a lot of stuck processes	23:55
ianw	interesting, attach strace to one and now strace is dead	23:56
ianw	oh dear, i think we have a smoking gun	23:57
ianw	1582391.992063] Call trace:	23:57
ianw	[1582391.998695] afs_linux_raw_open+0x114/0x158 [openafs]	23:57
ianw	[1582392.008571] osi_UFSOpen+0xa4/0x1d8 [openafs]	23:57
ianw	afs has got stuck	23:57
ianw	i'm rebooting the host	23:57
ianw	actually	23:58
ianw	[1241534.846289] print_req_error: I/O error, dev sdb, sector 314934816	23:58
ianw	[1241534.854136] Aborting journal on device dm-1-8.	23:58
ianw	[1241534.964702] sd 0:0:0:1: [sdb] tag#79 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE	23:58
ianw	seems it's not happy in multiple ways	23:58
clarkb	ouch that is a cinder volume?	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!