Friday, 2018-11-30

ianw	it would be. i'm waiting but i think there's a chance this reboot will not happen :/	00:00
clarkb	now I'm trying to figure out in my head what happens if mount fails on a specific disk (it should mount it ro or at least attempt to and finish the boot right?)	00:01
ianw	ok, it's looks like it's coming up, i'm probably too pessimistic :)	00:02
clarkb	ya I think if it isn't where /boot and other important things hanging under / are mounted then it should come up	00:03
clarkb	it may still have a sad, but should get sshd running	00:03
ianw	ok, it's back	00:04
clarkb	fwiw citynetwork pushed back the fix estimate to 0300CET :/	00:04
clarkb	I guess we should all ponder the static inventory a bit more then maybe plan to convert tomorrow or early next week?	00:04
*** dave-mccowan has quit IRC		00:08
*** mgutehal_ has quit IRC		00:09
*** mgutehall has joined #openstack-infra		00:09
*** wolverineav has quit IRC		00:09
*** flaper87 has quit IRC		00:09
*** wolverineav has joined #openstack-infra		00:10
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver https://review.openstack.org/535558	00:10
*** wolverineav has quit IRC		00:12
*** wolverineav has joined #openstack-infra		00:13
openstackgerrit	Merged openstack-infra/elastic-recheck master: Made elastic-recheck py3 compatible https://review.openstack.org/616578	00:13
*** flaper87 has joined #openstack-infra		00:14
*** threestrands has joined #openstack-infra		00:18
openstackgerrit	Adam Coldrick proposed openstack-infra/storyboard master: Fix the stories relation in StoryTag https://review.openstack.org/621045	00:26
openstackgerrit	Adam Coldrick proposed openstack-infra/storyboard master: Add a popularity measurement to tags https://review.openstack.org/621046	00:26
*** tosky has quit IRC		00:27
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Remove nodeid argument from updateNode https://review.openstack.org/621047	00:32
clarkb	hrm there are a bunch of db migration errors in cinder and nova	00:33
clarkb	I wonder if oslo.db made a release	00:33
ianw	#status log manual reboot of mirror01.nrt1.arm64ci.openstack.org after a lot of i/o failures	00:34
openstackstatus	ianw: finished logging	00:34
*** rh-jelabarre has joined #openstack-infra		00:41
*** yamamoto has joined #openstack-infra		00:48
*** hamzy_ has joined #openstack-infra		01:07
*** sthussey has quit IRC		01:11
*** verdurin has quit IRC		01:33
clarkb	apparently docker 1.12 and newer has a config option to prevent containers from restarting when you restart docker	01:36
clarkb	that is nice	01:36
*** verdurin has joined #openstack-infra		01:38
*** wolverineav has quit IRC		01:38
*** bhavikdbavishi has joined #openstack-infra		01:43
*** yamamoto has quit IRC		01:49
*** dave-mccowan has joined #openstack-infra		01:50
*** gyee has quit IRC		01:52
openstackgerrit	Merged openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954	01:58
*** dklyle has joined #openstack-infra		01:59
*** dklyle has quit IRC		02:05
*** mriedem_afk has quit IRC		02:19
*** mrsoul has quit IRC		02:22
*** psachin has joined #openstack-infra		02:39
*** eernst_ has joined #openstack-infra		02:45
*** dave-mccowan has quit IRC		02:46
*** eernst_ has quit IRC		02:50
*** dave-mccowan has joined #openstack-infra		02:56
*** hongbin has joined #openstack-infra		03:02
*** dklyle has joined #openstack-infra		03:11
*** apetrich has quit IRC		03:15
*** dklyle has quit IRC		03:17
openstackgerrit	Merged openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting" https://review.openstack.org/620201	03:18
*** ykarel\|away has joined #openstack-infra		03:22
*** rlandy has quit IRC		03:31
*** dave-mccowan has quit IRC		03:33
*** dave-mccowan has joined #openstack-infra		03:35
*** psachin has quit IRC		03:43
openstackgerrit	Merged openstack/diskimage-builder master: Add missing ws separator between words https://review.openstack.org/619169	03:53
*** hongbin has quit IRC		03:56
*** udesale has joined #openstack-infra		04:01
openstackgerrit	Brendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838	04:03
*** janki has joined #openstack-infra		04:03
*** ramishra has joined #openstack-infra		04:29
*** dave-mccowan has quit IRC		04:30
*** markvoelker has quit IRC		04:32
*** eernst has joined #openstack-infra		04:35
openstackgerrit	Merged openstack-infra/zuul-jobs master: upload-logs-swift: Cleanup temporary directories https://review.openstack.org/592340	04:49
openstackgerrit	Merged openstack-infra/zuul-jobs master: upload-logs-swift: Make indexer more generic https://review.openstack.org/592852	04:50
*** wolverineav has joined #openstack-infra		04:52
*** markvoelker has joined #openstack-infra		05:02
*** eernst has quit IRC		05:04
*** wolverineav has quit IRC		05:15
*** threestrands_ has joined #openstack-infra		05:17
*** threestrands has quit IRC		05:20
*** ykarel\|away has quit IRC		05:24
*** pcaruana has quit IRC		05:35
*** ykarel\|away has joined #openstack-infra		05:39
*** ykarel\|away is now known as ykarel		05:39
*** yamamoto has joined #openstack-infra		05:46
*** yamamoto has quit IRC		05:51
openstackgerrit	Merged openstack-infra/zuul master: Add support for zones in executors https://review.openstack.org/549197	06:00
*** diablo_rojo has quit IRC		06:12
openstackgerrit	Kartikeya Jain proposed openstack/diskimage-builder master: Adding support for SLES 15 in element 'sles' https://review.openstack.org/619186	06:17
*** e0ne has joined #openstack-infra		06:32
*** flaper87 has quit IRC		06:32
*** quiquell\|off is now known as quiquell		07:03
*** hamzy__ has joined #openstack-infra		07:11
*** hamzy_ has quit IRC		07:12
*** threestrands_ has quit IRC		07:13
*** apetrich has joined #openstack-infra		07:16
*** stakeda has joined #openstack-infra		07:22
*** pcaruana has joined #openstack-infra		07:22
*** e0ne has quit IRC		07:31
*** slaweq has joined #openstack-infra		07:34
*** florianf\|afk is now known as florianf		07:37
*** kjackal has joined #openstack-infra		07:39
*** bhavikdbavishi has quit IRC		07:42
*** ykarel is now known as ykarel\|lunch		07:43
openstackgerrit	Merged openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment https://review.openstack.org/620969	07:56
openstackgerrit	Merged openstack-infra/zuul master: Add gearman stats reference https://review.openstack.org/620192	07:57
*** jpena\|off is now known as jpena		08:01
*** ginopc has joined #openstack-infra		08:04
*** bhavikdbavishi has joined #openstack-infra		08:05
*** jtomasek has joined #openstack-infra		08:06
*** rcernin has quit IRC		08:06
*** dpawlik has joined #openstack-infra		08:07
*** ralonsoh has joined #openstack-infra		08:17
*** aojea has joined #openstack-infra		08:18
*** roman_g has joined #openstack-infra		08:22
*** shardy has joined #openstack-infra		08:23
*** shardy has quit IRC		08:24
*** shardy has joined #openstack-infra		08:24
*** yamamoto has joined #openstack-infra		08:30
*** ykarel\|lunch is now known as ykarel		08:38
*** xek has joined #openstack-infra		08:38
*** ginopc has quit IRC		08:40
*** tosky has joined #openstack-infra		08:42
*** dpawlik has quit IRC		08:46
*** ccamacho has joined #openstack-infra		08:46
*** jpich has joined #openstack-infra		08:53
*** kjackal has quit IRC		08:58
*** dpawlik has joined #openstack-infra		09:20
frickler	infra-root: does /etc/ansible/hosts/group_vars/all.yaml get edited manually on bridge? I wasn't aware that my acc had been commented out there, just curious when the amount of mail I was receiving was slowly decreasing.	09:21
frickler	I edited it now with a different address that is hopefully going to be bouncing less things, please let me know if you see any	09:22
frickler	Shrews: ^^ you are blocked there too, in case you were wondering	09:23
*** dpawlik has quit IRC		09:24
*** e0ne has joined #openstack-infra		09:27
*** stakeda has quit IRC		09:33
*** derekh has joined #openstack-infra		09:33
openstackgerrit	Ian Wienand proposed openstack-infra/system-config master: bridge.o.o : install ansible 2.7.3 https://review.openstack.org/617218	09:38
*** gfidente has joined #openstack-infra		09:40
*** takamatsu has quit IRC		09:41
*** ginopc has joined #openstack-infra		09:41
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Categorize missing /etc/heat/policy.json https://review.openstack.org/621128	10:01
*** olivierbourdon38 has joined #openstack-infra		10:09
*** tpsilva has joined #openstack-infra		10:31
*** slaweq has quit IRC		10:32
*** ramishra has quit IRC		10:43
*** ramishra has joined #openstack-infra		10:43
*** electrofelix has joined #openstack-infra		10:44
*** pcaruana has quit IRC		10:44
*** quite has left #openstack-infra		10:46
*** pcaruana has joined #openstack-infra		10:50
*** udesale has quit IRC		10:59
*** dtantsur\|mtg is now known as dtantsur\|afk		11:00
*** rfolco is now known as rfolco_doctor		11:09
*** bhavikdbavishi has quit IRC		11:11
*** ramishra has quit IRC		11:20
*** ramishra has joined #openstack-infra		11:26
*** hwoarang has quit IRC		11:32
*** slaweq has joined #openstack-infra		11:51
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356	11:52
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029	12:07
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Remove nodeid argument from updateNode https://review.openstack.org/621047	12:11
*** pcaruana has quit IRC		12:11
*** hwoarang has joined #openstack-infra		12:15
*** lucasagomes is now known as lucas-hungry		12:17
*** lucas-hungry is now known as lucasagomes		12:17
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306	12:24
*** slaweq has quit IRC		12:24
*** udesale has joined #openstack-infra		12:26
*** kjackal has joined #openstack-infra		12:26
*** rfolco_doctor is now known as rfolco		12:34
*** eharney has quit IRC		12:34
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: devstack: Didn't find service registered by hostname after 60 seconds https://review.openstack.org/621150	12:34
*** Serhii_Rusin has joined #openstack-infra		12:36
*** Serhii_Rusin has quit IRC		12:40
*** jpena is now known as jpena\|lunch		12:41
*** kjackal has quit IRC		12:41
*** xek has quit IRC		12:46
Shrews	frickler: yeah. i need to migrate away from gmail	12:46
*** xek has joined #openstack-infra		12:47
*** e0ne has quit IRC		12:54
*** yamamoto has quit IRC		12:59
*** boden has joined #openstack-infra		13:06
*** zul has joined #openstack-infra		13:06
*** yamamoto has joined #openstack-infra		13:15
*** trown\|outtypewww has quit IRC		13:17
*** trown\|brb has joined #openstack-infra		13:18
*** dave-mccowan has joined #openstack-infra		13:19
pabelanger	frickler: that file should be under control of git, so you should be able to look at history	13:26
pabelanger	if not, then I am not sure	13:26
pabelanger	I don't believe we should have manually editted files on brige.o.o	13:26
*** annp has quit IRC		13:29
frickler	pabelanger: ah, local git repo, nice. /me can blame mordred now :-D	13:29
pabelanger	frickler: cool, so any changes need to git commited to git there too	13:32
frickler	pabelanger: did that for my earlier change now	13:33
pabelanger	ack	13:35
*** rlandy has joined #openstack-infra		13:36
*** e0ne has joined #openstack-infra		13:37
*** jpena\|lunch is now known as jpena		13:40
*** sthussey has joined #openstack-infra		13:40
*** kgiusti has joined #openstack-infra		13:45
openstackgerrit	Merged openstack/diskimage-builder master: Add an element to configure iBFT network interfaces https://review.openstack.org/391787	13:46
*** udesale has quit IRC		13:48
*** udesale has joined #openstack-infra		13:49
*** hwoarang has quit IRC		13:54
*** takamatsu has joined #openstack-infra		13:55
*** slaweq has joined #openstack-infra		13:56
*** jcoufal has joined #openstack-infra		13:57
*** EmilienM is now known as EvilienM		13:58
*** ykarel is now known as ykarel\|away		14:07
*** roman_g has quit IRC		14:08
*** roman_g has joined #openstack-infra		14:08
*** hwoarang has joined #openstack-infra		14:12
*** mriedem has joined #openstack-infra		14:19
*** olivierbourdon38 has quit IRC		14:20
*** olivierbourdon38 has joined #openstack-infra		14:20
*** jamesmcarthur has joined #openstack-infra		14:24
ssbarnea\|rover	clarkb: are you working on https://review.openstack.org/#/c/621038/1 ? i can fix it myself.	14:25
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038	14:28
openstackgerrit	Merged openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota https://review.openstack.org/621040	14:29
openstackgerrit	Merged openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes https://review.openstack.org/621043	14:29
*** janki has quit IRC		14:30
pabelanger	clarkb: corvus: mordred: are we thinking of doing nodepool / zuul restarts this morning to pick up new noderequest logic?	14:31
*** ykarel\|away has quit IRC		14:33
*** lbragstad is now known as elbragstad		14:36
openstackgerrit	Merged openstack-infra/project-config master: Create airship-spyglass repo https://review.openstack.org/619493	14:37
*** bhavikdbavishi has joined #openstack-infra		14:39
*** eharney has joined #openstack-infra		14:42
openstackgerrit	Merged openstack-infra/project-config master: Add openstack/arch-design https://review.openstack.org/621012	14:44
*** bhavikdbavishi has quit IRC		14:47
*** bhavikdbavishi has joined #openstack-infra		14:47
*** takamatsu has quit IRC		14:48
*** ykarel\|away has joined #openstack-infra		14:52
*** ykarel\|away is now known as ykarel		14:52
*** dayou_ has joined #openstack-infra		14:59
*** dayou has quit IRC		15:00
*** dayou_ has quit IRC		15:10
*** dayou_ has joined #openstack-infra		15:11
*** chandan_kumar is now known as chkumar\|off		15:14
*** jcoufal has quit IRC		15:16
openstackgerrit	Merged openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356	15:24
*** takamatsu has joined #openstack-infra		15:29
*** jamesmcarthur has quit IRC		15:33
*** bnemec is now known as beekneemech		15:34
*** roman_g has quit IRC		15:36
*** eharney has quit IRC		15:36
*** cdent has joined #openstack-infra		15:37
cdent	Can someone point me to a job that does a webhook post merge? Basically I want to trigger a build on dockerhub after a placement change merges	15:38
*** ramishra has quit IRC		15:38
*** quiquell is now known as quiquell\|off		15:39
pabelanger	cdent: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n1128 does it for readthedocs.org	15:42
cdent	pabelanger: awesome, thanks	15:42
pabelanger	np!	15:42
*** dansmith is now known as SteelyDan		15:43
*** therve has joined #openstack-infra		15:44
therve	Hi	15:44
therve	We're getting some issues in the heat gate for the past couple of days	15:45
therve	It would seem there is an issue with ovh, is that a known problem?	15:45
*** jamesmcarthur has joined #openstack-infra		15:46
corvus	pabelanger, clarkb: yes i'd like to restart things today	15:48
*** yamamoto has quit IRC		15:48
*** jamesmcarthur has quit IRC		15:49
*** eernst has joined #openstack-infra		15:49
*** jamesmcarthur has joined #openstack-infra		15:49
*** sthussey has quit IRC		15:50
*** mriedem has quit IRC		15:50
*** eharney has joined #openstack-infra		15:51
*** munimeha1 has joined #openstack-infra		15:51
*** takamatsu has quit IRC		15:53
pabelanger	+1, happy to assist	15:59
openstackgerrit	James E. Blair proposed openstack-infra/puppet-zuul master: Add relative_priority scheduler option https://review.openstack.org/621194	15:59
mordred	cdent: fwiw, you could also just push images built in zuul to dockerhub	16:00
cdent	mordred: yeah, I thought about that too, but the dockerfile being used is "mine", not an openstack thing, so I was exploring the options	16:00
mordred	nod	16:00
cdent	I think it is probably better to make the Dockerfile a real placement thing	16:01
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195	16:01
cdent	mordred: can you point me to an example?	16:01
mordred	it's all good either way - mostly just wanted to mention	16:01
mordred	cdent: yup - one sec	16:01
corvus	pabelanger, mordred: can you review those 2 changes ^ (since they require a scheduler restart, would be good to go ahead and get them in place)	16:01
corvus	gah, bad parent	16:02
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195	16:02
corvus	there we go	16:02
mordred	cdent: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n1260	16:03
pabelanger	+2	16:03
cdent	thanks	16:03
mordred	cdent: also - if you're just publishing images of placement, you might want to check out what the loci folks are doing	16:03
mordred	corvus: +A to both	16:04
cdent	mordred: yeah, on that too	16:05
corvus	given what those changes are intended to do, i'm inclined to direct-enqueue them	16:05
mordred	corvus: ++	16:05
cdent	mordred: what I'm trying to do here in incrementally unwind a bunch of the external side stuff I've built alongside placement before it was extracted	16:06
mordred	cdent: that sounds like a very worthwhile thing to do	16:06
mordred	cdent: while I'm pointing you at 20 things in addition to what you were asking about ... you should check out pbrx :)	16:06
cdent	in this case the image I create is fairly purpose built for testing	16:06
mordred	ah- yes, well in that case neither loci nor pbrx are going to be very helpful :)	16:06
corvus	we're running a 3 for 1 special on answers here today :)	16:07
cdent	pbrx looks pretty interesting, though	16:07
cdent	mordred: the container in question is an example of "just how cdentish can I make this thing". In this case that means zero config in the container and as small as possible and uwsgi instead of apache, etc	16:08
cdent	a lot of which I'm hoping to eventually get back to loci and kolla, but there's only so much time	16:08
mordred	cdent: I'm very much in the 'no config in container' camp	16:09
mordred	cdent: I think putting config into the containers defeats the whole benefit of the containers	16:10
cdent	ayup	16:10
cdent	it's this: https://github.com/cdent/placedock	16:10
mordred	like - a container should contain a single process- and things like config should be mounted in or otherwise provided, and things like apache should be network proxies, etc	16:10
cdent	yup, we sound of like minds	16:11
cdent	(not all that surprising)	16:11
mordred	cdent: if you haven't already, you might want to consider python:alpine as a base image - you can skip your pip and python install steps	16:12
cdent	I've been back and forth on the image a few different times depending on what seems to be working that day	16:12
mordred	:)	16:12
mordred	tell me about it	16:12
cdent	yesterday lots of stuff was not working, so gave up on alpine:edge	16:12
mordred	we've been using python:alpine for the pbrx-built images and I'm fairly happy with it so far- there was one moment where something stopped working that we thought was alpine's fault, but I think it wound up being something else	16:13
cdent	i'll add that to the list, thanks	16:15
*** dayou_ has quit IRC		16:19
*** sthussey has joined #openstack-infra		16:20
*** mriedem has joined #openstack-infra		16:21
*** pcaruana has joined #openstack-infra		16:26
*** gyee has joined #openstack-infra		16:27
fungi	therve: did anyone get back to you on tour "issues in the heat gate" specific to ovh yet? can you elaborate? is it a particular error or something general like slow performance leading to job timeouts? is it in both obh regions (bhs1 and gra1) or only one?	16:27
therve	fungi: No	16:28
therve	fungi: It looks like slow performance	16:28
therve	I haven't checked the region	16:28
therve	http://logs.openstack.org/57/620457/3/check/heat-functional-convg-mysql-lbaasv2/934e143/ is a recent example	16:29
therve	It spent 90mins trying to setup devstack (that's usually our whole runtime)	16:29
*** boden has quit IRC		16:34
*** ccamacho has quit IRC		16:35
*** ccamacho has joined #openstack-infra		16:35
*** adriancz has quit IRC		16:36
clarkb	pabelanger: frickler correct its a local git repo. You edit and commit in place	16:36
fungi	therve: yeah, that looks like ovh-bhs1 which is where we've had a lot of reports of job timeouts. at first i thought it might be because of the 2:1 cpu oversubscription in our dedicate host aggregate for that region but temporarily halving the number of servers we were running didn't move the needle	16:36
*** xek has quit IRC		16:36
clarkb	fungi: therve: I wonder if some of the hypervisors are not using virt?	16:36
openstackgerrit	Monty Taylor proposed openstack-infra/project-config master: Create promstat project https://review.openstack.org/621225	16:37
openstackgerrit	Monty Taylor proposed openstack-infra/project-config master: Add promstat project to Zuul https://review.openstack.org/621226	16:37
corvus	2018-11-30 16:27:30.828209 \| primary \| manifests/init.pp - WARNING: quoted boolean value found on line 35	16:37
*** yamamoto has joined #openstack-infra		16:37
corvus	okay, so, in puppet, if i want to pass the literal word "true" around.... what? just don't do it?	16:37
fungi	therve: mnaser was just commenting in #openstack-tc about surprisingly slow swapfile creation/preallocation in ovh, so i'm starting to wonder if it could be attributed to disk i/o	16:37
clarkb	fungi: oh interesting	16:38
clarkb	fungi: that said you should totally use fallocate on ext4 and it shouldn't be slow even with bad disk io :P	16:38
therve	fungi: Anecdotally I've found db migrations to be slow	16:38
clarkb	mnaser: ^	16:38
mnaser	clarkb: http://logs.openstack.org/36/619636/1/gate/openstack-ansible-deploy-aio_metal-ubuntu-bionic/72c540f/logs/ara-report/result/f6ed9f8a-419a-41b8-8d81-19d6e5aac6cc/	16:38
*** jamesmcarthur has quit IRC		16:38
mnaser	i mean yes, but you cant fallocate on xfs so really we've only worked around the job	16:39
corvus	therve, fungi, clarkb: wow, a bunch of zuul sql tests have started timing out recently; i wonder if they ran there	16:39
mnaser	5.9 MB/s disk write speed will be bad regardless	16:39
mnaser	if you avoid it in swap, it'll bite you later somewhere else :)	16:39
clarkb	mnaser: our centos7 instances are ext4 not xfs	16:40
mnaser	oh shoot really	16:40
mnaser	TIL	16:40
mnaser	but anyways, still, 6 MB/s disk writes are painful :P	16:40
clarkb	mnaser: yup not saying it fixes the underlying issue. Just pointing out that you likely want to avoid this cost anyway	16:40
fungi	corvus: while we were rnning at max-servers=79 for both bhs1 and gra1 i ran the e-r logstash query for job timeouts and there were 20x as many (no joke) in bhs1 compared to gra1	16:40
mnaser	yeah well we decided to just drop swap entirely	16:40
mnaser	because we rarely ever used it	16:40
mnaser	it was like 1-2MB of swap with some 4.something gig of memory being used by caches/etc	16:41
*** jamesmcarthur has joined #openstack-infra		16:42
openstackgerrit	James E. Blair proposed openstack-infra/puppet-zuul master: Add relative_priority scheduler option https://review.openstack.org/621194	16:42
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195	16:42
corvus	pabelanger: mordred: ^	16:42
*** dayou_ has joined #openstack-infra		16:42
pabelanger	looking	16:44
pabelanger	+3	16:44
clarkb	corvus: is zuul restarted to accept that? or we put config in place first then restart?	16:44
*** bhavikdbavishi has quit IRC		16:45
openstackgerrit	Tobias Urdin proposed openstack-infra/system-config master: Mirror Stein on Ubuntu from Cloud Archive https://review.openstack.org/621231	16:45
clarkb	as for ovh IO, I know on my personal instance I seem capped at 1kiops so lots of really small writes perform poorly but large writes perform just as well as anything else. This seems more extreme than that though	16:45
*** bhavikdbavishi has joined #openstack-infra		16:45
corvus	clarkb: config first then restart, since the scheduler needs a restart to see that, but will ignore extra options.	16:46
fungi	mnaser: looks like the example linked above was also in bhs1, so i'm starting to wonder if disk is waaaay slower there than gra1. could explain the disproportionate number of job timeouts in bhs1 if so	16:46
*** ginopc has quit IRC		16:47
pabelanger	corvus: clarkb: we also need to do nodepool-launcher too right? I can help with that if needed	16:47
clarkb	pabelanger: for it to work yes, I think it will be a noop if zuul sets it until then (but otherwise work as is today)	16:47
corvus	pabelanger: yeah. in fact, do you want to get started on restarting those now?	16:48
corvus	or do we want to restart everything at once?	16:48
*** yamamoto has quit IRC		16:48
pabelanger	I'm fine with both options	16:48
*** udesale has quit IRC		16:49
corvus	i think launcher restarts are only minorly disruptive; i say go ahead and do 'em	16:49
pabelanger	ok	16:49
clarkb	corvus: ++	16:50
clarkb	usually I restart one and make sure its happy then do the others ~5 minutes later	16:50
clarkb	perceived impact is incredibly low	16:50
pabelanger	nl01 doesn't seem to have latest tip of nodepool master, let me see what is happening	16:52
fungi	clarkb: i noticed yesterday you added static.o.o in the emergency disable list. is that still in progress or forgotten?	16:53
clarkb	fungi: semi forgotten. I think we want to get dmsimard wsgi resource increases in place. But I don't think anyone else was reviewing those changes	16:54
* clarkb digs them up		16:54
*** trown\|brb is now known as regain		16:55
pabelanger	I think ansible on bridge.o.o is broken, but not 100%. I haven't interacted much with this server	16:55
pabelanger	ERROR! Completely failed to parse inventory source /opt/system-config/inventory/openstack.yaml	16:55
pabelanger	that is in /var/log/ansible/run_all_cron.log	16:55
*** regain is now known as trown		16:55
pabelanger	and looks like 12 hours ago was the last run	16:55
clarkb	fungi: https://review.openstack.org/#/c/616297/	16:55
*** trown is now known as trown\|lunch		16:56
clarkb	pabelanger: I think that got fixed because my cert signing request went through (whcih depended on dns updates)	16:56
corvus	that's the last message in the log	16:56
pabelanger	okay, I might be looking in the wrong log	16:56
clarkb	or maybe it stopped and then started again	16:56
fungi	perhaps citycloud is still busted?	16:57
corvus	clarkb, pabelanger, mordred: http://paste.openstack.org/show/736489/	16:57
pabelanger	thanks, that is what I am seeing	16:57
pabelanger	there was a new release of ansible yesterday, could be related	16:58
clarkb	oh interesting. I wonder if citycloud fixed, then we updated ansible then we broke	16:58
clarkb	ya	16:58
fungi	when it rains it pours	16:58
*** jpena is now known as jpena\|off		16:59
*** eharney has quit IRC		16:59
pabelanger	looking to see if ansible upgraded now	16:59
corvus	ansible 2.7.0	16:59
Shrews	pabelanger: corvus: if we're restarting launchers, we'll need to watch those carefully. lots of rather big changes (mostly around caching) have gone in	16:59
*** jpich has quit IRC		17:00
clarkb	fungi: I've discovered that nova takes over the dmi info for instances so hyou can't really tell if they are qemu or kvm (trying to double check that bhs1 isn't emulated as that could explain io as well as other slowness)	17:00
clarkb	ok systemd-detect-virt says it is kvm proper	17:02
clarkb	fungi: re static I think we should either approve dmsimard's change above or we can safely remove static from the emergency file then watch it for slowness	17:02
clarkb	its safe either way, its just that we think adding the workers helped with the log downloading slowness that was seen a while back	17:03
pabelanger	okay, 2.7.0 seems to be the version we are pinned to in system-config, so don't believe that has changed	17:04
*** pcaruana has quit IRC		17:04
corvus	mordred: can we put in your static inventory stuff asap?	17:05
*** aojea has quit IRC		17:06
fungi	clarkb: only a data point, but our mirror instances in ovh don't show a significant discrepancy in i/o performance. 124 MB/s in gra1 vs 131 MB/s in bhs1 when i tried the same dd as mnaser's job example. could be due to them having a different flavor, or not actually being in the same host aggregate, or maybe only some instances in bhs1 are hitting an i/o tarpit while others land on more performant	17:07
fungi	disk	17:07
clarkb	corvus: mordred fwiw I don't think the static inventory will fix this	17:07
clarkb	this is a different error than what we had with citycloud (that was a url parsing failure due to http 502). THis appears to be a python bug with a Proxy object in the openstack plugin not having a servers attribute	17:07
clarkb	possibly openstacksdk updated and not ansible?	17:07
corvus	clarkb: not using the openstack plugin as inventory will fix that problem	17:07
corvus	clarkb: ansible has been broken for two days straight because of two different problems with the openstack inventory plugin.	17:08
clarkb	corvus: it will work around it yes. Mostly pointing it out because mordred is maintainer for that plugin and all of that in ansible. It doesn't work right now. that is probably improtant info for mordred	17:08
clarkb	for infra we can work around it	17:08
pabelanger	I am not sure what yamlgroup plugin is, is that something we wrote?	17:08
corvus	clarkb: i agree. i'm not the maintainer for that plugin. i just want our systems to work so i can go back to doing my work.	17:09
clarkb	pabelanger: yes it is how we take our host listing and group them into groups	17:09
pabelanger	okay, sorry, not up to speed on recent changes	17:09
*** mriedem is now known as mriedem_lunch		17:11
pabelanger	could it be openstacksdk that is breaking here? there was a new release 19hrs ago	17:11
corvus	pabelanger: almost certainly	17:11
clarkb	pabelanger: yes that is my current hunch	17:11
corvus	feel free to revert or whatever, but i'm working on switching to static inventory	17:11
pabelanger	let me see how we install it	17:12
clarkb	I think for us static inventory is a good move since we have had multiple issues with not static inventory in a couple days. I think for mordred and sdk and ansible it should be clear this isn't just a failing cloud anymore and there is a bug to fix there	17:12
corvus	clarkb: agreed	17:12
pabelanger	okay, so try rolling back to 0.19.0, confirm inventory works. Then, work towards static inventory	17:14
*** eharney has joined #openstack-infra		17:14
clarkb	fungi: on a random ready xenial VM I hopped on in bhs1 3.6MB/s seems to be peak for write sizes of 128bytes and 512 byes >100MB total	17:16
fungi	granted, if that node was actively running a job then you may not know how much other contention you have for activity on the same node	17:17
clarkb	fungi: ya, though it seems consistent over multiple writes. We can always boot a node external to nodepool if we need better numbers. Mostly this seems to point that io is consistently slow when artificially tested with dd	17:17
pabelanger	http://paste.openstack.org/show/736491/	17:18
pabelanger	that is when openstacksdk was upgraded, then ansible ran 1 more time afterwards	17:18
fungi	clarkb: any chance you can repeat the same experiment on a random gra1 node for comparison?	17:18
pabelanger	going to downgrade it manually now to confirm inventory works again	17:18
clarkb	bumping up the bs to 2048 shows roughly the same throughput (3.4MB/s)	17:19
clarkb	fungi: ya I can do that	17:19
clarkb	that implies to me that this isn't iops caps we are running into. We should've seen difference in throughput with different block sizes (up to the disk block size iirc) if it were just iops	17:20
fungi	if disk access is an order of magnitude slower for (at least some) instances in bhs1 than in gra1, we probably have our smoking gun for the difference in timeouts	17:20
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Switch to static inventory https://review.openstack.org/621247	17:20
corvus	clarkb, pabelanger, fungi, mordred: ^	17:21
corvus	i quickly processed mordred's file from yesterday to include only the attributes he thought important	17:21
*** ykarel is now known as ykarel\|away		17:22
corvus	oh, i wonder if we need ansible_host ?	17:22
*** eharney has quit IRC		17:22
*** shardy has quit IRC		17:22
pabelanger	corvus: clarkb: downgrading to openstacksdk 0.0.19 seems to fix it	17:22
clarkb	corvus: mordred already pushed that change yesterday	17:22
pabelanger	looking at patch now	17:22
* clarkb +2'd it but no one else reviewed it...		17:23
clarkb	https://review.openstack.org/#/c/621031/ if we want to use that chagne instead	17:23
pabelanger	corvus: +2, how did you generate that by chance?	17:23
corvus	clarkb: approved	17:23
clarkb	fungi: ah ok so I had a minor derp in there I should've used /dev/zero. But ya bhs1 is 16.6MB/s at 512 bs and 9ish MB/s at 2048 bs vs >250MB/s on gra1	17:24
openstackgerrit	Merged openstack-infra/zuul master: Clarify executor zone documentation https://review.openstack.org/620989	17:24
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Disable openstack inventory plugin https://review.openstack.org/621247	17:25
corvus	clarkb, pabelanger, mordred: ^ that's the other thing from my change that wasn't in mordred's	17:25
clarkb	pabelanger: mordred genearted it by taking the ansible openstack plugin output (from the cache?) and filtering out all the extra data we don't need	17:25
fungi	clarkb: thanks. ironically disk writes are ~20x faster on gra1 and jobs time out ~20x more often on bhs1	17:25
clarkb	corvus: oh interesting I wonder if the yaml vs yamlgroup ordering there is important	17:26
corvus	we might have actually needed my change to get past today's bug.	17:26
corvus	clarkb: i don't know, but i re-ordered it to match our actual order assuming it is	17:26
clarkb	corvus: ya, also may need yaml to process before yamlgroup	17:26
pabelanger	clarkb: ack, thanks	17:26
*** efried is now known as fried_rolls		17:27
fungi	corvus: just to confirm, one of the tox-py36 job timeouts i ran across on a zuul change a few minutes ago was indeed in ovh-bhs1 as well. likely related	17:27
pabelanger	so, until bridge.o.o runs again, ansible works with old config. Next run, openstacksdk will update, but think we are protected a little with cached version, which should be enough time for these new patches to land	17:27
corvus	fungi: thanks; i' checked a few too, though one of the sql-only timeouts wasn't bhs1 (it was limestone)	17:27
clarkb	amorin: ^ fyi we think we have narrowed down the ovh-bhs1 issues to disk performance. Seeing writes for zeros to disk with dd on the order of 10-15MB/s on bhs1 when the same writes are >250MB/s on gra1	17:28
corvus	pabelanger: well, if we direct-enqueue the patches. otherwise we're going to spend 8 hours just getting to the point where we can begin the day's work.	17:28
pabelanger	++	17:28
corvus	so i will direct-enqueue the static inventory patches now	17:28
clarkb	corvus: pabelanger I thinik we may want them to go in together so that the config update is in place the first time we run with the static inventory	17:28
corvus	clarkb: i'll do both	17:29
clarkb	(otherwise yamlgroup may not be able to group any hsots into groups and we'll still run ansible in an unproductive manner	17:29
fungi	i wonder if we should disable ovh-bhs1 for now. it's a 159-node hit to our capacity, but the random timeouts are probably resulting in rechecks and gate resets which waste even more than that	17:29
clarkb	fungi: ya likely	17:29
pabelanger	fungi: actually, 1 sec, can I check something in ovh	17:29
fungi	pabelanger: check whatever you like	17:30
clarkb	I'm going to dig up breakfast then will update things with the opendev cert info	17:30
pabelanger	fungi: okay, thanks. Finished, I wanted to check to see if we had any leaked nodes their, doesn't look like it (unrelated to the current issue).	17:31
pabelanger	+1 to disable if jobs are being affected	17:31
*** weshay is now known as he_hates_me		17:35
openstackgerrit	Merged openstack-infra/puppet-zuul master: Add relative_priority scheduler option https://review.openstack.org/621194	17:35
openstackgerrit	Merged openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195	17:35
*** he_hates_me is now known as weshay		17:36
*** kjackal has joined #openstack-infra		17:36
openstackgerrit	Merged openstack-infra/zuul master: Remove STATE_PENDING https://review.openstack.org/620284	17:38
openstackgerrit	Jeremy Stanley proposed openstack-infra/project-config master: Temporarily disable ovh-bhs1 in nodepool https://review.openstack.org/621250	17:39
openstackgerrit	Jeremy Stanley proposed openstack-infra/project-config master: Revert "Temporarily disable ovh-bhs1 in nodepool" https://review.openstack.org/621251	17:39
openstackgerrit	Merged openstack-infra/system-config master: Bump amount of mod_wsgi processes for static vhosts to 16 https://review.openstack.org/616297	17:42
*** eernst has quit IRC		17:42
*** eernst has joined #openstack-infra		17:43
*** eernst has quit IRC		17:44
*** eernst has joined #openstack-infra		17:44
*** e0ne has quit IRC		17:45
clarkb	corvus: fungi opendev cert and key and friends are all in the usual location on bridge. Do we have hiera/ansible var keys set up for that yet?	17:47
corvus	clarkb: expected key names are here: https://review.openstack.org/620979	17:48
clarkb	thanks	17:48
fungi	aha, i thought i'd already reviewed that but looks like i have not	17:48
pabelanger	it is happening	17:48
fungi	opendev happens	17:49
openstackgerrit	Merged openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record" https://review.openstack.org/620986	17:51
*** openstackgerrit has quit IRC		17:51
clarkb	corvus: ok should be in place under those keys in hiera	17:54
corvus	ok approved	17:55
corvus	clarkb: thanks	17:55
*** hamerins has joined #openstack-infra		17:55
*** openstackgerrit has joined #openstack-infra		17:56
openstackgerrit	Merged openstack-infra/system-config master: Switch to a static inventory https://review.openstack.org/621031	17:56
corvus	second change is about 8 minutes out	17:56
clarkb	should we disable the cron on bridge so it doesn't run without second change in ~3 minutes?	17:57
clarkb	(I'm worried our group info won't be correct which could break firewall rules)	17:57
corvus	clarkb: yes	17:57
clarkb	corvus: are you doing that or should I?	17:57
corvus	clarkb: you	17:58
clarkb	#/15 * * * flock -n /var/run/ansible/run_all.lock bash /opt/system-config/run_all.sh -c >> /var/log/ansible/run_all_cron.log 2>&1 is in the crontab now	17:59
clarkb	(I also commented out the cloud launcher script as I'm not sure if that one will be happy either)	17:59
*** derekh has quit IRC		18:00
corvus	clarkb, pabelanger: i'm going to afk for 30m	18:00
clarkb	hrm it seems like ansible was running before though?	18:00
clarkb	pabelanger: did you downgrade sdk?	18:01
clarkb	I didn't expect ansible to be working atall, but if someone downgraded the openstacksdk package that might explain it	18:01
corvus	clarkb: i believe pabelanger said he did that	18:02
clarkb	ok	18:02
corvus	17:22 < pabelanger> corvus: clarkb: downgrading to openstacksdk 0.0.19 seems to fix it	18:02
clarkb	thanks	18:02
*** eernst has quit IRC		18:02
*** munimeha1 has quit IRC		18:02
* mordred poking to see if I can figure out what the actual issue is		18:03
*** gfidente has quit IRC		18:04
*** Swami has joined #openstack-infra		18:09
Shrews	mordred: i've been poking as well but not having much luck locally so far	18:09
mordred	Shrews: I've got a script in /root/mttest.py on bridge.openstack.org that exhibits the issue - it can be run with /root/mtvenv/bin/python mttest.py if you wanna look at it	18:10
mordred	something is unhappy with rackspace	18:11
openstackgerrit	Merged openstack-infra/system-config master: Disable openstack inventory plugin https://review.openstack.org/621247	18:11
clarkb	mordred: ^ can you double check that does what we want before we reenable the ansible cron on bridge?	18:12
clarkb	mordred: rax had their old compute api in teh catalog for a long time is sdk trying to make sense of it?	18:14
mordred	maybe?	18:14
*** jamesmcarthur has quit IRC		18:15
pabelanger	clarkb: corvus: yes, downgrade was the fix	18:15
mordred	oh. my. dear. god	18:16
*** ykarel\|away has quit IRC		18:17
*** wolverineav has joined #openstack-infra		18:17
* fungi can't wait to hear this one		18:20
fungi	such suspense	18:20
mordred	well - discovery is incorrectly finding the project id as the version when doing discovery	18:20
mordred	so is then not matching verison 610275 against v2	18:21
mordred	ultimately it's because rackspace blocks access to the discovery document so we have to fallback to parsing the URL	18:21
mordred	which SHOULD have a provision for stripping the trailing project id from the endpoint: https://dfw.servers.api.rackspacecloud.com/v2/610275	18:21
mordred	but for some reason is not doing that	18:21
amorin	clarkb: that's weird (the bhs1 io issue)	18:22
amorin	I am currently at home and cant test that right now	18:23
amorin	can it wait monday?	18:23
clarkb	amorin: yes I think you should enjoy your weekend	18:23
clarkb	and thank you for checking in!	18:24
amorin	:p	18:24
amorin	that's still weird, we are supposed to have same hardware config with SSD disk	18:24
Shrews	mordred: my that's fun	18:24
amorin	but anyway, I am writing that up in my checklist for monday morning	18:24
clarkb	amorin: thank you!	18:24
fungi	amorin: awesome. i was hesitant to reach out to you until monday anyway since i know it's late already there	18:25
fungi	we think this has been going on at least a week, possibly several, could even have started before the summit	18:26
fungi	so a couple more days aren't going to hurt	18:26
*** diablo_rojo has joined #openstack-infra		18:29
*** kjackal has quit IRC		18:30
*** jamesmcarthur has joined #openstack-infra		18:31
pabelanger	clarkb: corvus: Shrews: okay, seems brige.o.o got out a pulse and updated nodepool-launcher. Will hold off until restarting until everybody is back	18:32
clarkb	pabelanger: we may also want to prioritize the various inflight tasks. There is nodepool restarts, zuul restart, static inventory switch (and maybe others)	18:33
pabelanger	clarkb: agree, holding off for now	18:34
corvus	clarkb, pabelanger: here	18:36
pabelanger	also here	18:36
clarkb	me three	18:37
corvus	clarkb, pabelanger: so next we should manually update system-config, then uncomment the cron and watch it run?	18:37
clarkb	corvus: ya I think that is a sane next step. Do we want mordred to review the changes to the ansible config too? (since mordred probably groks yamlgroup and friends best?)	18:37
pabelanger	wfm	18:37
mordred	Shrews: ok. I have a 'fix'	18:38
mordred	Shrews: but I'm not 100% sure of the 'right' way to do it	18:38
corvus	mordred: can you review https://review.openstack.org/621247 or should we just muddle along?	18:38
openstackgerrit	Merged openstack-infra/system-config master: Serve opendev.org website from files.o.o https://review.openstack.org/620979	18:38
mordred	corvus: yes - that looks good	18:39
corvus	mordred: thx	18:39
clarkb	corvus: pabelanger also maybe after updating system-config but before uncommenting cron we run the command to list group membership (I forget what it is) and do a kick.sh?	18:39
corvus	clarkb, pabelanger: system-config is updated.	18:39
pabelanger	+1 for kick.sh	18:40
clarkb	I'm looking up the group listing command now	18:40
corvus	google says: ansible localhost -m debug -a 'var=groups'	18:40
mordred	Shrews: remote: https://review.openstack.org/621257 WIP Fix version discovery for rackspace public cloud is the hacky version	18:40
Shrews	mordred: looking	18:40
pabelanger	do we need to delete the inventory cache, or was removing yamlgroup enough?	18:40
corvus	the output of that lgtm	18:40
clarkb	corvus: ansible-playbook --list-hosts zookeeper	18:40
mordred	Shrews: tl;dr - rackspace project ids are integers, so they parse as version numbers	18:40
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Retire the interop-wg mailing list https://review.openstack.org/619056	18:40
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Shut down openstack general, dev, ops and sigs mls https://review.openstack.org/621258	18:40
Shrews	mordred: vomit	18:40
*** bhavikdbavishi has quit IRC		18:41
mordred	so when discovery doesn't work there and we fall back to parsing the version from the url, we do so by poping url segments to see if something is a version	18:41
mordred	and we match the project id	18:41
clarkb	hrm that comamnd isn't quite what I expected	18:41
*** ralonsoh has quit IRC		18:41
mordred	Shrews: I think what we actually want to do is pass the project_id along into that function (we do a similar thing in other places) so we can do a test of "does this url segment match the current project id"	18:41
mordred	but that'll take a few seconds to sort out	18:42
corvus	clarkb: i don't know the answer to your question about caching	18:42
corvus	pabelanger: er i mean yours	18:42
Shrews	mordred: if we have that info, yes, i agree	18:42
clarkb	ok I want ansible --lists-hosts $group not ansible-playbook	18:42
clarkb	checking a few groups they look correct to me	18:43
pabelanger	corvus: /var/cache/ansible/inventory should we also delete that now, I wasn't sure if removing yamlgroup was enough	18:43
pabelanger	or maybe that is a mordred question	18:43
corvus	i'll delete it	18:43
mordred	we can delete /var/cache/ansible/inventory - but it shouldn't be being touched by anything	18:43
mordred	so it should be a no-op	18:43
pabelanger	ack	18:44
clarkb	so ya things look good to me. I think next step is a kick.sh then turn cron back on	18:44
corvus	deleted. after doing that ansible localhost -m debug -a 'var=groups' still looks good	18:44
pabelanger	++	18:44
clarkb	ansible --lists-hosts $group still looking good too	18:44
corvus	i'll let someone else kick	18:45
clarkb	I'll kick.sh the logstash server since that has firewall rules that matter but if we break them impact is low	18:45
clarkb	that is running now (though I didn't start screan)	18:46
pabelanger	tail -f /var/log/ansible/ansible.log is what I am watching	18:47
clarkb	TASK [base-server : Set ssh key for managment] and TASK [puppet-install : Remove server] changed in the base server playbook	18:47
clarkb	the second I'm not worried about. The first oen is a little odd	18:48
clarkb	and then puppet run says changed in the puppet run. But overall this looks sane	18:48
clarkb	ya the puppet changes are fine	18:48
pabelanger	agree, ansible looks to have run properly	18:49
clarkb	and authorized keys don't look wrong	18:49
clarkb	unless there is another server or group people want to run against first I think we can reenable the cron	18:49
corvus	the authorized_keys file contents are the same as on another host	18:49
Shrews	mordred: hrm, we could determine it by place in the url (when split out by '/') too	18:51
mordred	we could - but most of the time there is no project_id there	18:51
Shrews	i think? not sure how standard that urls	18:51
clarkb	pabelanger: corvus: should I reenable the two cron jobs now? I'll wait until I get other confirmation	18:51
mordred	nova made it optional a few years ago	18:51
mordred	and it's now pretty much never there	18:51
corvus	clarkb: ++	18:51
pabelanger	clarkb: ++	18:52
clarkb	done	18:52
clarkb	next run for both is top of the hour	18:52
pabelanger	k, going to fresh coffee	18:52
* Shrews afk for 15m		18:57
pabelanger	back	18:58
*** boden has joined #openstack-infra		18:59
*** electrofelix has quit IRC		18:59
clarkb	I'm watching tail of the run all log file	18:59
clarkb	running now	19:00
clarkb	base server is running now	19:01
clarkb	the thing to watch for here is iptables imo	19:01
clarkb	and if that is happy I think we are good	19:02
*** wolverineav has quit IRC		19:02
openstackgerrit	Jeremy Stanley proposed openstack-dev/cookiecutter master: Update contact address to openstack-discuss ML https://review.openstack.org/621266	19:04
openstackgerrit	Chris Dent proposed openstack-infra/project-config master: Set placement's gate queue to integrated https://review.openstack.org/621267	19:05
*** olivierbourdon38 has quit IRC		19:05
*** wolverineav has joined #openstack-infra		19:06
openstackgerrit	Jeremy Stanley proposed openstack-dev/specs-cookiecutter master: Update contact address to openstack-discuss ML https://review.openstack.org/621269	19:09
mnaser	fungi: glad my random digging into a timeout has resulted in finding a timeout root cause D:	19:10
mordred	infra-root: https://review.openstack.org/621257 is a keystoneauth patch to fix the discovery issue. the openstacksdk update exposed the issue, which has been lurking there all along	19:10
fungi	mnaser: well, we don't know the root cause yet, but yes your report of slow disk writes in ovh was a huuuuge help in turning the corner on that ongoing investigation. thanks!!!	19:10
mnaser	everyone wins D:	19:11
clarkb	ansible still clean from where I'm sitting	19:11
mordred	if we want, we can protect ourselves from this issue between now and a ksa release by putting in compute_endpoint_override settings in our clouds.yaml - or we can ping openstacksdk on the nodepool launchers	19:11
fungi	mnaser: "my jobs run slow and don't complete within the timeout" was a lot harder to track down	19:11
mordred	s/ping/pin/	19:11
corvus	mordred: oh, so we had best not restart the launchers without doing one of those, eh?	19:12
mordred	corvus: yes. it would be bad - we'd lose all rackspace quota	19:12
corvus	mordred: maybe we should put the pin in nodepool?	19:13
tobiash	++	19:13
*** mriedem_lunch is now known as mriedem		19:13
corvus	mordred: maybe you can propose that change since you know what numbers to type?	19:13
mordred	corvus: yup. on it	19:14
clarkb	iptable files being installed now. Lots of Oks and no changed so far	19:15
clarkb	so that looks good	19:15
kmalloc	clarkb, fungi: confirming. gerrit does openID not OIDC (as far as I can tell)	19:16
fungi	kmalloc: i think it's just openid still, yes	19:17
kmalloc	bahg	19:17
kmalloc	ok no problem	19:17
clarkb	iptables has completed and I see nothign amiss there. I think this is working as expected	19:17
fungi	kmalloc: on a related note, have you seen lemonldap-ng?	19:17
kmalloc	working to ensure i have a working example SP for the ipsilon thing	19:17
kmalloc	fungi: no, looking now	19:17
fungi	https://lemonldap-ng.org/	19:17
clarkb	it will also do oath? whatever google does	19:17
corvus	kmalloc: but related is https://github.com/davido/gerrit-oauth-provider	19:17
kmalloc	oh neat	19:18
fungi	seems to have come out of a project to create a full sso suite for the french government	19:18
kmalloc	corvus: yeah i see	19:18
kmalloc	corvus: i saw that*	19:18
kmalloc	fungi: that is pretty darn cool	19:18
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool master: Block 0.19.0 of openstacksdk https://review.openstack.org/621272	19:18
fungi	kmalloc: it just got packaged in debian this week, which is how it came to my attention	19:18
fungi	and it came up in discussions about replacement sso for debian.org	19:18
* kmalloc nods		19:19
Shrews	mordred: did that show up in 0.19 or 0.20?	19:20
clarkb	now on to git and gerrit servers. I'm going to stop staring at it like a hawk now that base is done and seems to be happy. Also git and gerrit seems to actually just be git and gerrit as expected	19:20
mordred	Shrews: gah. am I stupid?	19:20
* Shrews refrains		19:20
*** jamesmcarthur has quit IRC		19:20
clarkb	pabelanger: ^ see also here :P	19:21
kmalloc	fungi: I'll poke at that thing as well. It is super interesting.	19:21
*** jamesmcarthur has joined #openstack-infra		19:21
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool master: Block 0.20.0 of openstacksdk https://review.openstack.org/621272	19:21
mordred	Shrews: thanks	19:21
fungi	kmalloc: yeah, i haven't delved deeply yet, but it looked like it could be relevant to our situation	19:21
mriedem	clarkb: fwiw i'm seeing e-r commenting on failed changes again	19:24
mriedem	did you tickle it into submission?	19:24
clarkb	mriedem: fungi restarted it to reset the backlog timeout queue	19:24
mriedem	ah	19:24
clarkb	mriedem: I also pushed a change to try and address it directly in the code by tieing the timeout to the event timestamp and not current time	19:25
fungi	also +2'd clarkb's proposed solution	19:25
fungi	but it could stand some more thorough review	19:25
mriedem	i saw but haven't reviewed that yet,	19:25
mriedem	will tab queue it	19:25
clarkb	so once we get behind we won't wait for the whole timeout we'll just check once if the files are there, if yes yay if not move on	19:25
fungi	i'm not as familiar with that codebase as my +2 permissions might imply	19:25
mriedem	maybe al calavicci will let mtreinish come back to help look for a sec too	19:26
*** jamesmcarthur has quit IRC		19:28
*** jamesmcarthur has joined #openstack-infra		19:28
corvus	al calavicci has an impressively detailed wikipedia bio	19:29
*** jamesmcarthur_ has joined #openstack-infra		19:29
clarkb	corvus: pabelanger I've not seen any unexpected behavior from inventory change. I think we can continue with nodepool/zuul work as soon as the sdk thing is handled	19:32
pabelanger	agree	19:32
fungi	some of stockwell's best work, though to me he'll always be wilbur whatley from the dunwich horror	19:33
*** jamesmcarthur has quit IRC		19:33
corvus	handled means that nodepool change lands and we verify that we've downgraded sdk	19:33
corvus	calavicci and cavil ar strikingly similar names	19:33
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376	19:34
openstackgerrit	Matt Riedemann proposed openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038	19:34
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS https://review.openstack.org/528739	19:35
mtreinish	mriedem: link?	19:36
clarkb	mtreinish: https://review.openstack.org/#/c/621038/	19:37
*** wolverineav has quit IRC		19:37
*** wolverineav has joined #openstack-infra		19:41
*** wolverineav has quit IRC		19:42
*** wolverineav has joined #openstack-infra		19:42
*** wolverineav has quit IRC		19:42
*** wolverineav has joined #openstack-infra		19:43
*** hamerins has quit IRC		19:43
*** wolverineav has quit IRC		19:43
*** wolverineav has joined #openstack-infra		19:43
*** wolverineav has quit IRC		19:44
clarkb	https://www.opendev.org/ and https://opendev.org/ have working ssl now. Just need to add content there	19:44
*** hamerins has joined #openstack-infra		19:44
clarkb	corvus: ^ fyi and thanks!	19:45
pabelanger	awesome!	19:45
fungi	content schmontent	19:45
*** wolverineav has joined #openstack-infra		19:45
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376	19:49
mtreinish	mriedem, clarkb: +2	19:50
mordred	clarkb, fungi, corvus: if you have a sec, mind popping some +A on https://review.openstack.org/#/c/621225/ and https://review.openstack.org/#/c/621226 ?	19:51
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376	19:52
*** wolverineav has quit IRC		19:52
clarkb	mordred: I'm kind of surprised such a thing doesn't exist yet	19:52
*** wolverineav has joined #openstack-infra		19:53
mordred	right?	19:53
mordred	from what I can tell, most people focus on using one or the other in their code and then using an exporter or translation service	19:53
mordred	clarkb: initial code sketch is here: https://review.openstack.org/620990	19:53
corvus	they're vastly different in terms of how you model data; so you need a real abstraction layer if you're going to try to do both	19:54
mordred	yah	19:54
clarkb	mordred: ya in that space and tracing too it seems everyone uses one thing and then has a ton of very specific toling built around that	19:54
mordred	yup	19:54
mordred	which is great if you're writing a service to run in one and only one place	19:55
corvus	(and even so, the abstraction layer mordred proposes largely helps you think about both at the same time. you still can't just forget about one or the other)	19:55
corvus	at least, that's how it registers in my brain	19:55
mordred	corvus: yah, that's the idea	19:55
mordred	"this is a metric I want to collect. this is how it goes to prometheus. this is how it goes to statsd"	19:56
* mordred waves hands		19:56
Shrews	mordred: if you add a new thing, do you have to change the project name? :)	19:56
clarkb	promstatcchi	19:56
mordred	hah	19:57
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS https://review.openstack.org/528739	19:58
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376	19:59
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS https://review.openstack.org/528739	19:59
pabelanger	clarkb: corvus: mordred: given the issues around bridge.o.o today, and tailing ansible.log where did the topic of running ara web on bridge end up? Do people see that actually live on bridge.o.o or some other server hosting the web for that? I know we talked about putting that into into trove also.	19:59
*** wolverineav has quit IRC		19:59
dmsimard	pabelanger: last I know we were waiting because bridge.o.o needed to be rebuilt on a larger instance or something along those lines	20:00
clarkb	I think we were going to run it with mostly the plan as is, then if we ever rebuild the bridge we can move it all internal	20:00
clarkb	fungi and ianw were far more on top of that than me though iirc	20:00
pabelanger	okay, yah. I might be able to find a little time to help with that. I was struggling an little today to look at ansible logs on bridge	20:01
fungi	the idea was we could run it on trove temporarily but when we get around to enlarging bridge.o.o we should make sure it's big enough to host its own database for that instead	20:01
pabelanger	assuming we did an audit of ARA for the public	20:01
fungi	s/enlarging/rebuilding/	20:01
dmsimard	just 2 cores and 2gb of ram on bridge.o.o indeed	20:02
dmsimard	is there anything stopping us from resizing it to a larger flavor ?	20:02
corvus	it's not actually large enough to do what it's already doing	20:03
fungi	dmsimard: mainly the lack of a "resize" option in that cloud provider is what's stopping us	20:03
*** rh-jelabarre has quit IRC		20:03
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306	20:03
fungi	dmsimard: we need to rebuild it on a larger flavor instead, and i gather it might not be 100% under configuration management yet?	20:03
pabelanger	is that because we thing ARA web will need more resources?	20:04
fungi	or maybe it is by now but nobody's tried to build a new one yet outside integration tests	20:04
clarkb	fungi: ya the base server -> running ansible isn't managed yet? mordred is that the case?	20:04
dmsimard	pabelanger: it's not well sized to begin with	20:04
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306	20:04
corvus	clarkb: well, the tests certainly suggest it's mostly there :)	20:04
pabelanger	I'm unsure what issues we are hitting today	20:05
*** rh-jelabarre has joined #openstack-infra		20:05
corvus	bootstrapping it may still be a bit tricky. we also need to copy all the non-managed stuff over (passwords, secrets, etc)	20:05
clarkb	pabelanger: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=65004&rra_id=all I think biggest concern is memory	20:05
clarkb	pabelanger: there are spikes that make it appear we should have more breathing room. Particularly to run a mysql	20:06
*** eharney has joined #openstack-infra		20:07
pabelanger	ack	20:07
openstackgerrit	Merged openstack-infra/nodepool master: Block 0.20.0 of openstacksdk https://review.openstack.org/621272	20:08
corvus	i'm going to grab lunch now; feel free to restart launchers after ^ is applied.	20:09
clarkb	I should get lunch too	20:09
pabelanger	clarkb: corvus: okay, I'll start with nl01 first	20:09
pabelanger	and confirm working	20:09
pabelanger	fungi: clarkb: have you see the zoom flaw making rounds, I think some of the openstack foundation staff using it for meetings: https://threatpost.com/critical-zoom-flaw-lets-hackers-hijack-conference-meetings/139489/	20:11
*** jistr has quit IRC		20:11
clarkb	I hadn't but thats neat	20:11
clarkb	mrhillsman: ^ uses a bunch of zoom too	20:11
pabelanger	a public exploit in the wild also	20:11
fungi	nifty	20:11
fungi	i've only ever dialled into it from a telephone line, fwiw	20:12
pabelanger	same, figured I'd shared in case you wanted to pass along to presenters	20:12
clarkb	pabelanger: I've done so, thanks!	20:12
fungi	yep, thanks!	20:12
*** jistr has joined #openstack-infra		20:13
*** jamesmcarthur_ has quit IRC		20:14
*** jamesmcarthur has joined #openstack-infra		20:15
pabelanger	clarkb: corvus: 621272 has landed on nl01, and confirmed openstacksdk 0.19.0 in installed	20:18
pabelanger	going to proceed with restart in a moment	20:18
mordred	woot!	20:19
*** xek has joined #openstack-infra		20:19
mordred	clarkb: yeah - what corvus said earlier - bridge itself is mostly managed, but the secrets and passwords file is manual	20:20
*** jamesmcarthur has quit IRC		20:20
mordred	so the bootstrap is ... fun	20:20
pabelanger	restarted	20:20
pabelanger	but there looks to be an issue	20:20
pabelanger	kubernetes.config.config_exception.ConfigException: Invalid kube-config file. Expected object with name default in kube-config/contexts list	20:20
pabelanger	guess we landed a nodepool change?	20:20
tobiash	pabelanger: either that or you landed a kube config	20:21
mordred	pabelanger: yah - we have a test kubernetes in vexxhost and a kube config for it	20:21
openstackgerrit	Merged openstack-infra/project-config master: Create promstat project https://review.openstack.org/621225	20:21
openstackgerrit	Merged openstack-infra/project-config master: Add promstat project to Zuul https://review.openstack.org/621226	20:21
kmalloc	fungi: so looking at lemonldap-ng. it's cool. it is also very very very very perl-ism focused.	20:21
kmalloc	fungi: we might be able to use it for our case over ipsilon	20:21
fungi	yeah, i saw it was perlish	20:22
kmalloc	fungi: it's URL-regex and blob-o-json configured	20:22
fungi	ahh	20:22
kmalloc	so it's per URL, it's a little weird.	20:22
mordred	pabelanger: https://review.openstack.org/#/c/620756/ didn't land yet	20:22
Ng	really? another -ng project? :(	20:22
kmalloc	but i can see it's very feature rich	20:22
kmalloc	Ng: yep.	20:22
* mordred waves to the Ng		20:22
Ng	dangit	20:22
fungi	Ng: i thought you wrote all of those	20:22
Ng	hey mordred	20:22
pabelanger	mordred: yah, I think this is something with default config, I don't see anything in nodepool.yaml	20:22
mordred	pabelanger: https://review.openstack.org/#/c/620755/ added the kube config	20:22
kmalloc	that said, we can totally do something more usable / api driven	20:23
Ng	fdegir: haha	20:23
kmalloc	long term	20:23
pabelanger	oh	20:23
kmalloc	i'll consider ipsilon vs lemonldap for the transition stuff	20:23
fungi	kmalloc: yep, as a canned thing i thought it might at least be an interesting alternative which is actually actively maintained still	20:23
kmalloc	since we have minimal things that need to be protected.	20:23
kmalloc	exactly	20:23
pabelanger	mordred: http://paste.openstack.org/show/736524/ is traceback	20:23
kmalloc	though the concept / mission of ipsilon maps more directly to what we do (even unmaintained)	20:24
Shrews	oh weird. i just liked one of Ng's twitter posts today. crazy coincidence	20:24
pabelanger	mordred: I am going to try and remove the file manually for now to see if nodpeool-launcher starts	20:24
*** eernst has joined #openstack-infra		20:24
Ng	Shrews: oh yeah, I saw that. I think I'd forgotten that IFTTT was posting those for me ;)	20:24
pabelanger	mordred: okay, renaming it to config.bak causes nodepool-launcher to start properly	20:25
Shrews	pabelanger: i see an issue	20:28
Shrews	pabelanger: related to tobiash's nodecachelistener	20:28
*** eernst has quit IRC		20:29
pabelanger	Shrews: yes, was just about to say it doesn't look happy	20:29
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Remove updating stats debug log https://review.openstack.org/621283	20:29
Shrews	http://paste.openstack.org/show/736525/	20:29
Shrews	tobiash: ^^	20:29
pabelanger	yah, we haven't launched a new node yet since restarting	20:30
pabelanger	I believe we should roll back to the previous version that was running	20:30
*** eernst has joined #openstack-infra		20:30
tobiash	Shrews: is that exception recurring?	20:31
pabelanger	okay, we appear to be launching nodes now: http://grafana.openstack.org/dashboard/db/nodepool-rackspace	20:32
mordred	corvus: ping just in case you didn't see the k8s config related traceback	20:32
*** wolverineav has joined #openstack-infra		20:33
*** slaweq has quit IRC		20:34
pabelanger	2018-11-30 20:33:05,859 INFO nodepool.driver.NodeRequestHandler[nl01-13954-PoolWorker.rax-ord-main]: Not enough quota remaining to satisfy request 200-0000583773	20:34
pabelanger	I am unsure if that is correct or not, looking at grafana, I do see space to launch more nodes	20:34
Shrews	tobiash: i'm not seeing more atm	20:34
*** eernst has quit IRC		20:35
pabelanger	so far, we haven't launched a new node in rax-ord / rax-iad since the restart, but rax-dfw is now bringing nodes online	20:36
tobiash	Shrews: maybe we should catch exceptions there and log a warning/error together with the event data and path	20:36
tobiash	Shrews: I think this might be an event we actually don't want to process	20:37
*** eernst has joined #openstack-infra		20:37
tobiash	pabelanger: do you have mode contect around the quota log?	20:37
tobiash	s/mode/more	20:38
pabelanger	tobiash: not yet, still looking into logs	20:38
Shrews	tobiash: you need the numbers?	20:39
pabelanger	tobiash: for exmaple: http://paste.openstack.org/show/736526/	20:39
openstackgerrit	Merged openstack-infra/project-config master: Temporarily disable ovh-bhs1 in nodepool https://review.openstack.org/621250	20:40
*** eernst has quit IRC		20:41
tobiash	pabelanger, Shrews: there seems to be something off in the quota calculation	20:41
tobiash	pabelanger, Shrews: maybe one of the patches that landed today	20:41
clarkb	one of corvus' changes modified how we account for nodes outside of nodepool	20:42
clarkb	(to handle leaks better)	20:42
mrhillsman	thx clarkb	20:42
clarkb	(I'm not really here yet still finishing up lunch)	20:42
tobiash	Shrews, pabelanger: probably this one: https://review.openstack.org/621040	20:42
tobiash	that would explain that we get predicted quota of -50 when launching one node	20:42
*** eernst has joined #openstack-infra		20:43
*** hwoarang has quit IRC		20:44
pabelanger	hmm, let me check if we have leaked nodes	20:44
Shrews	tobiash: pabelanger: corvus: hrm, we should be using zk.getNodes() instead of the nodeIterator in that change	20:45
openstackgerrit	Matt Riedemann proposed openstack-infra/elastic-recheck master: Add query for nova i/o semaphore test race bug 1806123 https://review.openstack.org/621285	20:45
openstack	bug 1806123 in OpenStack Compute (nova) "i/o concurrency semaphore test changes are racy" [High,Confirmed] https://launchpad.net/bugs/1806123	20:45
tobiash	Shrews: I think the negation here is wrong: https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py#n197	20:45
Shrews	but that's unrelated to the quota thing	20:46
Shrews	(my suggestion, that is)	20:46
*** yamamoto has joined #openstack-infra		20:46
tobiash	Shrews: no, if that's wrong, it increases the unmanages usage	20:46
tobiash	Shrews: so nodepool thinks it has -50 instances left and the rest is blocked with instances it doesn't manage	20:47
pabelanger	well, rax-ord does have a few leaked nodes	20:47
pabelanger	but, unsure why nodepool isn't deleting them	20:47
*** eernst has quit IRC		20:48
pabelanger	for example	20:48
pabelanger	http://paste.openstack.org/show/736527/	20:48
*** eernst has joined #openstack-infra		20:49
pabelanger	that one is missing metadata	20:49
*** openstackgerrit has quit IRC		20:50
*** yamamoto has quit IRC		20:50
*** openstackgerrit has joined #openstack-infra		20:50
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Fix leak detection in unmanaged quota calculation https://review.openstack.org/621286	20:50
*** hwoarang has joined #openstack-infra		20:51
*** fried_rolls is now known as fried_rice		20:51
fungi	okay, i'm taking a break from mailing list combining work to go obtain sustenance. will return asap	20:51
tobiash	Shrews, pabelanger: I think this should fix it ^	20:51
tobiash	corvus: ^	20:53
pabelanger	oh	20:53
*** takamatsu has joined #openstack-infra		20:53
*** eernst has quit IRC		20:53
tobiash	we also could revert 621040, fix that and re-propose it with a test case	20:54
Shrews	tobiash: corvus: pabelanger: if the instance isn't getting the meta properties set correctly, nodepool will never delete it.	20:55
pabelanger	so we have 2 issues right now with nl01, quota calculation if off. And the kube config is broken	20:55
Shrews	https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py#n474	20:55
pabelanger	Shrews: yah, I don't know why that is right now. For now, I'm going to manually clean them up after current issue is addressed	20:55
pabelanger	the missing metadata I mean	20:55
Shrews	pabelanger: you'll probably want to back off the upgrade for now	20:57
mordred	pabelanger: "Expected object with name default in kube-config/contexts list" seems to indicate something is unhappy with the "current-context: default" line	20:57
pabelanger	Shrews: yah, I can stop and roll back	20:57
pabelanger	let me check which verision the others are running	20:57
corvus	back/catching up	20:59
clarkb	I'm back now too	21:00
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Update the current-context to valid context https://review.openstack.org/621287	21:01
mordred	corvus, pabelanger: ^^ I think that should fix the kube config error	21:01
mordred	but I'm also sort of just shooting at pickles in a barrel	21:01
mordred	that was based on reading https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/ fwiw	21:01
pabelanger	Shrews: corvus: clarkb: I think we are safe to revert back to 3.3.1, but waiting for somebody to confirm	21:02
pabelanger	I also have to duck out for 10mins to wait for school bus, but will be back online to assist	21:02
clarkb	pabelanger: ya I don't think any of the changes required the updates to the zk schema	21:02
clarkb	so we should be safe to go back to older version	21:02
clarkb	pabelanger: maybe just do that manually by hand on nl01 then we get tobiash's fix in and try again later today or monday	21:03
pabelanger	clarkb: okay, can I pass the revert step to you now?	21:03
clarkb	pabelanger: on the server or getting the code changes in gerrit side? I can do either but want to make sure I know what you are doing too :)	21:03
clarkb	*either or both	21:04
pabelanger	clarkb: yes, the revert of nl01.o.o to 3.3.1.	21:04
pabelanger	I can do it, but will need about 10mins	21:04
corvus	regarding the current-context -- that's really disappointing that the kubernetes lib requires that even though we don't use it. maybe we should default a bogus default context -- because default-context makes no sense in our multi-cloud case.	21:04
pabelanger	on server, I should say	21:04
tobiash	corvus: sounds valid	21:05
clarkb	pabelanger: I'm ssh'ing in now	21:05
pabelanger	okay, afk for 10mins...	21:05
mordred	corvus: I think we can set it to ''	21:05
*** kjackal has joined #openstack-infra		21:06
clarkb	puppet just ran so I have ~30 minutes to get this done.	21:06
clarkb	I am pip installing 3.3.1 now	21:06
mordred	corvus: which seems to be a valid option based on that document	21:06
clarkb	that sound right to everyone?	21:06
clarkb	based on nl02 that looks right to me	21:07
clarkb	hahahaha ok	21:07
clarkb	we don't have the openstacksdk exclusion on 3.3.1	21:08
corvus	i will direct-enqueue both of those changes	21:08
clarkb	so I've got to manually install sdk afterwards	21:08
clarkb	(just a note for anyone else trying to do similar later)	21:08
*** rlandy has quit IRC		21:08
*** wolverineav has quit IRC		21:08
*** hongbin has joined #openstack-infra		21:08
clarkb	nl01 restarted running nodepool==3.3.1 and openstacksdk==0.19.0	21:09
corvus	both changes are in gate	21:10
hongbin	hi folks, i have a patch to modify the zuul job: https://review.openstack.org/#/c/619642/ but consistently getting 'RETRY_LIMIT' error for some jobs, want to get helps to resolve it	21:10
clarkb	hongbin: RETRY_LIMIT happens when the job fails in the job pre run stage	21:10
hongbin	clarkb: i see, any idea about how to get the logs to see what is wrong?	21:11
clarkb	hongbin: failures in pre-run are retried up to three times until we report back RETRY_LIMIT. The reason for this is that we expect those things to always pass as they shouldn't test code, just be setup	21:11
tobiash	clarkb: with finger links that fails very early	21:11
clarkb	tobiash: ya interesting	21:11
clarkb	hongbin: in this case I think this means its failing very early so we dont' get logs. One way to debug that is to catch them with the streaming logs from the status page while they happen	21:12
clarkb	another is we can go look in the zuul logs and see what that tells us	21:12
openstackgerrit	Matt Riedemann proposed openstack-infra/elastic-recheck master: Add query for libvirt functional evacuate test bug 1806126 https://review.openstack.org/621288	21:12
corvus	it can mean that the post playbook fails (possibly because the host is hosed)	21:12
openstack	bug 1806126 in OpenStack Compute (nova) "LibvirtRbdEvacuateTest and LibvirtFlatEvacuateTest tests race fail" [High,Confirmed] https://launchpad.net/bugs/1806126	21:12
tobiash	clarkb: what we did to improve that is to add a base-logs job as a parent to the job 'base' which only contains the log upload post playbook	21:12
corvus	tobiash: you don't need to use inheritance for that; post-playbooks are separable	21:13
clarkb	corvus: considering that this is making changes to networking in devstack that wouldn't surprise me	21:13
tobiash	corvus: does it now run all post playbooks even if the pre playbook of the same job failed?	21:13
corvus	tobiash: so as long as the last post playbook in base collects the logs, it will run regardless of anything before it	21:13
corvus	tobiash: i think so	21:14
tobiash	ah cool	21:14
hongbin	ok, let me recheck it and try to catch the streaming logs, thanks for the hint	21:14
clarkb	hongbin: it might help us debug, if we understand what you are trying to do with all of those api extensions	21:15
AJaeger_	fungi, I see you updating openstack-discuss - want to review https://review.openstack.org/619216 to handle infra-manual, please?	21:15
clarkb	hongbin: my best guess at this point is one of those extensions results in either broken network stack of firweall rules that prevent zuul from talking to the test node	21:15
corvus	tobiash: yeah, just double checked. run won't run, but post will.	21:15
tobiash	corvus: cool, thx	21:15
AJaeger_	hongbin: make a smaller change first - only change YAML with the goal to do exactly the same setup as before for a single job. And then iterate on it	21:16
hongbin	clarkb: i have two list of extensions, and in the zuul job config, combine those two list and write it to devstack config file	21:16
tobiash	maybe that has changed when we separated it, or I just misunderstood that	21:16
clarkb	hongbin: right, but what do those extensions do?	21:16
hongbin	those are just two list of string in yaml	21:17
clarkb	hongbin: but they change devstack behavior somehow right?	21:17
clarkb	(and that changes neutron's behavior)	21:17
hongbin	the neutron behavior is not changed, since we should pass exactly the same list to devstack	21:18
AJaeger_	hongbin: so, I would do it as follows: 1) Use yaml anchors and create exactly same job as today. 2) Update the values	21:18
pabelanger	and back, sorry about that. #dadops	21:18
pabelanger	clarkb: thanks for reverting	21:18
clarkb	pabelanger: no worries. want to check on nl01? I think it should be running again	21:18
pabelanger	sure	21:19
AJaeger_	hongbin: oh, I see - that list is long and you copy it over	21:19
AJaeger_	hongbin: so, looks like something is wrong in that setup.	21:19
pabelanger	clarkb: grafana.o.o looks much better	21:19
hongbin	NETWORK_API_EXTENSIONS: "{{ network_api_extensions_common + network_api_extensions_tempest \| join(',') }}"	21:19
hongbin	AJaeger_: yes, possibly	21:20
clarkb	ya I wasn't sure that would work when it was first suggested because this side of the config is in zuul not in ansible	21:20
clarkb	zuul doesn' jinja2	21:20
hongbin	for jobs that are doing NETWORK_API_EXTENSIONS: "{{ network_api_extensions \| join(',') }}" , it succeeded	21:21
hongbin	for jobs that are doing NETWORK_API_EXTENSIONS: "{{ network_api_extensions_common + network_api_extensions_tempest \| join(',') }}" , it failed	21:21
hongbin	so i guess the usage of "+" to combine list wont' work?	21:21
pabelanger	corvus: clarkb: tobiash: Shrews: do we have an idea what is happening with http://paste.openstack.org/show/736525/	21:22
logan-	hongbin: "{{ (network_api_extensions_common + network_api_extensions_tempest) \| join(',') }}"	21:22
hongbin	logan-: ack, let me try that	21:22
clarkb	pabelanger: looks like the string was '' and not valid json	21:22
hongbin	logan-: thanks for the advice	21:22
clarkb	pabelanger: I don't know why though	21:23
pabelanger	clarkb: that was the only other thing I noticed on startup	21:23
logan-	clarkb: yep, it works, because zuul just dumps the uninterpreted jinja into the job inventory, which ansible then consumes and interprets :)	21:24
*** kgiusti has left #openstack-infra		21:25
clarkb	pabelanger: I think we want to look for any cases we might write an empty string to zk	21:25
corvus	pabelanger, clarkb, tobiash: we could look in zk to see if there are any node records with empty data	21:26
clarkb	but I haven't followed any of the nodepool cache stuff	21:26
clarkb	so this is all new to me	21:26
tobiash	clarkb, pabelanger: we should catch this exception and log better data in this case, maybe it's just an event we don't want to process but didn't filter correctly	21:26
pabelanger	corvus: good idea, since this happened on startup, possible existing node-requests are missing data for some reason?	21:28
pabelanger	tobiash: +1	21:28
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Log exceptions in cache listener events https://review.openstack.org/621292	21:28
corvus	tobiash, clarkb, pabelanger: ^	21:28
clarkb	tobiash: these acts as watches on the "filesystem" tree? and we reconcile our local datastructures when they chagne? making sure I understand the basics here	21:28
corvus	clarkb: yep	21:29
tobiash	pabelanger: it's nodes, not node-requests in the exception	21:29
clarkb	then we don't need to read from zk every time we need data we just trust the cache is up to date beacuse it is being reconciled. Got it	21:29
corvus	clarkb: correct. we do read an extra time after we lock nodes, just to make sure everything is in sync.	21:29
pabelanger	tobiash: thanks!	21:29
corvus	i've direct-enqueued that too	21:30
clarkb	reading the code we treat node added and node updated the same. Wouldn't surprise me if we create an empty node then write to it later	21:31
clarkb	but the listener then races its cache update ot the update happening?	21:31
clarkb	though I thought storeNode would write json with keys just empty values if there is no data	21:32
clarkb	so maybe this isn't that	21:32
clarkb	in any case logs ++	21:32
pabelanger	and rax looks back at capacity now	21:33
corvus	i'm hoping the event structure converts to strings in a useful way	21:33
corvus	(ie, i hope it looks like "<Event path:foo/bar data:...>"	21:33
openstackgerrit	Merged openstack-infra/nodepool master: Fix leak detection in unmanaged quota calculation https://review.openstack.org/621286	21:34
*** kjackal has quit IRC		21:34
clarkb	corvus: and hopefully ADDED/UPDATED/DELETED types are included too	21:34
corvus	oh, heh, it's a subclass of tuple	21:35
corvus	https://kazoo.readthedocs.io/en/latest/_modules/kazoo/recipe/cache.html#TreeEvent	21:35
corvus	so we should be able to see what we need to	21:35
*** wolverineav has joined #openstack-infra		21:35
openstackgerrit	Merged openstack-infra/system-config master: Update the current-context to valid context https://review.openstack.org/621287	21:35
corvus	so next question is, what's the format of event_data	21:36
tobiash	NodeData inherits from tuple	21:38
*** kjackal has joined #openstack-infra		21:38
corvus	the log is the only oustanding fix now	21:38
corvus	it's probably worth waiting for before we attempt another restart	21:38
tobiash	yes	21:38
clarkb	corvus: ya I think so. we'll just end up restarting it again to debug that anyawy (likely)	21:39
clarkb	also suggestion: we use nl04 for next restart since we've disabled bhs1 anyway. That will have low impact	21:39
pabelanger	++	21:39
tobiash	now that I searched for it I see the same exceptions in my log	21:39
openstackgerrit	Monty Taylor proposed openstack-infra/project-config master: Update promstat to use storyboard https://review.openstack.org/621293	21:40
corvus	drat, pep8 failure	21:47
*** boden has quit IRC		21:47
clarkb	ah the exception indentation?	21:47
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Log exceptions in cache listener events https://review.openstack.org/621292	21:48
corvus	clarkb: nope, a legit thing. omitted a _	21:48
clarkb	oh ya diff shows that.	21:48
clarkb	I single approved it if you want to reenqueue	21:48
clarkb	(its a trivial fix)	21:49
corvus	done	21:49
clarkb	(fwiw I'm really excited about the relative priority stuff)	21:49
corvus	me too!	21:50
corvus	combine that with splitting zuul into its own tenant, and we should be able to merge changes at least this fast with no admin access needed	21:50
*** tpsilva has quit IRC		21:51
tobiash	corvus, clarkb: 2018-11-30 21:47:58,587 ERROR nodepool.zk.ZooKeeper: Exception in node cache update for event: (0, ('/nodepool/nodes/0000235218', b'', ZnodeStat(czxid=8628450713, mzxid=8628450713, ctime=1543562070559, mtime=1543562070559, version=0, cversion=1, aversion=0, ephemeralOwner=0, dataLength=0, numChildren=1, pzxid=8628450714)))	21:51
corvus	(because we can drop the clean-check pipeline requirement)	21:51
tobiash	so we get empty data from some nodes	21:51
tobiash	checking now if this is a cache setup thing or really like this in zk	21:51
corvus	0 is node_added	21:51
clarkb	tobiash: any idea what the 0 there is? is that the type?	21:51
clarkb	oh cool so my hunch was maybe right?	21:51
tobiash	0 is the event type	21:51
clarkb	in that case we probably want to say if type == added and data or type == updated and data	21:52
clarkb	or similar there	21:52
tobiash	0 means node added	21:52
tobiash	confirmed, that node is really empty in zk	21:53
tobiash	no idea yet what this means	21:53
corvus	looking at storeNode, there should be no case where we set empty data	21:53
pabelanger	corvus: drop clean-check pipeline requirement? Could you explain more?	21:54
clarkb	corvus: tobiash could it be that creating and writing data in zk is not atomic?	21:54
clarkb	corvus: tobiash so as events go we get one first for the node creation then for the update?	21:54
tobiash	clarkb: no, I restarted nodepool with the patch and it read the nodes that were already in the system	21:54
corvus	pabelanger: in the good old days, you used to be able to approve a change and have it gated before check results arrived. i believe that's a way better way to run a system, if you can trust people to behave. we can not, in openstack, so we require check results before approving in gate.	21:55
pabelanger	corvus: interesting	21:56
pabelanger	looking forward to seeing it in action :)	21:57
corvus	it's like all these direct-enqueues i'm doing, but anyone can do them	21:57
openstackgerrit	Merged openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038	21:57
pabelanger	corvus: yah	21:57
pabelanger	cool	21:57
*** wolverineav has quit IRC		21:59
clarkb	tobiash: so maybe in this case we can just ignore those events?	21:59
clarkb	as they aren't currently processed and lack useful info/data	21:59
tobiash	yes	21:59
corvus	tobiash: but if the node is still empty in zk... what does that mean?	22:00
tobiash	corvus, clarkb: I just looked into our zk data and we have many such nodes	22:00
corvus	if it were as clarkb suggests: a two-phase process -- create then write, you would expect to have data in there by the time you got around to looking at it	22:00
corvus	but to still have an empty string is puzzling	22:01
tobiash	and judging by the numbers really old ones	22:01
corvus	tobiash: any mention of that node id in logs?	22:01
tobiash	corvus: I need to find such a node that is still in my logs	22:01
corvus	i'm setting up zk_shell so i can poke too	22:02
clarkb	fwiw our node count is in the 15k range whcih is about where it was when I monitored the new zk cluster transition	22:03
clarkb	I don't think we are leaking nodes at an appreciable rate	22:03
corvus	we do have lots of old nodes with empty data	22:03
corvus	latest node id is 0000844533. but 0000470788 exists and is empty	22:04
clarkb	huh	22:04
*** wolverineav has joined #openstack-infra		22:04
corvus	0000819605 is recent-ish and empty	22:05
tobiash	corvus: http://paste.openstack.org/show/736529/	22:06
tobiash	corvus: nothing special here	22:06
clarkb	could it be a quorum thing?	22:06
corvus	tobiash: ditto: http://paste.openstack.org/show/736530/	22:06
clarkb	I woudl expect that to be fairly quick cleanup though not long term	22:07
clarkb	corvus: maybe check by connecting to other zk nodes to see if they have the same empty structure?	22:07
corvus	clarkb: zk01 and zk02 report the same	22:08
tobiash	corvus: is znode deletion recursive?	22:08
corvus	tobiash: i don't know -- are you noting that there is a lock file under it too?	22:08
tobiash	corvus: the node I looked at still has an empty lock child node	22:08
corvus	this seems like a very likely possibility	22:09
clarkb	would that imply the node is still locked? nodepool list should show us that info	22:09
clarkb	\| 0000844533 \| inap-mtl01 \| ubuntu-xenial \| 8067566c-765c-4165-a90e-0adf29e80f1b \| 198.72.124.158 \| \| in-use \| 00:00:05:31 \| locked \|	22:10
clarkb	seems like yes	22:10
tobiash	hrm, we do a recursive delete	22:11
clarkb	er wait I got the wrong node to grep for there	22:11
clarkb	ugh friday	22:11
clarkb	0000819605 is what we want to check for	22:11
clarkb	that one does not show up	22:12
*** wolverineav has quit IRC		22:12
tobiash	clarkb: yes, because nodepool has an if data: clause when getting the node	22:12
clarkb	ha	22:13
*** wolverineav has joined #openstack-infra		22:13
tobiash	so nodepool hid that before the caching patches	22:13
clarkb	ya	22:14
corvus	someone still holds the lock on 0000819605	22:14
clarkb	do the lockers drop an id of some sort on the node to indicate who has it?	22:14
corvus	it's /nodepool/launchers/nl03-13009-PoolWorker.vexxhost-sjc1-main	22:14
clarkb	fungi: mriedem fyi I'm going to restart elastic-recheck bot now and we can see if my change works	22:15
tobiash	at least on my leaked node there is no lock anymore, but still the empty lock child	22:16
clarkb	mriedem: fungi nevermind Nov 30 22:13:30 status puppet-user[17804]: (/Stage[main]/Elastic_recheck/Exec[install_elastic-recheck]/returns) ImportError: No module named docutils.core	22:17
corvus	tobiash: how did you determine there's no lock?	22:17
clarkb	that prevents us from installing my change	22:17
tobiash	dump with grep	22:17
corvus	tobiash: hrm. that's what i did.	22:17
tobiash	or is that not enough?	22:17
openstackgerrit	Merged openstack-infra/nodepool master: Log exceptions in cache listener events https://review.openstack.org/621292	22:17
corvus	tobiash: that should be enough	22:17
*** hwoarang has quit IRC		22:18
*** wolverineav has quit IRC		22:18
corvus	tobiash: i see: http://paste.openstack.org/show/736531/	22:19
*** wolverineav has joined #openstack-infra		22:19
tobiash	corvus: ok, so if there is a lock the lock child has another child	22:19
tobiash	that's not the case in my example, but it might have been the case earlier	22:20
tobiash	I restarted my launcher to pickup the logging change so the launcher cannot have any lock now	22:20
tobiash	(I have only one launcher atm)	22:21
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Log exceptions deleting ZK nodes https://review.openstack.org/621301	22:21
corvus	tobiash: ah, so the ephemeral lock may just turn into an empty dir after restarts	22:22
clarkb	https://pagure.io/python-daemon/issue/18 <- that is really frustrating	22:22
tobiash	yes	22:22
corvus	tobiash, clarkb: there's a code path where we can end up without an exception report; see change ^	22:22
tobiash	+2	22:22
clarkb	corvus: looking	22:22
clarkb	corvus: looking at your paste, that was a dump of ephemeral nodes in zk? and the two nodes being udner the same item means that that launcher holds that lock? Seems like maybe we want to record that info in the lock itself?	22:23
corvus	tobiash, clarkb: i think we've established this bug as "annoying but mostly harmless" and can probably proceed with restarts and debug this in parallel...	22:24
clarkb	corvus: I agree	22:24
tobiash	++	22:24
pabelanger	++	22:24
corvus	clarkb: your understanding of the ephemeral nodes stuff is correct, but i don't want to mess with the kazoo lock recipe; this is workable enough i think.	22:25
corvus	(*any more -- i've already undertaking one improvement to kazoo locks)	22:25
clarkb	corvus: ok, this might be a thing we add to operational docs then	22:25
clarkb	(this is how you find who owns a lock type thing)	22:26
corvus	clarkb: as long as we bury it deep, deep inside the nodepool developer docs. i've been fighting very hard to avoid the perception that a nodepool operator needs to know anything about zk. this is something only a nodepool dev should ever have to do, and only when nodepool is broke.	22:27
clarkb	corvus: thats fair. Things like why did zuul leak this lock etc	22:27
corvus	pabelanger: you want to go ahead and restart a launcher?	22:28
*** wolverineav has quit IRC		22:28
clarkb	nl04 is my suggestion	22:28
pabelanger	I can yes, nl04 this time?	22:28
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Don't update caches with empty zNodes https://review.openstack.org/621305	22:28
clarkb	and double check they have the fix for the quota calculations installed	22:28
pabelanger	ack, checking now	22:29
pabelanger	(nl04)	22:29
*** wolverineav has joined #openstack-infra		22:29
pabelanger	f8d20d6 is current version of nodepool	22:30
pabelanger	I think we want the next 2 right	22:30
corvus	pabelanger: f8d20d6 is okay	22:30
pabelanger	k	22:30
clarkb	ya we don't need those logs in place since tobiash was helpful and got them for us	22:31
pabelanger	okay, restarting nl04 now	22:31
corvus	yep, and next debug step there is changes that haven't merged yet	22:31
pabelanger	nodepool-launcher started	22:32
pabelanger	more execptions at beginning, expected (json)	22:32
tobiash	I have around 200 of these leaked znodes	22:33
openstackgerrit	Merged openstack-infra/elastic-recheck master: Add query for nova i/o semaphore test race bug 1806123 https://review.openstack.org/621285	22:33
openstack	bug 1806123 in OpenStack Compute (nova) "i/o concurrency semaphore test changes are racy" [High,Confirmed] https://launchpad.net/bugs/1806123	22:33
pabelanger	I can see we are launching nodes in ovh-gra21	22:33
pabelanger	ovh-gra1*	22:33
clarkb	fungi mriedem I think newer pip may fix this? I'm goign to try updating pip on that host to see	22:35
pabelanger	clarkb: corvus: tobiash: nl04 looks to be working as expected now	22:35
tobiash	:)	22:35
pabelanger	I no longer see no quota	22:36
pabelanger	and confirmed we've launched a few nodes	22:36
*** slaweq has joined #openstack-infra		22:36
pabelanger	I think that means we can proceed with other nodepool-launchers, waiting for confirmation	22:37
clarkb	mriedem: fungi ok that didn't change it, but everything else should be the same between local system and remote. So I'mconfused	22:37
clarkb	pabelanger: I usually like to wait long enough for deletes to happen if they don't happen on startup	22:37
pabelanger	clarkb: sure, let me confirm	22:37
clarkb	pabelanger: basically make sure we handle create and delete, but ya if both those look good I think you can restart the others	22:37
*** xek has quit IRC		22:37
clarkb	mriedem: fungi maybe a setuptools thing, checking that next	22:38
clarkb	that was it	22:39
clarkb	restarting elastic-recheck now	22:39
pabelanger	2018-11-30 22:39:14,760 DEBUG nodepool.StatsWorker: Updating stats	22:39
pabelanger	that doesn't really seem helpful	22:39
pabelanger	and looks newish	22:39
tobiash	pabelanger: https://review.openstack.org/621283	22:40
pabelanger	there we go, just deleted some nodes	22:40
pabelanger	tobiash: +3, thanks	22:40
pabelanger	clarkb: corvus: okay, I think we are good to proceed to other launchers	22:41
corvus	pabelanger: ++	22:41
tobiash	that was useful during development	22:41
clarkb	pabelanger: go for it	22:41
mriedem	clarkb: i knew you could do it	22:42
*** slaweq has quit IRC		22:43
clarkb	mriedem: heh, in any case can you keep your eye out for it leaving comments/reports on irc?	22:43
pabelanger	okay, all launchers have been restarted	22:43
mriedem	clarkb: not sure which channel those are in, except qa i guess	22:43
pabelanger	looking at logs now to confirm things are working properly	22:43
*** hwoarang has joined #openstack-infra		22:44
pabelanger	corvus: clarkb: okay, I don't see any obvious issues. We are still creating / deleting nodes properly now	22:45
corvus	ready for the zuul part of this?	22:45
clarkb	corvus: I am if you are	22:45
openstackgerrit	Merged openstack-infra/elastic-recheck master: Add query for libvirt functional evacuate test bug 1806126 https://review.openstack.org/621288	22:46
openstack	bug 1806126 in OpenStack Compute (nova) "LibvirtRbdEvacuateTest and LibvirtFlatEvacuateTest tests race fail" [High,Confirmed] https://launchpad.net/bugs/1806126	22:46
pabelanger	++	22:46
tobiash	corvus: just found this in my new nodepool: http://paste.openstack.org/show/736532/	22:46
tobiash	corvus: related to the openstacksdk release?	22:46
*** hamerins has quit IRC		22:46
tobiash	the flavor used to be a munch, now it seems to be a dict	22:46
tobiash	mordred: ^	22:46
clarkb	tobiash: is that with sdk 0.19.0?	22:46
clarkb	tobiash: of 0.20.0?	22:47
pabelanger	tobiash: we blocked 0.20.0 openstacksdk in nodepool not that long ago, which version	22:47
tobiash	checking, but I guess 0.20.0 as I only pulled in the exception change	22:47
pabelanger	tobiash: https://review.openstack.org/621272/	22:47
*** apetrich has quit IRC		22:47
fungi	several hundred lines of scrollback in this channel while i was at dinner. ftr i'm just going to check nick highlights	22:47
*** hamerins has joined #openstack-infra		22:47
corvus	should i just do a scheduler restart or a full restart?	22:48
tobiash	yes 0.20.0	22:48
corvus	we did executors a couple days ago	22:48
tobiash	so another incompatibility	22:48
clarkb	corvus: executors slow things down so I'm good with just scheduler if you want to make it easier	22:48
clarkb	that gets is relative priorities. And infra doesn't currently need executor zones	22:49
clarkb	if we had an immediate need for executor zones I'd say do it all, but I can;t think of one	22:49
fungi	AJaeger_: thanks for the heads-up on 619216. i'm mostly trying to fix up cookiecutter templates and important documentation, so that's totally in scope	22:49
pabelanger	what ever is easier to get the next release of zuul with executor zone support is my vote	22:49
*** slaweq has joined #openstack-infra		22:50
clarkb	pabelanger: in that case probably restarting the executors too, but we won't exercise those code paths. Probably want you to run it in beta form?	22:50
pabelanger	clarkb: I am okay with skipping executors today	22:51
pabelanger	yah, I'll test zones locally to be sure	22:51
corvus	i'm already down the scheduler-only path	22:51
*** kjackal has quit IRC		22:52
clarkb	corvus: wfm. I don't think relying on infra to ensure executor zones are ready is prudent	22:52
pabelanger	++	22:52
clarkb	we don't have the constraints in place that made it a feature so we wouldn't be able to test it for general functioanlity well	22:52
clarkb	(even if we could pretend and check it doesn't completely break)	22:52
*** slaweq has quit IRC		22:52
pabelanger	clarkb: I was only thinking the arm cloud in china might be a good zone to do	22:53
clarkb	pabelanger: I don't think we have trouble with the ssh connections. Only http	22:53
pabelanger	ah	22:53
clarkb	(well and zk, but executor zones don't help that either)	22:53
clarkb	we moved the builder into the london region to deal with the zk issue and switched to https to deal with http problems	22:54
pabelanger	if only infracloud was still arounds, we had bandwidth issues there :)	22:54
*** mriedem has quit IRC		22:54
*** wolverineav has quit IRC		22:54
*** slaweq has joined #openstack-infra		22:54
corvus	i'm pretty opposed to adding spofs to zuul, so if we were to use zones, we'd have to double up on executors. at least.	22:54
clarkb	ya I just don't see us needing it with current resources	22:55
corvus	(even if we had network restricted resources, i'd suggest we talk about vpns first)	22:55
*** wolverineav has joined #openstack-infra		22:55
clarkb	we'd have an excuse to use that new in kernel vpn system	22:56
corvus	scheduler restarted	22:57
clarkb	looks like check is enque(ing\|ed)	22:57
pabelanger	confirmed, I can see nodepool-launcher creating nodes	22:58
clarkb	corvus: and we are already running with relative priority right?	22:58
clarkb	seems so check is running jobs ahead of gate	22:58
clarkb	neat	22:58
*** wolverineav has quit IRC		23:00
clarkb	or at least it appeared so a minute ago. Now I'm not completely sure	23:00
*** wolverineav has joined #openstack-infra		23:00
corvus	the field is being set: http://paste.openstack.org/show/736533/	23:00
pabelanger	yay	23:00
clarkb	cool I don't have evidence against it working, more lack of evidence it is working (since now all the gate things are happening	23:01
fungi	and the playing field is being leveled?	23:01
corvus	theoretically	23:01
corvus	gate still has priority over check	23:01
corvus	but we should see, for example nodepool change 621286 get nodes before nova change 620861	23:02
*** apetrich has joined #openstack-infra		23:02
corvus	in check	23:02
pabelanger	corvus: that is because there are more nova changes in check right?	23:02
corvus	yeah, there's a bunch right now	23:02
pabelanger	kk, keeping an eye out	23:03
pabelanger	oh, I think it is working 621011 has jobs running now	23:03
pabelanger	and there is nova before it with none	23:04
corvus	similarly the requirements change 620563 is doing well	23:04
pabelanger	yah, that is pretty cool	23:05
*** Miouge- has quit IRC		23:05
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Add relative priority to request list https://review.openstack.org/621314	23:06
corvus	i assume that's going to fail tests, i just don't know which	23:06
pabelanger	now I better understand 'level playing field' comment from fungi :)	23:06
*** cdent has quit IRC		23:07
clarkb	pabelanger: ya this should allow lower resource usage projects "to get a word in" then once done give the higher usage projects a turn	23:07
*** Miouge has joined #openstack-infra		23:08
clarkb	s/lower resource usage/less active/	23:08
clarkb	its based on changes not nodeset size	23:08
pabelanger	clarkb: yah, that is really cool, and fair.	23:08
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Only count live items for relative priority https://review.openstack.org/621315	23:09
corvus	or, uh, nearly fair ^ :)	23:10
clarkb	it still won't be a strict ordering because different providers can grab requests and fulfill them at different speeds	23:10
pabelanger	It seems 614035 got nodes before 614012, but 614012 is parent	23:10
clarkb	corvus: oh hah	23:10
corvus	yeah, it's all best-effort	23:10
openstackgerrit	Merged openstack-infra/project-config master: Update promstat to use storyboard https://review.openstack.org/621293	23:11
pabelanger	+2	23:11
corvus	614012 has a relpri of 3	23:12
mordred	tobiash: hrm. flavors should not have changed. they should still be munch ... or at the very least munch-like	23:14
corvus	603930 has relpri 39	23:14
tobiash	mordred: I just confirmed that the downgrade fixed the issue	23:14
mordred	tobiash: AWESOME	23:14
mordred	sorry for the breakage .. I'm ... quite sad about that	23:15
corvus	i don't know what the ones that got nodes are	23:15
clarkb	corvus: all those non live changes bumping it up	23:15
*** hamerins has quit IRC		23:15
corvus	ya	23:15
tobiash	mordred: nothing happened, it's night here and I just tried that exception change and that pulled in the new sdk	23:15
corvus	my guess is the others all grabbed nodes as the system was starting up and there were no other requests	23:16
tobiash	so I noticed that before it had an impact	23:16
mordred	tobiash: oh good	23:18
clarkb	corvus: are you wanting to restart again today with the liveness check? I'm thinking that may be good to get in on friday vs monday	23:18
clarkb	(and i can hang around this afternoon to help with that)	23:18
corvus	clarkb: yeah, at least sometime before monday	23:19
pabelanger	I still have some time today to support	23:19
clarkb	the other thing we should probably do is send an email to the dev list explaining the change so that we can avoid all the questions (or at least some of them) next week :)	23:19
clarkb	corvus: is that something you'd like to draft up? (you came up with the diea and did most of the work to implement it)	23:20
clarkb	I'm happy to write a note too if you'd rather not	23:20
corvus	clarkb: i can do it	23:20
mordred	tobiash: oh! it's server.flavor.id ... so the sub-dict isn't munched	23:21
clarkb	while we are on a improve all the things streak. https://review.openstack.org/#/c/611920/ is a fairly straightforward gear change to make it more ipv6 friendly	23:21
tobiash	yes	23:21
clarkb	if anyone wants to review that one. It causes gear to listen on ipv6 if available	23:21
clarkb	pabelanger: ^ I want to say you've pushed changes around similar stuff there	23:21
pabelanger	looking	23:22
*** slaweq has quit IRC		23:22
pabelanger	Ah, yah. +2	23:22
*** Swami has quit IRC		23:27
openstackgerrit	Merged openstack-infra/nodepool master: Remove updating stats debug log https://review.openstack.org/621283	23:28
pabelanger	clarkb: corvus: there we go, 621286 (nodepool) just got allocated nodes	23:29
clarkb	ya and its been skipping changes aerlier as expected	23:29
pabelanger	way before existing keystone / nova patches	23:29
pabelanger	++	23:29
pabelanger	awesome	23:29
mordred	tobiash: remote: https://review.openstack.org/621316 Transform server with munch before normalizing should fix it	23:30
corvus	clarkb, pabelanger: btw this change from tobias https://review.openstack.org/610029 will help things as well	23:30
corvus	(that's probably the reason than 614012, at the top of the list, doesn't have nodes yet)	23:31
pabelanger	ah, cool	23:31
corvus	it was unlucky enough to be locked by a bunch of handlers which couldn't fulfill it, and is now waiting for the ones that can to come back to the top of the loop	23:31
pabelanger	corvus: do we want to restart launchers today to pick it up?	23:32
tobiash	mordred: thanks :)	23:32
corvus	pabelanger: yeah, either today or this weekend i think	23:38
clarkb	TIL about time.monotonic. There are some nice things in python3	23:38
corvus	be very careful about monotonic. it's a trap in unit tests.	23:38
clarkb	corvus: oh?	23:38
corvus	i keep having to swap it out for time.time.	23:38
corvus	yeah, the base value isn't guaranteed	23:38
clarkb	tobiash's change to timeout assigning handlers uses monotonic	23:39
corvus	yeah, i think it's okay there	23:39
clarkb	corvus: right its not a unix timestamp? I guess that could cause problems in places	23:39
clarkb	its a time that may start at zero when process starts?	23:39
corvus	clarkb: https://review.openstack.org/529173	23:40
corvus	i think as long as you always initialize to time.monotic, you're okay	23:41
clarkb	got it	23:41
corvus	but the common pattern of initializing to 0 and assuming that now()-then will work doesn't work with monotonic	23:42
*** wolverineav has quit IRC		23:42
*** tosky has quit IRC		23:42
corvus	or rather, it may or may not work depending on $random	23:42
*** wolverineav has joined #openstack-infra		23:45
corvus	clarkb, pabelanger, fungi, mordred: how's this? https://etherpad.openstack.org/p/QHQuh57TiG	23:45
*** pbourke has quit IRC		23:47
pabelanger	corvus: wfm	23:48
clarkb	corvus: looks great, particularly for calling out the out of order behavior	23:48
clarkb	I expect people will notice that and get curious	23:48
*** wolverineav has quit IRC		23:48
*** wolverineav has joined #openstack-infra		23:49
fungi	corvus: ship it!	23:49
*** pbourke has joined #openstack-infra		23:49
pabelanger	clarkb: looking forward to the stats next week for node break down	23:50
clarkb	pabelanger: I don't expect that will change much	23:50
clarkb	we are changing where in the pipeline you get resources not how many you get	23:50
clarkb	(so over a long period like a month or a week it shoudl all be roughly the same)	23:50
pabelanger	I guess it depends if more smaller projects are pushing up more patches	23:51
fungi	hint: they're not	23:52
fungi	that's what makes them "small projects"	23:52
pabelanger	cool, it only took my patch 21mins to get nodes, compared to the existing ones that are pushing 45mins	23:52
pabelanger	very nice	23:53
pabelanger	as it was the first one	23:53
fungi	also we generally do catch up over the weekend, so the week's worth of node requests still all get satisfied within that week	23:53
clarkb	ya once upon a time I tried to look at it yoy and we use ~30% of our total resources when viewed at that scale	23:53
clarkb	weekends and the apac timezones tend to be quieter iirc	23:54
*** slaweq has joined #openstack-infra		23:54
clarkb	so while we are flat out when people are watching, it is much less over long periods of time	23:54
clarkb	we'll need to watch it and see how it goes though	23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!