Wednesday, 2018-12-05

*** pabelanger has joined #openstack-infra		00:00
pabelanger	fungi: mnaser: thinking outload about a 20 stack-patchseries for nova, and have idea how it would look, but if zuul knows the 20 patches are submitted together, maybe the realitive priority also does. So, if there is the 20 stack nova patch, and 2nd, and 3rd nova patch behind it. Then nodepool allocated nodes to 2nd and 3rd nova change before 4th path in the nova 20 stack... if that makes sense	00:06
clarkb	http://logs.openstack.org/12/615612/3/gate/neutron-grenade/bcd2c51/logs/grenade.sh.txt.gz#_2018-12-04_23_01_10_429 the grenade job failed on ovs/q-agt timing out	00:06
pabelanger	s/idea/no idea/	00:06
pabelanger	that way, other users in nova also get feedback out side the mass patch series	00:06
openstackgerrit	Clark Boylan proposed openstack-infra/opendev-website master: Add some initial content thoughtso https://review.openstack.org/622624	00:13
clarkb	that is super rough	00:13
clarkb	and I intend to take faq/q&a content from the email we sent and incorporate it in	00:13
clarkb	but figured getting a draft or even an outline going would probably help get the ball rolling	00:14
*** kjackal has quit IRC		00:14
clarkb	I'm going to have to pop out soonish as a contractor is coming over to tell me how much a new wall costs but will saerch out any feedback when I am able	00:14
clarkb	interesting the tempest change failed on the same thing as gernade	00:16
clarkb	mnaser: ^ we likely have a systemic bug there since the change was from glance and not neutron	00:16
clarkb	hrm looks like both happened on bhs1 and the devstack run took about an hour before it failed	00:18
clarkb	fungi: amorin ^ maybe there is some other issue affecting bhs1?	00:18
*** jcoufal has joined #openstack-infra		00:30
*** gyee has quit IRC		00:33
clarkb	looking at dstat for the job there is a spike in writes to ~46MBps	00:39
clarkb	but its pretty consistently closer to 1MBps for the job run	00:39
clarkb	there is also a period of almost persistent 10% cpu wai which could be related (they do overlap somewhat)	00:39
clarkb	possible we are still our own noisy neighbor here?	00:40
clarkb	looking at the devstack log that roughly correlates with when neutron mysql migrations were run	00:42
clarkb	mordred: perhaps this is a silly idea, but can we run mysql in unsafe mode where it is eatmydatay?	00:42
clarkb	if we are our own noisy neighbor that sort of thing may help	00:43
clarkb	The other thing is whether or not kvm is waiting for these writes to succeed before completing them in hte VM	00:43
clarkb	we turned that off in infracloud to get more io throughput	00:43
clarkb	amorin: ^ that last question is likely a question for you	00:43
*** wolverineav has quit IRC		00:45
clarkb	ya then the second 10% ish cpu wai block lines up with nova db migrations	00:46
pabelanger	clarkb: yah, I might end up trying eatmydata for some DIB testing I am doing, I'm seeing a large amount of failures now with ovh	00:47
*** wolverineav has joined #openstack-infra		00:48
*** rlandy has quit IRC		00:59
*** jcoufal has quit IRC		01:00
*** sthussey has quit IRC		01:01
clarkb	I've launched clarkb-test nodes in vexxhost sjc1 adn ovh bhs1 (and working on gra1) so far I've run sysbench on bsh1 and sjc1 and they look really similar	01:05
clarkb	both about 40Mb/sec and 2600 requests per second	01:06
clarkb	makes me wonder if its a bad hypervisor or misconfigured hypervisor	01:06
clarkb	so some subset of the jobs hit it there	01:06
clarkb	ansible facts don't seem to capture if we are emulated or virt	01:08
clarkb	but the machines I've logged into are definitely kvm according to systemd-detect-virt	01:08
clarkb	amorin: fungi: d73a3a4c-b31c-41a7-b8a3-226cdae0d558 and 7bb83b73-e279-4c85-af31-f7870e2714fc are two bhs1 instances that exhibited the weird slowness. Maybe we can zero in on the hypervisor(s) that ran those and see if there is something wrong there? virt not enabled (so using qemu emulation), slow disk, unhappy disk, etc	01:11
clarkb	ba25071f-edaf-417d-8564-fab75676116e is my test instance that seems to be fine in that region if you need t ocompare (or check if it is running on the same hypervisor)	01:12
*** jamesmcarthur has joined #openstack-infra		01:13
clarkb	gra1 actually has much slower rnadom reads and write to disk according to sysbench. About 1/10 the others	01:15
clarkb	little more than that, but not much	01:15
clarkb	but we don't see the issues in gra1 right?	01:15
clarkb	and now I must go sort out dinner, fungi ^ you may want to try and duplicate my results. Feel free to use teh clarkb-test instances in those regions if you do so	01:16
*** wolverineav has quit IRC		01:24
fungi	yeah, at least mining logstash for job timeouts i was seeing disproportionally far more in bhs1 than graphene	01:24
fungi	er, than gra1	01:25
fungi	sorry graphene, my tab key is next to my 1	01:25
*** wolverineav has joined #openstack-infra		01:25
pabelanger	yes, I can confirm, IO heavy jobs on ovh-bhs1 are timing out here	01:26
*** wolverineav has quit IRC		01:30
openstackgerrit	Merged openstack/os-testr master: Update the home-page URL https://review.openstack.org/622427	01:31
*** wolverineav has joined #openstack-infra		01:31
*** studarus has joined #openstack-infra		01:34
*** bobh has joined #openstack-infra		01:35
*** hongbin has joined #openstack-infra		01:36
fungi	maybe we let amorin reproduce/investigate with load on it	01:36
mordred	clarkb: there's some stuff that could be done like eatmydata - also some my.cnf settings to turn down data durability - although it'll ultimately only usually affect fsync - if we're saturating throughput it wouldn't help a ton	01:38
mordred	but it's totally worth trying eatmydata	01:38
*** jamesdenton has joined #openstack-infra		01:46
*** witek has quit IRC		02:00
*** witek has joined #openstack-infra		02:00
*** jamesmcarthur has quit IRC		02:03
*** sthussey has joined #openstack-infra		02:04
*** wolverineav has quit IRC		02:07
*** dklyle has quit IRC		02:12
*** dklyle has joined #openstack-infra		02:13
*** mrsoul has joined #openstack-infra		02:14
*** wolverineav has joined #openstack-infra		02:14
openstackgerrit	MarcH proposed openstack-infra/git-review master: tests/__init__.py: ssh-keygen -m PEM for bouncycastle https://review.openstack.org/622636	02:18
clarkb	fungi pabelanger more and more Im suspecting a specific hypervisor	02:20
clarkb	since sjc1 and bhs1 are basically the same for io in limited testing	02:20
*** wolverineav has quit IRC		02:21
*** eernst has joined #openstack-infra		02:25
*** bobh has quit IRC		02:33
*** studarus has quit IRC		02:35
*** larainema has joined #openstack-infra		02:37
mnaser	fyi we do throttle iops per gb	02:39
mnaser	30 iops per gb so @ 80 gb volumes => 240 iops	02:40
mnaser	er	02:40
mnaser	2400.	02:40
mnaser	and 0.5MB/s per GB for SSD volume sso hence the 40MB/s	02:41
mnaser	(nice to know our qos works well, lol)	02:41
pabelanger	clarkb: I guess there is no way to track the hypervisor from guest OS	02:43
*** psachin has joined #openstack-infra		02:44
mnaser	pabelanger, clarkb: https://review.openstack.org/#/c/577933/ there is and i tried :D	02:50
mnaser	but you can check nova show and look at hostId from the API	02:50
*** imacdonn has quit IRC		02:52
*** imacdonn has joined #openstack-infra		02:53
*** jd_ has quit IRC		02:54
*** bhavikdbavishi has joined #openstack-infra		02:56
*** bhavikdbavishi has quit IRC		03:01
*** jd_ has joined #openstack-infra		03:02
*** hongbin has quit IRC		03:07
*** auristor has quit IRC		03:08
*** eernst has quit IRC		03:09
*** eernst has joined #openstack-infra		03:09
clarkb	ya unfortunately the nodes are gone by the time we can look	03:11
clarkb	so either welog that with nodepool or ask the cloud	03:12
openstackgerrit	melissaml proposed openstack/os-testr master: Change openstack-dev to openstack-discuss https://review.openstack.org/622698	03:14
*** apetrich has quit IRC		03:15
*** bobh has joined #openstack-infra		03:18
*** bhavikdbavishi has joined #openstack-infra		03:18
*** ykarel\|away has joined #openstack-infra		03:21
*** graphene has quit IRC		03:23
*** ramishra has joined #openstack-infra		03:23
*** agopi has joined #openstack-infra		03:23
pabelanger	mnaser: it seems 577933 might just need documentation updates at this point, but not sure what other reviewers say	03:32
*** auristor has joined #openstack-infra		03:32
pabelanger	agree it would be helpful from job POV to collect that info	03:32
*** bobh has quit IRC		03:36
*** yamamoto has joined #openstack-infra		03:38
*** yamamoto has quit IRC		03:42
*** bobh has joined #openstack-infra		03:45
*** jamesmcarthur has joined #openstack-infra		04:03
openstackgerrit	Tristan Cacqueray proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101	04:04
*** bobh has quit IRC		04:04
*** wolverineav has joined #openstack-infra		04:08
*** jamesmcarthur has quit IRC		04:08
*** wolverineav has quit IRC		04:12
*** wolverineav has joined #openstack-infra		04:19
*** hongbin has joined #openstack-infra		04:24
*** eernst has quit IRC		04:26
*** yamamoto has joined #openstack-infra		04:32
*** wolverineav has quit IRC		04:37
*** janki has joined #openstack-infra		04:40
*** sthussey has quit IRC		04:51
*** mordred has quit IRC		04:57
*** mordred has joined #openstack-infra		04:57
*** hongbin has quit IRC		05:19
*** yamamoto has quit IRC		05:34
openstackgerrit	Vieri proposed openstack/gertty master: Change openstack-dev to openstack-discuss https://review.openstack.org/622850	05:37
*** ahosam has joined #openstack-infra		05:37
*** stevebaker has quit IRC		05:45
*** dmellado has quit IRC		05:46
*** gouthamr has quit IRC		05:46
openstackgerrit	Merged openstack-infra/nodepool master: Add cleanup routine to delete empty nodes https://review.openstack.org/622616	05:47
*** dmellado has joined #openstack-infra		05:48
*** diablo_rojo has quit IRC		05:51
*** stevebaker has joined #openstack-infra		05:51
*** gouthamr has joined #openstack-infra		05:53
*** yamamoto has joined #openstack-infra		05:54
*** dmellado has quit IRC		05:55
*** wolverineav has joined #openstack-infra		05:56
*** dmellado has joined #openstack-infra		05:57
*** stevebaker has quit IRC		05:59
*** wolverineav has quit IRC		06:00
*** dmellado has quit IRC		06:02
*** stevebaker has joined #openstack-infra		06:06
*** dmellado has joined #openstack-infra		06:14
*** gouthamr has quit IRC		06:15
*** stevebaker has quit IRC		06:15
*** gouthamr has joined #openstack-infra		06:18
*** stevebaker has joined #openstack-infra		06:19
*** ykarel\|away has quit IRC		06:21
*** diablo_rojo has joined #openstack-infra		06:24
*** stevebaker has quit IRC		06:25
*** gouthamr has quit IRC		06:29
*** stevebaker has joined #openstack-infra		06:29
*** gouthamr has joined #openstack-infra		06:32
*** stevebaker has quit IRC		06:41
*** stevebaker has joined #openstack-infra		06:42
*** ykarel\|away has joined #openstack-infra		06:45
*** stevebaker has quit IRC		06:51
*** ykarel\|away is now known as ykarel		06:54
*** stevebaker has joined #openstack-infra		06:55
*** kjackal has joined #openstack-infra		06:57
*** gouthamr has quit IRC		06:58
*** ralonsoh has joined #openstack-infra		06:58
*** stevebaker has quit IRC		07:03
*** stevebaker has joined #openstack-infra		07:06
*** gouthamr has joined #openstack-infra		07:07
*** bhavikdbavishi has quit IRC		07:09
*** bhavikdbavishi1 has joined #openstack-infra		07:09
*** pcaruana has joined #openstack-infra		07:10
*** bhavikdbavishi1 is now known as bhavikdbavishi		07:12
*** ahosam has quit IRC		07:15
*** stevebaker has quit IRC		07:15
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175	07:17
*** quiquell\|off is now known as quiquell		07:17
*** stevebaker has joined #openstack-infra		07:19
*** takamatsu has joined #openstack-infra		07:21
*** stevebaker has quit IRC		07:27
*** florianf has joined #openstack-infra		07:27
*** stevebaker has joined #openstack-infra		07:30
*** yboaron has joined #openstack-infra		07:33
*** stevebaker has quit IRC		07:37
*** stevebaker has joined #openstack-infra		07:40
*** dpawlik has joined #openstack-infra		07:40
*** gouthamr has quit IRC		07:40
*** apetrich has joined #openstack-infra		07:42
*** gouthamr has joined #openstack-infra		07:42
*** kjackal has quit IRC		07:45
*** kjackal has joined #openstack-infra		07:45
*** stevebaker has quit IRC		07:46
*** slaweq has joined #openstack-infra		07:48
*** quiquell is now known as quiquell\|brb		07:48
*** stevebaker has joined #openstack-infra		07:50
amorin	mordred: fungi clarkb I am on the host was hosting d73a3a4c-b31c-41a7-b8a3-226cdae0d558 and 7bb83b73-e279-4c85-af31-f7870e2714fc	07:54
amorin	it seems to be slower than others	07:54
amorin	dd zero on host itself is slower than on another	07:54
amorin	but there are some instances on it	07:54
amorin	I will disable it to test without any load	07:54
*** stevebaker has quit IRC		07:55
*** stevebaker has joined #openstack-infra		08:00
*** ykarel is now known as ykarel\|lunch		08:00
openstackgerrit	Arnaud Morin proposed openstack-infra/project-config master: Reduce a little number of instances on BHS1 https://review.openstack.org/622876	08:02
*** ahosam has joined #openstack-infra		08:02
*** ginopc has joined #openstack-infra		08:03
*** e0ne has joined #openstack-infra		08:05
*** stevebaker has quit IRC		08:05
AJaeger	amorin: do you need this quickly? ^	08:07
AJaeger	infra-root, ^	08:08
*** stevebaker has joined #openstack-infra		08:09
amorin	AJaeger: nop	08:11
amorin	it can wait afternoon	08:11
amorin	(I am in europe tz)	08:11
amorin	I know that most of the guys are in US	08:11
amorin	it can wait	08:11
frickler	amorin: just an idea: are you using deadline of cfq scheduler on the hypervisors? we found that cfq can cause issues in an iops throttled environment and that effect increased a lot from 4.13 to 4.15 kernel	08:14
frickler	s/of/or/	08:14
amorin	I have no idea, but I can check	08:14
*** stevebaker has quit IRC		08:17
*** florianf has quit IRC		08:19
*** priteau has joined #openstack-infra		08:21
*** stevebaker has joined #openstack-infra		08:21
amorin	frickler: we are using cfq on hypervisors	08:21
*** quiquell\|brb is now known as quiquell		08:22
frickler	amorin: o.k., so for our setup, we resolved the issue by changing to deadline.	08:25
amorin	I was told that for SSD disks, it's better to use noop instead	08:25
amorin	but I not an expert on that part	08:26
amorin	I'll see if I can apply that on the whole aggregate for the OSF	08:26
frickler	amorin: thx, I'll be offline for a bit, will check back later. your node reduction should merge any moment	08:28
openstackgerrit	Merged openstack-infra/project-config master: Reduce a little number of instances on BHS1 https://review.openstack.org/622876	08:30
*** stevebaker has quit IRC		08:30
*** stevebaker has joined #openstack-infra		08:31
*** gfidente has joined #openstack-infra		08:31
*** hrubi has joined #openstack-infra		08:33
*** bhavikdbavishi has quit IRC		08:33
amorin	ok	08:33
*** jpena\|off is now known as jpena		08:39
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101	08:44
*** florianf_ has joined #openstack-infra		08:44
*** tosky has joined #openstack-infra		08:50
*** fresta has quit IRC		08:50
*** fresta has joined #openstack-infra		08:51
openstackgerrit	Natal Ngétal proposed openstack/diskimage-builder master: [Core] Change openstack-dev to openstack-discuss. https://review.openstack.org/622895	08:52
*** aojea has joined #openstack-infra		08:54
*** ykarel\|lunch is now known as ykarel		09:00
*** shardy has joined #openstack-infra		09:00
*** verdurin has quit IRC		09:04
*** dpawlik has quit IRC		09:04
*** jpich has joined #openstack-infra		09:05
*** wolverineav has joined #openstack-infra		09:05
*** dpawlik has joined #openstack-infra		09:05
*** bhavikdbavishi has joined #openstack-infra		09:06
*** bhavikdbavishi has quit IRC		09:07
*** bhavikdbavishi has joined #openstack-infra		09:07
*** verdurin has joined #openstack-infra		09:07
*** wolverineav has quit IRC		09:09
*** ccamacho has joined #openstack-infra		09:20
*** zhangfei has joined #openstack-infra		09:21
*** xek has joined #openstack-infra		09:21
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101	09:30
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906	09:30
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906	09:32
*** zhangfei has quit IRC		09:34
*** rcernin has quit IRC		09:38
*** agopi is now known as agopi\|brb		09:40
*** bhavikdbavishi has quit IRC		09:40
chandan_kumar	hello	09:44
chandan_kumar	Is there a way to set to set basepython=python3 when in zuul job run or post_run is called to run a ansible playbook?	09:45
*** larainema has quit IRC		09:47
*** zhangfei has joined #openstack-infra		09:47
*** derekh has joined #openstack-infra		09:49
*** witek has quit IRC		09:50
*** zhangfei has quit IRC		09:50
*** zhangfei has joined #openstack-infra		10:09
*** diablo_rojo has quit IRC		10:12
amorin	AJaeger frickler, I configured the whole OSF aggregate to deadline instead of CFQ on hypervisors in BHS1, we'll see if it's giving better results on your side	10:17
*** ahosam has quit IRC		10:20
*** quiquell is now known as quiquell\|brb		10:21
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout based on screen size https://review.openstack.org/622010	10:22
*** agopi\|brb is now known as agopi		10:29
sshnaidm	fungi, clarkb not sure what happened, but my email in gerrit turned to be "unverified" and I can't submit patches	10:42
*** alexchadin has joined #openstack-infra		10:49
*** dtantsur\|afk is now known as dtantsur		10:50
*** quiquell\|brb is now known as quiquell		10:52
stephenfin	Which repo configures the projects that the openstackgerrit IRC bot monitors?	10:55
*** graphene has joined #openstack-infra		10:57
openstackgerrit	Merged openstack-infra/git-review master: Use six for cross python compatibility https://review.openstack.org/616688	10:57
*** bhavikdbavishi has joined #openstack-infra		10:58
*** bhavikdbavishi has quit IRC		11:09
*** bhavikdbavishi has joined #openstack-infra		11:09
*** xek has quit IRC		11:12
openstackgerrit	Merged openstack-infra/git-review master: Avoid UnicodeEncodeError on python 2 https://review.openstack.org/583535	11:15
*** bhavikdbavishi has quit IRC		11:16
*** bhavikdbavishi has joined #openstack-infra		11:16
*** bhavikdbavishi has quit IRC		11:20
*** zhangfei has quit IRC		11:22
*** xek has joined #openstack-infra		11:26
cmurphy	stephenfin: http://git.openstack.org/cgit/openstack-infra/project-config/tree/gerritbot/channels.yaml	11:27
stephenfin	ta	11:27
*** tpsilva has joined #openstack-infra		11:31
*** xek has quit IRC		11:38
fungi	sshnaidm: when was the last time it worked? could you have switched accounts when we deactivated one of your duplicate gerrit accounts a couple of weeks ago?	11:50
*** ahosam has joined #openstack-infra		11:50
fungi	sshnaidm: is it your pullusum@ or einarum@ address?	11:53
*** yamamoto has quit IRC		11:53
*** yamamoto has joined #openstack-infra		11:53
*** yamamoto has quit IRC		11:57
*** bhavikdbavishi has joined #openstack-infra		12:02
openstackgerrit	Merged openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838	12:04
sshnaidm	fungi, it started today, yesterday submitted patches	12:04
sshnaidm	fungi, I had another email there "sshnaidm@redhat.com" and set it as "preferred"	12:05
alexchadin	hi, where can I find some grenade job examples which were adapted for zuul?	12:05
openstackgerrit	Brendan proposed openstack-infra/zuul master: Fix urllib imports in Gerrit HTTP form auth code https://review.openstack.org/622942	12:05
sshnaidm	fungi, and now seems like gerrit removed it, I'm trying to add it again, but no verification mail so far..	12:05
fungi	sshnaidm: indeed, i can't find that address associated with any account in gerrit's database	12:06
fungi	i'll see if i can tell when it sent verification e-mails for it	12:06
fungi	sshnaidm: i see it sent messages to that address as recently as 11:00:48 utc, so barely an hour ago and at least 15 minutes after you pinged me in here	12:11
fungi	was that when you attempted to re-add the address?	12:11
*** pbourke has quit IRC		12:14
*** kashyap has joined #openstack-infra		12:15
sshnaidm	fungi, yeah, I think so	12:25
*** jpena is now known as jpena\|lunch		12:25
*** ahosam has quit IRC		12:25
sshnaidm	fungi, found this mail finally	12:26
sshnaidm	fungi, yay! can submit patches again :)	12:27
*** e0ne has quit IRC		12:27
*** e0ne has joined #openstack-infra		12:29
fungi	sshnaidm: great! glad nothing seems to be broken on our end anyway	12:39
*** pbourke has joined #openstack-infra		12:40
kashyap	[OT] Some folks here might appreciate this: https://www.qemu-advent-calendar.org/	12:40
kashyap	If you're wondering WTH it is:	12:41
kashyap	[quote]	12:41
kashyap	The QEMU Advent Calendar 2018 features a QEMU disk image each day of December until Christmas. Each day a new package becomes available for download [...] The disk images contain interesting operating systems and software that run under the QEMU emulator. Some of them are well-known or not-so-well-known operating systems, old and new, others are custom demos and neat algorithms.	12:41
kashyap	[/quote]	12:41
fungi	i recall they've done that in previous years too	12:41
fungi	it's neat	12:41
kashyap	fungi: We didn't do it last year :-)	12:41
kashyap	Last time was in 2016. (And before that was in 2014)	12:41
fungi	in even years then? ;)	12:42
kashyap	Heh, yeah	12:45
kashyap	It's just too much damn work.	12:45
kashyap	This year, I only part-volunteered; 2016, I spent lot more time on it.	12:46
*** ramishra has quit IRC		12:51
*** yamamoto has joined #openstack-infra		12:54
*** kjackal has quit IRC		12:54
*** kjackal has joined #openstack-infra		12:55
*** bobh has joined #openstack-infra		12:58
*** kjackal has quit IRC		12:59
*** aojea has quit IRC		13:01
*** ykarel is now known as ykarel\|afk		13:03
*** kjackal has joined #openstack-infra		13:04
*** ramishra has joined #openstack-infra		13:05
*** yamamoto has quit IRC		13:06
*** janki has quit IRC		13:08
*** ykarel\|afk has quit IRC		13:10
*** bobh has quit IRC		13:22
*** udesale has joined #openstack-infra		13:26
*** jpena\|lunch is now known as jpena		13:33
*** ykarel\|afk has joined #openstack-infra		13:35
*** rlandy has joined #openstack-infra		13:36
*** ykarel\|afk is now known as ykarel		13:36
*** dtantsur is now known as dtantsur\|brb		13:39
mordred	morning fungi - how's things?	13:40
*** dpawlik has quit IRC		13:41
*** dpawlik has joined #openstack-infra		13:44
fungi	i have contractors ripping out and rebuilding my previously-flooded downstairs entry (finally)	13:44
fungi	it's sort of like my usual industrial music but with less shouting	13:44
*** jcoufal has joined #openstack-infra		13:44
mordred	hrm. maybe you could figure out some way to get the contractors to yell more?	13:45
*** agopi has quit IRC		13:45
fungi	i could flip this breaker back on, i suppose	13:45
*** jcoufal_ has joined #openstack-infra		13:46
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Add a script to generate the static inventory https://review.openstack.org/622964	13:47
mordred	fungi: that might just result in muffled thuds and then less noise	13:48
fungi	fair point	13:48
*** jcoufal has quit IRC		13:48
mordred	fungi: there's a stab at an inventory generation script. in writing the commit message for it, it occurred to me that in most cases it should be completely unneeded, as all somebody needs to do is add some ips to a yaml file after running launch-node	13:49
*** dpawlik has quit IRC		13:49
mordred	fungi: maybe we should just have launch-node print a little yaml snippet that could be copy-pastad into the inventory file?	13:49
*** agopi has joined #openstack-infra		13:49
*** priteau has quit IRC		13:50
*** dpawlik has joined #openstack-infra		13:50
*** mriedem has joined #openstack-infra		13:52
fungi	not a bad idea. we do something similar for dns currently and will likely be needing to print a snippet to add to a commit to one of our zone repos soon	13:53
fungi	maybe both can be wrapped up together	13:54
fungi	(i mean, snippets for two commits since they go in different repos, but the same routine could spit out both)	13:54
*** jamesmcarthur has joined #openstack-infra		13:55
fungi	that also gives us a start on having automation directly propose those patches in the future, should we wish it	13:55
*** sthussey has joined #openstack-infra		14:00
*** kgiusti has joined #openstack-infra		14:03
mordred	++	14:04
*** priteau has joined #openstack-infra		14:05
*** florianf_ is now known as florianf		14:05
openstackgerrit	Matt Riedemann proposed openstack-infra/elastic-recheck master: Add query for n-api/g-api startup timeout bug 1806912 https://review.openstack.org/622966	14:06
openstack	bug 1806912 in OpenStack-Gate "devstack timeout because n-api/g-api takes longer than 60 seconds to start" [Undecided,Confirmed] https://launchpad.net/bugs/1806912	14:06
*** sshnaidm has quit IRC		14:06
*** quiquell is now known as quiquell\|off		14:11
*** sshnaidm has joined #openstack-infra		14:14
mordred	infra-root: I'm afk for the next few hours - giving a talk about zuul today	14:24
fungi	mordred: ooh! g'luck!	14:25
pabelanger	+1	14:25
mordred	thanks! this one will be fun - these humans have zero background in openstack at all, so it's a complete blank canvas (I'm sure I'll be completely forgetting some important context :) )	14:26
*** psachin has quit IRC		14:27
fungi	they have a background in ci systems at least?	14:27
mordred	who knows!	14:27
fungi	sounds like it'll be a blast	14:28
fungi	"openstack: sort of like aws without selling your soul"	14:29
mordred	wait - I didn't have to sell my soul?	14:29
mordred	I knew I was doing something wrong	14:29
fungi	i can get you a receipt	14:29
*** janki has joined #openstack-infra		14:33
openstackgerrit	Merged openstack-infra/elastic-recheck master: Add query for n-api/g-api startup timeout bug 1806912 https://review.openstack.org/622966	14:38
openstack	bug 1806912 in OpenStack-Gate "devstack timeout because n-api/g-api takes longer than 60 seconds to start" [Undecided,Confirmed] https://launchpad.net/bugs/1806912	14:38
mriedem	ovh-bhs1 nodes must be slow	14:39
mriedem	seeing lots of slow-node related timeout failures in e-r on those nodes	14:39
*** jamesmcarthur has quit IRC		14:45
*** boden has joined #openstack-infra		14:48
*** eharney has joined #openstack-infra		14:48
fungi	mriedem: yes, we think it's disk write performance, amorin is attempting to troubleshoot	14:50
mriedem	ok	14:50
fungi	it may be just some of the hosts in our dedicated aggregate in that region	14:50
fungi	which is making it tough to pin down	14:51
pabelanger	mriedem: fungi: last evening mnaser link https://review.openstack.org/577933/ as maybe a way we could help track hostid from the guest VM when clarkb was trying to get more info from jobs.	14:53
openstackgerrit	Stephen Finucane proposed openstack-infra/project-config master: Remove openstack/osc-placement from #openstack-nova https://review.openstack.org/622987	14:54
mriedem	pabelanger: no updates since june so i forgot about that one	14:55
*** anteaya has joined #openstack-infra		14:55
mriedem	probably needs a helping hand	14:55
fungi	yeah, with that we could collect the hostid along with other instance metadata and even expose it as a column in logstash/kibana once our providers start supporting it	14:56
pabelanger	yah, would defer to mnaser, but looked like just needs doc updates	14:56
*** zul has quit IRC		14:57
*** sshnaidm has quit IRC		14:57
fungi	getting nodepool to request it from the api and pass that all the way through zuul to ansible somehow is probably possible but seems like it would be rather complicated to instrument	14:57
*** quiquell\|off has quit IRC		14:59
pabelanger	fungi: interesting idea, maybe something to ask about in #zuul.	14:59
openstackgerrit	Merged openstack-infra/system-config master: Retire the interop-wg mailing list https://review.openstack.org/619056	15:10
*** lpetrut has joined #openstack-infra		15:12
*** dtantsur\|brb is now known as dtantsur		15:13
*** sshnaidm has joined #openstack-infra		15:18
*** hwoarang has joined #openstack-infra		15:27
logan-	reviews on https://review.openstack.org/#/q/starredby:logan2211%2540gmail.com+status:open+project:%255Eopenstack/openstack-ansible-.* will beappreciated	15:30
logan-	er wrong channel, sorry	15:30
*** jamesmcarthur has joined #openstack-infra		15:33
corvus	fungi, pabelanger, mnaser: looks like we get the hostId from nova already in nodepool, we just don't do anything with it; it would be pretty easy to plumb that through to zuul	15:34
fungi	oh, really?	15:34
openstackgerrit	melissaml proposed openstack/ansible-role-cloud-launcher master: Change openstack-dev to openstack-discuss https://review.openstack.org/623008	15:34
*** zul has joined #openstack-infra		15:34
fungi	corvus: like, by returning it through gearman?	15:35
corvus	fungi: yeah, i think it should be there when we're done with server creation	15:35
*** jamesmcarthur has quit IRC		15:35
corvus	fungi: via zookeeper (to scheduler) and gearman (to executor), yes	15:35
*** jamesmcarthur has joined #openstack-infra		15:35
*** sshnaidm is now known as sshnaidm\|afk		15:36
fungi	oh, right, i keep forgetting launcher<->scheduler communication is zk	15:36
fungi	your definition of "pretty easy" differs a bit from mine ;)	15:37
*** jesusaur has quit IRC		15:37
corvus	fungi: heh, it's 2 changes to 3 components, but it's basically just adding data to structures that already exist	15:37
*** kashyap has left #openstack-infra		15:38
pabelanger	cool, so patches welcome then :)	15:40
pabelanger	might look at it more this afternoon	15:41
Linkid	hi	15:41
*** jesusaur has joined #openstack-infra		15:42
Linkid	I don't know if I'm on the right channel	15:42
Linkid	I would like some info about resources available at the OSF	15:42
Linkid	like disk space	15:43
Linkid	because I would like to suggest a tool for hosting sonething and I don't know if it is possible	15:43
Linkid	(before I suggest something impossible)	15:44
Linkid	*something	15:44
fungi	Linkid: the openstack foundation doesn't maintain community services, but you've found the channel for a community of people who collaborate on providing services to people who work on building free/libre open source projects	15:45
fungi	er, building services for	15:46
fungi	Linkid: what tool are you thinking about?	15:46
Linkid	fungi: I was thinking about installing peertube to host OpenStack Summit (among other) videos (in addition to Youtube	15:48
*** lpetrut has quit IRC		15:49
Linkid	and I would be happy to help :)	15:49
openstackgerrit	Stephen Finucane proposed openstack-infra/project-config master: Add openstack/os-api-ref to #openstack-doc https://review.openstack.org/623013	15:49
Linkid	https://joinpeertube.org	15:49
fungi	an interesting idea. i wasn't aware of peertube until just now	15:50
fungi	curious what drove them to choose the agpl	15:50
Linkid	this is a libre project maintained by a French orga, and it is a really good alternative to youtube :)	15:50
fungi	looks like last year they switched their codebase from gplv3 to agplv3 but the commit doesn't indicate why	15:51
Linkid	If there are enough resources, I would be happy to install it and to make scripts to add video metadata to it	15:51
Linkid	they made a presentation at FOSDEM 2018, if you want	15:52
Linkid	(but it was in french, if I remember it well)	15:52
fungi	i think we'd likely need to get some sort of copyright agreement in place with the openstack foundation since i'm not sure what the copyright situation is with the summit recordings	15:52
Linkid	ah	15:53
fungi	but this is compelling since i discovered not long ago that people in mainland china can't watch our conference session recordings on youtube so there's already a need to host those videos in multiple places	15:53
Linkid	I thought summit recordings were the property of the foundation (or CC0)	15:53
Linkid	:)	15:54
fungi	not to mention, it's always rubbed me the wrong way that we espouse free software at out conferences but then expect people who want to watch recordings of the sessions from them to do so through proprietary video hosting services	15:55
fungi	er, at our conferences	15:55
corvus	this sounds like a great idea; i agree we should clarify the licensing status (i really hope they are CC licensed, and if they aren't we should see about getting that changed)	15:56
Linkid	fungi: yes, Framasoft (the orga which promotè peertube) thoughts the same ^^	15:57
Linkid	(about other conferences)	15:57
corvus	it can be run in docker: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/docker.md	15:57
Linkid	do you want me to write a mail for the suggestion ?	15:57
Linkid	corvus: yep :)	15:58
corvus	Linkid: assuming we get all the pre-requisites worked out, we're working on moving our infrastructure to be mostly driven by ansible + containers, so that's what most of the work to get it running would entail.	15:58
fungi	Linkid: sure, starting a discussion on the openstack-infra@lists.openstack.org mailing list might be an easier way for us to also point osf staff who handle the event logistics and video production stuff involved	15:59
*** jamesmcarthur has quit IRC		15:59
corvus	maybe this could be an opendev branded service? we could probably offload the current static video hosting we have for zuul-ci.org to it.	16:00
pabelanger	TIL: https://joinpeertube.org/ looks very cool	16:00
pabelanger	corvus: +1	16:00
Linkid	fungi: ok :). I'll do it in 2 hours, when I'll be able to write the mail, then :)	16:01
fungi	Linkid: awesome, thanks! i'm digging into the copyright situation for summit session videos now, so hopefully will have a better answer about them by then	16:01
*** slaweq has quit IRC		16:02
*** haleyb has joined #openstack-infra		16:12
openstackgerrit	Jeremy Stanley proposed openstack-infra/zuul master: Add instructions for reporting vulnerabilities https://review.openstack.org/554352	16:12
*** jamesmcarthur has joined #openstack-infra		16:13
*** graphene has quit IRC		16:13
anteaya	Linkid: are you you affiliated with peertube?	16:14
openstackgerrit	Merged openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101	16:14
openstackgerrit	Merged openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906	16:14
*** graphene has joined #openstack-infra		16:15
*** sshnaidm\|afk has quit IRC		16:15
anteaya	Linkid: I'm wondering if the peertube folks want to stay with the use of the word 'spy' in these descriptions: https://framatube.org/about/peertube or would rather go with the word 'view'?	16:15
*** jamesmcarthur has quit IRC		16:18
Linkid	anteaya: I have contacts with Framasoft (the orga which promotes peertube)	16:18
Linkid	I could ask them, if you want	16:18
*** pcaruana has quit IRC		16:18
fungi	i have a feeling the content on that page was translated, and the translator didn't realize that word could have negative connotations	16:19
anteaya	fungi: that is my feeling as well	16:20
*** bobh has joined #openstack-infra		16:21
fungi	i do wonder how well bittorrent protocol works through the great firewall of china (if at all)	16:21
corvus	it's possible they used the word intentionally since it's describing a potentially privacy invading action -- learning which videos someone else is watching	16:21
anteaya	Linkid: thank you, they can contact me in this channel or via pm using my nick or email me at mynick@mynick.info	16:21
anteaya	corvus: agreed, I'm curious as well	16:21
fungi	corvus: i thought that too at first, but had a hard time parsing the sentence to be sure	16:21
anteaya	if it is intentional, then great, I'm just not sure	16:22
anteaya	I do like their transparency	16:22
corvus	the third use ("worst-case scenario") makes me lean towards the intentional interpretation	16:22
corvus	anteaya: yes, they're very up-front about what it does and how it works	16:23
fungi	yeah, if it's just missing some prepositions then i can see how it might have been intended that way	16:23
anteaya	I like that, gives me a good feeling inside	16:23
Linkid	yes, they do want transparency to show that this tool is great	16:25
clarkb	cool those bhs1 test nodes were on the same hypervisor	16:25
clarkb	fungi ^ any sense yet if amorins changes have made bhs1 more reliable?	16:25
clarkb	re videos confs like lca upload to youtube then separately host the videos in free format(s) on a browseable index	16:26
fungi	clarkb: no, he's looking into potential performance issues in bhs1 again today	16:26
clarkb	swift may make somethibg like lcas setup easy for us	16:26
Linkid	and they started a campaign some months ago to translate everything in english (and in other languages). Maybe there are still some mistakes	16:26
clarkb	fungi ya looks like a scheduler change was applied	16:27
clarkb	(reading sb)	16:27
fungi	oh, right, i saw that. no idea but i'll take a dig into logstash	16:28
openstackgerrit	Doug Hellmann proposed openstack-infra/project-config master: import git-os-job source repo https://review.openstack.org/623023	16:30
anteaya	Linkid: ah okay, thank you, I don't like to start off a relationship by assuming someone else is incorrect, I like to ask first, sometimes I'm the one who has mis-understood	16:30
*** janki has quit IRC		16:30
*** sshnaidm\|afk has joined #openstack-infra		16:30
*** gyee has joined #openstack-infra		16:34
anteaya	looks like one thing framasoft does is aggregate free software tools, rebrand them as frama-* and host instances	16:35
anteaya	their collaborative editing tool for instance is etherpad-lite	16:36
anteaya	looks like an awesome service for end users looking to use free software via the browser	16:37
anteaya	looks like their targets are schools and small businesses	16:38
*** studarus has joined #openstack-infra		16:39
fungi	btw, i can't find content licensing or copyright details for summit session videos anywhere so i've asked some osf staff who ought to know (and am also recommending they publish information about that one way or another)	16:41
anteaya	fungi: I'm surprised it hasn't come up before	16:42
*** dpawlik has quit IRC		16:42
anteaya	Linkid: I'm looking for the code for the https://framacolibri.org/ service, so far I can't seem to find that	16:42
anteaya	Linkid: it looks really awesome	16:43
*** bobh has quit IRC		16:43
anteaya	so far it appears it is a combination of twitter, bug tracker and linkedin	16:43
*** dpawlik has joined #openstack-infra		16:44
openstackgerrit	Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Query: [primary] Waiting for logger https://review.openstack.org/622210	16:45
*** udesale has quit IRC		16:46
*** udesale has joined #openstack-infra		16:47
ssbarnea\|rover	mriedem: can you please help with open elastic-recheck CRs? https://review.openstack.org/#/q/project:openstack-infra/elastic-recheck+status:open	16:51
ssbarnea\|rover	clarkb: fungi : i found one log file from yesterday which I cannot search on logstash: http://logs.openstack.org/20/619520/12/check/tripleo-ci-centos-7-scenario004-standalone/0edbd40/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz#_2018-12-04_20_08_20	16:56
*** jamesmcarthur has joined #openstack-infra		16:57
ssbarnea\|rover	i searched for that auth.docker.io error line and zero hits, but based on the file name it should have being indexed, right?	16:57
ssbarnea\|rover	i did not see any post failure, which makes me believe it shoudl have being found.	16:58
fungi	clarkb: amorin: frickler: not a huge sample size (35 occurrences registered in the past 6 hours) but the proportion of job timeouts occurring in ovh-bhs1 are roughly 2x what would be random distribution based on its proportion of overall quota	16:58
fungi	which isn't great, but also not nearly so bad as what we were seeing last week	16:58
*** priteau has quit IRC		16:59
fungi	ovh-bhs1 accounts for 15% of our quota and is where we saw 34% of timeouts occur in the past 6 hours	17:00
*** bhavikdbavishi has quit IRC		17:02
*** kjackal has quit IRC		17:02
*** bhavikdbavishi has joined #openstack-infra		17:02
fungi	on the other hand, ovh-gra1 accounts for only 8% of our quota and is where 20% of the timeouts occurred over the past 6 hours, so bhs1 is actually doing considerably better than gra1 in that regard	17:03
mriedem	ssbarnea\|rover: looking	17:03
fungi	it'll take a while to accumulate a large enough sample size to be sure this is representative though	17:03
*** yamamoto has joined #openstack-infra		17:04
clarkb	fungi: interesting. Not sure if you saw my notes last night but vexxhost sjc1 and bhs1 had roughly the same io throughput according to sysbench random read and writes benchmarking. That is what made me think there was maybe an unhappy hypervisor	17:05
clarkb	and amorin seems to have confirmed that a cuople of the unhappy jobs ran on the same hypervisor which was slower so at least some progress there	17:05
*** jamesmcarthur has quit IRC		17:05
*** priteau has joined #openstack-infra		17:06
*** jamesmcarthur has joined #openstack-infra		17:07
*** slaweq has joined #openstack-infra		17:08
*** yamamoto has quit IRC		17:08
*** ginopc has quit IRC		17:08
*** priteau has quit IRC		17:08
*** priteau has joined #openstack-infra		17:09
*** rlandy is now known as rlandy\|brb		17:10
Linkid	anteaya : yes, Framasoft promotes free softwares for people to use alternatives :).	17:11
*** shardy has quit IRC		17:11
openstackgerrit	Merged openstack-infra/elastic-recheck master: Change openstack-dev to openstack-discuss https://review.openstack.org/622326	17:12
Linkid	And to show that they are alternatives, they offer access to those ones with their frama* services :)	17:12
*** shardy has joined #openstack-infra		17:13
clarkb	ssbarnea\|rover: re https://bugs.launchpad.net/openstack-gate/+bug/1806655 that should only happen if the logger daemon is stopped for some reason. One reason may be if your job reboots the test nodes (then you have to restart the logger daemon in the job)	17:13
openstack	Launchpad bug 1806655 in OpenStack-Gate "Zuul console spam: [primary] Waiting for logger" [Undecided,New]	17:13
clarkb	ssbarnea\|rover: looking at logstash it appears to be a tripleo job specific issue?	17:14
ssbarnea\|rover	clarkb: ok. i have another case where some errors from mistral/api.log.txt.gz are not to be found on logstash. is this file indexed to or not?	17:15
ssbarnea\|rover	clarkb: yep, i suspect is specific to tripleo.	17:15
anteaya	Linkid: thanks, lots to read on the various links	17:16
*** jpich has quit IRC		17:17
*** wolverineav has joined #openstack-infra		17:17
clarkb	ssbarnea\|rover: https://git.openstack.org/cgit/openstack-infra/project-config/tree/roles/submit-logstash-jobs/defaults/main.yaml is the list of stuff we index. I don't think api.log matches anything there	17:18
clarkb	hrm I was trying to catch up on email but now there is more than when I started	17:19
*** slaweq has quit IRC		17:20
ssbarnea\|rover	clarkb: thanks for the link, now i don't know what to do about mistral log, is huge and indexing whose would sound like a pretty bad idea, but indexing errors and warnings from it would not sound as a bad idea to me.	17:20
ssbarnea\|rover	do we have a way to do selective indexing?	17:20
clarkb	ssbarnea\|rover: not with the current logstash config. It will index everything >=INFO level from the oslo.logging format logs	17:21
*** slaweq has joined #openstack-infra		17:21
clarkb	ssbarnea\|rover: in the past we have worked with various projects to improve their logging to make it more useable by logstash/us and conseuqently ops	17:21
*** e0ne has quit IRC		17:22
ssbarnea\|rover	ahh, that is ok 95% of the spam is debug so extra load should not be an issue.	17:22
*** jamesmcarthur has quit IRC		17:22
clarkb	ssbarnea\|rover: can you link me to an example file? I'm curious to see how big it is	17:22
ssbarnea\|rover	i am really glad to hear that we drop DEBUG lines	17:22
mriedem	ssbarnea\|rover: i've gone over several of your newer e-r queries and -1ed several of them	17:22
clarkb	we actually rely on os-loganalyze for that iirc. so we should double check it is filtering mistral logs properly too (which I expect it isn't for a file named just api.log)	17:23
ssbarnea\|rover	mriedem: thanks, looking now at them.	17:23
*** jamesmcarthur has joined #openstack-infra		17:23
*** zul has quit IRC		17:23
fungi	ssbarnea\|rover: if we didn't drop debug level loglines our logstash retention would probably be more like a day instead of 10	17:24
*** bobh has joined #openstack-infra		17:25
fungi	and it takes a 6tb elasticsearch cluster just to house that much	17:25
openstackgerrit	Merged openstack-infra/zuul master: Add instructions for reporting vulnerabilities https://review.openstack.org/554352	17:25
openstackgerrit	Merged openstack-infra/elastic-recheck master: Categorize missing /etc/heat/policy.json https://review.openstack.org/621128	17:27
clarkb	fungi: we would also likely take two days to index that one day of logs	17:28
fungi	or need many times more indexing workers	17:28
mnaser	clarkb: have you seen anything about systemd-python failing to install on centos 7.6?	17:28
mnaser	"Cannot find libsystemd or libsystemd-journal"	17:29
fungi	efried: nice plug for git-restack in your howto!	17:29
clarkb	mnaser: nope that one is new to me	17:30
mnaser	:( okay, i'm a bit lost on why its hapepning	17:30
mnaser	http://logs.openstack.org/51/620651/5/gate/openstack-ansible-deploy-aio_lxc-centos-7/01ca2fd/logs/openstack/aio1_keystone_container-c3bb8d22/python_venv_build.log.txt.gz	17:30
fungi	efried: makes me wonder if git-restack should grow a --continue convenience option which just invokes git-rebase --continue	17:31
mnaser	http://logs.openstack.org/51/620651/5/gate/openstack-ansible-deploy-aio_lxc-centos-7/01ca2fd/logs/ara-report/result/ad4ef8f9-31c4-4b5d-886d-b692fcd5e1d0/ and we install systemd-devel	17:31
openstackgerrit	Duc Truong proposed openstack-infra/irc-meetings master: Change Senlin meeting to different biweekly times https://review.openstack.org/623031	17:33
*** bobh has quit IRC		17:34
*** jmorgan1 has quit IRC		17:36
*** jmorgan1 has joined #openstack-infra		17:40
*** mriedem is now known as mriedem_away		17:40
*** bobh has joined #openstack-infra		17:42
*** rlandy\|brb is now known as rlandy		17:42
*** jpena is now known as jpena\|off		17:42
*** florianf is now known as florianf\|afk		17:43
ssbarnea\|rover	mriedem_away: clarkb : read last comment on https://review.openstack.org/#/c/621004/1 -- mainly the query to look for ssh failures is currently broken.	17:44
openstackgerrit	James E. Blair proposed openstack-infra/infra-specs master: Add opendev Gerrit spec https://review.openstack.org/623033	17:45
corvus	clarkb, fungi, mordred, pabelanger, tobiash: ^ can you take a look at that? i'd like to get it into shape as quickly as possible	17:46
ssbarnea\|rover	because ansible output changes with improved json logging there are lots of queries that may have broken because they assumed that two strings are on the same log line, which is no longer true. See https://github.com/openstack-infra/elastic-recheck/blob/master/queries/1721093.yaml	17:47
anteaya	this looks like the catalog of services without the frama-* branding: https://chatons.org/en/find though the page is yet to be translated into english, I welcome corrections on my assumptions from French collegues	17:48
*** mrhillsman has quit IRC		17:51
*** mrhillsman has joined #openstack-infra		17:51
*** sthussey has quit IRC		17:52
*** jiapei has quit IRC		17:52
*** adrianreza has quit IRC		17:52
*** jiapei has joined #openstack-infra		17:53
clarkb	corvus: next on my list after getting through email backlog	17:53
*** sthussey has joined #openstack-infra		17:55
*** adrianreza has joined #openstack-infra		17:55
*** jamesmcarthur has quit IRC		17:55
anteaya	looks like framasoft uses a feature on etherpads that allows them to be sorted by name or date, subscribed to and can make them private or public: https://mypads.framapad.org/mypads/?/mypads/group/framalang-7l3ibkl0/view	17:57
clarkb	anteaya: ya there are plugins to do that. For the private vs public you have to set up auth iirc. I'm more of the opinion that "etherpads" shouldn't be permanent records (since they can be changed afterall)	17:59
anteaya	ah, thank you	18:00
*** bobh has quit IRC		18:00
fungi	an interesting correlation, it looks like our average job uses fairly close to 1 node-hour? zuul seems to be averaging ~1kjph at capacity and we have 1030 nodes in our aggregate quota	18:01
anteaya	bnemec: I didn't know you had your own channel	18:01
anteaya	you have your broadcast on in the background, you are very soothing	18:01
fungi	bnn: the ben nemec network	18:02
clarkb	hrm apparently I don't understand what a reboot is required to complete this package upgrade means	18:02
clarkb	because ns2 is still complaining	18:02
anteaya	fungi: I do recommend	18:02
fungi	clarkb: yeah, i saw the cron e-mail again this morning	18:02
* bnemec trademarks bnn :-)		18:02
fungi	clarkb: puzzling to say the least	18:02
clarkb	fungi: is that a nice way of saying that unattended upgrades wants a human to do the upgrade?	18:02
clarkb	fungi: also looking more closely its mad about lxd and friends. Maybe we uninstall those	18:03
clarkb	(I thought we were uninstalling those fwiw, so maybe start by double checking that)	18:03
*** derekh has quit IRC		18:03
clarkb	corvus: did you see https://review.openstack.org/#/c/622624/ for website content? I'm hoping to pull that back up again and refine it after lunch today. So reviews there much appreciated as well	18:05
fungi	clarkb: i didn't hold onto the cronspam from ns2 so not entirely sure to be honest	18:06
fungi	clarkb: oh, these are new packages, that's why	18:07
*** jamesmcarthur has joined #openstack-infra		18:07
corvus	clarkb: oh nice i'll review now	18:07
fungi	see /var/log/dpkg.log for entries from today	18:07
bnemec	anteaya: I don't think of it so much as a channel as a dumping ground for videos I want other people to see. :-)	18:07
bnemec	There's also about a hundred videos of hurdle races, but they're not public because I'm not subjecting other people's kids to YouTube commenters.	18:07
*** jamesmcarthur has quit IRC		18:07
clarkb	fungi: will do thanks	18:07
bnemec	anteaya: But I'm glad you like it.	18:07
fungi	clarkb: there was a new kernel (again)	18:07
*** jamesmcarthur has joined #openstack-infra		18:07
clarkb	fungi: ah so the lxd being held back is just noise?	18:07
fungi	i think so	18:08
clarkb	why doesn't ns1 send the same spam?	18:08
clarkb	or maybe my filters are broken	18:08
fungi	likely not configured the same	18:08
anteaya	bnemec: ha ha ha, thanks for the explaination	18:09
*** dave-mccowan has joined #openstack-infra		18:09
anteaya	bnemec: and yes, I do like it, like background talk radio without the death	18:10
bnemec	:-)	18:10
anteaya	would listen to more ben radio anytime	18:10
anteaya	and yeah, +1 for protecting the younglings	18:11
fungi	clarkb: yeah, ns1 isn't auto-updating at all	18:11
fungi	though unattended-upgrades is installed and the checksum on /etc/apt/apt.conf.d/50unattended-upgrades matches between them	18:12
Shrews	infra-root: I'd like to restart the nodepool launchers to pick up some fixes. Any objections to doing that now?	18:13
corvus	Shrews: ++	18:13
fungi	Shrews: sounds fine to me	18:13
Shrews	ok. i'll begin the preparations for it	18:14
corvus	i added infra-core to opendev-website-core	18:14
tobiash	corvus: lgtm and ++ for not using cgit :)	18:15
fungi	clarkb: 2018-12-05 06:50:20,794 ERROR Cache has broken packages, exiting	18:17
fungi	seen in /var/log/unattended-upgrades/unattended-upgrades.log on ns1	18:17
clarkb	the one "feature" cgit has that few others seem to do is the ability to look at source files in a semi rendered state so that you can link to lines in eg rst files	18:17
clarkb	fungi: huh	18:17
clarkb	fungi: can we apt-get autoclean and see if that fixed?	18:18
Shrews	corvus: oh, your new addition for the DELETED node state is going to force a total shutdown i think	18:18
Shrews	corvus: if we do 1-at-a-time, some launchers may see that as an invalid state and bork	18:18
fungi	clarkb: https://bugs.debian.org/898607	18:18
openstack	Debian bug 898607 in unattended-upgrades "unattended-upgrades: send mail on all ERROR conditions" [Important,Fixed]	18:18
fungi	clarkb: i think it's to do with the lxd packages	18:20
clarkb	fungi: so maybe ensure those are removed and do an autoclean?	18:20
*** ralonsoh has quit IRC		18:21
tobiash	Shrews: so we should add a release note before we do a release then	18:21
Shrews	clarkb: i forget, is a pip install necessary for launcher upgrade?	18:24
fungi	Shrews: is `pbr freeze` not listing the latest commit?	18:28
clarkb	Shrews: puppet should do it for us	18:28
clarkb	but ya running python3 $(which pbr) freeze should show you the sha1	18:28
clarkb	to confirm	18:29
openstackgerrit	Clark Boylan proposed openstack-infra/system-config master: Don't install lxd on our servers https://review.openstack.org/623040	18:30
openstackgerrit	Clark Boylan proposed openstack-infra/system-config master: Configure packages on ubuntu arm servers https://review.openstack.org/623041	18:30
clarkb	fungi: ianw ^ package fixes including for arm	18:30
Shrews	fungi: clarkb: k, cool. i was doing 'pip freeze' which was less than helpful	18:30
fungi	clarkb: aha, i'm noticing that one of the packages which is behind in updates on ns1 compared to ns2 is unattended-upgrades itself	18:31
mnaser	does anyone know if new images have been uploaded to all clouds?	18:31
mnaser	i'm still seeing centos 7.5 machines go up, even as of 2 hours ago	18:32
fungi	clarkb: though the package changelog for the newer one doesn't seem to indicate it includes the fix for debian bug 898607 so i'm still at a loss	18:32
openstack	Debian bug 898607 in unattended-upgrades "unattended-upgrades: send mail on all ERROR conditions" [Important,Fixed] http://bugs.debian.org/898607	18:32
frickler	corvus: seems https://review.openstack.org/621639 would need to be applied manually before the check can pass?	18:33
clarkb	mnaser: looks like our images builds are broken for centos for ~one week	18:33
*** wolverineav has quit IRC		18:34
corvus	frickler: does that check not run as the openstackinfra user?	18:34
mnaser	clarkb: can i help fix that? i'm noticing some servers get centos 7.6 and others with 7.5	18:34
Shrews	ok, stopping all launchers	18:34
clarkb	mnaser: I don't think any of our test nodes should be 7.6, they will update to 7.6 when running though	18:34
corvus	Shrews: you can do one at a time for slightly less disruption	18:35
clarkb	mnaser: it appears to be systemic as only arm images on nb03 are successfully building right now?	18:35
Shrews	corvus: see my previous query to you	18:36
mnaser	clarkb: ok yeah it looks like now our containers are 7.6 and our hosts are 7.5	18:36
Shrews	corvus: new DELETED node state	18:36
Shrews	infra-root: all nodepool launchers restarted now	18:36
mnaser	clarkb: if there are any logs, i can look into them	18:36
clarkb	mnaser: looks like we've got dib processing running since November 14 and November 28 on the two non arm builders. So the pipeline jammed up there	18:36
*** wolverineav has joined #openstack-infra		18:37
fungi	as in dib was hung and still trying to build something?	18:37
openstackgerrit	Salvador Fuentes Garcia proposed openstack-infra/project-config master: kata-containers: re-enable Fedora job https://review.openstack.org/623043	18:37
clarkb	mnaser: https://nb02.openstack.org/ubuntu-bionic-0000000042.log and https://nb01.openstack.org/opensuse-423-0000000025.log	18:37
frickler	corvus: I was assuming there is a difference between access to the channel and access to listing permissions from chanserv. but maybe permissions still aren't set up correctly in the first place, yes	18:37
clarkb	fungi: yup exactly and those logs seem to line up with that	18:37
corvus	Shrews: ah missed that	18:37
corvus	frickler: the bot has access to the channel	18:38
clarkb	in this case a root likely needs to strace to understand what is going on there	18:38
fungi	clarkb: and not hung on the same image types either. nb01 is on opensuse and nb02 is on ubuntu	18:38
corvus	frickler: my question is whether that check runs authenticated	18:38
clarkb	because those logs just end as if they weren't running anymore but ps says otherwise	18:38
clarkb	fungi: they are in different clouds too iirc	18:38
*** priteau has quit IRC		18:38
mnaser	looks like no builds have been happening since the 27th?	18:39
mnaser	https://nb02.openstack.org	18:39
clarkb	mnaser: 28th	18:39
clarkb	mnaser: I linked the last log file above	18:39
clarkb	both disk-image-create processes are blocking on a wait4() so thats not super helpfu	18:39
fungi	clarkb: ps suggests the hung child process on nb02 is a child of `pip install os-testr` in service of the venv-os-testr element	18:40
clarkb	in this case we may just want to take this as an opportunity to reboot the servers (apply package updates and clear out any system resources dib may have leaked)	18:40
clarkb	fungi: yup that lines up with the log above too	18:40
*** ccamacho has quit IRC		18:40
clarkb	2018-11-28 00:35:02.023 \| Downloading https://files.pythonhosted.org/packages/2a/fd/2a8b894ee3451704cf8525a6a94b87d5ba24747b7bbd3d2f7059189ad79f/stestr-2.1.1.tar.gz (104kB) is last logged line	18:40
clarkb	that pip install is blocking on a read	18:41
frickler	corvus: I was assuming it does, because it takes the nickname as parameter, but looking closer in fact it does seem to	18:41
fungi	clarkb: though on nb01 it's hung running a git clone of openstack/charm-interface-barbican-secrets	18:41
frickler	does not*	18:41
fungi	so didn't even hang doing the same things	18:41
*** wolverineav has quit IRC		18:41
clarkb	the git operation is blocking on a read to 0	18:42
clarkb	(stdin)	18:42
*** wolverineav has joined #openstack-infra		18:43
corvus	frickler: any idea what i need to tell chanserv to make it so the script can run?	18:43
clarkb	plenty of disk on that server too	18:43
clarkb	possibly some condition was set up by an earlier build such that the subsequent onces were unhappy? weird though that it wouldn't be the same issue in both spots	18:43
Shrews	#status log Nodepool launchers restarted and now running with commit ee8ca083a23d5684d62b6a9709f068c59d7383e0	18:44
openstackstatus	Shrews: finished logging	18:44
frickler	corvus: "set #channel private off" might help	18:45
*** dpawlik has quit IRC		18:45
fungi	clarkb: and the pip install child process on nb02 is blocking on a read from fd 5	18:45
clarkb	fungi: yup	18:45
*** diablo_rojo has joined #openstack-infra		18:45
clarkb	fungi: I'm somewhat inclined to just reboot them both and watch them for similar behavior in the future	18:45
*** dpawlik has joined #openstack-infra		18:46
fungi	i concur, there's not much more i can learn from these unless someone else wants to take a stab	18:46
clarkb	I guess we can look at the build just prior to the most recent ones to see fi there is anything obvious	18:46
clarkb	https://nb02.openstack.org/debian-stretch-0000000037.log completed successfully on nb02	18:46
*** mriedem_away is now known as mriedem		18:46
*** manjeets_ is now known as manjeets		18:47
clarkb	https://nb01.openstack.org/opensuse-423-0000000024.log failed on nb01 so nothing clear there	18:47
clarkb	fungi: do you want to reboot or should I?	18:47
clarkb	I'll go ahead and do it	18:49
*** shardy has quit IRC		18:49
fungi	ahh, yep sorry, trying to do too many things at once	18:49
mriedem	ssbarnea\|rover: replied on https://review.openstack.org/#/c/621004/	18:49
clarkb	mnaser: fungi https://nb01.openstack.org/ubuntu-trusty-0000000040.log is the running build on nb01 after a reboot	18:50
ssbarnea\|rover	mriedem: thanks. that was my plan, after getting confirmation. probably after this we should check existing queries for similar issues.	18:50
*** ccamacho has joined #openstack-infra		18:50
clarkb	we can watch that to see if builds work now	18:50
*** eharney has quit IRC		18:52
*** harlowja has joined #openstack-infra		18:52
clarkb	mnaser: fungi https://nb02.openstack.org/ubuntu-xenial-0000000038.log for nb02. Assuming those builds are now working, they should get around to doing centos7 in the near future	18:52
clarkb	then we should keep our eyes open for new 7.6 failures	18:52
mnaser	clarkb: great, thank you	18:52
frickler	corvus: confirmed on a different channel that setting the private flag is what is hiding the access list	18:53
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Add an upgrade release note for schema change https://review.openstack.org/623046	18:53
corvus	frickler: flag removed	18:56
corvus	clarkb, dhellmann, tobiash: great comments on 623033, thx; i replied to all	19:00
*** ccamacho has quit IRC		19:00
*** wolverineav has quit IRC		19:01
*** betherly has joined #openstack-infra		19:01
*** eernst has joined #openstack-infra		19:01
corvus	dhellmann: also in my reply to you, i tried to both provide technical answers, and also my own feedback. hopefully the two can be disentangled as necessary. :)	19:04
*** betherly has quit IRC		19:06
clarkb	ssbarnea\|rover: seems like "2018-12-04 04:14:47,233 ERROR:dlrn:Known error building packages for openstack-tripleo-heat-templates, will retry later" may be the root case of one of those waiting on logger failures.	19:06
clarkb	ssbarnea\|rover: dlrn probably shouldn't have a retry later option for CI?	19:06
*** therve has left #openstack-infra		19:07
clarkb	looks like delorean failed to open a local repomd.xml file. Thats odd	19:09
ssbarnea\|rover	clarkb: please comment on https://bugs.launchpad.net/tripleo/+bug/1714202 ticket, so other can see it (and hopefully reply).	19:10
openstack	Launchpad bug 1714202 in tripleo "DLRN builds fail intermittently (network errors)" [Medium,Incomplete]	19:10
dhellmann	corvus : yep. I suspect a rename of a bunch of the cruft we have now to some neutral prefix to start combined with a lenient policy about using the openstack/ prefix later is going to be where we end up	19:10
ssbarnea\|rover	is late here and i do not know all the details. to me is looks like an infra issue.	19:11
clarkb	ssbarnea\|rover: done	19:12
clarkb	ssbarnea\|rover: well in this case its not the network because its a local file. So either the file doesn't exist or the permissions don't allow reads or the filesystem is corrupt	19:12
clarkb	we should rule out the first two things first :)	19:12
ssbarnea\|rover	clarkb: true, but corrupted filesystem counts as infra too. different cause. i am too tired now to look at it.	19:13
clarkb	it does but it should be a far less common issue and the other two things are much easier to check. Just add a sudo ls -l there	19:14
*** bhavikdbavishi has quit IRC		19:15
*** ykarel has quit IRC		19:16
*** wolverineav has joined #openstack-infra		19:16
*** wolverineav has quit IRC		19:16
*** wolverineav has joined #openstack-infra		19:17
clarkb	ssbarnea\|rover: for the other url on your bug we do proxy cache centos buildlogs https://git.openstack.org/cgit/openstack-infra/system-config/tree/modules/openstack_project/templates/mirror.vhost.erb#n172	19:19
*** gfidente has quit IRC		19:21
*** kgiusti has left #openstack-infra		19:29
*** wolverineav has quit IRC		19:29
*** kgiusti has joined #openstack-infra		19:30
*** wolverineav has joined #openstack-infra		19:31
*** wolverineav has quit IRC		19:39
*** wolverineav has joined #openstack-infra		19:43
mnaser	clarkb: so far xenial built successfully, opensuse-423 building now	19:45
mnaser	so good progress so far	19:45
*** wolverineav has quit IRC		19:46
*** wolverineav has joined #openstack-infra		19:46
openstackgerrit	Merged openstack-infra/project-config master: Add #openstack-designate to accessbot https://review.openstack.org/621639	19:46
fungi	clarkb: okay, so a bit of spelunking i think reveals that the reason unattended-upgrades doesn't want to run is that the new versions of lxd and lxd-client want to also install libuv1 which was not previously installed, and u-a is configured to only upgrade packages which are already installed but not add any new packages	19:48
clarkb	aha	19:48
fungi	so as suspected, removing lxd and lxd-client should, i think, get it back on track	19:48
fungi	(one exception to the no new packages rule is kernels)	19:49
*** studarus has quit IRC		19:53
*** jamesmcarthur has quit IRC		19:55
fungi	clarkb: ahh, nope, lxd looks like it may be unrelated after all	19:57
clarkb	http://logs.openstack.org/62/621562/2/gate/tripleo-ci-centos-7-containers-multinode/0fa88fd/logs/undercloud/var/log/extra/dstat.html.gz in that job from 18:57 ish to 19:20ish it appears its just validating heat yamls?	19:57
*** jamesmcarthur has joined #openstack-infra		19:57
clarkb	at 19:14 tripleoclient times out on some websocket error (unfortunately its not clear what it is talking to from that traceback)	19:58
clarkb	then at 19:20 the stack is actually deployed	19:58
clarkb	this did run on a bhs1 one node but looking at the dstat it actually looks a lot more healthy than the dstats from bhs1 I was looking at yseterday	19:58
fungi	clarkb: because the unattended-upgrades log on both ns1 and ns2 complains that it's not going to upgrade those packages so either there's an actual different corrupt package in the cache and it's not specifying which one, or it's a behavior difference between the versions of unattended-upgrades on those two servers	19:58
clarkb	fungi: time to try the autoclean?	19:59
clarkb	reading that dstat disk writes spike to over 300mbps	19:59
clarkb	and are in the 30 range while validating things	19:59
fungi	just a `sudo apt clean` ought to wipe out the package cache, but i want to try and figure out what's corrupt first	19:59
clarkb	what I do notice is that all the memory is used up (mostly)	19:59
clarkb	and there are load spikes to 29 and 12	20:00
clarkb	and lots of paging	20:00
clarkb	what this is telling me is that bhs1 itself isn't bad in that specific case, but rather the job is demanding quite a lot from the test node and things timed out (probably due to swapping?)	20:01
clarkb	thinking about disk throughput lots of swapping could haev a snowballing effect on disk performance	20:02
clarkb	where tripleo jobs end up being the noisy neighbors spinning the disks a bunch (or at least moving electrons around on ssds)	20:02
*** jaosorior has joined #openstack-infra		20:03
*** eharney has joined #openstack-infra		20:05
*** dtantsur is now known as dtantsur\|afk		20:06
*** wolverineav has quit IRC		20:06
*** wolverineav has joined #openstack-infra		20:07
fungi	#status log removed lxd and lxd-client packages from ns1 and ns2.opendev.org, autoremoved, upgraded and rebooted	20:09
openstackstatus	fungi: finished logging	20:09
fungi	clarkb: interestingly, i had to manually start nsd on ns2 (per your earlier observation) but not ns1	20:09
clarkb	fungi: could be a race in the startup. The existing unit does allow for it to start after networking is fully up but doesn't require it	20:11
clarkb	comparing dstat to a run of the same job in inap there are some differences. Inap has slightly more memory available to be VM	20:16
clarkb	8000GB vs 8096GB maybe?	20:16
fungi	i wonder if ns2 consistently loses the race and ns1 consistently wins it, but don't feel like rebooting them even more to find out	20:16
clarkb	fungi: ya my fix should solve it long term I think	20:17
clarkb	the paging is actually worse by count in inap	20:18
clarkb	implying that maybe that is the big difference (we are able to page in stuff that isn't cached due to no memory more effectively)	20:18
*** wolverineav has quit IRC		20:18
clarkb	mwhahaha: ssbarnea\|rover re ^ I would be curious if there are any easy hacks we can do to alleviate the memory pressure	20:19
clarkb	does tripleo enable kernel same page merging?	20:19
clarkb	that may help	20:19
clarkb	perhaps we can also reduce the number of webservers if we have overprovisioned them ? (have no idea if that is a thing just remember it being somethign we did with devstack to reduce its overhead)	20:20
clarkb	also this dstat data is great, thank you for adding that. I do notice it doesn't seem to be in the standalone job though?	20:20
mwhahaha	should be	20:20
mwhahaha	it uses the same ci bits	20:20
mwhahaha	(dstat that is)	20:20
clarkb	mwhahaha: hrm I can't find it in some semi random jobs I pulled out to check	20:21
clarkb	http://logs.openstack.org/62/621562/1/check/tripleo-ci-centos-7-standalone/1c762fc/logs/undercloud/var/log/extra/ doesn't have a dstat file	20:21
*** kjackal has joined #openstack-infra		20:22
mwhahaha	no looks like we don't run that part	20:22
mwhahaha	probably because it's in the udnercloud setup	20:22
mwhahaha	which this is not, we can add it tho	20:22
clarkb	I think that would be helpful especially if we do find its something like memory pressure causing the fallout, then we'll be able to measure if we've improved that or not	20:23
*** kjackal has quit IRC		20:23
mwhahaha	the standalone is less memory intensive	20:23
mwhahaha	by default it fits in like 6 or 7g	20:23
clarkb	oh nice	20:23
mwhahaha	but when tempest goes it might increase	20:24
*** kjackal has joined #openstack-infra		20:24
clarkb	in the failure above we seem to hit the pressure just before running the overcloud stack create	20:24
clarkb	and then it sticks around until the end of the job for the most part (it gets slightly better near the end)	20:25
clarkb	in any case, thats my current thought on where these are having a sad. Memory pressure reduces IO throughput because lots of paging is happening (which itself spins the disks which could cause noisy neighbor issues)	20:26
clarkb	reducing memory pressure should reduce the need for paging which should improve IO throughput and then hopefully things are happier	20:26
*** jamesmcarthur has quit IRC		20:28
mwhahaha	it would seem that just running the undercloud with nothing else means all the ram is taken up without doing anything	20:31
pabelanger	Shrews: thanks for restarting nodepool-launchers!	20:32
*** Swami has joined #openstack-infra		20:32
mwhahaha	so the memory all starts getting gobbled up during step4 of the undercloud install which would be when we actually turn on the openstack servers (other than keystone)	20:33
pabelanger	interesting enought, I think we might need another zuul-executor or 2. Our executor queue backlog has been above 0 the last 6 hours: http://grafana.openstack.org/dashboard/db/zuul-status	20:33
mwhahaha	so it's likely that it's the openstack service containers and the lack of sharedness from running python in containers	20:33
pabelanger	enough*	20:33
pabelanger	I am not really sure why that is however	20:35
pabelanger	http://grafana.openstack.org/dashboard/db/zuul-status?from=now-90d&to=now	20:38
pabelanger	we are starting more builds in a given period, since 2018-11-29	20:39
pabelanger	Which makes me think that realitive builds change, has caused us to run more jobs in a given hour	20:39
pabelanger	and more IO on executors, where they are running up against governors	20:40
pabelanger	clarkb: corvus: tobiash: ^something that might be of interest when you have spare cycles	20:40
corvus	pabelanger: that's a fascinating theory	20:42
openstackgerrit	Kendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments https://review.openstack.org/607377	20:42
tobiash	pabelanger: in my eyes this makes sense as now the smaller less active projects have a higher relative priority to the big resource consumers. Often the big resource consumers have long running jobs that block the same resources for a long time while smaller projects typically have faster jobs. Thus in the end you can run more jobs per hour than before.	20:43
corvus	makes sense	20:43
tobiash	I see the same thing in our deployment too	20:44
*** dpawlik has quit IRC		20:44
tobiash	the resource hogs are always the projects that have slow jobs	20:44
fungi	but over the long term it's still the same number and type of jobs completed in the same amount of time	20:44
*** jamesmcarthur has joined #openstack-infra		20:44
fungi	we're just changing the order in which they get started	20:44
tobiash	yes, but during peak load the throughput is higher	20:44
corvus	...theoretically, we could actually end up taking longer to run jobs overall -- because right now we have node capacity sitting idle waiting for executors, whereas we did not before....	20:45
pabelanger	right	20:45
fungi	ahh, and the projects with longer-runnnig jobs which are also the projects with more contention for changes (due to their longer-running jobs) end up waiting until later into the window when activity has died off	20:45
corvus	so we may need to add capacity not only to make things faster now (we will complete the jobs we're supposed to be running right now faster), but also to get our daily aggregate back to the overall time we had before.	20:46
tobiash	I'd guess you need at least two more executors	20:47
fungi	so basically with more executors we'd be able to sustain a higher jph throughput at peak and then have a lower job rate off-peak than otherwise	20:47
tobiash	judjing from these stats	20:47
pabelanger	tobiash: yah, I would agree with 2	20:47
corvus	why 2?	20:48
*** slaweq has quit IRC		20:49
fungi	if there's a metric for calculating how many executors we need that would be awesome to feed into our future elastic executor autoprovisioning ;)	20:49
pabelanger	executors look to be capped at about 65 running jobs, but executor queue seems to be ~120	20:49
pabelanger	so, guessing that 2 would help bring that down	20:49
pabelanger	but, yah, just a guess :)	20:50
efried	fungi: catching up...	20:51
efried	I realized later that I should probably not have put commit --amend in my instructions, as that can f ya when there's a merge conflict, and rebase --continue automatically does commit --amend when it's necessary (and not when it's not).	20:51
efried	But yeah, if we had a git restack --continue that served both purposes, it would be simpler to explain to noobs, probably.	20:51
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Add ze12.openstack.org https://review.openstack.org/623067	20:52
corvus	efried, fungi: oh, we talking about 'git restack'? can you catch me up?	20:52
corvus	fungi, pabelanger: ^ that change will fail until the host exists, but if we're ready to add it, i can go ahead and create it.	20:52
fungi	corvus: efried sent an awesome howto to the openstack-discuss ml about workflow for stacks of patches, and plugged git-restack heavily	20:52
corvus	sweet, reading now	20:53
tobiash	corvus: it's an educated guess. I see that there are times when all executors deregistered and 1 more would be a less than 10% increase while 2 more would probably give a little more headroom.	20:53
tobiash	but as pabelanger said, it's just a guess	20:53
corvus	tobiash, pabelanger: i agree, but also, i think the 'jobs starting' governor is a big part of this, so that may change things a bit. i lean toward adding 1 and re-evaluating for now...	20:54
pabelanger	starting with one wfm	20:55
pabelanger	I also agree jobs starting is related	20:55
tobiash	corvus: that's fine, I just looked at these graphs already one or two weeks ago and thought that one more would make sense, so my first guess was two	20:55
*** wolverineav has joined #openstack-infra		20:56
fungi	even with all the executor backoff, we're averaging rough 1kjph which is pretty awesome	20:57
corvus	tobiash, pabelanger: yeah, you're probably right. i mostly just don't want to delete servers :)	20:57
corvus	efried, fungi: adding 'git restack --continue' makes sense to me	20:57
pabelanger	corvus: ++	20:57
efried	corvus: Cool	20:57
*** wolverineav has quit IRC		20:58
*** wolverineav has joined #openstack-infra		20:58
*** kjackal has quit IRC		20:59
*** dpawlik has joined #openstack-infra		20:59
*** dpawlik has quit IRC		21:03
*** jtomasek_ has quit IRC		21:05
*** rkukura_ has joined #openstack-infra		21:08
tobiash	corvus: ah, so you're targeting an exact fit	21:11
*** rkukura has quit IRC		21:11
*** rkukura_ is now known as rkukura		21:11
*** jamesmcarthur has quit IRC		21:11
openstackgerrit	Merged openstack-infra/nodepool master: Add an upgrade release note for schema change https://review.openstack.org/623046	21:17
fungi	gonna go find some food now that the workmen seem to be done here for the day; i should be back soon though	21:25
*** jamesmcarthur has joined #openstack-infra		21:28
*** e0ne has joined #openstack-infra		21:30
*** sambetts\|afk has quit IRC		21:31
*** priteau has joined #openstack-infra		21:33
*** yboaron has quit IRC		21:36
corvus	hrm, launch-node doesn't seem to be detecting that the server is up	21:37
corvus	Shrews, mordred: ^ more potential fallout from openstacksdk?	21:38
corvus	i believe launch-node uses the internal wait/timeout feature	21:39
*** jcoufal_ has quit IRC		21:42
*** graphene has quit IRC		21:42
corvus	Shrews, mordred: yes, it seems so -- running launch-node with 0.19.0 works, 0.20.0 fails	21:43
*** graphene has joined #openstack-infra		21:44
corvus	this is all i got: http://paste.openstack.org/show/736724/	21:45
*** jamesmcarthur has quit IRC		21:46
*** priteau has quit IRC		21:50
*** e0ne has quit IRC		21:53
corvus	how do i find the documentation for 0.19.0?	21:57
*** kgiusti has left #openstack-infra		21:58
*** jamesmcarthur has joined #openstack-infra		21:58
corvus	the server is built, but dns.py isn't working because it says: AttributeError: 'Proxy' object has no attribute 'get_server'	21:59
*** e0ne has joined #openstack-infra		21:59
corvus	but according to the docs under the 0.19 tag, the compute proxy is supposed to have a get_server method: http://git.openstack.org/cgit/openstack/openstacksdk/tree/doc/source/user/proxies/compute.rst?h=0.19.0	21:59
*** rcernin has joined #openstack-infra		22:03
corvus	okay, i have to go back to openstacksdk 0.17.2 for that to work	22:03
mordred	corvus: crappit	22:04
mordred	corvus: lemme look real quick and see if I can be more immediately helpfulk	22:04
*** graphene has quit IRC		22:04
corvus	mordred: ok; i've just got past the immediate needs -- (i have a server and i have the dns commands). so at this point it's just what next to make launch-node.py and dns.py work again	22:04
*** graphene has joined #openstack-infra		22:05
mordred	corvus: kk. I think that's likely a yukky break in 0.20.0 ... I'm kinda tempted to revert the patch that's causing all of this and then re-revert it with a bunch more testing since there's clearly some issues here	22:05
*** betherly has joined #openstack-infra		22:06
corvus	mordred: ok. note two problems: the server wait (introduced in 0.20) and the server_get proxy issue (introduced in 0.18)	22:07
*** jamesmcarthur has quit IRC		22:07
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Add ze12.openstack.org https://review.openstack.org/623067	22:08
corvus	clarkb, mordred, fungi, pabelanger: ^ that shuld be ready to merge	22:08
*** e0ne has quit IRC		22:08
mordred	corvus: so .. I think the get_server issue is somethinng else - which I think is something I need to come up with a better logging/error for	22:09
pabelanger	corvus: +2	22:09
mordred	cloud.compute not having a get_server is more likely to mean it wasnt' able to construct a compute Adapter for some reaons - but if that's the case it's a TERRIBLE error message for it	22:10
mordred	I think what it's saying is "this Proxy, which is a bare Proxy and not an openstack.compute.v2._proxy.Proxy, does not have a get_server method"	22:11
*** betherly has quit IRC		22:11
mordred	but I'll dig in to both things	22:11
corvus	mordred: i think so -- here's a repr() and dict() from when the error was happening: http://paste.openstack.org/show/736725/	22:11
corvus	that's a <openstack.compute.v2._proxy.Proxy object at 0x7f24784b94e0> when it's working	22:12
clarkb	anyone want me to keep my three io test nodes around? if not I'll be deleting them shortly (one in bhs1, one in gra1 and one in sjc1)	22:15
clarkb	mnaser: centos7 image just started building a few minutes ago	22:16
clarkb	on nb01	22:16
*** efried is now known as efried_out_til_j		22:17
corvus	the "big project backlog" today is not so great. neutron and tht at 6h, nova at 3.	22:17
*** efried_out_til_j is now known as efried_cya_jan		22:17
corvus	by "not so great" i mean not large	22:17
mordred	corvus: yah. that's what it SHOULD be	22:18
corvus	yeah, it seems like a reasonable amount of time.	22:18
mordred	corvus: do you have any way to see what it is when it's broken without too much effort?	22:18
clarkb	corvus: ya its looking much better today than yesterday. Likely depending on whether or not long chains of changes are coming in for $project? at least athat seemed to be the case with nova yesterday	22:19
*** tpsilva has quit IRC		22:19
*** jamesmcarthur has joined #openstack-infra		22:19
corvus	mordred: i defer to clarkb on that :)	22:20
clarkb	aroo?	22:21
clarkb	what the Proxy value is when it doesn't work?	22:21
bkero	Shush Agnew	22:21
mordred	clarkb: yah - my hypothesis is that it's an openstack.proxy.Proxy when it doesn't work	22:22
mordred	oh - wait. yes - of course it is	22:22
corvus	mordred: oh ha i crossed streams	22:22
mordred	that is a bug we actually already have a workaround for in keystoneauth	22:22
corvus	i thought you were asking about backlogs	22:23
corvus	mordred: yes, it's a compute_v2 proxy when it works	22:23
mordred	https://review.openstack.org/#/c/621257/	22:23
*** bobh has joined #openstack-infra		22:23
mordred	that will fix the launch_node problem - it's the rackspace version discovery thing	22:23
mordred	or - it'll fix the dns.py problem	22:23
mordred	I do not yet know what the wait problem is	22:23
mordred	I was hoping we could get the rate limiting patch landed - but I think we just need to cut a new keystoneauth with that fix in	22:24
*** wolverineav has quit IRC		22:24
*** wolverineav has joined #openstack-infra		22:24
mordred	lbragstad, kmalloc: ^^ mind if I push up a ksa release patch?	22:25
kmalloc	mordred: do eet	22:25
lbragstad	sure	22:25
kmalloc	mordred: also...	22:25
*** udesale has quit IRC		22:25
kmalloc	mordred: +2 on rate limit	22:25
*** eernst has quit IRC		22:26
kmalloc	mordred: as long as we commit to getting an in-ksa functional test beyond the SDK case.	22:26
kmalloc	mordred: but the SDK case is a good start	22:26
* kmalloc has to go get car... it's finally repaired!		22:26
mordred	kmalloc, lbragstad: remote: https://review.openstack.org/623090 Release 3.11.2 of keystoneauth	22:27
mordred	kmalloc: sweet. and yes - actually, before we land it, let's get the sdk consume patch fixed up and green	22:27
mordred	that way we can at least have that and see that it works	22:27
kmalloc	mordred: ++	22:27
kmalloc	mordred: please mark the KSA -1 Workflow to hold until it's green or at least comment as much on it	22:28
mordred	++	22:28
kmalloc	but the code as of now, looks about right sans that functional test	22:28
mordred	kmalloc: done	22:28
mordred	woot	22:28
*** boden has quit IRC		22:30
clarkb	fungi: interesting thing I've noticed about e-r. A bunch of bug graphs have holes. At first I was worried that we broke indexing, then I realized after opening up those bug graphs that the hole is when we disabled bhs1	22:32
clarkb	unfortunately it see that most of them are still recurring :/	22:33
clarkb	http://status.openstack.org/elastic-recheck/#1805176 http://status.openstack.org/elastic-recheck/#1763070 http://status.openstack.org/elastic-recheck/#1802640 http://status.openstack.org/elastic-recheck/#1745168 http://status.openstack.org/elastic-recheck/#1793364 http://status.openstack.org/elastic-recheck/#1806912 and probably others	22:34
*** bobh has quit IRC		22:34
*** dave-mccowan has quit IRC		22:35
*** wolverineav has quit IRC		22:36
corvus	that's pretty standout	22:36
*** bobh has joined #openstack-infra		22:37
*** e0ne has joined #openstack-infra		22:37
corvus	in a bit of irony, all of the jobs for the ze12 change have node assignments but are waiting for an executor to start	22:38
*** e0ne has quit IRC		22:38
pabelanger	yah, I also noticed that	22:39
pabelanger	something happened at 22:30 (gate reset maybe?) and now backlog in executor queue is growing again	22:39
pabelanger	yah, I think integrated queue	22:40
clarkb	pabelanger: ya tripleo had a couple failures, one was ntp sync and the other a tempest unittest failure	22:40
clarkb	integrated queue may have also reset, not sure	22:40
*** owalsh_ has joined #openstack-infra		22:40
clarkb	fwiw the tripleo ntp sync issue shows up pretty evenly distributed across the clouds so that is one I don't think is bhs1 related	22:41
clarkb	fungi: ^ re bhs1 we think we ruled out cpu time?	22:41
mordred	corvus: well - I haven't found the wait_for_server issue yet - but I have found that we're apparently not doing discovery cache properly and are re-running discovery needlessly :(	22:41
clarkb	http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz is a recent example of a bhs1 is "slow" failure	22:41
mordred	kmalloc: ^^ just fyi - I haven't diagnosed the issue yet	22:41
clarkb	that was nova starting up taking 64 seconds which is longer than the devstack timeout allows for	22:42
*** owalsh has quit IRC		22:42
*** bobh has quit IRC		22:42
clarkb	in that particular case there is only one period of cpu wai of ~10% for a couple seconds, otherwise it actually looks pretty happy overall. Low load, no paging, no memory pressure and so on	22:44
clarkb	mriedem: if you look at http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz any hunches to what is slow there? its actually pretty slow (there is a gap) right at the start	22:44
clarkb	mriedem: maybe if we can identify what is particularly slow there we can use that as a bread crumb to go further. (and maybe it is just disk IO though my test instance shows that is fine, so possibly more unhappy hypversors?)	22:45
mriedem	please hold dear caller, currently slitting wrists in nova	22:46
clarkb	maybe I need to stay up late one of these evenings and try to catch amorin for paired debugging	22:46
clarkb	mriedem: I'm sorry, I hope that it gets better	22:46
*** wolverineav has joined #openstack-infra		22:47
pabelanger	ouch, another gate reset for integrated	22:49
clarkb	the good news is our classification rate is reasonably high, so figuring this out should have a pretty big impact overall	22:50
clarkb	the bad news, is that means is causing problems :?	22:50
*** jamesmcarthur has quit IRC		22:51
clarkb	pabelanger: not a bhs1 failure though	22:51
*** wolverineav has quit IRC		22:51
pabelanger	yah, but it just spiked our executor backlog again, 273 waiting now.	22:51
pabelanger	maybe we should enqueue 623067 to help get a head of it	22:52
clarkb	ya mostly just calling it out because addressing the oddity in bhs1 won't solve all the problems	22:52
*** rcernin has quit IRC		22:52
mordred	corvus: so - unfortunately, I cannot reproduce create_server(wait=True) breaking - oh - except ...	22:52
*** rcernin has joined #openstack-infra		22:52
mordred	corvus: that same bug would actually cause the wait loop to spin until it times out	22:52
mordred	corvus: so - other than having provided you with absolutely atrocious error messages, this whole thing is fixed with that keystoneauth patch - which I have now submitted a release request for	22:53
*** wolverineav has joined #openstack-infra		22:53
corvus	mordred: feel free to create ze13 if you want to test :) we can defer adding it into the cluster until we're sure we want it	22:53
mordred	corvus: I created a mttest already ... and it worked (this is with the patched ksa)	22:54
corvus	ok good	22:54
clarkb	I'm going to compile a list of test nodes in the last 6 hours or so that seem to have failed in bhs1 where we have those gaps and maybe that can help debug if it is unhappy hypervisors	22:54
mriedem	clarkb: well we appear to be doing this an ass ton	22:56
mriedem	nova.scheduler.client.report._create_client	22:56
mriedem	Dec 05 20:14:22.572362 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: DEBUG nova.compute.rpcapi [None req-87b2e9bd-e752-452f-9933-005e21b6841b None None] Not caching compute RPC version_cap, because min service_version is 0. Please ensure a nova-compute service has been started. Defaulting to current version. {{(pid=23463) _determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:397}} Dec 05 20:14:22.5	22:56
mriedem	0 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: DEBUG oslo_concurrency.lockutils [None req-87b2e9bd-e752-452f-9933-005e21b6841b None None] Lock "placement_client" acquired by "nova.scheduler.client.report._create_client" :: waited 0.000s {{(pid=23463) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:327}}	22:56
mriedem	yikes	22:56
mriedem	rmine_version_cap called over 1k times on startup	22:57
mriedem	*determine_version_cap	22:57
mriedem	dansmith: ^	22:57
openstackgerrit	MarcH proposed openstack-infra/git-review master: test_uploads_with_nondefault_rebase: fix git screen scraping https://review.openstack.org/623096	22:57
mriedem	my guess is,	22:58
mriedem	nova.compute.api.API init per extension, per api worker	22:58
dansmith	mriedem: instantiating a compute rpc somewhere that gets called a lot?	22:58
mriedem	so at least 2 workers	22:58
mriedem	~50+ extensions	22:58
*** bobh has joined #openstack-infra		22:59
dansmith	the value is cached per worker, and the warning is calling out a pretty bad sitch	22:59
dansmith	I guess we could try to only log once per process or something	22:59
mriedem	there just aren't any compute services started before the api in a devstack run	22:59
mriedem	also...	23:00
mriedem	this would be looking in cell0...	23:00
mriedem	where we'll never have computes	23:00
clarkb	mriedem: ok so devstack is detecting it as nova-api hasn't started but root causing it its something deeper	23:00
mriedem	devstack times out after 60 seconds	23:00
clarkb	ya and it took 64 seconds to start	23:00
mriedem	this is taking 64	23:00
mriedem	Dec 05 20:14:27.967716 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: WSGI app 0 (mountpoint='') ready in 64 seconds on interpreter 0x27018c0 pid: 23462 (default app)	23:00
clarkb	mriedem: mostly I'm trying to characterize the slowness there so that we can debug it better	23:01
clarkb	as it may be cloud provider side	23:01
clarkb	the good news is (if we ignore the tripleo failures that show the gap because tis convenient) there are only a small number of failures since amorin made changes this mornign	23:01
mriedem	this is one of the slow ovh-bhs1 nodes	23:01
dansmith	mriedem: ah, well, the cell0 thing is legit	23:02
mriedem	so my point is, we're doing this db lookup api workers * extensions for something that will never return anything because	23:02
mriedem	connection = mysql+pymysql://root:secretdatabase@127.0.0.1/nova_cell0?charset=utf8	23:02
*** jamesmcarthur has joined #openstack-infra		23:02
dansmith	yeah, that's fair	23:02
dansmith	but	23:02
mriedem	and we don't have nova-compute services in cell0	23:02
dansmith	it also means we probably have a bug there	23:02
clarkb	mriedem: ok so its two things then, yes slow bhs1 node, but maybe it illustrates a cell0 bug?	23:03
dansmith	because we should be scattering and gathering, even though that won't fix the time zero problem	23:03
mriedem	dansmith: but if we cached?	23:03
clarkb	(I'm mostly interested in figuring out the slow bhs1 nodes, but happy to help on the other thing too)	23:03
mriedem	we're not caching b/c 0	23:03
fungi	clarkb: yes, amorin confirmed cpo oversub in bhs1 was 2:1 and analysis of logstash records after we halved max-servers there still showed 20x as many job timeouts as gra1	23:03
*** bobh has quit IRC		23:03
dansmith	mriedem: but we'll continue looking at cell0 I imagine	23:03
mriedem	when n-api starts in devstack we won't have any computes, so this would just always be 0 on first startup	23:03
mriedem	but yeah we are not iterating the cells	23:04
mriedem	should be calling get_minimum_version_all_cells yeah?	23:04
clarkb	fungi: cool fwiw looking at dstat there is significant cpu idle time when things time out so nto surprising it isn't cpu (or at least doesn't appear to be)	23:04
dansmith	we have service version get all cells	23:04
dansmith	for a reason	23:04
dansmith	yeah that	23:04
dansmith	mriedem: we probably need a signal that none were found and avoid logging until there are computes that are old, not just missing	23:05
dansmith	although that could be a legit misconfig too I guess	23:05
*** wolverineav has quit IRC		23:05
dansmith	mriedem: if you file a bug and/or remind me tomorrow I could probably take a look at that	23:05
*** jamesmcarthur has quit IRC		23:05
pabelanger	another gate reset, this time cinder and constraints job	23:05
mriedem	dansmith: will do	23:06
*** wolverineav has joined #openstack-infra		23:12
pabelanger	corvus: clarkb: we are now up to same amount of jobs running as are queued in executor backlog, starting builds seems to be our hurdle right now. And with the last few gate resets in the last hour, executors cannot get a head. At this rate, i think we are still another hour out for 623067 to land.	23:12
corvus	pabelanger: seems likely. tripleo is about to reset again too.	23:14
corvus	(that job at the top timed out)	23:14
clarkb	infra-root I've started to summarize on https://etherpad.openstack.org/p/bhs1-test-node-slowness	23:14
corvus	(it was on bhs1)	23:14
clarkb	in the case of tripleo and bhs1 I think their jobs put a lot more demand on the test nodes than eg tempest	23:15
*** larainema has joined #openstack-infra		23:15
clarkb	and tripleo may be our noisy neighbor there possibly compounded by some external slowness	23:15
mriedem	clarkb: dansmith: https://bugs.launchpad.net/nova/+bug/1807044	23:16
openstack	Launchpad bug 1807044 in OpenStack Compute (nova) "nova-api startup does not scan cells looking for minimum nova-compute service version" [Medium,Confirmed]	23:16
clarkb	from what I can tell looking at things that are not tripleo we have much fewer failures on bhs1 since amorin made chagnes this morning	23:16
clarkb	we need more data to say for sure, but I think those changes did improve the situation for us.	23:16
clarkb	mriedem: thank you	23:17
clarkb	mwhahaha: ssbarnea\|rover http://logs.openstack.org/99/616199/4/gate/tripleo-buildimage-overcloud-full-centos-7/c027629/job-output.txt.gz#_2018-12-05_22_56_12_383902 a case of broken ansible?	23:21
mwhahaha	hmm that's odd	23:22
mwhahaha	we set environment_type in job defs (i think), maybe it got missed	23:22
corvus	it takes longer to start a big integration job for a change deep in the integrated gate, compounding the problem with the executors.	23:23
mwhahaha	yea it inherits from tripleo-ci-base which does not have the environment_type set	23:23
* mwhahaha fixes		23:23
corvus	clarkb, pabelanger: we also seem to be using swap a lot	23:24
clarkb	corvus: ya particularly in the tripleo jobs I think. I noticed there is a ton of paging on some of those jobs	23:25
*** rlandy is now known as rlandy\|bbl		23:25
*** owalsh_ is now known as owalsh		23:25
corvus	clarkb: i mean on the executors	23:25
clarkb	oh	23:25
clarkb	interesting	23:25
notmyname	I was going to try to make a patch for zuul's dashboard, but then I realized I don't know anything about react	23:26
corvus	http://cacti.openstack.org/cacti/graph.php?local_graph_id=64005&rra_id=all	23:26
corvus	notmyname: https://www.softwarefactory-project.io/react-for-python-developers.html	23:26
clarkb	corvus: there is free memory available on that executor too	23:26
mwhahaha	clarkb: hmm in poking at these base definitions, where is the 'multinode' base job defined? is that in zuul somewhere	23:26
notmyname	corvus: you certainly had that link available quickly :-)	23:27
mwhahaha	nm i found it in zuul-jobs/zuul.yaml	23:27
corvus	notmyname: :) https://www.google.com/search?q=tristan+react+python	23:27
*** wolverineav has quit IRC		23:27
corvus	at least, that works in my google bubble	23:27
notmyname	heh	23:27
notmyname	who's tristan?	23:28
corvus	notmyname: tristanC, he wrote most of the current dashboard	23:28
notmyname	ah, ok	23:28
clarkb	mwhahaha: it is defined in openstack-infra/zuul-jobs/zuul.yaml	23:29
clarkb	mwhahaha: oh you found it :)	23:29
corvus	clarkb: yeah, and there are no memory hogs, so i'm kinda curious what's getting swapped out	23:29
clarkb	corvus: ya top says ~4% for executor then a bunch of 1%ish ansible playbooks	23:30
mordred	notmyname: https://zuul-ci.org/docs/zuul/developer/javascript.html also might be helpful, toolchain-wise	23:30
clarkb	corvus: apparently /proc can tell us /me looks	23:30
corvus	clarkb: oh, tell me about this magic when you get a chance :)	23:31
*** mriedem has quit IRC		23:31
mwhahaha	clarkb: actually this is caused by https://review.openstack.org/#/c/618669/10/playbooks/tripleo-ci/post.yaml in the gate	23:31
clarkb	corvus: VmSwap: 2663556 kB for the executor says /proc. Uh its grep VmSwap in /proc/$pid/status	23:31
* fungi throws rip torn style glitter in the air and shouts "kernel magic!"		23:32
clarkb	mwhahaha: ah, so I got ahead of myself with the debugging. Though not sure if that chagne will fail on its own	23:32
mwhahaha	it won't	23:32
mwhahaha	but it didn't fail that job either	23:32
clarkb	corvus: I think thats a big chunk of it there, odd that its resident size is comparatively tiny	23:33
mwhahaha	unless i'm missing something, http://logs.openstack.org/99/616199/4/gate/tripleo-buildimage-overcloud-full-centos-7/c027629/job-output.txt.gz#_2018-12-05_22_56_13_434974 seems to show that the error was just ignored	23:34
clarkb	mwhahaha: it resulted in a post-failure result for the job. I think the "normal" there means that we didn't short circuit or abandon the play, it finished "normally" with error	23:34
mwhahaha	ok i'll kick the other change out of the gate	23:35
corvus	clarkb: i don't see much of significance that has changed in the executor code over the past few months	23:35
clarkb	corvus: I wonder if this is side effect of how python does dicts and data structures in general. basically it grows them and they never shrink (granted releasing memory back to OS is not easy outisde of python either)	23:36
clarkb	corvus: at some point the executor may have allocated a bunch of memory for "something"	23:37
clarkb	and now ist not needed so we see the delta between rss and vmswap	23:37
fungi	yes, it's not uncommon to see a python process gobbling lots of memory but with a small resident size	23:37
pabelanger	oh, I didn't even think to check cacti.o.o. Yah, wonder what is going on there	23:37
fungi	we likely need some sort of python memory profiler imported and logging sizes of (used and cleared) structures to know where it's all going	23:38
corvus	http://cacti.openstack.org/cacti/graph_view.php?action=preview&host_id=0&rows=100&columns=-1&graph_template_id=41&thumbnails=true&filter=ze	23:38
clarkb	started ~week and a half ago? likely unrelated to the relative priority changes then	23:39
clarkb	but they all started doing it together	23:39
clarkb	(for the most part)	23:39
fungi	ze02 is either getting overwhelmed or is suffering unrelated packet loss from cacti.o.o (ongoing rackspace ipv6 issues?)	23:40
clarkb	fungi: it has hadthose issues in the past	23:40
fungi	~1.,5 weeks ago is when a lot of our contributors decided their holiday vacation following the summit was over	23:40
clarkb	maybe we should just rebuild it	23:40
corvus	clarkb: the only restart i have in the logs is from nov 28.	23:41
corvus	which seems right around the beginning of the increase	23:41
corvus	that was 3.3.1.dev61 (whatever that is)	23:42
corvus	i don't know what it was running before	23:42
pabelanger	that's when we did the security fix, right?	23:42
corvus	yep	23:42
fungi	so could be that a change between the 2018-11-28 and the restart before that introduced a memory leak, or just a memory inefficiency maybe	23:42
fungi	doesn't look especially leaky as it's not really climbing	23:43
*** graphene has quit IRC		23:43
clarkb	corvus: 8715505e6d38c092257179b8a089a2a560df5e58 that should be 3.3.1.dev61 according to git describe	23:44
clarkb	that was the restart for the security fix	23:44
*** _alastor_ has quit IRC		23:44
corvus	clarkb: makes sense	23:44
*** wolverineav has joined #openstack-infra		23:45
clarkb	looking at commits before that the fix for semaphores and python3 can't do unbuffered binary io and possibly "Include executor variables in initial setup call" stand out to me reading the log	23:47
clarkb	ah we don't run the console logger thing under python3 currently so that binary io one should be a noop	23:48
clarkb	however, that does make we wonder if the place where we might be allocating large buffers then not using them much of the time is the log streaming	23:49
clarkb	that goes through the executor right?	23:49
*** jamesmcarthur has joined #openstack-infra		23:50
*** bobh has joined #openstack-infra		23:51
corvus	clarkb: zuul_console runs on the remote node, so shouldn't affect things. yes, the executor has a subprocess which handles log streaming	23:51
corvus	clarkb: it's not the one that's eating up all that swap	23:51
corvus	it's an order of magnitude smaller	23:51
corvus	on ze07: 2018-11-15 23:13:16,178 DEBUG zuul.Executor: Configured logging: 3.3.1.dev31	23:52
corvus	it must have rebooted or something	23:53
corvus	its graphs don't look like the rest	23:53
clarkb	b31b866fbca35939204bb69c447cd68412efa9ce dev31 is that commit I think	23:54
*** jamesmcarthur has quit IRC		23:54
corvus	and 2018-11-06 19:28:34,195 DEBUG zuul.Executor: Configured logging: 3.3.1.dev15	23:55
corvus	on ze09	23:55
corvus	also has different-looking graphs	23:55
clarkb	between 31 and 61 in the executor server file there is changes to job output parsing for the line based commenting	23:56
corvus	ze07's increased usage starts shortly after that restart	23:56
clarkb	other than that its just winrm fixes	23:56
corvus	ze09 looks suspicious after its restart as well	23:57
clarkb	actually maybe its not per line commenting. But it issomething that iterates through the log file then modifies it if an error condition is met	23:58
clarkb	5e9f77326 is the commit that changes things in that part of the code	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!