Thursday, 2018-12-06

clarkb	ah its going through the ansible log not the job console log	00:00
corvus	clarkb: so 1deb5f1e391aa7eea4d84b2032bb1c970e005500 would be dev15?	00:00
clarkb	1deb5f1e391aa7eea4d84b2032bb1c970e005500 is what I find for dev15	00:03
clarkb	yup	00:03
clarkb	(the thing that makes the weird is that pbr will do 3.3.1.dev$commits since 3.3.0 but git describe will do 3.3.0-$commits since 3.3.0	00:03
*** jamesmcarthur has joined #openstack-infra		00:08
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107	00:08
corvus	clarkb, pabelanger: none of those changes look suspicous. we don't have objgraph installed on the executors, so we can't get a histogram of memory usage.	00:11
corvus	meanwhile, the current executor behavior is highly anamolous for the last 90 days. this is the first time we've ever had more jobs queued than running.	00:11
corvus	we're running a mere 331 jobs right now	00:12
*** jamesmcarthur has quit IRC		00:12
clarkb	huh	00:12
pabelanger	agree, have all the executors been rebooted recently? I haven't checked to see if maybe we have a new hwe kernel	00:12
corvus	infra-root: so i'd say that whatever this problem is, it's critical at this point.	00:12
clarkb	pabelanger: I don't think they have ~111 days	00:12
*** _alastor_ has joined #openstack-infra		00:13
clarkb	corvus: do we want to rotate executors out and see if reboot reset swap and return them to happyness?	00:13
clarkb	(then monitor it for any signs of swap returning)	00:13
clarkb	mostly thinking that without instrumentation we likely aren't going to debug the swappage now, but restarting may give us breathing room	00:14
corvus	clarkb: yeah, i think it's likely to buy us a few days at least, assuming this behavior is consistent with the way it's been for the past week	00:14
corvus	yeah	00:14
pabelanger	+1	00:14
corvus	so let's pip3 install objgraph on all of them and do that	00:14
pabelanger	++	00:14
clarkb	let me know how I can help	00:14
corvus	though i'm just inclined to go ahead and hard-restart them all, rather than rotate out	00:14
corvus	mostly because it's eod	00:15
clarkb	also only using 30% capacity anyway (so the shock is low)	00:15
corvus	okay, objgraph is installed everywhere	00:15
corvus	i'll stop all executors now	00:16
clarkb	status page reflects that executors are stopping	00:17
*** gyee has quit IRC		00:17
*** _alastor_ has quit IRC		00:17
clarkb	ze01 appears to have stopped its executor	00:24
corvus	curiously 7 and 9 seem to be the slowest to stop	00:24
corvus	all stopped now; i will reboot them all	00:25
pabelanger	ack	00:25
corvus	they're starting	00:26
corvus	all running except 8,9,10	00:27
*** wolverineav has quit IRC		00:27
corvus	all running now; i guess those were just slow to boot	00:27
corvus	gah, i should have deleted the old builds	00:28
clarkb	was that not fixed?	00:28
corvus	clarkb: not merged: https://review.openstack.org/620697	00:28
clarkb	fwiw ze01 looks sane so far. Memory usage seems to be roughly proportional to the number of processes running	00:29
clarkb	ah	00:29
johnsom	I am guessing you all are talking about the jobs that have been sitting for over an hour, started, but no progress/stream?	00:29
*** wolverineav has joined #openstack-infra		00:30
clarkb	johnsom: they aren't quite started yet. They go to the empty box on the status page as soon as they have a node assigned aiui, then you have to wait for an executor to pick up that node and start the job. But yes	00:30
johnsom	Yep, cool. Just checking that it's a known issue.	00:31
corvus	i'll manually delete some build dirs (lots of old stuff sitting around will cause du to waste cycles)	00:31
clarkb	snmp hasn't quite caught up on all the nodes according to cacti but spot checking by hand it looks like things haven't immediately returned to the former state	00:32
clarkb	corvus: another thing I notice is that ansible 2.5 had a release at the end of october that we may have pciekd up? there have been a couple since then too (whcih we have been using on more recent restarts)	00:34
corvus	clarkb: yeah, i wonder if something about that could affect it. we don't import it or anything, but it could be using more memory and driving us into swap in general. or outputting more data that we capture or something.	00:35
clarkb	ya. The change log https://github.com/ansible/ansible/blob/stable-2.5/changelogs/CHANGELOG-v2.5.rst looks sane though	00:35
clarkb	we are now running more jobs than are queued	00:37
*** jaosorior has quit IRC		00:37
corvus	#status log rebooted all zuul executors (ze01-ze11) due to suspected performance degredation due to swap. underlying cause is unclear, but may be due to a regression in zuul introduced since 3.3.0, or in dependencies (including ansible). objgraph installed on all executors to support future memory profiling.	00:40
openstackstatus	corvus: finished logging	00:40
corvus	clarkb, pabelanger, tobiash: i'm not 100% sure i want to put ze12 into production at this point. we may have been wrong about our supposition for the increased queued jobs.	00:42
pabelanger	sure, makes sense	00:42
clarkb	ya if slowness was caused by memory issues we may not need it	00:43
clarkb	corvus: fwiw I did approve the change to puppet ze12	00:43
clarkb	do we need to -W it?	00:43
corvus	especially based on the sort of exponential regression we were seeing	00:43
pabelanger	yah, we should in next 5mins, about to merge	00:43
corvus	i'll -w it	00:43
pabelanger	I'll look at sf.io tomorrow to see if we are also seeing an increase of swap	00:44
clarkb	corvus: on ze01 and ze03 there are a few megabytes of swap being used, none of it from the two zuul-executor processes	00:44
*** yamamoto has joined #openstack-infra		00:45
corvus	clarkb: yeah, looking at the list it seems pretty reasonable -- kernel just moving idle stuff out of the way	00:46
corvus	also, we don't need to run apache on those servers	00:46
clarkb	++	00:46
corvus	i've deleted old build dirs from the 3 largest offenders, so the servers should be generally in-line now. there may be a few stragglers, but no big deal	00:47
corvus	clarkb, pabelanger: it's possible that ansible is using more memory and the only thing to do about it is just to add more executors after all	00:50
corvus	i kinda don't want to jump to that conclusion though	00:50
pabelanger	Yah, I can also look at open issues in ansible/ansible tomorrow, see if anybody has reported anything	00:50
clarkb	corvus: looks like there are ~200 playbooks running on ze01 but only ~60 jobs?	00:51
pabelanger	like clarkb said, there have been a few releases of 2.5 recently	00:51
clarkb	I guess that could be ansible forking	00:51
clarkb	ah yup it appears there are multiple ansible playbook processes running concurrently per build	00:51
*** Swami has quit IRC		00:52
corvus	queued jobs: 0	00:52
pabelanger	yay	00:52
clarkb	corvus: if that is the case we'd still want to hve the governors throttle such that they don't swap	00:53
clarkb	though the swap was from the zuul-executor process so I dunno	00:54
corvus	clarkb: the zuul-executor process on ze01 is the same virt size it was before the reboot	00:54
corvus	1882 zuul 20 0 5534272 175336 10224 S 45.5 2.1 9:10.47 zuul-executor	00:54
corvus	less resident	00:54
*** Belgar81 has joined #openstack-infra		00:55
clarkb	and about 10mB into swap now	00:55
clarkb	(far less than before)	00:55
corvus	but given that we're running all out now, and we've achieved the same virtual size as before, and pretty close to the same resident size (what was it, like 300000 or 400000 before?) i'm not sure the executor is going to turn out to be the smoking gun	00:56
corvus	i'm going to eod now	00:58
clarkb	ya I need to pop out myself.	00:59
clarkb	ianw and/or fungi if amorin wanders by later today/tomorrow morning (relative to me) maybe you can point out https://etherpad.openstack.org/p/bhs1-test-node-slowness I'ev triedto capture what we/I know there	00:59
clarkb	corvus: thinking out loud here it might be good to instrument things like ansible as used by zuul so that we'll know if/when there are regressions in performance or resource usage	01:00
clarkb	that feedback might also be useful to ansible tiself	01:01
*** sthussey has quit IRC		01:17
*** yamamoto has quit IRC		01:18
*** yamamoto has joined #openstack-infra		01:18
*** tosky has quit IRC		01:19
*** rkukura has quit IRC		01:20
*** harlowja has quit IRC		01:27
*** markvoelker has quit IRC		01:33
rm_work	hey, how complex is the process of getting a cloud added to nodepool? including technical and legal/political/whatever -- I assume there's all sorts of things that need to be signed?	01:35
*** betherly has joined #openstack-infra		01:40
*** betherly has quit IRC		01:44
*** david-lyle has joined #openstack-infra		01:48
*** manjeets_ has joined #openstack-infra		01:49
clarkb	rm_work: https://docs.openstack.org/infra/system-config/contribute-cloud.html is the doc we have for it. It tends to be more informal and we try to do our best to accomodate the needs of the provider	01:49
rm_work	cool cool	01:50
clarkb	corvus: I have a really derpy script at ze01:~clarkb/swap.sh that looks for playbooks using more than 60MB ish of swap. It seems that "larger" jobs tend to be in that club, things like grenade and tripleo and lbaas jobs	01:50
*** dklyle has quit IRC		01:51
*** manjeets has quit IRC		01:51
clarkb	also it seems that swap usage may have stablizied a bit. And now really calling it a day	01:51
*** _alastor_ has joined #openstack-infra		02:13
*** mrsoul has quit IRC		02:15
*** _alastor_ has quit IRC		02:18
*** hongbin has joined #openstack-infra		02:35
*** wolverineav has quit IRC		02:40
*** bhavikdbavishi has joined #openstack-infra		02:41
*** wolverineav has joined #openstack-infra		02:41
*** ykarel has joined #openstack-infra		02:41
*** bhavikdbavishi1 has joined #openstack-infra		02:44
*** yamamoto has quit IRC		02:45
*** bhavikdbavishi has quit IRC		02:45
*** bhavikdbavishi1 is now known as bhavikdbavishi		02:45
*** wolverineav has quit IRC		02:46
*** betherly has joined #openstack-infra		02:51
*** imacdonn has quit IRC		02:53
*** imacdonn has joined #openstack-infra		02:53
*** betherly has quit IRC		02:55
*** rlandy\|bbl is now known as rlandy		03:09
*** rlandy has quit IRC		03:10
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107	03:12
*** wolverineav has joined #openstack-infra		03:21
*** yamamoto has joined #openstack-infra		03:24
*** wolverineav has quit IRC		03:26
*** wolverineav has joined #openstack-infra		03:30
*** psachin has joined #openstack-infra		03:32
*** wolverineav has quit IRC		03:34
*** yamamoto has quit IRC		03:34
*** ramishra has quit IRC		03:36
*** wolverineav has joined #openstack-infra		03:46
*** hwoarang has quit IRC		03:47
*** hwoarang has joined #openstack-infra		03:50
*** wolverineav has quit IRC		04:02
*** wolverineav has joined #openstack-infra		04:03
*** diablo_rojo has quit IRC		04:06
*** wolverineav has quit IRC		04:07
*** yamamoto has joined #openstack-infra		04:29
*** betherly has joined #openstack-infra		04:32
*** hongbin has quit IRC		04:33
*** janki has joined #openstack-infra		04:34
*** ramishra has joined #openstack-infra		04:35
*** betherly has quit IRC		04:37
*** rh-jelabarre has quit IRC		04:41
*** ykarel is now known as ykarel\|afk		04:50
*** lpetrut has joined #openstack-infra		04:52
*** ykarel\|afk has quit IRC		04:54
*** yboaron has joined #openstack-infra		05:02
*** apetrich has quit IRC		05:07
*** ykarel\|afk has joined #openstack-infra		05:10
*** ykarel\|afk is now known as ykarel		05:10
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: [wip] rhel8 beta support https://review.openstack.org/623137	05:13
*** ahosam has joined #openstack-infra		05:32
*** lpetrut has quit IRC		05:36
*** wolverineav has joined #openstack-infra		05:43
*** apetrich has joined #openstack-infra		05:48
tonyb	tobias-urdin: I sent a list of repos to openstack-discuss can you verify them and then I'll get them taken care of	05:49
*** wolverineav has quit IRC		05:50
*** _alastor_ has joined #openstack-infra		06:15
*** _alastor_ has quit IRC		06:19
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout based on screen size https://review.openstack.org/622010	06:43
*** ahosam has quit IRC		07:00
*** wolverineav has joined #openstack-infra		07:05
*** wolverineav has quit IRC		07:10
*** bhavikdbavishi has quit IRC		07:13
*** jtomasek has joined #openstack-infra		07:22
*** ahosam has joined #openstack-infra		07:24
*** yboaron has quit IRC		07:26
*** yboaron has joined #openstack-infra		07:26
*** dpawlik has joined #openstack-infra		07:28
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306	07:33
*** gfidente has joined #openstack-infra		07:35
*** e0ne has joined #openstack-infra		07:38
*** ginopc has joined #openstack-infra		07:50
*** irdr has quit IRC		07:55
*** rcernin has quit IRC		07:56
amorin	hey all	07:57
*** pcaruana has joined #openstack-infra		07:58
*** pcaruana is now known as muttley		07:58
*** shardy has joined #openstack-infra		08:02
*** rcernin has joined #openstack-infra		08:03
*** shardy has quit IRC		08:05
amorin	so as far as I can read, the results are a little bit better since I moved the disk sched to deadline, but this is still not perfect.	08:05
*** lpetrut has joined #openstack-infra		08:06
amorin	on my side, I am investigating two things: enabling back VMX flag on cpu (for nested virt)	08:06
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: refactor jobs page to use a reducer https://review.openstack.org/621396	08:06
*** slaweq has joined #openstack-infra		08:06
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: refactor job page to use a reducer https://review.openstack.org/623156	08:06
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: refactor tenants page to use a reducer https://review.openstack.org/623157	08:06
amorin	and also completely removing iotune on osf flavors	08:06
amorin	cc fungi clarkb mordred dmsimard	08:07
*** slaweq has quit IRC		08:15
*** florianf\|afk is now known as florianf		08:17
*** ykarel is now known as ykarel\|lunch		08:18
frickler	amorin: the comment regarding kvm caching iiuc would amount to setting "libvirt:disk_cachemodes=writeback" in nova.conf	08:18
frickler	amorin: but I'd defer that to a second step	08:18
frickler	amorin: looking at the last 6h, I still see about 50% of the job timeouts on bhs1, which sadly doesn't look like much progress	08:20
*** ralonsoh has joined #openstack-infra		08:21
amorin	frickler: ok	08:26
*** shardy has joined #openstack-infra		08:28
tobias-urdin	tonyb: will check it out asap, sorry missed that yesterday was out on adventures	08:32
*** rcernin has quit IRC		08:33
tonyb	tobias-urdin: All good. I hope they were good adventures :)	08:34
tobias-urdin	tonyb: answered on ML, but that list is correct and complete, thanks tonyb!	08:42
*** AJaeger has quit IRC		08:49
*** AJaeger has joined #openstack-infra		08:51
*** ahosam has quit IRC		08:54
*** bhavikdbavishi has joined #openstack-infra		08:55
*** bkero has quit IRC		08:55
*** jpich has joined #openstack-infra		08:56
*** lpetrut has quit IRC		08:57
*** tosky has joined #openstack-infra		09:00
*** markvoelker has joined #openstack-infra		09:01
*** ahosam has joined #openstack-infra		09:01
*** kjackal has joined #openstack-infra		09:10
*** gfidente has quit IRC		09:14
*** ramishra has quit IRC		09:19
*** ramishra has joined #openstack-infra		09:21
*** ahosam has quit IRC		09:26
*** ahosam has joined #openstack-infra		09:26
*** ykarel\|lunch is now known as ykarel		09:27
*** witek has joined #openstack-infra		09:28
*** dtantsur\|afk is now known as dtantsur		09:29
*** markvoelker has quit IRC		09:34
*** derekh has joined #openstack-infra		09:44
*** sshnaidm\|afk has quit IRC		09:45
*** sshnaidm\|afk has joined #openstack-infra		09:46
*** bhavikdbavishi has quit IRC		09:49
*** electrofelix has joined #openstack-infra		10:04
dulek	Hey, is it possible to sync job lists between two repos in Zuul v3?	10:05
dulek	It's a bit hard to keep tempest plugin repo job list in sync with the main repo manually.	10:06
*** sshnaidm\|afk is now known as sshnaidm		10:12
*** ahosam has quit IRC		10:15
*** e0ne has quit IRC		10:24
*** gfidente has joined #openstack-infra		10:26
*** yboaron_ has joined #openstack-infra		10:26
*** e0ne has joined #openstack-infra		10:27
*** yboaron has quit IRC		10:29
*** markvoelker has joined #openstack-infra		10:31
frickler	dulek: iiuc the usual solution would be to use project-templates, see https://zuul-ci.org/docs/zuul/user/config.html#project-template	10:32
*** sshnaidm has quit IRC		10:33
*** sshnaidm has joined #openstack-infra		10:34
dulek	frickler: Oh my, how early is Christmas this year, this is awesome!	10:34
dulek	Thanks!	10:34
frickler	for an example see http://git.openstack.org/cgit/openstack/designate/tree/.zuul.yaml#n155 vs. http://git.openstack.org/cgit/openstack/designate-tempest-plugin/tree/.zuul.yaml	10:35
*** Belgar81 has quit IRC		10:54
*** markvoelker has quit IRC		11:05
*** yboaron_ has quit IRC		11:05
*** yboaron_ has joined #openstack-infra		11:08
*** jesusaur has quit IRC		11:27
*** jesusaur has joined #openstack-infra		11:31
*** yboaron_ has quit IRC		11:33
*** bhavikdbavishi has joined #openstack-infra		11:48
*** agopi has quit IRC		11:52
*** gfidente has quit IRC		11:54
stephenfin	fungi, clarkb: When you're about, fancy taking a look at these three doc changes for git-review? https://review.openstack.org/#/q/project:openstack-infra/git-review+status:open+owner:%22Stephen+Finucane+%253Cstephenfin%2540redhat.com%253E%22	11:58
ssbarnea\|rover	does anyone knows what is the bashate friendly of doing something like local foo=$(cmd) ? -- see https://github.com/openstack-dev/bashate/blob/master/bashate/messages.py#L166	12:01
openstackgerrit	Sorin Sbarnea proposed openstack-infra/git-review master: Stash and unstash changes during download https://review.openstack.org/340024	12:07
*** psachin has quit IRC		12:08
*** sshnaidm is now known as sshnaidm\|bbl		12:08
*** irdr has joined #openstack-infra		12:13
*** _alastor_ has joined #openstack-infra		12:18
*** yamamoto has quit IRC		12:22
*** _alastor_ has quit IRC		12:23
*** ahosam has joined #openstack-infra		12:27
*** mriedem has joined #openstack-infra		12:27
*** gfidente has joined #openstack-infra		12:28
*** yboaron_ has joined #openstack-infra		12:30
*** yamamoto has joined #openstack-infra		12:31
*** pbourke has quit IRC		12:40
*** pbourke has joined #openstack-infra		12:40
*** udesale has joined #openstack-infra		12:41
openstackgerrit	Dirk Mueller proposed openstack-infra/openstack-zuul-jobs master: use opensuse15 as generic name instead of opensuse150 https://review.openstack.org/619628	12:45
*** tpsilva has joined #openstack-infra		12:48
frickler	ssbarnea\|rover: split it into two commands: local foo; foo=$(cmd)	12:49
*** hjensas has quit IRC		12:49
ssbarnea\|rover	fresta: thanks. I was considering it but I was not sure if that had the desired behavior.	12:49
*** betherly has joined #openstack-infra		12:52
*** eharney has quit IRC		12:54
openstackgerrit	Merged openstack/os-performance-tools master: Change openstack-dev to openstack-discuss https://review.openstack.org/622173	12:56
*** rh-jelabarre has joined #openstack-infra		12:57
*** rlandy has joined #openstack-infra		12:58
*** bobh has quit IRC		13:00
*** bobh has joined #openstack-infra		13:00
*** betherly has quit IRC		13:01
openstackgerrit	Ghanshyam Mann proposed openstack-dev/hacking master: Fix 'ref' format errors in README file https://review.openstack.org/623203	13:05
openstackgerrit	Ghanshyam Mann proposed openstack-dev/hacking master: Fix 'ref' format errors in README file https://review.openstack.org/623203	13:06
*** muttley has quit IRC		13:08
*** boden has joined #openstack-infra		13:15
*** dave-mccowan has joined #openstack-infra		13:16
*** rcernin has joined #openstack-infra		13:17
fungi	#status log deleted stale /var/log/exim4/paniclog on ns2.opendev.org to silence nightly cron alert e-mails about it	13:17
openstackstatus	fungi: finished logging	13:17
*** dave-mccowan has quit IRC		13:21
*** muttley has joined #openstack-infra		13:21
*** muttley has quit IRC		13:25
*** muttley has joined #openstack-infra		13:26
Linkid	hi	13:29
*** rcernin has quit IRC		13:29
Linkid	fungi: about peertube, can I add a page here : https://docs.openstack.org/infra/system-config/systems.html ?	13:29
*** muttley has quit IRC		13:29
Linkid	(as WIP)	13:29
*** yboaron_ has quit IRC		13:29
AJaeger	Linkid: a spec is the better next step	13:30
AJaeger	Linkid: against openstack-infra/infra-specs repo	13:30
AJaeger	Linkid: once the spec is approved, adding a page is one step	13:30
Linkid	and corvus told about ansible to deploy services, but I only saw puppet classes for services on the system-config repo	13:30
fungi	Linkid: yes, in whatever change implements the configuration management for the service you would also add a document there explaining its management, but AJaeger is right after the mailing list discussion the next step is likely an infra-spec describing how we will get it bootstrapped	13:31
Linkid	ok, I'll make a spec this Friday or this week-end, then	13:31
Linkid	(I don't have enough time today)	13:32
fungi	Linkid: there is a template file in the openstack-infra/infra-specs repo you can fill in with the proposal, see the readme in that repo for instrructions	13:32
Linkid	oh, great :)	13:33
fungi	and there's no hurry, we operate on the assumption that people are working on these sorts of things in their spare/volunteer time anyway	13:33
fungi	and feel free to look at other approved specs in that repo for examples, or ask questions in here or on the ml if you need help	13:34
*** pcaruana has joined #openstack-infra		13:34
Linkid	ok, I will read some other specs today, I think	13:35
*** kota_ has quit IRC		13:36
Linkid	thanks for your help :)	13:36
fungi	my pleasure!	13:37
*** ahosam has quit IRC		13:37
*** ahosam has joined #openstack-infra		13:37
fungi	rereading the readme in the specs repo, i can see that it could use some clarifications too. i'll improve it a bit here shortly	13:38
*** kota_ has joined #openstack-infra		13:38
*** pcaruana has quit IRC		13:39
*** rfolco has quit IRC		13:41
*** rfolco has joined #openstack-infra		13:41
*** yamamoto has quit IRC		13:41
*** pcaruana has joined #openstack-infra		13:43
*** pcaruana has quit IRC		13:47
*** ahosam has quit IRC		13:47
*** kgiusti has joined #openstack-infra		13:47
*** ykarel is now known as ykarel\|away		13:48
*** dpawlik has quit IRC		13:50
openstackgerrit	Jeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity https://review.openstack.org/623211	13:51
fungi	Linkid: ^ those updates to the readme might be helpful to you	13:52
*** bhavikdbavishi has quit IRC		13:53
Linkid	ok, I'll take a look	13:55
openstackgerrit	Jeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity https://review.openstack.org/623211	13:56
*** agopi has joined #openstack-infra		13:57
openstackgerrit	Jeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity https://review.openstack.org/623211	13:58
openstackgerrit	Jeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity https://review.openstack.org/623211	14:00
fungi	okay, i think i'm happy with it now ;)	14:00
amorin	fungi: AJaeger clarkb I just enabled the cpu VMX flag on BHS1, so now, you should be able to spawn icccccdlucfbribncbuvefvbjlbeeckvvikkcvuhtdgn	14:00
amorin	instances	14:01
*** eharney has joined #openstack-infra		14:01
fungi	those are some fun looking instances	14:01
amorin	with nested virt	14:01
amorin	yes they are :p	14:01
amorin	sorry	14:01
fungi	no worries. i fall asleep on my keyboard all the time ;)	14:01
fungi	and thanks!	14:01
*** pgaxatte has joined #openstack-infra		14:03
*** pgaxatte has left #openstack-infra		14:03
*** pgaxatte has joined #openstack-infra		14:04
*** eernst has joined #openstack-infra		14:07
*** jcoufal has joined #openstack-infra		14:10
openstackgerrit	Jeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity https://review.openstack.org/623211	14:11
fungi	okay, now i'm really done with it... i think	14:12
*** sthussey has joined #openstack-infra		14:13
frickler	fungi: oh, since when do we only need the Task: header? that feature has managed to avoid making itself known to me so far	14:17
fungi	frickler: ever since https://review.openstack.org/607699 merged on halloween	14:20
fungi	hrm, though i still need to restart the gerrit service on review.o.o to pick that up. it's been running undisturbed since august 3	14:21
fungi	that's probably why it is unknown to you, i haven't announced it working because gerrit hasn't been restarted	14:22
fungi	perhaps i can do a quick gerrit restart late today when things (hopefully) quiet down	14:22
*** jaosorior has joined #openstack-infra		14:23
mordred	fungi: I really need to start working on the gerrit upgrade again	14:23
mordred	fungi: too many plates spinnin	14:23
mordred	fungi: it looks like the project rename plugin is actually real now though https://gerrit.googlesource.com/plugins/rename-project/+/master/src/main/resources/Documentation/about.md	14:24
*** yamamoto has joined #openstack-infra		14:24
fungi	yay! that'll be swell. once we have a new enough gerrit to use it	14:25
mordred	yah. I'll be glad when that's no longer a downtime event	14:25
frickler	fungi: hmm, IIUC that patch only creates the hyperlink to the task, will updates to the story still get posted on storyboard when a patch contains only the task reference?	14:26
fungi	frickler: yes, in fact only the task footer causes story updates to happen	14:27
fungi	after digging into the current implementation of the its-storyboard plugin for gerrit, it does nothing at all with story ids, and only acts on task ids	14:27
fungi	we had wanted it to leave story comments when the story footer was included in a change even without a task footer, but it seems that was never actually implemented	14:28
*** eernst has quit IRC		14:28
*** ykarel\|away has quit IRC		14:29
fungi	so if someone with a good grasp of java wants to work on the its-storyboard plugin for gerrit, that would be a good next feature	14:29
frickler	fungi: o.k., going via the task seems to work just as well, so fine for me	14:29
*** psachin has joined #openstack-infra		14:31
*** janki has quit IRC		14:33
*** yamamoto has quit IRC		14:37
mordred	infra-root: keystoneauth1 3.11.2 has been released, which has the fix for rackspace discovery in it	14:38
mordred	it should be safe to unpin nodepool and to use latest sdk for launch-node now	14:39
mordred	but I'm on a plane, so I'm not going to do any of those things right now	14:39
fungi	also i notice we still have some significant gaps in executor availability so we may want to proceed with the ze12 addition	14:40
*** rkukura has joined #openstack-infra		14:44
*** Swami has joined #openstack-infra		14:44
*** sshnaidm\|bbl is now known as sshnaidm		14:51
*** _alastor_ has joined #openstack-infra		14:53
*** janki has joined #openstack-infra		14:54
*** Swami has quit IRC		14:55
*** Swami has joined #openstack-infra		14:56
*** _alastor_ has quit IRC		14:57
fungi	oh joy, now spammers seem to be mistyping their spoofed domain and i'm getting messages into the openstack-discuss moderation queue for random addresses @q.com instead of @qq.com	14:59
fungi	on the order of one every few seconds	14:59
*** ramishra has quit IRC		15:00
fungi	updated the nonmember discard filter to ^[0-9]+@q+\.com$ now	15:00
fungi	and my renovation contractors are making me high on spray-foam insulation fumes	15:03
fungi	i should open a window but it's windy and close to freezing outside right now	15:03
amorin	where are you from?	15:04
fungi	a barrier island in the atlantic, off the coast of north carolina (usa)	15:05
fungi	we're ~16km from shore	15:06
amorin	nice, windy situation I guess	15:06
openstackgerrit	Merged openstack-dev/hacking master: Fix 'ref' format errors in README file https://review.openstack.org/623203	15:07
*** alexchadin has quit IRC		15:07
*** ykarel\|away has joined #openstack-infra		15:08
fungi	yeah, the water is no more than 30 meters from my house, at the end of my yard, so very windy	15:08
fungi	nothing to slow it down	15:08
*** ykarel\|away is now known as ykarel		15:08
fungi	i ended up opening a window anyway because i figure i'm far less likely to pass out from hypothermia than hypoxia (and i can always at least put on a jacket)	15:09
*** dpawlik has joined #openstack-infra		15:10
*** bobh has quit IRC		15:12
*** jcoufal_ has joined #openstack-infra		15:14
*** jcoufal_ has quit IRC		15:15
*** dpawlik has quit IRC		15:15
*** bobh has joined #openstack-infra		15:16
*** jcoufal_ has joined #openstack-infra		15:16
*** dpawlik has joined #openstack-infra		15:17
openstackgerrit	Chris Dent proposed openstack-infra/openstack-zuul-jobs master: Make lower-constraints job use python 3.6 https://review.openstack.org/623229	15:17
*** dpawlik has quit IRC		15:17
*** jcoufal has quit IRC		15:17
fungi	we nearly caught up on node requests around 1300z but i guess then north america woke up	15:17
*** dpawlik has joined #openstack-infra		15:18
*** slaweq has joined #openstack-infra		15:23
*** slaweq has quit IRC		15:29
*** bobh has quit IRC		15:31
mordred	stupid north america	15:32
*** adam_zhang has joined #openstack-infra		15:33
mordred	fungi: you should also ventiilate for a thile once they're done with that spray foam - it offgasses for a while, is my understanding	15:33
*** jhesketh has quit IRC		15:34
*** adam_zhang has quit IRC		15:35
*** jhesketh has joined #openstack-infra		15:35
fungi	yeah	15:35
fungi	luckily this house is fairly leaky already (part of why we're renovating the downstairs entry instead of just repairing it)	15:36
fungi	just an unfortunate time of year to need to leave windows open	15:36
*** bobh has joined #openstack-infra		15:37
mordred	++	15:40
*** bobh has quit IRC		15:42
*** jamesmcarthur has joined #openstack-infra		15:43
jrosser	could i get some more eyes on this when folks have a moment? https://review.openstack.org/#/c/622169/	15:45
corvus	fungi: good morning; i'll start looking at data in a bit	15:46
frickler	jrosser: done	15:47
jrosser	frickler: thanks :)	15:47
mriedem	clarkb: also for you https://bugs.launchpad.net/nova/+bug/1807219	15:51
openstack	Launchpad bug 1807219 in OpenStack Compute (nova) "SchedulerReporClient init slows down nova-api startup" [Medium,Triaged]	15:51
mriedem	working a patch now	15:51
*** zul has joined #openstack-infra		15:52
*** dpawlik has quit IRC		15:55
*** bobh has joined #openstack-infra		15:55
*** ramishra has joined #openstack-infra		15:59
*** bobh has quit IRC		16:00
corvus	fungi, clarkb: i'm going to sigusr2 ze01 to get an objgraph list	16:00
*** _alastor_ has joined #openstack-infra		16:00
fungi	oh! right, it was so late for me i didn't commit to memory that we'd added it to all the executors	16:01
*** jcoufal_ has quit IRC		16:01
*** janki is now known as janki\|dinner		16:02
corvus	here are the object counts: http://paste.openstack.org/show/736764/	16:04
openstackgerrit	Merged openstack-infra/project-config master: Add centos/suse to OSA grafana dashboard https://review.openstack.org/622169	16:04
corvus	our first class on that list is Repo. and 1700 repos sounds about right.	16:05
fungi	yep	16:08
corvus	i agree with clarkb; i wish we had historical values for "how much memory ansible is using". also, for that matter, "how much memory the executor process is using"	16:09
corvus	cause at this point, all we suspect is something changed and we don't even know which piece of software.	16:09
fungi	it does at least look like we're not swapping as hard since the restart	16:10
*** takamatsu has quit IRC		16:10
corvus	fungi: yeah, we seem to be around 2g, which is a value we've encountered before in our history without too much issue.	16:11
*** bobh has joined #openstack-infra		16:17
*** jcoufal has joined #openstack-infra		16:20
*** bobh has quit IRC		16:22
clarkb	corvus what I found interesting is ansible memory/swap use seems to correlate to the job playbooks	16:25
clarkb	grenade and tripleo were showing up abunch	16:25
mordred	you know ...	16:26
mordred	in the callback plugins, we actually have the entire log data in memory for the entire job in RAM	16:26
mordred	at least in the json one	16:26
openstackgerrit	Frank Kloeker proposed openstack-infra/irc-meetings master: Change meeting time and format for Docs & I18n team https://review.openstack.org/623242	16:27
mordred	which we can improve by switching that to be yaml and use multiple documents which we can just append without reading the old data like we discussed in berlin	16:27
mordred	so - grenade and tripleo being long/complex and potentially verbose could be causing the json callback plugin to eat ram	16:27
mordred	that said - we could even improve the json plugin by only reading the old data in to memory right before doing the append and write out - so that it's not holding the RAM for the whole playbook invocation	16:28
*** e0ne has quit IRC		16:30
openstackgerrit	Monty Taylor proposed openstack-infra/zuul master: Read old json data right before writing new data https://review.openstack.org/623245	16:30
mordred	clarkb, corvus: ^^ like that	16:30
*** ykarel is now known as ykarel\|away		16:33
clarkb	ah ya if thats held for say 2 hoursby gremade or tripleo I could see how those would be good swap candidates	16:37
mriedem	clarkb: btw, this takes 26 seconds on nova-api startup:	16:37
mriedem	Dec 05 20:13:23.060958 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: running "unix_signal:15 gracefully_kill_them_all" (master-start)...	16:37
mriedem	http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz#_Dec_05_20_13_23_060958	16:37
mriedem	then we spend about 27 seconds loading up API extensions	16:38
*** bobh has joined #openstack-infra		16:38
clarkb	mriedem: ya is it waiting on child pids to go away? that looked like uwsgi not nova?	16:38
mriedem	we're looking into the latter but i don't know if we can control the former	16:38
*** rossella_s has quit IRC		16:38
mriedem	yeah i'm not sure what's doing that	16:40
*** psachin has quit IRC		16:41
mriedem	http://git.openstack.org/cgit/openstack-dev/devstack/tree/lib/apache#n272	16:41
mriedem	devstack sets the hook	16:41
*** bobh has quit IRC		16:42
openstackgerrit	Monty Taylor proposed openstack-infra/zuul master: Add appending yaml log plugin https://review.openstack.org/623256	16:47
mordred	and there's the yaml version	16:47
*** pgaxatte has quit IRC		16:49
clarkb	mordred: left a small comment on the json change but +2	16:49
mordred	clarkb: yeah - totally. that would be better for sure	16:53
*** bobh has joined #openstack-infra		16:56
clarkb	mriedem: reading uwsgi that sets up a hook that on SIGTERM (signal 15) it calls the kill them all gracefully function	16:59
clarkb	mriedem: I wonder if that is actually slow or if uwsgi is just not logging what it is doing in the interim well	16:59
*** bobh has quit IRC		17:01
mriedem	clarkb: same, i suspect it's doing other things but not logging it	17:01
mriedem	i'll enable debug logging and see if that shows anything	17:07
mriedem	26 https://review.openstack.org/623265	17:12
clarkb	mriedem: like a season of 24 but two episodes longer	17:13
*** Swami has quit IRC		17:14
*** janki\|dinner has quit IRC		17:16
*** bobh has joined #openstack-infra		17:16
*** jamesmcarthur has quit IRC		17:17
mriedem	heh the 26 was typing in the wrong window	17:17
mriedem	i can't watch 24, the ads alone with kiefer constantly yelling is just too much	17:18
mriedem	"the pop tarts are done omfg!!!"	17:18
clarkb	mnaser: followup on centos images. All regions but inap-mtl1 should have centos 7.6 now	17:18
clarkb	mnaser: waiting on inap upload to complete	17:18
clarkb	mriedem: I couldn't watch it when broadcast but managed to get through the first season on netflex relatively recently	17:19
openstackgerrit	Frank Kloeker proposed openstack-infra/irc-meetings master: Change meeting time and format for Docs & I18n team https://review.openstack.org/623242	17:19
*** bobh has quit IRC		17:21
*** jamesmcarthur has joined #openstack-infra		17:22
clarkb	fungi: heh ns2 now emails about packages on hold?	17:25
clarkb	https://packages.ubuntu.com/bionic/netplan.io	17:25
fungi	clarkb: that's yet another package which can't be upgraded because it will bring in a new dependency	17:31
*** kjackal has quit IRC		17:32
*** jtomasek has quit IRC		17:32
corvus	fungi, clarkb: i think we should throw ze12 at the problem.	17:32
clarkb	corvus: ok, just a matter of merging the change to puppet it right?	17:32
corvus	clarkb: yeah, i'll re-approve	17:33
*** kjackal has joined #openstack-infra		17:33
fungi	wfm	17:34
*** bobh has joined #openstack-infra		17:34
openstackgerrit	Jeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity https://review.openstack.org/623211	17:36
clarkb	notmyname: http://logs.openstack.org/31/592231/6/gate/swift-probetests-centos-7/7bde795/job-output.txt.gz#_2018-12-06_17_32_48_836444 just failed in the gate. I took a quick look to see if it was for any of the known problems associated with the centos 7.6 release and unless that is a new race caused by new python or libs I don't think it is. (just an fyi that it appears to be an actual bug and not	17:36
clarkb	centos 7.6 causing problems)	17:36
*** rascasoft has quit IRC		17:38
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired https://review.openstack.org/623269	17:39
openstackgerrit	Chris Dent proposed openstack-infra/openstack-zuul-jobs master: Make lower-constraints job use python 3.6 https://review.openstack.org/623229	17:39
notmyname	clarkb: thanks	17:40
clarkb	ssbarnea\|rover: new theory on the file:// lookup issue with delorean. Is it possible that delorean is looking for that file within its chroot but it exists on the surrounding fs?	17:42
fungi	clarkb: followup on the stackalytics-bot-2 ip6tables block rule. it looks like the bot eventually switched to ipv4 anyway, so probably safe to say it's not what was causing the gerrit slowdowns a couple weeks back and there's no point in continuing to leave that block rule in place	17:48
*** jpich has quit IRC		17:48
*** florianf has quit IRC		17:49
*** eharney has quit IRC		17:52
*** gyee has joined #openstack-infra		17:55
fungi	so based on graphs it looks like we're topping out around 70 concurrent builds per executor? i guess if with the addition of ze12 we see we get up around 850 concurrent builds for extended periods that suggests we need another	17:55
fungi	the hysteresis kicking in on the executor queue graph around the time we stop accepting more builds is interesting and fairly pronounced	17:57
fungi	or perhaps each of the queued jobs spikes there reflects a major gate queue reset	17:58
openstackgerrit	Merged openstack-infra/system-config master: Add ze12.openstack.org https://review.openstack.org/623067	17:59
*** derekh has quit IRC		17:59
pabelanger	fungi: yah, gate resets are aplifying the backlog for sure	17:59
fungi	yeah, i guess there are corresponding spikes on the node requests graph so that seems likely	17:59
*** bdodd has quit IRC		17:59
fungi	though we do seem to be managing ~0.2kjph higher than yesterday already	18:00
*** dtantsur is now known as dtantsur\|afk		18:01
*** udesale has quit IRC		18:04
*** Swami has joined #openstack-infra		18:05
*** gfidente is now known as gfidente\|afk		18:06
*** ykarel\|away has quit IRC		18:07
*** e0ne has joined #openstack-infra		18:08
*** jamesmcarthur has quit IRC		18:08
*** e0ne has quit IRC		18:09
*** bobh has quit IRC		18:09
*** bobh has joined #openstack-infra		18:17
openstackgerrit	Merged openstack-infra/zuul master: web: break the reducers module into logical units https://review.openstack.org/621385	18:20
clarkb	ssbarnea\|rover: I'll move the conversation back here since I'm no longer thinking this is likely a zuul bug. http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/ara_oooq/reports/730db7e5-5c8a-4aec-a2a4-836c4367225a.html That ansible run crashes, this seems to crash the console log streaming which is then noticed when pre is started	18:20
clarkb	ssbarnea\|rover: it seems to crash when executing tempest and even the tempest log seems truncated: http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/undercloud/home/zuul/tempest.log.txt.gz notice that it has concurrency = 4 but only workers 0 and 2 record tests (we should at least have {1} as well)	18:21
*** bobh has quit IRC		18:22
*** electrofelix has quit IRC		18:26
*** wolverineav has joined #openstack-infra		18:29
*** wolverineav has quit IRC		18:29
*** wolverineav has joined #openstack-infra		18:29
*** dave-mccowan has joined #openstack-infra		18:29
clarkb	ssbarnea\|rover: ok I figured it out http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/undercloud/var/log/journal.txt.gz#_Dec_06_16_08_49 -- Reboot --	18:29
clarkb	ssbarnea\|rover: ^ so tahts the issue after all	18:29
clarkb	mordred: ^ fyi	18:30
*** slaweq has joined #openstack-infra		18:30
fungi	hah, rebooting a node mid-job. we knew that would prematurely terminate the console stream at least, right?	18:31
clarkb	fungi: yes	18:32
mordred	yeah. I really do need to raise the priority of reworking streaming	18:32
mordred	in the mean time - they can add a zuul_console line to their playbook after the reboot	18:32
mordred	it's just an ansible module - nothing stopping it from being restarted	18:32
clarkb	mordred: while I agree, I also don't think that running tempest should induce a reboot	18:34
*** yamamoto has joined #openstack-infra		18:34
clarkb	so I think there is a bigger tripleo bug here	18:34
clarkb	(or maybe I am misunderstanding the logs around whati s going on at that time)	18:35
openstackgerrit	Merged openstack-infra/irc-meetings master: Change meeting time and format for Docs & I18n team https://review.openstack.org/623242	18:35
*** diablo_rojo has joined #openstack-infra		18:35
*** slaweq has quit IRC		18:36
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Import install-docker role https://review.openstack.org/605585	18:37
*** wolverineav has quit IRC		18:37
mordred	clarkb: I saw something about enabling something - perhaps a new kernel is happening?	18:37
*** yamamoto has quit IRC		18:38
openstackgerrit	Merged openstack-infra/irc-meetings master: Change Senlin meeting to different biweekly times https://review.openstack.org/623031	18:39
*** jcoufal has quit IRC		18:40
*** jcoufal has joined #openstack-infra		18:40
clarkb	mordred: is zuul_console a task that you run or a role?	18:41
clarkb	updating the bug for this now and hoping to give that ^ info	18:41
mordred	clarkb: task. one sec ...	18:41
*** eharney has joined #openstack-infra		18:42
*** bdodd has joined #openstack-infra		18:42
mordred	clarkb: sorry, role: http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/start-zuul-console	18:42
clarkb	mordred: thanks	18:42
logan-	clarkb: i began seeing similar behavior about 2-3 days ago in jobs I have that launch nested vms with nested virt enabled. i have temporarily changed the affected jobs to use software virt and they no longer reboot. this is on ubuntu xenial test nodes launching bionic nested vms.	18:44
clarkb	logan-: ah good to know. Nested virt hits again. I'll leave that note too	18:45
logan-	pretty concerning to see it is happening on centos guests too	18:45
mordred	clarkb: that said - it's a one-task-role - so if it's more convenient to run it as a task, that's fine too	18:45
*** wolverineav has joined #openstack-infra		18:45
logan-	nothing has changed on the hosts, but maybe I should look to see if there is a newer kernel we can try.	18:45
clarkb	logan-: well centos just updated its kernels with 7.6 I'm sure	18:46
clarkb	logan-: could be in the guest side of things	18:46
fungi	#status log unblocked stackalytics-bot-2 access to review.o.o since the performance problems observed leading up to addition of the rule on 2018-11-23 seem to be unrelated (it eventually fell back to connecting via ipv4 and no recurrence was reported)	18:46
openstackstatus	fungi: finished logging	18:47
*** ramishra has quit IRC		18:47
logan-	yeah it seems like these breakages are usually guest induced by updated nodepool images, and then we usually get it back on track by updating the hosts. when I looked the other day there were no kernel updates available for the hosts :/	18:47
logan-	there is a kernel update available now. I'm taking a host out of the aggregate to update and test. will let you know how it goes.	18:52
*** slaweq has joined #openstack-infra		18:52
clarkb	logan-: sounds good. It will be really interseting to see in a year or so (when the current 4.19 kernel shows up in places) if the intel nested virt enabled by default there actuall pans out as being much more reliable	18:53
logan-	no kidding. I'm running xenial hwe on these hosts and it is pretty sad that it still breaks a few times per quarter while still being more reliable than the regular xenial kernel. :(	18:54
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired https://review.openstack.org/623269	18:56
*** slaweq has quit IRC		18:57
*** shardy has quit IRC		18:59
clarkb	mwhahaha: ssbarnea\|rover EmilienM to tl;dr it I think the three issues I'm aware of affecting tripleo jobs that are not "slowness" are the ntp errors, delorean not finding the /home/zuul/.*/repomd.xml file, delorean using pypi.python.org directly and having errors, and the nested virt possibly crashing VM and rebooting it (which may be the cause of the repomd.xml thing as a side effect?)	19:01
clarkb	I guess thats 3.5 issues	19:01
clarkb	I'm pretty sure all of these have bugs and I've updated the one I have new info on (reboots) with the data I collected	19:02
clarkb	Then there is the ovh slowness related stuff. I do still think reducing memory pressure would be worthwhile as an exercise to see if that helps. Especially if there are any easy wins like kernel same page merging	19:03
clarkb	and we'll keep working with ovh on the infra side to characterize and hopefully address underlying issues as well	19:04
*** udesale has joined #openstack-infra		19:11
clarkb	hrm my neighbor is getting a new roof, will need to find the airplane headphones	19:14
clarkb	fungi: ^ must be worse at your place :)	19:14
mwhahaha	clarkb: ok we had a fix for the ntp thing but it failed in the gate, maybe we can get that promoted next gate rest	19:14
mwhahaha	which seemst o have just occured	19:15
* mwhahaha sighs		19:15
mwhahaha	clarkb: https://review.openstack.org/#/c/621930/ if you can promote that to the top of the gate so we stop getting that one	19:15
mwhahaha	clarkb: i'm not aware of the repomd.xml one or the crashing vm. Is the creashing VM the standalone job?	19:16
clarkb	mwhahaha: ya an example is http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/ara_oooq/	19:17
mwhahaha	ok so originally we used just qemu hard coded and then we moved to try and do the auto direction	19:18
clarkb	mwhahaha: notice that the multinode-standalone.yaml playbook is incomplete/interrupted. If you then go look at the journal log you'll see that there was a reboot around 16:07 ish	19:18
mwhahaha	yea so it crashes in tempest	19:18
clarkb	then later in the job delorean fails beacuse the repomd.xml isn't present (possibly because ansible crashed in a way that really confused things?)	19:18
clarkb	mwhahaha: ya I think tempest is the trigger there	19:18
mwhahaha	we can force that to qemu but that is less than ideal	19:18
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Run a local MySQL service on StoryBoard servers https://review.openstack.org/623290	19:19
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Switch StoryBoard database backups to local https://review.openstack.org/623291	19:19
clarkb	mwhahaha: unfortunately nested virt has never been stable	19:19
clarkb	it will work then stop then work and its really hard to debug unless you've got logan- or mnaser investigating the hypervisor side too	19:19
fungi	clarkb: roof rebuild hasn't started yet, we're still getting quotes and arguing about whether we want shingle or steel	19:19
clarkb	mwhahaha: ntp fix is being promoted now	19:19
mwhahaha	thanks	19:19
mwhahaha	i'll propose a patch for the qemu thing	19:20
clarkb	fungi: I'm unsure of the comparative advantages in your part of the world but steel is very loud when it rains	19:20
clarkb	fungi: growing up it would rain hard enough that it would be louder than the speakers hooked to the tv	19:21
clarkb	(granted we lived in the tropics with minimal insulation to dampen things too)	19:21
*** dpawlik has joined #openstack-infra		19:22
fungi	yeah, lots of insulation here. i know metal roofs are loud, though in theory require less maintenance and last a lot longer for not a lot higher cost	19:22
*** dpawlik has quit IRC		19:24
mwhahaha	clarkb: for the nested virt thing: https://review.openstack.org/#/c/623293/	19:26
mwhahaha	ssbarnea\|rover, EmilienM fyi -^	19:26
clarkb	mwhahaha: thanks!	19:26
EmilienM	mwhahaha: ack	19:27
*** ndahiwade has joined #openstack-infra		19:27
openstackgerrit	Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294	19:28
*** udesale has quit IRC		19:31
openstackgerrit	Merged openstack-infra/zuul master: web: refactor info and tenant reducers action https://review.openstack.org/621386	19:35
fungi	this is awesome: https://github.com/systemd/systemd/issues/11026	19:39
*** dpawlik has joined #openstack-infra		19:39
*** boden has quit IRC		19:42
fungi	though now https://gitlab.freedesktop.org/polkit/polkit/issues/74 is arguing it's a systemd bug after all	19:42
clarkb	you get root I get root we all get root	19:43
fungi	finger pointing ftw!	19:43
*** sshnaidm is now known as sshnaidm\|afk		19:43
*** dpawlik has quit IRC		19:44
*** bobh has joined #openstack-infra		19:47
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired https://review.openstack.org/623269	19:50
*** bobh has quit IRC		19:54
*** jamesmcarthur has joined #openstack-infra		19:57
*** ndahiwade has quit IRC		19:59
*** wolverineav has quit IRC		19:59
corvus	#status log added ze12 to zuul executor pool to reduce memory pressure	20:00
openstackstatus	corvus: finished logging	20:00
corvus	infra-root: ze12 is in production	20:00
*** wolverineav has joined #openstack-infra		20:00
*** jamesmcarthur has quit IRC		20:01
*** jamesmcarthur has joined #openstack-infra		20:01
fungi	ahh, yep, just noticed the green line on the executors graph bump up a notch	20:02
corvus	i really like that it's immediately reflected in monitoring :)	20:02
fungi	that is super nice	20:02
corvus	all the governor graphs have an extra line now too. even cacti is updated.	20:02
fungi	and there's a gate resent underway in the integrated queue. curious to see if we fall into exhaustion again	20:04
fungi	er, gate reset	20:04
corvus	oh good, that will help things equalize across all the executors faster :)	20:04
clarkb	ya its been bumpy there too. I've been trying to context switch into debugging some of those failures next, but running out of steam	20:04
clarkb	glance python3 unittests were the last failure	20:05
*** wolverineav has quit IRC		20:05
clarkb	seemed to be a legit issue with bytes and unicode or something	20:05
clarkb	http://logs.openstack.org/61/610661/7/gate/openstack-tox-py35/f70430e/job-output.txt.gz#_2018-12-06_18_48_43_311604	20:05
*** bobh has joined #openstack-infra		20:07
clarkb	the most recent reset was grenade job failing on bhs1 because the nova node tempest was testing didn't reach an active state before the timeout	20:07
*** bobh has quit IRC		20:11
*** sthussey has quit IRC		20:12
*** rcernin has joined #openstack-infra		20:12
fungi	any puppet gurus know how to work around http://logs.openstack.org/90/623290/1/check/infra-puppet-apply-4-ubuntu-xenial/62c7ca7/applytest/puppetapplytest32.final.out.FAILED ?	20:17
fungi	seems we can't use our mysql::backup_remote class with the puppet mysql module because both want to install mysql-client	20:18
fungi	unfortunately one of them isn't a module under our control (i think?)	20:18
clarkb	fungi: ya the mysql module is an upstream module.	20:19
fungi	maybe it provides a way to not declare the mysql-client package	20:19
clarkb	fungi: puppet 4 is ordered (and maybe 3 is now too?) in any case in the backup module you can do the if !defined() check for mysql-client and install it that way. Then ensure that you include backup class after the regular myself stuff	20:20
clarkb	then the ordering should be such that it works	20:20
fungi	ohh	20:20
fungi	i can actually order them?	20:20
fungi	will try that, thanks!	20:20
clarkb	fungi: its implied top to bottom order in pupet 4	20:20
clarkb	but I think you can also have mysql backup require mysql then it will order them explicitly too	20:20
fungi	we already do the if ! defined(Package['mysql-client']) dance in puppet-mysql_backup so if i can order them that should do the trick	20:21
*** david-lyle is now known as dklyle		20:24
*** udesale has joined #openstack-infra		20:29
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Run a local MySQL service on StoryBoard servers https://review.openstack.org/623290	20:31
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Switch StoryBoard database backups to local https://review.openstack.org/623291	20:31
fungi	hopefully that ^ will solve it for puppet 3 and 4 then	20:31
pabelanger	corvus: clarkb: I think we forgot to move /var/lib/zuul to /dev/xvde2 partition for ze12.o.o, which means we only have 40GB there for builds	20:32
*** bobh has joined #openstack-infra		20:33
*** wolverineav has joined #openstack-infra		20:33
corvus	pabelanger: hrm. we should automate that.	20:34
pabelanger	agree!	20:35
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: WIP: Add spec for scale out scheduler https://review.openstack.org/621479	20:35
corvus	pabelanger: i think it's okay for now; we can shut it down later when it's quieter	20:35
*** bobh has quit IRC		20:37
*** wolverineav has quit IRC		20:38
fungi	ooh, just remembered, since i want to restart gerrit soon for the task footer hyperlinking config addition, might be nice to get https://review.openstack.org/471078 merged before as well	20:39
fungi	adding lp bug trackingids	20:39
fungi	(to make them searchable)	20:40
openstackgerrit	Merged openstack-infra/zuul master: web: add error reducer and info toast notification https://review.openstack.org/621387	20:42
fungi	corvus: looks like we've entered a period of no executors accepting new builds again	20:43
*** mriedem has quit IRC		20:45
clarkb	fungi: we had a few periods of that overnight according to grafana, they all seem to have recovered on their own (not sure if this one will)	20:45
fungi	yeah, it's more of a question of whether we'll have any more 2-hour-long ones	20:46
*** udesale has quit IRC		20:46
fungi	this at least has only persisted for ~15 minutes so far	20:46
*** mriedem has joined #openstack-infra		20:47
clarkb	on the bhs1 front I hopped into a test node and manually checked it had reasonable disk throughput, then found the job it is running https://zuul.openstack.org/stream/b81a8b3afe0f48819fcd3ed0fa201fba?logfile=console.log in the hopes of looking at dstat for that job to see if it exhibits similar behavior to the jobs that timeout	20:47
clarkb	thats a heat functional job that uses devstack, I'm not actually sure if it runs dstat :/	20:48
fungi	fingers crossed	20:48
clarkb	back to swapping. Last night I found that each of the swapping ansible jobs would use up to 75MB swap each	20:49
clarkb	I think we should consider getting mordreds patch in around the json handling	20:50
clarkb	that should reduce the window where memory is needed in the jobs	20:50
fungi	theory is that it's paging out the console stream?	20:50
clarkb	additionally we probably want to consider testing a downgrade of ansible to 2.5.older to see if it changes behavior	20:50
clarkb	fungi: its the ansible json log data not the console stream itself, but ya we open it and keep it open the whole time when we really only need to write a new copy with new data at the end aiui	20:51
clarkb	I think it scales in the size of the role and tasks in a playbook as its capturing all of that data?	20:51
mordred	there's a patch up to fix that	20:51
clarkb	mordred: ya I +2'd I'm saying we should try to get that in	20:51
mordred	yah. I totally agree	20:52
mordred	I think we should try rolling that out before we try downgrading anythign	20:52
clarkb	++	20:52
corvus	i'll take a look now	20:52
fungi	623245?	20:53
*** wolverineav has joined #openstack-infra		20:53
clarkb	fungi: yes	20:53
corvus	mordred: i like the local var idea; is there any reason not to do that?	20:54
mordred	there's also a followup patch that will do the same thing but with yaml and appending to a file instead of reading and re-writing	20:54
clarkb	on_stats is the thing that runs at the end of a playbook ruin to display stats around what took time and all that	20:54
mordred	corvus: there isn't - although that function is the last thing called before the process exits, so I didn't just to avoid the respin	20:54
mordred	but I can totally do that real quick	20:55
clarkb	mordred: ya I did double check on that in ansible docs (that the hook fires at the end)	20:55
clarkb	so I didn't -1	20:55
fungi	i suppose a local would be more future-proofing in case more function calls get tacked on after that down the road?	20:55
mordred	yeah. I'll push up a followup	20:55
corvus	yeah, it's mostly just confusing from a dev/maint pov.	20:55
corvus	mordred: you may as well ammend	20:56
mordred	kk	20:56
corvus	or however you spell that :)	20:56
fungi	ammm...mmmend	20:56
fungi	we have enough people around to approve the revised version anyway	20:56
corvus	i'd want to restart with that anyway, so we'd be waiting for the second, and we can all re+2 real quick	20:56
fungi	do we want to get the yaml equivalent in too?	20:57
clarkb	the one upside to json is browsers nicely render it	20:57
clarkb	yaml is more readable on its own though	20:57
corvus	personally, i'd like to take that one slower if we can, since it's basically a new feature.	20:57
fungi	oh, i see the yaml one is more involved	20:57
corvus	the json one seems more like an operational fix	20:58
mordred	yeah. the yaml one is like a change in approach	20:58
fungi	only just started looking at it and, yeah, i agree	20:58
corvus	i like it, i just think we should talk through it fully (eg clarkb's point)	20:58
fungi	entirely possible there are users of zuul who prefer the json version, so it's probably a bigger community question	20:59
openstackgerrit	Monty Taylor proposed openstack-infra/zuul master: Read old json data right before writing new data https://review.openstack.org/623245	20:59
openstackgerrit	Monty Taylor proposed openstack-infra/zuul master: Add appending yaml log plugin https://review.openstack.org/623256	20:59
mordred	ok. there's the updated	20:59
*** bobh has joined #openstack-infra		21:02
clarkb	looks like dstat is enabled in that test, should have that data once logs post	21:04
fungi	looks like that busy cycle for the executors lasted ~17 minutes	21:05
openstackgerrit	Nate Johnston proposed openstack-infra/project-config master: Neutron grafana update for co-gating section https://review.openstack.org/622418	21:05
*** bobh has quit IRC		21:07
*** agopi has quit IRC		21:12
*** agopi has joined #openstack-infra		21:12
*** gfidente\|afk has quit IRC		21:15
SpamapS	Hey I just realized my company affiliation changed recently. Is there still a place to update somewhere?	21:17
* SpamapS always forgets where it is		21:17
*** jcoufal has quit IRC		21:19
corvus	SpamapS: the openstack foundation site has a thing for that	21:21
corvus	SpamapS: foundation individual membership	21:21
corvus	fungi, clarkb, mordred: i think our next steps should be to restart executors with mordred's patch, observe behavior, then create ze13 if needed.	21:22
clarkb	ok. Im steeping out for a bit for lunch and needbreak from staring at monitor	21:22
clarkb	can help when I return	21:22
mriedem	clarkb: interesting, not seeing that 26 sec mystery time gap in this run on n-api startup http://logs.openstack.org/65/623265/1/check/tempest-full/0e80f2a/controller/logs/screen-n-api.txt.gz#_Dec_06_17_53_58_039303	21:22
mriedem	uwsgi debug logging seems to not do anything	21:23
*** jaosorior has quit IRC		21:23
clarkb	mriedem: it could be ovh specific :(	21:23
fungi	SpamapS: and if you care about stackalytics at all (or your handlers do?) then there's a config file in the stackaytics repo you could push up a review for	21:23
fungi	someday stackalytics might start consuming the affiliation info in osf profiles since there's an api to query that now, but it doesn't today	21:24
fungi	corvus: this sounds like a fine plan. i need to disappear in about 55 minutes to meet some friends for dinner, but can help with restarts prior to that	21:26
fungi	or once i get back (probably around 23:30-00:00z)	21:26
SpamapS	corvus: ty	21:26
SpamapS	fungi: ty too	21:26
*** kjackal has quit IRC		21:27
fungi	corvus: oh, and taking ze12 offline briefly to add a cinder volume i guess slots in there somewhere too	21:29
*** bobh has joined #openstack-infra		21:30
corvus	fungi: i don't believe we use cinder volumes	21:30
corvus	fungi: we just mount the ephemeral volume at /var/lib/zuul	21:31
corvus	(we should probably instead have our automation symlink that to /opt or something)	21:31
corvus	but i don't know how to untangle our deployment from openstackci at this point, so i don't want to touch it until we can move to the new stuff.	21:31
*** bobh has quit IRC		21:34
fungi	oh, got it, i always forget we have ephemeral disk in that provider	21:34
fungi	looks like 623245 has all its node requests fulfilled in the gate as of just now, eta 18 minutes	21:37
*** boden has joined #openstack-infra		21:40
*** jamesmcarthur has quit IRC		21:49
clarkb	I think the way it has been done in the past is passing the option to launch node to mount the ephemeral disk elsewhere?	21:50
clarkb	Putting http://logs.openstack.org/38/589238/13/check/heat-functional-convg-mysql-lbaasv2-amqp1/b81a8b3/logs/dstat-csv_log.txt.gz into https://lamada.eu/dstat-graph/ shows a much happier test run than the runs that fail. And there is quite a bit of IO happening too	21:54
clarkb	fungi: ianw ^ I think that points to ovh bhs1 (and maybe gra1) having unhappy and happy hypervisors as the source of the problem (assuming we aren't seeing noisy neighbor issues)	21:54
*** bobh has joined #openstack-infra		21:55
openstackgerrit	Merged openstack-infra/zuul master: Read old json data right before writing new data https://review.openstack.org/623245	21:56
fungi	i have a feeling puppet isn't going to update the executors before i have to disappear in 20 minutes, but happy to assist with restarts when i return from dinner if there's still any to be done at that point	21:59
*** bobh has quit IRC		22:01
clarkb	I've updated https://etherpad.openstack.org/p/bhs1-test-node-slowness	22:02
clarkb	I think we should consider halving the max-servers again and watch those e-r bugs I identified as being corrected by not running in bhs1	22:02
clarkb	amorin pointed out that none of those slow jobs ran on the same hypervisor so less likely it is one or two unhappy hypervisors. Instead we are maybe our own noisy neighbor	22:02
clarkb	and halving the number of nodes should reduce noisy neighbor impacts	22:03
corvus	clarkb: ack	22:03
fungi	makes sense to me	22:04
openstackgerrit	Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294	22:04
openstackgerrit	Clark Boylan proposed openstack-infra/project-config master: Halve bhs1 max-servers value https://review.openstack.org/623338	22:06
clarkb	fungi: corvus ^ quick review to implement that	22:06
corvus	clarkb: unfortunately that's a variable into our zuul executor swap problem. we should do it because resets are bad, we're just going to need to keep that in mind as we evaluate further changes.	22:07
clarkb	corvus: ++	22:07
fungi	if we drop it back to 79 instead of 75 we can more directly compare behavior differences between bhs1 and gra1	22:08
logan-	clarkb: updating xenial hwe from 4.15.0.34.56 to 4.15.0.42.63 makes my nested kvm jobs work again. ¯\_(ツ)_/¯	22:08
logan-	i will cycle thru the hvs and update them all over the next day or so	22:08
clarkb	logan-: weird	22:08
clarkb	fungi: I think gra1 has less physical hardware too though	22:09
clarkb	so that comparison won't be super accurate?	22:09
fungi	yeah, not a big deal either way	22:09
fungi	certainly if we still see more failures in bhs1 than gra1 even with a lower max-servers, that's telling too	22:10
fungi	anyway, i approved it	22:10
*** kjackal has joined #openstack-infra		22:11
fungi	and i'm being dragged away 10 minutes early. back as soon as i can be	22:11
clarkb	thanks!	22:11
*** calebb has quit IRC		22:12
clarkb	logan-: that does sort of imply to me that canonical/ubuntu must haev testing for this stuff, but its likely a losing battle for them trying to keep up	22:17
clarkb	corvus: for the executor stuff we are waiting on mordred's change to merge now?	22:18
corvus	clarkb: it's merged; waiting to deploy	22:19
clarkb	rgr	22:19
corvus	clarkb: but i think we should wait a bit after your quota change merges before we restart	22:19
corvus	so we get a new baseline	22:19
*** eernst has joined #openstack-infra		22:20
openstackgerrit	Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294	22:20
*** eernst has quit IRC		22:25
*** bobh has joined #openstack-infra		22:25
*** kgiusti has left #openstack-infra		22:27
*** manjeets_ is now known as manjeets		22:28
openstackgerrit	Jonathan Rosser proposed openstack-infra/project-config master: Separate out success/failure/timeout charts in grafana for OSA https://review.openstack.org/623341	22:29
*** bobh has quit IRC		22:29
openstackgerrit	Merged openstack-infra/project-config master: Halve bhs1 max-servers value https://review.openstack.org/623338	22:30
tonyb	Can I please be added to the bootsrappers gerrit group so I can EOL the puppet repos as per: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000663.html	22:32
* tonyb will self remove when done		22:32
clarkb	tonyb: yes, one moment	22:32
tonyb	clarkb: Thanks	22:33
clarkb	tonyb: done	22:33
tonyb	clarkb: \o/ as always I'll be careful :)	22:33
*** boden has quit IRC		22:45
*** bobh has joined #openstack-infra		22:45
tonyb	clarkb, tobias-urdin: Done and removed	22:48
*** bobh has quit IRC		22:49
openstackgerrit	Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294	22:53
*** bobh has joined #openstack-infra		23:04
*** lbragstad has quit IRC		23:08
*** bobh has quit IRC		23:09
*** lbragstad has joined #openstack-infra		23:09
clarkb	melwitt: mriedem: email sent	23:16
mriedem	clarkb: thanks	23:19
clarkb	corvus: max-servers change was applied at ~2300UTC and dropped to ~75 in use at about 23:15UTC	23:19
clarkb	baseline numbers probably want to start at 23:15UTC	23:20
corvus	clarkb: yeh, i was just looking. so maybe we wait until at least 24:00 before we restart any executors.	23:20
clarkb	wfm	23:20
melwitt	clarkb: danke	23:21
*** bobh has joined #openstack-infra		23:30
*** bobh has quit IRC		23:35
clarkb	corvus: we seem to be stabilizing just under 2GB swap per executor?	23:42
*** rkukura_ has joined #openstack-infra		23:48
corvus	clarkb: looks like it	23:48
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107	23:49
*** bobh has joined #openstack-infra		23:49
*** rkukura has quit IRC		23:51
*** rkukura_ is now known as rkukura		23:51
*** bobh has quit IRC		23:54
*** yamamoto has joined #openstack-infra		23:56
corvus	clarkb: other than 'atomic images prune' and 'atomic pull --storage ostree docker.io/openstackmagnum/kubernetes-kubelet:v1.11.5-1' did you have to do anything else to upgrade the cluster?	23:58
clarkb	corvus: yes, I "vacuumed" the journald contents to free up more disk space	23:59
clarkb	corvus: oh and I upgraded proxy, scheduler, kubelet, api, and controler-manager	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!