Tuesday, 2018-01-16

*** rlandy has quit IRC		00:27
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies https://review.openstack.org/530806	00:48
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699	00:48
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Compress testr results.html before fetching it https://review.openstack.org/533828	00:49
openstackgerrit	Tristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Clarify provider manager vs provider config https://review.openstack.org/531618	01:09
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Revert "Add consolidated role for processing subunit" https://review.openstack.org/533831	01:10
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit"" https://review.openstack.org/533834	01:17
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Adjust check for .stestr directory https://review.openstack.org/532688	01:33
openstackgerrit	Merged openstack-infra/zuul-jobs master: Revert "Add consolidated role for processing subunit" https://review.openstack.org/533831	01:55
*** haint_ has joined #zuul		02:01
*** threestrands_ has joined #zuul		02:02
*** haint has quit IRC		02:04
openstackgerrit	Tristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Do pep8 housekeeping according to zuul rules https://review.openstack.org/522945	02:04
*** jappleii__ has quit IRC		02:04
*** dtruong2 has joined #zuul		02:46
*** tflink has quit IRC		02:54
*** dtruong2 has quit IRC		03:37
*** jaianshu has joined #zuul		03:47
*** tflink has joined #zuul		04:03
*** dtruong2 has joined #zuul		04:17
*** dtruong2 has quit IRC		04:26
*** dtruong2 has joined #zuul		04:40
*** bhavik1 has joined #zuul		05:17
*** dtruong2 has quit IRC		05:17
*** bhavik1 has quit IRC		05:56
*** dtruong2 has joined #zuul		06:08
tobiash	corvus, dmsimard: re executor governor discussion of the meeting	06:11
tobiash	couldn't participate as it's too late for me (every time I attend I'm kind of jetlagged the whole next day)	06:12
tobiash	I think we need something smarter than the simple on/off approach based on current load/ram	06:13
tobiash	I think we need something similar than tcp does with slow start and sliding window	06:17
tobiash	s/than/like	06:17
*** dtruong2 has quit IRC		06:24
SpamapS	tobiash: could poll stats more aggressively on changes, and back off over time.	06:26
tobiash	SpamapS: I'm thinking about something like the slow start and a congestion window where we could hook in generic 'sensors' like load, ram	06:33
tobiash	e.g. the ram governor would not work for me in the openshift/kubernetes environment as it doesn't know how much of the free ram it is allowed to take	06:34
tobiash	for this we will have to check the cgroups	06:34
tobiash	and I like the idea of a general congestion algorithm where we can put sensors into rather than having its own governor for each metric	06:35
tobiash	I'll think more about this topic and maybe write a post to the ml	06:36
clarkb	tcp slow start was the model used with the dependent pipeline windowing too	06:36
tobiash	I think that could be a good fit	06:39
*** dtruong2 has joined #zuul		06:44
*** dtruong2 has quit IRC		06:55
SpamapS	querying the cgroups is also not a bad idea at all.	07:09
*** threestrands_ has quit IRC		07:18
openstackgerrit	Andreas Jaeger proposed openstack-infra/zuul-jobs master: Remove testr and stestr specific roles https://review.openstack.org/529340	07:28
*** hashar has joined #zuul		08:14
SpamapS	https://github.com/facebook/pyre2/pull/10	08:19
SpamapS	So now to figure out what to do about the MIA author.	08:19
*** jpena\|off is now known as jpena		08:46
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Use re2 for change_matcher https://review.openstack.org/534190	08:50
SpamapS	tobiash: ^ patch to use re2 for change matchers.	08:50
tobiash	cool	08:51
SpamapS	yeah.. note that pyre2 has not merged my PR yet.	08:51
SpamapS	and that's blocked on GoDaddy signing Facebook's Corp CLA	08:51
SpamapS	so it might be a while before we can merge that. :-P	08:51
tobiash	SpamapS: how long do you expect?	08:52
tobiash	signing the openstack cla took me a year...	08:52
SpamapS	No clue, it's my first time asking for a CLA to be signed. ;)	08:52
SpamapS	We've signed about 15 of them	08:52
SpamapS	so I expect we'll do this one relatively "easily"	08:52
SpamapS	I just dunno how long it takes.	08:52
tobiash	the openstack cla was our first ;)	08:52
SpamapS	Yeah, IMO they're all stupid.	08:53
SpamapS	Lawyers making work for themselves.	08:53
tobiash	yepp	08:53
SpamapS	But I guess nobody wants another SCO vs. Linux situation.	08:53
tobiash	probably	08:53
SpamapS	Also wondering if we can get some speed gains by using re2	08:54
SpamapS	in some cases it is 10x faster.	08:54
SpamapS	Not that scheduler CPU has been an issue.	08:54
tobiash	that would be cool, but I guess we're not limited by regex parsing currently	08:54
*** sshnaidm\|afk is now known as sshnaidm		09:52
*** saop has joined #zuul		10:06
saop	tristanC, Hello	10:06
saop	tristanC, Now the CI is able to pick the job and run	10:07
saop	tristanC, But in the ansible executing phase, we are getting error: Ansible output: b'Unknown option --unshare-all'	10:08
*** ankkumar has joined #zuul		10:08
saop	tristanC, Sending result: {"data": {}, "result": "FAILURE"}	10:08
saop	tristanC, any idea about that?	10:08
saop	tristanC, We have very basic ansible playbook file	10:08
tristanC	saop: it seems like you need a more recent bubblewrap version	10:08
saop	tristanC, thanks	10:09
tristanC	well, --unshare-all was added in bwrap-0.1.7	10:11
saop	tristanC, We installed some 0.1.2	10:14
saop	tristanC, will upgrade now	10:14
tristanC	saop: what distrib are you using?	10:14
saop	tristanC, ubuntu xenial	10:14
saop	tristanC, One more question, our CI is able to post the result but its not showing in gerrit, but when we do toggle CI we can see something like: Our CI: test-basic finger://ubuntu/f3f5f726a27a480687b826d9bf6a3e57 : FAILURE in 22s	10:21
saop	tristanC, Do we need to have any configuration for that?	10:21
tristanC	saop: you mean result in the table under vote?	10:22
saop	tristanC, Yes	10:23
saop	tristanC, you can check here: https://review.openstack.org/#/c/534137/	10:23
saop	tristanC, toggle for HPE Proliant CI	10:23
tristanC	saop: iirc there is some javascript magic that parse ci comment, and it only works when result has http links	10:23
saop	tristanC, Now we are getting in ansible b'Unknown option --die-with-parent'	10:26
saop	tristanC, Do we need to upgrade more,	10:26
*** AJaeger has quit IRC		10:26
saop	tristanC, We are using bwrap-0.1.7	10:26
tristanC	saop: yes, i meant you need the most recent bubblewrap version, that is 2.0.0	10:27
saop	tristanC, ohh okay	10:27
tristanC	perhaps ubuntu only needs 0.1.8	10:28
*** AJaeger has joined #zuul		10:32
saop	tristanC, how to set log url in zuul v3?	10:50
tristanC	saop: you should use https://docs.openstack.org/infra/zuul-jobs/roles.html#role-upload-logs or build something similar	10:54
*** electrofelix has joined #zuul		11:01
*** fbo has joined #zuul		11:31
*** jpena is now known as jpena\|lunch		12:35
*** weshay_PTO is now known as weshay		13:01
saop	tristanC, I created post-logs.yaml according to documentation, and also provided in jobs: post-run section, but it didn't executed, any idea?	13:07
tristanC	saop: to debug this kind of failure, use the executor keep option to get the raw ansible job logs in /tmp	13:12
saop	tristanC, Thanks	13:13
*** rlandy has joined #zuul		13:30
*** rlandy_ has joined #zuul		13:30
*** rlandy_ has quit IRC		13:30
*** jpena\|lunch is now known as jpena		13:39
*** jaianshu_ has joined #zuul		13:46
*** jaianshu has quit IRC		13:49
*** jaianshu_ has quit IRC		13:50
*** saop has quit IRC		13:51
*** ankkumar has quit IRC		14:05
*** dkranz has joined #zuul		14:13
pabelanger	http://grafana.openstack.org/dashboard/db/nodepool-inap is an interesting pattern with test nodes, I wonder if we are getting a large amount of locked ready nodes pooling, and cannot transition to in-use because we cannot fulfill the requests (no more quota)	15:00
pabelanger	we end up getting more then 1/2 the nodes marked ready, then seem to use then all at once	15:00
pabelanger	maybe we had a gate reset during that time too?	15:01
Shrews	pabelanger: what are you seeing in zk for that provider?	15:10
pabelanger	Shrews: anything specific I should be looking for? Just that we are at quota for the provider, and we have locked ready nodes, waiting for other nodes to come online	15:13
Shrews	pabelanger: first thing i'd look at is if the ready&locked nodes have been around for a long time. if so, could be an issue we might need to look into. otherwise, might be a normal pattern	15:15
*** bhavik1 has joined #zuul		15:16
pabelanger	Shrews: yah, I'll see if I can find pattern in logs	15:18
*** flepied has joined #zuul		15:33
*** flepied_ has joined #zuul		15:33
*** flepied_ has quit IRC		15:33
*** bhavik1 has quit IRC		16:09
*** hashar has quit IRC		16:32
*** dkranz has quit IRC		16:36
*** jpena is now known as jpena\|off		16:37
*** jpena\|off is now known as jpena		16:41
openstackgerrit	Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Only decline requests if no cloud can service them https://review.openstack.org/533372	16:46
openstackgerrit	Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Add test_launcher test https://review.openstack.org/533771	16:46
openstackgerrit	Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Only fail requests if no cloud can service them https://review.openstack.org/533372	16:48
openstackgerrit	Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Add test_launcher test https://review.openstack.org/533771	16:48
*** dkranz has joined #zuul		16:52
*** tflink has quit IRC		16:54
*** tflink has joined #zuul		16:57
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699	17:15
*** dkranz has quit IRC		17:15
*** dkranz has joined #zuul		17:27
*** zigo has quit IRC		17:33
*** openstackgerrit has quit IRC		17:33
*** sshnaidm is now known as sshnaidm\|afk		17:34
*** sshnaidm\|afk has quit IRC		17:34
*** bhavik1 has joined #zuul		17:35
*** bhavik1 has quit IRC		17:35
*** zigo has joined #zuul		17:37
*** openstackgerrit has joined #zuul		17:49
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Support cross-source dependencies https://review.openstack.org/530806	17:49
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add cross-source tests https://review.openstack.org/532699	17:49
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies https://review.openstack.org/534397	17:49
corvus	that patch series is a 3.0 release blocker and is ready for review ^	17:49
corvus	it may be worth reading the docs change first	17:50
SpamapS	corvus: so, re2 doesn't support this regexp: '^(?!stable)' .. that apparently is an inefficient backref.	17:50
SpamapS	wondering if there's a more efficient way to to say "doesn't start with stable"	17:51
corvus	hrm. well, i think we're going to need that less in the future, however, we still have vestigal uses of it now, and it's proven so useful in the past i worry about not being able to use something like that...	17:52
SpamapS	yeah, re2 seems to only support negation of character classes, not strings.	17:56
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit"" https://review.openstack.org/533834	17:58
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Adjust check for .stestr directory https://review.openstack.org/532688	17:58
SpamapS	corvus: we could make an irrelevant-branches maybe?	18:02
*** sshnaidm\|afk has joined #zuul		18:03
SpamapS	which would allow positive matching	18:03
corvus	SpamapS: an alternative might be to make the zuul config language more sophisticated, so you could specify boolean ops on the regexes. something like "branches: [not: stable/.*]". the underlying classes are sophisticated enough to support that, it's just not exposed in syntax.	18:03
corvus	SpamapS: we had similar ideas simultaneously :)	18:03
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver https://review.openstack.org/531904	18:07
SpamapS	corvus: Might be good to wrap this up before 3.0. Would be a shame to release with a DoS'able scheduler feature.	18:19
corvus	SpamapS: there are still so many other ways to DoS, and this one has been out there for 6 years	18:21
corvus	openstack's zuul runs out of memory every 4 days just through normal use, and i wasn't even planning on considering that a 3.0 blocker at this point.	18:22
SpamapS	:-/	18:22
corvus	arbitrary node set size is another one	18:22
SpamapS	kk, 3.1 -- the hardening release. ;)	18:23
SpamapS	(where we let you turn off re maybe? :-P )	18:23
corvus	SpamapS: ++ 3.1 hardening, but i think we should should avoid making language features configurable -- we should either reduce the scope of re (ie, switch to re2, possibly compensate by irrelevant-branches or booleans), or drop it altogether	18:25
SpamapS	Yeah I like the idea of allowing only positive matches and using re2.	18:25
SpamapS	The CLA process has begun here so hopefully we can get my py3k support merged and released relatively quickly.	18:27
corvus	SpamapS: if you end up being the maintainer, the CLA process won't matter anyway :)	18:27
* fungi feels sorry at a peersonal level that this cla is still hanging around		18:27
corvus	fungi: different cla	18:27
fungi	oh!	18:27
fungi	hah, i missed that	18:27
corvus	facebook i think?	18:28
SpamapS	corvus: actually the maintainer may have re-appeared. Apparently facebook changed their email domain and they didn't update their addresses in the README (they were @facebook.com and are now @fb.com)	18:28
clarkb	if we don't do negative lookahead do we even need regexes?	18:28
fungi	regardless, not another cla i need to feel bad about	18:28
clarkb	could just list all positive matches. Would be more verbose but avoids the re problem entirely	18:28
clarkb	(then just string == otherstring)	18:29
corvus	clarkb: agreed, i think that's something worth considering, and tbh, i would prefer that to our (openstack) habit of negative lookaheads regardless.	18:29
fungi	i suppose if you don't need negative lookahead you could get away with implementing positive matches via an expansion syntax to save some space	18:30
fungi	e.g. branch: stable/{ocata,pike},master	18:30
fungi	then internally expand and match against the resulting list	18:31
corvus	this would be a really good mailing list discussion -- have folks weigh in on "regex positive-only matches vs no regex at all"	18:31
clarkb	oh tahts what I need to do, sign up for ml again	18:31
fungi	yup!	18:31
* clarkb wonders if that will make mailman mad		18:31
corvus	fungi: if we do forward expansions, we may as well use re2 rather than rolling our own, i'd think	18:31
fungi	i took the hint yesterday and finally signed up for the lists.zuul-ci.org mailing lists	18:32
clarkb	I signed up last week but things were unworking	18:32
fungi	corvus: oh, that's a feature of re2? even nicer	18:32
corvus	fungi: er, i dunno? maybe i'm making stuff up.	18:32
fungi	that's cool too ;)	18:32
fungi	everyone needs a hobby	18:33
corvus	fungi: i just assumed that something like "stable/(foo\|bar)" would work	18:33
fungi	oh, sur	18:33
clarkb	re2 does safer more performant res with fewer magical features like negative lookahead	18:33
fungi	e	18:33
corvus	SpamapS: the other thing is there are some more uses of regex -- i think some of the pipeline trigger / approval matching stuff uses it. maybe it's okay to leave that since that's config-repos only. maybe we just need to identify all the untrusted-repo uses of regex and address them.	18:34
SpamapS	corvus: yeah I was mostly concerned with job config. Did not dig into other uses.	18:34
SpamapS	At some point it seems like it would make sense to just use re2 everywhere since it is also far faster even on the simple cases.	18:35
corvus	files needs either regex or glob. so if we entertain dropping regex, we'd have to glob there i think.	18:35
corvus	talk of mailing lists reminds me...	18:36
-corvus- Please sign up for new zuul mailing lists: http://lists.openstack.org/pipermail/openstack-infra/2018-January/005800.html		18:36
corvus	and we should update the readme/docs too	18:37
*** sshnaidm\|afk is now known as sshnaidm		18:39
*** jpena is now known as jpena\|off		18:45
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Clarify provider manager vs provider config https://review.openstack.org/531618	18:46
*** electrofelix has quit IRC		18:57
*** electrofelix has joined #zuul		18:58
*** flepied has quit IRC		19:07
*** flepied has joined #zuul		19:08
*** flepied has quit IRC		19:30
*** flepied has joined #zuul		19:30
*** harlowja has joined #zuul		19:44
openstackgerrit	Andreas Jaeger proposed openstack-infra/zuul-jobs master: Refactor fetch-subunit-output https://review.openstack.org/534427	20:04
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Clean up when conditions in fetch-subunit-output https://review.openstack.org/534428	20:05
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Use subunit2html from tox envs instead of os-testr-env https://review.openstack.org/534429	20:05
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-subunit-output work with multiple tox envs https://review.openstack.org/534430	20:05
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script https://review.openstack.org/534431	20:05
*** hashar has joined #zuul		20:07
openstackgerrit	Merged openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit"" https://review.openstack.org/533834	20:10
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script https://review.openstack.org/534431	20:23
pabelanger	Shrews: we seem to be continuing accumulating ready / locked nodes in inap ATM. I'm trying to see why new requests are coming online, when exists ones haven't been fulfilled yet	20:25
pabelanger	we are up to 32 ready / locked nodes right now	20:25
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script https://review.openstack.org/534431	20:28
pabelanger	odd, up to 52 now	20:29
pabelanger	something is going on	20:29
Shrews	pabelanger: "why new requests are coming online, when exists ones haven't been fulfilled yet" ... that confuses me. requests are allowed to exist in parallel for a provider, so not sure what you mean by that	20:30
Shrews	but i will see if i can glean anything from the logs	20:31
pabelanger	2018-01-16 20:28:04,870 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl01.openstack.org-30110-PoolWorker.inap-mtl01-main]: Fulfilled node request 100-0002104682	20:33
pabelanger	at that point, we seem to fullfill about 52 nodes, but I don't see why it took so long	20:33
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies https://review.openstack.org/534397	20:33
pabelanger	some of them appear to be single ubuntu-xenial nodes	20:33
Shrews	pabelanger: i do not see any ready&locked nodes	20:34
pabelanger	Shrews: yah, they just unlocked at timestamp from above	20:35
pabelanger	if you look at logs, we fulfilled a bunch of requests at one time, upwards of 45 nodes	20:35
Shrews	pabelanger: we are at capacity	20:35
Shrews	rather, inap is	20:35
pabelanger	http://grafana.openstack.org/dashboard/db/nodepool-inap shows the spike in avail ready nodes	20:36
Shrews	max-servers is 190	20:36
pabelanger	Shrews: right, but I am trying to understand, how do we keep growing ready / locked nodes. If at capacity, doesn't the next requests of nodes go towards existing open nodesets waiting for nodes?	20:37
pabelanger	For example: http://grafana.openstack.org/dashboard/db/nodepool-inap?from=1516133623844&to=1516134574705&var-provider=All	20:38
Shrews	wow, i cannot grok that sentence for some reason. :)	20:38
Shrews	pabelanger: i'm not sure i	20:38
pabelanger	for almost 15mins, we grew ready / locked nodes, until something happened for them to all dump	20:38
pabelanger	almost 30% of capacity grew to be idle for 15mins	20:39
pabelanger	http://grafana.openstack.org/dashboard/db/nodepool-rackspace?from=1516125202560&to=1516126777140 is another interesting graph for rackspace	20:41
pabelanger	IAD had 87 availble nodes but only 37 in-use	20:41
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver https://review.openstack.org/531904	20:44
corvus	clarkb: i'm a little confused by https://storyboard.openstack.org/#!/story/2001427 -- that diagram makes it look like C is the parent of B, but the text says that B is the parent of C.	20:48
corvus	pabelanger: you should entertain the idea that zuul doesn't have enough executors to handle all of the jobs. zuul collects nodes before it gives them to executors. i see some small and large spikes on the executor queue graph. that mens zuul is spending at least some time waiting for executors to pick up jobs.	20:54
corvus	is those cases, the nodes will be ready and locked by zuul, and their requests will be fulfilled.	20:55
corvus	s/is/in/	20:55
corvus	no idea if that's what you're seeing, just throwing it out there.	20:55
pabelanger	corvus: okay, let me confirm, but I think nodepool is still holding the lock at this point. Would zuul be involved if that is still the case?	20:56
pabelanger	the part I am trying to figure out is, I see the following in the logs:	20:57
pabelanger	2018-01-16 20:28:04,866 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl01.openstack.org-30110-PoolWorker.inap-mtl01-main]: Pausing request handling to satisfy request	20:57
pabelanger	<snipped for size>	20:57
pabelanger	then nodepool start to unlock nodes	20:57
pabelanger	which brings ready / locked nodes back down to zero	20:58
pabelanger	it seems we only pausing once at capacity, according to comments in nodepool/driver/openstack/handler.py	20:59
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Add skipped CRD tests https://review.openstack.org/531887	21:01
*** haint_ has quit IRC		21:01
Shrews	pabelanger: i'm not sure what's happening here, but it does seem that something is keeping completed request handlers from being processed	21:02
Shrews	in a timely fashion	21:02
*** haint_ has joined #zuul		21:02
pabelanger	I'm trying to walk my self though the poll function in nodepool/driver/__init__.py	21:05
*** rlandy_ has joined #zuul		21:05
corvus	clarkb: though, i think the direction of that arrow doesn't actually matter, the resulting trees are equivalent (though the final tree would be BCA not CBA)	21:05
*** rlandy__ has joined #zuul		21:05
*** rlandy__ has quit IRC		21:05
Shrews	pabelanger: theory... i believe something in _assignHandlers(), where it loops through the node requests and processes them, is causing a significant delay within the loop	21:08
Shrews	pabelanger: because until that loop finishes, it will never remove the completed handlers (and i'm not seeing that happen until minutes later in the log)	21:09
Shrews	i only see zookeeper communication happening there, so i wonder if something is wonky with that	21:10
Shrews	an overloaded ZK?	21:10
Shrews	or communication issues with it?	21:10
pabelanger	1 sec, need to AFk quickly	21:11
clarkb	corvus: C is the parent of B that is what the text says	21:12
Shrews	java.io.IOException: No space left on device	21:12
Shrews	on zookeeper	21:12
Shrews	wheeeee	21:12
clarkb	corvus: C -> B -> A, then separately C and B depends on A	21:12
Shrews	pabelanger: ^^^	21:12
corvus	clarkb: right, i was confused by "Change C has parent change B" which means B is the parent of C.	21:12
clarkb	corvus: actually the depends on may just be C to A	21:13
Shrews	pabelanger: oh, nm. that's an old log	21:13
corvus	clarkb: but honestly, i don't think it matters, we can just pick one and stick with it. :)	21:13
clarkb	er now I'm all confused. In any case I htink the arrows are correct :)	21:13
clarkb	corvus: ya, basically you just need a dag that looks like a cycle but with one of the arrows pointed the wrong way	21:13
corvus	clarkb: yeah, both sets of arrows match.	21:13
pabelanger	sorry, #dadop	21:14
clarkb	Shrews: ya nodepool.o.o has ~50% of / free	21:14
pabelanger	I'm looking at cacti.o.o now	21:14
corvus	it looks busy, but not critically so.	21:14
pabelanger	to see if we are spiking anything	21:14
clarkb	could it be the executors just arent' grabbing more jobs?	21:14
clarkb	or not grabbing nodes from nodepool fast enough?	21:15
pabelanger	if I understand logs, I don't think nodepool is releasing them to zuul	21:15
pabelanger	but will let Shrews confirm	21:15
corvus	clarkb: it's possible that's happened a few times, but i think Shrews and pabelanger have also found behavior that doesn't explain, and it's more common	21:15
corvus	(the grafana graphs suggest that has maybe happened after some gate resets)	21:16
*** threestrands_ has joined #zuul		21:18
Shrews	corvus: pabelanger: is it possible that we could be getting rate limited by quota queries by the provider here? if those are slow, we could spend a long time in the request acceptance loop	21:33
Shrews	which i'm growing more confident that we are doing	21:33
pabelanger	Shrews: let me check if we are doing any caching for quota or not	21:35
pabelanger	maybe we need to add it, if missing	21:35
Shrews	i'm not certain what those queries are... trying to track down where we do that	21:35
clarkb	Shrews: we may not be rate limited as much as clouds responding slow	21:36
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive https://review.openstack.org/534444	21:36
clarkb	Shrews: pabelanger you can probably test that out of band by making the same requests and timing them	21:37
pabelanger	yah, was going to suggest that	21:37
pabelanger	we could also start tracking them via statsd too, like create servers	21:38
corvus	we should be caching the quota queries -- except when we unexpectedly get an over-quota failure, we empty the cache. we're continually getting over-quota failures in inap due to the nova mismatch bug.	21:38
mordred	pabelanger: we should already be tracking them in statsd, just need to add grafana graphs for them	21:38
pabelanger	mordred: nice	21:38
mordred	pabelanger: (we should be generating statsd metrics for every REST call made - so we shoudl get all the metrics all the time)	21:39
pabelanger	stats.nodepool.task.inap-mtl01.ComputeGetLimits looks to be key	21:41
mordred	yup. that seems about right	21:41
pabelanger	I'll confirm and work up grafyaml patch	21:42
corvus	pabelanger: do you have a graphite link handy for us to look at?	21:43
Shrews	seems getting compute limits on inap takes about a second	21:44
corvus	clarkb: in 534444 i opted to fix only the new-style depends-on; do you think it's important to fix the legacy gerrit depends-on as well?	21:44
pabelanger	let me see how to share	21:44
mordred	corvus: test_crd_gate_triangle sounds like a place I don't want to fly a small aircraft over	21:44
corvus	pabelanger: right click on image ?	21:44
clarkb	corvus: is old style going away?	21:44
corvus	mordred: definitely	21:44
clarkb	if old style is going away probably not but if we expect it to stick around might be good to have consistent behavior	21:44
corvus	clarkb: yeah, but with a timeframe convenient for openstack, so maybe 3-6 mo...	21:44
corvus	i don't want to immediately invalidate a bunch of depends-on headers	21:45
pabelanger	http://graphite.openstack.org/render/?width=586&height=308&_salt=1516139140.947&target=stats.nodepool.task.inap-mtl01.ComputeGetLimits	21:46
pabelanger	that is the image, but not sure how to properly share	21:46
corvus	you just did	21:46
pabelanger	kk	21:46
pabelanger	seems pretty flat right now	21:46
Shrews	i'm not sure how to track this down without throwing in a bunch of spurious logging statements in a printf-style-ala-mordred attack	21:47
corvus	pabelanger: i suspect that's a graph of when we emit those calls	21:48
mordred	pabelanger: the timing key is ...	21:48
corvus	http://graphite.openstack.org/render/?width=586&height=308&_salt=1516139263.857&target=stats.timers.nodepool.task.inap-mtl01.ComputeGetLimits.mean	21:48
corvus	something like that ^ i think	21:48
pabelanger	ah, thank you that does look right now	21:48
mordred	yes. what corvus said	21:48
mordred	that's in miliseconds?	21:49
corvus	mordred: i think so	21:49
mordred	yah. so - not the world's fastest call - but not the world's slowest either	21:49
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Support cross-source dependencies https://review.openstack.org/530806	21:49
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Add cross-source tests https://review.openstack.org/532699	21:49
clarkb	corvus: reviewed the fix for triangle deps	21:52
clarkb	corvus: there is a bug	21:52
*** flepied has quit IRC		21:55
*** flepied has joined #zuul		21:57
Shrews	pabelanger: corvus: http://paste.openstack.org/show/645748/ Those "predicted" calculations happen back-to-back in the code, so the second one is taking at least 5 seconds. I'm not sure how long the 1st is taking	22:03
Shrews	I'm seeing processing a single request take between 20-30 seconds. If there are lots of requests to go through, then the code will not get around to removing completed requests for quite a while	22:03
Shrews	So we could be seeing a combination of things here	22:04
Shrews	Heavy load + inefficient request processing	22:04
Shrews	This would explain why pabelanger saw many ready+locked nodes suddenly free up	22:05
*** dkranz has quit IRC		22:06
Shrews	or change to in-use, rather	22:06
pabelanger	yah, nodepool-launcher is at 100% CPU for the most part	22:07
Shrews	pabelanger: this was a good spot. well done for catching it	22:08
pabelanger	could be run another nodepool-launcher process on the host, but just for inap as a test?	22:09
Shrews	pabelanger: now tell us how to make it better!!! :)	22:09
Shrews	i don't think this is inap specific. i see similar trends with other providers	22:09
pabelanger	yah, nl01 is 8vCPU, if we have at 100% for a single nodepool-launcher daemon, maybe we shared the config more and run a single processor for provider?	22:10
Shrews	i don't think that'd help, tbh	22:13
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive https://review.openstack.org/534444	22:13
corvus	clarkb: thx ^	22:14
pabelanger	Shrews: ack	22:14
Shrews	adding caching of limits to shade might help a bit	22:16
Shrews	i can work on that in the morning if we think that's a good idea	22:17
corvus	Shrews: can you summarize what the problem is? i don't have the full picture.	22:17
mordred	Shrews: I think adding in the same caching pattern we use for servers ports and floating-ips would be a fine idea (pending we don't learn something else to the contrary between now and then)	22:18
Shrews	corvus: requests aren't getting satisfied in a timely fashion because of the processing of the node requests loop. we are seeing several seconds between processing each request. until we get through all of them, we do not attempt to mark requests fulfilled	22:19
corvus	Shrews: which loop is the 'node requests loop' ?	22:19
Shrews	corvus: PoolWorker._assignHandlers()	22:20
corvus	Shrews: what in there do you think is slow?	22:21
*** haint_ has quit IRC		22:22
Shrews	corvus: all of it? :) the paste from me above shows several seconds for at least one of the quota estimations.	22:22
*** haint_ has joined #zuul		22:22
Shrews	corvus: i have not identified other areas of slowness yet, but can see over 20s between assigning a request, and actually launching the node for it	22:23
Shrews	without some more debugging info, i can only guess at to what else is slow about it	22:25
corvus	Shrews: okay, so we've got several seconds for the new handlers to run if that method creates any new ones...	22:26
corvus	Shrews: if the handler is paused, we short out of there pretty quick	22:27
corvus	Shrews: but if it isn't paused, then we're going to try to lock every request. we have 1800 of them right now.	22:27
corvus	but it seems like we should run up to the point where we are paused pretty quickly, so that shouldn't be an issue	22:28
Shrews	right. looking at the logs, i saw the inap thread accept 50 requests before it completed the loop, which took about 13m in all	22:28
Shrews	it never paused in that time	22:29
Shrews	which means it had capacity to handle them all	22:29
clarkb	each provider gets its own thread too right? and those threads fight all the launch threads? could we be cpu starved?	22:29
clarkb	like maybe we should run multiple nodepool launchers per host or something	22:29
pabelanger	clarkb: yah, that is what I was thinking about multiple launchers per host	22:30
pabelanger	they'd each need to have a config with a specific provider	22:30
Shrews	clarkb: yes, i provider pool per thread	22:30
Shrews	a provider can have multiple pools	22:30
clarkb	ah right its PoolWorker	22:31
clarkb	in our case pools and providers are currently 1:1	22:31
corvus	nl01 is currently cpu-bound	22:31
corvus	they both are	22:32
Shrews	there could be a lot of thread context switching going on since we have 1 thread per node launch too	22:32
Shrews	maybe multiple processes could help	22:32
Shrews	as pabelanger suggested	22:32
corvus	yep. this is one of the reasons we designed it to accomodate sharding -- we were starved on one machine. it's interesting that we are already starved with two processes.	22:33
corvus	so more processes (whether that's on more machines or the same one) may help, if the main pool thread is being starved by the launch threads.	22:33
clarkb	in this case mentioned same host since I think we have a lot of available cpu its just python being unable to take advantage	22:34
corvus	fwiw, the context switch overhead before it became unbearable was about 1k threads.	22:34
mordred	seems about right	22:34
pabelanger	nb01.o.o is also 8vcpu / 8GB RAM, if we are CPU bound, an expensive server for single process. We could trying bringing on more launcher nodes, in smaller sizes, or run more launchers on each host	22:35
corvus	Shrews, mordred: i'm not 100% sure, but i worry that quota caching isn't an immedate answer. we should verify that nodepool isn't already caching sufficiently and that changing shade would actually do anything different before we go down that road.	22:35
Shrews	corvus: nod	22:36
corvus	pabelanger: yes that server is too large. it should be 2g	22:36
clarkb	pabelanger: more nodes in smaller sizes potentially makes us less outage prone so thats a plus but also harder to do things like switch zk servers	22:36
corvus	i'd size it 2G for each process we want to run on it.	22:36
corvus	maybe even fit a few more processes on larger hosts	22:37
pabelanger	if we did single process on 2GB hosts, that would work well today how we manage nodepool.yaml in project-config (by hostname). If we do more processes on larger server, we'll need to rework some puppet code first	22:38
corvus	Shrews: we could also make the poolworker a little more concurrent, either by having another thread handle completions, or just interleaving completions with assignments.	22:38
Shrews	corvus: difficult to handle correctly since either way, we'd want to modify the same data structure (the handler array). but yeah, could be possible with the right locking and whatnot	22:39
pabelanger	I have to step away for a bit, will catch up when back	22:39
Shrews	and i need to make dinner	22:40
corvus	let's resume this tomorrow :)	22:41
clarkb	another idea: we could use multiprocessing for pool worker possibly?	22:41
clarkb	and have python handle it more behind the scenes for us? that might be friendlier to other users	22:41
corvus	clarkb: yes, as long as we do it at that level (the pool worker), where there's no shared data between any of them. that's worth looking into.	22:42
clarkb	ya all the communication is through zk anyways so that should be pretty safe	22:42
corvus	multiprocessing is cool as long as you don't try to share data, then it gets bad. so a process for a pool worker, but then all the launchers still as threads.	22:42
mordred	++	22:43
corvus	they already even have their own zk connection, so even that wouldn't be different	22:43
Shrews	i think the Nodepool object itself is shared, which could be problematic	22:43
Shrews	anyway, really away now	22:43
corvus	Shrews: yeah, i think that's mostly config stuff at this point; we could probably fix that.	22:44
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive https://review.openstack.org/534444	22:48
corvus	clarkb: there are tests which would have caught the error you pointed out. most of them worked, but two needed a fix, so i updated the patch with that as well. ^	22:49
clarkb	corvus: cool I will rereview	22:51
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Stop running tox-cover job https://review.openstack.org/534458	22:51
*** flepied has quit IRC		22:57
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Really change patchset column to string https://review.openstack.org/532459	23:05
clarkb	Shrews: corvus do you want to review https://review.openstack.org/#/c/533771/5 and parent?	23:06
clarkb	I think that is slightly more important now that infracloud might shutdown at any time	23:06
corvus	clarkb: looking	23:10
pabelanger	https://review.openstack.org/534450/ should start tracking getlimits API query in grafana to help with tomorrow	23:11
clarkb	there is another cleanup related to that where I think we can more aggressively unset allocated_to (or whatever the var is called) when we fail a multinode nodeset	23:12
clarkb	I think right now we let the cleanup thread find those and unset them, btu we can unset them early allowing them to be used in other nodesets sooner	23:12
corvus	clarkb: lgtm -- we have some tests which set an image flag that tells the fake to always fail when booting that image. i wonder if there's a way to turn that on mid-stream to avoid that particular monkey-patch.	23:14
corvus	clarkb: i've +2d it; let's see if Shrews wants to review it tomorrow? unless you think we need to accelerate deployment of it.	23:15
clarkb	looking at multiprocess PoolWorker more closely we use nodepool.getZK() and nodepool.getPoolWorkers() along with some config related stuff	23:16
clarkb	I think the only thing that may be a real problem is getPoolWorkers? /me makes a quick change and sees what happens	23:16
clarkb	oh you know where this might break the most is in the tests :/	23:17
clarkb	because we sort of assume a single process we can manipulate	23:18
corvus	hrm. i'm guessing black-box tests might work, but not so much white-box.	23:21
corvus	the tests which do end-to-end testing through zookeeper should work.	23:22
corvus	how does this message about the zuulv3 merge look? https://etherpad.openstack.org/p/4sX3qKYDBN	23:28
clarkb	ya just hacking in multiprocessing and running it against the test I added above once I fix some structural issues we run into the problem where the test framework introspects its threads and stuff to know how things are doing	23:32
clarkb	so it iwll be a bit of work to get that going in the tset suite	23:32
corvus	clarkb: maybe we need a slightly different structure then -- maybe we need to handle the process split in a way where we can test the system all in one process, but when actually started, we get multiple ones.	23:34
clarkb	that seems viable, we would need a more true functioanl black box test to go with that (which we do have)	23:34
clarkb	this isn't somethign I'm going to dive into now but wanted to confirm my suspicions it is non trivial before I ignored it :)	23:35
clarkb	but I do think if we can make it work this will be the most user friendly way of scaling up launchers per host	23:36
corvus	clarkb: yeah, for the most part, i'd think even just a special test like "i started a two-provider-pool system with multiple-processes and they both gave me a node" would be sufficient to excercise that -- then all the rest of the correctness tests can remain in the single-process realm	23:38
*** sshnaidm is now known as sshnaidm\|off		23:42
corvus	i sent the first message to zuul-announce: http://lists.zuul-ci.org/pipermail/zuul-announce/2018-January/000000.html	23:43
corvus	hopefully folks got that	23:43
pabelanger	yup, got email here	23:46
*** rlandy_ has quit IRC		23:52
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Use hotlink instead log url in github job report https://review.openstack.org/531545	23:53
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Disambiguate with Netflix and Javascript zuul https://review.openstack.org/531292	23:55
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Add support for protected jobs https://review.openstack.org/522985	23:56
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies https://review.openstack.org/534397	23:56
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver https://review.openstack.org/531904	23:58
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Handle sigterm in nodaemon mode https://review.openstack.org/528646	23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!