Tuesday, 2017-04-25

SpamapS	jeblair: AH right ok	00:01
* SpamapS may poke at that, I understand now		00:01
SpamapS	tho I kinda wanna bang out the last few test re-enables first	00:01
* SpamapS EOD's		00:01
clarkb	wow is it that time already?	00:09
clarkb	jeblair: can you look at test_client_enqueue_negative? is the reuse of client there valid when using shutdown?	00:12
clarkb	test_client_promote_negative does it too	00:14
clarkb	I think those are likely passing because they raise the expected exceptions but because the client is shutdown not because it acutally spoke through gearman and failed /me will adjust	00:21
*** Shrews has quit IRC		00:51
jeblair	clarkb: the exception check ostensibly checks the specific error, so it seems like they ought to be valid checks (ie, not fooled by the client being shutdown). however, i agree, the reuse of client is wrong and should cause problems. so you're almost certainly on to something. :)	00:57
clarkb	ya I think I have this mostly sorted out	00:58
clarkb	Its ~133 tests in and no fails except for 3 where I derped a cleanup	00:58
clarkb	(still not fast but seems table at least)	01:02
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Don't run apsched thread when no jobs need scheduling https://review.openstack.org/457798	01:14
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409	01:14
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Add ability to stop drivers https://review.openstack.org/459477	01:14
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use cStringIO if available. https://review.openstack.org/459478	01:14
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Join watchdog thread so it doesn't leak in tests https://review.openstack.org/459479	01:14
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480	01:14
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481	01:14
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482	01:14
clarkb	thats today's braindump	01:14
clarkb	I'm not sure if last one will pass as I haven't been able to run full suite locally yet	01:14
clarkb	But its way better than it has been I think	01:14
jeblair	clarkb: i haven't looked, but it's possible that could cause follow-on failures	01:15
jeblair	clarkb: like, if one test fails, then all subsequent tests in that test runner will also fail	01:15
jeblair	which is a bit distracting; hard to see where the real failure is	01:15
clarkb	jeblair: the error if threads leak change?	01:15
clarkb	I don't think it should	01:15
jeblair	clarkb: yeah.	01:15
clarkb	cleanups should run regardless of other cleanups raising	01:16
clarkb	and its at the end of that cleanup so I think we are fine	01:16
jeblair	clarkb: if a test fails, it seems quite likely that it may fail to clean up properly, and threads may leak.	01:16
clarkb	even failed tests should cleanup properly though...	01:16
clarkb	thats why cleanup is better than teardown	01:17
clarkb	anyways I can sort out how to make that assertion stronger later. I do think we need something like it though	01:17
clarkb	since leaked threads are very expensive based on my local job runs	01:17
jeblair	ideally yes. we can choose to make that a priority. i'm just saying it's probably not the case now. we'll probably be mopping that up for a while.	01:17
clarkb	one reason I want it to fail is so that you see leaked threads on the first test that leaks them (since otherwise logs are not captured by default)	01:23
clarkb	but I can work on that	01:23
clarkb	I think http://logs.openstack.org/81/459481/1/check/gate-zuul-python27-ubuntu-xenial/02d8108/console.html is possibly a race introduced by my changes too :/ I touched the ansible execution code a little. Will have to poke more tomorrow	01:30
clarkb	jlk: SpamapS would appreciate if you could grab that stack and see if its better for you too	01:45
jlk	kk	01:57
clarkb	hrm tests.unit.test_scheduler.TestScheduler.test_needed_changes_enqueue has been running for ~8 minutes	02:09
clarkb	might have to look at that test more closely too	02:10
jamielennox	jeblair: so backscroll, but do you consider py3 and py3gear a blocker for that	03:38
jamielennox	jeblair: ie, is py3 a specific valuable task that someone could work on today?	03:39
jamielennox	SpamapS: obvoiusly has some influence in the gear space so i can talk with him about a path to getting that done	03:42
jamielennox	because it definitely makes sense that the proxying service be asyncio based	03:42
tobiash	morning	05:09
tobiash	what do you think about enriching nodepool logging with the extra keyword argument?	05:10
tobiash	with that we could add e.g. the node id as extra metadata	05:11
tobiash	using log formatters which support this extra metadata this would make it easy for e.g. tracking/filtering the lifecycle of a node in the logs	05:14
tobiash	I use logstash_formatter which formats the log to json (including the extra dict) which is pushed directly to elasticsearch	05:15
*** yolanda has quit IRC		05:31
SpamapS	jamielennox: yeah I think the idea is that to do websockets properly the right path is asyncio	05:54
*** bhavik1 has joined #zuul		06:29
*** bhavik1 has quit IRC		06:58
*** yolanda has joined #zuul		07:04
*** isaacb has joined #zuul		07:15
*** jeblair_ has joined #zuul		07:41
*** jeblair has quit IRC		07:41
*** yolanda has quit IRC		07:44
*** hashar has joined #zuul		07:56
*** yolanda has joined #zuul		08:00
*** yolanda has quit IRC		08:17
*** yolanda has joined #zuul		08:33
*** yolanda has quit IRC		09:22
*** yolanda has joined #zuul		09:37
*** jkilpatr has quit IRC		10:49
*** yolanda has quit IRC		11:05
*** jkilpatr has joined #zuul		11:11
*** Shrews has joined #zuul		11:29
mordred	jamielennox, SpamapS: yes re: websockets and asyncio - but I don't think py3/py3gear needs to be a blocker for getting the websockets service done	11:56
mordred	it'll only be a blocker for writing unit tests for the websockets service	11:56
mordred	I say that because while technically getting gear python3'd has already eaten one person and is an easy hole to fall down and be frustrated inside of	11:57
mordred	while getting the websockets streamer going should actually be fun	11:58
mordred	OH	11:58
mordred	well - I mean, tox -epy35 isn't really going to do anything right now - so I suppose one could actually even write unit tests for the websockets streamer and just put something into the tox definition that excludes the non-websockets unit tests from running under 35	11:59
mordred	so if it's a separate little daemon, and it fetches the information it needs about where the logs are from an http call to status.json ...	12:00
mordred	I don't think there are any blockers	12:00
mordred	(it wants to make the rest call, fwiw, so that it's not just an open proxy to streaming from random ports - if the browser->websocket protocol is "please give me the stream for job XXX and optionally for logfile YYY", then the websocket server asks zuul "hey, where's XXX and YYY?" it should be a fairly closed loop	12:02
mordred	also, Shrews is currently working on the thing the websocket server will be streaming _from_, so if anybody does decide to dive on that topic, syncing with him will likely be helpful	12:03
jamielennox	mordred: so i was mostly asking because if that just needed someone to put in the work to unblock something i could pick it up	12:17
jamielennox	mordred: but so gear is apparently already running with py35, but it has no asyncio	12:17
jamielennox	mordred: and a google search for asyncio gearman has 3 libraries in top 4 results, what was the initial rationale for that being an openstack library?	12:18
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: Don't getChange on source not triggering a change https://review.openstack.org/406129	12:19
*** yolanda has joined #zuul		12:21
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool feature/zuulv3: Support externally managed images https://review.openstack.org/458073	13:10
*** atarakt has quit IRC		13:14
*** yolanda has quit IRC		13:28
pabelanger	https://admin.fedoraproject.org/pkgdb/package/rpms/zuul/	13:40
pabelanger	\o/	13:40
*** yolanda has joined #zuul		13:44
*** hashar is now known as wmf-SWAT-shop		13:52
*** wmf-SWAT-shop is now known as hashar		13:52
pabelanger	zuulv3-dev.o.o back online, but I see an exception	14:04
pabelanger	http://paste.openstack.org/show/607867/	14:04
pabelanger	possible another configuration change?	14:05
pabelanger	Hmm, possible a bug. According to docs, server should be the default for canonical_hostname	14:10
Shrews	commit 1c7744 added canonical hostname to the source obj. probably related	14:11
pabelanger	thanks	14:11
pabelanger	I think we want our configure to be the following now	14:11
pabelanger	server = review.openstack.org	14:12
pabelanger	canonical_hostname = git.openstack.org	14:12
jeblair_	yes, though it should work without that, so we should fix the error.	14:12
pabelanger	agree	14:13
*** jeblair_ is now known as jeblair		14:15
*** isaacb has quit IRC		14:17
jeblair	pabelanger: i'm going to do some editing/debugging on zuulv3-dev, ok?	14:17
pabelanger	jeblair: sure	14:17
pabelanger	going to grab a coffee	14:17
mordred	jamielennox: we wrote gear originally just because we needed a better/simpler library for infra things back in the day (the other python libs were either incorrect or tried to be too clever)	14:17
mordred	jamielennox: I think if we have the streamer get the needed data via http and not via gear we can avoid the need to get gear ported to py3 for now (it still needs to be done to support zuul on py3, but there's no need to block the websocket streaming on gear getting a py3 update)	14:18
*** yolanda has quit IRC		14:21
jeblair	pabelanger: oh, it's because of the 'legacy gerrit section'. it's time to remove that. :)	14:32
jeblair	pabelanger: [gerrit] needs to be [connection gerrit]	14:33
pabelanger	Ah, cool!	14:36
*** yolanda has joined #zuul		14:36
*** yolanda has quit IRC		15:15
pabelanger	apparently gnocchi is moving out of openstack namespace. Not sure what they are planning on for testing, but maybe another project to add to the list for our zuulv3 github connection	15:23
clarkb	well they weren't using our testing before, justmerge check and noop jobs.	15:25
clarkb	or is that just debs ETOOEARLY	15:25
clarkb	oh thats transitional, gah I need more caffiene	15:26
pabelanger	I think they had python-jobs at one point	15:26
pabelanger	and publish-to-pypi	15:26
clarkb	pabelanger: ya the change diff I saw was removal of their transition out change	15:26
clarkb	so first cahnge is to remove python jobs et al and replace with noop jobs	15:27
clarkb	then change to remove noop jobs and I saw this one	15:27
*** yolanda has joined #zuul		15:30
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482	15:54
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481	15:54
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480	15:54
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Join watchdog thread so it doesn't leak in tests https://review.openstack.org/459479	15:54
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Properly record timed out builds https://review.openstack.org/459767	15:54
clarkb	ok that should fix up the watchdog related error	15:54
clarkb	and it fixes a bug I found in the process	15:54
*** Shrews has quit IRC		15:54
*** Shrews has joined #zuul		15:54
clarkb	SpamapS: jlk ^ I still have one test worker crash on my locally so I don't run the full suite but this seems to get me reliably completing the test suite and things are generally happier than they were so reviews and test driving would be great	15:55
*** Shrews has quit IRC		15:55
clarkb	now to turn timeout to gentle and see if I can't sort that out	15:55
*** Shrews has joined #zuul		15:58
Shrews	\o/ got my irc bouncer back	15:59
Shrews	no thx to vexxhost support, though	15:59
jeblair	Shrews: regarding 456685, i just remembered that the random tmp dir name has a security component. it seems like the build uuid might be sufficiently secure, but i think we need to be as careful in creating that as mkdtemp.	16:06
jeblair	Shrews: i left a comment inline. I think what it describes will be sufficient. if we're not comfortable with that, your previous version (which used a prefix and glob) would work.	16:06
jeblair	Shrews: we may want to ask SpamapS to look at this too since he has opinions on this sort of thing. :)	16:07
* SpamapS awake		16:09
Shrews	jeblair: ack	16:09
SpamapS	I think it's probably unpredictable enough. Unless it's sitting around in a globally readable place for a long time (more than 5 seconds == long) before the dir gets created.	16:11
SpamapS	seems like build UUID's are only internal to zuul and the gearman jobs	16:12
jeblair	SpamapS: it will be exposed by the status json, and if it takes a while for a node to show up, it could sit there for a while.	16:13
clarkb	tests.unit.test_scheduler.TestScheduler.test_needed_changes_enqueue seems to be the problem test based on my grepping of subunit	16:13
clarkb	gentle timeout also doesn't seem strong enough	16:14
jeblair	SpamapS, Shrews: so we may not be able to rely on the element of surprise. given that, is the mkdir suffirient? (it should fail if it already exists)	16:14
SpamapS	jeblair: yeah I think it being a dir name and not a file, we can relax with a mkdir that fails.	16:14
jeblair	cool, yeah that seems like a nice benefit of using a dir	16:15
*** yolanda has quit IRC		16:15
SpamapS	since the trouble would only come if they symlinked the UUID to a place with files that would match what gets written into a job dir.	16:15
Shrews	jeblair: i suppose still using os.makedirs() on the root (not including the build uuid) is necessary, yeah?	16:16
SpamapS	and then zuul would wreak havoc on that dir. So yeah, just make sure the mkdir raises on existing anything.	16:16
jeblair	Shrews: i don't think we do that now; i think we expect the root to exist.	16:16
jeblair	Shrews: er, i'm using 'root' to mean the 'root' argument as passed in to jobdir	16:17
Shrews	jeblair: ok. was wondering because all of the following code uses that	16:17
jeblair	Shrews: also, fwiw, JobDir is used in exactly one place, so you can drop the conditionals (all arguments are, in reality, required).	16:18
jeblair	that might make this simpler to reason about	16:19
Shrews	jeblair: it looks like 'root' can actually be None	16:25
Shrews	so should probaby keep that conditional	16:25
jeblair	Shrews: oh, and that uses TMPDIR? sounds right then.	16:25
Shrews	i'll make the other two params really required	16:27
jeblair	clarkb: re 457798 -- how does that leak happen?	16:28
clarkb	jeblair: https://review.openstack.org/#/c/459477/1 is what properly addresses the leak since we never actually stop the driver otherwise	16:29
clarkb	jeblair: so everytime a test creates a new scheduler you get a new apsched thread	16:30
jeblair	clarkb: ok, so i think we may not want 457798 (it has a structural problem i don't see a good way to fix; comment inline)	16:30
clarkb	ya I see that now	16:31
clarkb	jeblair: I'll wait for more reviews on the stack if you are going through it then address them in one go (rather than push a bunch)	16:31
jeblair	kk	16:32
*** yolanda has joined #zuul		16:32
clarkb	there is still something causing the tests to run longer under testr than testtools.run	16:34
clarkb	test_crd_branch is ~40 seconds vs ~16 seconds	16:34
clarkb	so its actually quicker for me to do for X in `cat testlist` ; do testtools.run $X ; done then testr run	16:34
jeblair	that's so weird	16:35
clarkb	also more reliable	16:35
clarkb	my hunch is that its processing the subunit stream	16:36
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name https://review.openstack.org/456685	16:37
clarkb	however the time difference is only apparent when all the tests are run together and not individually so maybe not	16:37
clarkb	* Error in `python': free(): invalid pointer: 0x00007fdf880000a8 * <- that just happened running tox -epy27 -- test_crd_branch	16:39
*** Shrews has quit IRC		16:39
jeblair	clarkb: 459479 -- it seems like even if that does leak, it won't leak for more than 10 seconds. the fix may cause an extra 0-10 second delay in production after every job completes. is it enough of a problem in thet tests to make that worthwhile?	16:40
clarkb	jeblair: the delay should only be up to one second	16:41
clarkb	and it actually makes the job that tests for timeouts quicker	16:41
jeblair	clarkb: i don't think we should increase the watchdog interval in production. we can do it in tests, but in prod, i think we should stick with 10s. we're going to have at least 100 of those threads in production, having them all poll every second will be a considerable amount of context switching and python GIL locking.	16:42
jeblair	or, i guess that should have said 'decrease the watchdog interval'.	16:42
clarkb	the reason to claen it up is so that we can assert the end state of tests more strongly, we could whitelist the watchdog thread when we do that	16:42
clarkb	so change could be to add a name to watchdog thread, then whitelist it in the tests rather than shorten sleep and join	16:43
*** yolanda has quit IRC		16:44
jeblair	and that amount of thread leakage will be okay you think?	16:44
clarkb	ya especially since that thread will be on a 10 second timeout	16:45
jeblair	wfm then.	16:45
clarkb	I think the costly threads are gaerman and kazoo beacuse they go as quickly as possible doing nothing in the background but generating log noise	16:45
clarkb	oh them and apsched	16:45
clarkb	for same rason	16:45
jeblair	clarkb: i think gearman has a 2s sleep, but yeah, it adds up.	16:46
jlk	o/	16:50
jeblair	clarkb: stack review complete	16:51
clarkb	jeblair: this stack appears to have made my local runs of CRD tests not error on timeout waiting for zuul to settle. So ya I think it does add up. Unfortunately it hasn't corrected the tests are generally slower in testr problem :/	16:52
*** Shrews has joined #zuul		16:57
SpamapS	clarkb: I'm at home with sick baby today, so I'll take a look when she goes down for a nap.	16:57
SpamapS	"at home" vs. "working at home" ;-)	16:58
*** hashar_ has joined #zuul		17:01
*** hashar_ has quit IRC		17:01
jlk	So what's with the "queue" key in a project definition?	17:13
jlk	and queue-name in a job definition?	17:13
jeblair	jlk: queue-name should go away (it's zuulv2 only)	17:14
jeblair	jlk: queue in a project assigns that project to a shared change queue in a dependent pipeline.	17:15
jlk	jeblair: is that how multiple unrelated projects can use the same "dependent" manager pipeline, but each have their own queues?	17:16
jlk	projects A, B, C all use one queue, project D uses a different one, and thus is not dependent on jobs of A, B, C?	17:16
jeblair	jlk: exactly.	17:16
jlk	cool, I've had that question bouncing in my head recently	17:16
jeblair	jlk: in zuulv2, combining them was automatic (based on job name). but in v3, we don't assume anything about job names, so you have to manually group projects together like that.	17:17
jlk	ah	17:17
jlk	oh yeah that is bad for us on v2, where everything shares a single job name, the job where we run whatever "run.sh" is in the project repo	17:17
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema https://review.openstack.org/459792	17:18
jeblair	jlk: yeah. it was designed under the assumption of job names like "gate-nova-python27".	17:18
jlk	makes sense	17:18
jlk	so only a tiny bit related, I'm working on the patch that adds dependent queue support to github	17:19
jlk	and in the tests I'm failing, because the call in manager/dependent.py getChangeQueue() is calling out to self.pipeline.getQueue(change.project), which then looks at the self.queues, and tries to see if "project" is in queue.projects	17:20
jlk	and while it looks like it, the comparison is failing for some reason	17:20
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482	17:20
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481	17:20
jlk	so I'm trying to figure out where equivalence of project objects is determined	17:20
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480	17:20
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Properly record timed out builds https://review.openstack.org/459767	17:20
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Whitelist watchdog threads in test thread leak checker https://review.openstack.org/459479	17:20
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use cStringIO if available. https://review.openstack.org/459478	17:21
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409	17:21
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Add ability to stop drivers https://review.openstack.org/459477	17:21
clarkb	jeblair: ^ I think that addresses your concerns	17:21
jlk	I only have one queue in self.queues	17:21
clarkb	now to abandon the unneeded apsched change	17:21
jlk	and there is only one project in that queue	17:21
jlk	and there is only one project in that queue: <Project org/project>	17:22
jlk	which LOOKS like it should match the project in question, <Project org/project>	17:22
jlk	(Pdb) project == queue.projects[0]	17:22
jlk	False	17:22
jlk	but it doesn't	17:22
jeblair	jlk: i believe we rely on object identity for equilavence. Sources are responsible for supplying Project objects and they are expected to cache them (probably on their driver), so that they always return the same object for a given project.	17:22
jlk	hrm.	17:23
jlk	I'll go hunting in that direction, thanks.	17:24
jeblair	(it's quite possible our data model has evolved to the point where we no longer need to do that and could implement a real __eq__ method (now that we can fully qualify project names with hostnames) but let's defer changing that for later)	17:24
jeblair	jlk: given that, temporarily throwing an id(self) in the __repr__ for project might help track that down too.	17:26
clarkb	tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue is actually the test that appears to timeout under full testr run. It takes ~55 seconds alone either with testr or testtools. So confused	17:27
jeblair	clarkb: 15s for me locally	17:27
jeblair	clarkb: and about 20-25s in the gate	17:27
jeblair	(alone)	17:28
clarkb	jeblair: do you see the same time difference between tox -epy27 run tests and testtools.run or similar?	17:28
jeblair	clarkb: nope	17:28
jlk	yeah the IDs are definitely different	17:28
jeblair	(i checked that yesterday)	17:28
jlk	unfortunately how the github code is handling projects basically mirrors how gerrit does.	17:28
jlk	unless...	17:28
jlk	nope, doesn't look like tests/base.py overrides that	17:29
jeblair	clarkb: 'nodepool list\|grep jeblair' if you want to putz around on the 4 test nodes i held. ~root/zuul already exists checked out at feature/zuulv3 with a .tox/py27	17:29
clarkb	jeblair: thanks	17:30
jeblair	jlk: yeah, looks like the projects are cached on the connection object.	17:33
jlk	oh this is interesting	17:33
jlk	I think the source object is not being cached	17:34
jlk	so some calls to self.source.getProject are hitting a _different_ source object.	17:34
jeblair	jlk: yeah, i think that's okay though, since the different sources should have the same connection object.	17:35
jlk	hrm.	17:35
jlk	then it's the connection object that's different	17:35
jlk	either way this trip through pdb I'm seeing an empty list of projects even though the first trip through some projects were added	17:35
jlk	self.connections.getSourceByHostname	17:36
jlk	I think that failed to find the right source for me	17:36
jeblair	jlk: which test are you running?	17:36
jlk	it's a new one in this patch. It's running jobs through a pipeline with a dependent manager	17:37
jlk	oooooh wait.	17:37
jlk	what happens if you have two sources that point to the same hostname?	17:37
jeblair	jlk: yeah, i've got it checked out	17:37
* jlk has a hunch		17:38
jeblair	jlk: two connections? bad mojo.	17:38
jlk	okay I think I know where this is fubar then	17:38
jeblair	jlk: i think there's a right way to do that though, lemme look it up	17:38
jlk	https://review.openstack.org/#/c/444034/7/tests/fixtures/zuul-github-driver.conf	17:39
jlk	I think this is what did it, in an earlier test	17:39
jlk	er earlier patch	17:39
jeblair	jlk: hrm. it should actually work -- tests/fixtures/zuul-connections-same-gerrit.conf tests that case	17:40
jlk	oh, darn.	17:40
jeblair	but it's possible there's a bug that's slipping by that test case but not yours	17:40
jlk	well I'm going ot modify that file and see if things chnage.	17:41
jlk	sure did	17:42
jeblair	jlk: generally speaking, the idea is that even if we have two connections for a given hostname, in the tenant config file (main.yaml), we associate each project with a specific connection. so in your case, if you say "source: github:" in main.yaml, the following projects are associated with that specific connection (vs "source: github_ssh")	17:42
clarkb	woo if I increase the test timeout I get successful tox run on tip of my stack	17:43
clarkb	tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue 141.804 so I'm about 20s past the default timeout	17:43
jeblair	jlk: however, i think the call to getSourceByHostname in scheduler.process_event_queue is wrong; it fails to take that into account.	17:44
clarkb	next slowest is 98 seconds	17:44
jlk	jeblair: yeah I was about to trace around that call	17:44
jeblair	jlk: my guess is that the gerrit test happens to have the sources in the right order to mask that, but the github test doesn't.	17:44
jeblair	jlk: i have an idea; gimme a few to hack it up.	17:47
jlk	okay	17:48
jlk	I'll keep poking too	17:48
clarkb	jeblair: http://paste.openstack.org/show/607901/ I'm seeing the testr full run slowdown on one of your held nodes	17:51
clarkb	gonna tweark the --concurrency to see if that affects it too (eg less concurrency == slower?)	17:51
jlk	jeblair: I think the logic in line 108 of zuul/lib/connections.py is suspect.	17:52
jlk	It'll wind up replacing sources['hostname'] with whatever the last read connection to the same host is	17:53
jlk	so the sources list will be incomplete	17:53
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: remove getSourceByHostname https://review.openstack.org/459821	17:58
jeblair	jlk: yes, i think that's right. ^ may be the solution.	17:58
jeblair	clarkb: yeah, that test is pretty busy so i expect it's especially affected by concurrency.	18:00
clarkb	jeblair: what I am curious to see is if less concurrency makes it slower or faster. Faster is what you'd expect because that means more cpu time/less contention	18:00
clarkb	jeblair: but I'm thinking maybe its slower because that means single running is running more jobs that are leaking something causing it to be slow?	18:01
clarkb	jeblair: http://paste.openstack.org/show/607905/ it is indeed slower the fewer cpus we use	18:07
clarkb	and that scaling factor is very similar to mine compared to running isolated	18:07
*** isaacb has joined #zuul		18:08
jeblair	clarkb: interesting... though it could also be unfortunate test ordering, yeah?	18:08
clarkb	jeblair: I don't think so. each process is running fairly isolated. The only shared resource is zk but you'd expect slower times with more concurrency if it was getting hit hard	18:09
clarkb	each process runs one test at a time, regardless of order I would expect similar test runtimes	18:10
jeblair	clarkb: io is shared, which is quite a lot of what these are doing	18:10
clarkb	unless it is a resource leak early in test run that negatively affects subsequent tests	18:10
clarkb	jeblair: right but you'd expect less concurrency to be faster if that was the issue not slower	18:10
clarkb	by reducing concurrency we are reducing total cpu and io load on the system but the tests run slower	18:11
jeblair	clarkb: yes, i mostly agree, just holding out the thought that in a pathological case, with concurrency=8 we could be up against cpu bound tests and concurrency=4 up against io bound tests. that's unlikely, but possible.	18:12
clarkb	jeblair: I'm going to run it with concurrency=1	18:13
clarkb	which may rule out that possibility	18:13
jlk	hey look at that, all my tests are actually passing now :/	18:21
clarkb	jeblair: --concurrency=1 on that held node just hit timeout waiting for zuul to settle	18:36
clarkb	so I think there must be something elsegoing on in here that causes consolidated test processes to perform poorly	18:36
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: remove getSourceByHostname https://review.openstack.org/459821	18:52
jeblair	jlk: ^ passes tests now.	18:52
jlk	woo	18:52
clarkb	tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue 100.753 <- under timeout. I killed logging	18:56
clarkb	I think our logging stuff must be leaking somehow?	18:56
jeblair	clarkb: that's still 30s longer than run alone, right?	19:06
clarkb	no sorry thats local to me where it took ~55s	19:07
clarkb	so thats inline with my local isolated run	19:07
clarkb	I'm testing a change that I think may help, iteration time is so long though	19:07
clarkb	(while keeping our logging)	19:07
clarkb	er derp I see what you are saying	19:11
clarkb	I don't know how I math'd that wrong	19:11
*** yolanda has joined #zuul		19:38
*** hashar has quit IRC		19:55
jlk	GREAT NEWS EVERYBODY	20:00
jlk	er, GOOD NEWS whateve.r	20:00
jlk	Github branch protection now has an option where new pushes to a PR will dismiss any Reviews on the PR!	20:00
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup logging handlers https://review.openstack.org/459849	20:00
clarkb	jeblair: ^ thats a massive improvement in test runtime but still slower than isolated	20:00
pabelanger	jlk: success?	20:01
clarkb	SpamapS: jlk ^ thats change is the one you'll want to grab for local testing	20:02
jlk	kk	20:02
clarkb	I need to grab lunch now, but very interested if ^ helps anyone else running tests locally	20:04
*** jkilpatr has quit IRC		20:10
clarkb	jeblair: http://logs.openstack.org/49/459849/1/check/gate-zuul-python27-ubuntu-xenial/b9421e1/console.html http://logs.openstack.org/77/459477/2/check/gate-zuul-python27-ubuntu-xenial/e93f61e/console.html comparison seems to confirm this change works as expected in the gate too	20:45
*** jkilpatr has joined #zuul		20:47
clarkb	based on ^ it would be excellent to get reviews on https://review.openstack.org/#/q/topic:test-stability jeblair has reviewed most of them and it should make everyone's dev process happier	21:19
jeblair	clarkb: nice, those even both ran on internap	21:23
clarkb	jeblair: yup so should be a good comparison. Ran in half the time	21:27
jeblair	clarkb: +2 from me	21:28
clarkb	I'm still not sure where the other 40-50s is going locally	21:28
clarkb	but guessing its another leak in the same sort of genre	21:28
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema https://review.openstack.org/459792	21:30
*** isaacb has quit IRC		21:36
SpamapS	oh wow.. github has added the ability to dismiss PR reviews on push to the pr source.	21:48
SpamapS	I guess they did get enough angry fist shakes	21:48
clarkb	jeblair: ok I just finished a successful --concurrency=1 run locally. And the individual tests had runtimes closer to what I would expect. So I think the remaining 40-50s is contention with shared resources like disk IO and zk	22:03
clarkb	jeblair: I'm going to treat this as "solved" for now as a result of ^ those results	22:03
jeblair	clarkb: w00t, thanks!	22:03
clarkb	tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue 61.805	22:03
clarkb	Ran 236 tests in 1614.315s (+1008.612s) wall time is much much longer but individual tests run quicker	22:03
jesusaur	clarkb++ continues to be awesome	22:08
jlk	woo	22:21
jlk	Reviews on https://review.openstack.org/459821 would help me a lot in my progress :)	22:21
jeblair	clarkb, mordred: i think you can help jlk there ^	22:22
Shrews	clarkb: question for you on 459480	22:23
jeblair	jlk, mordred, SpamapS: i have reviewed the first 4 github-v3 changes. i think 439834 and 443947 open a set of related questions on the subject of driver->zuul event manipulation which would be good to come to consensus on.	22:25
jlk	thanks! I saw those hit my inbox, and I briefly went over them. I'll probably have replies on them tomorrow morning	22:25
clarkb	Shrews: responded	22:26
clarkb	Shrews: thats actually something that I think we likely want to get better at, is constructing things such that the cleanups are in place before they have a chance to break	22:26
jeblair	clarkb: cleanups are all in individual try/excepts clauses, right?	22:27
clarkb	jeblair: yes, one cleanup failing doesn't prevent the next from running	22:27
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Add ability to stop drivers https://review.openstack.org/459477	22:27
clarkb	jeblair: unless you sys.exit ro something	22:27
jeblair	clarkb: so yeah, i agree, that's probably the safer method and we may want to move many of our cleanups up a line.	22:28
Shrews	clarkb: k. iirc, the disconnect() checks to see if it's actually connected first anyway (assuming it was copied directly from nodepool's zk.py)	22:28
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema https://review.openstack.org/459792	22:30
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409	22:32
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Use cStringIO if available. https://review.openstack.org/459478	22:32
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Whitelist watchdog threads in test thread leak checker https://review.openstack.org/459479	22:32
*** harlowja has quit IRC		22:34
clarkb	jeblair: jlk make sure I get this right, basically you can have review.o.o/foo and review.o.o/bar so sources[review.o.o] isn't a singleton?	22:34
jeblair	clarkb: your example looks like two projects over the same connection which is more straightforward than the problem -- the problem is actually two connections to the same server.	22:35
clarkb	when would you have two connections to the same server?	22:36
jeblair	clarkb: it's actually the original impetus for multi-connection support, though we've never used it: report changes to gerrit under a different username.	22:36
clarkb	aha	22:36
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480	22:38
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481	22:38
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Properly record timed out builds https://review.openstack.org/459767	22:40
jeblair	clarkb: so the only way to really know which connection to use for a project is via the tenant config file, where project names are definitively mapped to connections. that process ensures that we don't end up with two identically named projects (with that index, it's okay to have two connections with the same hostname, as long as they don't also have projects with the same name)	22:41
jeblair	(attempting to do so is a configuration error, so we can rely on that)	22:41
mordred	jeblair: I believe this matches what we talked about that one time	22:46
clarkb	jeblair: ok I am confused about the comment at line 498 https://review.openstack.org/#/c/459821/2/zuul/scheduler.py. What is the corresponding live item and why is it N items behind the current item. Is it valid to have live -> not live -> not live -> live -> not live ?	22:48
clarkb	jeblair: at least with v2 I thought that once you are not live everything behind is not live?	22:49
jeblair	clarkb: behind and ahead may be confusing here. take the gnocchi change near the top of check right now	22:50
jeblair	clarkb: that's a bunch of non-live items ahead of a single live item.	22:51
clarkb	ah	22:51
clarkb	I'm thinking of the gate case with windows (because familiar with that code)	22:51
jeblair	(which is the only configuration non-live shows up in, at the moment)	22:51
clarkb	thanks	22:51
jeblair	clarkb: ah, that's 'active'	22:51
jeblair	(rather than 'live')	22:51
clarkb	gotcha	22:51
jeblair	honestly, foreign projects and multi-connections need a bit of a rethink, because at some point we're going to want to say "this zuul patch Depends-On this ansible pull request". but we have no way of saying that at the moment, so that code makes things continue to work the way they do now with minimal fuss.	22:53
clarkb	where you won't merge until the other has merged, but won't actually get that pull request in your git repo?	22:53
jeblair	clarkb: you will get that pull request in the executor-staged copy of the ansible repo	22:54
jeblair	(so the executor will stage both your change and the ansible pull request in the jobdir src directory)	22:54
clarkb	but the PR would not be live in that case beacuse we aren't voting back on it?	22:55
clarkb	we'd vote on zuul	22:55
jeblair	clarkb: correct, it'd still be non-live	22:55
clarkb	ok that makes a lot more sense now	22:55
jeblair	clarkb: i mention this because the code i just wrote will break in that circumstance. so it'll have to change. but it's not possible to get into that state now, so i think it's good enough.	22:56
jeblair	(because it assumes that all changes for foreign projects were pulled in via a depends-on from the same source/connection, which obviously won't be the case when you can do cross-connection-dependencies)	22:57
clarkb	Shrews: just two more changes to review on topic:test-stability	22:58
clarkb	Shrews: the last one is the best one too	22:58
jeblair	also, holy cow i think it's time for a glossary. "item", "foreign", "live", "active", "cross-repo-dependency", "cross-connection-dependency" are all zuul-specific terms of art used in the last 5 minutes with very specific meanings. :)	22:59
clarkb	indeed	22:59
Shrews	clarkb: my pizza arrived. Basic math tells us that pizza > reviews	23:01
jeblair	mordred: should i take "jeblair: I believe this matches what we talked about that one time" as a +2 on 459821?	23:01
clarkb	Shrews: except that the last change will make your tests run twice as fast. So you can eat more pizza	23:01
clarkb	:P	23:01
clarkb	enjoy dinner	23:01
Shrews	Will look after fooding	23:02
mordred	jeblair: I was about to leave the actual +2	23:03
mordred	jeblair: also - yes, I think it's time for a glossary :)	23:03
jeblair	mordred: ok cool :)	23:03
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: remove getSourceByHostname https://review.openstack.org/459821	23:14
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482	23:19
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Cleanup logging handlers https://review.openstack.org/459849	23:19
*** openstack has joined #zuul		23:31
mordred	clarkb: well, eatmydata does make sync() a no-op. I'm not sure if tmpfs does any extra work than "return 0" on a sync or not?	23:38

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!