Monday, 2017-04-24

*** jamielennox is now known as jamielennox\|away		01:33
*** jamielennox\|away is now known as jamielennox		01:44
*** yolanda has joined #zuul		03:54
*** yolanda has quit IRC		05:23
*** isaacb has joined #zuul		06:26
*** hashar has joined #zuul		06:59
*** bhavik1 has joined #zuul		07:00
*** yolanda has joined #zuul		07:29
*** yolanda has quit IRC		08:20
*** yolanda has joined #zuul		08:39
*** yolanda has quit IRC		08:54
*** yolanda has joined #zuul		09:11
*** yolanda has quit IRC		09:39
*** jkilpatr has quit IRC		10:40
*** jkilpatr has joined #zuul		11:02
*** yolanda has joined #zuul		11:16
*** Shrews has joined #zuul		11:26
*** bhavik1 has quit IRC		11:40
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name https://review.openstack.org/456685	13:16
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for specifying root job directory https://review.openstack.org/456691	13:16
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name https://review.openstack.org/456685	13:34
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for specifying root job directory https://review.openstack.org/456691	13:34
*** yolanda has quit IRC		13:36
*** bhavik1 has joined #zuul		13:58
*** bhavik1 has quit IRC		14:05
*** yolanda has joined #zuul		15:03
*** yolanda has quit IRC		15:10
*** yolanda has joined #zuul		15:11
pabelanger	mordred: speaking of packaging, just bringing the zuul package for fedora rawhide online today: https://bugzilla.redhat.com/show_bug.cgi?id=1220451	15:20
openstack	bugzilla.redhat.com bug 1220451 in Package Review "Review Request: zuul - Trunk gating system developed for the OpenStack Project" [Medium,Post] - Assigned to karlthered	15:20
*** yolanda has quit IRC		15:23
*** yolanda has joined #zuul		15:32
*** hashar is now known as hasharAway		15:37
jlk	Good morning you zuuligans!	15:54
*** isaacb has quit IRC		15:59
clarkb	I finally have zuul test suite running under a hand crafted all natural free range python install made with gcc7	16:09
clarkb	curious if this will show any difference	16:09
*** isaacb has joined #zuul		16:11
clarkb	Exception: Timeout waiting for Zuul to settle <- so free range python didn't help that. Now to see if it at least completes running in a reasonable amount of time	16:21
clarkb	SpamapS: jeblair was thinking about this more and could it be that git itself is slower in newer versions in how zuul uses it?	16:22
clarkb	I'm going to attempt putting yappi in place on a per test basis	16:28
jlk	What time (UTC) is the meeting today?	16:38
clarkb	2200 iirc	16:38
SpamapS	clarkb: no I think it's python.	16:49
SpamapS	clarkb: free range python.. did you enable all the optimizations?	16:50
jlk	oh good I'll be able to make that.	16:50
jeblair	i have iad and osic trusty and xenial nodes held; i'll kick off some tests there	16:54
clarkb	SpamapS: ya I enabled lto and pgo	16:55
clarkb	currently running into the problem of not being able to get stdout out of the test suite even when setting OS_STDOUT_CAPTURE=0	16:56
jeblair	fwiw, osic installed the tox deps in 1m6s +/-2s; rax-iad in 1m36s +/- 1s.	16:59
clarkb	ok if I print in the tests cases I see that but not in BaseTestCase.setUp() even sys.stdout.write isn't working	17:05
clarkb	hrm I wonder if we are calling that setUp(), even pdb doesn't seem to break there	17:07
SpamapS	clarkb: threads make that tricky	17:08
*** isaacb has quit IRC		17:08
SpamapS	or are you pdb.set_trace()'ing in it? should break there	17:08
clarkb	I figured it out, bug in test suite I think. Gonna grab HEAD and double check and push fix up if still there	17:09
pabelanger	clarkb: jeblair: mordred: care to add https://review.openstack.org/#/c/452494/ to your review queue? Fetch console log if SSH fails.	17:10
openstackgerrit	Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409	17:12
clarkb	SpamapS: ^ is fix	17:12
clarkb	setUp was just never running on the fast test I was iterating on	17:12
jeblair	pabelanger: is that a port of a change to another branch?	17:16
pabelanger	jeblair: yes, I decided just to add it feature/zuulv3 to avoid recent nodepool.yaml changes	17:17
clarkb	woo I have more infos. If I run a test that fails for me, tests.unit.test_scheduler.TestScheduler.test_crd_undefined_project, under testr alone it takes 55 seconds. If I run it under testtools, 12 seconds	17:17
jeblair	pabelanger: can you mention/link to the previous change?	17:17
pabelanger	jeblair: sure, 1 moment	17:18
clarkb	and I haven't managed to get profiling working yet with testr runs to see what is eating all that time (I can't print out the results as stdout seems to be eaten)	17:18
jeblair	clarkb: what does it mean to "run under testtools"?	17:19
clarkb	jeblair: pythom -m testtools.run instead of testr run	17:19
jeblair	ah ok	17:19
jeblair	clarkb: testr does the discover thing, testtools.run doesn't, right?	17:19
jeblair	clarkb: (that was a big motivation for mordred writing ttrun)	17:20
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Fetch server console log if ssh connection fails https://review.openstack.org/452494	17:20
clarkb	jeblair: yes but discover is really fast	17:20
clarkb	I can time it independently in a sec	17:20
clarkb	subunit.run is ~14 seconds	17:21
clarkb	discover is 1 second wall clock	17:21
jeblair	clarkb: was your 55 seconds the time that the "testr" command took, or the time that testr reported for the test?	17:24
clarkb	jeblair: the time that testr reported for the test	17:24
jeblair	oh. yeah. i agree those should be the same. :)	17:24
clarkb	let me rerun under time just to make sure there is no funny business in that recording	17:24
SpamapS	ooooooooooooooooooh my	17:25
SpamapS	yeah that's a bit nuts	17:25
clarkb	wow also seems quite variable. last run was 27 seconds	17:29
clarkb	(with testr, testtools seems pretty consistently 12-14 seconds)	17:29
jeblair	i'm not seeing that locally with behind_dequeue	17:30
* clarkb tries with that test		17:32
jeblair	also, running behind_dequeue on trusty/xenial nodepool nodes produced a fairly consistent low-mid 20s times	17:32
clarkb	jeblair: using testr or testtools?	17:32
jeblair	clarkb: testr ("tox -e py27 ...")	17:33
jeblair	clarkb: i ran crd_undefined_project locally and get 3.3s both ways	17:33
clarkb	ZUUL_TEST_ROOT=/home/clark/tmp/tmpfs time testr run tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue produces: 103.92user 18.40system 1:15.73elapsed 161%CPU (0avgtext+0avgdata 278300maxresident)k 0inputs+496outputs (0major+4407299minor)pagefaults 0swaps Ran 1 tests in 69.140s (+48.013s)	17:34
clarkb	now to try with testtools.run	17:34
clarkb	Ran 1 test in 60.213s 86.46user 16.58system 1:00.92elapsed 169%CPU (0avgtext+0avgdata 270040maxresident)k 0inputs+0outputs (0major+4220868minor)pagefaults 0swaps	17:35
clarkb	so that test seems a lot more consistent	17:36
clarkb	jeblair: fwiw the crd tests are the ones that always break locally for me	17:36
clarkb	..py:531 RecordingAnsibleJob.execute 44 0.004697 10.05743 0.228578 is biggest cost in dequeue job according to yappi	17:38
clarkb	with testr run we are ingesting the subunit streams and writing data out to the "db" files	17:38
clarkb	possibly that is what is making it slower in some cases depending on the stream contents?	17:38
clarkb	after that git is the highest cost	17:39
clarkb	now if I could just get the testr runs to print out their yappi infos :/	17:40
jeblair	clarkb: yeah. the new behavior is that we only save log files from failing tests, because they are quite large. you mention your test is failing, so that should be happening (though your env variables may alter that)	17:42
clarkb	jeblair: well it only fails when run as a whole, its actually passing when i run it alone (so possibly an intertest interaction?)	17:44
clarkb	but I have new data. set *_CAPTURE=0 in .testr.conf and the crd test takes a minute. set them to 1 (the default) and it takes half a minute (still twice the time of a testtools.run) implying the cost is in parsing the output stream and that is somehow affected by not capturing things possibly because its bubbling all the way up to testr and ti has to work harder?	17:45
*** yolanda has quit IRC		17:48
clarkb	if I run testr run --subunit foo.test \| subunit-2to1 then I see the output. But I don't see it in the .testrepository files. So testr is ingesting a bunch of information then discarding it?	17:50
clarkb	jeblair: is gdbm present on the xenail and trusty nodes? (thats another potential difference here is testr won't have a gdbm to use on my system apparently)	17:52
* clarkb install it		17:52
jeblair	-bash: gdbm: command not found	17:53
jeblair	on both	17:53
clarkb	jeblair: python interpreter import gdbm?	17:53
jeblair	oh yeah that's there	17:53
clarkb	having it makes the warning go away but doesn't affect the test runtime	17:55
clarkb	still ~30s for that crd tests and *_CAPTURE=1	17:55
clarkb	http://paste.openstack.org/show/607721/	17:59
clarkb	also comparing function calls logging is more expensive so that may be the cost there?	18:11
clarkb	oh wow so that time increase was purely due to having stdout written that wasn't part of the subunit stream	18:22
clarkb	(my yappi stuff was not getting monkey patched stdout, if I fixed that the run time is in line with testtools.run)	18:23
clarkb	actually its not working for capturing yet either. Arg, but good to know that having "raw" stdout over the wire makes things really unhappy with testr	18:28
Shrews	Can I just say that Vexxhost's first response to my issue of "instance down" is to ask if it is still down doesn't install much confidence of their support methods.	18:40
clarkb	woo finally have yappi working the way I want it and testrun times are not crazy and then discover some of these tests run long enough to have the counters roll over	18:44
clarkb	yes negative runtimes thats gotta be correct. Zuul can time travel you know	18:45
jeblair	clarkb: if zuulv3 can time travel backwards that will be great for our release timetable	19:07
clarkb	fwiw I have the suite running now and trying to get a failure, none yet	19:08
clarkb	changes are yappi profiling and use cStringIO	19:08
jeblair	clarkb: i wanted to use cstringio, but that's apparently gone in python3.	19:22
clarkb	jeblair: ya I've got it conditionally importing	19:24
*** hasharAway is now known as hashar		19:25
SpamapS	oh we got that time travel feature done?	19:26
SpamapS	cause.. I have a few ideas for use cases..	19:26
clarkb	oh it just failed. But the data is useless because its all rolling over :(	19:27
* clarkb tries to get test run alone to fail		19:27
clarkb	something must be leaking	19:28
clarkb	11seconds when run alone	19:30
clarkb	1 minute 8 seconds when run with everything else (but with concurrency=1 so sequentially)	19:30
clarkb	SpamapS: ^	19:31
clarkb	SpamapS: do you see that behavior too?	19:31
SpamapS	clarkb: which test?	19:36
SpamapS	$ ttrun -e py27 tests.unit.test_scheduler.TestScheduler.test_crd_check_transitive	19:37
SpamapS	takes ~21s consistently	19:37
SpamapS	worth noting I'm on Ice9c2fd320a4d581f0f71cbacc61f7ac58183c23 sha=070ee7e979dbb9b488493984aeddb866da3884ba	19:38
clarkb	tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies	19:38
clarkb	SpamapS: right, I am comparing testr run tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies to the runtime of that test in testr run	19:39
clarkb	running it on its own seems consistent	19:39
SpamapS	8.5s for that one	19:41
SpamapS	7.5s sometimes	19:42
SpamapS	tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies 5.401	19:42
SpamapS	faster under testr	19:42
clarkb	SpamapS: a full testr run?	19:43
SpamapS	no that's just alone	19:43
clarkb	right its consistent and fast alone	19:43
SpamapS	hm can I tell testr to show me last_run-1 ?	19:43
clarkb	SpamapS: testr last	19:43
SpamapS	no that's last_run	19:44
clarkb	oh for that i just vim the file in .testrepository directly, you can also testr load it then testr last	19:44
SpamapS	ahh	19:46
SpamapS	tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies 5.401	19:47
SpamapS	clarkb: that's actually from the last real run	19:47
SpamapS	same as alone	19:47
SpamapS	to the thousandth	19:47
* SpamapS is suspicious		19:48
SpamapS	seems like there's some funny business going on	19:48
clarkb	Exception: Timeout waiting for Zuul to settle is the exception which makes me think that maybe we aren't settling deterministically for some reason	19:48
clarkb	alone is fine and runs in 11 seconds. Not alone takes more then a minute then hits ^	19:48
clarkb	so I think profiling not necessary (at least not for this), but glad I got to this point	19:52
clarkb	(the other angle I am investigating is maybe there is rogue data in the subunit stream which is causing testr to be unhappy. Have it running dumping subunit to stdout and into file to check. Then will also do a full run with testtools and see if it exhibits the same behavior of longer tests over time)	19:56
*** harlowja has quit IRC		20:03
*** jamielennox is now known as jamielennox\|away		20:13
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer https://review.openstack.org/456721	20:14
Shrews	jeblair: that ^^^ caused me much headache	20:14
Shrews	hopefully it's treading the right path now	20:15
jeblair	Shrews: cool!	20:17
Shrews	i also fully expect tests to fail b/c of the single port thing	20:17
mordred	Shrews: its correctness is directly proportional to how much your head hurts	20:17
jeblair	Shrews: single port thing?	20:18
Shrews	jeblair: not sure if more than 1 executor is started for tests, but if so, they'll get port conflicts for the finger port	20:18
Shrews	don't know enough about tests though	20:19
*** jamielennox\|away is now known as jamielennox		20:20
jeblair	Shrews: we don't start an executor using the cmd, we do it directly. so there shouldn't be anything starting a log streamer until you add it to the tests. that's why i asked you to split it out from the startup code.	20:20
Shrews	ah. then yay	20:20
jeblair	Shrews: ya. when we add it, we'll want to have it start on port 0 in the tests to pick a random port, since there can be multiple tests which start it running, and we also don't want to require root in the tests.	20:21
mordred	jeblair: it's worth noting that this approach does tie us to a 1:1 relationship between executors and hosts - which I believe we're fine with - but noting it because I believe there had been musing in the past about the viability of running more than one executor on a given machine	20:23
mordred	jeblair: I'm quite in favor of it as an approach and don't think the limition is an issue - that said, it might make things fun if people attempt to run an executor farm in something like k8s	20:24
jeblair	mordred: indeed. i think the 1:1 aspect is the main reason to do this. i may not fully recall the musings you're recalling -- is there a potential advantage to having more than one on a host?	20:24
mordred	jeblair: not to my knowledge, no - unless there were some thread-scaling issue that would be better served by multiple processes - although in the infra case I'd imagine if that were the case we'd just spawn more smaller VMs	20:25
jeblair	mordred: if you were using multiple executors is a k8s environment, would you be able to run a single log streamer across them?	20:26
mordred	jeblair: _maybe_? I mean, we haven't really designed for that at all - but with the 1:1 relationship it implies that each executor needs a routable IP	20:27
mordred	I believe one could fairly easy "expose" a service on each executor to make that happen in k8s land	20:27
mordred	but down the road we might wind up wanting/needing a single finger multiplexer that could hit each of the executor fingers on the non-routable network ... this is HIGHLY hand-wavey - trying to put my brain the space of the people who like to put everytihng on private addresses and expose a bare minimum of public ips	20:29
jeblair	mordred: yes, we even want something similar for the websocket multiplexer	20:29
mordred	(thisis all "I think this is the right approach and will scale well for us. I don't think it will prevent a k8s scale out of executors - but we might have some enablement work to do in the future to make that nicer for some of those users"	20:30
mordred	jeblair: ++	20:30
mordred	jeblair: luckily finger already includes in the protocol the ability to chain hosts	20:30
mordred	jeblair: so it's almost like the RFC authors back when the internet worked planned this for us!	20:30
jeblair	mordred: i'm trying to figure out is whether the 1:1 or 1:N choice has an impact on that. i guess i don't understand enough about k8s to understand how you would create a scenario where you could have multiple executors writing log files to a space where a single log streamer reads from.	20:30
fungi	sorry all, i'm entertaining inlaws this evening and will likely be out at dinner during the zuul meeting, but will check out the log when i get back	20:32
mordred	jeblair: yah - the more I think about it the more I think the 1:1 choice is beneficial for that - and in a k8s land with less public IPs than desired executors having a multiplexor will be good	20:33
mordred	and since it's a 1:1 with the executors, we're tlaking about an easy to understand "service talks to other service over network"	20:34
mordred	and _not_ "this one daemon reads the files these other 10 daemons are writing - oh wait, they're all in their own containers"	20:34
jeblair	mordred: ok. i definitely think this is worth thinking about; we don't want to back ourselves into a corner and there are certainly compromises in this direction. a good argument could easily push us one way or the other at this point. this approach has 'ease of deployment' going for it right now.	20:36
Shrews	fwiw, if we make the decision similar to "we need to rewrite this to be a highly distributed, highly redundant, dockerized application", imma gonna fight somebody	20:38
mordred	Shrews: hell no. BUT - I don't want to do something that prevents someone who does want to run it in k8s or mesos from doing so	20:39
mordred	jeblair: I think it gets us a one-executor==one-network-listener, which I think folks grok fairly well in k8s land ("microservices") - so you'd wind up with N containers each running an executor/finger-listener	20:39
mordred	which is good	20:39
Shrews	mordred: lol, i know. was j/k :)	20:39
mordred	Shrews: :)	20:39
mordred	Shrews: I dunno - you COULD want to fight someone anyway	20:39
mordred	jeblair: and an executor is never going to work in a container that doesn't support fork - because ansible	20:40
jeblair	Shrews: it already is a highly distributed application, and becoming moreso, especially thanks to your work. it only has one SPOF left, and in my mind, zuulv4 will get rid of that. :)	20:40
jeblair	mordred: yeah, that makes sense.	20:40
Shrews	all hail zuulv4	20:40
mordred	so we're never going to hit the "purist" place of only ever having a single executable in each container with the fork scaleout model actually being at the k8s layer	20:41
clarkb	SpamapS: ok after lunch and looking at my log from testr run that job is trying to connect to a bunch of gearman servers that it can't connect to	20:41
clarkb	SpamapS: but gneral idea is to run in isolation and compare logs	20:41
mordred	(I mean, we could if we wanted to go 100% down the road of doing that by having the executor make k8s api calls instead of subprocess calls - but that's a whole GIANT other thing	20:41
mordred	and I do not think buys us, well, anything	20:41
jeblair	mordred: yeah, i feel that that is significant enough of a change to warrant revisiting this anyway. i mean that's not an implementation choice you can easily make today. that's something worth considering later, and when we do, "how to get the logs off" will be a good question to answer at the same time.	20:42
jeblair	mordred: i know that's something SpamapS and others were interested in when we were discussing the executor-security thing.	20:43
jeblair	clarkb: if there are multiple test failures in your run, it's quite possible those gearman connects are from threads left over from previous tests which did not shut down correctly. they should be mostly harmless, other than taking up some resources.	20:44
mordred	jeblair: ++	20:45
clarkb	jeblair: they show up in the first test that failed. So I don't think its just dirty test env from fails	20:46
jeblair	clarkb: hrm. i've been considering having us save a subunit attachment with the gearman port number of every test, even the successful ones, so we can trace those back to the test that launched them.	20:47
clarkb	I now have logs files up gonna work on seeing if I can see where they start to diverge time wise	20:50
clarkb	have confirmed that single test does not have a bunch of servers its trying to connect to	20:50
mordred	jeblair: re: significant enough change - I totally agree. and I mostly want to make sure we don't prevent anyone from going deep on the run what we have now in k8s so that we can get a bunch of feedback/knowledge of what the ups and downs are... so that if we decide that zuul v5 will just darned-well depend on k8s, that we're doing it not becaus of new-shiny but because we've actually been able to learn about	20:51
mordred	real/tangible benefits	20:51
jeblair	mordred: yep	20:51
clarkb	its actually really early on hrm. The time between test start to gearman server starting is 6 seconds in slow failed test and one second in test run alone	20:52
clarkb	so maybe it is just leaked resources vying for cpu time	20:52
clarkb	and the reason gate jobs are happy is by splitting into more cpus you have processes that can better time share the cpu?	20:52
clarkb	anyways will keep digging	20:52
jeblair	clarkb: are you saying those may not be leaked threads, but are just the actual threads for that test but geard hasn't started yet?	20:53
clarkb	jeblair: no, pretty sure they are leaked. I'm looking at the python logging time sequence and the time taken to go from test start to gearman server starting for that test is different	20:54
clarkb	and that happens really early on	20:54
*** harlowja has joined #zuul		20:59
SpamapS	jeblair: mordred jlk I will likely miss today's Zuul meeting. Sick child has to be taken to doctor.	21:01
jlk	:(	21:02
SpamapS	pink eye... always pink eye.. :-/	21:02
clarkb	jeblair: there is also a lot more kazoo activity	21:02
*** yolanda has joined #zuul		21:02
jamielennox	mordred, jeblair: having just found out about the finger thing, and it basically replacing my little websocket app, why are we sending the uuid to the exectuor, isn't the executor only running one job?	21:13
mordred	jamielennox: oh - it's not replacing the websocket app - we also need a websocket app	21:14
jamielennox	i'm reading scrollback and we are in a position of limited public ips, and i'm not even sure executors would be public	21:14
mordred	jamielennox: nod, so you might be the first user who needs the multiplexor then :)	21:15
mordred	jeblair: ^^ look how quickly we go from thought to actual usecases	21:16
jamielennox	so i think what we would need there is some way to ask zuul for a build UUID and resolve that to a (private) executor ip	21:21
jamielennox	or otherwise just expose the private IPs to a user which is what my thing is doing now	21:21
jamielennox	but i'm also interested in the decision to limit 1 executor per VM, we don't have a decision on this but most executors are just going to be running ansible (network bound) and will register themselves on a job queue	21:24
jeblair	wait there's already a websocket app?	21:24
jamielennox	jeblair: it was something i POCed for our zuul 2.5	21:24
jamielennox	but i can show you	21:24
jeblair	jamielennox: which framework did you use?	21:25
jamielennox	jeblair: https://github.com/BonnyCI/logging-proxy	21:25
jamielennox	jeblair: nodejs :(	21:25
jamielennox	in it's defense when pumping out to a browser it's probably the right choice	21:25
mordred	jamielennox: so - the fun thing about multi-plexing is that it's built in to the finger spec - so if you say "finger XXX@zuul.openstack.org@private-executor01.openstack.org"	21:26
mordred	which basically means we just have to do a little more plumbing	21:26
mordred	gah	21:26
mordred	I didn't finish my sentence	21:26
jeblair	mordred: while that's neat, i don't think that's what we should actually do. i don't think we need to conform to the finger spec here. :)	21:26
mordred	if you say "finger XXX@zuul.openstack.org@private-executor01.openstack.org" - that tells zuul.o.o to finger private-executor01.openstack.org and pass the results back	21:26
jamielennox	mordred: that's really cool from a tech perspective and something you could build into a client side zuul-cli, but probably not something many people would use	21:27
jeblair	i think our process should be to build up layers to get to the point where we have a user-friendly system. here's how i think we should proceed:	21:27
clarkb	ok running testr run with OS_LOG_CAPTURE set to 0 so that I can see the logs we are leaking threads	21:27
clarkb	so best guess is that we are eventually getting to the point where those threads are harmful?	21:28
mordred	clarkb: if so, it would make sense that it would be more harmful on machines with less cores	21:28
jeblair	1) implement a console log streamer on each test node (done)	21:28
clarkb	mordred: and slower cores	21:28
jeblair	2) implement a daemon which multiplexes the console log from each test node in a job along with the ansible log on the executor (in progress)	21:29
clarkb	a good chunk of them are apscheduler threads, I will pull down my change and use it in my testing too	21:29
clarkb	<Thread(Thread-158, started daemon 140017141806848)> a lot of those too	21:29
jeblair	3) implement a central daemon which can connect to the correct executor and ship that log stream over websockets	21:30
jeblair	it's not much of a stretch to have that daemon also implement finger as well as websockets	21:31
mordred	jeblair: ++ yes, I agree	21:31
mordred	(turns out finger is much easier to implement than websockets)	21:31
jeblair	that way we end up with "ws://zuul.example.com/build/..." and "finger build@zuul.example.com" as easy front-ends so users don't have to see executors.	21:32
jeblair	i anticipate that we would use the executor finger url as the build url for now, and then once we have the central daemons, replace that	21:32
jeblair	jamielennox: does that address your needs?	21:33
jamielennox	jeblair: yea, that will work - one way of another we'd be overriding the build url for viewing anyway, so so long as we have a central place to hit to find information for BUILD_UUID we're good	21:35
mordred	jeblair: the central daemon should also be pretty easy to scale-out and be a little fleet of log streaming daemons if needed too, yeah?	21:36
jeblair	jamielennox: well, the end goal is to have the build url point to a js page which uses websockets to stream the logs, so hopefully you won't need to overwrite it	21:36
jeblair	mordred: yes, i don't see a reason it can't be a scalable component (if you put it behind a load balancer)	21:36
jamielennox	mordred: yea, there shouldn't be any state in that central component	21:37
jeblair	mordred, jamielennox: i think we brainstormed either having the central daemon perform a http/json query to the zuul status endpoint or a direct gearman function call to the scheduler to find out which executor to talk to. other than that, it's just shifting network traffic.	21:38
jeblair	jamielennox: so, aside from that, are there other conerns about 1 executor per machine?	21:39
clarkb	I think I may have found that kazoo leak	21:39
jamielennox	jeblair: not real concerns, i don't see any reason we couldn't run a number of small vms with executors as opposed to cgroup/docker/something isolated on the one machine	21:41
*** hashar has quit IRC		21:41
jamielennox	we haven't really discussed it, it was just at first pass of the idea it seemed that log streaming is a strange thing to be imposing that limitation	21:42
jamielennox	NFS ftw?	21:43
jeblair	jamielennox: once we have a central daemon, it will make much more sense to put the executor finger daemon on an alternate port, then you could run multiple executor-streamer pairs. however, i really think you'd need a sizable machine to get to the point where we have more python thread contention than ansible process overhead.	21:45
jamielennox	jeblair: yea, there are ways we could get around it later if required	21:46
jamielennox	quick question though - what's the log streamer on the test node? is that taking syslog etc?	21:47
jlk	jeblair: for today's meeting, can we do the github subject early? I have to pick up my kid from school, leaving about 10~15 minutes after the meeting starts.	21:47
mordred	jamielennox: it's hacked in to the ansible command/shell module	21:48
jamielennox	mordred: but won't that be coming from the executor now?	21:48
mordred	jamielennox: it's actually the exact same mechanism that provides the telnet streaming in the 2.5 codebase	21:48
mordred	jamielennox: the executor has to get the data somehow	21:48
*** jkilpatr has quit IRC		21:48
jeblair	jamielennox: it streams stdout of commands. i want to enhance it later to also fetch syslog, etc. (and similarly plumb that through the entire stack too)	21:48
jeblair	jlk: sure thing	21:48
jlk	<3	21:49
mordred	jamielennox: so currently the callback plugin on the executor makes a connection to the streamer on the test node and pulls back whatever the test node is streaming out its telent stream and then writes it to the ansible log file on the executor	21:49
jamielennox	mordred: oh - ok, i wasn't aware that was how that worked	21:50
jamielennox	makes sense	21:50
jeblair	oh it's meeting time in #openstack-meeting-alt	22:00
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer https://review.openstack.org/456721	22:21
openstackgerrit	David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer https://review.openstack.org/456721	22:24
*** jkilpatr has joined #zuul		22:28
jeblair	jamielennox: the plan for the websocket proxy in zuul is to use the autobahn framework with the asyncio module in python3. we won't be able to run tests for that until we convert zuul to python3 (which is blocked on gear for python3). but we can get started on the code without integrating it into tests for now.	22:50
mordred	(we do run our pep8 tests under python3, so python3 syntax checking at least happens)	22:53
mordred	(this is important since asyncio in python3 uses new python3 syntax that does not exist in python2 and thus pep8 has sads)	22:54
jeblair	ya	22:54
mordred	very not working patch: https://review.openstack.org/#/c/320563	22:55
clarkb	the helpful yaml exception parsing we do for bad configs si really confusing when you've broken the nconfig validation :) (easy enough to add a raise in the context manager though)	22:56
jeblair	mordred: oh hey i didn't know that was so far along. :)	22:58
jeblair	jamielennox: ^	22:58
jeblair	mordred: do you think it's worth porting to v3 and provisionally merging it with the understanding we'll run tests on it later?	22:58
mordred	jeblair: sure - I mean, it doesn't have the extra info plumbed in to it - but I could get that ball rolling probably	23:00
mordred	jeblair: oh - actually, the thing that doesn't have is gear for v3	23:01
jeblair	mordred: ah, right, so that's a harder dep on the gear py3 thing.	23:02
mordred	jeblair: although that part could be rewritten as a GET on the status.json	23:02
jeblair	true; that could be a good interim thing.	23:02
jeblair	i don't think py3gear will take much time once we have the brainspace for it, but until then, if folks would like to move forward on that, there is a path.	23:03
jeblair	jamielennox, SpamapS: ^	23:03
jeblair	(and if it's not urgent, we can wait for py3gear)	23:03
clarkb	jeblair: one of the problems with leaking is that the timer driver has a stop() which is never called. That means we leak the apsched threads. I tried addressing this by implementing a shim connection for timer but that implies zuul.conf config that will be checked against the layout.yaml and that fails	23:04
jeblair	clarkb: sounds like we need driver shutdown?	23:04
clarkb	ya, or separate the shutdown sequence from the magic that sets up valid config	23:05
jeblair	clarkb: i think explicit driver shutdown is better anyway, since connection shutdown doesn't necessarily imply driver shutdown	23:05
jeblair	clarkb: the drivers are loaded regardless of configuration	23:06
clarkb	ya, thats just a lot more work :) today the driver manipulation appears to go through connections	23:06
clarkb	also it looks like currently stop() is implied for both	23:07
clarkb	eg we don't stop only of a gerrit connection we stop all the gerrit connections. So maybe this is easier and I can just add in a for driver in drivers if driver hasattr stop then call it there	23:07
jeblair	clarkb: yeah, i think so.	23:10
jlk	Can v3 jobs have their own failure-message any more?	23:15
SpamapS	I lost the thread on py3gear a bit..	23:15
jlk	oh n/m	23:16
SpamapS	I believe we had settled on assuming utf8 and adding a binary set of classes for py3 users who want to do non-utf8 function names and payloads.	23:16
SpamapS	(with the understanding that a gear user migrating to python3 in this particular position would likely see oddness)	23:17
jeblair	SpamapS: me too, i'd need to spend a few minutes refreshing my memory. but that sounds reasonable.	23:18
SpamapS	I think I wrote that patch	23:21
SpamapS	https://review.openstack.org/398560	23:21
SpamapS	I also started on a py3 branch of v3	23:22
jeblair	SpamapS: ya, looks like the next patch is where we left off	23:23
jeblair	https://review.openstack.org/393544	23:23
SpamapS	jeblair: I think we may have actually decided to abandon 393544 as it was a more heavy-for-py3k-migrations approach.	23:24
* SpamapS forgot to write that down		23:24
SpamapS	398560 handles name automgically.. and payload I think works if all you do is json decode/encode	23:25
SpamapS	ah no, payload ends up coming out as bytes in py3	23:27
SpamapS	which won't json.loads without a decode	23:27
SpamapS	but that works the same py2/py3 so isn't too painful to wrap your json loads's in it	23:27
jeblair	SpamapS: i think we don't want to require wrapping encodes around everything -- the idea was that TextJob lets you avoid that. so i think we still want the idea of 393544, just a simpler implementation.	23:39

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!