*** jamielennox is now known as jamielennox|away | 01:33 | |
*** jamielennox|away is now known as jamielennox | 01:44 | |
*** yolanda has joined #zuul | 03:54 | |
*** yolanda has quit IRC | 05:23 | |
*** isaacb has joined #zuul | 06:26 | |
*** hashar has joined #zuul | 06:59 | |
*** bhavik1 has joined #zuul | 07:00 | |
*** yolanda has joined #zuul | 07:29 | |
*** yolanda has quit IRC | 08:20 | |
*** yolanda has joined #zuul | 08:39 | |
*** yolanda has quit IRC | 08:54 | |
*** yolanda has joined #zuul | 09:11 | |
*** yolanda has quit IRC | 09:39 | |
*** jkilpatr has quit IRC | 10:40 | |
*** jkilpatr has joined #zuul | 11:02 | |
*** yolanda has joined #zuul | 11:16 | |
*** Shrews has joined #zuul | 11:26 | |
*** bhavik1 has quit IRC | 11:40 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name https://review.openstack.org/456685 | 13:16 |
---|---|---|
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for specifying root job directory https://review.openstack.org/456691 | 13:16 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name https://review.openstack.org/456685 | 13:34 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for specifying root job directory https://review.openstack.org/456691 | 13:34 |
*** yolanda has quit IRC | 13:36 | |
*** bhavik1 has joined #zuul | 13:58 | |
*** bhavik1 has quit IRC | 14:05 | |
*** yolanda has joined #zuul | 15:03 | |
*** yolanda has quit IRC | 15:10 | |
*** yolanda has joined #zuul | 15:11 | |
pabelanger | mordred: speaking of packaging, just bringing the zuul package for fedora rawhide online today: https://bugzilla.redhat.com/show_bug.cgi?id=1220451 | 15:20 |
openstack | bugzilla.redhat.com bug 1220451 in Package Review "Review Request: zuul - Trunk gating system developed for the OpenStack Project" [Medium,Post] - Assigned to karlthered | 15:20 |
*** yolanda has quit IRC | 15:23 | |
*** yolanda has joined #zuul | 15:32 | |
*** hashar is now known as hasharAway | 15:37 | |
jlk | Good morning you zuuligans! | 15:54 |
*** isaacb has quit IRC | 15:59 | |
clarkb | I finally have zuul test suite running under a hand crafted all natural free range python install made with gcc7 | 16:09 |
clarkb | curious if this will show any difference | 16:09 |
*** isaacb has joined #zuul | 16:11 | |
clarkb | Exception: Timeout waiting for Zuul to settle <- so free range python didn't help that. Now to see if it at least completes running in a reasonable amount of time | 16:21 |
clarkb | SpamapS: jeblair was thinking about this more and could it be that git itself is slower in newer versions in how zuul uses it? | 16:22 |
clarkb | I'm going to attempt putting yappi in place on a per test basis | 16:28 |
jlk | What time (UTC) is the meeting today? | 16:38 |
clarkb | 2200 iirc | 16:38 |
SpamapS | clarkb: no I think it's python. | 16:49 |
SpamapS | clarkb: free range python.. did you enable all the optimizations? | 16:50 |
jlk | oh good I'll be able to make that. | 16:50 |
jeblair | i have iad and osic trusty and xenial nodes held; i'll kick off some tests there | 16:54 |
clarkb | SpamapS: ya I enabled lto and pgo | 16:55 |
clarkb | currently running into the problem of not being able to get stdout out of the test suite even when setting OS_STDOUT_CAPTURE=0 | 16:56 |
jeblair | fwiw, osic installed the tox deps in 1m6s +/-2s; rax-iad in 1m36s +/- 1s. | 16:59 |
clarkb | ok if I print in the tests cases I see that but not in BaseTestCase.setUp() even sys.stdout.write isn't working | 17:05 |
clarkb | hrm I wonder if we are calling that setUp(), even pdb doesn't seem to break there | 17:07 |
SpamapS | clarkb: threads make that tricky | 17:08 |
*** isaacb has quit IRC | 17:08 | |
SpamapS | or are you pdb.set_trace()'ing in it? should break there | 17:08 |
clarkb | I figured it out, bug in test suite I think. Gonna grab HEAD and double check and push fix up if still there | 17:09 |
pabelanger | clarkb: jeblair: mordred: care to add https://review.openstack.org/#/c/452494/ to your review queue? Fetch console log if SSH fails. | 17:10 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409 | 17:12 |
clarkb | SpamapS: ^ is fix | 17:12 |
clarkb | setUp was just never running on the fast test I was iterating on | 17:12 |
jeblair | pabelanger: is that a port of a change to another branch? | 17:16 |
pabelanger | jeblair: yes, I decided just to add it feature/zuulv3 to avoid recent nodepool.yaml changes | 17:17 |
clarkb | woo I have more infos. If I run a test that fails for me, tests.unit.test_scheduler.TestScheduler.test_crd_undefined_project, under testr alone it takes 55 seconds. If I run it under testtools, 12 seconds | 17:17 |
jeblair | pabelanger: can you mention/link to the previous change? | 17:17 |
pabelanger | jeblair: sure, 1 moment | 17:18 |
clarkb | and I haven't managed to get profiling working yet with testr runs to see what is eating all that time (I can't print out the results as stdout seems to be eaten) | 17:18 |
jeblair | clarkb: what does it mean to "run under testtools"? | 17:19 |
clarkb | jeblair: pythom -m testtools.run instead of testr run | 17:19 |
jeblair | ah ok | 17:19 |
jeblair | clarkb: testr does the discover thing, testtools.run doesn't, right? | 17:19 |
jeblair | clarkb: (that was a big motivation for mordred writing ttrun) | 17:20 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Fetch server console log if ssh connection fails https://review.openstack.org/452494 | 17:20 |
clarkb | jeblair: yes but discover is really fast | 17:20 |
clarkb | I can time it independently in a sec | 17:20 |
clarkb | subunit.run is ~14 seconds | 17:21 |
clarkb | discover is 1 second wall clock | 17:21 |
jeblair | clarkb: was your 55 seconds the time that the "testr" command took, or the time that testr reported for the test? | 17:24 |
clarkb | jeblair: the time that testr reported for the test | 17:24 |
jeblair | oh. yeah. i agree those should be the same. :) | 17:24 |
clarkb | let me rerun under time just to make sure there is no funny business in that recording | 17:24 |
SpamapS | ooooooooooooooooooh my | 17:25 |
SpamapS | yeah that's a bit nuts | 17:25 |
clarkb | wow also seems quite variable. last run was 27 seconds | 17:29 |
clarkb | (with testr, testtools seems pretty consistently 12-14 seconds) | 17:29 |
jeblair | i'm not seeing that locally with behind_dequeue | 17:30 |
* clarkb tries with that test | 17:32 | |
jeblair | also, running behind_dequeue on trusty/xenial nodepool nodes produced a fairly consistent low-mid 20s times | 17:32 |
clarkb | jeblair: using testr or testtools? | 17:32 |
jeblair | clarkb: testr ("tox -e py27 ...") | 17:33 |
jeblair | clarkb: i ran crd_undefined_project locally and get 3.3s both ways | 17:33 |
clarkb | ZUUL_TEST_ROOT=/home/clark/tmp/tmpfs time testr run tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue produces: 103.92user 18.40system 1:15.73elapsed 161%CPU (0avgtext+0avgdata 278300maxresident)k 0inputs+496outputs (0major+4407299minor)pagefaults 0swaps Ran 1 tests in 69.140s (+48.013s) | 17:34 |
clarkb | now to try with testtools.run | 17:34 |
clarkb | Ran 1 test in 60.213s 86.46user 16.58system 1:00.92elapsed 169%CPU (0avgtext+0avgdata 270040maxresident)k 0inputs+0outputs (0major+4220868minor)pagefaults 0swaps | 17:35 |
clarkb | so that test seems a lot more consistent | 17:36 |
clarkb | jeblair: fwiw the crd tests are the ones that always break locally for me | 17:36 |
clarkb | ..py:531 RecordingAnsibleJob.execute 44 0.004697 10.05743 0.228578 is biggest cost in dequeue job according to yappi | 17:38 |
clarkb | with testr run we are ingesting the subunit streams and writing data out to the "db" files | 17:38 |
clarkb | possibly that is what is making it slower in some cases depending on the stream contents? | 17:38 |
clarkb | after that git is the highest cost | 17:39 |
clarkb | now if I could just get the testr runs to print out their yappi infos :/ | 17:40 |
jeblair | clarkb: yeah. the new behavior is that we only save log files from failing tests, because they are quite large. you mention your test is failing, so that should be happening (though your env variables may alter that) | 17:42 |
clarkb | jeblair: well it only fails when run as a whole, its actually passing when i run it alone (so possibly an intertest interaction?) | 17:44 |
clarkb | but I have new data. set *_CAPTURE=0 in .testr.conf and the crd test takes a minute. set them to 1 (the default) and it takes half a minute (still twice the time of a testtools.run) implying the cost is in parsing the output stream and that is somehow affected by not capturing things possibly because its bubbling all the way up to testr and ti has to work harder? | 17:45 |
*** yolanda has quit IRC | 17:48 | |
clarkb | if I run testr run --subunit foo.test | subunit-2to1 then I see the output. But I don't see it in the .testrepository files. So testr is ingesting a bunch of information then discarding it? | 17:50 |
clarkb | jeblair: is gdbm present on the xenail and trusty nodes? (thats another potential difference here is testr won't have a gdbm to use on my system apparently) | 17:52 |
* clarkb install it | 17:52 | |
jeblair | -bash: gdbm: command not found | 17:53 |
jeblair | on both | 17:53 |
clarkb | jeblair: python interpreter import gdbm? | 17:53 |
jeblair | oh yeah that's there | 17:53 |
clarkb | having it makes the warning go away but doesn't affect the test runtime | 17:55 |
clarkb | still ~30s for that crd tests and *_CAPTURE=1 | 17:55 |
clarkb | http://paste.openstack.org/show/607721/ | 17:59 |
clarkb | also comparing function calls logging is more expensive so that may be the cost there? | 18:11 |
clarkb | oh wow so that time increase was purely due to having stdout written that wasn't part of the subunit stream | 18:22 |
clarkb | (my yappi stuff was not getting monkey patched stdout, if I fixed that the run time is in line with testtools.run) | 18:23 |
clarkb | actually its not working for capturing yet either. Arg, but good to know that having "raw" stdout over the wire makes things really unhappy with testr | 18:28 |
Shrews | Can I just say that Vexxhost's first response to my issue of "instance down" is to ask if it is still down doesn't install much confidence of their support methods. | 18:40 |
clarkb | woo finally have yappi working the way I want it and testrun times are not crazy and then discover some of these tests run long enough to have the counters roll over | 18:44 |
clarkb | yes negative runtimes thats gotta be correct. Zuul can time travel you know | 18:45 |
jeblair | clarkb: if zuulv3 can time travel backwards that will be great for our release timetable | 19:07 |
clarkb | fwiw I have the suite running now and trying to get a failure, none yet | 19:08 |
clarkb | changes are yappi profiling and use cStringIO | 19:08 |
jeblair | clarkb: i wanted to use cstringio, but that's apparently gone in python3. | 19:22 |
clarkb | jeblair: ya I've got it conditionally importing | 19:24 |
*** hasharAway is now known as hashar | 19:25 | |
SpamapS | oh we got that time travel feature done? | 19:26 |
SpamapS | cause.. I have a few ideas for use cases.. | 19:26 |
clarkb | oh it just failed. But the data is useless because its all rolling over :( | 19:27 |
* clarkb tries to get test run alone to fail | 19:27 | |
clarkb | something must be leaking | 19:28 |
clarkb | 11seconds when run alone | 19:30 |
clarkb | 1 minute 8 seconds when run with everything else (but with concurrency=1 so sequentially) | 19:30 |
clarkb | SpamapS: ^ | 19:31 |
clarkb | SpamapS: do you see that behavior too? | 19:31 |
SpamapS | clarkb: which test? | 19:36 |
SpamapS | $ ttrun -e py27 tests.unit.test_scheduler.TestScheduler.test_crd_check_transitive | 19:37 |
SpamapS | takes ~21s consistently | 19:37 |
SpamapS | worth noting I'm on Ice9c2fd320a4d581f0f71cbacc61f7ac58183c23 sha=070ee7e979dbb9b488493984aeddb866da3884ba | 19:38 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies | 19:38 |
clarkb | SpamapS: right, I am comparing testr run tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies to the runtime of that test in testr run | 19:39 |
clarkb | running it on its own seems consistent | 19:39 |
SpamapS | 8.5s for that one | 19:41 |
SpamapS | 7.5s sometimes | 19:42 |
SpamapS | tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies 5.401 | 19:42 |
SpamapS | faster under testr | 19:42 |
clarkb | SpamapS: a full testr run? | 19:43 |
SpamapS | no that's just alone | 19:43 |
clarkb | right its consistent and fast alone | 19:43 |
SpamapS | hm can I tell testr to show me last_run-1 ? | 19:43 |
clarkb | SpamapS: testr last | 19:43 |
SpamapS | no that's last_run | 19:44 |
clarkb | oh for that i just vim the file in .testrepository directly, you can also testr load it then testr last | 19:44 |
SpamapS | ahh | 19:46 |
SpamapS | tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies 5.401 | 19:47 |
SpamapS | clarkb: that's actually from the last real run | 19:47 |
SpamapS | same as alone | 19:47 |
SpamapS | to the thousandth | 19:47 |
* SpamapS is suspicious | 19:48 | |
SpamapS | seems like there's some funny business going on | 19:48 |
clarkb | Exception: Timeout waiting for Zuul to settle is the exception which makes me think that maybe we aren't settling deterministically for some reason | 19:48 |
clarkb | alone is fine and runs in 11 seconds. Not alone takes more then a minute then hits ^ | 19:48 |
clarkb | so I think profiling not necessary (at least not for this), but glad I got to this point | 19:52 |
clarkb | (the other angle I am investigating is maybe there is rogue data in the subunit stream which is causing testr to be unhappy. Have it running dumping subunit to stdout and into file to check. Then will also do a full run with testtools and see if it exhibits the same behavior of longer tests over time) | 19:56 |
*** harlowja has quit IRC | 20:03 | |
*** jamielennox is now known as jamielennox|away | 20:13 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer https://review.openstack.org/456721 | 20:14 |
Shrews | jeblair: that ^^^ caused me much headache | 20:14 |
Shrews | hopefully it's treading the right path now | 20:15 |
jeblair | Shrews: cool! | 20:17 |
Shrews | i also fully expect tests to fail b/c of the single port thing | 20:17 |
mordred | Shrews: its correctness is directly proportional to how much your head hurts | 20:17 |
jeblair | Shrews: single port thing? | 20:18 |
Shrews | jeblair: not sure if more than 1 executor is started for tests, but if so, they'll get port conflicts for the finger port | 20:18 |
Shrews | don't know enough about tests though | 20:19 |
*** jamielennox|away is now known as jamielennox | 20:20 | |
jeblair | Shrews: we don't start an executor using the cmd, we do it directly. so there shouldn't be anything starting a log streamer until you add it to the tests. that's why i asked you to split it out from the startup code. | 20:20 |
Shrews | ah. then yay | 20:20 |
jeblair | Shrews: ya. when we add it, we'll want to have it start on port 0 in the tests to pick a random port, since there can be multiple tests which start it running, and we also don't want to require root in the tests. | 20:21 |
mordred | jeblair: it's worth noting that this approach does tie us to a 1:1 relationship between executors and hosts - which I believe we're fine with - but noting it because I believe there had been musing in the past about the viability of running more than one executor on a given machine | 20:23 |
mordred | jeblair: I'm quite in favor of it as an approach and don't think the limition is an issue - that said, it might make things fun if people attempt to run an executor farm in something like k8s | 20:24 |
jeblair | mordred: indeed. i think the 1:1 aspect is the main reason to do this. i may not fully recall the musings you're recalling -- is there a potential advantage to having more than one on a host? | 20:24 |
mordred | jeblair: not to my knowledge, no - unless there were some thread-scaling issue that would be better served by multiple processes - although in the infra case I'd imagine if that were the case we'd just spawn more smaller VMs | 20:25 |
jeblair | mordred: if you were using multiple executors is a k8s environment, would you be able to run a single log streamer across them? | 20:26 |
mordred | jeblair: _maybe_? I mean, we haven't really designed for that at all - but with the 1:1 relationship it implies that each executor needs a routable IP | 20:27 |
mordred | I believe one could fairly easy "expose" a service on each executor to make that happen in k8s land | 20:27 |
mordred | but down the road we might wind up wanting/needing a single finger multiplexer that could hit each of the executor fingers on the non-routable network ... this is HIGHLY hand-wavey - trying to put my brain the space of the people who like to put everytihng on private addresses and expose a bare minimum of public ips | 20:29 |
jeblair | mordred: yes, we even want something similar for the websocket multiplexer | 20:29 |
mordred | (thisis all "I think this is the right approach and will scale well for us. I don't think it will prevent a k8s scale out of executors - but we might have some enablement work to do in the future to make that nicer for some of those users" | 20:30 |
mordred | jeblair: ++ | 20:30 |
mordred | jeblair: luckily finger already includes in the protocol the ability to chain hosts | 20:30 |
mordred | jeblair: so it's almost like the RFC authors back when the internet worked planned this for us! | 20:30 |
jeblair | mordred: i'm trying to figure out is whether the 1:1 or 1:N choice has an impact on that. i guess i don't understand enough about k8s to understand how you would create a scenario where you could have multiple executors writing log files to a space where a single log streamer reads from. | 20:30 |
fungi | sorry all, i'm entertaining inlaws this evening and will likely be out at dinner during the zuul meeting, but will check out the log when i get back | 20:32 |
mordred | jeblair: yah - the more I think about it the more I think the 1:1 choice is beneficial for that - and in a k8s land with less public IPs than desired executors having a multiplexor will be good | 20:33 |
mordred | and since it's a 1:1 with the executors, we're tlaking about an easy to understand "service talks to other service over network" | 20:34 |
mordred | and _not_ "this one daemon reads the files these other 10 daemons are writing - oh wait, they're all in their own containers" | 20:34 |
jeblair | mordred: ok. i definitely think this is worth thinking about; we don't want to back ourselves into a corner and there are certainly compromises in this direction. a good argument could easily push us one way or the other at this point. this approach has 'ease of deployment' going for it right now. | 20:36 |
Shrews | fwiw, if we make the decision similar to "we need to rewrite this to be a highly distributed, highly redundant, dockerized application", imma gonna fight somebody | 20:38 |
mordred | Shrews: hell no. BUT - I don't want to do something that prevents someone who does want to run it in k8s or mesos from doing so | 20:39 |
mordred | jeblair: I think it gets us a one-executor==one-network-listener, which I think folks grok fairly well in k8s land ("microservices") - so you'd wind up with N containers each running an executor/finger-listener | 20:39 |
mordred | which is good | 20:39 |
Shrews | mordred: lol, i know. was j/k :) | 20:39 |
mordred | Shrews: :) | 20:39 |
mordred | Shrews: I dunno - you COULD want to fight someone anyway | 20:39 |
mordred | jeblair: and an executor is never going to work in a container that doesn't support fork - because ansible | 20:40 |
jeblair | Shrews: it already is a highly distributed application, and becoming moreso, especially thanks to your work. it only has one SPOF left, and in my mind, zuulv4 will get rid of that. :) | 20:40 |
jeblair | mordred: yeah, that makes sense. | 20:40 |
Shrews | all hail zuulv4 | 20:40 |
mordred | so we're never going to hit the "purist" place of only ever having a single executable in each container with the fork scaleout model actually being at the k8s layer | 20:41 |
clarkb | SpamapS: ok after lunch and looking at my log from testr run that job is trying to connect to a bunch of gearman servers that it can't connect to | 20:41 |
clarkb | SpamapS: but gneral idea is to run in isolation and compare logs | 20:41 |
mordred | (I mean, we could if we wanted to go 100% down the road of doing that by having the executor make k8s api calls instead of subprocess calls - but that's a whole GIANT other thing | 20:41 |
mordred | and I do not think buys us, well, anything | 20:41 |
jeblair | mordred: yeah, i feel that that is significant enough of a change to warrant revisiting this anyway. i mean that's not an implementation choice you can easily make today. that's something worth considering later, and when we do, "how to get the logs off" will be a good question to answer at the same time. | 20:42 |
jeblair | mordred: i know that's something SpamapS and others were interested in when we were discussing the executor-security thing. | 20:43 |
jeblair | clarkb: if there are multiple test failures in your run, it's quite possible those gearman connects are from threads left over from previous tests which did not shut down correctly. they should be mostly harmless, other than taking up some resources. | 20:44 |
mordred | jeblair: ++ | 20:45 |
clarkb | jeblair: they show up in the first test that failed. So I don't think its just dirty test env from fails | 20:46 |
jeblair | clarkb: hrm. i've been considering having us save a subunit attachment with the gearman port number of every test, even the successful ones, so we can trace those back to the test that launched them. | 20:47 |
clarkb | I now have logs files up gonna work on seeing if I can see where they start to diverge time wise | 20:50 |
clarkb | have confirmed that single test does not have a bunch of servers its trying to connect to | 20:50 |
mordred | jeblair: re: significant enough change - I totally agree. and I mostly want to make sure we don't prevent anyone from going deep on the run what we have now in k8s so that we can get a bunch of feedback/knowledge of what the ups and downs are... so that if we decide that zuul v5 will just darned-well depend on k8s, that we're doing it not becaus of new-shiny but because we've actually been able to learn about | 20:51 |
mordred | real/tangible benefits | 20:51 |
jeblair | mordred: yep | 20:51 |
clarkb | its actually really early on hrm. The time between test start to gearman server starting is 6 seconds in slow failed test and one second in test run alone | 20:52 |
clarkb | so maybe it is just leaked resources vying for cpu time | 20:52 |
clarkb | and the reason gate jobs are happy is by splitting into more cpus you have processes that can better time share the cpu? | 20:52 |
clarkb | anyways will keep digging | 20:52 |
jeblair | clarkb: are you saying those may not be leaked threads, but are just the actual threads for that test but geard hasn't started yet? | 20:53 |
clarkb | jeblair: no, pretty sure they are leaked. I'm looking at the python logging time sequence and the time taken to go from test start to gearman server starting for that test is different | 20:54 |
clarkb | and that happens really early on | 20:54 |
*** harlowja has joined #zuul | 20:59 | |
SpamapS | jeblair: mordred jlk I will likely miss today's Zuul meeting. Sick child has to be taken to doctor. | 21:01 |
jlk | :( | 21:02 |
SpamapS | pink eye... always pink eye.. :-/ | 21:02 |
clarkb | jeblair: there is also a lot more kazoo activity | 21:02 |
*** yolanda has joined #zuul | 21:02 | |
jamielennox | mordred, jeblair: having just found out about the finger thing, and it basically replacing my little websocket app, why are we sending the uuid to the exectuor, isn't the executor only running one job? | 21:13 |
mordred | jamielennox: oh - it's not replacing the websocket app - we also need a websocket app | 21:14 |
jamielennox | i'm reading scrollback and we are in a position of limited public ips, and i'm not even sure executors would be public | 21:14 |
mordred | jamielennox: nod, so you might be the first user who needs the multiplexor then :) | 21:15 |
mordred | jeblair: ^^ look how quickly we go from thought to actual usecases | 21:16 |
jamielennox | so i think what we would need there is some way to ask zuul for a build UUID and resolve that to a (private) executor ip | 21:21 |
jamielennox | or otherwise just expose the private IPs to a user which is what my thing is doing now | 21:21 |
jamielennox | but i'm also interested in the decision to limit 1 executor per VM, we don't have a decision on this but most executors are just going to be running ansible (network bound) and will register themselves on a job queue | 21:24 |
jeblair | wait there's already a websocket app? | 21:24 |
jamielennox | jeblair: it was something i POCed for our zuul 2.5 | 21:24 |
jamielennox | but i can show you | 21:24 |
jeblair | jamielennox: which framework did you use? | 21:25 |
jamielennox | jeblair: https://github.com/BonnyCI/logging-proxy | 21:25 |
jamielennox | jeblair: nodejs :( | 21:25 |
jamielennox | in it's defense when pumping out to a browser it's probably the right choice | 21:25 |
mordred | jamielennox: so - the fun thing about multi-plexing is that it's built in to the finger spec - so if you say "finger XXX@zuul.openstack.org@private-executor01.openstack.org" | 21:26 |
mordred | which basically means we just have to do a little more plumbing | 21:26 |
mordred | gah | 21:26 |
mordred | I didn't finish my sentence | 21:26 |
jeblair | mordred: while that's neat, i don't think that's what we should actually do. i don't think we need to conform to the finger spec here. :) | 21:26 |
mordred | if you say "finger XXX@zuul.openstack.org@private-executor01.openstack.org" - that tells zuul.o.o to finger private-executor01.openstack.org and pass the results back | 21:26 |
jamielennox | mordred: that's really cool from a tech perspective and something you could build into a client side zuul-cli, but probably not something many people would use | 21:27 |
jeblair | i think our process should be to build up layers to get to the point where we have a user-friendly system. here's how i think we should proceed: | 21:27 |
clarkb | ok running testr run with OS_LOG_CAPTURE set to 0 so that I can see the logs we are leaking threads | 21:27 |
clarkb | so best guess is that we are eventually getting to the point where those threads are harmful? | 21:28 |
mordred | clarkb: if so, it would make sense that it would be more harmful on machines with less cores | 21:28 |
jeblair | 1) implement a console log streamer on each test node (done) | 21:28 |
clarkb | mordred: and slower cores | 21:28 |
jeblair | 2) implement a daemon which multiplexes the console log from each test node in a job along with the ansible log on the executor (in progress) | 21:29 |
clarkb | a good chunk of them are apscheduler threads, I will pull down my change and use it in my testing too | 21:29 |
clarkb | <Thread(Thread-158, started daemon 140017141806848)> a lot of those too | 21:29 |
jeblair | 3) implement a *central* daemon which can connect to the correct executor and ship that log stream over websockets | 21:30 |
jeblair | it's not much of a stretch to have that daemon *also* implement finger as well as websockets | 21:31 |
mordred | jeblair: ++ yes, I agree | 21:31 |
mordred | (turns out finger is much easier to implement than websockets) | 21:31 |
jeblair | that way we end up with "ws://zuul.example.com/build/..." and "finger build@zuul.example.com" as easy front-ends so users don't have to see executors. | 21:32 |
jeblair | i anticipate that we would use the executor finger url as the build url for now, and then once we have the central daemons, replace that | 21:32 |
jeblair | jamielennox: does that address your needs? | 21:33 |
jamielennox | jeblair: yea, that will work - one way of another we'd be overriding the build url for viewing anyway, so so long as we have a central place to hit to find information for BUILD_UUID we're good | 21:35 |
mordred | jeblair: the central daemon should also be pretty easy to scale-out and be a little fleet of log streaming daemons if needed too, yeah? | 21:36 |
jeblair | jamielennox: well, the end goal is to have the build url point to a js page which uses websockets to stream the logs, so hopefully you won't need to overwrite it | 21:36 |
jeblair | mordred: yes, i don't see a reason it can't be a scalable component (if you put it behind a load balancer) | 21:36 |
jamielennox | mordred: yea, there shouldn't be any state in that central component | 21:37 |
jeblair | mordred, jamielennox: i think we brainstormed either having the central daemon perform a http/json query to the zuul status endpoint or a direct gearman function call to the scheduler to find out which executor to talk to. other than that, it's just shifting network traffic. | 21:38 |
jeblair | jamielennox: so, aside from that, are there other conerns about 1 executor per machine? | 21:39 |
clarkb | I think I may have found that kazoo leak | 21:39 |
jamielennox | jeblair: not real concerns, i don't see any reason we couldn't run a number of small vms with executors as opposed to cgroup/docker/something isolated on the one machine | 21:41 |
*** hashar has quit IRC | 21:41 | |
jamielennox | we haven't really discussed it, it was just at first pass of the idea it seemed that log streaming is a strange thing to be imposing that limitation | 21:42 |
jamielennox | NFS ftw? | 21:43 |
jeblair | jamielennox: once we have a central daemon, it will make much more sense to put the executor finger daemon on an alternate port, then you could run multiple executor-streamer pairs. *however*, i really think you'd need a sizable machine to get to the point where we have more python thread contention than ansible process overhead. | 21:45 |
jamielennox | jeblair: yea, there are ways we could get around it later if required | 21:46 |
jamielennox | quick question though - what's the log streamer on the test node? is that taking syslog etc? | 21:47 |
jlk | jeblair: for today's meeting, can we do the github subject early? I have to pick up my kid from school, leaving about 10~15 minutes after the meeting starts. | 21:47 |
mordred | jamielennox: it's hacked in to the ansible command/shell module | 21:48 |
jamielennox | mordred: but won't that be coming from the executor now? | 21:48 |
mordred | jamielennox: it's actually the exact same mechanism that provides the telnet streaming in the 2.5 codebase | 21:48 |
mordred | jamielennox: the executor has to get the data somehow | 21:48 |
*** jkilpatr has quit IRC | 21:48 | |
jeblair | jamielennox: it streams stdout of commands. i want to enhance it later to also fetch syslog, etc. (and similarly plumb that through the entire stack too) | 21:48 |
jeblair | jlk: sure thing | 21:48 |
jlk | <3 | 21:49 |
mordred | jamielennox: so currently the callback plugin on the executor makes a connection to the streamer on the test node and pulls back whatever the test node is streaming out its telent stream and then writes it to the ansible log file on the executor | 21:49 |
jamielennox | mordred: oh - ok, i wasn't aware that was how that worked | 21:50 |
jamielennox | makes sense | 21:50 |
jeblair | oh it's meeting time in #openstack-meeting-alt | 22:00 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer https://review.openstack.org/456721 | 22:21 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer https://review.openstack.org/456721 | 22:24 |
*** jkilpatr has joined #zuul | 22:28 | |
jeblair | jamielennox: the plan for the websocket proxy in zuul is to use the autobahn framework with the asyncio module in python3. we won't be able to run tests for that until we convert zuul to python3 (which is blocked on gear for python3). but we can get started on the code without integrating it into tests for now. | 22:50 |
mordred | (we do run our pep8 tests under python3, so python3 syntax checking at least happens) | 22:53 |
mordred | (this is important since asyncio in python3 uses new python3 syntax that does not exist in python2 and thus pep8 has sads) | 22:54 |
jeblair | ya | 22:54 |
mordred | very not working patch: https://review.openstack.org/#/c/320563 | 22:55 |
clarkb | the helpful yaml exception parsing we do for bad configs si really confusing when you've broken the nconfig validation :) (easy enough to add a raise in the context manager though) | 22:56 |
jeblair | mordred: oh hey i didn't know that was so far along. :) | 22:58 |
jeblair | jamielennox: ^ | 22:58 |
jeblair | mordred: do you think it's worth porting to v3 and provisionally merging it with the understanding we'll run tests on it later? | 22:58 |
mordred | jeblair: sure - I mean, it doesn't have the extra info plumbed in to it - but I could get that ball rolling probably | 23:00 |
mordred | jeblair: oh - actually, the thing that doesn't have is gear for v3 | 23:01 |
jeblair | mordred: ah, right, so that's a harder dep on the gear py3 thing. | 23:02 |
mordred | jeblair: although that part could be rewritten as a GET on the status.json | 23:02 |
jeblair | true; that could be a good interim thing. | 23:02 |
jeblair | i don't think py3gear will take much time once we have the brainspace for it, but until then, if folks would like to move forward on that, there is a path. | 23:03 |
jeblair | jamielennox, SpamapS: ^ | 23:03 |
jeblair | (and if it's not urgent, we can wait for py3gear) | 23:03 |
clarkb | jeblair: one of the problems with leaking is that the timer driver has a stop() which is never called. That means we leak the apsched threads. I tried addressing this by implementing a shim connection for timer but that implies zuul.conf config that will be checked against the layout.yaml and that fails | 23:04 |
jeblair | clarkb: sounds like we need driver shutdown? | 23:04 |
clarkb | ya, or separate the shutdown sequence from the magic that sets up valid config | 23:05 |
jeblair | clarkb: i think explicit driver shutdown is better anyway, since connection shutdown doesn't necessarily imply driver shutdown | 23:05 |
jeblair | clarkb: the drivers are loaded regardless of configuration | 23:06 |
clarkb | ya, thats just a lot more work :) today the driver manipulation appears to go through connections | 23:06 |
clarkb | also it looks like currently stop() is implied for both | 23:07 |
clarkb | eg we don't stop only of a gerrit connection we stop all the gerrit connections. So maybe this is easier and I can just add in a for driver in drivers if driver hasattr stop then call it there | 23:07 |
jeblair | clarkb: yeah, i think so. | 23:10 |
jlk | Can v3 jobs have their own failure-message any more? | 23:15 |
SpamapS | I lost the thread on py3gear a bit.. | 23:15 |
jlk | oh n/m | 23:16 |
SpamapS | I believe we had settled on assuming utf8 and adding a binary set of classes for py3 users who want to do non-utf8 function names and payloads. | 23:16 |
SpamapS | (with the understanding that a gear user migrating to python3 in this particular position would likely see oddness) | 23:17 |
jeblair | SpamapS: me too, i'd need to spend a few minutes refreshing my memory. but that sounds reasonable. | 23:18 |
SpamapS | I think I wrote that patch | 23:21 |
SpamapS | https://review.openstack.org/398560 | 23:21 |
SpamapS | I also started on a py3 branch of v3 | 23:22 |
jeblair | SpamapS: ya, looks like the next patch is where we left off | 23:23 |
jeblair | https://review.openstack.org/393544 | 23:23 |
SpamapS | jeblair: I think we may have actually decided to abandon 393544 as it was a more heavy-for-py3k-migrations approach. | 23:24 |
* SpamapS forgot to write that down | 23:24 | |
SpamapS | 398560 handles name automgically.. and payload I think works if all you do is json decode/encode | 23:25 |
SpamapS | ah no, payload ends up coming out as bytes in py3 | 23:27 |
SpamapS | which won't json.loads without a decode | 23:27 |
SpamapS | but that works the same py2/py3 so isn't too painful to wrap your json loads's in it | 23:27 |
jeblair | SpamapS: i think we *don't* want to require wrapping encodes around everything -- the idea was that TextJob lets you avoid that. so i think we still want the idea of 393544, just a simpler implementation. | 23:39 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!