SpamapS | jeblair: AH right ok | 00:01 |
---|---|---|
* SpamapS may poke at that, I understand now | 00:01 | |
SpamapS | tho I kinda wanna bang out the last few test re-enables first | 00:01 |
* SpamapS EOD's | 00:01 | |
clarkb | wow is it that time already? | 00:09 |
clarkb | jeblair: can you look at test_client_enqueue_negative? is the reuse of client there valid when using shutdown? | 00:12 |
clarkb | test_client_promote_negative does it too | 00:14 |
clarkb | I think those are likely passing because they raise the expected exceptions but because the client is shutdown not because it acutally spoke through gearman and failed /me will adjust | 00:21 |
*** Shrews has quit IRC | 00:51 | |
jeblair | clarkb: the exception check ostensibly checks the specific error, so it seems like they ought to be valid checks (ie, not fooled by the client being shutdown). however, i agree, the reuse of client is wrong and should cause problems. so you're almost certainly on to something. :) | 00:57 |
clarkb | ya I think I have this mostly sorted out | 00:58 |
clarkb | Its ~133 tests in and no fails except for 3 where I derped a cleanup | 00:58 |
clarkb | (still not fast but seems table at least) | 01:02 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Don't run apsched thread when no jobs need scheduling https://review.openstack.org/457798 | 01:14 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409 | 01:14 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Add ability to stop drivers https://review.openstack.org/459477 | 01:14 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use cStringIO if available. https://review.openstack.org/459478 | 01:14 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Join watchdog thread so it doesn't leak in tests https://review.openstack.org/459479 | 01:14 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480 | 01:14 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481 | 01:14 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482 | 01:14 |
clarkb | thats today's braindump | 01:14 |
clarkb | I'm not sure if last one will pass as I haven't been able to run full suite locally yet | 01:14 |
clarkb | But its way better than it has been I think | 01:14 |
jeblair | clarkb: i haven't looked, but it's possible that could cause follow-on failures | 01:15 |
jeblair | clarkb: like, if one test fails, then all subsequent tests in that test runner will also fail | 01:15 |
jeblair | which is a bit distracting; hard to see where the real failure is | 01:15 |
clarkb | jeblair: the error if threads leak change? | 01:15 |
clarkb | I don't think it should | 01:15 |
jeblair | clarkb: yeah. | 01:15 |
clarkb | cleanups should run regardless of other cleanups raising | 01:16 |
clarkb | and its at the end of that cleanup so I think we are fine | 01:16 |
jeblair | clarkb: if a test fails, it seems quite likely that it may fail to clean up properly, and threads may leak. | 01:16 |
clarkb | even failed tests should cleanup properly though... | 01:16 |
clarkb | thats why cleanup is better than teardown | 01:17 |
clarkb | anyways I can sort out how to make that assertion stronger later. I do think we need something like it though | 01:17 |
clarkb | since leaked threads are very expensive based on my local job runs | 01:17 |
jeblair | ideally yes. we can choose to make that a priority. i'm just saying it's probably not the case now. we'll probably be mopping that up for a while. | 01:17 |
clarkb | one reason I want it to fail is so that you see leaked threads on the first test that leaks them (since otherwise logs are not captured by default) | 01:23 |
clarkb | but I can work on that | 01:23 |
clarkb | I think http://logs.openstack.org/81/459481/1/check/gate-zuul-python27-ubuntu-xenial/02d8108/console.html is possibly a race introduced by my changes too :/ I touched the ansible execution code a little. Will have to poke more tomorrow | 01:30 |
clarkb | jlk: SpamapS would appreciate if you could grab that stack and see if its better for you too | 01:45 |
jlk | kk | 01:57 |
clarkb | hrm tests.unit.test_scheduler.TestScheduler.test_needed_changes_enqueue has been running for ~8 minutes | 02:09 |
clarkb | might have to look at that test more closely too | 02:10 |
jamielennox | jeblair: so backscroll, but do you consider py3 and py3gear a blocker for that | 03:38 |
jamielennox | jeblair: ie, is py3 a specific valuable task that someone could work on today? | 03:39 |
jamielennox | SpamapS: obvoiusly has some influence in the gear space so i can talk with him about a path to getting that done | 03:42 |
jamielennox | because it definitely makes sense that the proxying service be asyncio based | 03:42 |
tobiash | morning | 05:09 |
tobiash | what do you think about enriching nodepool logging with the extra keyword argument? | 05:10 |
tobiash | with that we could add e.g. the node id as extra metadata | 05:11 |
tobiash | using log formatters which support this extra metadata this would make it easy for e.g. tracking/filtering the lifecycle of a node in the logs | 05:14 |
tobiash | I use logstash_formatter which formats the log to json (including the extra dict) which is pushed directly to elasticsearch | 05:15 |
*** yolanda has quit IRC | 05:31 | |
SpamapS | jamielennox: yeah I think the idea is that to do websockets properly the right path is asyncio | 05:54 |
*** bhavik1 has joined #zuul | 06:29 | |
*** bhavik1 has quit IRC | 06:58 | |
*** yolanda has joined #zuul | 07:04 | |
*** isaacb has joined #zuul | 07:15 | |
*** jeblair_ has joined #zuul | 07:41 | |
*** jeblair has quit IRC | 07:41 | |
*** yolanda has quit IRC | 07:44 | |
*** hashar has joined #zuul | 07:56 | |
*** yolanda has joined #zuul | 08:00 | |
*** yolanda has quit IRC | 08:17 | |
*** yolanda has joined #zuul | 08:33 | |
*** yolanda has quit IRC | 09:22 | |
*** yolanda has joined #zuul | 09:37 | |
*** jkilpatr has quit IRC | 10:49 | |
*** yolanda has quit IRC | 11:05 | |
*** jkilpatr has joined #zuul | 11:11 | |
*** Shrews has joined #zuul | 11:29 | |
mordred | jamielennox, SpamapS: yes re: websockets and asyncio - but I don't think py3/py3gear needs to be a blocker for getting the websockets service done | 11:56 |
mordred | it'll only be a blocker for writing unit tests for the websockets service | 11:56 |
mordred | I say that because while technically getting gear python3'd has already eaten one person and is an easy hole to fall down and be frustrated inside of | 11:57 |
mordred | while getting the websockets streamer going should actually be fun | 11:58 |
mordred | OH | 11:58 |
mordred | well - I mean, tox -epy35 isn't really going to do anything right now - so I suppose one could actually even write unit tests for the websockets streamer and just put something into the tox definition that excludes the non-websockets unit tests from running under 35 | 11:59 |
mordred | so if it's a separate little daemon, and it fetches the information it needs about where the logs are from an http call to status.json ... | 12:00 |
mordred | I don't think there are any blockers | 12:00 |
mordred | (it wants to make the rest call, fwiw, so that it's not just an open proxy to streaming from random ports - if the browser->websocket protocol is "please give me the stream for job XXX and optionally for logfile YYY", then the websocket server asks zuul "hey, where's XXX and YYY?" it should be a fairly closed loop | 12:02 |
mordred | also, Shrews is currently working on the thing the websocket server will be streaming _from_, so if anybody does decide to dive on that topic, syncing with him will likely be helpful | 12:03 |
jamielennox | mordred: so i was mostly asking because if that just needed someone to put in the work to unblock something i could pick it up | 12:17 |
jamielennox | mordred: but so gear is apparently already running with py35, but it has no asyncio | 12:17 |
jamielennox | mordred: and a google search for asyncio gearman has 3 libraries in top 4 results, what was the initial rationale for that being an openstack library? | 12:18 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: Don't getChange on source not triggering a change https://review.openstack.org/406129 | 12:19 |
*** yolanda has joined #zuul | 12:21 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool feature/zuulv3: Support externally managed images https://review.openstack.org/458073 | 13:10 |
*** atarakt has quit IRC | 13:14 | |
*** yolanda has quit IRC | 13:28 | |
pabelanger | https://admin.fedoraproject.org/pkgdb/package/rpms/zuul/ | 13:40 |
pabelanger | \o/ | 13:40 |
*** yolanda has joined #zuul | 13:44 | |
*** hashar is now known as wmf-SWAT-shop | 13:52 | |
*** wmf-SWAT-shop is now known as hashar | 13:52 | |
pabelanger | zuulv3-dev.o.o back online, but I see an exception | 14:04 |
pabelanger | http://paste.openstack.org/show/607867/ | 14:04 |
pabelanger | possible another configuration change? | 14:05 |
pabelanger | Hmm, possible a bug. According to docs, server should be the default for canonical_hostname | 14:10 |
Shrews | commit 1c7744 added canonical hostname to the source obj. probably related | 14:11 |
pabelanger | thanks | 14:11 |
pabelanger | I think we want our configure to be the following now | 14:11 |
pabelanger | server = review.openstack.org | 14:12 |
pabelanger | canonical_hostname = git.openstack.org | 14:12 |
jeblair_ | yes, though it should work without that, so we should fix the error. | 14:12 |
pabelanger | agree | 14:13 |
*** jeblair_ is now known as jeblair | 14:15 | |
*** isaacb has quit IRC | 14:17 | |
jeblair | pabelanger: i'm going to do some editing/debugging on zuulv3-dev, ok? | 14:17 |
pabelanger | jeblair: sure | 14:17 |
pabelanger | going to grab a coffee | 14:17 |
mordred | jamielennox: we wrote gear originally just because we needed a better/simpler library for infra things back in the day (the other python libs were either incorrect or tried to be too clever) | 14:17 |
mordred | jamielennox: I think if we have the streamer get the needed data via http and not via gear we can avoid the need to get gear ported to py3 for now (it still needs to be done to support zuul on py3, but there's no need to block the websocket streaming on gear getting a py3 update) | 14:18 |
*** yolanda has quit IRC | 14:21 | |
jeblair | pabelanger: oh, it's because of the 'legacy gerrit section'. it's time to remove that. :) | 14:32 |
jeblair | pabelanger: [gerrit] needs to be [connection gerrit] | 14:33 |
pabelanger | Ah, cool! | 14:36 |
*** yolanda has joined #zuul | 14:36 | |
*** yolanda has quit IRC | 15:15 | |
pabelanger | apparently gnocchi is moving out of openstack namespace. Not sure what they are planning on for testing, but maybe another project to add to the list for our zuulv3 github connection | 15:23 |
clarkb | well they weren't using our testing before, justmerge check and noop jobs. | 15:25 |
clarkb | or is that just debs ETOOEARLY | 15:25 |
clarkb | oh thats transitional, gah I need more caffiene | 15:26 |
pabelanger | I think they had python-jobs at one point | 15:26 |
pabelanger | and publish-to-pypi | 15:26 |
clarkb | pabelanger: ya the change diff I saw was removal of their transition out change | 15:26 |
clarkb | so first cahnge is to remove python jobs et al and replace with noop jobs | 15:27 |
clarkb | then change to remove noop jobs and I saw this one | 15:27 |
*** yolanda has joined #zuul | 15:30 | |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482 | 15:54 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481 | 15:54 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480 | 15:54 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Join watchdog thread so it doesn't leak in tests https://review.openstack.org/459479 | 15:54 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Properly record timed out builds https://review.openstack.org/459767 | 15:54 |
clarkb | ok that should fix up the watchdog related error | 15:54 |
clarkb | and it fixes a bug I found in the process | 15:54 |
*** Shrews has quit IRC | 15:54 | |
*** Shrews has joined #zuul | 15:54 | |
clarkb | SpamapS: jlk ^ I still have one test worker crash on my locally so I don't run the full suite but this seems to get me reliably completing the test suite and things are generally happier than they were so reviews and test driving would be great | 15:55 |
*** Shrews has quit IRC | 15:55 | |
clarkb | now to turn timeout to gentle and see if I can't sort that out | 15:55 |
*** Shrews has joined #zuul | 15:58 | |
Shrews | \o/ got my irc bouncer back | 15:59 |
Shrews | no thx to vexxhost support, though | 15:59 |
jeblair | Shrews: regarding 456685, i just remembered that the random tmp dir name has a security component. it seems like the build uuid might be sufficiently secure, but i think we need to be as careful in creating that as mkdtemp. | 16:06 |
jeblair | Shrews: i left a comment inline. I think what it describes will be sufficient. if we're not comfortable with that, your previous version (which used a prefix and glob) would work. | 16:06 |
jeblair | Shrews: we may want to ask SpamapS to look at this too since he has opinions on this sort of thing. :) | 16:07 |
* SpamapS awake | 16:09 | |
Shrews | jeblair: ack | 16:09 |
SpamapS | I think it's probably unpredictable enough. Unless it's sitting around in a globally readable place for a long time (more than 5 seconds == long) before the dir gets created. | 16:11 |
SpamapS | seems like build UUID's are only internal to zuul and the gearman jobs | 16:12 |
jeblair | SpamapS: it will be exposed by the status json, and if it takes a while for a node to show up, it could sit there for a while. | 16:13 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_needed_changes_enqueue seems to be the problem test based on my grepping of subunit | 16:13 |
clarkb | gentle timeout also doesn't seem strong enough | 16:14 |
jeblair | SpamapS, Shrews: so we may not be able to rely on the element of surprise. given that, is the mkdir suffirient? (it should fail if it already exists) | 16:14 |
SpamapS | jeblair: yeah I think it being a dir name and not a file, we can relax with a mkdir that fails. | 16:14 |
jeblair | cool, yeah that seems like a nice benefit of using a dir | 16:15 |
*** yolanda has quit IRC | 16:15 | |
SpamapS | since the trouble would only come if they symlinked the UUID to a place with files that would match what gets written into a job dir. | 16:15 |
Shrews | jeblair: i suppose still using os.makedirs() on the root (not including the build uuid) is necessary, yeah? | 16:16 |
SpamapS | and then zuul would wreak havoc on that dir. So yeah, just make sure the mkdir raises on existing anything. | 16:16 |
jeblair | Shrews: i don't think we do that now; i think we expect the root to exist. | 16:16 |
jeblair | Shrews: er, i'm using 'root' to mean the 'root' argument as passed in to jobdir | 16:17 |
Shrews | jeblair: ok. was wondering because all of the following code uses that | 16:17 |
jeblair | Shrews: also, fwiw, JobDir is used in exactly one place, so you can drop the conditionals (all arguments are, in reality, required). | 16:18 |
jeblair | that might make this simpler to reason about | 16:19 |
Shrews | jeblair: it looks like 'root' can actually be None | 16:25 |
Shrews | so should probaby keep that conditional | 16:25 |
jeblair | Shrews: oh, and that uses TMPDIR? sounds right then. | 16:25 |
Shrews | i'll make the other two params really required | 16:27 |
jeblair | clarkb: re 457798 -- how does that leak happen? | 16:28 |
clarkb | jeblair: https://review.openstack.org/#/c/459477/1 is what properly addresses the leak since we never actually stop the driver otherwise | 16:29 |
clarkb | jeblair: so everytime a test creates a new scheduler you get a new apsched thread | 16:30 |
jeblair | clarkb: ok, so i think we may not want 457798 (it has a structural problem i don't see a good way to fix; comment inline) | 16:30 |
clarkb | ya I see that now | 16:31 |
clarkb | jeblair: I'll wait for more reviews on the stack if you are going through it then address them in one go (rather than push a bunch) | 16:31 |
jeblair | kk | 16:32 |
*** yolanda has joined #zuul | 16:32 | |
clarkb | there is still something causing the tests to run longer under testr than testtools.run | 16:34 |
clarkb | test_crd_branch is ~40 seconds vs ~16 seconds | 16:34 |
clarkb | so its actually quicker for me to do for X in `cat testlist` ; do testtools.run $X ; done then testr run | 16:34 |
jeblair | that's so weird | 16:35 |
clarkb | also more reliable | 16:35 |
clarkb | my hunch is that its processing the subunit stream | 16:36 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name https://review.openstack.org/456685 | 16:37 |
clarkb | however the time difference is only apparent when all the tests are run together and not individually so maybe not | 16:37 |
clarkb | *** Error in `python': free(): invalid pointer: 0x00007fdf880000a8 *** <- that just happened running tox -epy27 -- test_crd_branch | 16:39 |
*** Shrews has quit IRC | 16:39 | |
jeblair | clarkb: 459479 -- it seems like even if that does leak, it won't leak for more than 10 seconds. the fix may cause an extra 0-10 second delay in production after every job completes. is it enough of a problem in thet tests to make that worthwhile? | 16:40 |
clarkb | jeblair: the delay should only be up to one second | 16:41 |
clarkb | and it actually makes the job that tests for timeouts quicker | 16:41 |
jeblair | clarkb: i don't think we should increase the watchdog interval in production. we can do it in tests, but in prod, i think we should stick with 10s. we're going to have at least 100 of those threads in production, having them all poll every second will be a considerable amount of context switching and python GIL locking. | 16:42 |
jeblair | or, i guess that should have said 'decrease the watchdog interval'. | 16:42 |
clarkb | the reason to claen it up is so that we can assert the end state of tests more strongly, we could whitelist the watchdog thread when we do that | 16:42 |
clarkb | so change could be to add a name to watchdog thread, then whitelist it in the tests rather than shorten sleep and join | 16:43 |
*** yolanda has quit IRC | 16:44 | |
jeblair | and that amount of thread leakage will be okay you think? | 16:44 |
clarkb | ya especially since that thread will be on a 10 second timeout | 16:45 |
jeblair | wfm then. | 16:45 |
clarkb | I think the costly threads are gaerman and kazoo beacuse they go as quickly as possible doing nothing in the background but generating log noise | 16:45 |
clarkb | oh them and apsched | 16:45 |
clarkb | for same rason | 16:45 |
jeblair | clarkb: i think gearman has a 2s sleep, but yeah, it adds up. | 16:46 |
jlk | o/ | 16:50 |
jeblair | clarkb: stack review complete | 16:51 |
clarkb | jeblair: this stack appears to have made my local runs of CRD tests not error on timeout waiting for zuul to settle. So ya I think it does add up. Unfortunately it hasn't corrected the tests are generally slower in testr problem :/ | 16:52 |
*** Shrews has joined #zuul | 16:57 | |
SpamapS | clarkb: I'm at home with sick baby today, so I'll take a look when she goes down for a nap. | 16:57 |
SpamapS | "at home" vs. "working at home" ;-) | 16:58 |
*** hashar_ has joined #zuul | 17:01 | |
*** hashar_ has quit IRC | 17:01 | |
jlk | So what's with the "queue" key in a project definition? | 17:13 |
jlk | and queue-name in a job definition? | 17:13 |
jeblair | jlk: queue-name should go away (it's zuulv2 only) | 17:14 |
jeblair | jlk: queue in a project assigns that project to a shared change queue in a dependent pipeline. | 17:15 |
jlk | jeblair: is that how multiple unrelated projects can use the same "dependent" manager pipeline, but each have their own queues? | 17:16 |
jlk | projects A, B, C all use one queue, project D uses a different one, and thus is not dependent on jobs of A, B, C? | 17:16 |
jeblair | jlk: exactly. | 17:16 |
jlk | cool, I've had that question bouncing in my head recently | 17:16 |
jeblair | jlk: in zuulv2, combining them was automatic (based on job name). but in v3, we don't assume anything about job names, so you have to manually group projects together like that. | 17:17 |
jlk | ah | 17:17 |
jlk | oh yeah that is bad for us on v2, where everything shares a single job name, the job where we run whatever "run.sh" is in the project repo | 17:17 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema https://review.openstack.org/459792 | 17:18 |
jeblair | jlk: yeah. it was designed under the assumption of job names like "gate-nova-python27". | 17:18 |
jlk | makes sense | 17:18 |
jlk | so only a tiny bit related, I'm working on the patch that adds dependent queue support to github | 17:19 |
jlk | and in the tests I'm failing, because the call in manager/dependent.py getChangeQueue() is calling out to self.pipeline.getQueue(change.project), which then looks at the self.queues, and tries to see if "project" is in queue.projects | 17:20 |
jlk | and while it looks like it, the comparison is failing for some reason | 17:20 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482 | 17:20 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481 | 17:20 |
jlk | so I'm trying to figure out where equivalence of project objects is determined | 17:20 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480 | 17:20 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Properly record timed out builds https://review.openstack.org/459767 | 17:20 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Whitelist watchdog threads in test thread leak checker https://review.openstack.org/459479 | 17:20 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use cStringIO if available. https://review.openstack.org/459478 | 17:21 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409 | 17:21 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Add ability to stop drivers https://review.openstack.org/459477 | 17:21 |
clarkb | jeblair: ^ I think that addresses your concerns | 17:21 |
jlk | I only have one queue in self.queues | 17:21 |
clarkb | now to abandon the unneeded apsched change | 17:21 |
jlk | and there is only one project in that queue | 17:21 |
jlk | and there is only one project in that queue: <Project org/project> | 17:22 |
jlk | which LOOKS like it should match the project in question, <Project org/project> | 17:22 |
jlk | (Pdb) project == queue.projects[0] | 17:22 |
jlk | False | 17:22 |
jlk | but it doesn't | 17:22 |
jeblair | jlk: i believe we rely on object identity for equilavence. Sources are responsible for supplying Project objects and they are expected to cache them (probably on their driver), so that they always return the same object for a given project. | 17:22 |
jlk | hrm. | 17:23 |
jlk | I'll go hunting in that direction, thanks. | 17:24 |
jeblair | (it's quite possible our data model has evolved to the point where we no longer need to do that and could implement a real __eq__ method (now that we can fully qualify project names with hostnames) but let's defer changing that for later) | 17:24 |
jeblair | jlk: given that, temporarily throwing an id(self) in the __repr__ for project might help track that down too. | 17:26 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue is actually the test that appears to timeout under full testr run. It takes ~55 seconds alone either with testr or testtools. So confused | 17:27 |
jeblair | clarkb: 15s for me locally | 17:27 |
jeblair | clarkb: and about 20-25s in the gate | 17:27 |
jeblair | (alone) | 17:28 |
clarkb | jeblair: do you see the same time difference between tox -epy27 run tests and testtools.run or similar? | 17:28 |
jeblair | clarkb: nope | 17:28 |
jlk | yeah the IDs are definitely different | 17:28 |
jeblair | (i checked that yesterday) | 17:28 |
jlk | unfortunately how the github code is handling projects basically mirrors how gerrit does. | 17:28 |
jlk | unless... | 17:28 |
jlk | nope, doesn't look like tests/base.py overrides that | 17:29 |
jeblair | clarkb: 'nodepool list|grep jeblair' if you want to putz around on the 4 test nodes i held. ~root/zuul already exists checked out at feature/zuulv3 with a .tox/py27 | 17:29 |
clarkb | jeblair: thanks | 17:30 |
jeblair | jlk: yeah, looks like the projects are cached on the connection object. | 17:33 |
jlk | oh this is interesting | 17:33 |
jlk | I think the source object is not being cached | 17:34 |
jlk | so some calls to self.source.getProject are hitting a _different_ source object. | 17:34 |
jeblair | jlk: yeah, i think that's okay though, since the different sources should have the same connection object. | 17:35 |
jlk | hrm. | 17:35 |
jlk | then it's the connection object that's different | 17:35 |
jlk | either way this trip through pdb I'm seeing an empty list of projects even though the first trip through some projects were added | 17:35 |
jlk | self.connections.getSourceByHostname | 17:36 |
jlk | I think that failed to find the right source for me | 17:36 |
jeblair | jlk: which test are you running? | 17:36 |
jlk | it's a new one in this patch. It's running jobs through a pipeline with a dependent manager | 17:37 |
jlk | oooooh wait. | 17:37 |
jlk | what happens if you have two sources that point to the same hostname? | 17:37 |
jeblair | jlk: yeah, i've got it checked out | 17:37 |
* jlk has a hunch | 17:38 | |
jeblair | jlk: two connections? bad mojo. | 17:38 |
jlk | okay I think I know where this is fubar then | 17:38 |
jeblair | jlk: i think there's a right way to do that though, lemme look it up | 17:38 |
jlk | https://review.openstack.org/#/c/444034/7/tests/fixtures/zuul-github-driver.conf | 17:39 |
jlk | I think this is what did it, in an earlier test | 17:39 |
jlk | er earlier patch | 17:39 |
jeblair | jlk: hrm. it should actually work -- tests/fixtures/zuul-connections-same-gerrit.conf tests that case | 17:40 |
jlk | oh, darn. | 17:40 |
jeblair | but it's possible there's a bug that's slipping by that test case but not yours | 17:40 |
jlk | well I'm going ot modify that file and see if things chnage. | 17:41 |
jlk | sure did | 17:42 |
jeblair | jlk: generally speaking, the idea is that even if we have two connections for a given hostname, in the tenant config file (main.yaml), we associate each project with a specific connection. so in your case, if you say "source: github:" in main.yaml, the following projects are associated with that specific connection (vs "source: github_ssh") | 17:42 |
clarkb | woo if I increase the test timeout I get successful tox run on tip of my stack | 17:43 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue 141.804 so I'm about 20s past the default timeout | 17:43 |
jeblair | jlk: however, i think the call to getSourceByHostname in scheduler.process_event_queue is wrong; it fails to take that into account. | 17:44 |
clarkb | next slowest is 98 seconds | 17:44 |
jlk | jeblair: yeah I was about to trace around that call | 17:44 |
jeblair | jlk: my guess is that the gerrit test happens to have the sources in the right order to mask that, but the github test doesn't. | 17:44 |
jeblair | jlk: i have an idea; gimme a few to hack it up. | 17:47 |
jlk | okay | 17:48 |
jlk | I'll keep poking too | 17:48 |
clarkb | jeblair: http://paste.openstack.org/show/607901/ I'm seeing the testr full run slowdown on one of your held nodes | 17:51 |
clarkb | gonna tweark the --concurrency to see if that affects it too (eg less concurrency == slower?) | 17:51 |
jlk | jeblair: I think the logic in line 108 of zuul/lib/connections.py is suspect. | 17:52 |
jlk | It'll wind up replacing sources['hostname'] with whatever the last read connection to the same host is | 17:53 |
jlk | so the sources list will be incomplete | 17:53 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: remove getSourceByHostname https://review.openstack.org/459821 | 17:58 |
jeblair | jlk: yes, i think that's right. ^ may be the solution. | 17:58 |
jeblair | clarkb: yeah, that test is pretty busy so i expect it's especially affected by concurrency. | 18:00 |
clarkb | jeblair: what I am curious to see is if less concurrency makes it slower or faster. Faster is what you'd expect because that means more cpu time/less contention | 18:00 |
clarkb | jeblair: but I'm thinking maybe its slower because that means single running is running more jobs that are leaking something causing it to be slow? | 18:01 |
clarkb | jeblair: http://paste.openstack.org/show/607905/ it is indeed slower the fewer cpus we use | 18:07 |
clarkb | and that scaling factor is very similar to mine compared to running isolated | 18:07 |
*** isaacb has joined #zuul | 18:08 | |
jeblair | clarkb: interesting... though it could also be unfortunate test ordering, yeah? | 18:08 |
clarkb | jeblair: I don't think so. each process is running fairly isolated. The only shared resource is zk but you'd expect slower times with more concurrency if it was getting hit hard | 18:09 |
clarkb | each process runs one test at a time, regardless of order I would expect similar test runtimes | 18:10 |
jeblair | clarkb: io is shared, which is quite a lot of what these are doing | 18:10 |
clarkb | unless it is a resource leak early in test run that negatively affects subsequent tests | 18:10 |
clarkb | jeblair: right but you'd expect less concurrency to be faster if that was the issue not slower | 18:10 |
clarkb | by reducing concurrency we are reducing total cpu and io load on the system but the tests run slower | 18:11 |
jeblair | clarkb: yes, i mostly agree, just holding out the thought that in a pathological case, with concurrency=8 we could be up against cpu bound tests and concurrency=4 up against io bound tests. that's unlikely, but possible. | 18:12 |
clarkb | jeblair: I'm going to run it with concurrency=1 | 18:13 |
clarkb | which may rule out that possibility | 18:13 |
jlk | hey look at that, all my tests are actually passing now :/ | 18:21 |
clarkb | jeblair: --concurrency=1 on that held node just hit timeout waiting for zuul to settle | 18:36 |
clarkb | so I think there must be something elsegoing on in here that causes consolidated test processes to perform poorly | 18:36 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: remove getSourceByHostname https://review.openstack.org/459821 | 18:52 |
jeblair | jlk: ^ passes tests now. | 18:52 |
jlk | woo | 18:52 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue 100.753 <- under timeout. I killed logging | 18:56 |
clarkb | I think our logging stuff must be leaking somehow? | 18:56 |
jeblair | clarkb: that's still 30s longer than run alone, right? | 19:06 |
clarkb | no sorry thats local to me where it took ~55s | 19:07 |
clarkb | so thats inline with my local isolated run | 19:07 |
clarkb | I'm testing a change that I think may help, iteration time is so long though | 19:07 |
clarkb | (while keeping our logging) | 19:07 |
clarkb | er derp I see what you are saying | 19:11 |
clarkb | I don't know how I math'd that wrong | 19:11 |
*** yolanda has joined #zuul | 19:38 | |
*** hashar has quit IRC | 19:55 | |
jlk | GREAT NEWS EVERYBODY | 20:00 |
jlk | er, GOOD NEWS whateve.r | 20:00 |
jlk | Github branch protection now has an option where new pushes to a PR will dismiss any Reviews on the PR! | 20:00 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup logging handlers https://review.openstack.org/459849 | 20:00 |
clarkb | jeblair: ^ thats a massive improvement in test runtime but still slower than isolated | 20:00 |
pabelanger | jlk: success? | 20:01 |
clarkb | SpamapS: jlk ^ thats change is the one you'll want to grab for local testing | 20:02 |
jlk | kk | 20:02 |
clarkb | I need to grab lunch now, but very interested if ^ helps anyone else running tests locally | 20:04 |
*** jkilpatr has quit IRC | 20:10 | |
clarkb | jeblair: http://logs.openstack.org/49/459849/1/check/gate-zuul-python27-ubuntu-xenial/b9421e1/console.html http://logs.openstack.org/77/459477/2/check/gate-zuul-python27-ubuntu-xenial/e93f61e/console.html comparison seems to confirm this change works as expected in the gate too | 20:45 |
*** jkilpatr has joined #zuul | 20:47 | |
clarkb | based on ^ it would be excellent to get reviews on https://review.openstack.org/#/q/topic:test-stability jeblair has reviewed most of them and it should make everyone's dev process happier | 21:19 |
jeblair | clarkb: nice, those even both ran on internap | 21:23 |
clarkb | jeblair: yup so should be a good comparison. Ran in half the time | 21:27 |
jeblair | clarkb: +2 from me | 21:28 |
clarkb | I'm still not sure where the other 40-50s is going locally | 21:28 |
clarkb | but guessing its another leak in the same sort of genre | 21:28 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema https://review.openstack.org/459792 | 21:30 |
*** isaacb has quit IRC | 21:36 | |
SpamapS | oh wow.. github has added the ability to dismiss PR reviews on push to the pr source. | 21:48 |
SpamapS | I guess they did get enough angry fist shakes | 21:48 |
clarkb | jeblair: ok I just finished a successful --concurrency=1 run locally. And the individual tests had runtimes closer to what I would expect. So I think the remaining 40-50s is contention with shared resources like disk IO and zk | 22:03 |
clarkb | jeblair: I'm going to treat this as "solved" for now as a result of ^ those results | 22:03 |
jeblair | clarkb: w00t, thanks! | 22:03 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue 61.805 | 22:03 |
clarkb | Ran 236 tests in 1614.315s (+1008.612s) wall time is much much longer but individual tests run quicker | 22:03 |
jesusaur | clarkb++ continues to be awesome | 22:08 |
jlk | woo | 22:21 |
jlk | Reviews on https://review.openstack.org/459821 would help me a lot in my progress :) | 22:21 |
jeblair | clarkb, mordred: i think you can help jlk there ^ | 22:22 |
Shrews | clarkb: question for you on 459480 | 22:23 |
jeblair | jlk, mordred, SpamapS: i have reviewed the first 4 github-v3 changes. i think 439834 and 443947 open a set of related questions on the subject of driver->zuul event manipulation which would be good to come to consensus on. | 22:25 |
jlk | thanks! I saw those hit my inbox, and I briefly went over them. I'll probably have replies on them tomorrow morning | 22:25 |
clarkb | Shrews: responded | 22:26 |
clarkb | Shrews: thats actually something that I think we likely want to get better at, is constructing things such that the cleanups are in place before they have a chance to break | 22:26 |
jeblair | clarkb: cleanups are all in individual try/excepts clauses, right? | 22:27 |
clarkb | jeblair: yes, one cleanup failing doesn't prevent the next from running | 22:27 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add ability to stop drivers https://review.openstack.org/459477 | 22:27 |
clarkb | jeblair: unless you sys.exit ro something | 22:27 |
jeblair | clarkb: so yeah, i agree, that's probably the safer method and we may want to move many of our cleanups up a line. | 22:28 |
Shrews | clarkb: k. iirc, the disconnect() checks to see if it's actually connected first anyway (assuming it was copied directly from nodepool's zk.py) | 22:28 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema https://review.openstack.org/459792 | 22:30 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Use current class name on super call https://review.openstack.org/459409 | 22:32 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Use cStringIO if available. https://review.openstack.org/459478 | 22:32 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Whitelist watchdog threads in test thread leak checker https://review.openstack.org/459479 | 22:32 |
*** harlowja has quit IRC | 22:34 | |
clarkb | jeblair: jlk make sure I get this right, basically you can have review.o.o/foo and review.o.o/bar so sources[review.o.o] isn't a singleton? | 22:34 |
jeblair | clarkb: your example looks like two projects over the same connection which is more straightforward than the problem -- the problem is actually two *connections* to the same server. | 22:35 |
clarkb | when would you have two connections to the same server? | 22:36 |
jeblair | clarkb: it's actually the original impetus for multi-connection support, though we've never used it: report changes to gerrit under a different username. | 22:36 |
clarkb | aha | 22:36 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests https://review.openstack.org/459480 | 22:38 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests https://review.openstack.org/459481 | 22:38 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Properly record timed out builds https://review.openstack.org/459767 | 22:40 |
jeblair | clarkb: so the only way to really know which connection to use for a project is via the tenant config file, where project names are definitively mapped to connections. that process ensures that we don't end up with two identically named projects (with that index, it's okay to have two connections with the same hostname, as long as they don't also have projects with the same name) | 22:41 |
jeblair | (attempting to do so is a configuration error, so we can rely on that) | 22:41 |
mordred | jeblair: I believe this matches what we talked about that one time | 22:46 |
clarkb | jeblair: ok I am confused about the comment at line 498 https://review.openstack.org/#/c/459821/2/zuul/scheduler.py. What is the corresponding live item and why is it N items behind the current item. Is it valid to have live -> not live -> not live -> live -> not live ? | 22:48 |
clarkb | jeblair: at least with v2 I thought that once you are not live everything behind is not live? | 22:49 |
jeblair | clarkb: behind and ahead may be confusing here. take the gnocchi change near the top of check right now | 22:50 |
jeblair | clarkb: that's a bunch of non-live items *ahead of* a single live item. | 22:51 |
clarkb | ah | 22:51 |
clarkb | I'm thinking of the gate case with windows (because familiar with that code) | 22:51 |
jeblair | (which is the only configuration non-live shows up in, at the moment) | 22:51 |
clarkb | thanks | 22:51 |
jeblair | clarkb: ah, that's 'active' | 22:51 |
jeblair | (rather than 'live') | 22:51 |
clarkb | gotcha | 22:51 |
jeblair | honestly, foreign projects and multi-connections need a bit of a rethink, because at some point we're going to want to say "this zuul patch Depends-On this ansible pull request". but we have no way of saying that at the moment, so that code makes things continue to work the way they do now with minimal fuss. | 22:53 |
clarkb | where you won't merge until the other has merged, but won't actually get that pull request in your git repo? | 22:53 |
jeblair | clarkb: you will get that pull request in the executor-staged copy of the ansible repo | 22:54 |
jeblair | (so the executor will stage both your change and the ansible pull request in the jobdir src directory) | 22:54 |
clarkb | but the PR would not be live in that case beacuse we aren't voting back on it? | 22:55 |
clarkb | we'd vote on zuul | 22:55 |
jeblair | clarkb: correct, it'd still be non-live | 22:55 |
clarkb | ok that makes a lot more sense now | 22:55 |
jeblair | clarkb: i mention this because the code i just wrote will break in that circumstance. so it'll have to change. but it's not possible to get into that state now, so i think it's good enough. | 22:56 |
jeblair | (because it assumes that all changes for foreign projects were pulled in via a depends-on from the same source/connection, which obviously won't be the case when you can do cross-connection-dependencies) | 22:57 |
clarkb | Shrews: just two more changes to review on topic:test-stability | 22:58 |
clarkb | Shrews: the last one is the best one too | 22:58 |
jeblair | also, holy cow i think it's time for a glossary. "item", "foreign", "live", "active", "cross-repo-dependency", "cross-connection-dependency" are all zuul-specific terms of art used in the last 5 minutes with very specific meanings. :) | 22:59 |
clarkb | indeed | 22:59 |
Shrews | clarkb: my pizza arrived. Basic math tells us that pizza > reviews | 23:01 |
jeblair | mordred: should i take "jeblair: I believe this matches what we talked about that one time" as a +2 on 459821? | 23:01 |
clarkb | Shrews: except that the last change will make your tests run twice as fast. So you can eat more pizza | 23:01 |
clarkb | :P | 23:01 |
clarkb | enjoy dinner | 23:01 |
Shrews | Will look after fooding | 23:02 |
mordred | jeblair: I was about to leave the actual +2 | 23:03 |
mordred | jeblair: also - yes, I think it's time for a glossary :) | 23:03 |
jeblair | mordred: ok cool :) | 23:03 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: remove getSourceByHostname https://review.openstack.org/459821 | 23:14 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Error if threads leak. https://review.openstack.org/459482 | 23:19 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Cleanup logging handlers https://review.openstack.org/459849 | 23:19 |
*** openstack has joined #zuul | 23:31 | |
mordred | clarkb: well, eatmydata does make sync() a no-op. I'm not sure if tmpfs does any extra work than "return 0" on a sync or not? | 23:38 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!