openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Handle build_set being None for priority https://review.openstack.org/508634 | 00:13 |
---|---|---|
*** harlowja has quit IRC | 00:14 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Handle build_set being None for priority https://review.openstack.org/508634 | 00:14 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate https://review.openstack.org/508629 | 00:14 |
SpamapS | ooo I wish I hadn't been in meetings all day I'd love to do the load average limiter patch | 01:40 |
SpamapS | fungi: still working on it? | 01:41 |
fungi | SpamapS: i never even really got off the ground with it--if you want to take it, all yours! | 01:44 |
fungi | there's some discussion in here on a viable direction for it, at least | 01:45 |
fungi | if you haven't already caught up | 01:45 |
SpamapS | I did see that | 01:46 |
SpamapS | I'm wondering if we can just go simpler and limit things with a thread pool. | 01:46 |
SpamapS | if ansible jobs are the source of load and RAM usage.. causing load.. .then limiting concurrency seems like the way to go. | 01:47 |
fungi | and just let the remaining jobs pile up in gearman until an executor has available threads again? i guess that would be the result | 01:50 |
SpamapS | yep | 01:51 |
SpamapS | but it's easier to just make a thread pool than monitor load | 01:51 |
SpamapS | and the way gearman works busier executors will always respond slower than idle ones if the concurrency hasn't been all used up | 01:52 |
openstackgerrit | John L. Villalovos proposed openstack-infra/zuul feature/zuulv3: Fix pep8 error https://review.openstack.org/508643 | 01:55 |
*** harlowja has joined #zuul | 02:39 | |
*** harlowja has quit IRC | 02:55 | |
SpamapS | hm actually no | 03:00 |
SpamapS | if I just have a thread pool for jobs the executor server will slurp all of the jobs in. | 03:00 |
SpamapS | it's simpler than that anyway. I can have a counter for active jobs and deregister/register when it crosses the concurrency threshhold | 03:01 |
jeblair | SpamapS: that works for a max job count, but we were thinking that load average might be more adaptive | 03:11 |
jeblair | (like, actual system load average) | 03:11 |
SpamapS | jeblair: It is, but it's also more complicated. ;) | 03:13 |
SpamapS | now that I'm digging in | 03:14 |
SpamapS | it's not thaaaat much more complicated | 03:15 |
SpamapS | I have it working where it will deregister when it has more than 5 jobs running, and re-register when it drops below 5 | 03:16 |
SpamapS | jeblair: a concurrency limit will also help control memory in a coarse kind of way. | 03:17 |
SpamapS | but if we can poll load average, we can poll free | 03:17 |
SpamapS | still I'm inclined to start with this and see how it goes. | 03:18 |
SpamapS | as much because I'm about to get home and I won't be coding for about 48 hours after that. ;) | 03:18 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add a concurrency limit to zuul-executor https://review.openstack.org/508649 | 03:21 |
SpamapS | anyway, I may hack on a load average based one later or something | 03:21 |
* SpamapS afk for a bit | 03:21 | |
jeblair | SpamapS: i'm not opposed to more tunable limits; i think what i'd like the default to be though is automatic based on load average. i'm not a fan of systems that make sysadmins guess tunable parameters when we have computers that can do it for them. but we can build on that. :) | 03:22 |
jeblair | i love ndb. i'd love it even more if it ran that perl script for me before starting. ;) | 03:22 |
SpamapS | hah yeah | 03:44 |
SpamapS | jeblair: I love adaptable systems too. Rarely do I get to write one. :) Sitting on a bus now, will see what flies out of me. | 03:45 |
SpamapS | I think the right thing is to just check load before getJob | 03:46 |
SpamapS | if load is too high, sleep a bit and check again. | 03:46 |
jeblair | SpamapS: thing is we need to keep getting jobs other than execute:execute though. especially execute:stop | 03:49 |
jeblair | well, i guess that's the only other one. | 03:50 |
jeblair | but it is important. :) | 03:50 |
SpamapS | oh right | 03:50 |
SpamapS | not sleep, unregister | 03:50 |
SpamapS | so check load... if too busy, unregister expensive | 03:50 |
SpamapS | I think that's the ticket | 03:50 |
SpamapS | note that there's still a tunable required | 03:51 |
SpamapS | Which is "what's an OK load?" | 03:51 |
jeblair | yeah, the load average. we could set it to nproc*3 | 03:51 |
jeblair | by default | 03:51 |
SpamapS | what's the 3 coming from? | 03:51 |
jeblair | swag | 03:51 |
SpamapS | +1 | 03:52 |
jeblair | looking at http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=63999&rra_id=1&view_type=&graph_start=1506660317&graph_end=1506742125&graph_height=120&graph_width=500&title_font_size=10 make it look like 20 is the magic number for those servers | 03:54 |
jeblair | so maybe nproc*2.5 :) | 03:54 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP Do not add implied branch matchers in project-templates https://review.openstack.org/508658 | 03:58 |
jeblair | clarkb, mordred: ^ i *think* that's 99% there; i think the logic is correct; i just need to write a commit message and do a little cleanup from a previous attempt. should be able to do that tomorrow. | 03:59 |
SpamapS | whoa that was weird | 04:01 |
SpamapS | I just did git review -d 508649, and got the content from 508658 | 04:01 |
SpamapS | almost like it collided with you uploading 508658 | 04:02 |
SpamapS | bah | 04:18 |
SpamapS | if I do it only before getJob... we stay doing nothing until a cancel/cat/merge comes in | 04:18 |
SpamapS | need a thread and I think a lock on the client :-P | 04:18 |
clarkb | can you do it without a thread as a noop handler? | 04:20 |
clarkb | iirc the server sends those gratuitously to wake workers? | 04:20 |
clarkb | or maybe it was different request | 04:21 |
SpamapS | no | 04:21 |
SpamapS | well | 04:21 |
SpamapS | yeah | 04:22 |
SpamapS | NOOP is what the server sends to say "Hey you say you can do this, wake up and GRAB JOB | 04:22 |
SpamapS | so yes | 04:22 |
SpamapS | this delay thing is interesting | 04:22 |
SpamapS | Not sure why that's there. | 04:22 |
SpamapS | I mean I know why it says it is there. | 04:23 |
SpamapS | but that seems unnecessary. The delay should already be happening by virtue of the fact that the less busy servers should respond faster. | 04:23 |
clarkb | the problem is the job cost is delayed itself | 04:26 |
SpamapS | clarkb: so the problem is once we've unregistered from work, we won't get noop's anymore. | 04:26 |
clarkb | so they will more slowly grab jobs | 04:27 |
clarkb | but that doesntatter when load is 80 because ansible | 04:27 |
clarkb | SpamapS: aha | 04:27 |
SpamapS | and gearman will send us jobs for anything we're registered for, so if we don't unregister, getjob will still assign us jobs | 04:27 |
SpamapS | so we need something that periodically checks to see that we're ready for more work | 04:27 |
clarkb | but ya if you grab 50 jobs and load skyrockets you'll just continue to add on but more slowly | 04:28 |
SpamapS | yeah it's untenable this way | 04:28 |
SpamapS | so a thread that just sleeps and goes "am I taking work? If not, is load low enough to take more work? If yes, register" every few seconds seems like the right thing. | 04:29 |
SpamapS | but that also gets into "are gear.Worker's thread safe?" | 04:30 |
SpamapS | because the worker is likely to be in getJob() | 04:31 |
SpamapS | I wonder if we could just make the tunable "concurrency_factor" and basically say "multiple this time nproc to get the concurrent jobs"? | 04:32 |
SpamapS | because controlling the # of concurrent jobs is pretty easy | 04:32 |
SpamapS | since we're always going to be in control when jobs are started or finished. | 04:32 |
SpamapS | that also feels like a pretty reasonable factor to expect sites to tune as they optimize. We can make the default roughly what infra sees on its executors, but other sites may have very different jobs | 04:34 |
SpamapS | It's also a bit safer since taking on jobs while load is low and we're just waiting for a bunch of devstack runs may backfire if the devstack runs all finish at once and we're doing 50 concurrent rsyncs. | 04:34 |
SpamapS | we could even get good at tracking job cost eventually, and not go by the count of jobs, but by the count of expected job execution seconds/iops/etc. | 04:36 |
SpamapS | anyway.. home now... will ponder more. If we see it getting out of control the patch I made will at least give us a governor | 04:37 |
SpamapS | Actually I just check load in finish and start. I can make a special exception never to unregister in finish if it would take me below 1 job running. | 04:48 |
* SpamapS may not sleep until this is implemented | 04:48 | |
*** xinliang has quit IRC | 07:46 | |
*** xinliang has joined #zuul | 07:59 | |
*** bhavik1 has joined #zuul | 09:31 | |
*** bhavik1 has quit IRC | 09:35 | |
*** xinliang has quit IRC | 09:35 | |
*** huangtianhua has quit IRC | 11:11 | |
*** zhuli has quit IRC | 13:15 | |
openstackgerrit | Jeremy Stanley proposed openstack-infra/zuul-jobs master: Yet still more fix post log location https://review.openstack.org/508684 | 14:31 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Handle build_set being None for priority https://review.openstack.org/508634 | 15:04 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Yet still more fix post log location https://review.openstack.org/508684 | 15:06 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate https://review.openstack.org/508629 | 15:06 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563 | 15:28 |
jeblair | SpamapS: i was about to say "oh it's totally safe to call register functions from another thread", and in fact we do exactly that in zuul v2.5 | 15:49 |
jeblair | SpamapS: i went to double check that though, and i did find that we may want this: remote: https://review.openstack.org/508698 Add a send lock to the base Connection class | 15:49 |
jeblair | SpamapS: zuul v2.5 was not using ssl, but v3 is. so we were very unlikely to see a problem with that in v2, but somewhat more likely in v3. | 15:50 |
jeblair | SpamapS: also, a downside to only checking at start/end of jobs is that if we end up in a case where we unregister, and then a bunch of jobs finish when the load is high, we stay unregistered, but then the load drops but it's an hour until the next job finishes, we may end up substantially underutilized. so i favor something that checks more regularly. | 15:52 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563 | 15:55 |
SpamapS | jeblair: yeah in fact finishJob sends work complete from another thread from what I see. | 16:13 |
jeblair | good point, we're already playing the odds | 16:14 |
SpamapS | Yeah so I think I can just start a thread that checks load every few seconds and registers or unregisters appropriately. | 16:15 |
SpamapS | And then maybe we can put some armor on gear to make it less of a dice roll. | 16:16 |
jeblair | SpamapS: https://review.openstack.org/508698 should be armor | 16:19 |
SpamapS | I will say also that I'm not sure gearman is the best protocol for this. AMQP has a specific response which is "send this to somebody else I'm too busy" will would allow us to make per-job cost decisions. | 16:19 |
SpamapS | s/will/which/. DYAC | 16:20 |
SpamapS | Nice! | 16:21 |
SpamapS | Re: lock | 16:21 |
SpamapS | oh man I just discovered mosh the other day.. nice that my irc session never disconnects now | 16:24 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Handle build_set being None for priority https://review.openstack.org/508634 | 16:42 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Limit concurrency in zuul-executor under load https://review.openstack.org/508649 | 16:48 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Do not add implied branch matchers in project-templates https://review.openstack.org/508658 | 16:48 |
SpamapS | jeblair: ^ load based, definitely could end up trying to send at the same time we're sending other stuff so may add another dice roll. ;) | 16:50 |
SpamapS | I'm going to test it out in my GD internal zuul. | 16:51 |
jeblair | i'm never going to stap giggling whenever you say that :) | 16:51 |
jeblair | stop even | 16:51 |
*** bhavik1 has joined #zuul | 16:54 | |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Yet still more fix post log location https://review.openstack.org/508684 | 16:54 |
*** bhavik1 has quit IRC | 16:56 | |
*** bhavik1 has joined #zuul | 16:56 | |
*** bhavik1 has quit IRC | 16:58 | |
*** bhavik1 has joined #zuul | 16:58 | |
*** bhavik1 has quit IRC | 17:00 | |
SpamapS | jeblair: It's a great GD zuul. ;) | 17:07 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Do not add implied branch matchers in project-templates https://review.openstack.org/508658 | 17:19 |
openstackgerrit | David Moreau Simard proposed openstack-infra/zuul feature/zuulv3: Delete IncludeRole object from result object for include_role tasks https://review.openstack.org/504238 | 17:57 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Do not add implied branch matchers in project-templates https://review.openstack.org/508658 | 18:11 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate https://review.openstack.org/508629 | 20:39 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Multi-node: Set up hosts file https://review.openstack.org/504552 | 21:45 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Multi-node: Set up firewalls https://review.openstack.org/504553 | 21:45 |
*** mrhillsman has joined #zuul | 21:55 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!