*** rlandy has quit IRC | 00:27 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies https://review.openstack.org/530806 | 00:48 |
---|---|---|
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699 | 00:48 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Compress testr results.html before fetching it https://review.openstack.org/533828 | 00:49 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Clarify provider manager vs provider config https://review.openstack.org/531618 | 01:09 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Revert "Add consolidated role for processing subunit" https://review.openstack.org/533831 | 01:10 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit"" https://review.openstack.org/533834 | 01:17 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Adjust check for .stestr directory https://review.openstack.org/532688 | 01:33 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Revert "Add consolidated role for processing subunit" https://review.openstack.org/533831 | 01:55 |
*** haint_ has joined #zuul | 02:01 | |
*** threestrands_ has joined #zuul | 02:02 | |
*** haint has quit IRC | 02:04 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Do pep8 housekeeping according to zuul rules https://review.openstack.org/522945 | 02:04 |
*** jappleii__ has quit IRC | 02:04 | |
*** dtruong2 has joined #zuul | 02:46 | |
*** tflink has quit IRC | 02:54 | |
*** dtruong2 has quit IRC | 03:37 | |
*** jaianshu has joined #zuul | 03:47 | |
*** tflink has joined #zuul | 04:03 | |
*** dtruong2 has joined #zuul | 04:17 | |
*** dtruong2 has quit IRC | 04:26 | |
*** dtruong2 has joined #zuul | 04:40 | |
*** bhavik1 has joined #zuul | 05:17 | |
*** dtruong2 has quit IRC | 05:17 | |
*** bhavik1 has quit IRC | 05:56 | |
*** dtruong2 has joined #zuul | 06:08 | |
tobiash | corvus, dmsimard: re executor governor discussion of the meeting | 06:11 |
tobiash | couldn't participate as it's too late for me (every time I attend I'm kind of jetlagged the whole next day) | 06:12 |
tobiash | I think we need something smarter than the simple on/off approach based on current load/ram | 06:13 |
tobiash | I think we need something similar than tcp does with slow start and sliding window | 06:17 |
tobiash | s/than/like | 06:17 |
*** dtruong2 has quit IRC | 06:24 | |
SpamapS | tobiash: could poll stats more aggressively on changes, and back off over time. | 06:26 |
tobiash | SpamapS: I'm thinking about something like the slow start and a congestion window where we could hook in generic 'sensors' like load, ram | 06:33 |
tobiash | e.g. the ram governor would not work for me in the openshift/kubernetes environment as it doesn't know how much of the free ram it is allowed to take | 06:34 |
tobiash | for this we will have to check the cgroups | 06:34 |
tobiash | and I like the idea of a general congestion algorithm where we can put sensors into rather than having its own governor for each metric | 06:35 |
tobiash | I'll think more about this topic and maybe write a post to the ml | 06:36 |
clarkb | tcp slow start was the model used with the dependent pipeline windowing too | 06:36 |
tobiash | I think that could be a good fit | 06:39 |
*** dtruong2 has joined #zuul | 06:44 | |
*** dtruong2 has quit IRC | 06:55 | |
SpamapS | querying the cgroups is also not a bad idea at all. | 07:09 |
*** threestrands_ has quit IRC | 07:18 | |
openstackgerrit | Andreas Jaeger proposed openstack-infra/zuul-jobs master: Remove testr and stestr specific roles https://review.openstack.org/529340 | 07:28 |
*** hashar has joined #zuul | 08:14 | |
SpamapS | https://github.com/facebook/pyre2/pull/10 | 08:19 |
SpamapS | So now to figure out what to do about the MIA author. | 08:19 |
*** jpena|off is now known as jpena | 08:46 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Use re2 for change_matcher https://review.openstack.org/534190 | 08:50 |
SpamapS | tobiash: ^ patch to use re2 for change matchers. | 08:50 |
tobiash | cool | 08:51 |
SpamapS | yeah.. note that pyre2 has not merged my PR yet. | 08:51 |
SpamapS | and that's blocked on GoDaddy signing Facebook's Corp CLA | 08:51 |
SpamapS | so it might be a while before we can merge that. :-P | 08:51 |
tobiash | SpamapS: how long do you expect? | 08:52 |
tobiash | signing the openstack cla took me a year... | 08:52 |
SpamapS | No clue, it's my first time asking for a CLA to be signed. ;) | 08:52 |
SpamapS | We've signed about 15 of them | 08:52 |
SpamapS | so I expect we'll do this one relatively "easily" | 08:52 |
SpamapS | I just dunno how long it takes. | 08:52 |
tobiash | the openstack cla was our first ;) | 08:52 |
SpamapS | Yeah, IMO they're all stupid. | 08:53 |
SpamapS | Lawyers making work for themselves. | 08:53 |
tobiash | yepp | 08:53 |
SpamapS | But I guess nobody wants another SCO vs. Linux situation. | 08:53 |
tobiash | probably | 08:53 |
SpamapS | Also wondering if we can get some speed gains by using re2 | 08:54 |
SpamapS | in some cases it is 10x faster. | 08:54 |
SpamapS | Not that scheduler CPU has been an issue. | 08:54 |
tobiash | that would be cool, but I guess we're not limited by regex parsing currently | 08:54 |
*** sshnaidm|afk is now known as sshnaidm | 09:52 | |
*** saop has joined #zuul | 10:06 | |
saop | tristanC, Hello | 10:06 |
saop | tristanC, Now the CI is able to pick the job and run | 10:07 |
saop | tristanC, But in the ansible executing phase, we are getting error: Ansible output: b'Unknown option --unshare-all' | 10:08 |
*** ankkumar has joined #zuul | 10:08 | |
saop | tristanC, Sending result: {"data": {}, "result": "FAILURE"} | 10:08 |
saop | tristanC, any idea about that? | 10:08 |
saop | tristanC, We have very basic ansible playbook file | 10:08 |
tristanC | saop: it seems like you need a more recent bubblewrap version | 10:08 |
saop | tristanC, thanks | 10:09 |
tristanC | well, --unshare-all was added in bwrap-0.1.7 | 10:11 |
saop | tristanC, We installed some 0.1.2 | 10:14 |
saop | tristanC, will upgrade now | 10:14 |
tristanC | saop: what distrib are you using? | 10:14 |
saop | tristanC, ubuntu xenial | 10:14 |
saop | tristanC, One more question, our CI is able to post the result but its not showing in gerrit, but when we do toggle CI we can see something like: Our CI: test-basic finger://ubuntu/f3f5f726a27a480687b826d9bf6a3e57 : FAILURE in 22s | 10:21 |
saop | tristanC, Do we need to have any configuration for that? | 10:21 |
tristanC | saop: you mean result in the table under vote? | 10:22 |
saop | tristanC, Yes | 10:23 |
saop | tristanC, you can check here: https://review.openstack.org/#/c/534137/ | 10:23 |
saop | tristanC, toggle for HPE Proliant CI | 10:23 |
tristanC | saop: iirc there is some javascript magic that parse ci comment, and it only works when result has http links | 10:23 |
saop | tristanC, Now we are getting in ansible b'Unknown option --die-with-parent' | 10:26 |
saop | tristanC, Do we need to upgrade more, | 10:26 |
*** AJaeger has quit IRC | 10:26 | |
saop | tristanC, We are using bwrap-0.1.7 | 10:26 |
tristanC | saop: yes, i meant you need the most recent bubblewrap version, that is 2.0.0 | 10:27 |
saop | tristanC, ohh okay | 10:27 |
tristanC | perhaps ubuntu only needs 0.1.8 | 10:28 |
*** AJaeger has joined #zuul | 10:32 | |
saop | tristanC, how to set log url in zuul v3? | 10:50 |
tristanC | saop: you should use https://docs.openstack.org/infra/zuul-jobs/roles.html#role-upload-logs or build something similar | 10:54 |
*** electrofelix has joined #zuul | 11:01 | |
*** fbo has joined #zuul | 11:31 | |
*** jpena is now known as jpena|lunch | 12:35 | |
*** weshay_PTO is now known as weshay | 13:01 | |
saop | tristanC, I created post-logs.yaml according to documentation, and also provided in jobs: post-run section, but it didn't executed, any idea? | 13:07 |
tristanC | saop: to debug this kind of failure, use the executor keep option to get the raw ansible job logs in /tmp | 13:12 |
saop | tristanC, Thanks | 13:13 |
*** rlandy has joined #zuul | 13:30 | |
*** rlandy_ has joined #zuul | 13:30 | |
*** rlandy_ has quit IRC | 13:30 | |
*** jpena|lunch is now known as jpena | 13:39 | |
*** jaianshu_ has joined #zuul | 13:46 | |
*** jaianshu has quit IRC | 13:49 | |
*** jaianshu_ has quit IRC | 13:50 | |
*** saop has quit IRC | 13:51 | |
*** ankkumar has quit IRC | 14:05 | |
*** dkranz has joined #zuul | 14:13 | |
pabelanger | http://grafana.openstack.org/dashboard/db/nodepool-inap is an interesting pattern with test nodes, I wonder if we are getting a large amount of locked ready nodes pooling, and cannot transition to in-use because we cannot fulfill the requests (no more quota) | 15:00 |
pabelanger | we end up getting more then 1/2 the nodes marked ready, then seem to use then all at once | 15:00 |
pabelanger | maybe we had a gate reset during that time too? | 15:01 |
Shrews | pabelanger: what are you seeing in zk for that provider? | 15:10 |
pabelanger | Shrews: anything specific I should be looking for? Just that we are at quota for the provider, and we have locked ready nodes, waiting for other nodes to come online | 15:13 |
Shrews | pabelanger: first thing i'd look at is if the ready&locked nodes have been around for a long time. if so, could be an issue we might need to look into. otherwise, might be a normal pattern | 15:15 |
*** bhavik1 has joined #zuul | 15:16 | |
pabelanger | Shrews: yah, I'll see if I can find pattern in logs | 15:18 |
*** flepied has joined #zuul | 15:33 | |
*** flepied_ has joined #zuul | 15:33 | |
*** flepied_ has quit IRC | 15:33 | |
*** bhavik1 has quit IRC | 16:09 | |
*** hashar has quit IRC | 16:32 | |
*** dkranz has quit IRC | 16:36 | |
*** jpena is now known as jpena|off | 16:37 | |
*** jpena|off is now known as jpena | 16:41 | |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Only decline requests if no cloud can service them https://review.openstack.org/533372 | 16:46 |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Add test_launcher test https://review.openstack.org/533771 | 16:46 |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Only fail requests if no cloud can service them https://review.openstack.org/533372 | 16:48 |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Add test_launcher test https://review.openstack.org/533771 | 16:48 |
*** dkranz has joined #zuul | 16:52 | |
*** tflink has quit IRC | 16:54 | |
*** tflink has joined #zuul | 16:57 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699 | 17:15 |
*** dkranz has quit IRC | 17:15 | |
*** dkranz has joined #zuul | 17:27 | |
*** zigo has quit IRC | 17:33 | |
*** openstackgerrit has quit IRC | 17:33 | |
*** sshnaidm is now known as sshnaidm|afk | 17:34 | |
*** sshnaidm|afk has quit IRC | 17:34 | |
*** bhavik1 has joined #zuul | 17:35 | |
*** bhavik1 has quit IRC | 17:35 | |
*** zigo has joined #zuul | 17:37 | |
*** openstackgerrit has joined #zuul | 17:49 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Support cross-source dependencies https://review.openstack.org/530806 | 17:49 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add cross-source tests https://review.openstack.org/532699 | 17:49 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies https://review.openstack.org/534397 | 17:49 |
corvus | that patch series is a 3.0 release blocker and is ready for review ^ | 17:49 |
corvus | it may be worth reading the docs change first | 17:50 |
SpamapS | corvus: so, re2 doesn't support this regexp: '^(?!stable)' .. that apparently *is* an inefficient backref. | 17:50 |
SpamapS | wondering if there's a more efficient way to to say "doesn't start with stable" | 17:51 |
corvus | hrm. well, i think we're going to need that less in the future, however, we still have vestigal uses of it now, and it's proven so useful in the past i worry about not being able to use something like that... | 17:52 |
SpamapS | yeah, re2 seems to only support negation of character classes, not strings. | 17:56 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit"" https://review.openstack.org/533834 | 17:58 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Adjust check for .stestr directory https://review.openstack.org/532688 | 17:58 |
SpamapS | corvus: we could make an irrelevant-branches maybe? | 18:02 |
*** sshnaidm|afk has joined #zuul | 18:03 | |
SpamapS | which would allow positive matching | 18:03 |
corvus | SpamapS: an alternative might be to make the zuul config language more sophisticated, so you could specify boolean ops on the regexes. something like "branches: [not: stable/.*]". the underlying classes are sophisticated enough to support that, it's just not exposed in syntax. | 18:03 |
corvus | SpamapS: we had similar ideas simultaneously :) | 18:03 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver https://review.openstack.org/531904 | 18:07 |
SpamapS | corvus: Might be good to wrap this up before 3.0. Would be a shame to release with a DoS'able scheduler feature. | 18:19 |
corvus | SpamapS: there are still so many other ways to DoS, and this one has been out there for 6 years | 18:21 |
corvus | openstack's zuul runs out of memory every 4 days just through normal use, and i wasn't even planning on considering that a 3.0 blocker at this point. | 18:22 |
SpamapS | :-/ | 18:22 |
corvus | arbitrary node set size is another one | 18:22 |
SpamapS | kk, 3.1 -- the hardening release. ;) | 18:23 |
SpamapS | (where we let you turn off re maybe? :-P ) | 18:23 |
corvus | SpamapS: ++ 3.1 hardening, but i think we should should avoid making language features configurable -- we should either reduce the scope of re (ie, switch to re2, possibly compensate by irrelevant-branches or booleans), or drop it altogether | 18:25 |
SpamapS | Yeah I like the idea of allowing only positive matches and using re2. | 18:25 |
SpamapS | The CLA process has begun here so hopefully we can get my py3k support merged and released relatively quickly. | 18:27 |
corvus | SpamapS: if you end up being the maintainer, the CLA process won't matter anyway :) | 18:27 |
* fungi feels sorry at a peersonal level that this cla is still hanging around | 18:27 | |
corvus | fungi: different cla | 18:27 |
fungi | oh! | 18:27 |
fungi | hah, i missed that | 18:27 |
corvus | facebook i think? | 18:28 |
SpamapS | corvus: actually the maintainer may have re-appeared. Apparently facebook changed their email domain and they didn't update their addresses in the README (they were @facebook.com and are now @fb.com) | 18:28 |
clarkb | if we don't do negative lookahead do we even need regexes? | 18:28 |
fungi | regardless, not *another* cla i need to feel bad about | 18:28 |
clarkb | could just list all positive matches. Would be more verbose but avoids the re problem entirely | 18:28 |
clarkb | (then just string == otherstring) | 18:29 |
corvus | clarkb: agreed, i think that's something worth considering, and tbh, i would prefer that to our (openstack) habit of negative lookaheads regardless. | 18:29 |
fungi | i suppose if you don't need negative lookahead you could get away with implementing positive matches via an expansion syntax to save some space | 18:30 |
fungi | e.g. branch: stable/{ocata,pike},master | 18:30 |
fungi | then internally expand and match against the resulting list | 18:31 |
corvus | this would be a really good mailing list discussion -- have folks weigh in on "regex positive-only matches vs no regex at all" | 18:31 |
clarkb | oh tahts what I need to do, sign up for ml again | 18:31 |
fungi | yup! | 18:31 |
* clarkb wonders if that will make mailman mad | 18:31 | |
corvus | fungi: if we do forward expansions, we may as well use re2 rather than rolling our own, i'd think | 18:31 |
fungi | i took the hint yesterday and finally signed up for the lists.zuul-ci.org mailing lists | 18:32 |
clarkb | I signed up last week but things were unworking | 18:32 |
fungi | corvus: oh, that's a feature of re2? even nicer | 18:32 |
corvus | fungi: er, i dunno? maybe i'm making stuff up. | 18:32 |
fungi | that's cool too ;) | 18:32 |
fungi | everyone needs a hobby | 18:33 |
corvus | fungi: i just assumed that something like "stable/(foo|bar)" would work | 18:33 |
fungi | oh, sur | 18:33 |
clarkb | re2 does safer more performant res with fewer magical features like negative lookahead | 18:33 |
fungi | e | 18:33 |
corvus | SpamapS: the other thing is there are some more uses of regex -- i think some of the pipeline trigger / approval matching stuff uses it. maybe it's okay to leave that since that's config-repos only. maybe we just need to identify all the untrusted-repo uses of regex and address them. | 18:34 |
SpamapS | corvus: yeah I was mostly concerned with job config. Did not dig into other uses. | 18:34 |
SpamapS | At some point it seems like it would make sense to just use re2 everywhere since it is also far faster even on the simple cases. | 18:35 |
corvus | files needs either regex or glob. so if we entertain dropping regex, we'd have to glob there i think. | 18:35 |
corvus | talk of mailing lists reminds me... | 18:36 |
-corvus- Please sign up for new zuul mailing lists: http://lists.openstack.org/pipermail/openstack-infra/2018-January/005800.html | 18:36 | |
corvus | and we should update the readme/docs too | 18:37 |
*** sshnaidm|afk is now known as sshnaidm | 18:39 | |
*** jpena is now known as jpena|off | 18:45 | |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Clarify provider manager vs provider config https://review.openstack.org/531618 | 18:46 |
*** electrofelix has quit IRC | 18:57 | |
*** electrofelix has joined #zuul | 18:58 | |
*** flepied has quit IRC | 19:07 | |
*** flepied has joined #zuul | 19:08 | |
*** flepied has quit IRC | 19:30 | |
*** flepied has joined #zuul | 19:30 | |
*** harlowja has joined #zuul | 19:44 | |
openstackgerrit | Andreas Jaeger proposed openstack-infra/zuul-jobs master: Refactor fetch-subunit-output https://review.openstack.org/534427 | 20:04 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Clean up when conditions in fetch-subunit-output https://review.openstack.org/534428 | 20:05 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Use subunit2html from tox envs instead of os-testr-env https://review.openstack.org/534429 | 20:05 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-subunit-output work with multiple tox envs https://review.openstack.org/534430 | 20:05 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script https://review.openstack.org/534431 | 20:05 |
*** hashar has joined #zuul | 20:07 | |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit"" https://review.openstack.org/533834 | 20:10 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script https://review.openstack.org/534431 | 20:23 |
pabelanger | Shrews: we seem to be continuing accumulating ready / locked nodes in inap ATM. I'm trying to see why new requests are coming online, when exists ones haven't been fulfilled yet | 20:25 |
pabelanger | we are up to 32 ready / locked nodes right now | 20:25 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script https://review.openstack.org/534431 | 20:28 |
pabelanger | odd, up to 52 now | 20:29 |
pabelanger | something is going on | 20:29 |
Shrews | pabelanger: "why new requests are coming online, when exists ones haven't been fulfilled yet" ... that confuses me. requests are allowed to exist in parallel for a provider, so not sure what you mean by that | 20:30 |
Shrews | but i will see if i can glean anything from the logs | 20:31 |
pabelanger | 2018-01-16 20:28:04,870 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl01.openstack.org-30110-PoolWorker.inap-mtl01-main]: Fulfilled node request 100-0002104682 | 20:33 |
pabelanger | at that point, we seem to fullfill about 52 nodes, but I don't see why it took so long | 20:33 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies https://review.openstack.org/534397 | 20:33 |
pabelanger | some of them appear to be single ubuntu-xenial nodes | 20:33 |
Shrews | pabelanger: i do not see any ready&locked nodes | 20:34 |
pabelanger | Shrews: yah, they just unlocked at timestamp from above | 20:35 |
pabelanger | if you look at logs, we fulfilled a bunch of requests at one time, upwards of 45 nodes | 20:35 |
Shrews | pabelanger: we are at capacity | 20:35 |
Shrews | rather, inap is | 20:35 |
pabelanger | http://grafana.openstack.org/dashboard/db/nodepool-inap shows the spike in avail ready nodes | 20:36 |
Shrews | max-servers is 190 | 20:36 |
pabelanger | Shrews: right, but I am trying to understand, how do we keep growing ready / locked nodes. If at capacity, doesn't the next requests of nodes go towards existing open nodesets waiting for nodes? | 20:37 |
pabelanger | For example: http://grafana.openstack.org/dashboard/db/nodepool-inap?from=1516133623844&to=1516134574705&var-provider=All | 20:38 |
Shrews | wow, i cannot grok that sentence for some reason. :) | 20:38 |
Shrews | pabelanger: i'm not sure i | 20:38 |
pabelanger | for almost 15mins, we grew ready / locked nodes, until something happened for them to all dump | 20:38 |
pabelanger | almost 30% of capacity grew to be idle for 15mins | 20:39 |
pabelanger | http://grafana.openstack.org/dashboard/db/nodepool-rackspace?from=1516125202560&to=1516126777140 is another interesting graph for rackspace | 20:41 |
pabelanger | IAD had 87 availble nodes but only 37 in-use | 20:41 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver https://review.openstack.org/531904 | 20:44 |
corvus | clarkb: i'm a little confused by https://storyboard.openstack.org/#!/story/2001427 -- that diagram makes it look like C is the parent of B, but the text says that B is the parent of C. | 20:48 |
corvus | pabelanger: you should entertain the idea that zuul doesn't have enough executors to handle all of the jobs. zuul collects nodes before it gives them to executors. i see some small and large spikes on the executor queue graph. that mens zuul is spending at least some time waiting for executors to pick up jobs. | 20:54 |
corvus | is those cases, the nodes will be ready and locked by zuul, and their requests will be fulfilled. | 20:55 |
corvus | s/is/in/ | 20:55 |
corvus | no idea if that's what you're seeing, just throwing it out there. | 20:55 |
pabelanger | corvus: okay, let me confirm, but I think nodepool is still holding the lock at this point. Would zuul be involved if that is still the case? | 20:56 |
pabelanger | the part I am trying to figure out is, I see the following in the logs: | 20:57 |
pabelanger | 2018-01-16 20:28:04,866 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl01.openstack.org-30110-PoolWorker.inap-mtl01-main]: Pausing request handling to satisfy request | 20:57 |
pabelanger | <snipped for size> | 20:57 |
pabelanger | then nodepool start to unlock nodes | 20:57 |
pabelanger | which brings ready / locked nodes back down to zero | 20:58 |
pabelanger | it seems we only pausing once at capacity, according to comments in nodepool/driver/openstack/handler.py | 20:59 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add skipped CRD tests https://review.openstack.org/531887 | 21:01 |
*** haint_ has quit IRC | 21:01 | |
Shrews | pabelanger: i'm not sure what's happening here, but it does seem that something is keeping completed request handlers from being processed | 21:02 |
Shrews | in a timely fashion | 21:02 |
*** haint_ has joined #zuul | 21:02 | |
pabelanger | I'm trying to walk my self though the poll function in nodepool/driver/__init__.py | 21:05 |
*** rlandy_ has joined #zuul | 21:05 | |
corvus | clarkb: though, i think the direction of that arrow doesn't actually matter, the resulting trees are equivalent (though the final tree would be BCA not CBA) | 21:05 |
*** rlandy__ has joined #zuul | 21:05 | |
*** rlandy__ has quit IRC | 21:05 | |
Shrews | pabelanger: theory... i believe something in _assignHandlers(), where it loops through the node requests and processes them, is causing a significant delay within the loop | 21:08 |
Shrews | pabelanger: because until that loop finishes, it will never remove the completed handlers (and i'm not seeing that happen until minutes later in the log) | 21:09 |
Shrews | i only see zookeeper communication happening there, so i wonder if something is wonky with that | 21:10 |
Shrews | an overloaded ZK? | 21:10 |
Shrews | or communication issues with it? | 21:10 |
pabelanger | 1 sec, need to AFk quickly | 21:11 |
clarkb | corvus: C is the parent of B that is what the text says | 21:12 |
Shrews | java.io.IOException: No space left on device | 21:12 |
Shrews | on zookeeper | 21:12 |
Shrews | wheeeee | 21:12 |
clarkb | corvus: C -> B -> A, then separately C and B depends on A | 21:12 |
Shrews | pabelanger: ^^^ | 21:12 |
corvus | clarkb: right, i was confused by "Change C has parent change B" which means B is the parent of C. | 21:12 |
clarkb | corvus: actually the depends on may just be C to A | 21:13 |
Shrews | pabelanger: oh, nm. that's an old log | 21:13 |
corvus | clarkb: but honestly, i don't think it matters, we can just pick one and stick with it. :) | 21:13 |
clarkb | er now I'm all confused. In any case I htink the arrows are correct :) | 21:13 |
clarkb | corvus: ya, basically you just need a dag that looks like a cycle but with one of the arrows pointed the wrong way | 21:13 |
corvus | clarkb: yeah, both sets of arrows match. | 21:13 |
pabelanger | sorry, #dadop | 21:14 |
clarkb | Shrews: ya nodepool.o.o has ~50% of / free | 21:14 |
pabelanger | I'm looking at cacti.o.o now | 21:14 |
corvus | it looks busy, but not critically so. | 21:14 |
pabelanger | to see if we are spiking anything | 21:14 |
clarkb | could it be the executors just arent' grabbing more jobs? | 21:14 |
clarkb | or not grabbing nodes from nodepool fast enough? | 21:15 |
pabelanger | if I understand logs, I don't think nodepool is releasing them to zuul | 21:15 |
pabelanger | but will let Shrews confirm | 21:15 |
corvus | clarkb: it's possible that's happened a few times, but i think Shrews and pabelanger have also found behavior that doesn't explain, and it's more common | 21:15 |
corvus | (the grafana graphs suggest that has maybe happened after some gate resets) | 21:16 |
*** threestrands_ has joined #zuul | 21:18 | |
Shrews | corvus: pabelanger: is it possible that we could be getting rate limited by quota queries by the provider here? if those are slow, we could spend a long time in the request acceptance loop | 21:33 |
Shrews | which i'm growing more confident that we are doing | 21:33 |
pabelanger | Shrews: let me check if we are doing any caching for quota or not | 21:35 |
pabelanger | maybe we need to add it, if missing | 21:35 |
Shrews | i'm not certain what those queries are... trying to track down where we do that | 21:35 |
clarkb | Shrews: we may not be rate limited as much as clouds responding slow | 21:36 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive https://review.openstack.org/534444 | 21:36 |
clarkb | Shrews: pabelanger you can probably test that out of band by making the same requests and timing them | 21:37 |
pabelanger | yah, was going to suggest that | 21:37 |
pabelanger | we could also start tracking them via statsd too, like create servers | 21:38 |
corvus | we should be caching the quota queries -- except when we unexpectedly get an over-quota failure, we empty the cache. we're continually getting over-quota failures in inap due to the nova mismatch bug. | 21:38 |
mordred | pabelanger: we should already be tracking them in statsd, just need to add grafana graphs for them | 21:38 |
pabelanger | mordred: nice | 21:38 |
mordred | pabelanger: (we should be generating statsd metrics for every REST call made - so we shoudl get all the metrics all the time) | 21:39 |
pabelanger | stats.nodepool.task.inap-mtl01.ComputeGetLimits looks to be key | 21:41 |
mordred | yup. that seems about right | 21:41 |
pabelanger | I'll confirm and work up grafyaml patch | 21:42 |
corvus | pabelanger: do you have a graphite link handy for us to look at? | 21:43 |
Shrews | seems getting compute limits on inap takes about a second | 21:44 |
corvus | clarkb: in 534444 i opted to fix only the new-style depends-on; do you think it's important to fix the legacy gerrit depends-on as well? | 21:44 |
pabelanger | let me see how to share | 21:44 |
mordred | corvus: test_crd_gate_triangle sounds like a place I don't want to fly a small aircraft over | 21:44 |
corvus | pabelanger: right click on image ? | 21:44 |
clarkb | corvus: is old style going away? | 21:44 |
corvus | mordred: definitely | 21:44 |
clarkb | if old style is going away probably not but if we expect it to stick around might be good to have consistent behavior | 21:44 |
corvus | clarkb: yeah, but with a timeframe convenient for openstack, so maybe 3-6 mo... | 21:44 |
corvus | i don't want to immediately invalidate a bunch of depends-on headers | 21:45 |
pabelanger | http://graphite.openstack.org/render/?width=586&height=308&_salt=1516139140.947&target=stats.nodepool.task.inap-mtl01.ComputeGetLimits | 21:46 |
pabelanger | that is the image, but not sure how to properly share | 21:46 |
corvus | you just did | 21:46 |
pabelanger | kk | 21:46 |
pabelanger | seems pretty flat right now | 21:46 |
Shrews | i'm not sure how to track this down without throwing in a bunch of spurious logging statements in a printf-style-ala-mordred attack | 21:47 |
corvus | pabelanger: i suspect that's a graph of when we emit those calls | 21:48 |
mordred | pabelanger: the timing key is ... | 21:48 |
corvus | http://graphite.openstack.org/render/?width=586&height=308&_salt=1516139263.857&target=stats.timers.nodepool.task.inap-mtl01.ComputeGetLimits.mean | 21:48 |
corvus | something like that ^ i think | 21:48 |
pabelanger | ah, thank you that does look right now | 21:48 |
mordred | yes. what corvus said | 21:48 |
mordred | that's in miliseconds? | 21:49 |
corvus | mordred: i think so | 21:49 |
mordred | yah. so - not the world's fastest call - but not the world's slowest either | 21:49 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Support cross-source dependencies https://review.openstack.org/530806 | 21:49 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add cross-source tests https://review.openstack.org/532699 | 21:49 |
clarkb | corvus: reviewed the fix for triangle deps | 21:52 |
clarkb | corvus: there is a bug | 21:52 |
*** flepied has quit IRC | 21:55 | |
*** flepied has joined #zuul | 21:57 | |
Shrews | pabelanger: corvus: http://paste.openstack.org/show/645748/ Those "predicted" calculations happen back-to-back in the code, so the second one is taking at least 5 seconds. I'm not sure how long the 1st is taking | 22:03 |
Shrews | I'm seeing processing a single request take between 20-30 seconds. If there are lots of requests to go through, then the code will not get around to removing completed requests for quite a while | 22:03 |
Shrews | So we could be seeing a combination of things here | 22:04 |
Shrews | Heavy load + inefficient request processing | 22:04 |
Shrews | This would explain why pabelanger saw many ready+locked nodes suddenly free up | 22:05 |
*** dkranz has quit IRC | 22:06 | |
Shrews | or change to in-use, rather | 22:06 |
pabelanger | yah, nodepool-launcher is at 100% CPU for the most part | 22:07 |
Shrews | pabelanger: this was a good spot. well done for catching it | 22:08 |
pabelanger | could be run another nodepool-launcher process on the host, but just for inap as a test? | 22:09 |
Shrews | pabelanger: now tell us how to make it better!!! :) | 22:09 |
Shrews | i don't think this is inap specific. i see similar trends with other providers | 22:09 |
pabelanger | yah, nl01 is 8vCPU, if we have at 100% for a single nodepool-launcher daemon, maybe we shared the config more and run a single processor for provider? | 22:10 |
Shrews | i don't think that'd help, tbh | 22:13 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive https://review.openstack.org/534444 | 22:13 |
corvus | clarkb: thx ^ | 22:14 |
pabelanger | Shrews: ack | 22:14 |
Shrews | adding caching of limits to shade might help a bit | 22:16 |
Shrews | i can work on that in the morning if we think that's a good idea | 22:17 |
corvus | Shrews: can you summarize what the problem is? i don't have the full picture. | 22:17 |
mordred | Shrews: I think adding in the same caching pattern we use for servers ports and floating-ips would be a fine idea (pending we don't learn something else to the contrary between now and then) | 22:18 |
Shrews | corvus: requests aren't getting satisfied in a timely fashion because of the processing of the node requests loop. we are seeing several seconds between processing each request. until we get through all of them, we do not attempt to mark requests fulfilled | 22:19 |
corvus | Shrews: which loop is the 'node requests loop' ? | 22:19 |
Shrews | corvus: PoolWorker._assignHandlers() | 22:20 |
corvus | Shrews: what in there do you think is slow? | 22:21 |
*** haint_ has quit IRC | 22:22 | |
Shrews | corvus: all of it? :) the paste from me above shows several seconds for at least one of the quota estimations. | 22:22 |
*** haint_ has joined #zuul | 22:22 | |
Shrews | corvus: i have not identified other areas of slowness yet, but can see over 20s between assigning a request, and actually launching the node for it | 22:23 |
Shrews | without some more debugging info, i can only guess at to what else is slow about it | 22:25 |
corvus | Shrews: okay, so we've got several seconds for the new handlers to run if that method creates any new ones... | 22:26 |
corvus | Shrews: if the handler is paused, we short out of there pretty quick | 22:27 |
corvus | Shrews: but if it isn't paused, then we're going to try to lock every request. we have 1800 of them right now. | 22:27 |
corvus | but it seems like we should run up to the point where we are paused pretty quickly, so that shouldn't be an issue | 22:28 |
Shrews | right. looking at the logs, i saw the inap thread accept 50 requests before it completed the loop, which took about 13m in all | 22:28 |
Shrews | it never paused in that time | 22:29 |
Shrews | which means it had capacity to handle them all | 22:29 |
clarkb | each provider gets its own thread too right? and those threads fight all the launch threads? could we be cpu starved? | 22:29 |
clarkb | like maybe we should run multiple nodepool launchers per host or something | 22:29 |
pabelanger | clarkb: yah, that is what I was thinking about multiple launchers per host | 22:30 |
pabelanger | they'd each need to have a config with a specific provider | 22:30 |
Shrews | clarkb: yes, i provider pool per thread | 22:30 |
Shrews | a provider can have multiple pools | 22:30 |
clarkb | ah right its PoolWorker | 22:31 |
clarkb | in our case pools and providers are currently 1:1 | 22:31 |
corvus | nl01 is currently cpu-bound | 22:31 |
corvus | they both are | 22:32 |
Shrews | there could be a lot of thread context switching going on since we have 1 thread per node launch too | 22:32 |
Shrews | maybe multiple processes could help | 22:32 |
Shrews | as pabelanger suggested | 22:32 |
corvus | yep. this is one of the reasons we designed it to accomodate sharding -- we were starved on one machine. it's interesting that we are already starved with two processes. | 22:33 |
corvus | so more processes (whether that's on more machines or the same one) may help, if the main pool thread is being starved by the launch threads. | 22:33 |
clarkb | in this case mentioned same host since I think we have a lot of available cpu its just python being unable to take advantage | 22:34 |
corvus | fwiw, the context switch overhead before it became unbearable was about 1k threads. | 22:34 |
mordred | seems about right | 22:34 |
pabelanger | nb01.o.o is also 8vcpu / 8GB RAM, if we are CPU bound, an expensive server for single process. We could trying bringing on more launcher nodes, in smaller sizes, or run more launchers on each host | 22:35 |
corvus | Shrews, mordred: i'm not 100% sure, but i worry that quota caching isn't an immedate answer. we should verify that nodepool isn't already caching sufficiently and that changing shade would actually do anything different before we go down that road. | 22:35 |
Shrews | corvus: nod | 22:36 |
corvus | pabelanger: yes that server is too large. it should be 2g | 22:36 |
clarkb | pabelanger: more nodes in smaller sizes potentially makes us less outage prone so thats a plus but also harder to do things like switch zk servers | 22:36 |
corvus | i'd size it 2G for each process we want to run on it. | 22:36 |
corvus | maybe even fit a few more processes on larger hosts | 22:37 |
pabelanger | if we did single process on 2GB hosts, that would work well today how we manage nodepool.yaml in project-config (by hostname). If we do more processes on larger server, we'll need to rework some puppet code first | 22:38 |
corvus | Shrews: we could also make the poolworker a little more concurrent, either by having another thread handle completions, or just interleaving completions with assignments. | 22:38 |
Shrews | corvus: difficult to handle correctly since either way, we'd want to modify the same data structure (the handler array). but yeah, could be possible with the right locking and whatnot | 22:39 |
pabelanger | I have to step away for a bit, will catch up when back | 22:39 |
Shrews | and i need to make dinner | 22:40 |
corvus | let's resume this tomorrow :) | 22:41 |
clarkb | another idea: we could use multiprocessing for pool worker possibly? | 22:41 |
clarkb | and have python handle it more behind the scenes for us? that might be friendlier to other users | 22:41 |
corvus | clarkb: yes, as long as we do it at that level (the pool worker), where there's no shared data between any of them. that's worth looking into. | 22:42 |
clarkb | ya all the communication is through zk anyways so that should be pretty safe | 22:42 |
corvus | multiprocessing is cool as long as you don't try to share data, then it gets bad. so a process for a pool worker, but then all the launchers still as threads. | 22:42 |
mordred | ++ | 22:43 |
corvus | they already even have their own zk connection, so even that wouldn't be different | 22:43 |
Shrews | i think the Nodepool object itself is shared, which could be problematic | 22:43 |
Shrews | anyway, really away now | 22:43 |
corvus | Shrews: yeah, i think that's mostly config stuff at this point; we could probably fix that. | 22:44 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive https://review.openstack.org/534444 | 22:48 |
corvus | clarkb: there are tests which would have caught the error you pointed out. most of them worked, but two needed a fix, so i updated the patch with that as well. ^ | 22:49 |
clarkb | corvus: cool I will rereview | 22:51 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Stop running tox-cover job https://review.openstack.org/534458 | 22:51 |
*** flepied has quit IRC | 22:57 | |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Really change patchset column to string https://review.openstack.org/532459 | 23:05 |
clarkb | Shrews: corvus do you want to review https://review.openstack.org/#/c/533771/5 and parent? | 23:06 |
clarkb | I think that is slightly more important now that infracloud might shutdown at any time | 23:06 |
corvus | clarkb: looking | 23:10 |
pabelanger | https://review.openstack.org/534450/ should start tracking getlimits API query in grafana to help with tomorrow | 23:11 |
clarkb | there is another cleanup related to that where I think we can more aggressively unset allocated_to (or whatever the var is called) when we fail a multinode nodeset | 23:12 |
clarkb | I think right now we let the cleanup thread find those and unset them, btu we can unset them early allowing them to be used in other nodesets sooner | 23:12 |
corvus | clarkb: lgtm -- we have some tests which set an image flag that tells the fake to always fail when booting that image. i wonder if there's a way to turn that on mid-stream to avoid that particular monkey-patch. | 23:14 |
corvus | clarkb: i've +2d it; let's see if Shrews wants to review it tomorrow? unless you think we need to accelerate deployment of it. | 23:15 |
clarkb | looking at multiprocess PoolWorker more closely we use nodepool.getZK() and nodepool.getPoolWorkers() along with some config related stuff | 23:16 |
clarkb | I think the only thing that may be a real problem is getPoolWorkers? /me makes a quick change and sees what happens | 23:16 |
clarkb | oh you know where this might break the most is in the tests :/ | 23:17 |
clarkb | because we sort of assume a single process we can manipulate | 23:18 |
corvus | hrm. i'm guessing black-box tests might work, but not so much white-box. | 23:21 |
corvus | the tests which do end-to-end testing through zookeeper should work. | 23:22 |
corvus | how does this message about the zuulv3 merge look? https://etherpad.openstack.org/p/4sX3qKYDBN | 23:28 |
clarkb | ya just hacking in multiprocessing and running it against the test I added above once I fix some structural issues we run into the problem where the test framework introspects its threads and stuff to know how things are doing | 23:32 |
clarkb | so it iwll be a bit of work to get that going in the tset suite | 23:32 |
corvus | clarkb: maybe we need a slightly different structure then -- maybe we need to handle the process split in a way where we can test the system all in one process, but when actually started, we get multiple ones. | 23:34 |
clarkb | that seems viable, we would need a more true functioanl black box test to go with that (which we do have) | 23:34 |
clarkb | this isn't somethign I'm going to dive into now but wanted to confirm my suspicions it is non trivial before I ignored it :) | 23:35 |
clarkb | but I do think if we can make it work this will be the most user friendly way of scaling up launchers per host | 23:36 |
corvus | clarkb: yeah, for the most part, i'd think even just a special test like "i started a two-provider-pool system with multiple-processes and they both gave me a node" would be sufficient to excercise that -- then all the rest of the correctness tests can remain in the single-process realm | 23:38 |
*** sshnaidm is now known as sshnaidm|off | 23:42 | |
corvus | i sent the first message to zuul-announce: http://lists.zuul-ci.org/pipermail/zuul-announce/2018-January/000000.html | 23:43 |
corvus | hopefully folks got that | 23:43 |
pabelanger | yup, got email here | 23:46 |
*** rlandy_ has quit IRC | 23:52 | |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Use hotlink instead log url in github job report https://review.openstack.org/531545 | 23:53 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Disambiguate with Netflix and Javascript zuul https://review.openstack.org/531292 | 23:55 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add support for protected jobs https://review.openstack.org/522985 | 23:56 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies https://review.openstack.org/534397 | 23:56 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver https://review.openstack.org/531904 | 23:58 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Handle sigterm in nodaemon mode https://review.openstack.org/528646 | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!