*** _ari_ has quit IRC | 00:17 | |
*** _ari_ has joined #zuul | 00:19 | |
*** dmellado has joined #zuul | 00:37 | |
*** _ari_ has quit IRC | 01:01 | |
*** dmellado has quit IRC | 01:06 | |
*** _ari_ has joined #zuul | 01:08 | |
*** tobiash_ has left #zuul | 03:38 | |
jamielennox | interesting - don't know why yet: | 04:02 |
---|---|---|
jamielennox | File "/usr/local/lib/python3.5/dist-packages/nodepool/launcher.py", line 1486, in removeCompletedRequests | 04:02 |
jamielennox | for label in requested_labels: | 04:02 |
jamielennox | RuntimeError: dictionary changed size during iteration | 04:02 |
tobiash | mordred, jeblair: retestet synchronize | 04:56 |
tobiash | and found out now that it actually worked and the error I saw was from rsync :( | 04:57 |
tobiash | I had several issues during debugging making me think it was an ansible issue | 04:58 |
tobiash | synchronize seems to be also handled by the log streamer (which is not logged in the executor) and with post failure I didn't know how to get the console log | 04:58 |
tobiash | further I have the issue that the executor often prepared old versions of the trusted repo (triggering patch is in a different repo) | 04:59 |
tobiash | that fooled my debug iterations | 05:00 |
tobiash | -> https://review.openstack.org/#/c/469214/ helped now with getting the log from rsync | 05:00 |
tobiash | -> rsync worked but had some transient errors: http://paste.openstack.org/show/613461/ | 05:02 |
tobiash | so sorry for bothering you with that topic | 05:02 |
tobiash | but the non-deterministic playbook repo preparation is an issue I have to look at (I think that fooled me to think that shell also doesn't work delegated) | 05:03 |
tobiash | my current setup is scheduler + 5 merger + 1 executor for testing how that scales | 05:07 |
jamielennox | gah, i should know this but how do i get nodepool/shade to give me a floating ip | 05:25 |
clarkb | jamielennox: iirc you get one if you get back a non routablr ipv4 addr | 05:28 |
clarkb | or if you specify it in the oscc network config | 05:29 |
jamielennox | clarkb: yea, i was looking in the wrong place it seems, we had routes_externally: True in the clouds.yaml which was stopping the assignment | 05:29 |
jamielennox | any idea whilst people are looking for how to add a security group to nodepool nodes? | 05:30 |
clarkb | I dont know that we support setting security groups | 05:33 |
clarkb | unless oscc allows you to | 05:33 |
clarkb | we have clouds that dont do security groups so our deployments have always opened them up where they exist then lock down on host | 05:34 |
jamielennox | yea, i seem to remember this once before, i guess given that nodepool is likely creating nodes in its own project it's reasonable to just add ssh to default | 05:35 |
mordred | jamielennox: yah - that's what we do - just modify the default | 05:37 |
mordred | jamielennox: that said - it's probably not a terrible idea to add support for specifying a security-group - I can imagine it might be a thing people might want | 05:38 |
mordred | tobiash: good to know - and no worried - I learned a lot about synchronize in the process :) | 05:41 |
mordred | tobiash: non-deterministic playbook repo prep certainly doesn't sound good though. jeblair when you're up that sounds like a thing you may also want to look at | 05:43 |
mordred | jamielennox: if you were to want to add security-group support, it would be basically identical to key-name | 05:48 |
mordred | jamielennox: so if you did 'git grep key.name' in nodepool, you'll see where you need to add it | 05:50 |
mordred | although man, looking at that our naming there seems very awkward | 05:50 |
jamielennox | mordred: it's interesting because it's openstack driver specific and i imagine that once v3 is stable and that kicks off it'll change again | 05:50 |
jamielennox | not even thinking too far afield it's not useful for static nodes | 05:51 |
mordred | jamielennox: indeed - but then neither is key-name, min-ram or console-log | 05:51 |
jamielennox | true, there is a lot to do there | 05:52 |
mordred | there's a bunch of cloud-specific things that go into a label definition | 05:52 |
mordred | that I think it's likely fine if they're specific to a label defintion | 05:52 |
mordred | (also, my comment about awkward naming was because I was looing at master - we have fixed the naming issue I was concerned about :) ) | 05:52 |
*** hashar has joined #zuul | 07:13 | |
tobiash | mordred, jeblair: the playbook repos don't seem to be part of the repo state the executor gets | 07:13 |
tobiash | mordred, jeblair: excerpt of the data structure the executor gets for running the job: | 07:22 |
tobiash | http://paste.openstack.org/show/613474/ | 07:22 |
tobiash | the playbook repos are also not part of the projects dict | 08:09 |
*** jkilpatr has quit IRC | 10:38 | |
mordred | tobiash: oh - there's also a bug I think we have with streaming logs and trusted playbooks | 10:58 |
tobiash | due to non-brwap? | 10:59 |
mordred | no - due to how we're injecting the jobid so that the remote command knows how to log | 11:03 |
mordred | I BELEIVE we're doing that in a place that doesn't get installed when the job is trusted | 11:03 |
* mordred is writing down bug right now so he doesn't forget to actually look | 11:04 | |
mordred | tobiash: oh. nope. I'mtotally wrong | 11:06 |
*** jkilpatr has joined #zuul | 11:10 | |
*** dmellado_ has joined #zuul | 11:17 | |
*** dmellado_ is now known as dmellado | 11:19 | |
*** openstackgerrit has quit IRC | 11:33 | |
Shrews | jamielennox: that's interesting on the dictionary RuntimeError. i think the line above that trying to prevent that error is not correct | 12:45 |
Shrews | ah, it seems keys() returns an iterator in py3 | 12:47 |
Shrews | jamielennox: fix coming | 12:48 |
*** openstackgerrit has joined #zuul | 12:51 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix dict key copy operation https://review.openstack.org/476902 | 12:51 |
Shrews | jamielennox: ^^^ | 12:51 |
tristanC | what do you think about adding a zuul_comment module so that a job can report custom content to a review? | 13:10 |
tristanC | i mean, to help user figure out why a job failed withough going to the failure-url, it may be convenient to report more content such as the list of unit-test that actually failled | 13:16 |
mordred | tristanC: yes - SO - funny story, I'm actually writing up some things related to that use case from discussions we had this week | 13:34 |
clarkb | I thought you can alrrady put arbitrary text | 13:50 |
clarkb | we do for requirements checking jobs for example | 13:50 |
tristanC | mordred: nice, please let me know how it goes, I'm interested in such capabilities, it may be very useful for custom reporters... e.g. code coverage test should be able to tell patch author when the coverage drop a certain threshold | 13:51 |
clarkb | I guess its mostly.static though | 13:51 |
*** hashar_ has joined #zuul | 15:12 | |
SpamapS | That would work nicely with the "job drops a json file" thing we have talked about for artifact sharing too. | 15:43 |
SpamapS | (so jobs would drop a json with some comment text that zuul would dump into the reporter) | 15:43 |
*** hashar is now known as hasharAway | 15:57 | |
*** hashar_ has quit IRC | 16:04 | |
dmsimard | Question about nodepool: as I understand it, TripleO is currently sharing the tenant nodepool is using with other processes which manage virtual machines in that tenant. I know it goes against the recommendation of dedicating a tenant to nodepool but what kind of problems could this expose ? | 16:18 |
Shrews | I would think that the biggest risk you open yourself to is getting reality (available vms) out of sync with expectations (the database or ZK in v3) if something other than nodepool deletes/modifies/etc the nodepool owned nodes | 17:19 |
Shrews | harlowja: hrm, there might be an issue with the Event change you proposed yesterday. seeing some random job failures with what looks like tests hanging causing the job to timeout | 17:29 |
Shrews | but i'm not seeing anything obviously wrong :( | 17:29 |
harlowja | hmmmm, that'd seem odd, logically those should work (since i've been doing this for a long time, lol) | 17:29 |
harlowja | *doing similar stuff | 17:29 |
harlowja | any more info? | 17:30 |
harlowja | on py3 it fails, or py2.7? | 17:30 |
Shrews | harlowja: nothing really useful in the logs. failing on coverage job the last few times I checked | 17:35 |
Shrews | i'm beginning to wonder about Event.wait() and context switching | 17:40 |
tobiash | Shrews, harlowja: I also saw some random job hangs on my deployment | 17:40 |
harlowja | how much of eventlet is that stuff using? | 17:40 |
harlowja | i mean Event.wait is basically waiting on a condition variable | 17:40 |
tobiash | looking at 'ps afx' it looked like ansible didn't start an ssh process | 17:40 |
harlowja | which is basically waiting on a threading primitive that all that it does is force context switches / waits, lol | 17:41 |
harlowja | so ya, it'd be interesting to know what happens more tobiash and Shrews | 17:42 |
harlowja | how much eventlet are u guys using btw? | 17:42 |
harlowja | none, some? | 17:42 |
tobiash | harlowja: I don't know if we're talking about the same issue | 17:43 |
harlowja | perhaps, lol | 17:43 |
Shrews | no eventlet used | 17:43 |
harlowja | kk | 17:43 |
harlowja | then ya, i can't think why it wouldn't work | 17:43 |
tobiash | my observation: in my zuulv3 deployment sometimes a shell task just hangs with ansible not starting ssh from what I could see | 17:43 |
Shrews | harlowja: yeah, conditional is used, so you'd think that would work as expected: https://github.com/python/cpython/blob/3.6/Lib/threading.py#L551 | 17:43 |
harlowja | ya, i mean if those don't work, bigger problems existing all over python :-P | 17:44 |
tobiash | killing the leaf-ansible-playbook process on the executor made the job continue (even successfully) | 17:44 |
Shrews | I have yet to see these hangs on the changes NOT related to the use of Event (eg., https://review.openstack.org/476902) so it makes me thing the event change is causing it | 17:45 |
Shrews | s/thing/think/ | 17:45 |
Shrews | tobiash: i haven't experience what you are seeing :( | 17:46 |
harlowja | `RuntimeError: dictionary changed size during iteration` is a threading thing | 17:47 |
harlowja | that's just something u have to know about :-P | 17:47 |
harlowja | and not do, lol | 17:47 |
tobiash | Shrews: it was happening in about 5% of the runs and I didn't see anything obvious so far | 17:47 |
tobiash | could be anything in zuul, ansible or bwrap | 17:48 |
harlowja | Shrews let me know what u find, i'd be surprised if its something busted in threading :-/ | 17:48 |
harlowja | but weird shit could happen, just would seem hard | 17:48 |
harlowja | *hard to believe | 17:49 |
Shrews | harlowja: i'm not going to spend too much time on it. will explore it a bit today, but may just have to ditch it and stick with our time.sleep() :( | 17:51 |
harlowja | k | 17:51 |
harlowja | though its odd, because the whole underlying wait and things its using internally is also sort of equivalent to time.sleep | 17:52 |
harlowja | lol | 17:52 |
harlowja | just u aren't seeing the equivalent of time.sleep, lol | 17:53 |
Shrews | i wish our logs were more helpful. they are useless for timeouts | 17:53 |
harlowja | :-/ | 17:53 |
Shrews | and of course cannot reproduce it locally (yet) | 17:54 |
Shrews | aha. got a local hang | 17:56 |
harlowja | u can try https://gist.github.com/harlowja/5544d84e8e734ea1cc7c163eff007531 | 17:59 |
harlowja | at least it will show u some <waiting> events | 17:59 |
harlowja | event stuff isn't to crazy :-P | 18:00 |
harlowja | (ie making your own to see whats happening) | 18:01 |
harlowja | typically what happens is someone isn't calling `set` somewhere | 18:01 |
harlowja | then wait (especially with no timeout) just waits forever, lol | 18:01 |
harlowja | the one thing that might be useful to look at is `run` in those changes | 18:02 |
harlowja | self._death.clear() | 18:02 |
harlowja | if shutdown/stop happens before run | 18:03 |
harlowja | (for some reason) | 18:03 |
harlowja | then run will never die, lol | 18:03 |
harlowja | perhaps put some print/logs in https://gist.github.com/harlowja/5544d84e8e734ea1cc7c163eff007531#file-gistfile1-txt-L10 | 18:05 |
harlowja | and say in clear | 18:05 |
harlowja | and see if someone is setting (ie shutting down) and then run calls clear, lol | 18:05 |
Shrews | harlowja: the stop before run thing is what i'm looking at now, particularly: https://github.com/openstack-infra/nodepool/blob/feature/zuulv3/nodepool/builder.py#L1110-L1118 | 18:09 |
harlowja | lol | 18:10 |
harlowja | ya... | 18:10 |
harlowja | u can take my latch code | 18:10 |
harlowja | and then drop that whole loop | 18:10 |
harlowja | but ya, that's usually what hangs, something like this :-P | 18:11 |
Shrews | harlowja: i'm not sure how that paste you just gave me helps | 18:11 |
harlowja | mainly that u can add log statements and shit all over it :-P | 18:11 |
harlowja | *good shit, not the bad kind, lol | 18:12 |
harlowja | Shrews so my guess would be to remove that clear in those various `run` methods | 18:16 |
harlowja | they are probably re-clearing it | 18:16 |
harlowja | and the clear should probably happen outside of `run` in some kind of 'reset` method (if those threads are really re-used) | 18:16 |
harlowja | if they aren't reused (after shutdown) then meh, just throw them away, and don't add reset on | 18:17 |
harlowja | so i'd drop that clear from all the places (in run) | 18:23 |
harlowja | my 3 cents | 18:24 |
harlowja | ha | 18:24 |
Shrews | seems to be hanging on join to one of the worker threads during the shutdown process. so one of them is not stopping properly | 18:27 |
* Shrews adds more logging | 18:27 | |
dmsimard | Shrews: you happen to know if 'env-vars' defined for a DIB image definition in nodepool end up being set on the node when it actually boots up ? | 18:29 |
dmsimard | or if they are only effective during the image build ? | 18:29 |
dmsimard | doesn't look like those env vars are persisted (i.e, writing to /etc/profile) | 18:29 |
Shrews | dmsimard: they are not set on the launched node | 18:29 |
dmsimard | Shrews: so here's my dillemma -- I'm trying to reproduce the upstream image definitions outside of the openstack-infra environment (RDO's nodepool/zuul). I got the images to build no problem with the upstream elements/scripts except the configure_mirror.sh ready-script borks because it's not picking up NODEPOOL_MIRROR_HOST | 18:31 |
dmsimard | https://github.com/openstack-infra/project-config/blob/master/nodepool/scripts/configure_mirror.sh#L70 | 18:32 |
dmsimard | image builds fine with https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/nodepool/nodepool.yaml#L13-L34 but then configure_mirror fails | 18:32 |
dmsimard | I was wondering if it'd make sense to persist those env vars | 18:32 |
Shrews | dmsimard: oh, you're not talking about v3 nodepool. i'd have to search the v2 code to know what's happening with those vars | 18:34 |
dmsimard | yeah we're not running v2.5/v3 yet :/ | 18:34 |
Shrews | (we did away with the ready script codepath in v3) | 18:34 |
Shrews | dmsimard: so you're note even using the zk version of the builder, right? | 18:36 |
dmsimard | Shrews: nope, still geard so far as I am aware -- we'll be moving towards 2.5ish in the near future but we're not there yet | 18:37 |
Shrews | i'm _fairly_ certain that no env vars were set on the launched node (but we did write some variable info to files on that host) | 18:37 |
dmsimard | Shrews: I think for the time being I won't bother too much and insert an element to persist the env vars | 18:37 |
dmsimard | yeah there's like /etc/nodepool/provider with some info, amongst other things | 18:38 |
Shrews | dmsimard: this is the thing I was thinking of https://github.com/openstack-infra/nodepool/blob/0.3.0/nodepool/nodepool.py#L630 but nothing for mirror host | 18:42 |
dmsimard | yeah that's /etc/nodepool/provider | 18:43 |
dmsimard | I half-expected the values defined in 'env-vars' to be carried over there | 18:43 |
Shrews | harlowja: k, i think i understand the problem now. we'll have to use a barrier instead of that code i pointed out | 18:54 |
harlowja | if u desire :-P | 18:55 |
harlowja | https://github.com/openstack/taskflow/blob/master/taskflow/types/latch.py is all yours | 18:55 |
harlowja | if u want it | 18:55 |
harlowja | lol | 18:55 |
harlowja | since py3.x got the latch addition | 18:55 |
harlowja | maybe someone backported it, i didn't, lol | 18:55 |
Shrews | because x.running no longer means "hey, i've started and reached a certain point in the startup" | 18:55 |
harlowja | ya, fair | 18:57 |
Shrews | ugh, this might be too much for right now. would have to pass the Latch in to each thread b/c the # of threads that would need to use it is configurable | 19:14 |
harlowja | yup | 19:15 |
harlowja | up to u | 19:16 |
Shrews | i think other v3 things are pressing, atm. will revisit later | 19:16 |
tobiash | I've noticed that ansible uses quite a lot cpu when running (also when just waiting for a task to finish) | 19:43 |
tobiash | setting "internal_poll_interval = 0.01" in ansible.cfg fixed that for me locally | 19:43 |
tobiash | any objections of adding this setting to the generated ansible configs? | 19:47 |
tobiash | encountered again a job freeze: http://paste.openstack.org/show/613548/ | 19:52 |
tobiash | last one is an ansible-playbook process in sleep state and without launched ssh connection | 19:53 |
SpamapS | ruhroh | 20:33 |
SpamapS | tobiash: what does strace think that process is doing? | 20:33 |
*** jkilpatr has quit IRC | 22:05 | |
*** jkilpatr has joined #zuul | 22:20 | |
*** hasharAway has quit IRC | 22:30 | |
mordred | dmsimard, Shrews: the main issue with sharing an openstack project between nodepool and non-nodepool is quota calculations | 22:49 |
mordred | dmsimard, Shrews: nodepool should not touch nodes it did not create - but whatever you tell nodepool its quota is will need to take in to account acutal quota - minus non-nodepool nodes | 22:50 |
mordred | the main problem with that will be that nodepool may sit there spinning hitting the nova api over and over trying and failing to create a node, which could put undue load on your control plane | 22:53 |
mordred | but if you manage the quotas right, you'll be good | 22:53 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!