Friday, 2017-06-23

*** _ari_ has quit IRC		00:17
*** _ari_ has joined #zuul		00:19
*** dmellado has joined #zuul		00:37
*** _ari_ has quit IRC		01:01
*** dmellado has quit IRC		01:06
*** _ari_ has joined #zuul		01:08
*** tobiash_ has left #zuul		03:38
jamielennox	interesting - don't know why yet:	04:02
jamielennox	File "/usr/local/lib/python3.5/dist-packages/nodepool/launcher.py", line 1486, in removeCompletedRequests	04:02
jamielennox	for label in requested_labels:	04:02
jamielennox	RuntimeError: dictionary changed size during iteration	04:02
tobiash	mordred, jeblair: retestet synchronize	04:56
tobiash	and found out now that it actually worked and the error I saw was from rsync :(	04:57
tobiash	I had several issues during debugging making me think it was an ansible issue	04:58
tobiash	synchronize seems to be also handled by the log streamer (which is not logged in the executor) and with post failure I didn't know how to get the console log	04:58
tobiash	further I have the issue that the executor often prepared old versions of the trusted repo (triggering patch is in a different repo)	04:59
tobiash	that fooled my debug iterations	05:00
tobiash	-> https://review.openstack.org/#/c/469214/ helped now with getting the log from rsync	05:00
tobiash	-> rsync worked but had some transient errors: http://paste.openstack.org/show/613461/	05:02
tobiash	so sorry for bothering you with that topic	05:02
tobiash	but the non-deterministic playbook repo preparation is an issue I have to look at (I think that fooled me to think that shell also doesn't work delegated)	05:03
tobiash	my current setup is scheduler + 5 merger + 1 executor for testing how that scales	05:07
jamielennox	gah, i should know this but how do i get nodepool/shade to give me a floating ip	05:25
clarkb	jamielennox: iirc you get one if you get back a non routablr ipv4 addr	05:28
clarkb	or if you specify it in the oscc network config	05:29
jamielennox	clarkb: yea, i was looking in the wrong place it seems, we had routes_externally: True in the clouds.yaml which was stopping the assignment	05:29
jamielennox	any idea whilst people are looking for how to add a security group to nodepool nodes?	05:30
clarkb	I dont know that we support setting security groups	05:33
clarkb	unless oscc allows you to	05:33
clarkb	we have clouds that dont do security groups so our deployments have always opened them up where they exist then lock down on host	05:34
jamielennox	yea, i seem to remember this once before, i guess given that nodepool is likely creating nodes in its own project it's reasonable to just add ssh to default	05:35
mordred	jamielennox: yah - that's what we do - just modify the default	05:37
mordred	jamielennox: that said - it's probably not a terrible idea to add support for specifying a security-group - I can imagine it might be a thing people might want	05:38
mordred	tobiash: good to know - and no worried - I learned a lot about synchronize in the process :)	05:41
mordred	tobiash: non-deterministic playbook repo prep certainly doesn't sound good though. jeblair when you're up that sounds like a thing you may also want to look at	05:43
mordred	jamielennox: if you were to want to add security-group support, it would be basically identical to key-name	05:48
mordred	jamielennox: so if you did 'git grep key.name' in nodepool, you'll see where you need to add it	05:50
mordred	although man, looking at that our naming there seems very awkward	05:50
jamielennox	mordred: it's interesting because it's openstack driver specific and i imagine that once v3 is stable and that kicks off it'll change again	05:50
jamielennox	not even thinking too far afield it's not useful for static nodes	05:51
mordred	jamielennox: indeed - but then neither is key-name, min-ram or console-log	05:51
jamielennox	true, there is a lot to do there	05:52
mordred	there's a bunch of cloud-specific things that go into a label definition	05:52
mordred	that I think it's likely fine if they're specific to a label defintion	05:52
mordred	(also, my comment about awkward naming was because I was looing at master - we have fixed the naming issue I was concerned about :) )	05:52
*** hashar has joined #zuul		07:13
tobiash	mordred, jeblair: the playbook repos don't seem to be part of the repo state the executor gets	07:13
tobiash	mordred, jeblair: excerpt of the data structure the executor gets for running the job:	07:22
tobiash	http://paste.openstack.org/show/613474/	07:22
tobiash	the playbook repos are also not part of the projects dict	08:09
*** jkilpatr has quit IRC		10:38
mordred	tobiash: oh - there's also a bug I think we have with streaming logs and trusted playbooks	10:58
tobiash	due to non-brwap?	10:59
mordred	no - due to how we're injecting the jobid so that the remote command knows how to log	11:03
mordred	I BELEIVE we're doing that in a place that doesn't get installed when the job is trusted	11:03
* mordred is writing down bug right now so he doesn't forget to actually look		11:04
mordred	tobiash: oh. nope. I'mtotally wrong	11:06
*** jkilpatr has joined #zuul		11:10
*** dmellado_ has joined #zuul		11:17
*** dmellado_ is now known as dmellado		11:19
*** openstackgerrit has quit IRC		11:33
Shrews	jamielennox: that's interesting on the dictionary RuntimeError. i think the line above that trying to prevent that error is not correct	12:45
Shrews	ah, it seems keys() returns an iterator in py3	12:47
Shrews	jamielennox: fix coming	12:48
*** openstackgerrit has joined #zuul		12:51
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix dict key copy operation https://review.openstack.org/476902	12:51
Shrews	jamielennox: ^^^	12:51
tristanC	what do you think about adding a zuul_comment module so that a job can report custom content to a review?	13:10
tristanC	i mean, to help user figure out why a job failed withough going to the failure-url, it may be convenient to report more content such as the list of unit-test that actually failled	13:16
mordred	tristanC: yes - SO - funny story, I'm actually writing up some things related to that use case from discussions we had this week	13:34
clarkb	I thought you can alrrady put arbitrary text	13:50
clarkb	we do for requirements checking jobs for example	13:50
tristanC	mordred: nice, please let me know how it goes, I'm interested in such capabilities, it may be very useful for custom reporters... e.g. code coverage test should be able to tell patch author when the coverage drop a certain threshold	13:51
clarkb	I guess its mostly.static though	13:51
*** hashar_ has joined #zuul		15:12
SpamapS	That would work nicely with the "job drops a json file" thing we have talked about for artifact sharing too.	15:43
SpamapS	(so jobs would drop a json with some comment text that zuul would dump into the reporter)	15:43
*** hashar is now known as hasharAway		15:57
*** hashar_ has quit IRC		16:04
dmsimard	Question about nodepool: as I understand it, TripleO is currently sharing the tenant nodepool is using with other processes which manage virtual machines in that tenant. I know it goes against the recommendation of dedicating a tenant to nodepool but what kind of problems could this expose ?	16:18
Shrews	I would think that the biggest risk you open yourself to is getting reality (available vms) out of sync with expectations (the database or ZK in v3) if something other than nodepool deletes/modifies/etc the nodepool owned nodes	17:19
Shrews	harlowja: hrm, there might be an issue with the Event change you proposed yesterday. seeing some random job failures with what looks like tests hanging causing the job to timeout	17:29
Shrews	but i'm not seeing anything obviously wrong :(	17:29
harlowja	hmmmm, that'd seem odd, logically those should work (since i've been doing this for a long time, lol)	17:29
harlowja	*doing similar stuff	17:29
harlowja	any more info?	17:30
harlowja	on py3 it fails, or py2.7?	17:30
Shrews	harlowja: nothing really useful in the logs. failing on coverage job the last few times I checked	17:35
Shrews	i'm beginning to wonder about Event.wait() and context switching	17:40
tobiash	Shrews, harlowja: I also saw some random job hangs on my deployment	17:40
harlowja	how much of eventlet is that stuff using?	17:40
harlowja	i mean Event.wait is basically waiting on a condition variable	17:40
tobiash	looking at 'ps afx' it looked like ansible didn't start an ssh process	17:40
harlowja	which is basically waiting on a threading primitive that all that it does is force context switches / waits, lol	17:41
harlowja	so ya, it'd be interesting to know what happens more tobiash and Shrews	17:42
harlowja	how much eventlet are u guys using btw?	17:42
harlowja	none, some?	17:42
tobiash	harlowja: I don't know if we're talking about the same issue	17:43
harlowja	perhaps, lol	17:43
Shrews	no eventlet used	17:43
harlowja	kk	17:43
harlowja	then ya, i can't think why it wouldn't work	17:43
tobiash	my observation: in my zuulv3 deployment sometimes a shell task just hangs with ansible not starting ssh from what I could see	17:43
Shrews	harlowja: yeah, conditional is used, so you'd think that would work as expected: https://github.com/python/cpython/blob/3.6/Lib/threading.py#L551	17:43
harlowja	ya, i mean if those don't work, bigger problems existing all over python :-P	17:44
tobiash	killing the leaf-ansible-playbook process on the executor made the job continue (even successfully)	17:44
Shrews	I have yet to see these hangs on the changes NOT related to the use of Event (eg., https://review.openstack.org/476902) so it makes me thing the event change is causing it	17:45
Shrews	s/thing/think/	17:45
Shrews	tobiash: i haven't experience what you are seeing :(	17:46
harlowja	`RuntimeError: dictionary changed size during iteration` is a threading thing	17:47
harlowja	that's just something u have to know about :-P	17:47
harlowja	and not do, lol	17:47
tobiash	Shrews: it was happening in about 5% of the runs and I didn't see anything obvious so far	17:47
tobiash	could be anything in zuul, ansible or bwrap	17:48
harlowja	Shrews let me know what u find, i'd be surprised if its something busted in threading :-/	17:48
harlowja	but weird shit could happen, just would seem hard	17:48
harlowja	*hard to believe	17:49
Shrews	harlowja: i'm not going to spend too much time on it. will explore it a bit today, but may just have to ditch it and stick with our time.sleep() :(	17:51
harlowja	k	17:51
harlowja	though its odd, because the whole underlying wait and things its using internally is also sort of equivalent to time.sleep	17:52
harlowja	lol	17:52
harlowja	just u aren't seeing the equivalent of time.sleep, lol	17:53
Shrews	i wish our logs were more helpful. they are useless for timeouts	17:53
harlowja	:-/	17:53
Shrews	and of course cannot reproduce it locally (yet)	17:54
Shrews	aha. got a local hang	17:56
harlowja	u can try https://gist.github.com/harlowja/5544d84e8e734ea1cc7c163eff007531	17:59
harlowja	at least it will show u some <waiting> events	17:59
harlowja	event stuff isn't to crazy :-P	18:00
harlowja	(ie making your own to see whats happening)	18:01
harlowja	typically what happens is someone isn't calling `set` somewhere	18:01
harlowja	then wait (especially with no timeout) just waits forever, lol	18:01
harlowja	the one thing that might be useful to look at is `run` in those changes	18:02
harlowja	self._death.clear()	18:02
harlowja	if shutdown/stop happens before run	18:03
harlowja	(for some reason)	18:03
harlowja	then run will never die, lol	18:03
harlowja	perhaps put some print/logs in https://gist.github.com/harlowja/5544d84e8e734ea1cc7c163eff007531#file-gistfile1-txt-L10	18:05
harlowja	and say in clear	18:05
harlowja	and see if someone is setting (ie shutting down) and then run calls clear, lol	18:05
Shrews	harlowja: the stop before run thing is what i'm looking at now, particularly: https://github.com/openstack-infra/nodepool/blob/feature/zuulv3/nodepool/builder.py#L1110-L1118	18:09
harlowja	lol	18:10
harlowja	ya...	18:10
harlowja	u can take my latch code	18:10
harlowja	and then drop that whole loop	18:10
harlowja	but ya, that's usually what hangs, something like this :-P	18:11
Shrews	harlowja: i'm not sure how that paste you just gave me helps	18:11
harlowja	mainly that u can add log statements and shit all over it :-P	18:11
harlowja	*good shit, not the bad kind, lol	18:12
harlowja	Shrews so my guess would be to remove that clear in those various `run` methods	18:16
harlowja	they are probably re-clearing it	18:16
harlowja	and the clear should probably happen outside of `run` in some kind of 'reset` method (if those threads are really re-used)	18:16
harlowja	if they aren't reused (after shutdown) then meh, just throw them away, and don't add reset on	18:17
harlowja	so i'd drop that clear from all the places (in run)	18:23
harlowja	my 3 cents	18:24
harlowja	ha	18:24
Shrews	seems to be hanging on join to one of the worker threads during the shutdown process. so one of them is not stopping properly	18:27
* Shrews adds more logging		18:27
dmsimard	Shrews: you happen to know if 'env-vars' defined for a DIB image definition in nodepool end up being set on the node when it actually boots up ?	18:29
dmsimard	or if they are only effective during the image build ?	18:29
dmsimard	doesn't look like those env vars are persisted (i.e, writing to /etc/profile)	18:29
Shrews	dmsimard: they are not set on the launched node	18:29
dmsimard	Shrews: so here's my dillemma -- I'm trying to reproduce the upstream image definitions outside of the openstack-infra environment (RDO's nodepool/zuul). I got the images to build no problem with the upstream elements/scripts except the configure_mirror.sh ready-script borks because it's not picking up NODEPOOL_MIRROR_HOST	18:31
dmsimard	https://github.com/openstack-infra/project-config/blob/master/nodepool/scripts/configure_mirror.sh#L70	18:32
dmsimard	image builds fine with https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/nodepool/nodepool.yaml#L13-L34 but then configure_mirror fails	18:32
dmsimard	I was wondering if it'd make sense to persist those env vars	18:32
Shrews	dmsimard: oh, you're not talking about v3 nodepool. i'd have to search the v2 code to know what's happening with those vars	18:34
dmsimard	yeah we're not running v2.5/v3 yet :/	18:34
Shrews	(we did away with the ready script codepath in v3)	18:34
Shrews	dmsimard: so you're note even using the zk version of the builder, right?	18:36
dmsimard	Shrews: nope, still geard so far as I am aware -- we'll be moving towards 2.5ish in the near future but we're not there yet	18:37
Shrews	i'm _fairly_ certain that no env vars were set on the launched node (but we did write some variable info to files on that host)	18:37
dmsimard	Shrews: I think for the time being I won't bother too much and insert an element to persist the env vars	18:37
dmsimard	yeah there's like /etc/nodepool/provider with some info, amongst other things	18:38
Shrews	dmsimard: this is the thing I was thinking of https://github.com/openstack-infra/nodepool/blob/0.3.0/nodepool/nodepool.py#L630 but nothing for mirror host	18:42
dmsimard	yeah that's /etc/nodepool/provider	18:43
dmsimard	I half-expected the values defined in 'env-vars' to be carried over there	18:43
Shrews	harlowja: k, i think i understand the problem now. we'll have to use a barrier instead of that code i pointed out	18:54
harlowja	if u desire :-P	18:55
harlowja	https://github.com/openstack/taskflow/blob/master/taskflow/types/latch.py is all yours	18:55
harlowja	if u want it	18:55
harlowja	lol	18:55
harlowja	since py3.x got the latch addition	18:55
harlowja	maybe someone backported it, i didn't, lol	18:55
Shrews	because x.running no longer means "hey, i've started and reached a certain point in the startup"	18:55
harlowja	ya, fair	18:57
Shrews	ugh, this might be too much for right now. would have to pass the Latch in to each thread b/c the # of threads that would need to use it is configurable	19:14
harlowja	yup	19:15
harlowja	up to u	19:16
Shrews	i think other v3 things are pressing, atm. will revisit later	19:16
tobiash	I've noticed that ansible uses quite a lot cpu when running (also when just waiting for a task to finish)	19:43
tobiash	setting "internal_poll_interval = 0.01" in ansible.cfg fixed that for me locally	19:43
tobiash	any objections of adding this setting to the generated ansible configs?	19:47
tobiash	encountered again a job freeze: http://paste.openstack.org/show/613548/	19:52
tobiash	last one is an ansible-playbook process in sleep state and without launched ssh connection	19:53
SpamapS	ruhroh	20:33
SpamapS	tobiash: what does strace think that process is doing?	20:33
*** jkilpatr has quit IRC		22:05
*** jkilpatr has joined #zuul		22:20
*** hasharAway has quit IRC		22:30
mordred	dmsimard, Shrews: the main issue with sharing an openstack project between nodepool and non-nodepool is quota calculations	22:49
mordred	dmsimard, Shrews: nodepool should not touch nodes it did not create - but whatever you tell nodepool its quota is will need to take in to account acutal quota - minus non-nodepool nodes	22:50
mordred	the main problem with that will be that nodepool may sit there spinning hitting the nova api over and over trying and failing to create a node, which could put undue load on your control plane	22:53
mordred	but if you manage the quotas right, you'll be good	22:53

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!