Friday, 2016-02-05

openstackgerrit	Adam Gandelman proposed openstack/astara: WIP Add clustering capabilities to instance manager https://review.openstack.org/267263	00:23
*** shashank_hegde has quit IRC		01:42
*** phil_hopk has quit IRC		01:49
*** phil_hop has joined #openstack-astara		01:50
*** phil_hopk has joined #openstack-astara		01:51
*** phil_hop has quit IRC		01:55
*** davidlenwell has quit IRC		02:18
*** davidlenwell has joined #openstack-astara		02:19
*** davidlenwell has quit IRC		02:23
*** davidlenwell has joined #openstack-astara		02:27
*** shashank_hegde has joined #openstack-astara		02:37
*** justinlund has quit IRC		04:24
*** shashank_hegde has quit IRC		05:02
*** shashank_hegde has joined #openstack-astara		05:27
*** reedip is now known as reedip_away		06:30
*** shashank_hegde has quit IRC		07:23
*** openstackgerrit has quit IRC		09:17
*** openstackgerrit has joined #openstack-astara		09:17
*** phil_hop has joined #openstack-astara		11:48
*** phil_hopk has quit IRC		11:52
*** justinlund has joined #openstack-astara		14:54
drwahl	so, akanda dudes, i think we found a bug	17:31
drwahl	https://github.com/openstack/astara/blob/f2360d861f3904c8a06d94175be553fe5e7bab05/astara/worker.py#L212	17:31
drwahl	that get is blocking. for some reason, we have threads hanging on that get	17:31
drwahl	which causes everything to back up and the workers to stop processing jobs	17:31
drwahl	we're testing a (probably improper) patch for this right now	17:32
drwahl	by setting sm = self.work_queue.get(block=False)	17:32
drwahl	and then adding a 10 second sleep to the execpt	17:32
drwahl	if that solves our problem, we'll create a bug and submit a patch and then you guys can tell us how crappy our patch is and we can figure out a proper fix for this	17:32
cleverdevil	:)	17:34
cleverdevil	the `Queue.Queue` object pretty clearly says that it won't block for longer than `timeout` seconds.	17:35
cleverdevil	so, that's a strange issue.	17:35
rods	yeah, that's a weird	17:36
drwahl	i have other suspicions as to what could be happening, and it wouldn't incrimenate Queue	17:39
cleverdevil	yeah, I would find it highly unlikely that Queue is the issue.	17:39
cleverdevil	have we confirmed that we actually block longer on that line than `timeout`?	17:39
drwahl	we created a bug in that try/except and it spooled up the queue as well	17:40
drwahl	which makes me think that the try/except block isn't working as expected	17:40
cleverdevil	hmm	17:40
drwahl	need to test more, but haven't had time yet	17:40
cleverdevil	its possible that something other than Queue.Empty is being raised.	17:41
drwahl	indeed	17:41
drwahl	however, i'd expect that to actually raise through the stack and cause problems	17:42
drwahl	instead of just spool up the queue...	17:42
cleverdevil	yeah... still... Queue is a pretty proven and simple bit of the Python standard library.	17:42
drwahl	yup, i doubt the bug resides in Queue	17:42
drwahl	but rather, in the way it's being used in that section of the code	17:43
* cleverdevil nods		17:43
cleverdevil	we'll figure it out :)\	17:43
drwahl	unless our current patch doesn't fix it. then we're back to square 1 :(	17:43
cleverdevil	yeah, not to be a downer, but my instinct is that the problem happens somewhere in the body of the loop, when an operation hangs.	17:45
cleverdevil	not in the management of the queue	17:45
*** shashank_hegde has joined #openstack-astara		17:46
drwahl	so, in the "except Queue.Empty", we added time.sleep(10), but forgot to import time, so it was creating a traceback	17:49
drwahl	however	17:49
drwahl	it wasn't crashing the thread or creating any logs mentioning anything	17:49
drwahl	and the queue was spooling up	17:49
drwahl	which is the exact behavior we see happen eventually	17:49
drwahl	also, checking strace, the threads are clearly hanging on some FIFO read	17:49
drwahl	which this queue get call should be executing	17:50
drwahl	so we're clearly in the right area of the code. just need to figure out what exactly in this are is causing the threads to hang	17:51
drwahl	also, since we changed that queue.get to non-blocking, things seem to be working, which again, makes it suspicious that a non-blocking call makes thigns work there whereas a blocking call causes us issues	17:51
drwahl	afaik, this is all an internal/IPC queue, so there shouldn't be anything external (like rabbitmq) causing the threads to hang	17:52
markmcclain	drwahl, cleverdevil: interesting... don't forget eventlet messes with implementations of standard things sp possible something isn't yielding for TO event to fire	18:16
markmcclain	adam_g: ^^^^	18:16
cleverdevil	that's a really good point	18:28
cleverdevil	I absolutely think it could be eventlet.	18:28
cleverdevil	(I. hate. eventlet.)	18:28
markmcclain	eventlet and multi-processing are known not to play nice with each other	18:28
cleverdevil	is there any reason that eventlet needs to be used at all with the rug?	18:30
markmcclain	rpc and message processing since we inherit some hooks from neutron	18:31
cleverdevil	hm.	18:31
adam_g	drwahl, do you have any good way to reproduce the original issue or is it sporadic?	18:32
adam_g	also id be interested in seeing the orchestrator logs leading up to the hangup, if you have them	18:32
drwahl	adam_g: there has been effectively nothing notable in the logs	18:36
drwahl	i'll see if i can dig them up	18:36
drwahl	but we've gone over them really thoroughly the last couple of weeks and say nothing	18:36
adam_g	drwahl, it might be worth adding a bare 'except:' at the end of the block to log and re-raise	18:40
drwahl	was thinking the same thing	18:41
adam_g	also, are you hitting this on master or a stable branch?	18:41
adam_g	how did you narrow it down to the .get() blocking? i believe an uncaught exception anywhere in the thread target will cause the state machine to gum up. i ran into a similar issue that had the same symptoms fixed by https://review.openstack.org/#/c/271158/	18:44
drwahl	adam_g: we narrowed it down to that get because when we set .get(block=False), things work	18:47
drwahl	at least, they've been working for the last 2 hours...	18:47
drwahl	the verdict is still out	18:47
drwahl	also, when things are locked up, the threads are stuck in reading some FIFO pipe	18:47
drwahl	so clearly it some IPC	18:47
drwahl	as for the code base, we're building from stable/kilo	18:50
drwahl	we built from it Feb 2 last	18:51
adam_g	drwahl, i wonder, is setting it get(block=False) just masking the fact that the get()s corresponding task_done() is never getting calling?	18:51
adam_g	drwahl, so https://review.openstack.org/#/c/271158/1 hasnt been backported to kilo yet and the symptoms sound the same, you might want to pull that in as well	18:53
adam_g	ill put up a backport now	18:53
* drwahl doesn't see any "_release_resource_lock"		18:55
drwahl	that entire function seems absent from the code we're running	18:55
adam_g	drwahl, https://github.com/openstack/astara/blob/stable/kilo/akanda/rug/worker.py#L366	18:58
adam_g	backporting it now	18:58
cleverdevil	once we get our new cluster fully stabilized, we just need to push forward to get on the newest release of openstack we're happy with	18:59
openstackgerrit	Adam Gandelman proposed openstack/astara: Handle exception when worker unlocks already unlocked resource https://review.openstack.org/276875	19:02
adam_g	drwahl, ^	19:02
drwahl	do i have perms to +2 that?	19:04
* drwahl checks...		19:04
drwahl	+1	19:06
drwahl	(as good as i can do for you)	19:06
adam_g	:)	19:06
openstackgerrit	Adam Gandelman proposed openstack/astara: Handle exception when worker unlocks already unlocked resource https://review.openstack.org/276877	19:09
adam_g	(liberty)	19:09
markmcclain	drwahl: rods and ryanpetrello have +2 too	19:11
drwahl	roger that. thanks	19:12
openstackgerrit	Merged openstack/astara: Handle exception when worker unlocks already unlocked resource https://review.openstack.org/276875	20:25
*** adam_g has left #openstack-astara		20:32
*** adam_g has joined #openstack-astara		20:32
*** justinlund1 has joined #openstack-astara		20:57
*** justinlund has quit IRC		20:58
*** shashank_hegde has quit IRC		21:49
*** shashank_hegde has joined #openstack-astara		22:32
*** phil_h has quit IRC		22:34
*** phil_h has joined #openstack-astara		22:50
markmcclain	adam_g: took a moment to dig into the functional failure when we turn off auto add resources	23:04
markmcclain	looks like I'll have to make a few changes to how we setup the router since we set it into error state fairly early	23:04
*** phil_hop has quit IRC		23:09
*** phil_hop has joined #openstack-astara		23:10

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!