openstackgerrit | Adam Gandelman proposed openstack/astara: WIP Add clustering capabilities to instance manager https://review.openstack.org/267263 | 00:23 |
---|---|---|
*** shashank_hegde has quit IRC | 01:42 | |
*** phil_hopk has quit IRC | 01:49 | |
*** phil_hop has joined #openstack-astara | 01:50 | |
*** phil_hopk has joined #openstack-astara | 01:51 | |
*** phil_hop has quit IRC | 01:55 | |
*** davidlenwell has quit IRC | 02:18 | |
*** davidlenwell has joined #openstack-astara | 02:19 | |
*** davidlenwell has quit IRC | 02:23 | |
*** davidlenwell has joined #openstack-astara | 02:27 | |
*** shashank_hegde has joined #openstack-astara | 02:37 | |
*** justinlund has quit IRC | 04:24 | |
*** shashank_hegde has quit IRC | 05:02 | |
*** shashank_hegde has joined #openstack-astara | 05:27 | |
*** reedip is now known as reedip_away | 06:30 | |
*** shashank_hegde has quit IRC | 07:23 | |
*** openstackgerrit has quit IRC | 09:17 | |
*** openstackgerrit has joined #openstack-astara | 09:17 | |
*** phil_hop has joined #openstack-astara | 11:48 | |
*** phil_hopk has quit IRC | 11:52 | |
*** justinlund has joined #openstack-astara | 14:54 | |
drwahl | so, akanda dudes, i think we found a bug | 17:31 |
drwahl | https://github.com/openstack/astara/blob/f2360d861f3904c8a06d94175be553fe5e7bab05/astara/worker.py#L212 | 17:31 |
drwahl | that get is blocking. for some reason, we have threads hanging on that get | 17:31 |
drwahl | which causes everything to back up and the workers to stop processing jobs | 17:31 |
drwahl | we're testing a (probably improper) patch for this right now | 17:32 |
drwahl | by setting sm = self.work_queue.get(block=False) | 17:32 |
drwahl | and then adding a 10 second sleep to the execpt | 17:32 |
drwahl | if that solves our problem, we'll create a bug and submit a patch and then you guys can tell us how crappy our patch is and we can figure out a proper fix for this | 17:32 |
cleverdevil | :) | 17:34 |
cleverdevil | the `Queue.Queue` object pretty clearly says that it *won't* block for longer than `timeout` seconds. | 17:35 |
cleverdevil | so, that's a strange issue. | 17:35 |
rods | yeah, that's a weird | 17:36 |
drwahl | i have other suspicions as to what *could* be happening, and it wouldn't incrimenate Queue | 17:39 |
cleverdevil | yeah, I would find it highly unlikely that Queue is the issue. | 17:39 |
cleverdevil | have we confirmed that we actually block longer on that line than `timeout`? | 17:39 |
drwahl | we created a bug in that try/except and it spooled up the queue as well | 17:40 |
drwahl | which makes me think that the try/except block isn't working as expected | 17:40 |
cleverdevil | hmm | 17:40 |
drwahl | need to test more, but haven't had time yet | 17:40 |
cleverdevil | its possible that something other than Queue.Empty is being raised. | 17:41 |
drwahl | indeed | 17:41 |
drwahl | however, i'd expect that to actually raise through the stack and cause problems | 17:42 |
drwahl | instead of just spool up the queue... | 17:42 |
cleverdevil | yeah... still... Queue is a pretty proven and simple bit of the Python standard library. | 17:42 |
drwahl | yup, i doubt the bug resides in Queue | 17:42 |
drwahl | but rather, in the way it's being used in that section of the code | 17:43 |
* cleverdevil nods | 17:43 | |
cleverdevil | we'll figure it out :)\ | 17:43 |
drwahl | unless our current patch *doesn't* fix it. then we're back to square 1 :( | 17:43 |
cleverdevil | yeah, not to be a downer, but my instinct is that the problem happens somewhere in the body of the loop, when an operation hangs. | 17:45 |
cleverdevil | not in the management of the queue | 17:45 |
*** shashank_hegde has joined #openstack-astara | 17:46 | |
drwahl | so, in the "except Queue.Empty", we added time.sleep(10), but forgot to import time, so it was creating a traceback | 17:49 |
drwahl | however | 17:49 |
drwahl | it wasn't crashing the thread or creating any logs mentioning anything | 17:49 |
drwahl | and the queue was spooling up | 17:49 |
drwahl | which is the *exact* behavior we see happen eventually | 17:49 |
drwahl | also, checking strace, the threads are clearly hanging on some FIFO read | 17:49 |
drwahl | which this queue get call should be executing | 17:50 |
drwahl | so we're clearly in the right area of the code. just need to figure out what exactly in this are is causing the threads to hang | 17:51 |
drwahl | also, since we changed that queue.get to non-blocking, things seem to be working, which again, makes it suspicious that a non-blocking call makes thigns work there whereas a blocking call causes us issues | 17:51 |
drwahl | afaik, this is all an internal/IPC queue, so there shouldn't be anything external (like rabbitmq) causing the threads to hang | 17:52 |
markmcclain | drwahl, cleverdevil: interesting... don't forget eventlet messes with implementations of standard things sp possible something isn't yielding for TO event to fire | 18:16 |
markmcclain | adam_g: ^^^^ | 18:16 |
cleverdevil | that's a *really* good point | 18:28 |
cleverdevil | I absolutely think it could be eventlet. | 18:28 |
cleverdevil | (I. hate. eventlet.) | 18:28 |
markmcclain | eventlet and multi-processing are known not to play nice with each other | 18:28 |
cleverdevil | is there any reason that eventlet needs to be used at all with the rug? | 18:30 |
markmcclain | rpc and message processing since we inherit some hooks from neutron | 18:31 |
cleverdevil | hm. | 18:31 |
adam_g | drwahl, do you have any good way to reproduce the original issue or is it sporadic? | 18:32 |
adam_g | also id be interested in seeing the orchestrator logs leading up to the hangup, if you have them | 18:32 |
drwahl | adam_g: there has been effectively nothing notable in the logs | 18:36 |
drwahl | i'll see if i can dig them up | 18:36 |
drwahl | but we've gone over them really thoroughly the last couple of weeks and say nothing | 18:36 |
adam_g | drwahl, it might be worth adding a bare 'except:' at the end of the block to log and re-raise | 18:40 |
drwahl | was thinking the same thing | 18:41 |
adam_g | also, are you hitting this on master or a stable branch? | 18:41 |
adam_g | how did you narrow it down to the .get() blocking? i believe an uncaught exception anywhere in the thread target will cause the state machine to gum up. i ran into a similar issue that had the same symptoms fixed by https://review.openstack.org/#/c/271158/ | 18:44 |
drwahl | adam_g: we narrowed it down to that get because when we set .get(block=False), things work | 18:47 |
drwahl | at least, they've been working for the last 2 hours... | 18:47 |
drwahl | the verdict is still out | 18:47 |
drwahl | also, when things are locked up, the threads are stuck in reading some FIFO pipe | 18:47 |
drwahl | so clearly it some IPC | 18:47 |
drwahl | as for the code base, we're building from stable/kilo | 18:50 |
drwahl | we built from it Feb 2 last | 18:51 |
adam_g | drwahl, i wonder, is setting it get(block=False) just masking the fact that the get()s corresponding task_done() is never getting calling? | 18:51 |
adam_g | drwahl, so https://review.openstack.org/#/c/271158/1 hasnt been backported to kilo yet and the symptoms sound the same, you might want to pull that in as well | 18:53 |
adam_g | ill put up a backport now | 18:53 |
* drwahl doesn't see any "_release_resource_lock" | 18:55 | |
drwahl | that entire function seems absent from the code we're running | 18:55 |
adam_g | drwahl, https://github.com/openstack/astara/blob/stable/kilo/akanda/rug/worker.py#L366 | 18:58 |
adam_g | backporting it now | 18:58 |
cleverdevil | once we get our new cluster fully stabilized, we just need to push forward to get on the newest release of openstack we're happy with | 18:59 |
openstackgerrit | Adam Gandelman proposed openstack/astara: Handle exception when worker unlocks already unlocked resource https://review.openstack.org/276875 | 19:02 |
adam_g | drwahl, ^ | 19:02 |
drwahl | do i have perms to +2 that? | 19:04 |
* drwahl checks... | 19:04 | |
drwahl | +1 | 19:06 |
drwahl | (as good as i can do for you) | 19:06 |
adam_g | :) | 19:06 |
openstackgerrit | Adam Gandelman proposed openstack/astara: Handle exception when worker unlocks already unlocked resource https://review.openstack.org/276877 | 19:09 |
adam_g | (liberty) | 19:09 |
markmcclain | drwahl: rods and ryanpetrello have +2 too | 19:11 |
drwahl | roger that. thanks | 19:12 |
openstackgerrit | Merged openstack/astara: Handle exception when worker unlocks already unlocked resource https://review.openstack.org/276875 | 20:25 |
*** adam_g has left #openstack-astara | 20:32 | |
*** adam_g has joined #openstack-astara | 20:32 | |
*** justinlund1 has joined #openstack-astara | 20:57 | |
*** justinlund has quit IRC | 20:58 | |
*** shashank_hegde has quit IRC | 21:49 | |
*** shashank_hegde has joined #openstack-astara | 22:32 | |
*** phil_h has quit IRC | 22:34 | |
*** phil_h has joined #openstack-astara | 22:50 | |
markmcclain | adam_g: took a moment to dig into the functional failure when we turn off auto add resources | 23:04 |
markmcclain | looks like I'll have to make a few changes to how we setup the router since we set it into error state fairly early | 23:04 |
*** phil_hop has quit IRC | 23:09 | |
*** phil_hop has joined #openstack-astara | 23:10 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!