*** outofmemory is now known as reedip | 00:05 | |
*** yanghy has quit IRC | 00:14 | |
*** shashank_hegde has quit IRC | 05:02 | |
*** yanghy has joined #openstack-astara | 06:02 | |
*** yanghy has quit IRC | 06:12 | |
*** yanghy has joined #openstack-astara | 06:24 | |
*** shashank_hegde has joined #openstack-astara | 06:27 | |
openstackgerrit | Yang Hongyang proposed openstack/astara-appliance: ignore pbr outputs https://review.openstack.org/277705 | 06:42 |
---|---|---|
*** yanghy has quit IRC | 07:05 | |
*** xiayu1 has joined #openstack-astara | 07:05 | |
*** xiayu has quit IRC | 07:07 | |
*** yanghy has joined #openstack-astara | 07:17 | |
*** yanghy has quit IRC | 07:28 | |
*** shashank_hegde has quit IRC | 08:45 | |
*** prithiv has joined #openstack-astara | 10:24 | |
*** prithiv has quit IRC | 10:32 | |
*** prithiv has joined #openstack-astara | 11:01 | |
*** prithiv has quit IRC | 12:08 | |
*** openstackgerrit_ has joined #openstack-astara | 12:39 | |
*** prithiv has joined #openstack-astara | 13:06 | |
*** openstackgerrit_ has quit IRC | 13:56 | |
*** prithiv has quit IRC | 14:30 | |
*** prithiv has joined #openstack-astara | 14:32 | |
drwahl | so, adam_g, we had another hang last night | 14:49 |
*** prithiv has quit IRC | 14:51 | |
*** shashank_hegde has joined #openstack-astara | 16:28 | |
markmcclain | phil_h, elo: I've tested disabling the ARP proxy with l2 pop enabled and things work as expected | 16:36 |
markmcclain | I'll work up a patch for neutron | 16:37 |
phil_h | Thanks, I have been finding different behaviour for the VMs on my laptop than in my remote environment | 16:37 |
markmcclain | will take some changes to get it backported to liberty | 16:38 |
phil_h | the remote one is where I has having the problem and I am trying figure out the difference and why | 16:38 |
markmcclain | I've got 2 different nova computes | 16:39 |
markmcclain | to ensure that the instances were not local to each other | 16:39 |
markmcclain | after I get the tests written for this change | 16:40 |
markmcclain | I plan on diving into the wrong source address | 16:40 |
phil_h | thanks | 16:40 |
elopez | thanks | 16:49 |
elopez | drawl: was this in the same part of the code where it hanged? | 16:50 |
*** shashank_hegde has quit IRC | 16:52 | |
drwahl | yup | 17:06 |
drwahl | we have a fix (that was different than https://review.openstack.org/#/c/276875/) | 17:06 |
*** prithiv has joined #openstack-astara | 17:15 | |
*** prithiv has quit IRC | 17:17 | |
elopez | is patch for the BP that j_king submitted the other day | 17:18 |
*** prithiv has joined #openstack-astara | 17:19 | |
*** xiayu1 has quit IRC | 17:20 | |
*** xiayu has joined #openstack-astara | 17:20 | |
drwahl | not sure... didn't see his BP | 17:43 |
drwahl | it's essentially a 2 line patch | 17:43 |
drwahl | 1 to set block=False on the queue, and a 2nd line to add a sleep in the except | 17:44 |
adam_g | drwahl, eh? | 17:57 |
drwahl | ya, that backported change didn't fix our hangs | 17:57 |
adam_g | :\ | 17:58 |
stupidnic | adam_g: do you recall what the issue was that we had when installing from the git repo? I remember that it was something to do with the security groups and the DHCP patch. | 17:59 |
stupidnic | was that specific to astara-appliance or astara in general | 17:59 |
adam_g | stupidnic, IIRC i believe you were having trouble getting the correct branch of astara-neutron installed? | 18:00 |
stupidnic | Yeah that might be it | 18:00 |
stupidnic | I just don't understand how I can select the stable/liberty branch and it still ended up with the wrong thing installed on the controller | 18:01 |
stupidnic | /usr/local/lib/python2.7/dist-packages/astara_neutron/plugins/ml2_neutron_plugin.py | 18:02 |
stupidnic | (That's the issue that Eric just opened up) | 18:03 |
adam_g | stupidnic, the astara_neutron module didnt exist in liberty, it was still packaged as akanda/neutron/plugins | 18:04 |
stupidnic | Right | 18:04 |
adam_g | could it be you're actually installing it twice? once from master and once from stable/liberty ? i really dont know what tools you're using to do any of this | 18:05 |
stupidnic | I followed the instructions from the etherpad we have | 18:05 |
stupidnic | I actually modified it a little to actually checkout the correct branch | 18:05 |
stupidnic | I can confirm that git has the correct branch selected | 18:06 |
adam_g | stupidnic, were these systems re-used from a previous installation? | 18:06 |
stupidnic | Nope, fresh wipes/OS installation | 18:06 |
stupidnic | I will go back and review our salt states to confirm that nothing is changing | 18:07 |
adam_g | stupidnic, on that sytem, what does 'pip list | grep astara' show ? | 18:08 |
stupidnic | astara (8.0.0.0b2.dev132) | 18:08 |
stupidnic | astara-neutron (8.0.0.0b3.dev1) | 18:08 |
adam_g | yeah, 8.x is all master | 18:08 |
adam_g | you should be on 7.x | 18:08 |
stupidnic | root@controller01:/opt/stack/astara# git status | 18:08 |
stupidnic | On branch stable/liberty | 18:08 |
stupidnic | so how can I be on that branch, but still install the wrong version? | 18:09 |
*** shashank_hegde has joined #openstack-astara | 18:09 | |
adam_g | stupidnic, because you've pip installed from that repository when it was checked out to master? | 18:09 |
stupidnic | Is there a specific file I can check in the repo to verify version? | 18:11 |
adam_g | stupidnic, 'pip install /opt/stack/astara' (by default) will install it in its current state into /usr/local/lib/python2.7 -- so if you 'pip install' on master then checkout stable/liberty, you're going to get master installed | 18:11 |
stupidnic | yeah... my salt state checks out the stable branch by default | 18:11 |
adam_g | stupidnic, it sounds like something is screwy with your salt stuff | 18:11 |
stupidnic | Next question then... if I figure out how that happens, can I downgrade my version of Astara? Will I have to rebuild my Astara router images? | 18:13 |
adam_g | stupidnic, downgrades are not something anyones every tested AFAIK, but swapping out the running astara bits while keeping the rest of the cloud static should be okay, provided you update configs accordingly | 18:15 |
adam_g | drwahl, any chance you can get some logs collected running up to and after things start hanging? | 18:19 |
drwahl | yup, i'll email you what we have | 18:26 |
adam_g | thanks | 18:27 |
drwahl | sent. should be in your inbox momentarily | 18:29 |
elo | can you CC me as well | 18:33 |
*** prithiv has quit IRC | 18:34 | |
*** prithiv has joined #openstack-astara | 19:34 | |
*** prithiv has quit IRC | 19:38 | |
*** prithiv has joined #openstack-astara | 19:38 | |
ryanpetrello | we're nearly certain it's some type of deadlock on https://github.com/openstack/astara/blob/stable/kilo/akanda/rug/worker.py#L120 | 19:44 |
ryanpetrello | I'm working w/ jordan to try to reproduce with even more logging added | 19:44 |
ryanpetrello | but the symptom is that the queues for *every* thread in a worker start growing | 19:44 |
ryanpetrello | and are never consumed | 19:44 |
ryanpetrello | it's as if the Queue.Queue.get(timeout=...) call is hanging forever (and causing a lock contention for *all* of the threads that share that queue) | 19:44 |
ryanpetrello | for experimentation purposes, in devstack I replaced that .get() call | 19:45 |
ryanpetrello | with a function wrapped in a @lockutils.synchronized | 19:45 |
ryanpetrello | and the function just loops endlessly | 19:45 |
ryanpetrello | to sort of simulate a deadlock | 19:45 |
ryanpetrello | and the behavior is exactly what we're seeing in production | 19:46 |
ryanpetrello | (in terms of no threads doing any work) | 19:46 |
ryanpetrello | elo adam_g ^ | 19:49 |
adam_g | ryanpetrello, right, loking at logs it looks like a global lockup | 19:51 |
adam_g | and not per-SM | 19:51 |
adam_g | im spinning up a kilo devstack now to play with it | 19:52 |
ryanpetrello | k | 19:52 |
ryanpetrello | I don't know if we've seen *that* behavior | 19:52 |
ryanpetrello | but I suppose it's possible | 19:52 |
ryanpetrello | what's interesting is that if we change the Queue.get() call slightly so that it's non-blocking | 19:52 |
ryanpetrello | e.g., | 19:52 |
ryanpetrello | Queue.get(block=False) | 19:52 |
ryanpetrello | except Empty: time.sleep; continue | 19:52 |
ryanpetrello | the issue *seems* to go away | 19:53 |
ryanpetrello | at least we ran a version that was nonblocking over the weekend | 19:53 |
ryanpetrello | and never saw the issue | 19:53 |
ryanpetrello | whereas with the current, blocking version, the rug usually hangs up within 12 hours | 19:54 |
ryanpetrello | we're all working over here trying to figure out some way to trigger it | 19:55 |
ryanpetrello | but it may just be some sort of lock race | 19:55 |
ryanpetrello | we haven't seen the bug in the older Juno code | 19:55 |
ryanpetrello | so it's either some sort of issue in the Kilo version of the rug | 19:55 |
ryanpetrello | or a difference in Python | 19:55 |
ryanpetrello | we *are* running a newer Python in the kilo environment | 19:55 |
ryanpetrello | so it's possible it's actually a cPython queue/thread bug | 19:55 |
adam_g | ryanpetrello, oh thats interesting | 19:59 |
adam_g | ryanpetrello, what python version are you using an is it avialable somewhere? | 19:59 |
drwahl | Python 2.7.6 | 20:00 |
adam_g | oh, ok | 20:01 |
*** phil_h has quit IRC | 20:07 | |
ryanpetrello | markmcclain or adam_g you around? | 20:24 |
ryanpetrello | we're seeing the issue *right now* | 20:24 |
*** prithiv has quit IRC | 20:24 | |
*** phil_h has joined #openstack-astara | 20:27 | |
drwahl | ryanpetrello: https://gist.github.com/drwahl/0a4fa4d33fcb7e7b45d1 | 20:29 |
markmcclain | distracted by the TC meeting | 20:31 |
markmcclain | I'm thinking we might need to get neutron and astara database dump | 20:32 |
markmcclain | and try to stand up 2nd instance where we poke around a bit the internals without risk impacting your prod clients | 20:33 |
adam_g | ryanpetrello, the block=False fix sounds like a reasonable workaround but it would be good to get to the bottom of it and figure out why its clogging up. i assume it'll probably affect master+liberty as well | 20:42 |
ryanpetrello | yea, I'm guessing so | 20:42 |
ryanpetrello | we're digging w/ strace atm | 20:42 |
ryanpetrello | we may need to use a python with debug and gdb to dig further | 20:42 |
markmcclain | what version of eventlet? | 20:43 |
fzylogic | 0.18.1 | 20:45 |
markmcclain | fzylogic: the only fix in 0.18.2 likely won't change things | 20:57 |
markmcclain | I think it might be FIFO between the parent process and workers | 20:57 |
markmcclain | wondering if a gratuitous sleep() just prior to reading off the queue would help to solve the deadlock | 21:00 |
adam_g | im walking through the worker code juno vs kilo and not much changed | 21:03 |
adam_g | drwahl, ryanpetrello what python were you on for juno? | 21:04 |
rods | 2.7.3 | 21:04 |
fzylogic | 2.7.3 from Precise | 21:04 |
markmcclain | adam_g: could just be we introduced enough jitter elsewhere to create the deadlock | 21:06 |
markmcclain | sprinkling the magic eventlet.sleep(0.1) pixie dust might be a super low hanging fruit | 21:07 |
*** prithiv has joined #openstack-astara | 21:10 | |
*** Vp has joined #openstack-astara | 21:40 | |
*** Vp has joined #openstack-astara | 21:40 | |
*** Vp is now known as Guest77892 | 21:41 | |
Guest77892 | Hi | 21:41 |
*** Guest77892 has quit IRC | 21:41 | |
*** phil_h has quit IRC | 22:03 | |
*** prithiv has quit IRC | 22:24 | |
*** prithiv has joined #openstack-astara | 22:24 | |
*** shashank_hegde has quit IRC | 23:53 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!