Tuesday, 2016-02-09

*** outofmemory is now known as reedip00:05
*** yanghy has quit IRC00:14
*** shashank_hegde has quit IRC05:02
*** yanghy has joined #openstack-astara06:02
*** yanghy has quit IRC06:12
*** yanghy has joined #openstack-astara06:24
*** shashank_hegde has joined #openstack-astara06:27
openstackgerritYang Hongyang proposed openstack/astara-appliance: ignore pbr outputs  https://review.openstack.org/27770506:42
*** yanghy has quit IRC07:05
*** xiayu1 has joined #openstack-astara07:05
*** xiayu has quit IRC07:07
*** yanghy has joined #openstack-astara07:17
*** yanghy has quit IRC07:28
*** shashank_hegde has quit IRC08:45
*** prithiv has joined #openstack-astara10:24
*** prithiv has quit IRC10:32
*** prithiv has joined #openstack-astara11:01
*** prithiv has quit IRC12:08
*** openstackgerrit_ has joined #openstack-astara12:39
*** prithiv has joined #openstack-astara13:06
*** openstackgerrit_ has quit IRC13:56
*** prithiv has quit IRC14:30
*** prithiv has joined #openstack-astara14:32
drwahlso, adam_g, we had another hang last night14:49
*** prithiv has quit IRC14:51
*** shashank_hegde has joined #openstack-astara16:28
markmcclainphil_h, elo: I've tested disabling the ARP proxy with l2 pop enabled and things work as expected16:36
markmcclainI'll work up a patch for neutron16:37
phil_hThanks, I have been finding different behaviour for the VMs on my laptop than in my remote environment16:37
markmcclainwill take some changes to get it backported to liberty16:38
phil_hthe remote one is where I has having the problem and I am trying figure out the difference and why16:38
markmcclainI've got 2 different nova computes16:39
markmcclainto ensure that the instances were not local to each other16:39
markmcclainafter I get the tests written for this change16:40
markmcclainI plan on diving into the wrong source address16:40
elopezdrawl: was this in the same part of the code where it hanged?16:50
*** shashank_hegde has quit IRC16:52
drwahlwe have a fix (that was different than https://review.openstack.org/#/c/276875/)17:06
*** prithiv has joined #openstack-astara17:15
*** prithiv has quit IRC17:17
elopezis patch for the BP that j_king submitted the other day17:18
*** prithiv has joined #openstack-astara17:19
*** xiayu1 has quit IRC17:20
*** xiayu has joined #openstack-astara17:20
drwahlnot sure... didn't see his BP17:43
drwahlit's essentially a 2 line patch17:43
drwahl1 to set block=False on the queue, and a 2nd line to add a sleep in the except17:44
adam_gdrwahl, eh?17:57
drwahlya, that backported change didn't fix our hangs17:57
stupidnicadam_g: do you recall what the issue was that we had when installing from the git repo? I remember that it was something to do with the security groups and the DHCP patch.17:59
stupidnicwas that specific to astara-appliance or astara in general17:59
adam_gstupidnic, IIRC i believe you were having trouble getting the correct branch of astara-neutron installed?18:00
stupidnicYeah that might be it18:00
stupidnicI just don't understand how I can select the stable/liberty branch and it still ended up with the wrong thing installed on the controller18:01
stupidnic(That's the issue that Eric just opened up)18:03
adam_gstupidnic, the astara_neutron module didnt exist in liberty, it was still packaged as akanda/neutron/plugins18:04
adam_gcould it be you're actually installing it twice? once from master and once from stable/liberty ? i really dont know what tools you're using to do any of this18:05
stupidnicI followed the instructions from the etherpad we have18:05
stupidnicI actually modified it a little to actually checkout the correct branch18:05
stupidnicI can confirm that git has the correct branch selected18:06
adam_gstupidnic, were these systems re-used from a previous installation?18:06
stupidnicNope, fresh wipes/OS installation18:06
stupidnicI will go back and review our salt states to confirm that nothing is changing18:07
adam_gstupidnic, on that sytem, what does 'pip list | grep astara' show ?18:08
stupidnicastara (
stupidnicastara-neutron (
adam_gyeah, 8.x is all master18:08
adam_gyou should be on 7.x18:08
stupidnicroot@controller01:/opt/stack/astara# git status18:08
stupidnicOn branch stable/liberty18:08
stupidnicso how can I be on that branch, but still install the wrong version?18:09
*** shashank_hegde has joined #openstack-astara18:09
adam_gstupidnic, because you've pip installed from that repository when it was checked out to master?18:09
stupidnicIs there a specific file I can check in the repo to verify version?18:11
adam_gstupidnic, 'pip install /opt/stack/astara' (by default) will install it in its current state into /usr/local/lib/python2.7 -- so if you 'pip install' on master then checkout stable/liberty, you're going to get master installed18:11
stupidnicyeah... my salt state checks out the stable branch by default18:11
adam_gstupidnic, it sounds like something is screwy with your salt stuff18:11
stupidnicNext question then... if I figure out how that happens, can I downgrade my version of Astara? Will I have to rebuild my Astara router images?18:13
adam_gstupidnic, downgrades are not something anyones every tested AFAIK, but swapping out the running astara bits while keeping the rest of the cloud static should be okay, provided you update configs accordingly18:15
adam_gdrwahl, any chance you can get some logs collected running up to and after things start hanging?18:19
drwahlyup, i'll email you what we have18:26
drwahlsent. should be in your inbox momentarily18:29
elocan you CC me as well18:33
*** prithiv has quit IRC18:34
*** prithiv has joined #openstack-astara19:34
*** prithiv has quit IRC19:38
*** prithiv has joined #openstack-astara19:38
ryanpetrellowe're nearly certain it's some type of deadlock on https://github.com/openstack/astara/blob/stable/kilo/akanda/rug/worker.py#L12019:44
ryanpetrelloI'm working w/ jordan to try to reproduce with even more logging added19:44
ryanpetrellobut the symptom is that the queues for *every* thread in a worker start growing19:44
ryanpetrelloand are never consumed19:44
ryanpetrelloit's as if the Queue.Queue.get(timeout=...) call is hanging forever (and causing a lock contention for *all* of the threads that share that queue)19:44
ryanpetrellofor experimentation purposes, in devstack I replaced that .get() call19:45
ryanpetrellowith a function wrapped in a @lockutils.synchronized19:45
ryanpetrelloand the function just loops endlessly19:45
ryanpetrelloto sort of simulate a deadlock19:45
ryanpetrelloand the behavior is exactly what we're seeing in production19:46
ryanpetrello(in terms of no threads doing any work)19:46
ryanpetrelloelo adam_g ^19:49
adam_gryanpetrello, right, loking at logs it looks like a global lockup19:51
adam_gand not per-SM19:51
adam_gim spinning up a kilo devstack now to play with it19:52
ryanpetrelloI don't know if we've seen *that* behavior19:52
ryanpetrellobut I suppose it's possible19:52
ryanpetrellowhat's interesting is that if we change the Queue.get() call slightly so that it's non-blocking19:52
ryanpetrelloexcept Empty:  time.sleep; continue19:52
ryanpetrellothe issue *seems* to go away19:53
ryanpetrelloat least we ran a version that was nonblocking over the weekend19:53
ryanpetrelloand never saw the issue19:53
ryanpetrellowhereas with the current, blocking version, the rug usually hangs up within 12 hours19:54
ryanpetrellowe're all working over here trying to figure out some way to trigger it19:55
ryanpetrellobut it may just be some sort of lock race19:55
ryanpetrellowe haven't seen the bug in the older Juno code19:55
ryanpetrelloso it's either some sort of issue in the Kilo version of the rug19:55
ryanpetrelloor a difference in Python19:55
ryanpetrellowe *are* running a newer Python in the kilo environment19:55
ryanpetrelloso it's possible it's actually a cPython queue/thread bug19:55
adam_gryanpetrello, oh thats interesting19:59
adam_gryanpetrello, what python version are you using an is it avialable somewhere?19:59
drwahlPython 2.7.620:00
adam_goh, ok20:01
*** phil_h has quit IRC20:07
ryanpetrellomarkmcclain or adam_g you around?20:24
ryanpetrellowe're seeing the issue *right now*20:24
*** prithiv has quit IRC20:24
*** phil_h has joined #openstack-astara20:27
drwahlryanpetrello: https://gist.github.com/drwahl/0a4fa4d33fcb7e7b45d120:29
markmcclaindistracted by the TC meeting20:31
markmcclainI'm thinking we might need to get neutron and astara database dump20:32
markmcclainand try to stand up 2nd instance where we poke around a bit the internals without risk impacting your prod clients20:33
adam_gryanpetrello, the block=False fix sounds like a reasonable workaround but it would be good to get to the bottom of it and figure out why its clogging up. i assume it'll probably affect master+liberty as well20:42
ryanpetrelloyea, I'm guessing so20:42
ryanpetrellowe're digging w/ strace atm20:42
ryanpetrellowe may need to use a python with debug and gdb to dig further20:42
markmcclainwhat version of eventlet?20:43
markmcclainfzylogic: the only fix in 0.18.2 likely won't change things20:57
markmcclainI think it might be FIFO between the parent process and workers20:57
markmcclainwondering if a gratuitous sleep() just prior to reading off the queue would help to solve the deadlock21:00
adam_gim walking through the worker code juno vs kilo and not much changed21:03
adam_gdrwahl, ryanpetrello what python were you on for juno?21:04
fzylogic2.7.3 from Precise21:04
markmcclainadam_g: could just be we introduced enough jitter elsewhere to create the deadlock21:06
markmcclainsprinkling the magic eventlet.sleep(0.1) pixie dust might be a super low hanging fruit21:07
*** prithiv has joined #openstack-astara21:10
*** Vp has joined #openstack-astara21:40
*** Vp has joined #openstack-astara21:40
*** Vp is now known as Guest7789221:41
*** Guest77892 has quit IRC21:41
*** phil_h has quit IRC22:03
*** prithiv has quit IRC22:24
*** prithiv has joined #openstack-astara22:24
*** shashank_hegde has quit IRC23:53

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!