Tuesday, 2016-02-09

*** outofmemory is now known as reedip		00:05
*** yanghy has quit IRC		00:14
*** shashank_hegde has quit IRC		05:02
*** yanghy has joined #openstack-astara		06:02
*** yanghy has quit IRC		06:12
*** yanghy has joined #openstack-astara		06:24
*** shashank_hegde has joined #openstack-astara		06:27
openstackgerrit	Yang Hongyang proposed openstack/astara-appliance: ignore pbr outputs https://review.openstack.org/277705	06:42
*** yanghy has quit IRC		07:05
*** xiayu1 has joined #openstack-astara		07:05
*** xiayu has quit IRC		07:07
*** yanghy has joined #openstack-astara		07:17
*** yanghy has quit IRC		07:28
*** shashank_hegde has quit IRC		08:45
*** prithiv has joined #openstack-astara		10:24
*** prithiv has quit IRC		10:32
*** prithiv has joined #openstack-astara		11:01
*** prithiv has quit IRC		12:08
*** openstackgerrit_ has joined #openstack-astara		12:39
*** prithiv has joined #openstack-astara		13:06
*** openstackgerrit_ has quit IRC		13:56
*** prithiv has quit IRC		14:30
*** prithiv has joined #openstack-astara		14:32
drwahl	so, adam_g, we had another hang last night	14:49
*** prithiv has quit IRC		14:51
*** shashank_hegde has joined #openstack-astara		16:28
markmcclain	phil_h, elo: I've tested disabling the ARP proxy with l2 pop enabled and things work as expected	16:36
markmcclain	I'll work up a patch for neutron	16:37
phil_h	Thanks, I have been finding different behaviour for the VMs on my laptop than in my remote environment	16:37
markmcclain	will take some changes to get it backported to liberty	16:38
phil_h	the remote one is where I has having the problem and I am trying figure out the difference and why	16:38
markmcclain	I've got 2 different nova computes	16:39
markmcclain	to ensure that the instances were not local to each other	16:39
markmcclain	after I get the tests written for this change	16:40
markmcclain	I plan on diving into the wrong source address	16:40
phil_h	thanks	16:40
elopez	thanks	16:49
elopez	drawl: was this in the same part of the code where it hanged?	16:50
*** shashank_hegde has quit IRC		16:52
drwahl	yup	17:06
drwahl	we have a fix (that was different than https://review.openstack.org/#/c/276875/)	17:06
*** prithiv has joined #openstack-astara		17:15
*** prithiv has quit IRC		17:17
elopez	is patch for the BP that j_king submitted the other day	17:18
*** prithiv has joined #openstack-astara		17:19
*** xiayu1 has quit IRC		17:20
*** xiayu has joined #openstack-astara		17:20
drwahl	not sure... didn't see his BP	17:43
drwahl	it's essentially a 2 line patch	17:43
drwahl	1 to set block=False on the queue, and a 2nd line to add a sleep in the except	17:44
adam_g	drwahl, eh?	17:57
drwahl	ya, that backported change didn't fix our hangs	17:57
adam_g	:\	17:58
stupidnic	adam_g: do you recall what the issue was that we had when installing from the git repo? I remember that it was something to do with the security groups and the DHCP patch.	17:59
stupidnic	was that specific to astara-appliance or astara in general	17:59
adam_g	stupidnic, IIRC i believe you were having trouble getting the correct branch of astara-neutron installed?	18:00
stupidnic	Yeah that might be it	18:00
stupidnic	I just don't understand how I can select the stable/liberty branch and it still ended up with the wrong thing installed on the controller	18:01
stupidnic	/usr/local/lib/python2.7/dist-packages/astara_neutron/plugins/ml2_neutron_plugin.py	18:02
stupidnic	(That's the issue that Eric just opened up)	18:03
adam_g	stupidnic, the astara_neutron module didnt exist in liberty, it was still packaged as akanda/neutron/plugins	18:04
stupidnic	Right	18:04
adam_g	could it be you're actually installing it twice? once from master and once from stable/liberty ? i really dont know what tools you're using to do any of this	18:05
stupidnic	I followed the instructions from the etherpad we have	18:05
stupidnic	I actually modified it a little to actually checkout the correct branch	18:05
stupidnic	I can confirm that git has the correct branch selected	18:06
adam_g	stupidnic, were these systems re-used from a previous installation?	18:06
stupidnic	Nope, fresh wipes/OS installation	18:06
stupidnic	I will go back and review our salt states to confirm that nothing is changing	18:07
adam_g	stupidnic, on that sytem, what does 'pip list \| grep astara' show ?	18:08
stupidnic	astara (8.0.0.0b2.dev132)	18:08
stupidnic	astara-neutron (8.0.0.0b3.dev1)	18:08
adam_g	yeah, 8.x is all master	18:08
adam_g	you should be on 7.x	18:08
stupidnic	root@controller01:/opt/stack/astara# git status	18:08
stupidnic	On branch stable/liberty	18:08
stupidnic	so how can I be on that branch, but still install the wrong version?	18:09
*** shashank_hegde has joined #openstack-astara		18:09
adam_g	stupidnic, because you've pip installed from that repository when it was checked out to master?	18:09
stupidnic	Is there a specific file I can check in the repo to verify version?	18:11
adam_g	stupidnic, 'pip install /opt/stack/astara' (by default) will install it in its current state into /usr/local/lib/python2.7 -- so if you 'pip install' on master then checkout stable/liberty, you're going to get master installed	18:11
stupidnic	yeah... my salt state checks out the stable branch by default	18:11
adam_g	stupidnic, it sounds like something is screwy with your salt stuff	18:11
stupidnic	Next question then... if I figure out how that happens, can I downgrade my version of Astara? Will I have to rebuild my Astara router images?	18:13
adam_g	stupidnic, downgrades are not something anyones every tested AFAIK, but swapping out the running astara bits while keeping the rest of the cloud static should be okay, provided you update configs accordingly	18:15
adam_g	drwahl, any chance you can get some logs collected running up to and after things start hanging?	18:19
drwahl	yup, i'll email you what we have	18:26
adam_g	thanks	18:27
drwahl	sent. should be in your inbox momentarily	18:29
elo	can you CC me as well	18:33
*** prithiv has quit IRC		18:34
*** prithiv has joined #openstack-astara		19:34
*** prithiv has quit IRC		19:38
*** prithiv has joined #openstack-astara		19:38
ryanpetrello	we're nearly certain it's some type of deadlock on https://github.com/openstack/astara/blob/stable/kilo/akanda/rug/worker.py#L120	19:44
ryanpetrello	I'm working w/ jordan to try to reproduce with even more logging added	19:44
ryanpetrello	but the symptom is that the queues for every thread in a worker start growing	19:44
ryanpetrello	and are never consumed	19:44
ryanpetrello	it's as if the Queue.Queue.get(timeout=...) call is hanging forever (and causing a lock contention for all of the threads that share that queue)	19:44
ryanpetrello	for experimentation purposes, in devstack I replaced that .get() call	19:45
ryanpetrello	with a function wrapped in a @lockutils.synchronized	19:45
ryanpetrello	and the function just loops endlessly	19:45
ryanpetrello	to sort of simulate a deadlock	19:45
ryanpetrello	and the behavior is exactly what we're seeing in production	19:46
ryanpetrello	(in terms of no threads doing any work)	19:46
ryanpetrello	elo adam_g ^	19:49
adam_g	ryanpetrello, right, loking at logs it looks like a global lockup	19:51
adam_g	and not per-SM	19:51
adam_g	im spinning up a kilo devstack now to play with it	19:52
ryanpetrello	k	19:52
ryanpetrello	I don't know if we've seen that behavior	19:52
ryanpetrello	but I suppose it's possible	19:52
ryanpetrello	what's interesting is that if we change the Queue.get() call slightly so that it's non-blocking	19:52
ryanpetrello	e.g.,	19:52
ryanpetrello	Queue.get(block=False)	19:52
ryanpetrello	except Empty: time.sleep; continue	19:52
ryanpetrello	the issue seems to go away	19:53
ryanpetrello	at least we ran a version that was nonblocking over the weekend	19:53
ryanpetrello	and never saw the issue	19:53
ryanpetrello	whereas with the current, blocking version, the rug usually hangs up within 12 hours	19:54
ryanpetrello	we're all working over here trying to figure out some way to trigger it	19:55
ryanpetrello	but it may just be some sort of lock race	19:55
ryanpetrello	we haven't seen the bug in the older Juno code	19:55
ryanpetrello	so it's either some sort of issue in the Kilo version of the rug	19:55
ryanpetrello	or a difference in Python	19:55
ryanpetrello	we are running a newer Python in the kilo environment	19:55
ryanpetrello	so it's possible it's actually a cPython queue/thread bug	19:55
adam_g	ryanpetrello, oh thats interesting	19:59
adam_g	ryanpetrello, what python version are you using an is it avialable somewhere?	19:59
drwahl	Python 2.7.6	20:00
adam_g	oh, ok	20:01
*** phil_h has quit IRC		20:07
ryanpetrello	markmcclain or adam_g you around?	20:24
ryanpetrello	we're seeing the issue right now	20:24
*** prithiv has quit IRC		20:24
*** phil_h has joined #openstack-astara		20:27
drwahl	ryanpetrello: https://gist.github.com/drwahl/0a4fa4d33fcb7e7b45d1	20:29
markmcclain	distracted by the TC meeting	20:31
markmcclain	I'm thinking we might need to get neutron and astara database dump	20:32
markmcclain	and try to stand up 2nd instance where we poke around a bit the internals without risk impacting your prod clients	20:33
adam_g	ryanpetrello, the block=False fix sounds like a reasonable workaround but it would be good to get to the bottom of it and figure out why its clogging up. i assume it'll probably affect master+liberty as well	20:42
ryanpetrello	yea, I'm guessing so	20:42
ryanpetrello	we're digging w/ strace atm	20:42
ryanpetrello	we may need to use a python with debug and gdb to dig further	20:42
markmcclain	what version of eventlet?	20:43
fzylogic	0.18.1	20:45
markmcclain	fzylogic: the only fix in 0.18.2 likely won't change things	20:57
markmcclain	I think it might be FIFO between the parent process and workers	20:57
markmcclain	wondering if a gratuitous sleep() just prior to reading off the queue would help to solve the deadlock	21:00
adam_g	im walking through the worker code juno vs kilo and not much changed	21:03
adam_g	drwahl, ryanpetrello what python were you on for juno?	21:04
rods	2.7.3	21:04
fzylogic	2.7.3 from Precise	21:04
markmcclain	adam_g: could just be we introduced enough jitter elsewhere to create the deadlock	21:06
markmcclain	sprinkling the magic eventlet.sleep(0.1) pixie dust might be a super low hanging fruit	21:07
*** prithiv has joined #openstack-astara		21:10
*** Vp has joined #openstack-astara		21:40
*** Vp has joined #openstack-astara		21:40
*** Vp is now known as Guest77892		21:41
Guest77892	Hi	21:41
*** Guest77892 has quit IRC		21:41
*** phil_h has quit IRC		22:03
*** prithiv has quit IRC		22:24
*** prithiv has joined #openstack-astara		22:24
*** shashank_hegde has quit IRC		23:53

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!