Monday, 2018-03-26

*** chkumar246 has quit IRC		01:04
*** Alex_Staf has quit IRC		01:15
*** chandankumar has joined #openstack-lbaas		01:22
*** fnaval has quit IRC		01:38
*** annp has joined #openstack-lbaas		02:34
*** yamamoto has joined #openstack-lbaas		03:27
*** ianychoi has joined #openstack-lbaas		03:29
*** yamamoto has quit IRC		03:31
*** rcernin has quit IRC		03:55
*** rcernin has joined #openstack-lbaas		03:55
*** yamamoto has joined #openstack-lbaas		04:28
nmagnezi	johnsom, o/	06:12
nmagnezi	johnsom, still around?	06:12
*** links has joined #openstack-lbaas		06:19
*** AlexeyAbashkin has joined #openstack-lbaas		06:19
*** AlexeyAbashkin has quit IRC		06:24
*** ianychoi_ has joined #openstack-lbaas		06:26
*** ianychoi has quit IRC		06:29
*** Alex_Staf has joined #openstack-lbaas		06:34
*** kobis has joined #openstack-lbaas		06:53
*** pcaruana has joined #openstack-lbaas		07:08
*** rcernin has quit IRC		07:10
*** aojea has joined #openstack-lbaas		07:13
*** aojea has quit IRC		07:22
*** tesseract has joined #openstack-lbaas		07:23
*** aojea has joined #openstack-lbaas		07:29
*** kobis1 has joined #openstack-lbaas		07:31
*** kobis1 has quit IRC		07:31
*** aojea has quit IRC		07:35
*** aojea has joined #openstack-lbaas		07:42
*** aojea has quit IRC		07:49
*** AlexeyAbashkin has joined #openstack-lbaas		07:50
*** kobis has quit IRC		08:12
*** velizarx has joined #openstack-lbaas		08:35
openstackgerrit	Ganpat Agarwal proposed openstack/octavia master: Active-Active: ExaBGP amphora L3 distributor driver https://review.openstack.org/537842	08:39
*** voelzmo has joined #openstack-lbaas		09:10
*** voelzmo has quit IRC		09:16
*** salmankhan has joined #openstack-lbaas		09:21
*** Alex_Staf has quit IRC		09:29
*** Alex_Staf has joined #openstack-lbaas		09:38
*** Alex_Staf has quit IRC		09:51
*** salmankhan has quit IRC		09:57
*** salmankhan has joined #openstack-lbaas		09:57
*** salmankhan has quit IRC		10:50
*** salmankhan has joined #openstack-lbaas		11:08
*** AlexeyAbashkin has quit IRC		11:14
*** Alex_Staf has joined #openstack-lbaas		11:16
*** yamamoto has quit IRC		11:27
*** AlexeyAbashkin has joined #openstack-lbaas		11:29
*** yamamoto has joined #openstack-lbaas		11:36
*** salmankhan has quit IRC		11:40
*** yamamoto has quit IRC		11:45
*** salmankhan has joined #openstack-lbaas		11:48
*** atoth has joined #openstack-lbaas		11:59
*** yamamoto has joined #openstack-lbaas		12:05
*** velizarx has quit IRC		12:20
*** yamamoto has quit IRC		12:30
*** Alex_Staf has quit IRC		13:02
*** yamamoto has joined #openstack-lbaas		13:20
*** yamamoto has quit IRC		13:20
*** yamamoto has joined #openstack-lbaas		13:20
*** cpusmith_ has joined #openstack-lbaas		13:49
*** cpusmith has joined #openstack-lbaas		13:50
*** fnaval has joined #openstack-lbaas		13:52
*** cpusmith_ has quit IRC		13:54
*** yamamoto has quit IRC		14:03
*** kong has quit IRC		14:13
*** salmankhan has quit IRC		14:21
*** links has quit IRC		14:32
*** salmankhan has joined #openstack-lbaas		14:37
openstackgerrit	Adam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting https://review.openstack.org/556549	14:37
rm_work	johnsom / nmagnezi / xgerman_ ^^	14:37
rm_work	I think I did that correctly?	14:38
rm_work	nope lol	14:38
rm_work	maybe not	14:39
openstackgerrit	Adam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting https://review.openstack.org/556549	14:39
rm_work	how about this	14:39
rm_work	hmmm no	14:40
rm_work	i don't understand what i did wrong on the first one	14:40
rm_work	that should have been right	14:40
rm_work	i guess i have to select branches in order to make it non-voting?	14:40
johnsom	rm_work The semi-colon was missing in the first try	15:01
rm_work	wut	15:02
rm_work	ohhh	15:02
openstackgerrit	Adam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting https://review.openstack.org/556549	15:03
*** yamamoto has joined #openstack-lbaas		15:03
*** chandankumar has quit IRC		15:06
xgerman_	k, they have been failing forever	15:10
rm_work	yes	15:10
xgerman_	rm_work: also my HM ate up 10G RAM over the weekend. Anywhere in the DB to look for bottlenecks?	15:11
rm_work	it's not the DB	15:11
rm_work	i mean, not really	15:11
xgerman_	well, I know — but it’s not like HM tels me how ong it’s queries/inserts take	15:11
rm_work	cutting down the query count helps a lot (see: https://review.openstack.org/#/c/555471/) but you need the rest of those patches	15:12
rm_work	that one will help a lot though	15:12
xgerman_	mmh, I like to have numbers to base that one. There should be a setting where HM spews diagnostic info	15:13
rm_work	as will https://review.openstack.org/#/c/555473/	15:13
rm_work	xgerman_: it was something like ... many hundreds to 10a	15:13
rm_work	*to 10s	15:13
rm_work	johnsom had some numbers	15:13
rm_work	but, an order of magnitude reduction	15:13
xgerman_	yeah, I guess we should have the measurement code submitted - since other operators need that for diagnosyics	15:14
rm_work	you REALLY need to get those first three	15:14
rm_work	it will improve things significantly	15:14
rm_work	but to REALLY fix it you need the last three, which ... i don't know if we'll be able to get in	15:14
rm_work	maybe you can just apply those patches locally	15:14
*** yamamoto has quit IRC		15:15
xgerman_	well, if Pike doesn’t work we need to tell people	15:15
rm_work	i had a hard time proving it, i tried, but i'm just not that good at performance diagnostics	15:16
rm_work	i can show the symptoms in a variety of ways	15:16
rm_work	via like, debug logging	15:16
rm_work	or system performance graphs	15:16
xgerman_	yes, but if I am some operator I need something in the logs which tells me what is going on other than my OpenStack cluster crashing because we have a run-away health manager	15:17
rm_work	but i worked quite a bit with zzzeek on the sqlalchemy bits that seemed to be taking a long time, and could never find a smoking gun	15:17
rm_work	yes	15:17
rm_work	so see my third patch there	15:17
rm_work	https://review.openstack.org/#/c/555473/	15:17
xgerman_	because if I wouldn’t need Octavia after one or two of those episodes I would loose all faith	15:17
rm_work	https://review.openstack.org/#/c/555473/2/octavia/controller/healthmanager/update_db.py@123	15:18
rm_work	if you start seeing those messages	15:18
rm_work	it's bad news bears time	15:18
*** pcaruana has quit IRC		15:18
rm_work	i put that in specifically for what you're talking about	15:18
rm_work	to warn operators that something is going wrong	15:18
rm_work	and they need to be aware	15:18
rm_work	The language is as strong as I could make it and stay "PC"	15:19
rm_work	saying "your shit is fucked" in a log message isn't the best, but i wanted to, lol	15:19
xgerman_	well, we can also try to keep track of jobs and cancel the ones we get overtaken… to keep the queue at a ~constant size	15:21
*** chkumar246 has joined #openstack-lbaas		15:23
rm_work	xgerman_: that's what that code does	15:24
rm_work	less tracking, more just cancelling	15:24
rm_work	you're better off letting the original job finish than restarting	15:25
rm_work	if you cancelled the first job and tried to do the new one, once you got behind almost nothing would ever actually update, and it'd be catastrophic	15:25
rm_work	which brings me back to, you absolutely need that third patch https://review.openstack.org/#/c/555473/	15:25
johnsom	Yeah, we need to stop the panic here and figure out why this system is so much different than everyone else's	15:25
rm_work	johnsom: regardless of whatever xgerman_'s issue is, I firmly believe we need to backport those first three	15:26
*** imacdonn has joined #openstack-lbaas		15:26
rm_work	(only three because the middle one prevents a ton of merge errors, so i threw it in there, it's test-only)	15:26
rm_work	this is based entirely on my assessment of the state of the pike codebase, and cgoncalves's plea to more actively backport stuff that we think is necessary for good operation	15:27
rm_work	i didn't realize before that we hadn't fixed any of that stuff in Pike yet	15:27
rm_work	health management is completely out of control without at least your patch (the first one in the chain) and mine (the third one) prevents a lot of headache and gives at least some heads-up that operators would need to even realize the issue	15:28
rm_work	i think if we merge those three, we'll see more people coming forward saying "i'm getting this warning message from my HMs..."	15:28
rm_work	as it is, people may not even realize until it's too late, and then it takes a good deal of forensics to realize what happened	15:29
rm_work	I should have asked cgoncalves to do the backports so I could +2 them myself <_<	15:29
xgerman_	ok, let’s see if my system is really that special	15:30
johnsom	Well, I'm going to have a look at this environment and see what is up. The context switches are near zero there and the load is too high for the # of amps. So something else is going on there	15:30
rm_work	the first three have nothing to do with that	15:31
rm_work	the first one is yours that objectively makes the number of DB operations less ridiculous, and the third is mine that gives operators at least a chance to see something is wrong and possibly head off a catastrophic failure	15:31
rm_work	IMO not merging at least those would be irresponsible	15:32
rm_work	I should have proposed them earlier honestly	15:32
rm_work	but i hadn't really been thinking about backports / remembering that people actually will try to run Pike	15:32
xgerman_	My 2ct: We run an unbound service which can chew up resources. We need to constrain that somehow	15:32
xgerman_	rm_work: people are just now getting on newton ;-)	15:33
rm_work	xgerman_: yes, well, the next set fixes all of that	15:33
rm_work	if we merge those it isn't an issue	15:33
rm_work	and since we'd have no reason to make a patch to master to do constraints (since it isn't an issue), we're not going to be able to backport anything	15:33
rm_work	anyway, i don't understand why we'd make and backport a workaround when the resolution is already merged in master	15:34
xgerman_	there are also rumours that the queue in the ProcessPool is better constraint	15:34
xgerman_	yeah, I like to work of one code base as well	15:34
rm_work	xgerman_: did you get a chance to try this patch chain in your env	15:35
xgerman_	no, we wanted to test the exception handling fix over the weekend	15:36
johnsom	Looks like the environment reverted to stock pike, no exception fix, etc. in it	15:38
rm_work	:(	15:40
johnsom	Hmm, is there a way to find the sha from an pip installed app? The version in pip list is a "dev" version #	15:40
rm_work	ah you don't do -e ?	15:42
rm_work	is the dev version number not like... a shorthash?	15:42
johnsom	Apparently here, they do dev. I think I found one in the egg under pbr. looking now	15:42
johnsom	There are so many amps reporting in that don't have listeners in the DB....	15:44
rm_work	errr	15:46
rm_work	because they didn't get their configs updated correctly?	15:47
johnsom	yeah, ok, it has failed over 83 amps in 10 minutes	15:47
johnsom	I think we are ping ponging on these corrupt amps/records	15:47
johnsom	Remember this isn't a normal environment, folks have messed in the DB, it was running neutron-lbaas, etc.	15:48
johnsom	Yeah, they are looping on the amp ids.... Interesting.	15:49
rm_work	yeah so	15:49
rm_work	that is a cascade failure	15:49
rm_work	i saw the same thing	15:49
rm_work	you need to shut it all down	15:49
johnsom	Yeah, this is immediate after a fresh restart too	15:49
rm_work	yes	15:49
rm_work	gotta shut down the HMs	15:49
rm_work	mark all amps busy in health	15:50
rm_work	bring the HMs up	15:50
rm_work	wait for the health messages to stabilize	15:50
rm_work	then un-busy and failover one at a time	15:50
rm_work	to get them back to "actually correct" state	15:50
xgerman_	johnsom: infra01 has the monkey patched version	15:52
johnsom	No, I am on infra01, it's running clean d9e24e83bbf7d12f8bec16072b28c6ca655690ac	15:53
johnsom	Our changes are gone	15:53
xgerman_	mmh	15:53
xgerman_	I only restarted the hm	15:53
johnsom	Me too	15:53
xgerman_	ok	15:53
johnsom	I haven't touched this since we did. Let me see if I can get a timestamp when it was redeployed	15:54
johnsom	Yeah, no idea, the files all say jan 4	15:55
johnsom	xgerman_ oh, somehow that IP you gave me took me to infra3 this time...	15:56
xgerman_	mmh, no idea	15:56
rm_work	lol	15:57
johnsom	Ok, yeah, I think we have it figured out, now just why.... It's failing over slightly more than 8 a minute, with a failover queue of 10, so... That is the queue growth, and why the CPU is high. Now digging into the why....	16:03
*** salmankhan has quit IRC		16:03
*** salmankhan has joined #openstack-lbaas		16:07
*** salmankhan has quit IRC		16:12
johnsom	Yeah, ok so it's amps we show as deleted reporting in.	16:15
johnsom	I think we need to add some zombie protection... lol	16:17
*** salmankhan has joined #openstack-lbaas		16:31
rm_work	please please PLEASE take full advantage of the possible commit messages	16:34
rm_work	i will -1 for not properly taking advantage of subject matter	16:34
*** pcaruana has joined #openstack-lbaas		16:44
rm_work	i wish we had a way to figure out if the IP that reports in is actually an old amp compute, and if so, figure out what its compute-id is and delete it	16:47
*** salmankhan has quit IRC		16:51
johnsom	So I am tracking by the "reports 1 listeners when 0 expected" which gives the amp ID, which is the name of the amp in nova	17:04
*** AlexeyAbashkin has quit IRC		17:15
*** salmankhan has joined #openstack-lbaas		17:16
*** salmankhan has quit IRC		17:20
rm_work	ummm	17:36
rm_work	"amp id"?	17:36
rm_work	what if it's not in the DB?	17:37
rm_work	we're talking about zombies right?	17:37
rm_work	amps that were "deleted" but the compute instance survived somehow?	17:37
johnsom	Right	17:43
johnsom	Yeah, looks like nova was trashed here for a while and some non-graceful shutdowns	17:45
xgerman_	mmh	17:46
xgerman_	I think nova foo-baring is a fact of life	17:47
johnsom	Agreed, and we handled it appropriately	17:48
johnsom	retried and then marked them ERROR	17:48
xgerman_	well, and then we dropped the ball on zombie vms	17:48
johnsom	Yeah, it looks like we are not handling amps we told nova to delete, nova said success, but they didn't actually delete. We can improve this.	17:49
xgerman_	+1	17:51
*** tesseract has quit IRC		18:10
rm_work	yeah so how to get the compute-id of them	18:12
rm_work	the only thing i could think of is their address <_<	18:13
rm_work	which ... is maybe not 100% reliable?	18:13
johnsom	They come into the HM with their IDs	18:13
johnsom	We just need to check if the amp is DELETED during a failover and re-delete if so	18:13
rm_work	umm	18:14
rm_work	so that's if they're still in the DB	18:14
rm_work	[02:37:12] <rm_work>what if it's not in the DB?	18:14
rm_work	[02:37:49] <rm_work>we're talking about zombies right?	18:14
rm_work	[02:38:02] <rm_work>amps that were "deleted" but the compute instance survived somehow?	18:14
johnsom	Ok, or missing I guess, but yeah, normally they will be in "DELETED"	18:14
rm_work	i don't know about you, but mine go away	18:14
johnsom	at least for some period of time	18:15
rm_work	so there is literally nothing	18:15
rm_work	i usually have to do lookups by IP	18:15
rm_work	because there's this thing coming in with an amp-id that has no record in the amphora table	18:15
johnsom	Even if you have your HK DB cleanup too low, we get in an amp ID, it's failed because it claims listeners that we show as 0, we check DB if DELETED or missing, call nova delete "amphora-<ID>"	18:16
johnsom	Do you happen to remember the columns we need to fill in to re-create an amp record to trigger a failover?	18:18
johnsom	I have a MASTER healthy but no BACKUP	18:18
rm_work	ah, yes i think so, one sec	18:21
rm_work	and also: it has nothing to do with "HK cleanup time"	18:21
rm_work	see: https://review.openstack.org/#/c/548989/	18:21
rm_work	until ya'll review that	18:21
rm_work	... i will admit that i need to really give it a good testing in devstack	18:22
*** fnaval has quit IRC		18:23
rm_work	also, to distract even more, I have one more feature that I'm going to be whipping through today and hopefully have a patch up shortly -- an admin API call that will give usage statistics for the whole system (total number of LBs, amps, listeners, members, etc	18:24
rm_work	)	18:24
rm_work	drawing up a mock output first to run past people	18:24
*** fnaval has joined #openstack-lbaas		18:26
rm_work	I would do a spec but	18:27
rm_work	then it'd take a month to get to code	18:28
rm_work	rather just get some basic agreement	18:28
*** AlexeyAbashkin has joined #openstack-lbaas		18:42
*** pcaruana has quit IRC		18:46
rm_work	johnsom / xgerman_ / nmagnezi / cgoncalves / anyone really: when you have a minute: https://etherpad.openstack.org/p/octavia-admin-usage	18:51
rm_work	proposing something like that	18:51
rm_work	please to comment / propose alternatives	18:51
rm_work	need to have a way to expose some stats about how utilized the system is	18:52
rm_work	ah maybe it should be utilization	18:52
*** AlexeyAbashkin has quit IRC		18:54
*** voelzmo has joined #openstack-lbaas		18:54
*** voelzmo has quit IRC		19:00
*** atoth has quit IRC		19:00
nmagnezi	o/	19:06
rm_work	o/	19:07
nmagnezi	rm_work, looking now	19:07
nmagnezi	:)	19:07
nmagnezi	rm_work, a question in regards to the amphorae section	19:08
nmagnezi	rm_work, since we plan to implement providers support in Rocky, will it support an equivalent of that for other providers?	19:08
nmagnezi	rm_work, would you like me to make comment on this in the etherpad itself?	19:09
nmagnezi	johnsom, o/	19:09
rm_work	nmagnezi: i mean i am thinking yes	19:10
rm_work	but	19:10
rm_work	well	19:10
rm_work	i'm not sure	19:10
rm_work	i think that might be digging into the internals of other providers	19:10
rm_work	and i think they have their own stats endpoints	19:10
*** AlexeyAbashkin has joined #openstack-lbaas		19:14
xgerman_	well, I think we could have a more common API call listing how many LB are in the system and such…	19:14
*** voelzmo has joined #openstack-lbaas		19:15
xgerman_	yeah, both lb and listeners could be part of lbaasv2	19:15
xgerman_	^^ rm_work	19:15
xgerman_	so I would make it two calls - one on the main API for lb, listeners, members — and one on the ovctavia-API for amps and oither things we can think of	19:16
rm_work	hmmm k	19:17
rm_work	i mean for amps technically you could just ... call the amphora api I guess	19:17
rm_work	since it already spits back everything	19:17
rm_work	and the total is "len(result)"	19:17
rm_work	and the statuses you can make a map of	19:18
rm_work	it's this other stuff that's a bit finnicky	19:18
rm_work	since you can't get a list of members, for example, without querying by pool	19:18
*** AlexeyAbashkin has quit IRC		19:18
rm_work	so yeah, i could at least remove amphora listing from it	19:18
rm_work	though i figured it would be fine to include, and just have it be None or 0 in the case that it's unused	19:19
*** aojea has joined #openstack-lbaas		19:29
johnsom	nmagnezi Hi	19:29
xgerman_	rm_work: I think we need to do a bit more with the amphora API to get a utilization view. After all we report out mem and load stats	19:32
rm_work	err, we do?	19:32
johnsom	Let's not go too far with controller health stuff. There is a cross project proposal coming out for that	19:33
johnsom	I linked to it in one of the last two meetings	19:33
rm_work	this isn't controller health	19:33
rm_work	this is dataplane health	19:33
xgerman_	+1	19:33
johnsom	Ok, just thought xgerman_ was implying controller stats	19:34
rm_work	at least, what i'm proposing	19:34
rm_work	i thought he meant amp load/mem	19:34
xgerman_	yes, amp sepcific things	19:34
xgerman_	the /info	19:36
rm_work	but we don't push that	19:36
rm_work	i'm not looking to have something that calls out to every amp when you call it :/	19:36
rm_work	this is a proposal for something very light-weight, DB only	19:36
*** harlowja has joined #openstack-lbaas		19:37
*** voelzmo_ has joined #openstack-lbaas		19:38
xgerman_	ok	19:39
rm_work	just basic utilization statistics that could be fetched for a dashboard	19:42
*** voelzmo has quit IRC		19:42
xgerman_	sounds good	19:42
*** yamamoto has joined #openstack-lbaas		19:45
rm_work	so, plz comments on etherpad :P	19:45
rm_work	i asked a ton of questions to my own example, lol	19:45
*** voelzmo_ has quit IRC		19:46
*** yamamoto has quit IRC		19:50
*** voelzmo has joined #openstack-lbaas		19:51
*** voelzmo has quit IRC		19:51
*** voelzmo has joined #openstack-lbaas		19:51
*** voelzmo_ has joined #openstack-lbaas		19:52
*** voelzmo has quit IRC		19:55
*** voelzmo_ has quit IRC		20:02
*** voelzmo has joined #openstack-lbaas		20:02
nmagnezi	rm_work, placed some comments in the etherpad	20:02
nmagnezi	johnsom, o/	20:02
rm_work	coolcool	20:02
nmagnezi	johnsom, a question re https://docs.openstack.org/octavia/queens/configuration/policy.html	20:03
johnsom	nmagnezi Hi	20:03
nmagnezi	johnsom, so if I understand this correctly, do we expect the operator to assign the load-balancer_member role for each newly created user/tenant?	20:04
nmagnezi	johnsom, otherwise, that user can't actually interact with o-api	20:04
johnsom	project ID, yes, I think there is a hierarchical thing that lets you set it for everyone or a group too.	20:05
nmagnezi	hmm	20:05
nmagnezi	okay	20:05
johnsom	Or deploy our policy.json and switch it back to the old way	20:05
nmagnezi	do we know of any other projects who do the same? what was our incentive for that?	20:05
johnsom	nmagenzi If I remember right nova does this. It was prompted by their work and a general consensus that we need finer grain control in OpenStack.	20:06
nmagnezi	indeed, but I was told that projects are moving away from policy.json	20:07
johnsom	nmagenzi Yes, like we did....	20:07
johnsom	policy.json is only for overrides now	20:07
nmagnezi	so moving back to it won't be a good idea for new deployments :)	20:07
johnsom	nmagenzi No, policy.json will always be an option for deployments. That is not going away as it is the standard way to override the defaults. However, going back to the old RBAC rules "owner or admin" may not be the best as going forward the rules are changing. Let me find some references for you...	20:09
nmagnezi	johnsom, thank you!	20:10
johnsom	nmagenzi So this is the spec for "make everyone do this and make it the same" https://review.openstack.org/#/c/523973/	20:13
johnsom	nmagenzi Still digging for the older stuff about why we did it this way.	20:14
johnsom	nmagenzi Geez that was hard to find: https://review.openstack.org/#/c/427872/19/specs/pike/approved/additional-default-policy-roles.rst	20:25
johnsom	And this: https://review.openstack.org/#/c/245629/8/specs/common-default-policy.rst	20:26
rm_work	nmagnezi: responded	20:33
rm_work	anyone else? johnsom?	20:33
nmagnezi	johnsom, thanks a lot. really.	20:33
nmagnezi	rm_work, cool :)	20:35
xgerman_	ok, I think I added my 2ct	20:40
rm_work	replied	20:41
*** rm_mobile has joined #openstack-lbaas		20:48
johnsom	rm_work So, done cleaning up that zombie mess	21:00
johnsom	I think if we are going to do stats about the objects it should be in the main api (admin), then have another that is amps and driver specific stuffs	21:01
openstackgerrit	Merged openstack/neutron-lbaas-dashboard master: Show last updated timestamp for docs https://review.openstack.org/539074	21:01
*** voelzmo has quit IRC		21:01
*** voelzmo has joined #openstack-lbaas		21:02
*** sshank has joined #openstack-lbaas		21:02
*** voelzmo has quit IRC		21:02
*** voelzmo has joined #openstack-lbaas		21:02
openstackgerrit	Merged openstack/neutron-lbaas-dashboard master: Update links in README https://review.openstack.org/551829	21:03
openstackgerrit	Merged openstack/neutron-lbaas-dashboard master: i18n: Do not include html directives in translation strings https://review.openstack.org/538214	21:03
*** voelzmo has quit IRC		21:03
*** voelzmo has joined #openstack-lbaas		21:04
*** voelzmo has quit IRC		21:04
*** voelzmo has joined #openstack-lbaas		21:04
*** voelzmo has quit IRC		21:05
*** voelzmo has joined #openstack-lbaas		21:05
*** voelzmo has quit IRC		21:05
*** kong has joined #openstack-lbaas		21:08
xgerman_	johnsom: +1	21:24
rm_mobile	Hmm	21:26
rm_mobile	Did you comment on the spec etherpad? :P	21:26
*** rm_mobile has quit IRC		21:28
xgerman_	yes	21:30
rm_work	i meant johnsom	21:32
rm_work	but i can check now	21:32
rm_work	... great. he said that	21:33
*** sshank has quit IRC		21:50
rm_work	johnsom: A little more detail? Like ... just remove the amphora keyword and we'd be good to merge something like option#1?	22:04
johnsom	Ha, I was pulliing an rm_work. paper-minimization	22:05
rm_work	lol is that my signature move? :P	22:05
rm_work	also -- still no one has addressed any of my other comments really	22:06
johnsom	I think option one is fine, if it has value for someone. I would change the path to not be octavia obviously.	22:06
rm_work	k	22:06
rm_work	do you agree with xgerman_ to split out tls and non-tls listeners?	22:06
johnsom	Just now digging myself out of zombie land. Still one stuck in nova "deleting"	22:07
rm_work	T_T	22:07
johnsom	Since the load on the system is higher with TLS it might be nice, but could be a follow on IMO	22:08
johnsom	I don't have a real need for this, so kinda flexible. I mean I just query the DB by hand if I need these counts.	22:09
johnsom	But I realize not everyone is as comfortable with that	22:10
johnsom	Also, the by project stuff, I think the quota API and coming quota stuff may provide that answer	22:10
xgerman_	it’s not all about our needs but …	22:10
rm_work	k	22:11
rm_work	johnsom: yeah i was doing that too	22:11
rm_work	then talking it over with my team here, we realized that's ... dumb	22:11
rm_work	that should not be a thing we have to do	22:11
johnsom	Yeah, fair	22:11
*** AlexeyAbashkin has joined #openstack-lbaas		22:12
*** AlexeyAbashkin has quit IRC		22:16
*** aojea has quit IRC		22:18
*** rcernin has joined #openstack-lbaas		22:20
*** cpusmith has quit IRC		22:26
johnsom	Hmm, what do you folks think? If we get a DELETED amp submitted to failover (HM or API) show we log and mark it busy to exclude it from future health checks or should we try to go delete it in nova?	22:27
xgerman_	both	22:28
johnsom	rm_work?	22:28
rm_work	hmmmmm	22:43
rm_work	how does that even happen	22:44
johnsom	german's lab had 30+	22:44
rm_work	wut	22:44
xgerman_	mmh	22:44
johnsom	His nova seems to be brokenish	22:45
xgerman_	well, it probably didn’t help we were recycling like mad	22:45
johnsom	Well, it's going to be a bit of work if we want to have it try another delete	22:47
johnsom	I need to add a compute driver interface for get compute id by name	22:48
*** fnaval has quit IRC		22:50
xgerman_	don’t we already have that — after all we failover…	22:51
johnsom	We have the compute ID in failover cases. In this case we may not have that record anymore	22:52
xgerman_	johnsom: wait - we have the compute-id in the DB on the amphora — is that one no-good?	22:52
johnsom	If they set an aggressive DELETED purge it will be gone.	22:53
xgerman_	ah, ok -	22:53
johnsom	You had a few zombies that had been purged	22:54
johnsom	So, if we are going to do deletes, we might as well do it all, not just those with records	22:54
xgerman_	makes sense	22:55
*** chkumar246 has quit IRC		22:57
rm_work	johnsom: yes to both BTW above	22:59
rm_work	which ... if it IS gone already, need to just delete the health record	22:59
rm_work	of course it may come back	22:59
johnsom	rm_work the question is, if it's DELETED in our DB, do we retry delete with nova	23:00
rm_work	i think it's probably happening because german's health stuff is delayed by so many minutes that the records are still being saved well after the vms are gone	23:00
xgerman_	you extended that to “if we can’t find it in our DB, let’s delete, too”	23:00
rm_work	which is why he REALLY needs the third patch i backported	23:00
rm_work	>_<	23:00
johnsom	No, he had no delay at all	23:00
rm_work	wtf	23:00
rm_work	well, anyway, yes	23:00
*** fnaval has joined #openstack-lbaas		23:01
xgerman_	so, the zombie vms were bringing us down — that should not happen	23:03
johnsom	Yeah, it turned out to be all of these zombies coming in and triggering failovers because they say they have listeners but we say they shouldn't have any. So, the failover queue was getting flooded. That was the memory growth issues. After cleanup and a clean deploy the CPU is down and things are better aside from the amp that nova is stuck "deleting"	23:03
xgerman_	but zombie vms shouldn’t trigger failovers? They should just be ignored…	23:03
johnsom	Right, that is what I am working on. Either we ignore them or we try to kill them and ignore them	23:04
xgerman_	ah, ok, got it	23:04
johnsom	I think I will do one patch that just marks it busy so they don't attempt to failover. This will be clean for backports as it's a simple change.	23:11
xgerman_	+1	23:11
johnsom	We can revisit the "kill the zombies" in another patch if we decide it's a good idea.	23:12
xgerman_	yeah, I can live with that.	23:12
*** AlexeyAbashkin has joined #openstack-lbaas		23:12
xgerman_	=1	23:12
xgerman_	Thanks. Good sleuthing I really was under the impression we would ignore zombies already and not make them failover every time	23:13
*** chandankumar has joined #openstack-lbaas		23:14
*** AlexeyAbashkin has quit IRC		23:16
rm_work	k	23:20
openstackgerrit	Michael Johnson proposed openstack/octavia master: Fix health manager edge case with zombie amphora https://review.openstack.org/556682	23:45
*** chandankumar has quit IRC		23:53

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!