Monday, 2018-03-26

*** chkumar246 has quit IRC01:04
*** Alex_Staf has quit IRC01:15
*** chandankumar has joined #openstack-lbaas01:22
*** fnaval has quit IRC01:38
*** annp has joined #openstack-lbaas02:34
*** yamamoto has joined #openstack-lbaas03:27
*** ianychoi has joined #openstack-lbaas03:29
*** yamamoto has quit IRC03:31
*** rcernin has quit IRC03:55
*** rcernin has joined #openstack-lbaas03:55
*** yamamoto has joined #openstack-lbaas04:28
nmagnezijohnsom, o/06:12
nmagnezijohnsom, still around?06:12
*** links has joined #openstack-lbaas06:19
*** AlexeyAbashkin has joined #openstack-lbaas06:19
*** AlexeyAbashkin has quit IRC06:24
*** ianychoi_ has joined #openstack-lbaas06:26
*** ianychoi has quit IRC06:29
*** Alex_Staf has joined #openstack-lbaas06:34
*** kobis has joined #openstack-lbaas06:53
*** pcaruana has joined #openstack-lbaas07:08
*** rcernin has quit IRC07:10
*** aojea has joined #openstack-lbaas07:13
*** aojea has quit IRC07:22
*** tesseract has joined #openstack-lbaas07:23
*** aojea has joined #openstack-lbaas07:29
*** kobis1 has joined #openstack-lbaas07:31
*** kobis1 has quit IRC07:31
*** aojea has quit IRC07:35
*** aojea has joined #openstack-lbaas07:42
*** aojea has quit IRC07:49
*** AlexeyAbashkin has joined #openstack-lbaas07:50
*** kobis has quit IRC08:12
*** velizarx has joined #openstack-lbaas08:35
openstackgerritGanpat Agarwal proposed openstack/octavia master: Active-Active: ExaBGP amphora L3 distributor driver
*** voelzmo has joined #openstack-lbaas09:10
*** voelzmo has quit IRC09:16
*** salmankhan has joined #openstack-lbaas09:21
*** Alex_Staf has quit IRC09:29
*** Alex_Staf has joined #openstack-lbaas09:38
*** Alex_Staf has quit IRC09:51
*** salmankhan has quit IRC09:57
*** salmankhan has joined #openstack-lbaas09:57
*** salmankhan has quit IRC10:50
*** salmankhan has joined #openstack-lbaas11:08
*** AlexeyAbashkin has quit IRC11:14
*** Alex_Staf has joined #openstack-lbaas11:16
*** yamamoto has quit IRC11:27
*** AlexeyAbashkin has joined #openstack-lbaas11:29
*** yamamoto has joined #openstack-lbaas11:36
*** salmankhan has quit IRC11:40
*** yamamoto has quit IRC11:45
*** salmankhan has joined #openstack-lbaas11:48
*** atoth has joined #openstack-lbaas11:59
*** yamamoto has joined #openstack-lbaas12:05
*** velizarx has quit IRC12:20
*** yamamoto has quit IRC12:30
*** Alex_Staf has quit IRC13:02
*** yamamoto has joined #openstack-lbaas13:20
*** yamamoto has quit IRC13:20
*** yamamoto has joined #openstack-lbaas13:20
*** cpusmith_ has joined #openstack-lbaas13:49
*** cpusmith has joined #openstack-lbaas13:50
*** fnaval has joined #openstack-lbaas13:52
*** cpusmith_ has quit IRC13:54
*** yamamoto has quit IRC14:03
*** kong has quit IRC14:13
*** salmankhan has quit IRC14:21
*** links has quit IRC14:32
*** salmankhan has joined #openstack-lbaas14:37
openstackgerritAdam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting
rm_workjohnsom / nmagnezi / xgerman_ ^^14:37
rm_workI *think* I did that correctly?14:38
rm_worknope lol14:38
rm_workmaybe not14:39
openstackgerritAdam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting
rm_workhow about this14:39
rm_workhmmm no14:40
rm_worki don't understand what i did wrong on the first one14:40
rm_workthat should have been right14:40
rm_worki guess i have to select branches in order to make it non-voting?14:40
johnsomrm_work The semi-colon was missing in the first try15:01
openstackgerritAdam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting
*** yamamoto has joined #openstack-lbaas15:03
*** chandankumar has quit IRC15:06
xgerman_k, they have been failing forever15:10
xgerman_rm_work: also my HM ate up 10G RAM over the weekend. Anywhere in the DB to look for bottlenecks?15:11
rm_workit's not the DB15:11
rm_worki mean, not really15:11
xgerman_well, I know — but it’s not like HM tels me how ong it’s queries/inserts take15:11
rm_workcutting down the query count helps a lot (see: but you need the rest of those patches15:12
rm_workthat one will help *a lot* though15:12
xgerman_mmh, I like to have numbers to base that one. There should be a setting where HM spews diagnostic info15:13
rm_workas will
rm_workxgerman_: it was something like ... many hundreds to 10a15:13
rm_work*to 10s15:13
rm_workjohnsom had some numbers15:13
rm_workbut, an order of magnitude reduction15:13
xgerman_yeah, I guess we should have the measurement code submitted - since other operators need that for diagnosyics15:14
rm_workyou REALLY need to get those first three15:14
rm_workit will improve things significantly15:14
rm_workbut to REALLY fix it you need the last three, which ... i don't know if we'll be able to get in15:14
rm_workmaybe you can just apply those patches locally15:14
*** yamamoto has quit IRC15:15
xgerman_well, if Pike doesn’t work we need to tell people15:15
rm_worki had a hard time proving it, i tried, but i'm just not that good at performance diagnostics15:16
rm_worki can show the symptoms in a variety of ways15:16
rm_workvia like, debug logging15:16
rm_workor system performance graphs15:16
xgerman_yes, but if I am some operator I need something in the logs which tells me what is going on other than my OpenStack cluster crashing because we have a run-away health manager15:17
rm_workbut i worked quite a bit with zzzeek on the sqlalchemy bits that seemed to be taking a long time, and could never find a smoking gun15:17
rm_workso see my third patch there15:17
xgerman_because if I wouldn’t need Octavia after one or two of those episodes I would loose all faith15:17
rm_workif you start seeing those messages15:18
rm_workit's bad news bears time15:18
*** pcaruana has quit IRC15:18
rm_worki put that in specifically for what you're talking about15:18
rm_workto warn operators that something is going wrong15:18
rm_workand they need to be aware15:18
rm_workThe language is as strong as I could make it and stay "PC"15:19
rm_worksaying "your shit is fucked" in a log message isn't the best, but i wanted to, lol15:19
xgerman_well, we can also try to keep track of jobs and cancel the ones we get overtaken… to keep the queue at a ~constant size15:21
*** chkumar246 has joined #openstack-lbaas15:23
rm_workxgerman_: that's what that code does15:24
rm_workless tracking, more just cancelling15:24
rm_workyou're better off letting the original job finish than restarting15:25
rm_workif you cancelled the first job and tried to do the new one, once you got behind almost nothing would ever actually update, and it'd be catastrophic15:25
rm_workwhich brings me back to, you *absolutely* need that third patch
johnsomYeah, we need to stop the panic here and figure out  why this system is so much different than everyone else's15:25
rm_workjohnsom: regardless of whatever xgerman_'s issue is, I firmly believe we need to backport those first three15:26
*** imacdonn has joined #openstack-lbaas15:26
rm_work(only three because the middle one prevents a ton of merge errors, so i threw it in there, it's test-only)15:26
rm_workthis is based entirely on my assessment of the state of the pike codebase, and cgoncalves's plea to more actively backport stuff that we think is necessary for good operation15:27
rm_worki didn't realize before that we hadn't fixed any of that stuff in Pike yet15:27
rm_workhealth management is completely out of control without at least your patch (the first one in the chain) and mine (the third one) prevents a lot of headache and gives at least some heads-up that operators would need to even realize the issue15:28
rm_worki think if we merge those three, we'll see more people coming forward saying "i'm getting this warning message from my HMs..."15:28
rm_workas it is, people may not even realize until it's too late, and then it takes a good deal of forensics to realize what happened15:29
rm_workI should have asked cgoncalves to do the backports so I could +2 them myself <_<15:29
xgerman_ok, let’s see if my system is really that special15:30
johnsomWell, I'm going to have a look at this environment and see what is up.  The context switches are near zero there and the load is too high for the # of amps. So something else is going on there15:30
rm_workthe first three have nothing to do with that15:31
rm_workthe first one is yours that objectively makes the number of DB operations less ridiculous, and the third is mine that gives operators at least a chance to see something is wrong and possibly head off a catastrophic failure15:31
rm_workIMO not merging at least those would be irresponsible15:32
rm_workI should have proposed them earlier honestly15:32
rm_workbut i hadn't really been thinking about backports / remembering that people actually will try to run Pike15:32
xgerman_My 2ct: We run an unbound service which can chew up resources. We need to constrain that somehow15:32
xgerman_rm_work: people are just now getting on newton ;-)15:33
rm_workxgerman_: yes, well, the next set fixes all of that15:33
rm_workif we merge those it isn't an issue15:33
rm_workand since we'd have no reason to make a patch to master to do constraints (since it isn't an issue), we're not going to be able to backport anything15:33
rm_workanyway, i don't understand why we'd make and backport a workaround when the *resolution* is already merged in master15:34
xgerman_there are also rumours that the queue in the ProcessPool is better constraint15:34
xgerman_yeah, I like to work of one code base as well15:34
rm_workxgerman_: did you get a chance to *try* this patch chain in your env15:35
xgerman_no, we wanted to test the exception handling fix over the weekend15:36
johnsomLooks like the environment reverted to stock pike, no exception fix, etc. in it15:38
johnsomHmm, is there a way to find the sha from an pip installed app?  The version in pip list is a "dev" version #15:40
rm_workah you don't do -e ?15:42
rm_workis the dev version number not like... a shorthash?15:42
johnsomApparently here, they do dev. I think I found one in the egg under pbr. looking now15:42
johnsomThere are so many amps reporting in that don't have listeners in the DB....15:44
rm_workbecause they didn't get their configs updated correctly?15:47
johnsomyeah, ok, it has failed over 83 amps in 10 minutes15:47
johnsomI think we are ping ponging on these corrupt amps/records15:47
johnsomRemember this isn't a normal environment, folks have messed in the DB, it was running neutron-lbaas, etc.15:48
johnsomYeah, they are looping on the amp ids....  Interesting.15:49
rm_workyeah so15:49
rm_workthat is a cascade failure15:49
rm_worki saw the same thing15:49
rm_workyou need to shut it all down15:49
johnsomYeah, this is immediate after a fresh restart too15:49
rm_workgotta shut down the HMs15:49
rm_workmark all amps busy in health15:50
rm_workbring the HMs up15:50
rm_workwait for the health messages to stabilize15:50
rm_workthen un-busy and failover one at a time15:50
rm_workto get them back to "actually correct" state15:50
xgerman_johnsom: infra01 has the monkey patched version15:52
johnsomNo, I am on infra01, it's running clean d9e24e83bbf7d12f8bec16072b28c6ca655690ac15:53
johnsomOur changes are gone15:53
xgerman_I only restarted the hm15:53
johnsomMe too15:53
johnsomI haven't touched this since we did.  Let me see if I can get a timestamp when it was redeployed15:54
johnsomYeah, no idea, the files all say jan 415:55
johnsomxgerman_ oh, somehow that IP you gave me took me to infra3 this time...15:56
xgerman_mmh, no idea15:56
johnsomOk, yeah, I think we have it figured out, now just why....  It's failing over slightly more than 8 a minute, with a failover queue of 10, so...  That is the queue growth, and why the CPU is high.  Now digging into the why....16:03
*** salmankhan has quit IRC16:03
*** salmankhan has joined #openstack-lbaas16:07
*** salmankhan has quit IRC16:12
johnsomYeah, ok so it's amps we show as deleted reporting in.16:15
johnsomI think we need to add some zombie protection...  lol16:17
*** salmankhan has joined #openstack-lbaas16:31
rm_workplease please PLEASE take full advantage of the possible commit messages16:34
rm_worki will -1 for not properly taking advantage of subject matter16:34
*** pcaruana has joined #openstack-lbaas16:44
rm_worki wish we had a way to figure out if the IP that reports in is actually an old amp compute, and if so, figure out what its compute-id is and delete it16:47
*** salmankhan has quit IRC16:51
johnsomSo I am tracking by the "reports 1 listeners when 0 expected" which gives the amp ID, which is the name of the amp in nova17:04
*** AlexeyAbashkin has quit IRC17:15
*** salmankhan has joined #openstack-lbaas17:16
*** salmankhan has quit IRC17:20
rm_work"amp id"?17:36
rm_workwhat if it's not in the DB?17:37
rm_workwe're talking about zombies right?17:37
rm_workamps that were "deleted" but the compute instance survived somehow?17:37
johnsomYeah, looks like nova was trashed here for a while and some non-graceful shutdowns17:45
xgerman_I think nova foo-baring is a fact of life17:47
johnsomAgreed, and we handled it appropriately17:48
johnsomretried and then marked them ERROR17:48
xgerman_well, and then we dropped the ball on zombie vms17:48
johnsomYeah, it looks like we are not handling amps we told nova to delete, nova said success, but they didn't actually delete. We can improve this.17:49
*** tesseract has quit IRC18:10
rm_workyeah so how to get the compute-id of them18:12
rm_workthe only thing i could think of is their address <_<18:13
rm_workwhich ... is maybe not 100% reliable?18:13
johnsomThey come into the HM with their IDs18:13
johnsomWe just need to check if the amp is DELETED during a failover and re-delete if so18:13
rm_workso that's if they're still in the DB18:14
rm_work[02:37:12]  <rm_work>what if it's not in the DB?18:14
rm_work[02:37:49]  <rm_work>we're talking about zombies right?18:14
rm_work[02:38:02]  <rm_work>amps that were "deleted" but the compute instance survived somehow?18:14
johnsomOk, or missing I guess, but yeah, normally they will be in "DELETED"18:14
rm_worki don't know about you, but mine go away18:14
johnsomat least for some period of time18:15
rm_workso there is literally nothing18:15
rm_worki usually have to do lookups by IP18:15
rm_workbecause there's this thing coming in with an amp-id that has no record in the amphora table18:15
johnsomEven if you have your HK DB cleanup too low, we get in an amp ID, it's failed because it claims listeners that we show as 0, we check DB if DELETED or missing, call nova delete "amphora-<ID>"18:16
johnsomDo you happen to remember the columns we need to fill in to re-create an amp record to trigger a failover?18:18
johnsomI have a MASTER healthy but no BACKUP18:18
rm_workah, yes i think so, one sec18:21
rm_workand also: it has nothing to do with "HK cleanup time"18:21
rm_workuntil ya'll review that18:21
rm_work... i will admit that i need to really give it a good testing in devstack18:22
*** fnaval has quit IRC18:23
rm_workalso, to distract even more, I have one more feature that I'm going to be whipping through today and hopefully have a patch up shortly -- an admin API call that will give usage statistics for the whole system (total number of LBs, amps, listeners, members, etc18:24
rm_workdrawing up a mock output first to run past people18:24
*** fnaval has joined #openstack-lbaas18:26
rm_workI would do a spec but18:27
rm_workthen it'd take a month to get to code18:28
rm_workrather just get some basic agreement18:28
*** AlexeyAbashkin has joined #openstack-lbaas18:42
*** pcaruana has quit IRC18:46
rm_workjohnsom / xgerman_ / nmagnezi / cgoncalves / anyone really: when you have a minute:
rm_workproposing *something* like that18:51
rm_workplease to comment / propose alternatives18:51
rm_workneed to have a way to expose some stats about how utilized the system is18:52
rm_workah maybe it should be *utilization*18:52
*** AlexeyAbashkin has quit IRC18:54
*** voelzmo has joined #openstack-lbaas18:54
*** voelzmo has quit IRC19:00
*** atoth has quit IRC19:00
nmagnezirm_work, looking now19:07
nmagnezirm_work, a question in regards to the amphorae section19:08
nmagnezirm_work, since we plan to implement providers support in Rocky, will it support an equivalent of that for other providers?19:08
nmagnezirm_work, would you like me to make comment on this in the etherpad itself?19:09
nmagnezijohnsom, o/19:09
rm_worknmagnezi: i mean i am thinking yes19:10
rm_worki'm not sure19:10
rm_worki think that might be digging into the internals of other providers19:10
rm_workand i think they have their own stats endpoints19:10
*** AlexeyAbashkin has joined #openstack-lbaas19:14
xgerman_well, I think we could have a more common API call listing how many LB are in the system and such…19:14
*** voelzmo has joined #openstack-lbaas19:15
xgerman_yeah, both lb and listeners could be part of lbaasv219:15
xgerman_^^ rm_work19:15
xgerman_so I would make it two calls - one on the main API for lb, listeners, members — and one on the ovctavia-API for amps and oither things we can think of19:16
rm_workhmmm k19:17
rm_worki mean for amps technically you could just ... call the amphora api I guess19:17
rm_worksince it already spits back everything19:17
rm_workand the total is "len(result)"19:17
rm_workand the statuses you can make a map of19:18
rm_workit's this other stuff that's a bit finnicky19:18
rm_worksince you can't get a list of members, for example, without querying by pool19:18
*** AlexeyAbashkin has quit IRC19:18
rm_workso yeah, i could at least remove amphora listing from it19:18
rm_workthough i figured it would be fine to include, and just have it be None or 0 in the case that it's unused19:19
*** aojea has joined #openstack-lbaas19:29
johnsomnmagnezi Hi19:29
xgerman_rm_work: I think we need to do a bit more with the amphora API to get a utilization view. After all we report out mem and load stats19:32
rm_workerr, we do?19:32
johnsomLet's not go too far with controller health stuff. There is a cross project proposal coming out for that19:33
johnsomI linked to it in one of the last two meetings19:33
rm_workthis isn't controller health19:33
rm_workthis is dataplane health19:33
johnsomOk, just thought xgerman_ was implying controller stats19:34
rm_workat least, what i'm proposing19:34
rm_worki thought he meant amp load/mem19:34
xgerman_yes, amp sepcific things19:34
xgerman_the /info19:36
rm_workbut we don't *push* that19:36
rm_worki'm not looking to have something that calls out to every amp when you call it :/19:36
rm_workthis is a proposal for something very light-weight, DB only19:36
*** harlowja has joined #openstack-lbaas19:37
*** voelzmo_ has joined #openstack-lbaas19:38
rm_workjust basic utilization statistics that could be fetched for a dashboard19:42
*** voelzmo has quit IRC19:42
xgerman_sounds good19:42
*** yamamoto has joined #openstack-lbaas19:45
rm_workso, plz comments on etherpad :P19:45
rm_worki asked a ton of questions to my own example, lol19:45
*** voelzmo_ has quit IRC19:46
*** yamamoto has quit IRC19:50
*** voelzmo has joined #openstack-lbaas19:51
*** voelzmo has quit IRC19:51
*** voelzmo has joined #openstack-lbaas19:51
*** voelzmo_ has joined #openstack-lbaas19:52
*** voelzmo has quit IRC19:55
*** voelzmo_ has quit IRC20:02
*** voelzmo has joined #openstack-lbaas20:02
nmagnezirm_work, placed some comments in the etherpad20:02
nmagnezijohnsom, o/20:02
nmagnezijohnsom, a question re
johnsomnmagnezi Hi20:03
nmagnezijohnsom, so if I understand this correctly, do we expect the operator to assign the load-balancer_member role for each newly created user/tenant?20:04
nmagnezijohnsom, otherwise, that user can't actually interact with o-api20:04
johnsomproject ID, yes, I think there is a hierarchical thing that lets you set it for everyone or a group too.20:05
johnsomOr deploy our policy.json and switch it back to the old way20:05
nmagnezido we know of any other projects who do the same? what was our incentive for that?20:05
johnsomnmagenzi If I remember right nova does this. It was prompted by their work and a general consensus that we need finer grain control in OpenStack.20:06
nmagneziindeed, but I was told that projects are moving away from policy.json20:07
johnsomnmagenzi Yes, like we did....20:07
johnsompolicy.json is only for overrides now20:07
nmagneziso moving back to it won't be a good idea for new deployments :)20:07
johnsomnmagenzi No, policy.json will always be an option for deployments. That is not going away as it is the standard way to override the defaults.  However, going back to the old RBAC rules "owner or admin" may not be the best as going forward the rules are changing.  Let me find some references for you...20:09
nmagnezijohnsom, thank you!20:10
johnsomnmagenzi So this is the spec for "make everyone do this and make it the same"
johnsomnmagenzi Still digging for the older stuff about why we did it this way.20:14
johnsomnmagenzi Geez that was hard to find:
johnsomAnd this:
rm_worknmagnezi: responded20:33
rm_workanyone else? johnsom?20:33
nmagnezijohnsom, thanks a lot. really.20:33
nmagnezirm_work, cool :)20:35
xgerman_ok, I think I added my 2ct20:40
*** rm_mobile has joined #openstack-lbaas20:48
johnsomrm_work So, done cleaning up that zombie mess21:00
johnsomI think if we are going to do stats about the objects it should be in the main api (admin), then have another that is amps and driver specific stuffs21:01
openstackgerritMerged openstack/neutron-lbaas-dashboard master: Show last updated timestamp for docs
*** voelzmo has quit IRC21:01
*** voelzmo has joined #openstack-lbaas21:02
*** sshank has joined #openstack-lbaas21:02
*** voelzmo has quit IRC21:02
*** voelzmo has joined #openstack-lbaas21:02
openstackgerritMerged openstack/neutron-lbaas-dashboard master: Update links in README
openstackgerritMerged openstack/neutron-lbaas-dashboard master: i18n: Do not include html directives in translation strings
*** voelzmo has quit IRC21:03
*** voelzmo has joined #openstack-lbaas21:04
*** voelzmo has quit IRC21:04
*** voelzmo has joined #openstack-lbaas21:04
*** voelzmo has quit IRC21:05
*** voelzmo has joined #openstack-lbaas21:05
*** voelzmo has quit IRC21:05
*** kong has joined #openstack-lbaas21:08
xgerman_johnsom: +121:24
rm_mobileDid you comment on the spec etherpad? :P21:26
*** rm_mobile has quit IRC21:28
rm_worki meant johnsom21:32
rm_workbut i can check now21:32
rm_work... great. he said that21:33
*** sshank has quit IRC21:50
rm_workjohnsom: A little more detail? Like ... just remove the amphora keyword and we'd be good to merge something like option#1?22:04
johnsomHa, I was pulliing an rm_work. paper-minimization22:05
rm_worklol is that my signature move? :P22:05
rm_workalso -- still no one has addressed any of my other comments really22:06
johnsomI think option one is fine, if it has value for someone.  I would change the path to not be octavia obviously.22:06
rm_workdo you agree with xgerman_ to split out tls and non-tls listeners?22:06
johnsomJust now digging myself out of zombie land. Still one stuck in nova "deleting"22:07
johnsomSince the load on the system is higher with TLS it might be nice, but could be a follow on IMO22:08
johnsomI don't have a real need for this, so kinda flexible.  I mean I just query the DB by hand if I need these counts.22:09
johnsomBut I realize not everyone is as comfortable with that22:10
johnsomAlso, the by project stuff, I think the quota API and coming quota stuff may provide that answer22:10
xgerman_it’s not all about our needs but …22:10
rm_workjohnsom: yeah i was doing that too22:11
rm_workthen talking it over with my team here, we realized that's ... dumb22:11
rm_workthat should not be a thing we have to do22:11
johnsomYeah, fair22:11
*** AlexeyAbashkin has joined #openstack-lbaas22:12
*** AlexeyAbashkin has quit IRC22:16
*** aojea has quit IRC22:18
*** rcernin has joined #openstack-lbaas22:20
*** cpusmith has quit IRC22:26
johnsomHmm, what do you folks think? If we get a DELETED amp submitted to failover (HM or API) show we log and mark it busy to exclude it from future health checks or should we try to go delete it in nova?22:27
rm_workhow does that even happen22:44
johnsomgerman's lab had 30+22:44
johnsomHis nova seems to be brokenish22:45
xgerman_well, it probably didn’t help we were recycling like mad22:45
johnsomWell, it's going to be a bit of work if we want to have it try another delete22:47
johnsomI need to add a compute driver interface for get compute id by name22:48
*** fnaval has quit IRC22:50
xgerman_don’t we already have that — after all we failover…22:51
johnsomWe have the compute ID in failover cases. In this case we may not have that record anymore22:52
xgerman_johnsom: wait - we have the compute-id in the DB on the amphora — is that one no-good?22:52
johnsomIf they set an aggressive DELETED purge it will be gone.22:53
xgerman_ah, ok -22:53
johnsomYou had a few zombies that had been purged22:54
johnsomSo, if we are going to do deletes, we might as well do it all, not just those with records22:54
xgerman_makes sense22:55
*** chkumar246 has quit IRC22:57
rm_workjohnsom: yes to both BTW above22:59
rm_workwhich ... if it IS gone already, need to just delete the health record22:59
rm_workof course it may come back22:59
johnsomrm_work the question is, if it's DELETED in our DB, do we retry delete with nova23:00
rm_worki think it's probably happening because german's health stuff is delayed by so many minutes that the records are still being saved well after the vms are gone23:00
xgerman_you extended that to “if we can’t find it in our DB, let’s delete, too”23:00
rm_workwhich is why he REALLY needs the third patch i backported23:00
johnsomNo, he had no delay at all23:00
rm_workwell, anyway, yes23:00
*** fnaval has joined #openstack-lbaas23:01
xgerman_so, the zombie vms were bringing us down — that *should* not happen23:03
johnsomYeah, it turned out to be all of these zombies coming in and triggering failovers because they say they have listeners but we say they shouldn't have any. So, the failover queue was getting flooded. That was the memory growth issues. After cleanup and a clean deploy the CPU is down and things are better aside from the amp that nova is stuck "deleting"23:03
xgerman_but zombie vms shouldn’t trigger failovers? They should just be ignored…23:03
johnsomRight, that is what I am working on.  Either we ignore them or we try to kill them and ignore them23:04
xgerman_ah, ok, got it23:04
johnsomI think I will do one patch that just marks it busy so they don't attempt to failover. This will be clean for backports as it's a simple change.23:11
johnsomWe can revisit the "kill the zombies" in another patch if we decide it's a good idea.23:12
xgerman_yeah, I can live with that.23:12
*** AlexeyAbashkin has joined #openstack-lbaas23:12
xgerman_Thanks. Good sleuthing I really was under the impression we would ignore zombies already and not make them failover every time23:13
*** chandankumar has joined #openstack-lbaas23:14
*** AlexeyAbashkin has quit IRC23:16
openstackgerritMichael Johnson proposed openstack/octavia master: Fix health manager edge case with zombie amphora
*** chandankumar has quit IRC23:53

Generated by 2.15.3 by Marius Gedminas - find it at!