Wednesday, 2018-05-30

*** mordred has quit IRC00:06
*** zigo has quit IRC00:06
*** zigo_ has joined #openstack-lbaas00:06
*** KeithMnemonic has quit IRC00:08
*** mordred has joined #openstack-lbaas00:16
*** leitan has quit IRC00:18
*** leitan has joined #openstack-lbaas00:18
*** sshank has quit IRC00:20
*** leitan has quit IRC00:22
rm_workfailed one each00:31
rm_worki've seen similar failures before in noops00:32
rm_worknot sure how stuff gets stuck in PENDING statuses??00:32
rm_workbut it seems to00:32
rm_workRMQ failures?00:32
johnsomNo RMQ with noop....00:32
rm_workdigging through o-cw log00:32
rm_workerr ... yes?00:33
rm_workwe still go through to worker00:33
rm_workand it just noops the compute/net/amp parts00:33
johnsomIt's not running noop Octavia driver?00:33
rm_workthis is an octavia tempest test00:33
johnsomYeah, right, I think that is going through00:33
johnsomLooks broken:
rm_workthis doesn't look good tho:
*** atoth has quit IRC00:35
rm_workhmmm yeah00:35
rm_workis this catching an octavia bug somehow? lol00:35
rm_workbut why only sometimes00:35
rm_workmaybe due to the test order / parallel runs?00:35
johnsomOye, I hope not....00:36
*** longkb has joined #openstack-lbaas00:36
rm_worki mean it's here:
rm_workhow would that happen O_o00:36
*** keithmnemonic[m] has quit IRC00:37
rm_workit tries to fetch the pool info from the DB and it isn't there O_o00:38
rm_workhow, what00:38
johnsomWe need to fix that quota log message, it's not printing the object00:38
rm_worki wonder if this is an issue with the new provider stuff ordering the db commit wrong possibly? grasping at straws00:39
rm_workcause yeah... you do the "call_provider" before the commit00:40
rm_workso that's my guess00:40
rm_workwe manage to make it through the RMQ call and into the worker and do the processing, before the commit happens00:40
rm_workin some instances00:40
rm_worksad panda00:40
rm_workis it OK to move the driver call *after* the commit?00:41
rm_workactually, it should definitely be after, right? because if the driver call fails, we don't want to rollback still IMO00:42
rm_workit'd be direct-to-error?00:42
rm_workor is that not the design00:42
johnsomHmmm, well, we would have to go back the the "ERROR" status thing.00:42
rm_workso... yeah. we have to IMO00:42
rm_worksince we really can't call out to the provider without having the db updated yet T_T00:42
johnsomI switched this around  to try to enroll the provider call in the DB transaction, so if the driver failed we would roll back the whole request00:43
rm_workwe're going to have this race00:43
johnsomWe, as in Octavia driver....00:43
rm_workhow can a driver that relies on the DB function then00:43
rm_workamphora driver, yeah00:43
johnsomNo driver should rely on the DB00:44
rm_workso amp driver needs to be rewritten then <_<00:44
rm_workbecause it does00:44
johnsomThey should get everything from the provider call00:44
rm_worki guess we need to rewrite the amp driver to use the interface?00:44
johnsomYes, ultimately, that is true00:44
rm_workthat means passing that data over rmq00:44
rm_workinstead of just an ID00:44
johnsomI had hoped to avoid that00:45
rm_workwell, you see the problem with avoiding that <_<00:45
johnsomAt least for now, just because that is a lot of change too00:45
rm_workok... so.00:45
rm_workwe can't leave it like this :(00:45
johnsomAdd sleeps?  lol00:47
rm_workwhat do you suggest BESIDES that00:47
* rm_work slaps johnsom 00:47
rm_workno hacks :P00:47
johnsomPoll the DB?00:47
johnsomSorry, still wacky from being sick00:47
rm_workugh, retry decorator?00:47
rm_workyou got sick? :(00:47
rm_worki mean we could do a retry for like 5s or something dumb00:48
*** leitan has joined #openstack-lbaas00:48
johnsomYeah, I got a sinus infection from the trip. Burned my whole weekend00:48
rm_worki said "temporarily", which means shitty hack workaround00:48
rm_worksad :(00:48
rm_workI drove to Boise and back, which burned a lot of mine :P00:48
johnsomYep, hope the weddding was fun00:48
rm_worknah, not a wedding. just meeting up with some folks00:49
johnsomSo..... Yeah, I think the driver code is *right*, it's our amphora side that is *wrong*00:49
johnsomAh, I thought you were going to a wedding for some reason.00:49
rm_workbut whatever meds you're on, I want some00:49
johnsomNone sadly00:49
rm_workyes, I guess I agree then00:50
* johnsom sees the problem00:50
rm_worki don't see a solution besides passing stuff through RMQ00:50
rm_workthe whole kaboodle00:50
johnsomWell, it's either that or do the DB polling on the controller side.00:50
rm_workthat's really shitty :/00:51
rm_workit's impossible to *prove* it works00:51
johnsomRight, long term, it should be using the data it gets in the provider call00:53
*** annp has quit IRC00:53
*** annp has joined #openstack-lbaas00:54
rm_workso you think... what? we temporary-hack-workaround it to just do a retry on all the DB loads?00:54
rm_workif it's not found during a create?00:54
rm_workI GUESS that's kinda ok? since for a create, we really should be able to assume that it's got to be there sooner or later00:55
johnsomIt really comes down to who has time to do the long term solution vs. slapping in a workaround.00:56
rm_workk well, at the moment this is scary00:57
rm_worki couldn't put this in prod, and our gates are going to be unreliable00:57
*** leitan has quit IRC00:58
*** leitan has joined #openstack-lbaas00:58
*** harlowja has quit IRC01:00
*** leitan has quit IRC01:00
*** leitan has joined #openstack-lbaas01:00
*** keithmnemonic[m] has joined #openstack-lbaas01:05
openstackgerritMichael Johnson proposed openstack/octavia master: Implement provider drivers - Cleanup
johnsomrm_work the gotcha with the long term solution is the code expects a DB model there, just as that failure was calling, which doesn't exist with the provider driver method. We don't have a seperate DB for the amphora driver yet.01:16
rm_workyeah so it's kinda a rewrite01:18
johnsomTomorrow I will cook up the polling thing.01:21
rm_workk imean01:22
rm_worki assume it's just a DB retry01:22
rm_workusing like01:22
rm_workif it throws the DB error01:22
rm_worki can prolly do it too01:22
johnsomRight, a decorator that checks the repo get result to see if it is empty, if so, try again. like up to 30 seconds or something01:22
johnsomIt won't throw an exception, it's just an None object back01:23
rm_workyeah, i mean, I could make it throw one :P01:23
johnsomThen go down the controller_worker, and wrap those first "get" calls01:23
johnsomThat would be nasty as tons of things use those repo get calls01:24
rm_worknot in the repo-get01:24
rm_workin the controller_worker create_*01:24
johnsomAnd it would only be the create calls, the others should be fine. I think01:25
johnsomOk, I need to run sadly01:25
rm_workupdates will be fine01:27
rm_workand deletes obv01:27
*** mordred has quit IRC01:31
*** hongbin has joined #openstack-lbaas01:38
*** mordred has joined #openstack-lbaas01:45
*** leitan has quit IRC02:38
*** leitan has joined #openstack-lbaas02:39
*** leitan has quit IRC02:43
*** harlowja has joined #openstack-lbaas03:29
*** hongbin has quit IRC03:53
*** links has joined #openstack-lbaas04:11
*** annp has quit IRC04:13
*** harlowja has quit IRC04:14
*** blake has joined #openstack-lbaas04:23
*** JudeC has joined #openstack-lbaas05:10
*** eandersson has joined #openstack-lbaas05:10
*** kobis has joined #openstack-lbaas05:19
*** JudeC has quit IRC05:22
*** JudeC has joined #openstack-lbaas05:24
*** kobis has quit IRC05:37
*** kobis has joined #openstack-lbaas05:40
*** kobis has quit IRC05:41
*** eandersson has quit IRC05:48
*** eandersson has joined #openstack-lbaas05:49
*** kobis has joined #openstack-lbaas06:11
*** blake has quit IRC06:13
*** imacdonn has quit IRC06:16
*** imacdonn has joined #openstack-lbaas06:16
rm_workugh a lot of our docstrings are hilariously wrong in controller_worker06:29
*** AlexeyAbashkin has joined #openstack-lbaas06:32
*** pcaruana has joined #openstack-lbaas06:33
*** JudeC__ has joined #openstack-lbaas06:40
*** JudeC_ has quit IRC06:41
openstackgerritKobi Samoray proposed openstack/octavia master: Octavia devstack plugin API mode
*** annp has joined #openstack-lbaas06:52
*** apple01 has joined #openstack-lbaas06:54
openstackgerritAdam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates
openstackgerritAdam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for healthmonitors
*** apple01 has quit IRC07:17
*** tesseract has joined #openstack-lbaas07:20
*** rcernin has quit IRC07:27
*** kobis has quit IRC07:34
*** kobis has joined #openstack-lbaas07:38
*** yboaron has joined #openstack-lbaas07:39
*** apple01 has joined #openstack-lbaas07:46
openstackgerritZhaoBo proposed openstack/octavia master: UDP for [3][5][6]
*** apple01 has quit IRC07:51
*** JudeC__ has quit IRC08:10
*** kobis has quit IRC08:12
*** kobis has joined #openstack-lbaas08:13
*** jiteka has quit IRC08:13
*** nmanos has joined #openstack-lbaas08:19
*** Alexey_Abashkin has joined #openstack-lbaas08:50
*** Alexey_Abashkin has quit IRC08:51
*** AlexeyAbashkin has quit IRC08:51
*** AlexeyAbashkin has joined #openstack-lbaas08:52
*** apple01 has joined #openstack-lbaas08:56
*** apple01 has quit IRC09:01
*** salmankhan has joined #openstack-lbaas09:14
*** yamamoto has quit IRC09:21
*** salmankhan has quit IRC09:25
*** kobis has quit IRC09:26
*** JudeC_ has joined #openstack-lbaas09:28
*** salmankhan has joined #openstack-lbaas09:35
*** salmankhan has quit IRC09:42
*** salmankhan has joined #openstack-lbaas09:47
*** yamamoto has joined #openstack-lbaas09:50
*** zigo_ is now known as zigo09:56
*** annp has quit IRC10:01
*** apple01 has joined #openstack-lbaas10:06
*** kobis has joined #openstack-lbaas10:14
*** apple01 has quit IRC10:14
*** JudeC_ has quit IRC10:15
*** AlexeyAbashkin has quit IRC10:30
*** AlexeyAbashkin has joined #openstack-lbaas10:31
*** AlexeyAbashkin has quit IRC10:36
*** apple01 has joined #openstack-lbaas11:04
*** yboaron has quit IRC11:04
*** apple01 has quit IRC11:08
*** apple01 has joined #openstack-lbaas11:20
*** AlexeyAbashkin has joined #openstack-lbaas11:21
*** yamamoto has quit IRC11:29
*** longkb has quit IRC11:40
*** apple01 has quit IRC11:42
*** apple01 has joined #openstack-lbaas11:49
*** yamamoto has joined #openstack-lbaas12:07
*** leitan has joined #openstack-lbaas12:09
*** atoth has joined #openstack-lbaas12:16
*** amuller has joined #openstack-lbaas12:23
*** apple01 has quit IRC12:44
*** yboaron has joined #openstack-lbaas12:45
*** sajjadg has joined #openstack-lbaas12:59
*** samccann has joined #openstack-lbaas13:00
*** yamamoto has quit IRC13:04
*** leitan has quit IRC13:11
*** leitan has joined #openstack-lbaas13:26
*** leitan has quit IRC13:26
*** links has quit IRC13:29
*** links has joined #openstack-lbaas13:30
*** yamamoto has joined #openstack-lbaas13:35
*** fnaval has joined #openstack-lbaas13:36
*** yamamoto has quit IRC13:39
*** links has quit IRC13:44
*** AlexeyAbashkin has quit IRC14:02
*** AlexeyAbashkin has joined #openstack-lbaas14:04
*** kobis has quit IRC14:06
*** yboaron_ has joined #openstack-lbaas14:25
*** yboaron has quit IRC14:28
*** yboaron_ has quit IRC14:37
*** yboaron_ has joined #openstack-lbaas14:38
*** yboaron_ has quit IRC14:53
*** kobis has joined #openstack-lbaas14:55
*** apple01 has joined #openstack-lbaas15:07
*** rpittau has joined #openstack-lbaas15:27
*** sajjadg has quit IRC15:30
*** pcaruana has quit IRC15:33
amullerRandom Q of the day - When issuing the admin failover command to a loadbalancer in active_standby topology, I see that both amphorae IDs changed. Is this expected?15:44
*** jiteka has joined #openstack-lbaas15:45
*** jiteka has quit IRC15:47
*** jiteka has joined #openstack-lbaas15:48
amullerit seems like both the active and standby amphorae were killed and new ones were spawned15:48
*** apple01 has quit IRC15:51
xgerman_yep, that is expected15:53
xgerman_the idea is to recycle both amps in order to update the amphora image for example15:54
xgerman_the amphora API has a more granular failover15:54
xgerman_^^ amuller15:54
amullerso how do I know that a keepalived based failover happened, via the API?15:55
xgerman_yeah, failover on lb might mean something different for F5 :-)15:55
amuller(thinking of testing)15:55
amullerI mean, how do I know that the old backup is now the active?15:55
johnsomamuller If you want to failover only one of the pair, you can use the amphora failover API.15:56
xgerman_well, I think you mean the active dies and the passive takes over15:56
xgerman_you can “simulate” that with a nova delete on the ACTIVE15:56
xgerman_or a port down, or…15:57
amulleradmin state down on the neutron port of the active?15:57
johnsomAs for which one is currently "MASTER" in the VRRP pair, we don't currently expose that outside the amphora. They manage that completely autonomously to the control plane.15:57
*** nmanos has quit IRC15:57
xgerman_amuller: yes, the vrrp port15:57
amullerjohnsom: ack15:57
johnsomIt also comes down to keepalived not reliably exposing the status of a given instance.15:58
amullerso I'm trying to write a scenario test for failover with active_standby topology15:58
xgerman_yeah, I think in the grander scheme of things killing the amp you think is master will definitely have the other amp take over15:59
amullerso I'm trying to find an easy way (using the API) to see that a failover happened15:59
xgerman_then you check via the nova or amp api if there is a new amp15:59
johnsomAdam had started one:
johnsomHave you looked at that.15:59
xgerman_while you check that traffic kept flowing15:59
amullerI had not15:59
johnsomYeah, the amphora IDs will change, which is queriable via the amphora admin API. That is one clue16:00
johnsomI think that patch is pretty out of date now, but could be a starting place.16:00
amullermhmm, that patch has a larger scope than I was hoping to put up for review16:03
amullerI was hoping for a simpler patch, leaving traffic flow out of it16:04
amullerjust using the API to see that a failover happened16:04
amullerxgerman_: if I find the nova VM for the active amphora and kill it, I imagine octavia will spawn a new amphora, but will the old standby become the new active, without octavia spawning a new 2nd amphora?16:09
amullerin other words if I kill the nova VM for the active amphora, and I wait a bit and issue an amphorae list for the LB, will I see two new amphorae or only one new one?16:10
johnsomOnly one new one16:10
xgerman_passive->active runs “local” on each amp pair and independently we replace broken/deleted amps with the control plane16:11
johnsomWhen you kill one amp by nova delete, the other will automatically assume MASTER if it wasn't. Then health manager will rebuild the other amphora and it will assume the BACKUP role.16:11
johnsomMASTER and BACKUP should not be confused with the database "ROLE" which is for build time settings and not an indication of which amp is playing what role.16:12
xgerman_+1 — once the pair is live it can do whatever it wants16:13
johnsomI actually had a demo video of doing this via horizon I had in case I ran short on my presentations at the summit16:13
amulleroh, ok good note on the database role16:13
johnsomYeah, that confuses people. It's for settings, not for current status16:14
amulleryeah it's confusing =p16:14
amullerso to double check there's no way to use the admin API to find out who is the active one (per keepalived)?16:14
johnsomCorrect. Keepalived doesn't provide a way to ask this that is reliable enough to use.16:15
johnsomUnless they have added it in the last year or two.16:15
amullerwhen I did the l3 HA work in neutron16:15
amullerbecause keepalived didn't at the time allow you to query it for its status16:15
amuller(I don't know if that changed)16:15
amullerI wrote keepalived-state-change-monitor16:15
amullerwhich is a little python process that uses 'ip monitor' to detect IP address changes16:16
xgerman_mmh, maybe we can incorporate that16:16
amullerand when it does, it notifies the L3 agent via a unix domain socket16:16
amullerand the L3 agent then notifies neutron-server via RPC16:16
amullerwhich updates the DB16:16
amullerso when operators can use an admin API to see where is the active router replica16:16
xgerman_yeah, we can extend the health messages to stream it into the DB16:17
johnsomHmm, well, we have a status/diagnostic API on the amps, I think that would be fine to query and not have yet another status to update in the DB.16:17
amulleryeah, I was just commenting on a way to figure out keepalived state changes16:18
johnsomAlso, I think you will find there are some oddities in keepalived that on initial startup both amps will see the VIP IP but keepalived will only be GARPing from one of them.16:18
amullernot how you'd model it in the API16:18
johnsomThis is one of the "reliable" issues we hit16:18
amullerhmm, I didn't see that in L3 HA16:19
amullerI know that we configure keepalived differently16:19
amullerin neutron and in octavia16:19
johnsomYeah, I think it might be how we are bringing up the interfaces. It wasn't a "problem" so we never went back and tried to fix it16:19
*** kobis has quit IRC16:19
johnsomWe considered log scraping, but that provided to not be accurate. We considered doing the status dump to /tmp using the signal, but that required a newer version that some distros have. etc.16:20
johnsomJust wasn't a priority to run to ground as it doesn't provide a lot of useful information, just some nice to have info for failover.16:21
johnsomI think Ryan brought up that the right answer was likely listening on the dbus, but again, a good chunk of work16:21
amullerI'm not familiar with the status dump using a signal16:22
xgerman_so far the amp (diagnostic) API is not exposed in the best way — that’s why I was thinking health messages… I think one day I will write a way to proxy amp-api->diagnostic-api16:22
*** pcaruana has joined #openstack-lbaas16:23
amullerso it seems like right now there's no way to test failover in active_standby topology, if I want to stay in the realm of the API16:23
johnsomWhat do you meant it's not exposed in the best way. It works just fine.  I have no idea why you would need a proxy16:23
johnsomSure there is, use the amphora admin API to check the amphora IDs16:24
xgerman_you need to curl the amp directly or not?16:25
johnsomnot for the amphora admin API\16:25
johnsomI guess list would be more useful:
johnsomFilter by LB ID16:26
johnsomI am pretty sure this is how Adam's failover patch is doing it16:26
amullerthat tells me that octavia spawned new amphoras, does it answer the question of 'did the old backup become the new active'?16:27
johnsomYou can test that by making sure traffic still flows during the failover16:27
amullerI don't really wanna do that =p16:28
amullerit makes the test more complicated than it needs to be16:28
amullerso, even with active_standby you'll see some downtime16:28
johnsomVRRP is designed to be autonomous to the control plane. It can switch at any time for failures the control plane doesn't know about.16:28
amullerso how long of a downtime would you set to be tolerable?16:28
amullerit needs to be shorten than a test with standalone topology16:28
johnsomYeah, depending on how it's tuned it's around a second, usually less16:29
amullerit's gonna be difficult to write it in a way that's reliable at the gate and in loaded environments16:29
amullerbut with a low enough timeout so that the test is still meaningful16:29
*** salmankhan has quit IRC16:29
xgerman_johnsom: I was thinking proxying — but you probably guessed that :-)16:30
johnsomYeah, the hierarchic is:  systemd recovery, sub second; active/standby around a second; failvoer, under a minute (depends on the could infrastructure and your octavia config, typically 30 seconds).16:30
amullerso let's say we do this by pinging the LB16:32
johnsomamuller Here is a demo I showed at the tokyo summit:
amullerwith some timeout16:33
amullerhow do I know which VM to kill16:33
amullerbecause I don't know who the active amphora is16:33
johnsomRight, you won't, you have to cycle through killing them both, one at a time to prove and active/standby transition16:33
xgerman_they could switch in between — so you are probably left with “good enough”16:34
amullerso I pick one at random, kill it, assert < timeout outage16:34
amullerI could have killed either the active or the standby16:34
rm_workamuller: i am mid failover-scenario right now (started on it yesterday)16:34
rm_workupdating my old patch16:35
amulleroh hah ok16:35
rm_workabout to push up the amphora client patch16:35
amullerwell, good thing I asked =p16:35
rm_workjust didn't get the testing quite right last night16:35
amulleralrighty well16:36
amulleris there another area there are glaring testing coverage issues?16:36
amullerone thing to pops to mind is the amphora show and list API16:36
amullerI don't think I saw tests for that in the api subdir16:36
amulleris there a doc or some such ya'all are using to track progress on test coverage in the octavia tempest plugin?16:37
rm_workamuller: that's what i meant16:38
openstackgerritAdam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for l7policies
openstackgerritAdam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for l7rules
openstackgerritAdam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for amphora
rm_work^^ there16:40
rm_worknot done but i'll just push it so it'll be obvious what i mean16:40
amullerrm_work: you prolly want a different class name =p16:42
rm_workyes :P16:42
rm_worki wasn't really ready to push yet but ;P16:43
rm_workone of the tests is still just an unmodified clone, lol16:43
amullerstill curious if there's an effort to track test coverage in the octavia tempest plugin16:44
amulleron my way out for lunch, we can sync later :)16:44
*** jcarpentier has joined #openstack-lbaas16:44
rm_worki think it is missing the amp piece16:45
amullerah ha :)16:45
amullerthat's quite a story hehe16:45
*** kobis has joined #openstack-lbaas16:45
*** JudeC_ has joined #openstack-lbaas16:46
rm_workjohnsom: did you see my attempts at the db retry patch?16:52
johnsomHaven't had a chance yet this morning16:53
* rm_work shrugs16:54
rm_worki can tweak that stuff a bunch if we want to do like16:55
rm_workwait 2, retry count = 516:55
rm_workinstead of wait 1, retry until 10s16:55
*** kobis has quit IRC16:59
*** tesseract has quit IRC17:10
johnsomrm_work So this looks fine. I would make those numbers either config or a constant somewhere (could be at the top of the file). I think 1 is the right answer, but maybe the top end as 30. Just in case there is a DB that takes a crazy long time to commit.17:28
*** AlexeyAbashkin has quit IRC17:29
johnsomIt doesn't really handle the case of walking off the end of the retries in a nice way. It's not like we could do much about that (can't set ERROR obviously), but a log message with details would be nice.17:30
*** AlexeyAbashkin has joined #openstack-lbaas17:31
johnsomFYI, someone on the dev mailing list was asking about backend re-encryption17:32
*** jcarpentier has left #openstack-lbaas17:35
*** AlexeyAbashkin has quit IRC17:36
rm_workjohnsom: right, if it fails right now the exception makes just as little sense, i think17:39
rm_workat least this one is representative of the issue17:39
rm_workbut in no case is there anything to be done :(17:40
rm_workI guess I could put in a log like "retrying"17:40
rm_workone sec17:40
johnsomI probably would have put the retry loop inside and had a log message that we "lost this object" and that the database is broken17:40
rm_workI mean, I could just catch the existing error, but it's not *necessarily* as explicit17:41
rm_workthe the AttributeError17:42
rm_workbut that'd also be easier17:42
rm_workwhat about:17:45
rm_workwait=tenacity.wait_incrementing(2, 2)17:46
johnsomrm_work Hmm, there are some nice backoff options in the tinacity docs (trying to figure out what increment does, which of course isn't documented). It also has an "after failed" log option that would work here17:49
*** kobis has joined #openstack-lbaas17:49
rm_worki don't really want to go exponential17:50
johnsomyeah, agreed17:50
rm_workok actually, maybe I should just do it like...17:51
johnsomIt should land quickly under normal operation, so adding too much delay between attempts just slows the whole thing down.17:51
rm_workretry on the natural AttributeError, and then reraise after the last retry17:52
rm_workand not increment the delay much17:52
rm_workbut meh...17:54
*** AlexeyAbashkin has joined #openstack-lbaas17:57
*** Alexey_Abashkin has joined #openstack-lbaas18:00
*** AlexeyAbashkin has quit IRC18:02
*** Alexey_Abashkin is now known as AlexeyAbashkin18:02
*** pcaruana has quit IRC18:09
rm_worklol, retry_error_callback is SUPER NEW18:11
openstackgerritMichael Johnson proposed openstack/octavia master: Implement provider drivers - Cleanup
rm_workso not sure it's safe to use yet18:12
johnsom^^^ just a docs update to clarify some questions Kobi raised this morning18:12
rm_workyeah k18:13
*** kobis has quit IRC18:18
openstackgerritAdam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates
rm_workjohnsom: revised ^^18:21
rm_workerk off by one18:21
rm_worki was aiming for 1+2+3+4+5*11 = 6018:21
rm_workbut i need 15 not 14 for that18:22
openstackgerritAdam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates
rm_workas this is a *workaround*, I really wanted to avoid adding a bunch more config (that we'd have to deprecate, to boot)18:23
johnsomYeah, just throw a constant at the top of the file18:26
rm_workah, sure18:27
openstackgerritAdam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates
*** AlexeyAbashkin has quit IRC18:40
*** salmankhan has joined #openstack-lbaas18:53
*** atoth has quit IRC18:55
*** salmankhan has quit IRC18:57
johnsomOne minor comment about the log message, otherwise good for me19:17
johnsomAnd I bet lower constraints is wrong19:18
openstackgerritJan Zerebecki proposed openstack/neutron-lbaas master: Remove eager loading of Listener relations
*** salmankhan has joined #openstack-lbaas19:46
johnsom#startmeeting Octavia20:00
openstackMeeting started Wed May 30 20:00:07 2018 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.20:00
*** openstack changes topic to " (Meeting topic: Octavia)"20:00
openstackThe meeting name has been set to 'octavia'20:00
johnsomHi folks20:00
johnsomPretty light agenda today20:00
johnsom#topic Announcements20:01
*** openstack changes topic to "Announcements (Meeting topic: Octavia)"20:01
johnsomI don't have much for announcements.  The summit was last week. There are lots of good sessions up for viewing on the openstack site.20:02
johnsomOctavia was demoed and called out in a Keynote, so yay for that!20:02
johnsomWe also came up in a number of sessions I attended, so some good buzz20:02
johnsomNir, I think you have an announcement today....20:03
nmagneziI'm happy to announce that TripleO now fully supports Octavia as a One click install.20:03
nmagneziThat includes:20:03
nmagnezi1. Octavia services in Docker containers.20:03
nmagnezi2. Creation of the mgmt subnet for amphorae20:03
nmagnezi3. If the user is using RHEL/CentOS based amphora - it will automatically pull an image and load it to glance.20:04
nmagneziAdditionally, the SELinux policies for the amphora image are now ready and tested internally. Those policies are available as a part of openstack-selinux package.20:04
nmagnezisome pointers (partial list):20:04
nmagnezi1. I'll find the related tripleO docs and provide it next week (what we do currently have is release notes ready for Rocky20:04
nmagnezi2. SELinux:20:04
nmagneziMany people were involved with this effort ( cgoncalves amuller myself bcafarel and more) . And now we can fully support Octavia as an OSP13 component (Based on Queens).20:05
johnsomOk, cool!  So glad to see SELinux enabled for the amps20:05
johnsom+1000 There are people looking for it.  I also mentioned OSP 13 a number of times at the summit20:06
amullerI do expect this to drive up Octavia adoption pretty significantly20:06
nmagneziyup. I'm sure operators that are still using the legacy n-lbaas with haproxy will migrate20:07
cgoncalvesit takes less than 2 people to deploy octavia with tripleo now20:07
johnsomWay to go RH/TripleO folks. I know it's been a journey, but it is great to have the capability.20:07
johnsom2 people?  I thought it was one-click?  One to click the mouse and one to open the beverages?20:07
cgoncalvesjohnsom, I said "less than" ;) one is enough to do both jobs :)20:08
johnsomNice.  Any other announcements today?20:08
*** amuller has quit IRC20:09
johnsom#topic Brief progress reports / bugs needing review20:09
*** openstack changes topic to "Brief progress reports / bugs needing review (Meeting topic: Octavia)"20:09
johnsomI was obviously a bit distracted with the summit and presentation prep.20:10
nmagneziyup :)20:10
johnsomWe have started merging the provider driver code and the tempest plugin code. We are co-testing the patches as we go.20:10
johnsomrm_work Did point out a race condition in my amp driver last night, so he is working on fixing that.20:11
johnsomIt's kind of a migration issue, as I wanted to incrementally migrate the amphora driver over to the new provider driver ways.20:12
johnsomToday my focus is on adding the driver library support (update stats/status).20:12
johnsomI have been chatting with Kobi in the mornings about the VMWare NSX driver, which it sounds like is in-progress.20:12
johnsomHe has been giving feedback that I have been including in the cleanup patch.20:13
johnsomAlong the lines of drivers, it's not clear when we will get an F5 driver. They have had some re-orgs from what I gather, so it may delay that work.20:14
nmagnezigood to know. at least we have feedback from one vendor for now.20:15
johnsomYeah, sad that the vendor that was the key author of the spec may not be able to create their driver right away20:15
openstackgerritAdam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates
johnsomAny other progress updates today?20:16
nmagneziI had some cycles, so I finished my Rally patch to add support for Octavia. it is ready for feedback now and worked okay for me:
johnsomIn case you are really bored, here is the link to my project update.  Feedback welcome.20:17
nmagnezijohnsom, i watched it. you did a great work with this.20:18
johnsomRally, cool.  I need to look at that and refresh my memory of how those gates work20:19
nmagnezijohnsom, yeah. the scenario i add now is a port of the existing n-lbass scenario to Octavia20:19
johnsomI'm guessing it is the "rally-task-load-balancing" gate I should look at?20:19
nmagnezinext up we can add more stuff20:20
nmagnezijohnsom, yup20:20
johnsomCool, I will check it out20:20
johnsomAny other updates?20:21
johnsomcgoncalves I saw the grenade gate was failing, but didn't have much time to dig into why. Are you looking into that?20:22
cgoncalvesjohnsom, I submitted new patch set today (yesterday?) to check what's going on when we curl. it fails post-upgrade20:22
johnsomYeah, odd20:23
cgoncalvesnot sure why yet. it successfully passes the same curl pre and during upgrade20:23
cgoncalvesit started failing out of the sudden20:23
johnsomWell, let us know if we can provide a second set of eyes to look into it.20:24
johnsomI really want to get that merged and voting.20:24
johnsomSo we can start the climb on upgrade tags20:24
johnsomAlso, we discussed fast-forward upgrades a bit at the summit.20:24
johnsomI'm thinking we need to setup grenade starting with Pike (1.0 release) and have gates for Pike->Queens, Queens->Rocky, etc. to prove we can do a fast-forward upgrade20:25
johnsomI guess I am jumping ahead...  grin20:26
johnsom#topic Open Discussion20:26
*** openstack changes topic to "Open Discussion (Meeting topic: Octavia)"20:26
*** mstrohl has joined #openstack-lbaas20:28
johnsomfast-forward is running each upgrade sequentially to move from an older release to current. This is different from leap-frog which is a direct jump Pike->Rocky.  It sounds like fast-forward is going to be the supported plan for OpenStack upgrades20:28
rm_workfast forward sounds like...  a normal upgrade process20:29
johnsomRight, just chained together20:29
nmagneziand... fast :)20:29
rm_workunless they add stuff like "you don't have to start/stop the control plane at each stage"20:29
johnsomEh, as long as we have a plan and a test I will be happy.20:29
rm_workbut yeah20:29
johnsomI think it's the script of what is required to do that.20:30
johnsomOther topics today?20:30
cgoncalvesIIUC, we could try that now in the grenade patch by changing the base version from queens to pike:
johnsomcgoncalves that would be a leap frog thought right?20:31
cgoncalvesif there are upgrade issues (e.g. deprecated configs), we create a per version directory with upgrade instructions20:31
cgoncalvesjohnsom, ah right20:32
johnsomyeah, we need to write up an upgrade doc that lays out the steps.  Maybe from their link to any upgrade issues or just link to the release notes20:32
* johnsom notices the room going quiet....20:33
nmagnezieveryone like docs..20:34
johnsomIn fairness, I think we can just pull the grenade back to the Queens  branch and set it up.  This would give us the FFU gates we need.20:34
johnsomOnce we get it stable on master20:34
nmagneziyup. sounds right20:36
johnsomOk, if there aren't any more topics, have a good week and we will chat next Wednesday!20:38
rm_worksomething is nagging at me about that... but sure, probably20:38
rm_worko/ REVIEW STUFF20:38
johnsomYes, please.  I really hope to do a client release soon20:38
johnsomWould love to get some reviews on that20:39
*** openstack changes topic to "Discussion of OpenStack Load Balancing (Octavia) |"20:40
openstackMeeting ended Wed May 30 20:40:00 2018 UTC.  Information about MeetBot at . (v 0.1.4)20:40
openstackMinutes (text):
rm_workah yeah i need to look at client again20:40
johnsomI think status and timeouts are ready for review.20:41
johnsomUDP might be, there were some updates pending on the controller sied20:41
rm_worki need to get scenario tests done on the l7 stuff so we can merge those and then merge yours20:42
rm_workre: the comments earlier about amps not showing who is master -- i could implement it to update our DB at least the way that I do it in my env20:43
rm_workwhich is, on takeover, keepalived can run a script -- and that script can trigger a health message with the takeover notification20:43
rm_workso we could update the db20:43
* rm_work shrugs20:43
johnsomWell, I have a bunch of concerns about that frankly20:43
rm_workit's just a third type of health message, which is like "hey, i'm a master now, do with that what you will"20:44
rm_workand the default driver can just be like "ok, updating amps in the db"20:44
rm_workit's not guaranteed to be super accurate20:45
johnsomYeah, plus new DB columns, a bunch of latency, etc.  Just is it really that useful of info?20:45
rm_workbut it's better than what it does now20:45
rm_workno new db columns20:45
rm_workwe already have a column for role20:45
johnsomOh yes! You CANNOT change the ROLE column20:45
rm_worklol k20:45
* rm_work does20:45
johnsomThat would be a big problem20:45
rm_workthat's how it is in my prod20:45
rm_workhave yet to see any problems20:45
johnsomThat dictates the settings pushed to the amps, like priorities and peers20:46
johnsomMy guess is your failovers aren't happening right and likely pre-empt the master when a failed amp is brought back in20:47
johnsomThat column was never intended to be a status column20:47
rm_workit wouldn't push new configs to keepalived20:47
rm_workunless one of the amps actually gets updated20:47
johnsomRight, but that column dictates the settings deployed into the keepalived config file when we build replacements20:48
rm_workat which point... it then pushes to both20:48
rm_workyeah, i build a replacement BACKUP immediately actually20:48
rm_workwhich gets the new backup settings20:48
johnsomWe have had this conversation, if I remember right you are ok with the downtime20:49
rm_workwhat downtime?20:49
rm_workI mean, we already have some downtime just for the FLIP to move, on a fail20:49
rm_workbut there's no other downtime involved20:49
johnsomThat eats the keepalived downtime20:49
rm_workwe always reload keepalived on amp configs20:50
rm_worknot sure how there would be any different downtime20:50
xgerman_I would rather have working and then relay back to the amp API - my 2cts20:50
johnsomxgerman_ I don't think we want anymore "proxy" code right now20:51
rm_workeugh, public API endpoint that can trigger an actual callout to dataplane -> not something i want to see happen20:51
xgerman_I said relay — not proxy20:51
johnsom<xgerman_> German Eichberger johnsom: I was thinking proxying — but you probably guessed that :-)20:52
johnsomThat was from this morning.20:52
xgerman_yes, that’s why I said relay — I am learning20:53
rm_worki get the idea -- it's not exactly a proxy I think -- but it involves syncronously reaching out to the dataplane20:53
rm_workwhich is meh20:53
rm_worki would rather not expose a direct path from user->dataplane20:53
rm_workthough, it is just admins20:54
johnsomyeah, I struggle with the value of it. To me it is a cattle detail hidden from view that can change at any time autonomously from the control plane.20:54
rm_workit's essentially useful just for testing20:54
rm_workwhich is at least 3/4 of why i even wrote the amp API20:54
xgerman_if it’s just tetsing they can hit the amp directlly20:55
rm_workso i could do stuff in tempest without reaching into the DB20:55
xgerman_I was more thinking for some automation stuff - e.g. change roles (ther eis also a set)20:55
rm_workyes, for pre-fails that would be nice20:55
rm_workfor the failover call and such20:55
rm_workand my evac20:56
johnsomIt's not like you can force a role change outside of a failure. There is no keepalived call to do that (thank goodness)20:56
xgerman_also seeing amps flap between active-passive is useful20:56
johnsomIs it?20:57
xgerman_yes, if it happens every 3s20:57
johnsomUser wouldn't know20:57
xgerman_the entwork is wonly20:57
rm_workthe latency +/- on that would be a good deal of that 3s lol20:57
xgerman_yes, it’s useful for admin/ monitoring20:57
*** KeithMnemonic has joined #openstack-lbaas20:57
xgerman_users don’t care…20:58
johnsomI think that is a poor metric. It doesn't tell you anything about why or if it is even a bad thing.20:58
johnsomIf you want that level of deep monitoring, you need the logs that would actually tell you why in addition to if it changed20:59
johnsomBut again, this is treating them like pets, not cattle20:59
xgerman_I see monitoring as detetcing potential error conditions and then investigating through logs21:00
xgerman_but if I would see a ton of flapping and heard that networking did soemthing…21:01
rm_workthe biggest thing i see it being useful for is tempest failover testing21:03
johnsomAgain, I think role changes are noise and not an actionable event21:03
*** sapd_ has joined #openstack-lbaas21:03
johnsomRight, I think the only real use case is testing IMO.  Which could be done other ways, like ssh in and dump the status file21:04
johnsomOr just do a dual failure to guarantee the initial state21:04
xgerman_now, to what I actually wanted to do: check zuul for ending jobs — didn’t they have a page I could check by Change-id?21:06
johnsomChange ID, not so sure21:07
*** sapd has quit IRC21:07
*** samccann has quit IRC21:12
*** samccann has joined #openstack-lbaas21:14
*** salmankhan has quit IRC21:29
*** harlowja has joined #openstack-lbaas21:52
lxkongmorning, guys, I have a question, is it possible that the package data sent from amphora to health-manager is in non-ascii format?21:57
lxkongwe met with some error message in health-manager log: `Health Manager experienced an exception processing a heartbeat message from ('', 31538). Ignoring this packet. Exception: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128)`21:57
lxkongso i made a patch to oslo_utils here
lxkongseems the method hmac.compare_digest doesn't support non-ascii21:58
lxkongbut our issue was solved using this patch21:59
rm_workthat's weird21:59
johnsomYeah, I think it is always non-ascii...  It should be raw bytes22:00
rm_workah Ben's link is interesting22:00
rm_work`a and b must both be of the same type: either str (ASCII only, as e.g. returned by HMAC.hexdigest()), or a bytes-like object`22:00
lxkongyeah, that's from the doc of hmac module22:03
johnsomlxkong Do you have the traceback?22:04
lxkongjohnsom: i don't have traceback now, but according to my debugging, the error happens here:
lxkongand i also printed the data received, it's non-ascii22:05
johnsomYeah, so the data it's comparing is generated here:
johnsomWhich is clear, it is not ASCII22:08
johnsomWe probably should switch to using hmac.compare_digest22:08
lxkongthe method has different explanation in python 2.7.7 and 3.322:11
johnsomYeah, I see that22:12
rm_workright so22:12
rm_workthe point of that is because we can't use hmac.compare_digest :P22:12
rm_workit does use it, if it can22:12
johnsomOh, yeah, I see it22:13
lxkongdo you think we should fix that in octavia?22:13
rm_workso it looks like we SHOULD have been using hexdigest() instead of digest() ?22:13
johnsomSo, right, maybe for the decode we use hexdigest and for the encode use digest?22:14
johnsomSo, don't we still want to pass the hmac in binary over the wire (half the size) but use the hex dump for the compare?22:16
rm_workyeah i am trying to find where we actually do the encode/decode22:16
johnsomNo, that won't work, it has to be hexdigest all the way around22:17
rm_workah right, same file for both22:17
rm_workso should we even be encoding in utf-8 then?22:18"utf-8"), payload, hashlib.sha256)22:18
rm_workoh that's just the key22:18
johnsomThat is the key, so yes22:18
rm_workwe also do the payload tho22:18
rm_workah and then it becomes binary22:19
johnsomBefore it's compressed, so again, doesn't matter22:19
johnsomI guess we just have to double the size of the hmac sig on the packets by making them hex22:20
rm_workso... yeah we just switch to hexdigest aybe22:20
rm_workhow big is that22:20
johnsom32 bin, 64 hex22:20
rm_workbits? bytes?22:21
rm_workI assume bytes22:21
rm_workso ... ok, not ideal, but meh22:21
rm_workwe've got .... 65507 bytes to work with? :P22:22
rm_workthough that may not function on all networks?22:22
*** rcernin has joined #openstack-lbaas22:22
johnsomUnless there is some way to make those both "bytes like objects"22:25
rm_worki mean, that's what they should be22:25
rm_worki believe that IS what we send22:26
rm_workessentially just a bytefield22:26
johnsomSo, it must be this that is hosing it over to python "what is a string" hell: envelope[-hash_len:]22:27
rm_workcan debug through that and see22:28
rm_workone sec22:28
rm_workstill bytes objects22:30
rm_workwhen we pass to22:30
rm_worksecretutils.constant_time_compare(expected_hmc, calculated_hmc)22:30
rm_workboth expected and calculated are bytes()22:30
rm_workone sec let me see which python i'm on22:30
rm_workok in py2 it's all str()22:31
rm_workin py3 it's all bytes()22:31
rm_workbut even the original envelope is str() in py222:31
rm_workwe DO test this in our unit tests22:32
rm_workit's possible it's an issue with py3 on amps and py2 on controlplane, or visa-versa?22:32
rm_worklxkong: what version of python does your HM run on?22:32
lxkonglet me see22:33
lxkongthe hm is on 2.7.6, the python inside amphora is 2.7.1222:34
lxkongusing hexdigest() can also solve the issue. What are you talking about?22:36
*** mstrohl has quit IRC22:36
*** fnaval has quit IRC22:37
johnsomlxkong BTW, you might want to add a docs page in Octavia that talks about using Octavia with K8s and your ingress work. Just so people can find it, etc.22:41
lxkongi don't understand what's the potential problem it will bring if we replace digest() with hexdigest()?22:41
lxkongjohnsom: sure, i will22:42
johnsomlxkong Kind of like I did for the SDKs:
lxkongyeah, i will find an appropriate place to advertise that work :-)22:43
johnsomWell, it doubles the size of the hmac data being sent.  We want our overall data size to fit in a single UDP packet, so trying to keep the size down is good practice22:43
lxkongah, ok yeah, 32 should be 6422:43
johnsomYeah, if we switch to hexdigest it will become 6422:44
lxkongbut reducing from 64 to 32 is just a mitigation22:44
lxkongany other options do we have to solve the problem?22:45
johnsomIt's an optimization to have it stay 32 bytes like it is now, true.22:45
lxkongor is anything i can help?22:45
johnsomThat is a question for rm_work, I think he is poking at this one22:46
rm_workeh, i looked, but22:47
rm_worki don't really have time to22:47
lxkongor we could keep a copy of oslo_utils/ but change things i proposed :-)22:47
rm_worknor have i seen this issue?22:47
rm_workwell, I think we might have to do the hexdigest thing22:47
rm_workto be actually legit22:47
johnsomYeah, so maybe just do that for now.  I think I have seen this, but found out it really was some other issue that was triggering it.  Not sure.  It's vaguely familiar22:48
lxkongso do you guys have time to fix that atm, i can definitely do that because maybe we are the only one who are affected22:50
johnsomGot for it, my vote is the hexdigest approach.  We need to handle backwards compatibility though...  I.e. operator updates the HM controller, but still has digest() amps22:52
rm_workeugh yeah22:53
johnsomIt also might be worth debugging if you can reproduce, see what types we have and the payload22:53
lxkongjohnsom: do you mean we should both check hmc.digest and hmc.hexdigest to make sure we don't break the old amps?22:54
johnsomYes, we will have to fall back in some way22:55
openstackgerritLingxian Kong proposed openstack/octavia master: [WIP] Use HMAC.hexdigest to avoid non-ascii characters for package data
lxkongjohnsom, rm_work: when you are available, pleaset take a look at, I want to make sure that's the way you expected before I am starting to add unit tests23:50
lxkongit's already verified in our preprod23:51
johnsomOne comment, but otherwise yeah, I think that is the path23:55
openstackgerritLingxian Kong proposed openstack/octavia master: [WIP] Use HMAC.hexdigest to avoid non-ascii characters for package data
lxkongjohnsom: thanks23:59

Generated by 2.15.3 by Marius Gedminas - find it at!