19:59:14 <johnsom> #startmeeting Octavia
19:59:15 <openstack> Meeting started Wed Apr  3 19:59:14 2019 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:59:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:59:18 <openstack> The meeting name has been set to 'octavia'
19:59:24 <cgoncalves> hello
19:59:28 <johnsom> Hi folks
19:59:46 <johnsom> poke rm_work our future PTL
20:00:14 <johnsom> #topic Announcements
20:00:57 <johnsom> This is the final RC week. I think we need an RC2 for octavia, so we will try to get those fixes in today and hopefully do the RC2 today as well.
20:01:10 <johnsom> We are also close to doing some stable branch releases.
20:01:37 <johnsom> Other than that, I don't have any other announcements this week.  Anyone else?
20:02:35 <johnsom> #topic Proposal to change meeting time (cgoncalves)
20:02:50 <johnsom> cgoncalves Do you want to talk to this?
20:02:59 <cgoncalves> sure, thanks
20:03:48 <cgoncalves> so currently our weekly meetings are at 1pm PST, which means 10 pm CEST and 11 pm IST (Israel)
20:04:04 <johnsom> 2000 UTC
20:04:20 <cgoncalves> in Asia, it is in the middle of the night
20:04:55 <cgoncalves> I was wondering if we could have our meetings earlier to be more friendly to folks in EMEA and Asia
20:05:25 <cgoncalves> thanks for the correction
20:05:30 <johnsom> Yes as long as we get quorum with the change.
20:05:36 <cgoncalves> agreed
20:05:44 <johnsom> (I was just adding to the conversation, grin)
20:05:48 <xgerman> sure, what time are you proposing?
20:06:05 <johnsom> So how this works, community process wise:
20:06:07 <xgerman> also we should make sure rm_work is available
20:06:14 <johnsom> 1. we propose some times/days
20:06:16 <cgoncalves> right
20:06:28 <johnsom> 2. I will create a doodle for those times/days
20:06:42 <johnsom> 3. We e-mail the openstack list with the details and the doodle.
20:06:48 <xgerman> s/I/rm_work/g
20:07:33 <johnsom> 4. We let that soak a week, then if we have quorum for a new time, I will go update all the places that need updating and we have a new time.
20:08:11 <cgoncalves> sounds good to me
20:08:13 <johnsom> Questions/comments on the process?
20:08:27 <xgerman> +1 (other than we should have rm_work own more of the process)
20:09:01 <xgerman> I think it will stretch into when he takes over
20:09:02 <johnsom> I would hope that rm_work would participate in the proposals.
20:09:13 * johnsom wonders how many times we can ping him.... grin
20:09:20 <cgoncalves> perhaps current meeting time is also not much convenient for rm_work
20:09:40 <xgerman> yeah, who knows which time zone he lives by nowerdays
20:09:58 <johnsom> So I will start, 1600 UTC is a nice time for me
20:10:05 <cgoncalves> true. it's 5 am in Japan
20:12:15 <xgerman> since for the people here the time works we are the wrong ones to ask to begin with
20:13:03 <cgoncalves> let's throw more time options in to doodle. say 1500 UTC
20:13:20 <johnsom> +1
20:13:51 <cgoncalves> let's also make sure we include the current meeting time
20:14:01 <johnsom> Ok, fair point
20:14:28 <xgerman> +1
20:14:37 <johnsom> Do we have any particular day that we should propose or are no-go for folks?
20:15:07 <xgerman> Let’s stick with Wednesday - Friday/Monday are funny in a lot of time zones
20:15:12 <cgoncalves> Fridays are no go for Israelis
20:15:16 <johnsom> I am guessing Friday, Saturday, Sunday are bad
20:15:36 <johnsom> Yeah, so Tue-Wed-Thur
20:15:38 <cgoncalves> Tue-Thu
20:15:43 <xgerman> +1
20:16:21 <johnsom> Ok, any other proposed times for the doodle?
20:16:22 <cgoncalves> cool. thank you, all!
20:17:05 <cgoncalves> johnsom, there's an option in doodle that allows anyone to add new rows (= times), no?
20:17:24 <cgoncalves> s/rows/columns/
20:17:25 <johnsom> Ok. I will get the process going.
20:17:31 <xgerman> =1
20:17:39 <johnsom> yes I think so, you think we should leave it open?
20:17:44 <xgerman> but really rm_work...
20:17:49 <cgoncalves> why not
20:17:54 * johnsom notes it's been a year or two since I did this
20:18:19 <johnsom> Ok, will do
20:18:34 <cgoncalves> thanks
20:18:42 <johnsom> It just means folks need to check back to it in case new times are added
20:19:06 <johnsom> #topic Brief progress reports / bugs needing review
20:20:08 <johnsom> I have worked on removing the last references to oslosphinx which is broken with sphinx 2, deprecated, and won't be fixed.
20:20:35 <johnsom> For the most part we have already done that, but there were two references we missed. You should not see any major changes in the docs/release notes
20:21:30 <johnsom> I helped figure out a solution to our grenade issue.
20:21:53 <johnsom> Lots of reviews, etc.
20:22:30 <cgoncalves> johnsom, thank you for your help troubleshooting and proposing a fix to the grenade issue. really appreciate!
20:22:31 <johnsom> Currently I'm working on adding the "unset" option to our openstack client. This will make it more clear for users of how to clear settings.
20:22:58 <johnsom> I'm going to go through the main options first, then come back and do tags.
20:23:14 <xgerman> I send back my laptop, did a week vacation… slowly getting my 2008 Mac into 2019’s software
20:23:22 <johnsom> I need to move a module out of neutron in OSC up to osc-lib so we can share the tags code.
20:23:54 <johnsom> Yeah, I am also running on "alternate" hardware now.  Seems to be working ok though.
20:24:32 <johnsom> I also fixed a security related issue in the OSA role.
20:24:39 <johnsom> #link https://review.openstack.org/648744
20:24:52 <johnsom> xgerman You might want to do a quick review on that
20:25:01 <xgerman> on it
20:25:45 <johnsom> So, that is my plan for the next few days, work on unset and then tags for the client.
20:26:11 <johnsom> Also, I will be travelling and not available much Sun-Wed.
20:26:14 <johnsom> Just as a heads up
20:26:25 <cgoncalves> there is a patch in master and stein that broke spare pools. Change https://review.openstack.org/#/c/649381/ will fix it. we need to backport it to stein and release Stein RC2 this week
20:26:58 <johnsom> Yeah, I am going to take a stab at that after lunch.
20:27:05 <cgoncalves> it would be nice if we could merge a tempest test for spare pool to prevent regressions like this in the future
20:27:09 <cgoncalves> #link https://review.openstack.org/#/c/634988/
20:27:14 <johnsom> I think we just need to do another migration and we can fix it that way
20:28:03 <cgoncalves> migration? as in DB migration?
20:28:10 <johnsom> Yep. Good stuff. I had previously +2'd, will circle back
20:28:13 <cgoncalves> patch set I take
20:28:14 <johnsom> Yeah
20:28:52 <johnsom> Any other updates this week?
20:28:54 <cgoncalves> I also addressed reviews on https://review.openstack.org/#/c/645817/
20:29:08 <cgoncalves> I have a customer waiting for it
20:29:52 <johnsom> +2, you addressed my only issue with it.
20:30:01 <xgerman> looking
20:30:09 <cgoncalves> thank you, thank you!
20:30:17 <xgerman> need to see what happened to my patches...
20:31:04 <johnsom> #topic Open Discussion
20:31:09 <johnsom> Ok, other topics this week?
20:32:10 <cgoncalves> some folks were discussing here on the channel earlier today about an issue where health manager would trigger failover
20:32:28 <cgoncalves> while the network was still being configured, i.e. flows, etc
20:32:31 <xgerman> yeah, it can do that ;-)
20:32:54 <xgerman> how do we know that the network is configures
20:32:54 <cgoncalves> no resolution yet
20:32:59 <xgerman> ?
20:33:06 <cgoncalves> precisely, that's the question
20:33:15 <johnsom> Really? The HM honors the lock on the objects, so it should not be able to start a failover if another controller owns the resouce
20:33:37 <johnsom> Oh, you mean the neutron networking....
20:33:39 <johnsom> Right
20:33:41 <xgerman> I was thinking neutron was pulling sh*t
20:33:43 <cgoncalves> yes
20:34:08 <cgoncalves> would it work not to fail over amps unless at least one heartbeat from any amp is received by the HM on start up?
20:34:17 <xgerman> that goes back to should we go i to ERROR and tell the operator neutron is broken or keep retrying
20:34:46 <xgerman> see [1] http://blog.eichberger.de/posts/yolo_cloud/
20:35:00 <cgoncalves> we don't know if neutron is "broken". all we know is the HM hasn't received heartbeat within the heartbeat_timeout (60 seconds)
20:35:11 <johnsom> I suspect the issue is around net splits where some hosts and racks are working, but others are not, so any heartbeat would likely have the same issue
20:35:21 <xgerman> cgoncalves: we had it not fail over stuff when there was no heartbeat which caused other problems
20:35:40 <xgerman> (mainly amp doesn’t come up right and we never know)
20:35:45 <cgoncalves> xgerman, we still do not failover on newly created LBs
20:36:03 <xgerman> I thought we fixed that a while back...
20:36:33 <johnsom> I thought someone was looking at that again and proposed a fix as well. Not positive though.
20:36:46 <cgoncalves> my understanding was that it was a feature/desired behavior, not a bug
20:36:47 <johnsom> It's a trickier problem than it seems on the surface
20:37:20 <xgerman> yeah, it’s the wait or trow hands-up in the air
20:38:34 <cgoncalves> just wanted to bring this up in case anyone had some thoughts
20:38:54 <cgoncalves> this is affecting some customers on this side
20:39:12 <xgerman> how? OSP 13 is not HA...
20:39:14 <cgoncalves> I heard other people also facing same issue
20:39:21 <johnsom> Personally, if the amp isn't talking to us, it seems like it is the right answer to fail it over. The question is what to do after that, specifically if the neutron outage causes the failover to not be successful.  Right now we fail "safe", in that it's marked ERROR, but the secondary amp is still passing traffic.
20:39:41 <johnsom> lol, ouch
20:40:33 <cgoncalves> having a HA tempest job would play a long way in having it supported in OSP
20:41:34 <johnsom> We could have a periodic that looks for ERROR LBs with an amp in ERROR and attempts an amp failover again. We would just need to figure out the right back off and make super sure we don't bother the other functional amp.
20:42:06 <johnsom> I think we have some work to do on the failover flows in general actually.
20:42:07 <xgerman> mmh, I would leave that for the PTG…
20:42:23 <cgoncalves> I don't like that, to be honest. we would be killing an amphora that is actually up by failing over not once but twice
20:42:23 <johnsom> Yes, good topic for the PTG
20:42:25 <xgerman> yep , agree on failover fpws
20:42:36 <xgerman> was beating that drum for a ear now
20:43:55 <johnsom> Added a topic to the PTG etherpad
20:44:00 <johnsom> #link https://etherpad.openstack.org/p/octavia-train-ptg
20:44:31 <cgoncalves> is there a reason you guys see not to like my proposed approach?
20:45:39 <johnsom> The one heartbeat one or did I miss something?
20:45:52 <cgoncalves> "would it work not to fail over amps unless at least one heartbeat from any amp is received by the HM on start up?"
20:46:01 <cgoncalves> let me know if I'm not being clear
20:46:15 <johnsom> I responded: "I suspect the issue is around net splits where some hosts and racks are working, but others are not, so any heartbeat would likely have the same issue"
20:46:50 <cgoncalves> ah, sorry. I read that but didn't read it as a reply to my message
20:47:09 <johnsom> We could also try to set some threshold, say if more than x amps are "down", pause failovers
20:47:59 <johnsom> I guess now that we can mutate the config this could be more feasible now. It would allow the operator to have a knob to turn.
20:48:15 <cgoncalves> it would be expected that sometime "soon" the network would be back again so HMs would start receiving heartbeats
20:49:08 <johnsom> Well, there is always the "rack got zapped by a evil mastermind's laser" scenario where it will not come back
20:50:23 <johnsom> Or the scenario I saw once, where the host was just powered off for a day
20:51:36 <cgoncalves> hmm...
20:51:59 <johnsom> That one was nice, it led to a bunch of zombie instances showing up again
20:52:24 <cgoncalves> health manager kills them now, thanks to xgerman patch
20:52:36 <johnsom> Stuff to think about. I captured a few on the etherpad, please add more!
20:52:53 <johnsom> Yeah, that is some of the background on that patch
20:53:53 <johnsom> We have about 5 minutes, were there other topics we needed to discuss?
20:55:14 <johnsom> Ok, just wanted to check. Sometimes we run out of time before we discuss everything.
20:55:51 <johnsom> Oh, BTW, the devstack patch still hasn't merged, so the barbican job is still going to fail.
20:56:22 <cgoncalves> which devstack patch?
20:56:27 <johnsom> #link https://review.openstack.org/648951
20:58:04 <cgoncalves> thanks
20:58:28 <johnsom> Ok, sounds like we are wrapping up.  Thanks folks!  Have a great week.
20:58:30 <johnsom> #endmeeting