19:59:14 <johnsom> #startmeeting Octavia 19:59:15 <openstack> Meeting started Wed Apr 3 19:59:14 2019 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:59:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:59:18 <openstack> The meeting name has been set to 'octavia' 19:59:24 <cgoncalves> hello 19:59:28 <johnsom> Hi folks 19:59:46 <johnsom> poke rm_work our future PTL 20:00:14 <johnsom> #topic Announcements 20:00:57 <johnsom> This is the final RC week. I think we need an RC2 for octavia, so we will try to get those fixes in today and hopefully do the RC2 today as well. 20:01:10 <johnsom> We are also close to doing some stable branch releases. 20:01:37 <johnsom> Other than that, I don't have any other announcements this week. Anyone else? 20:02:35 <johnsom> #topic Proposal to change meeting time (cgoncalves) 20:02:50 <johnsom> cgoncalves Do you want to talk to this? 20:02:59 <cgoncalves> sure, thanks 20:03:48 <cgoncalves> so currently our weekly meetings are at 1pm PST, which means 10 pm CEST and 11 pm IST (Israel) 20:04:04 <johnsom> 2000 UTC 20:04:20 <cgoncalves> in Asia, it is in the middle of the night 20:04:55 <cgoncalves> I was wondering if we could have our meetings earlier to be more friendly to folks in EMEA and Asia 20:05:25 <cgoncalves> thanks for the correction 20:05:30 <johnsom> Yes as long as we get quorum with the change. 20:05:36 <cgoncalves> agreed 20:05:44 <johnsom> (I was just adding to the conversation, grin) 20:05:48 <xgerman> sure, what time are you proposing? 20:06:05 <johnsom> So how this works, community process wise: 20:06:07 <xgerman> also we should make sure rm_work is available 20:06:14 <johnsom> 1. we propose some times/days 20:06:16 <cgoncalves> right 20:06:28 <johnsom> 2. I will create a doodle for those times/days 20:06:42 <johnsom> 3. We e-mail the openstack list with the details and the doodle. 20:06:48 <xgerman> s/I/rm_work/g 20:07:33 <johnsom> 4. We let that soak a week, then if we have quorum for a new time, I will go update all the places that need updating and we have a new time. 20:08:11 <cgoncalves> sounds good to me 20:08:13 <johnsom> Questions/comments on the process? 20:08:27 <xgerman> +1 (other than we should have rm_work own more of the process) 20:09:01 <xgerman> I think it will stretch into when he takes over 20:09:02 <johnsom> I would hope that rm_work would participate in the proposals. 20:09:13 * johnsom wonders how many times we can ping him.... grin 20:09:20 <cgoncalves> perhaps current meeting time is also not much convenient for rm_work 20:09:40 <xgerman> yeah, who knows which time zone he lives by nowerdays 20:09:58 <johnsom> So I will start, 1600 UTC is a nice time for me 20:10:05 <cgoncalves> true. it's 5 am in Japan 20:12:15 <xgerman> since for the people here the time works we are the wrong ones to ask to begin with 20:13:03 <cgoncalves> let's throw more time options in to doodle. say 1500 UTC 20:13:20 <johnsom> +1 20:13:51 <cgoncalves> let's also make sure we include the current meeting time 20:14:01 <johnsom> Ok, fair point 20:14:28 <xgerman> +1 20:14:37 <johnsom> Do we have any particular day that we should propose or are no-go for folks? 20:15:07 <xgerman> Let’s stick with Wednesday - Friday/Monday are funny in a lot of time zones 20:15:12 <cgoncalves> Fridays are no go for Israelis 20:15:16 <johnsom> I am guessing Friday, Saturday, Sunday are bad 20:15:36 <johnsom> Yeah, so Tue-Wed-Thur 20:15:38 <cgoncalves> Tue-Thu 20:15:43 <xgerman> +1 20:16:21 <johnsom> Ok, any other proposed times for the doodle? 20:16:22 <cgoncalves> cool. thank you, all! 20:17:05 <cgoncalves> johnsom, there's an option in doodle that allows anyone to add new rows (= times), no? 20:17:24 <cgoncalves> s/rows/columns/ 20:17:25 <johnsom> Ok. I will get the process going. 20:17:31 <xgerman> =1 20:17:39 <johnsom> yes I think so, you think we should leave it open? 20:17:44 <xgerman> but really rm_work... 20:17:49 <cgoncalves> why not 20:17:54 * johnsom notes it's been a year or two since I did this 20:18:19 <johnsom> Ok, will do 20:18:34 <cgoncalves> thanks 20:18:42 <johnsom> It just means folks need to check back to it in case new times are added 20:19:06 <johnsom> #topic Brief progress reports / bugs needing review 20:20:08 <johnsom> I have worked on removing the last references to oslosphinx which is broken with sphinx 2, deprecated, and won't be fixed. 20:20:35 <johnsom> For the most part we have already done that, but there were two references we missed. You should not see any major changes in the docs/release notes 20:21:30 <johnsom> I helped figure out a solution to our grenade issue. 20:21:53 <johnsom> Lots of reviews, etc. 20:22:30 <cgoncalves> johnsom, thank you for your help troubleshooting and proposing a fix to the grenade issue. really appreciate! 20:22:31 <johnsom> Currently I'm working on adding the "unset" option to our openstack client. This will make it more clear for users of how to clear settings. 20:22:58 <johnsom> I'm going to go through the main options first, then come back and do tags. 20:23:14 <xgerman> I send back my laptop, did a week vacation… slowly getting my 2008 Mac into 2019’s software 20:23:22 <johnsom> I need to move a module out of neutron in OSC up to osc-lib so we can share the tags code. 20:23:54 <johnsom> Yeah, I am also running on "alternate" hardware now. Seems to be working ok though. 20:24:32 <johnsom> I also fixed a security related issue in the OSA role. 20:24:39 <johnsom> #link https://review.openstack.org/648744 20:24:52 <johnsom> xgerman You might want to do a quick review on that 20:25:01 <xgerman> on it 20:25:45 <johnsom> So, that is my plan for the next few days, work on unset and then tags for the client. 20:26:11 <johnsom> Also, I will be travelling and not available much Sun-Wed. 20:26:14 <johnsom> Just as a heads up 20:26:25 <cgoncalves> there is a patch in master and stein that broke spare pools. Change https://review.openstack.org/#/c/649381/ will fix it. we need to backport it to stein and release Stein RC2 this week 20:26:58 <johnsom> Yeah, I am going to take a stab at that after lunch. 20:27:05 <cgoncalves> it would be nice if we could merge a tempest test for spare pool to prevent regressions like this in the future 20:27:09 <cgoncalves> #link https://review.openstack.org/#/c/634988/ 20:27:14 <johnsom> I think we just need to do another migration and we can fix it that way 20:28:03 <cgoncalves> migration? as in DB migration? 20:28:10 <johnsom> Yep. Good stuff. I had previously +2'd, will circle back 20:28:13 <cgoncalves> patch set I take 20:28:14 <johnsom> Yeah 20:28:52 <johnsom> Any other updates this week? 20:28:54 <cgoncalves> I also addressed reviews on https://review.openstack.org/#/c/645817/ 20:29:08 <cgoncalves> I have a customer waiting for it 20:29:52 <johnsom> +2, you addressed my only issue with it. 20:30:01 <xgerman> looking 20:30:09 <cgoncalves> thank you, thank you! 20:30:17 <xgerman> need to see what happened to my patches... 20:31:04 <johnsom> #topic Open Discussion 20:31:09 <johnsom> Ok, other topics this week? 20:32:10 <cgoncalves> some folks were discussing here on the channel earlier today about an issue where health manager would trigger failover 20:32:28 <cgoncalves> while the network was still being configured, i.e. flows, etc 20:32:31 <xgerman> yeah, it can do that ;-) 20:32:54 <xgerman> how do we know that the network is configures 20:32:54 <cgoncalves> no resolution yet 20:32:59 <xgerman> ? 20:33:06 <cgoncalves> precisely, that's the question 20:33:15 <johnsom> Really? The HM honors the lock on the objects, so it should not be able to start a failover if another controller owns the resouce 20:33:37 <johnsom> Oh, you mean the neutron networking.... 20:33:39 <johnsom> Right 20:33:41 <xgerman> I was thinking neutron was pulling sh*t 20:33:43 <cgoncalves> yes 20:34:08 <cgoncalves> would it work not to fail over amps unless at least one heartbeat from any amp is received by the HM on start up? 20:34:17 <xgerman> that goes back to should we go i to ERROR and tell the operator neutron is broken or keep retrying 20:34:46 <xgerman> see [1] http://blog.eichberger.de/posts/yolo_cloud/ 20:35:00 <cgoncalves> we don't know if neutron is "broken". all we know is the HM hasn't received heartbeat within the heartbeat_timeout (60 seconds) 20:35:11 <johnsom> I suspect the issue is around net splits where some hosts and racks are working, but others are not, so any heartbeat would likely have the same issue 20:35:21 <xgerman> cgoncalves: we had it not fail over stuff when there was no heartbeat which caused other problems 20:35:40 <xgerman> (mainly amp doesn’t come up right and we never know) 20:35:45 <cgoncalves> xgerman, we still do not failover on newly created LBs 20:36:03 <xgerman> I thought we fixed that a while back... 20:36:33 <johnsom> I thought someone was looking at that again and proposed a fix as well. Not positive though. 20:36:46 <cgoncalves> my understanding was that it was a feature/desired behavior, not a bug 20:36:47 <johnsom> It's a trickier problem than it seems on the surface 20:37:20 <xgerman> yeah, it’s the wait or trow hands-up in the air 20:38:34 <cgoncalves> just wanted to bring this up in case anyone had some thoughts 20:38:54 <cgoncalves> this is affecting some customers on this side 20:39:12 <xgerman> how? OSP 13 is not HA... 20:39:14 <cgoncalves> I heard other people also facing same issue 20:39:21 <johnsom> Personally, if the amp isn't talking to us, it seems like it is the right answer to fail it over. The question is what to do after that, specifically if the neutron outage causes the failover to not be successful. Right now we fail "safe", in that it's marked ERROR, but the secondary amp is still passing traffic. 20:39:41 <johnsom> lol, ouch 20:40:33 <cgoncalves> having a HA tempest job would play a long way in having it supported in OSP 20:41:34 <johnsom> We could have a periodic that looks for ERROR LBs with an amp in ERROR and attempts an amp failover again. We would just need to figure out the right back off and make super sure we don't bother the other functional amp. 20:42:06 <johnsom> I think we have some work to do on the failover flows in general actually. 20:42:07 <xgerman> mmh, I would leave that for the PTG… 20:42:23 <cgoncalves> I don't like that, to be honest. we would be killing an amphora that is actually up by failing over not once but twice 20:42:23 <johnsom> Yes, good topic for the PTG 20:42:25 <xgerman> yep , agree on failover fpws 20:42:36 <xgerman> was beating that drum for a ear now 20:43:55 <johnsom> Added a topic to the PTG etherpad 20:44:00 <johnsom> #link https://etherpad.openstack.org/p/octavia-train-ptg 20:44:31 <cgoncalves> is there a reason you guys see not to like my proposed approach? 20:45:39 <johnsom> The one heartbeat one or did I miss something? 20:45:52 <cgoncalves> "would it work not to fail over amps unless at least one heartbeat from any amp is received by the HM on start up?" 20:46:01 <cgoncalves> let me know if I'm not being clear 20:46:15 <johnsom> I responded: "I suspect the issue is around net splits where some hosts and racks are working, but others are not, so any heartbeat would likely have the same issue" 20:46:50 <cgoncalves> ah, sorry. I read that but didn't read it as a reply to my message 20:47:09 <johnsom> We could also try to set some threshold, say if more than x amps are "down", pause failovers 20:47:59 <johnsom> I guess now that we can mutate the config this could be more feasible now. It would allow the operator to have a knob to turn. 20:48:15 <cgoncalves> it would be expected that sometime "soon" the network would be back again so HMs would start receiving heartbeats 20:49:08 <johnsom> Well, there is always the "rack got zapped by a evil mastermind's laser" scenario where it will not come back 20:50:23 <johnsom> Or the scenario I saw once, where the host was just powered off for a day 20:51:36 <cgoncalves> hmm... 20:51:59 <johnsom> That one was nice, it led to a bunch of zombie instances showing up again 20:52:24 <cgoncalves> health manager kills them now, thanks to xgerman patch 20:52:36 <johnsom> Stuff to think about. I captured a few on the etherpad, please add more! 20:52:53 <johnsom> Yeah, that is some of the background on that patch 20:53:53 <johnsom> We have about 5 minutes, were there other topics we needed to discuss? 20:55:14 <johnsom> Ok, just wanted to check. Sometimes we run out of time before we discuss everything. 20:55:51 <johnsom> Oh, BTW, the devstack patch still hasn't merged, so the barbican job is still going to fail. 20:56:22 <cgoncalves> which devstack patch? 20:56:27 <johnsom> #link https://review.openstack.org/648951 20:58:04 <cgoncalves> thanks 20:58:28 <johnsom> Ok, sounds like we are wrapping up. Thanks folks! Have a great week. 20:58:30 <johnsom> #endmeeting