#openstack-meeting-alt log

20:00:01 <johnsom> #startmeeting Octavia
20:00:02 <openstack> Meeting started Wed Jun 22 20:00:01 2016 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:00:04 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
20:00:04 <Frito> O.o
20:00:07 <openstack> The meeting name has been set to 'octavia'
20:00:08 <TrevorV> o/
20:00:10 <diltram> o/
20:00:12 <johnsom> #topic Announcements
20:00:19 <longstaff> hi
20:00:20 <sbalukoff> Howdy!
20:00:22 <johnsom> I'm back........   Grin
20:00:24 <diltram> hey
20:00:24 <fnaval> o/
20:00:27 <sbalukoff> Yay!
20:00:34 <eezhova> hi
20:00:34 <rm_work> o/
20:01:01 <TrevorV> He's not back... HE'S OURS
20:01:03 <TrevorV> MUAH HA HA HA HA HA
20:01:05 <johnsom> Also, mid-cycle planning
20:01:09 <fnaval> he's to my right!
20:01:18 <johnsom> #link https://etherpad.openstack.org/p/lbaas-octavia-newton-midcycle
20:01:19 <TrevorV> He's to my 4:30
20:01:22 <blogan> hi
20:01:27 * mhayden stumbles in
20:01:38 <johnsom> Please update with your attendance, etc.
20:01:48 * johnsom Waves to people
20:01:56 <Frito> he's to my that-a-way
20:02:00 <TrevorV> also make sure to note if you can't physically attend but want to be on a conference
20:02:03 <TrevorV> we can set that up here at Rax
20:02:04 <johnsom> Any other announcements?
20:02:27 <dougwig> have we finalized which days?
20:02:53 <TrevorV> dougwig its marked in the etherpad and everything man!
20:03:03 <johnsom> I plan to be there for Monday through sometime Friday.  I see we added a section about maybe less days.
20:03:08 <TrevorV> (said in "The Dude"'s voice)
20:03:40 <johnsom> That probably means: dougwig can't stand to look at us for a full week.
20:04:09 <blogan> i think its the other way around
20:04:13 <johnsom> Do we need a vote?
20:04:15 <johnsom> grin
20:04:26 <sbalukoff> Heh!
20:04:44 <TrevorV> I'm not against having the space available for the week, but us calling it a 4-day
20:04:52 <dougwig> it just says 'week of', with some debate.
20:05:00 <johnsom> Other folks traveling, sbalukoff, any comments on how many days?
20:05:12 <Frito> always better to reserve and not need vs not reserve and need. That's just my $0.02
20:05:17 <dougwig> i'll be there mon-thu night, regardless of what y'all decide, because i've learned i'm too much of a grouch on the 5th day.  not fit company for man nor beast.
20:05:37 <sbalukoff> +1 on dougwig being intolerable by the 5th day.
20:05:40 <sbalukoff> ;)
20:05:43 <rm_work> I think Friday usually ends up kinda minimally useful
20:05:49 * dougwig blushes.
20:05:55 <johnsom> Oye, that isn't exactly what I meant, but ok....
20:05:56 <sbalukoff> And actually, I think productivity that last day is always pretty waning anyway
20:06:25 <TrevorV> Yeah, we end up spending the morning touching up some reviews, and then everyone bailing after lunch anyway
20:06:27 <rm_work> Maybe Mon-Thurs with Monday as a kind of "get everyone set up and started" day, and Tues-Thurs the real work days
20:06:29 <TrevorV> So I think that's fine, we'll call it 4 days
20:06:29 <sbalukoff> But I'm all for working Monday -> Thurs (and possibly part of Friday, eh.)
20:06:44 <rm_work> yeah and we'll have the room Friday for stragglers :P
20:06:50 <sbalukoff> Sounds good.
20:06:51 <TrevorV> Yeah, sounds good to me.
20:06:55 <xgerman> o/
20:07:23 <johnsom> Yep, works for me too.  I will likely have to leave mid-day Friday myself.
20:07:38 <sbalukoff> Yep, gotta catch a flight back home.
20:08:12 <johnsom> #topic Brief progress reports / bugs needing review
20:08:28 <TrevorV> #link https://review.openstack.org/#/c/306084
20:08:31 <TrevorV> I need eyes on this
20:08:32 <TrevorV> its ready
20:08:34 <johnsom> How are things going?  I have started doing reviews again.   Lots of good stuff going on!
20:08:54 <johnsom> "its ready" == ready to -2?
20:08:58 <johnsom> JK
20:09:03 <sbalukoff> I'm still being diverted by internal stuff and recovery from surgery, so wasn't able to get to many reviews this week; Probably won't get to many this week either. :/
20:09:13 <dougwig> TrevorV: looks like it just passed my CI, which is what i was waiting for to review it.
20:09:59 <johnsom> TrevorV It looks like sbalukoff has a comment to attend to on that
20:10:16 <johnsom> Anything else?
20:10:20 <eezhova> #link https://review.openstack.org/#/c/310490/
20:10:22 <fnaval> mostly working on ineternal stuff;  I still a few open reviews that needs to be looked at
20:10:34 <TrevorV> johnsom I just noticed.
20:10:39 <fnaval> #link https://review.openstack.org/#/c/306083/
20:10:42 <TrevorV> I'll tinker
20:10:43 <sbalukoff> That's because I'm a jerk.
20:10:45 <rm_work> sbalukoff: default rise is 2, fall is 3
20:10:47 <rm_work> for HAProxy
20:10:48 <fnaval> #link https://review.openstack.org/#/c/308091/
20:10:55 <johnsom> Thanks eezhova
20:10:58 <rm_work> just a note
20:11:00 <sbalukoff> rm_work: Oh? Ok-- we should probably do that, then.
20:11:26 <Frito> I've been mainly working on internal again, lol
20:11:33 <rm_work> I'll post that note on the CR
20:11:46 <sbalukoff> rm_work: Sounds good.
20:11:58 <blogan> eezhova: done!
20:12:01 <johnsom> Ok, I will be spinning up the review engine over the next week, so hopefully get our velocity back up.
20:12:13 <sbalukoff> Yay!
20:12:27 <eezhova> blogan, thanks!
20:12:31 <johnsom> #topic Should amphorae be rebootable?
20:12:38 <johnsom> Not sure who added this, but good topic
20:12:42 <johnsom> #link https://bugs.launchpad.net/octavia/+bug/1517290
20:12:42 <openstack> Launchpad bug 1517290 in octavia "Not able to ssh to amphora or curl the vip after rebooting" [High,Opinion]
20:12:57 <blogan> why does it need rebooting in the first place?
20:13:10 <rm_work> I added it
20:13:15 <rm_work> I think the answer is no
20:13:19 <johnsom> I agree with the cattle mentality here, but I have also attempted to maintain reboot functionality.
20:13:22 <sbalukoff> My thought is: "No, amphorae should not be rebootable."
20:13:23 <rm_work> but, we just need to make a decision so i can kill the bug or not
20:13:31 <johnsom> It actually was a huge pain for network namespaces.
20:13:48 <blogan> what are the cases a reboot is absolutely needed though?
20:13:51 <sbalukoff> johnsom: Can you think of a reason why we'd want to allow rebooting of amphorae?
20:14:08 <diltram> plus without reboot we're able to use some system like tiny core linux
20:14:11 <johnsom> We should make a policy call on this.  At least to remove my guilt.  grin
20:14:19 <sbalukoff> diltram: +1
20:14:48 <johnsom> diltram Not sure how that makes a difference with tiny core linux....
20:14:49 <rm_work> yeah, i'm about to be dealing with minifying the amp image, so i don't want to have to deal with this :P
20:14:54 <dougwig> you always have your amps on UPS's, huh?
20:14:55 <sbalukoff> johnsom: Shall we vote on it? ;)  Seriously, though, I've not heard any compelling arguments for needing to be able to reboot in any case.
20:15:25 <blogan> it probably depends on what the amp is
20:15:28 <johnsom> I was leaving it functional mostly thinking of the situation where an amp could reboot faster than our health check timeout, thus reducing nova churn
20:15:29 <dougwig> what happens if the hypervisors bounces?  downtime while we decide its dead and spin up another?  is that faster than just letting it boot?
20:15:30 <rm_work> I mean, if an amp goes down to reboot, it should probably just failover anyway
20:15:30 <diltram> johnsom: TCL is a ram based system, you need to really run command to sync data on disk and create configuration to save data on storage
20:15:34 <blogan> if someone writes a hardware amp driver, then they'd have to support it
20:15:42 <rm_work> hmm
20:15:43 <blogan> but since we only support VMs right now...
20:15:53 <rm_work> that's an interesting point i guess dougwig <_<
20:16:03 <rm_work> because that happens
20:16:29 <TrevorV> I think I have a better question, what's the *harm* in letting it reboot?
20:16:34 <rm_work> rarely, but imagine having to apply a Xen patch for an XSA and then ...
20:16:41 <sbalukoff> Well, also consider that SSL keys are supposed to be on RAM only on amphorae. They'd be gone after a reboot.
20:16:52 <TrevorV> ^^ that
20:16:54 <rm_work> hmm yeah that was in our design but never completed
20:16:55 <TrevorV> That's a downside
20:17:05 <xgerman> if we reboot are we taking it out of health?
20:17:10 <rm_work> and I'm not sure if we actually decided to worry about that
20:17:14 <sbalukoff> We should complete that. It's better security. :/
20:17:19 <rm_work> depends on the monitoring frequency
20:17:19 <xgerman> Also if we failover and it comes back can we account for that
20:17:34 <rm_work> well if we failover it, we'll nova-delete it
20:17:37 <johnsom> I also have the question if anyone has tried this recently?  I think I have rebooted and sshed in without issue during namespace testing.
20:17:46 <diltram> there is too many corner cases to cover when we're allowing to reboot
20:17:57 <xgerman> +1
20:18:01 <sbalukoff> Yep.
20:18:08 <sbalukoff> I'm strongly against allowing amphorae reboots.
20:18:13 <diltram> +1
20:18:15 <xgerman> we can always not support it and tell people YMMV
20:18:19 <dougwig> sbalukoff: i'm not sure even just RAM is enough on a shared hypervisor.  we're in smoke and mirrors land without a soft HSM involved.
20:18:23 <sbalukoff> Unless someone can come up with a really compelling reason to go to the trouble of allowing them.
20:18:29 <blogan> depends on the amp driver doesn't it?
20:19:10 <johnsom> dougwig yeah, it was encrypted ram volume, but yeah, without HSM it "would" be possible.
20:19:21 <rm_work> i mean, I feel like dougwig's point is valid, it'd be scary if we had to bounce every hypervisor momentarily for a security patch and that meant forced cycling 100% of amps
20:19:26 <sbalukoff> dougwig: Well, RAM is still probably better than non-volatile storage--  at least its apparent then that it's not suppose to be written to non-volatile storage (even if the back-end does this without our say-so.)
20:19:57 <xgerman> rm_work those bounces are done by first migrating vms, then bounce, etc.
20:19:59 <johnsom> I think it depends on the amp image actually.  Octavia doesn't care today.  It is either up or not, based on the health monitoring.
20:20:12 <rm_work> hmm yeah i guess i don't actually have experience with doing it
20:20:17 <sbalukoff> rm_work: Yep. This is a known (and solved) problem in the could.
20:20:20 <sbalukoff> cloud.
20:20:20 <xgerman> I think people would stop paying you if you start rebooting vms randomly
20:20:37 <rm_work> well i just remember my servers were "rebooted" when that happened
20:20:43 <rm_work> but, i guess it was from the original migration
20:20:43 <dougwig> one example of a simple seeming decision like this that SUCKS for operators is cinder... have you ever tried to reboot a cinder-volume server?  no?  enjoy finding all the related instances and pausing them first.  it's a nightmare.
20:20:52 <johnsom> So, before we go too far here, I think we should test this bug to validate.
20:20:55 <dougwig> but the ssl key argument does have some validity.
20:21:04 <xgerman> +1
20:21:07 <sbalukoff> dougwig: Because block storage is ugly in a cloud.
20:21:19 <xgerman> also in the world of containers you won’t need to reboot
20:21:31 <xgerman> sbalukoff +1
20:21:38 <dougwig> sbalukoff: what part of openstack is not ugly, and don't we exist to make it less ugly?
20:21:49 <xgerman> also if you don’t like cinder I can put you in touch with people who sell storage
20:22:15 <rm_work> ok so then we can't actually close this bug yet? T_T
20:22:35 <sbalukoff> dougwig: True enough. But some parts are just always going to be ugly. Doing block storage (ie. using a very, very old interface to storage because you can't be bothered to update your damned legacy application) is always going to be ugly.)
20:22:46 <rm_work> are we actually agreeing to test this before officially taking a vote, to see if we can avoid taking a vote? :P
20:23:13 <johnsom> I would like someone to try it and either mark it invalid or an confirmed
20:23:25 <johnsom> Yes, basically.
20:23:54 <sbalukoff> Again, I think we should just make a policy decision that amphorae are not rebooted, and go with that until someone comes up with a really compelling case to do it. (Compelling enough to revisit code that depends on not allowing reboot functionality.)
20:24:01 <johnsom> If we decide to not support reboots, we should open a bug to make the amp NOT come back up to enforce that decission.
20:25:01 <johnsom> sbalukoff So not churning nova isn't a basic enough reason to maintain reboot capability?  This just seems like one of those we will come back and wish we didn't make the decision.
20:25:39 <sbalukoff> johnsom: Realistically, which cases are you worried about churning nova in?
20:26:20 <johnsom> If an amp could come up before we health fail it over.  It saves a nova boot, etc.
20:26:35 <dougwig> since the meat of the argument here is around SSL termination security issues, what is the ratio of ssl termination to not in you big operators VIPs ?
20:26:38 <xgerman> but that is highly YMMV
20:26:52 <sbalukoff> johnsom: But I mean: That's still a service disruption even if it's smaller than your health check thresholds.
20:27:01 <johnsom> Agreed, just exploring the issue.
20:27:18 <johnsom> sbalukoff Not in Act/stndby or act/act
20:27:39 <sbalukoff> johnsom: Right, but why would you need to do that reboot in the first place?
20:28:06 <johnsom> sbalukoff I don't have a "need".  reboots happen
20:28:37 <sbalukoff> johnsom: Yeah, but then I suspect you'd want to detect that it happened. Seeing the amphora get recycled is a good indicator that something hiccupped there.
20:28:53 <dougwig> that's a stretch.  :)
20:29:06 <sbalukoff> dougwig: What is?
20:29:25 <dougwig> that a cycle amp is a feature that provides notification.
20:29:29 <dougwig> /cycle/cycled/
20:29:47 <sbalukoff> dougwig: Right. But then, random reboots that are "normal" is also a stretch. :P
20:29:49 <johnsom> sbalukoff I don't disagree that you would like to know it happened, but that is different about how the situation resolves itself.
20:29:53 <perelman> hi
20:30:05 <dougwig> sbalukoff: touche.
20:30:29 <johnsom> I guess I still have a certain public cloud nightmare about nova issues
20:30:31 <TrevorV> sbalukoff so you're saying its better to have random failovers than random reboots?
20:30:42 <johnsom> queues full and instance issues.
20:30:54 <johnsom> I perelman
20:30:54 <sbalukoff> TrevorV: Probably, yes.
20:31:12 <TrevorV> I can't say I agree there sbalukoff , especially when you increase frequency of either event.
20:31:23 <TrevorV> I think I'd rather have 100 reboots of an amphora than 100 failovers
20:31:24 <johnsom> Hmmm.  Ok, so we can vote or we can test and potentially kick the can....
20:31:29 <sbalukoff> johnsom: If your reboot / failover frequency is that high, you've got some serious other problems.
20:32:02 <TrevorV> Right, but the failover frequency is much much more taxing to other systems and provides potentially more down-time than just the reboots
20:32:05 <TrevorV> and is just as detectable
20:32:05 <johnsom> sbalukoff I don't disagree.  However that issue was resolved in a different way....
20:32:16 <sbalukoff> TrevorV: If you've got a piece of faulty hardware that keeps rebooting amphora VMs, it's better that the instance get failed over to some other hardware host the very first time.
20:32:17 <rm_work> if i saw that some VM rebooted 100 times, i'd think "ok that VM is busted", IF i even noticed (how would I notice?)
20:32:18 <johnsom> TrevorV +1
20:32:55 <TrevorV> sbalukoff My argument there is if the faulty hardware is the cause, a failover event would be issued before the amphora came back online from a reboot anyway
20:33:20 <johnsom> Ok.  Let's hold this discussion here.  I will try to test by next week.  Then we can do a vote if it is an issue now.  Work?
20:33:34 <TrevorV> Sounds good.
20:33:35 <sbalukoff> TrevorV: Uh... that argument goes against the idea the reboots would take less than the health check interval. You've got no reason to assume that.
20:33:40 <rm_work> if I saw that an amp cycled 100 times, i'd think "oh shit"
20:33:41 <rm_work> and it'd be easier to see, I think
20:34:08 <TrevorV> No, sbalukoff the reboots of an amphora are less taxing than failing over, that's what I said, not a check interval
20:34:42 <johnsom> I want to leave some time for other discussions (and catch a shuttle in 30 if I can)
20:34:48 <xgerman> now as everybody has their paintbrush are there other sheds in need of coloring
20:34:53 <johnsom> #topic Open Discussion
20:35:07 <sbalukoff> Sorry, I meant to say that when you say "if the faulty hardware is the cause, a failover event would be issued before the amphora came back online from a reboot anyway"-- that's a faulty argument. You have no reason to believe faulty hardware wouldn't reboot quickly.
20:35:22 <dougwig> an ubuntu reboot is like 5 seconds.  i know a nova spawn and orchestrate is longer.
20:35:33 <johnsom> dougwig +1
20:35:45 <xgerman> +1
20:36:01 <xgerman> also I will skip next week’s meeting…
20:36:02 <TrevorV> sbalukoff I'll concede that, I was just considering that lets say a server in a rack was having issues, its unlikely that everything comes back up super quickly, but you're right, I don't know that 100%
20:37:07 <sbalukoff> TrevorV: And my argument that is if you're having random reboot problems, that if faulty hardware is the cause (or faulty software, like an improperly-set-up compute node), then it's better to get off that host the first time anyway.
20:37:29 <TrevorV> sbalukoff but do we have the logic that failover will definitely choose a different host?
20:37:36 <TrevorV> What if it fails over to the same host constantly?
20:37:40 <sbalukoff> I guess the core of my argument is that random reboots should be a relatively rare occurrence. If not, your cloud is already having serious problems anyway.
20:38:07 <sbalukoff> TrevorV: I'm assuming you have a cloud that is somewhat large. And therefore, it's unlikely to be re-issued to the same host.
20:38:38 <sbalukoff> Only large clouds are going to need to worry about nova scheduler bottlenecks and whatnot when you have a lot of failovers near the same time. :/
20:38:38 <johnsom> I'm more worried about software issues than hardware.  Memory leaks, kernel panics, etc. in the amp.
20:38:54 <johnsom> Yes, yes, shame on us, I know, but "stuff happens"
20:39:04 <johnsom> These are fast reboot situations
20:39:07 <xgerman> reboots are very concentrated on vms
20:39:20 <xgerman> container wouldn’t reboot and not sure what we would do with hardware
20:39:21 <sbalukoff> xgerman: What do you mean?
20:39:33 <dougwig> hardware wouldn't have 'amps'.
20:39:35 <xgerman> well, for generalizations sake I would just not allow reboots
20:39:42 <blogan> so can i argue that a nova instance being deleted is not going to happen very often adn we shouldn't solve for that case?
20:39:56 <blogan> and if it does get deleted by an admin there are other serious problems?
20:39:59 * johnsom glares at blogan
20:40:04 <TrevorV> blogan nova instances are deleted every time a failover happens
20:40:09 <TrevorV> But not every time a reboot would happen
20:40:19 <TrevorV> hence my tax comment
20:40:23 <blogan> TrevorV: i'm alluding to a specific bug we've discussed before
20:40:26 <johnsom> blogan is referencing the failover flow issue with deleted instances
20:40:35 <sbalukoff> Haha
20:40:53 <blogan> same arguments are being made for this that i made for not handling that case
20:41:08 <sbalukoff> In a real production scenario, then yes, I would anticipate that deliberate deletions of amphorae by administrators is going to be a relatively rare occurrence.
20:41:14 <blogan> they're not apples to apples though
20:41:22 <johnsom> They are not
20:41:44 <sbalukoff> Hell, on our blocks load balancer product (that a good portion of Octavia was modeled after) we have instances that are still running continuously for 4 years...
20:42:05 <xgerman> nope, but if an operator wants to reboot they can deactivate health monitoring, reboot, and be back on their merry way
20:42:09 <TrevorV> Alright, well then, I concede to having a test and a decision next week.
20:42:11 <sbalukoff> (They probably shouldn't be, but I'm not in charge of applying security updates, etc. on them anymore. ;) )
20:42:14 <johnsom> To me it comes down to layers of resiliency.  "stuff happens" and I would like clean, efficient ways to deal with those situations that minimize the impact.
20:42:19 <TrevorV> Its not like its keeping us from doing anything if that testing is done, right?
20:42:24 <sbalukoff> xgerman: +1
20:42:25 <dougwig> sbalukoff: are those 4 years instances running on IBM hardware?
20:42:37 <xgerman> HP probably :-)
20:42:53 <dougwig> sbalukoff: and how are those instances not vulnerable to about 8 dozen openssl bugs by now?
20:42:57 <sbalukoff> dougwig: Nope. Which means they're going down soon in any case because the Blue Box datacenters are being phased out. ;)
20:43:08 <sbalukoff> dougwig: No comment. ;)
20:43:13 <dougwig> lol
20:43:22 <johnsom> wow...
20:43:30 <johnsom> Ok, any other topics for today?
20:44:02 <TrevorV> sbalukoff  http://az616578.vo.msecnd.net/files/2015/09/19/635782305346788765-336606072_2905279.jpg
20:44:02 * Frito observes the crickets
20:44:09 <sbalukoff> dougwig: Actually, we'd typically patch the libraries and restart haproxy. This doesn't require a reboot.
20:44:30 <dougwig> good that you threw that in the logs there to cover yourself.
20:44:44 <sbalukoff> Haha
20:45:20 <johnsom> Ok, going once.....
20:45:27 <sbalukoff> Thanks folks!
20:45:47 <johnsom> Awesome.  Thanks for joining!
20:45:51 <johnsom> #endmeeting