20:00:01 <johnsom> #startmeeting Octavia 20:00:02 <openstack> Meeting started Wed Jun 22 20:00:01 2016 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:04 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:00:04 <Frito> O.o 20:00:07 <openstack> The meeting name has been set to 'octavia' 20:00:08 <TrevorV> o/ 20:00:10 <diltram> o/ 20:00:12 <johnsom> #topic Announcements 20:00:19 <longstaff> hi 20:00:20 <sbalukoff> Howdy! 20:00:22 <johnsom> I'm back........ Grin 20:00:24 <diltram> hey 20:00:24 <fnaval> o/ 20:00:27 <sbalukoff> Yay! 20:00:34 <eezhova> hi 20:00:34 <rm_work> o/ 20:01:01 <TrevorV> He's not back... HE'S OURS 20:01:03 <TrevorV> MUAH HA HA HA HA HA 20:01:05 <johnsom> Also, mid-cycle planning 20:01:09 <fnaval> he's to my right! 20:01:18 <johnsom> #link https://etherpad.openstack.org/p/lbaas-octavia-newton-midcycle 20:01:19 <TrevorV> He's to my 4:30 20:01:22 <blogan> hi 20:01:27 * mhayden stumbles in 20:01:38 <johnsom> Please update with your attendance, etc. 20:01:48 * johnsom Waves to people 20:01:56 <Frito> he's to my that-a-way 20:02:00 <TrevorV> also make sure to note if you can't physically attend but want to be on a conference 20:02:03 <TrevorV> we can set that up here at Rax 20:02:04 <johnsom> Any other announcements? 20:02:27 <dougwig> have we finalized which days? 20:02:53 <TrevorV> dougwig its marked in the etherpad and everything man! 20:03:03 <johnsom> I plan to be there for Monday through sometime Friday. I see we added a section about maybe less days. 20:03:08 <TrevorV> (said in "The Dude"'s voice) 20:03:40 <johnsom> That probably means: dougwig can't stand to look at us for a full week. 20:04:09 <blogan> i think its the other way around 20:04:13 <johnsom> Do we need a vote? 20:04:15 <johnsom> grin 20:04:26 <sbalukoff> Heh! 20:04:44 <TrevorV> I'm not against having the space available for the week, but us calling it a 4-day 20:04:52 <dougwig> it just says 'week of', with some debate. 20:05:00 <johnsom> Other folks traveling, sbalukoff, any comments on how many days? 20:05:12 <Frito> always better to reserve and not need vs not reserve and need. That's just my $0.02 20:05:17 <dougwig> i'll be there mon-thu night, regardless of what y'all decide, because i've learned i'm too much of a grouch on the 5th day. not fit company for man nor beast. 20:05:37 <sbalukoff> +1 on dougwig being intolerable by the 5th day. 20:05:40 <sbalukoff> ;) 20:05:43 <rm_work> I think Friday usually ends up kinda minimally useful 20:05:49 * dougwig blushes. 20:05:55 <johnsom> Oye, that isn't exactly what I meant, but ok.... 20:05:56 <sbalukoff> And actually, I think productivity that last day is always pretty waning anyway 20:06:25 <TrevorV> Yeah, we end up spending the morning touching up some reviews, and then everyone bailing after lunch anyway 20:06:27 <rm_work> Maybe Mon-Thurs with Monday as a kind of "get everyone set up and started" day, and Tues-Thurs the real work days 20:06:29 <TrevorV> So I think that's fine, we'll call it 4 days 20:06:29 <sbalukoff> But I'm all for working Monday -> Thurs (and possibly part of Friday, eh.) 20:06:44 <rm_work> yeah and we'll have the room Friday for stragglers :P 20:06:50 <sbalukoff> Sounds good. 20:06:51 <TrevorV> Yeah, sounds good to me. 20:06:55 <xgerman> o/ 20:07:23 <johnsom> Yep, works for me too. I will likely have to leave mid-day Friday myself. 20:07:38 <sbalukoff> Yep, gotta catch a flight back home. 20:08:12 <johnsom> #topic Brief progress reports / bugs needing review 20:08:28 <TrevorV> #link https://review.openstack.org/#/c/306084 20:08:31 <TrevorV> I need eyes on this 20:08:32 <TrevorV> its ready 20:08:34 <johnsom> How are things going? I have started doing reviews again. Lots of good stuff going on! 20:08:54 <johnsom> "its ready" == ready to -2? 20:08:58 <johnsom> JK 20:09:03 <sbalukoff> I'm still being diverted by internal stuff and recovery from surgery, so wasn't able to get to many reviews this week; Probably won't get to many this week either. :/ 20:09:13 <dougwig> TrevorV: looks like it just passed my CI, which is what i was waiting for to review it. 20:09:59 <johnsom> TrevorV It looks like sbalukoff has a comment to attend to on that 20:10:16 <johnsom> Anything else? 20:10:20 <eezhova> #link https://review.openstack.org/#/c/310490/ 20:10:22 <fnaval> mostly working on ineternal stuff; I still a few open reviews that needs to be looked at 20:10:34 <TrevorV> johnsom I just noticed. 20:10:39 <fnaval> #link https://review.openstack.org/#/c/306083/ 20:10:42 <TrevorV> I'll tinker 20:10:43 <sbalukoff> That's because I'm a jerk. 20:10:45 <rm_work> sbalukoff: default rise is 2, fall is 3 20:10:47 <rm_work> for HAProxy 20:10:48 <fnaval> #link https://review.openstack.org/#/c/308091/ 20:10:55 <johnsom> Thanks eezhova 20:10:58 <rm_work> just a note 20:11:00 <sbalukoff> rm_work: Oh? Ok-- we should probably do that, then. 20:11:26 <Frito> I've been mainly working on internal again, lol 20:11:33 <rm_work> I'll post that note on the CR 20:11:46 <sbalukoff> rm_work: Sounds good. 20:11:58 <blogan> eezhova: done! 20:12:01 <johnsom> Ok, I will be spinning up the review engine over the next week, so hopefully get our velocity back up. 20:12:13 <sbalukoff> Yay! 20:12:27 <eezhova> blogan, thanks! 20:12:31 <johnsom> #topic Should amphorae be rebootable? 20:12:38 <johnsom> Not sure who added this, but good topic 20:12:42 <johnsom> #link https://bugs.launchpad.net/octavia/+bug/1517290 20:12:42 <openstack> Launchpad bug 1517290 in octavia "Not able to ssh to amphora or curl the vip after rebooting" [High,Opinion] 20:12:57 <blogan> why does it need rebooting in the first place? 20:13:10 <rm_work> I added it 20:13:15 <rm_work> I think the answer is no 20:13:19 <johnsom> I agree with the cattle mentality here, but I have also attempted to maintain reboot functionality. 20:13:22 <sbalukoff> My thought is: "No, amphorae should not be rebootable." 20:13:23 <rm_work> but, we just need to make a decision so i can kill the bug or not 20:13:31 <johnsom> It actually was a huge pain for network namespaces. 20:13:48 <blogan> what are the cases a reboot is absolutely needed though? 20:13:51 <sbalukoff> johnsom: Can you think of a reason why we'd want to allow rebooting of amphorae? 20:14:08 <diltram> plus without reboot we're able to use some system like tiny core linux 20:14:11 <johnsom> We should make a policy call on this. At least to remove my guilt. grin 20:14:19 <sbalukoff> diltram: +1 20:14:48 <johnsom> diltram Not sure how that makes a difference with tiny core linux.... 20:14:49 <rm_work> yeah, i'm about to be dealing with minifying the amp image, so i don't want to have to deal with this :P 20:14:54 <dougwig> you always have your amps on UPS's, huh? 20:14:55 <sbalukoff> johnsom: Shall we vote on it? ;) Seriously, though, I've not heard any compelling arguments for needing to be able to reboot in any case. 20:15:25 <blogan> it probably depends on what the amp is 20:15:28 <johnsom> I was leaving it functional mostly thinking of the situation where an amp could reboot faster than our health check timeout, thus reducing nova churn 20:15:29 <dougwig> what happens if the hypervisors bounces? downtime while we decide its dead and spin up another? is that faster than just letting it boot? 20:15:30 <rm_work> I mean, if an amp goes down to reboot, it should probably just failover anyway 20:15:30 <diltram> johnsom: TCL is a ram based system, you need to really run command to sync data on disk and create configuration to save data on storage 20:15:34 <blogan> if someone writes a hardware amp driver, then they'd have to support it 20:15:42 <rm_work> hmm 20:15:43 <blogan> but since we only support VMs right now... 20:15:53 <rm_work> that's an interesting point i guess dougwig <_< 20:16:03 <rm_work> because that happens 20:16:29 <TrevorV> I think I have a better question, what's the *harm* in letting it reboot? 20:16:34 <rm_work> rarely, but imagine having to apply a Xen patch for an XSA and then ... 20:16:41 <sbalukoff> Well, also consider that SSL keys are supposed to be on RAM only on amphorae. They'd be gone after a reboot. 20:16:52 <TrevorV> ^^ that 20:16:54 <rm_work> hmm yeah that was in our design but never completed 20:16:55 <TrevorV> That's a downside 20:17:05 <xgerman> if we reboot are we taking it out of health? 20:17:10 <rm_work> and I'm not sure if we actually decided to worry about that 20:17:14 <sbalukoff> We should complete that. It's better security. :/ 20:17:19 <rm_work> depends on the monitoring frequency 20:17:19 <xgerman> Also if we failover and it comes back can we account for that 20:17:34 <rm_work> well if we failover it, we'll nova-delete it 20:17:37 <johnsom> I also have the question if anyone has tried this recently? I think I have rebooted and sshed in without issue during namespace testing. 20:17:46 <diltram> there is too many corner cases to cover when we're allowing to reboot 20:17:57 <xgerman> +1 20:18:01 <sbalukoff> Yep. 20:18:08 <sbalukoff> I'm strongly against allowing amphorae reboots. 20:18:13 <diltram> +1 20:18:15 <xgerman> we can always not support it and tell people YMMV 20:18:19 <dougwig> sbalukoff: i'm not sure even just RAM is enough on a shared hypervisor. we're in smoke and mirrors land without a soft HSM involved. 20:18:23 <sbalukoff> Unless someone can come up with a really compelling reason to go to the trouble of allowing them. 20:18:29 <blogan> depends on the amp driver doesn't it? 20:19:10 <johnsom> dougwig yeah, it was encrypted ram volume, but yeah, without HSM it "would" be possible. 20:19:21 <rm_work> i mean, I feel like dougwig's point is valid, it'd be scary if we had to bounce every hypervisor momentarily for a security patch and that meant forced cycling 100% of amps 20:19:26 <sbalukoff> dougwig: Well, RAM is still probably better than non-volatile storage-- at least its apparent then that it's not suppose to be written to non-volatile storage (even if the back-end does this without our say-so.) 20:19:57 <xgerman> rm_work those bounces are done by first migrating vms, then bounce, etc. 20:19:59 <johnsom> I think it depends on the amp image actually. Octavia doesn't care today. It is either up or not, based on the health monitoring. 20:20:12 <rm_work> hmm yeah i guess i don't actually have experience with doing it 20:20:17 <sbalukoff> rm_work: Yep. This is a known (and solved) problem in the could. 20:20:20 <sbalukoff> cloud. 20:20:20 <xgerman> I think people would stop paying you if you start rebooting vms randomly 20:20:37 <rm_work> well i just remember my servers were "rebooted" when that happened 20:20:43 <rm_work> but, i guess it was from the original migration 20:20:43 <dougwig> one example of a simple seeming decision like this that SUCKS for operators is cinder... have you ever tried to reboot a cinder-volume server? no? enjoy finding all the related instances and pausing them first. it's a nightmare. 20:20:52 <johnsom> So, before we go too far here, I think we should test this bug to validate. 20:20:55 <dougwig> but the ssl key argument does have some validity. 20:21:04 <xgerman> +1 20:21:07 <sbalukoff> dougwig: Because block storage is ugly in a cloud. 20:21:19 <xgerman> also in the world of containers you won’t need to reboot 20:21:31 <xgerman> sbalukoff +1 20:21:38 <dougwig> sbalukoff: what part of openstack is not ugly, and don't we exist to make it less ugly? 20:21:49 <xgerman> also if you don’t like cinder I can put you in touch with people who sell storage 20:22:15 <rm_work> ok so then we can't actually close this bug yet? T_T 20:22:35 <sbalukoff> dougwig: True enough. But some parts are just always going to be ugly. Doing block storage (ie. using a very, very old interface to storage because you can't be bothered to update your damned legacy application) is always going to be ugly.) 20:22:46 <rm_work> are we actually agreeing to test this before officially taking a vote, to see if we can avoid taking a vote? :P 20:23:13 <johnsom> I would like someone to try it and either mark it invalid or an confirmed 20:23:25 <johnsom> Yes, basically. 20:23:54 <sbalukoff> Again, I think we should just make a policy decision that amphorae are not rebooted, and go with that until someone comes up with a really compelling case to do it. (Compelling enough to revisit code that depends on not allowing reboot functionality.) 20:24:01 <johnsom> If we decide to not support reboots, we should open a bug to make the amp NOT come back up to enforce that decission. 20:25:01 <johnsom> sbalukoff So not churning nova isn't a basic enough reason to maintain reboot capability? This just seems like one of those we will come back and wish we didn't make the decision. 20:25:39 <sbalukoff> johnsom: Realistically, which cases are you worried about churning nova in? 20:26:20 <johnsom> If an amp could come up before we health fail it over. It saves a nova boot, etc. 20:26:35 <dougwig> since the meat of the argument here is around SSL termination security issues, what is the ratio of ssl termination to not in you big operators VIPs ? 20:26:38 <xgerman> but that is highly YMMV 20:26:52 <sbalukoff> johnsom: But I mean: That's still a service disruption even if it's smaller than your health check thresholds. 20:27:01 <johnsom> Agreed, just exploring the issue. 20:27:18 <johnsom> sbalukoff Not in Act/stndby or act/act 20:27:39 <sbalukoff> johnsom: Right, but why would you need to do that reboot in the first place? 20:28:06 <johnsom> sbalukoff I don't have a "need". reboots happen 20:28:37 <sbalukoff> johnsom: Yeah, but then I suspect you'd want to detect that it happened. Seeing the amphora get recycled is a good indicator that something hiccupped there. 20:28:53 <dougwig> that's a stretch. :) 20:29:06 <sbalukoff> dougwig: What is? 20:29:25 <dougwig> that a cycle amp is a feature that provides notification. 20:29:29 <dougwig> /cycle/cycled/ 20:29:47 <sbalukoff> dougwig: Right. But then, random reboots that are "normal" is also a stretch. :P 20:29:49 <johnsom> sbalukoff I don't disagree that you would like to know it happened, but that is different about how the situation resolves itself. 20:29:53 <perelman> hi 20:30:05 <dougwig> sbalukoff: touche. 20:30:29 <johnsom> I guess I still have a certain public cloud nightmare about nova issues 20:30:31 <TrevorV> sbalukoff so you're saying its better to have random failovers than random reboots? 20:30:42 <johnsom> queues full and instance issues. 20:30:54 <johnsom> I perelman 20:30:54 <sbalukoff> TrevorV: Probably, yes. 20:31:12 <TrevorV> I can't say I agree there sbalukoff , especially when you increase frequency of either event. 20:31:23 <TrevorV> I think I'd rather have 100 reboots of an amphora than 100 failovers 20:31:24 <johnsom> Hmmm. Ok, so we can vote or we can test and potentially kick the can.... 20:31:29 <sbalukoff> johnsom: If your reboot / failover frequency is that high, you've got some serious other problems. 20:32:02 <TrevorV> Right, but the failover frequency is much much more taxing to other systems and provides potentially more down-time than just the reboots 20:32:05 <TrevorV> and is just as detectable 20:32:05 <johnsom> sbalukoff I don't disagree. However that issue was resolved in a different way.... 20:32:16 <sbalukoff> TrevorV: If you've got a piece of faulty hardware that keeps rebooting amphora VMs, it's better that the instance get failed over to some other hardware host the very first time. 20:32:17 <rm_work> if i saw that some VM rebooted 100 times, i'd think "ok that VM is busted", IF i even noticed (how would I notice?) 20:32:18 <johnsom> TrevorV +1 20:32:55 <TrevorV> sbalukoff My argument there is if the faulty hardware is the cause, a failover event would be issued before the amphora came back online from a reboot anyway 20:33:20 <johnsom> Ok. Let's hold this discussion here. I will try to test by next week. Then we can do a vote if it is an issue now. Work? 20:33:34 <TrevorV> Sounds good. 20:33:35 <sbalukoff> TrevorV: Uh... that argument goes against the idea the reboots would take less than the health check interval. You've got no reason to assume that. 20:33:40 <rm_work> if I saw that an amp cycled 100 times, i'd think "oh shit" 20:33:41 <rm_work> and it'd be easier to see, I think 20:34:08 <TrevorV> No, sbalukoff the reboots of an amphora are less taxing than failing over, that's what I said, not a check interval 20:34:42 <johnsom> I want to leave some time for other discussions (and catch a shuttle in 30 if I can) 20:34:48 <xgerman> now as everybody has their paintbrush are there other sheds in need of coloring 20:34:53 <johnsom> #topic Open Discussion 20:35:07 <sbalukoff> Sorry, I meant to say that when you say "if the faulty hardware is the cause, a failover event would be issued before the amphora came back online from a reboot anyway"-- that's a faulty argument. You have no reason to believe faulty hardware wouldn't reboot quickly. 20:35:22 <dougwig> an ubuntu reboot is like 5 seconds. i know a nova spawn and orchestrate is longer. 20:35:33 <johnsom> dougwig +1 20:35:45 <xgerman> +1 20:36:01 <xgerman> also I will skip next week’s meeting… 20:36:02 <TrevorV> sbalukoff I'll concede that, I was just considering that lets say a server in a rack was having issues, its unlikely that everything comes back up super quickly, but you're right, I don't know that 100% 20:37:07 <sbalukoff> TrevorV: And my argument that is if you're having random reboot problems, that if faulty hardware is the cause (or faulty software, like an improperly-set-up compute node), then it's better to get off that host the first time anyway. 20:37:29 <TrevorV> sbalukoff but do we have the logic that failover will definitely choose a different host? 20:37:36 <TrevorV> What if it fails over to the same host constantly? 20:37:40 <sbalukoff> I guess the core of my argument is that random reboots should be a relatively rare occurrence. If not, your cloud is already having serious problems anyway. 20:38:07 <sbalukoff> TrevorV: I'm assuming you have a cloud that is somewhat large. And therefore, it's unlikely to be re-issued to the same host. 20:38:38 <sbalukoff> Only large clouds are going to need to worry about nova scheduler bottlenecks and whatnot when you have a lot of failovers near the same time. :/ 20:38:38 <johnsom> I'm more worried about software issues than hardware. Memory leaks, kernel panics, etc. in the amp. 20:38:54 <johnsom> Yes, yes, shame on us, I know, but "stuff happens" 20:39:04 <johnsom> These are fast reboot situations 20:39:07 <xgerman> reboots are very concentrated on vms 20:39:20 <xgerman> container wouldn’t reboot and not sure what we would do with hardware 20:39:21 <sbalukoff> xgerman: What do you mean? 20:39:33 <dougwig> hardware wouldn't have 'amps'. 20:39:35 <xgerman> well, for generalizations sake I would just not allow reboots 20:39:42 <blogan> so can i argue that a nova instance being deleted is not going to happen very often adn we shouldn't solve for that case? 20:39:56 <blogan> and if it does get deleted by an admin there are other serious problems? 20:39:59 * johnsom glares at blogan 20:40:04 <TrevorV> blogan nova instances are deleted every time a failover happens 20:40:09 <TrevorV> But not every time a reboot would happen 20:40:19 <TrevorV> hence my tax comment 20:40:23 <blogan> TrevorV: i'm alluding to a specific bug we've discussed before 20:40:26 <johnsom> blogan is referencing the failover flow issue with deleted instances 20:40:35 <sbalukoff> Haha 20:40:53 <blogan> same arguments are being made for this that i made for not handling that case 20:41:08 <sbalukoff> In a real production scenario, then yes, I would anticipate that deliberate deletions of amphorae by administrators is going to be a relatively rare occurrence. 20:41:14 <blogan> they're not apples to apples though 20:41:22 <johnsom> They are not 20:41:44 <sbalukoff> Hell, on our blocks load balancer product (that a good portion of Octavia was modeled after) we have instances that are still running continuously for 4 years... 20:42:05 <xgerman> nope, but if an operator wants to reboot they can deactivate health monitoring, reboot, and be back on their merry way 20:42:09 <TrevorV> Alright, well then, I concede to having a test and a decision next week. 20:42:11 <sbalukoff> (They probably shouldn't be, but I'm not in charge of applying security updates, etc. on them anymore. ;) ) 20:42:14 <johnsom> To me it comes down to layers of resiliency. "stuff happens" and I would like clean, efficient ways to deal with those situations that minimize the impact. 20:42:19 <TrevorV> Its not like its keeping us from doing anything if that testing is done, right? 20:42:24 <sbalukoff> xgerman: +1 20:42:25 <dougwig> sbalukoff: are those 4 years instances running on IBM hardware? 20:42:37 <xgerman> HP probably :-) 20:42:53 <dougwig> sbalukoff: and how are those instances not vulnerable to about 8 dozen openssl bugs by now? 20:42:57 <sbalukoff> dougwig: Nope. Which means they're going down soon in any case because the Blue Box datacenters are being phased out. ;) 20:43:08 <sbalukoff> dougwig: No comment. ;) 20:43:13 <dougwig> lol 20:43:22 <johnsom> wow... 20:43:30 <johnsom> Ok, any other topics for today? 20:44:02 <TrevorV> sbalukoff http://az616578.vo.msecnd.net/files/2015/09/19/635782305346788765-336606072_2905279.jpg 20:44:02 * Frito observes the crickets 20:44:09 <sbalukoff> dougwig: Actually, we'd typically patch the libraries and restart haproxy. This doesn't require a reboot. 20:44:30 <dougwig> good that you threw that in the logs there to cover yourself. 20:44:44 <sbalukoff> Haha 20:45:20 <johnsom> Ok, going once..... 20:45:27 <sbalukoff> Thanks folks! 20:45:47 <johnsom> Awesome. Thanks for joining! 20:45:51 <johnsom> #endmeeting