20:00:26 <johnsom> #startmeeting Octavia
20:00:27 <openstack> Meeting started Wed Jan 30 20:00:26 2019 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:00:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
20:00:31 <openstack> The meeting name has been set to 'octavia'
20:00:36 <nmagnezi> o/
20:00:37 <colin-> o/
20:00:52 <johnsom> Hi folks
20:01:08 <johnsom> #topic Announcements
20:01:17 <tomtom001> ok that's making sense 3 controllers so 2 per controller
20:01:40 <johnsom> First up, a reminder that TC elections are coming up.  If you have an interest in running for the TC, please see this post:
20:01:47 <johnsom> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001829.html
20:02:21 <johnsom> Next up, HAProxy has reached out and is soliciting feedback on the next "native" protocol to add to HAProxy.
20:02:39 <johnsom> They recently added gRPC and are interested in what folks are looking for.
20:02:43 <colin-> that's awesome
20:03:08 <rm_work> o/
20:03:12 <johnsom> They have put up a poll here:
20:03:21 <johnsom> #link https://xoyondo.com/ap/6KYAwtyjNiaEOjJ
20:03:33 <johnsom> Please give them feedback on what you would like to see
20:03:55 <johnsom> If you select other, either comment on the poll or let me know what you are looking for and I can relay it
20:05:17 <johnsom> If you aren't following the upstream HAProxy work, HTTP/2 is really coming together with the 1.9 release.
20:05:51 <johnsom> #link https://www.haproxy.com/blog/haproxy-1-9-has-arrived/
20:05:55 <rm_work> ah yeah i was wondering why that wasn't in the poll, but i guess since it was partially in 1.8 then they were already planning to finish it in 1.9
20:06:06 <colin-> was surprised not to see amqp there for our own msging
20:06:12 <rm_work> was googling to see if they finished it before i put it as "other" :P
20:06:23 <johnsom> Yeah, it's pretty much done in 1.9 but they are working out  the bugs, edge cases, etc.
20:07:04 <johnsom> Any other announcements today?
20:07:05 <rm_work> cool, so is it packaged anywhere yet? :P
20:07:40 <johnsom> Well, to quote the release announcement "An important point to note, this technical release is not suitable for inclusion in distros, as it will only be maintained for approximately one year (till 2.1 is out). Version 2.0 coming in May will be long-lived and more suitable for distros."
20:08:06 <johnsom> So, likely no.
20:09:02 <johnsom> However, we might want to consider work on our DIB setup to allow users to point to a non-distro package. Not a high priority (easy enough to install it in an image), but something for the roadmap.
20:09:17 <openstackgerrit> Merged openstack/octavia-tempest-plugin master: Add the provider service client.  https://review.openstack.org/630408
20:09:45 <johnsom> I didn't see any other announcements, so...
20:09:45 <johnsom> #topic Brief progress reports / bugs needing review
20:10:22 <johnsom> I have been working on flavors. The client, tempest, and octavia patches are all done and ready for review.  Some have started merging.
20:10:46 <johnsom> I also added an API to force a configuration refresh on the amphora-agent.  That is all posted as well.
20:11:16 <nmagnezi> I back ported some fixes for stable/queens, a chain of 3 patches (the last one is from cgoncalves) that starts here: https://review.openstack.org/#/c/633412/1
20:11:40 <nmagnezi> All related to IPv6 / Keepalived / Active Standby stuff
20:11:42 <johnsom> I had to stop and deal with some gate issues this week as well. 1. requests_toolbelt released with broken dependencies which broke a release test gate. 2. pycodestyle updated and broken some pep8 stuff in openstacksdk.
20:12:13 <johnsom> Currently I am working on getting the openstacksdk updated for our new API capabilities.
20:12:46 <johnsom> tags, stats, amp/provider/flavor/flavorprofile, etc.
20:12:59 <johnsom> And finally, working on patch reviews.
20:13:07 <johnsom> nmagnezi Thank you!
20:13:26 <rm_work> sorry, been reading that haproxy 1.9 announcement. nice cacheing stuff coming soon too, we might need to think about that again
20:13:36 <nmagnezi> johnsom, np at all
20:13:48 <johnsom> Once the SDK stuff is cleaned up I will work on reviews and helping with the TLS patches that are up for review
20:13:51 <nmagnezi> Had some conflicts so look closely :)
20:14:17 <johnsom> Yeah, that is why I want to carve some time out for that review
20:15:25 <rm_work> the last change there (ipv6 prefix) needs to merge in rocky first too... need to figure out why it has a -1 there
20:15:26 <johnsom> Yeah, the caching stuff was neat. At least in 1.8 you had to turn off other things though to use it. Hopefully that got fixed.
20:16:10 <nmagnezi> rm_work, thank you for pointing this out!
20:16:17 <johnsom> Any other updates today?
20:17:03 <johnsom> #topic Discuss spares pool and anti-affinity (colin-)
20:17:06 <rm_work> oh, cgoncalves responded on the rocky one, i may be missing other backports there too, if you have time to look at that
20:17:11 <rm_work> ^^ nmagnezi
20:17:29 <rm_work> similar to your chain in queens
20:17:47 <nmagnezi> rm_work, will do
20:17:48 <johnsom> So colin- raised the question of enabling spares pool when anti-affinity is enabled.  colin- Did you want to lead this discussion?
20:18:27 <nmagnezi> rm_work, I know that the second patch in the chain exists in Rocky , will double check for the rest
20:18:54 <colin-> sure, was looking at what the limitations were and the channel helped me catch up on why it's not true today. in doing so we discussed a couple of ways that could be address, including rm_work's experimental change here https://review.openstack.org/558962
20:19:04 <rm_work> 👍
20:19:41 <rm_work> ah yeah, so that is the way I handled it for AZs, not sure if we want to try to handle it similarly for flavors
20:19:45 <colin-> i was curious if the other operators have a desire/need for this behavior or if it would largely be for convenience sake? is anti-affinity strongly preferred over spares pool generally speaking?
20:19:51 <rm_work> oh sorry this is anti-affinity not flavors
20:20:19 <johnsom> The question came up whether we should be managing the scheduling of spare amphora such that they could be used to replace the nova anti-affinity capability.
20:20:37 <rm_work> so, anti-affinity really only matters for active-standby, and the idea is that there it's ok to wait a little longer, so spares pool isn't as important
20:20:53 <colin-> ok, that was sort of the intent i was getting. that makes sense
20:21:16 * rm_work waits for someone to confirm what he just said
20:21:25 <johnsom> Yeah, it should be a delta of about 30 seconds
20:21:47 <rm_work> yeah i don't THINK it's relevant for STANDALONBE
20:21:54 <rm_work> *STANDALONE
20:21:55 <johnsom> At least on well behaved clouds
20:22:05 <rm_work> since ... there's nothing to anti-affinity to
20:22:26 <johnsom> Right, standalone we don't enable anti-affinity because it would be dumb to have server groups with one instance in them
20:22:59 <rm_work> so that's why we never really cared to take the effort
20:23:12 <colin-> i'm guessing things like querying nova to check a hypervisor attribute on an amphora before pulling a spare out of the pool would be out of the question?
20:23:23 <colin-> a copmarison to see if it matches the value on the STANDALONE during a failover maybe
20:23:50 <rm_work> i mean, the issue is cache reliability
20:23:53 <johnsom> Yeah, I think ideally, nova would allow us to add an instance to a server group with anti-affinity on it, and nova just live migrate it if it wasn't meeting the anti-affinity rule. But this is not a capability of nova today.
20:23:59 <rm_work> with AZ i'm pretty confident
20:24:02 <rm_work> with HV, not so much
20:24:35 <rm_work> yes, i had a productive talk about that in the back of an uber with a nova dev at the PTG a year ago
20:24:40 <rm_work> and then i forgot who it was and nothing happened <_<
20:25:49 <johnsom> I mean, you could query nova for the current host info, then make a decision of which to pull from the spares pool.  That is basically the scheduling that *could* be added to octavia.  Though it seems heavy weight.
20:26:07 <colin-> and would be assuming some scheduling responsibilities it sounds like
20:26:07 <rm_work> yes, and unless the spares pool manager tries to maintain this
20:26:08 <colin-> conceptually
20:26:09 <johnsom> With nova live migration, you don't know if the boot host is the current host, so you have to ask
20:26:11 <rm_work> which it COULD technically
20:26:20 <rm_work> there's a good chance you'd have to try several times
20:26:25 <rm_work> or be out of luck
20:26:30 <rm_work> especially if your nova uses pack
20:26:54 <rm_work> so if we did add a field, and had housekeeping check the accuracy of it every spares run.... that could work
20:27:03 <rm_work> but still like... eh
20:27:41 <colin-> is this the relevant section for this logic? https://github.com/openstack/octavia/blob/979144f2fdf7c391b6c154c01a22f107f45da833/octavia/controller/worker/flows/amphora_flows.py#L224-L230
20:27:42 <johnsom> Yeah, I feel like it's a bit of tight coupling and getting into some nasty duplicate scheduling code with nova.
20:28:06 <johnsom> Nope, just a second
20:28:47 <johnsom> #link https://github.com/openstack/octavia/blob/master/octavia/controller/worker/tasks/database_tasks.py#L479
20:29:03 <johnsom> This is where we check if we can use a spare and allocate it to an LB
20:29:09 <colin-> thanks
20:29:36 <johnsom> There would probably be code in the "create amphora" flow to make sure you create spares with some sort of host diversity too
20:30:22 <johnsom> basically if that MapLoadbalancerToAmphora task fails to assign a spare amphora, we then go and boot one
20:31:37 <johnsom> So, I guess the question on the table is: Is the boot time long enough to warrant building compute scheduling into the Octavia controller or not?
20:31:57 <colin-> given the feedback i've gotten i'm leaning towards not (yet!)
20:32:06 <rm_work> right, exactly what i have in my AZ patch, just using a different field
20:32:10 <rm_work> that's host-based instead of AZ based
20:32:16 <rm_work> which i commented in the patch i think
20:32:29 <rm_work> and yeah, i think it's not worth it
20:32:51 <johnsom> For me, the 30 second boot time is ok.
20:33:01 <colin-> ok. appreciate the indulgence :)
20:33:07 <rm_work> even if it's minutes....
20:33:11 <johnsom> Plus we still have container dreams which might help
20:33:11 <rm_work> it's an active-standby LB
20:33:19 <rm_work> it's probably fine
20:33:36 <rm_work> i see failures so infrequently, i would hope you wouldn't see two on the same LB within 10 minutes
20:33:48 <rm_work> and if you do, i feel like probably you have bigger issues
20:33:51 <jiteka> johnsom: for us it's more about 1:30 and up to 2min with first amp hit a new compute
20:34:00 <rm_work> especially if antri-affinity is enabled
20:34:14 <johnsom> jiteka Why is it 1:30? have you looked at it?
20:34:28 <rm_work> they are still using celerons
20:34:32 <johnsom> Disk IO? Network IO from glance?
20:34:41 <rm_work> 😇✌️
20:34:47 <jiteka> johnsom: our glance is global accross region (datacenter/location)
20:34:47 <johnsom> hyper visor overhead?
20:34:58 <rm_work> no local image cacheing?
20:35:05 <jiteka> we have local cache
20:35:11 <jiteka> but not pre-cached
20:35:21 <johnsom> Yeah, the 2 minutes on first boot should be loading the cache
20:35:23 <jiteka> when new amphora image is promoted and tagged
20:35:23 <rm_work> oh, right, new compute
20:35:26 <jiteka> and used for the first time
20:35:38 <rm_work> yeah, at RAX they had a pre-cache script thing
20:35:47 <johnsom> But after that, I'm curious what is driving the 1:30 time.
20:35:49 <rm_work> so they could define images to force into cache on new computes
20:35:49 <jiteka> in public cloud yes
20:35:56 <jiteka> can't say in private ;)
20:36:00 <rm_work> :P
20:36:21 <rm_work> but per what i said before, even up to 10m... not a real issue IMO
20:36:21 <johnsom> jiteka
20:36:27 <johnsom> Are you using centos images?
20:36:30 <colin-> the spares pool has given us the latitude to overlook that for the time being but abandoning it will be cause to look into long boot times, for sure
20:36:31 <jiteka> also I don't recall any way directly with glance to cherry-pick image you want to pre-cache among all available image
20:36:38 <jiteka> johnsom: using ubuntu
20:36:49 <rm_work> ah for new LBs you mean? colin-
20:36:59 <rm_work> because that's the only place anyone would ever see an issue with times
20:37:10 <colin-> yes, jiteka and i are referring to the same platform
20:37:13 <rm_work> jiteka: they had a custom thing that ran on the HVs
20:37:13 <jiteka> johnsom: Ubuntu 16.04.5 LTS
20:37:17 <johnsom> Hmm, yeah, would be interested to learn what is slow.
20:37:20 <rm_work> they == rax
20:37:35 <johnsom> Ok, yeah, I know there is a centos image slowdown that folks are looking into.
20:37:48 <jiteka> johnsom: all other officially released and used VMs in our cloud are centos based
20:38:15 <jiteka> johnsom: which makes octavia often hit that "first spawn not yet cached" situation
20:39:11 <johnsom> #link https://bugzilla.redhat.com/show_bug.cgi?id=1666612
20:39:12 <openstack> bugzilla.redhat.com bug 1666612 in systemd "Rules "uname -p" and "systemd-detect-virt" kill the system boot time on large systems" [High,Assigned] - Assigned to jsynacek
20:39:18 <jiteka> johnsom: sorry I mean all other user used same glance image that we provided to them which are centos based
20:39:42 <johnsom> Just an FYI on the centos boot time issue.
20:40:46 <jiteka> johnsom: thx
20:41:16 <johnsom> Ok, are we good on the spares/anti-affinity topic? Should I move on to the next topic?
20:41:21 <colin-> yes
20:41:39 <rm_work> IMO AZ anti-affinity is better anyway
20:41:42 <rm_work> if you can use it
20:41:43 <rm_work> :P
20:41:49 <johnsom> Cool. Thanks for bringing it up. I'm always happy to have a discussion on things.
20:42:01 <johnsom> #topic Need to start making decisions about what is going to land in Stein
20:42:27 <johnsom> I want to raise awareness that we are about one month from feature freeze for Stein!
20:43:00 <johnsom> Because of that I plan to focus more on reviews for features we committed to completing at the last PTG and bug fixes.
20:43:31 <jiteka> johnsom: I noticed the exact same behavour in my rocky devstack when I perform "Rotating the Amphora Images" scenario, after kill spare pool and forcing it to re-generate using newly tagged amphora image, it takes long time. Once that first amphora end-up ready, next one are really fast
20:44:34 <johnsom> Sadly we don't have a lot of reviewer cycles for Stein (at least that is my perception) so I'm going to be a bit picky about what we focus on and prioritize reviews for things we agreed on at the PTG.
20:44:37 <johnsom> #link https://etherpad.openstack.org/p/octavia-stein-ptg
20:44:37 <jiteka> johnsom: and nothing more to add about this part
20:44:51 <johnsom> That is the etherpad with the section we listed our priorities.
20:45:32 <johnsom> Do you all agree with my assessment and approach?  Discussion?
20:46:36 <johnsom> I would hate to see patches for things we agreed on at the PTG go un-merged because we didn't have review cycles or spent them on patches for things we didn't agree to.
20:47:33 <openstackgerrit> Adam Harwell proposed openstack/octavia master: WIP: Floating IP Network Driver (spans L3s)  https://review.openstack.org/435612
20:47:52 <colin-> anything known to be slipping out of Stein at this stage? or too early
20:47:55 <colin-> from the etherpad
20:47:55 <johnsom> ^^^^ Speaking of patches not targeted for Stein....
20:48:09 <rm_work> :P
20:48:28 <rm_work> patch not targeted for merging ever :P
20:48:49 <johnsom> I think the TLS patches (client certs, backend re-encryption) are posted but at-risk.  Flavors is in the same camp.  Log offloading.
20:49:15 <johnsom> privsep and members as a base resource are clearly out.
20:49:28 <johnsom> HTTP2 is out
20:49:28 <rm_work> damn, that one is so easy too :P
20:49:32 <rm_work> just needed to ... do it
20:49:37 <rm_work> but agreed
20:50:05 <johnsom> Ok. I didn't want to make a cut line today. I think that is too early.
20:50:28 <johnsom> I just wanted to raise awareness and make sure we discussed the plan as we head to feature freeze.
20:51:22 <johnsom> We have a nice lenghty roadmap, but limited development resources, so we do need to plan a bit.
20:51:52 <rm_work> i'm trying to spin back up
20:51:57 <rm_work> just catching up on everything
20:52:18 <johnsom> Good news is, we did check things off that list already, so we are getting stuff done and there are patches posted.
20:52:55 <johnsom> Ok, unless there are more comments, I will open it to open discussion
20:53:19 <johnsom> #topic Open Discussion
20:53:34 <johnsom> Other topics today?
20:53:59 <colin-> have a plan to maintain a separate aggregate with computes that will exclusively host amphorae, believe we have what we need to proceed but wanted to see if anyone has advice against this or experience doing it successfully?
20:54:00 <jiteka> johnsom: I would like to know more about Containerized amphora
20:54:21 <jiteka> johnsom: on the etherpad, maybe more context about statement "k8s performance overhead"
20:55:22 <johnsom> colin- The only issue I have seem is people treating that group as pets and getting scared to update the hosts or lack of maintenance as they aren't on the main compute radar.
20:55:31 <jiteka> with that strategy of an aggregate of few computes hosting all small amphora VM, I feel that our build time will be relatively small expect when performing a new image rotation
20:56:15 <colin-> johnsom: strictly cattle here, good call out :)
20:56:17 <johnsom> jiteka Yeah, lengthy discussion that comes up regularly. Some of the baseline issues with k8s is the pool network performance due to the ingress/egress handling.
20:57:53 <johnsom> jiteka The real issue now is we don't have anyone that can take time to work on containers.  We were close to having lxc working, but there were bugs with nova-lxd. These are likely not issues any longer, but we don't have anyone to look at them.
20:58:43 <johnsom> jiteka Also, k8s isn't a great platform for base services like networking. You don't exactly want it to shuffle/restart the pods during someone's TCP flow.....
21:00:20 <johnsom> We tend to work a bit lower in the stack.  That said, using containers is still something we think is good and should happen. Just maybe not k8s, or some of the orchestrators that aren't conducive to networking.
21:00:30 <johnsom> Dang, out of time. Thanks folks!
21:00:33 <johnsom> #endmeeting