16:00:33 <johnsom> #startmeeting Octavia
16:00:34 <openstack> Meeting started Wed Jun 17 16:00:33 2020 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:37 <openstack> The meeting name has been set to 'octavia'
16:00:56 <johnsom> Hello everyone
16:01:01 <gthiemonge> Hi
16:01:02 <ataraday_> hi
16:01:05 <aannuusshhkkaa> Hello!
16:01:06 <rm_work> o/
16:01:24 <shtepanie> Hi!
16:02:19 <cgoncalves> hi
16:02:40 <openstackgerrit> Merged openstack/octavia master: Use uwsgi binary from path  https://review.opendev.org/736137
16:02:43 <johnsom> #topic Announcements
16:02:58 <johnsom> Well, that was one announcement ^^^^
16:03:53 <johnsom> uwsgi was broken for devstack recently. That patch should resolve the master branch.
16:05:09 <johnsom> Does anyone have any other announcements this week?
16:06:16 <rm_work> aannuusshhkkaa and shtepanie are joining us for the summer as dev interns at vzm!
16:06:40 <johnsom> Yay! Welcome
16:06:40 <cgoncalves> nice, welcome!
16:06:45 <gthiemonge> welcome!
16:07:09 <shtepanie> thank you!
16:07:09 <aannuusshhkkaa> Thank you!! :)
16:07:40 <rm_work> In the process of getting them up to speed, and we've got a topic later about what they'll be working on (metrics!)
16:08:03 <openstackgerrit> Merged openstack/octavia master: add the verify for the session  https://review.opendev.org/726567
16:08:31 <johnsom> Sounds good
16:08:40 <johnsom> #topic Brief progress reports / bugs needing review
16:09:44 <johnsom> I have been focused on catching up on reviews, getting the stable branches - sigh - stable, and cutting some stable branch releases.
16:10:05 <ataraday_> I was a bit off with internal processes. Now, want to highlight two changes, adding retry and preupgrade check for amphorav2
16:10:06 <johnsom> We got Ussuri and Stein out of the gate. Train is still broken on grenade issues
16:10:16 <ataraday_> #link https://review.opendev.org/#/c/726084/
16:10:29 <ataraday_> #link https://review.opendev.org/#/c/735556/
16:10:52 <rm_work> Oh was it just this last week that we EOL'd two branches? Or was that already announced
16:10:59 <rm_work> I'm losing track of time
16:11:07 <johnsom> Oh, yeah, in fact it was!
16:11:18 * johnsom is living in a time warp as well
16:11:43 <johnsom> We have officially EOL'd the Ocata and Pike releases of Octavia.
16:12:24 <johnsom> Thanks rm_work for leading that effort and navigating the process waters
16:12:41 <cgoncalves> +1, thank you
16:12:47 <rm_work> You mean breaking through the process wall like koolaid man
16:12:58 <johnsom> Yes, that exactly, lol
16:13:06 <rm_work> Which is my preferred style of political negotiation :D
16:13:47 <johnsom> Well, it was a good thing as we should have truth in advertising and really no one was maintaining those branches anymore.
16:14:57 <johnsom> I also spent some time looking at the centos amphora images to see if I could find any tricks to speed it up under qemu tcg. I achieved a huge improvement of 17 seconds.
16:15:12 <johnsom> Which means, it still takes four minutes to boot and is still a problem.
16:16:04 <rm_work> lol
16:16:10 <rm_work> Well that's something
16:16:36 <johnsom> Yeah, not worth the trouble really.
16:17:05 <johnsom> Ok, any other progress reports or updates?
16:17:13 <johnsom> ataraday_ Thanks for the patches!
16:17:29 <cgoncalves> I spent some time working on diskimage-builder to add centos 8 stream support. centos stream is the rolling  pre-release of RHEL 8 and CentOS 8. there's a WIP patch in octavia side that builds an amphora and runs the tests, all successful
16:17:44 <johnsom> Nice
16:17:49 <rm_work> Cool
16:17:51 <cgoncalves> I also continued to review johnsom's monster patch aka failover refactor patch
16:18:04 <rm_work> Yeah we need to get that in :)
16:18:06 <johnsom> Yeah, I have done a few comment update spins on that
16:18:36 <rm_work> We've been running it in prod for over a month now? Multiple months maybe?
16:18:45 <rm_work> It's been good
16:18:55 <johnsom> Nice, that is good feedback.
16:19:21 <johnsom> For the most part, the comment have been minor issues. I think the biggest change was adding retry timeouts to the configuration file.
16:20:13 <johnsom> Based on the PTG feedback
16:20:31 <johnsom> Ok, if there are no more updates, we can move on to "metrics"
16:20:35 <johnsom> #topic metrics
16:20:53 <johnsom> rm_work You have the conn
16:21:18 <rm_work> Alright
16:21:35 <rm_work> So, we're picking up this task!
16:21:52 <rm_work> We discussed it briefly last night, and it seems it's essentially three parts
16:22:43 <rm_work> ... and my irc window doesn't want to scroll back that far, apparently
16:23:00 <rm_work> anyway, we think it's essentially:
16:23:15 <rm_work> 1) Add new metrics at the system level (for example, RAM usage, CPU usage)
16:24:22 <rm_work> 2) Transition to sending deltas instead of absolute totals, where it makes sense (for things like total connections and transfer bytes, but not for current active or the system stuff probably)
16:25:17 <njohnston> tdd 9
16:25:22 <rm_work> 3) Rework/improve the driver layer to allow running multiple metrics storage drivers at once, and probably add at least one new driver for shipping metrics somewhere like influxdb
16:26:01 <johnsom> +1 That is the list I am aware of
16:27:00 <rm_work> We'll probably tackle 1 and 2 first
16:27:23 <rm_work> The discussion topic today though is basically -- can we brainstorm what we actually want for #1?
16:27:34 <johnsom> Yeah, those should go together nicely with a heartbeat protocol version bump
16:28:15 <johnsom> yeah, that is a good question.
16:28:54 <rm_work> i listed the two i can think of off the top of my head
16:29:16 <johnsom> My first thought is percentages. Simply because the agent would have the best information about the nova flavor of the instance.
16:29:36 <rm_work> yeah, definitely thinking percentages
16:29:56 <johnsom> Ah, you meant which metrics. Yeah, RAM and CPU are on the top of my list. I personally don't have any others.
16:29:58 <rm_work> which does mean those numbers would be absolute, not deltas
16:30:06 <johnsom> Correct
16:30:30 <rm_work> yeah is there anything else useful we could collect?
16:30:35 <cgoncalves> disk? local logs can pile up
16:30:38 <rm_work> hmm
16:31:28 <rm_work> as an admin, having an at-a-glance of the disk might be useful in that specific situation
16:31:37 <johnsom> Personally I think we have other ways to address that (log offloading and hourly rotation), but we have seen one case where some other issue in the cloud filled the system log file with garbage.
16:31:49 <rm_work> but i don't know how generally useful that'd be in the 99% case for a user
16:32:23 <cgoncalves> ok, it was just a thought. we can add later if we want to
16:32:26 <rm_work> i guess we should clarify the goal
16:32:53 <rm_work> I THINK what we're trying to do is add metrics that would allow one essential insight: how much "capacity" does my LB have left
16:33:20 <johnsom> Yeah, my goal for those is to get us a step closer to auto-scaling
16:33:25 <rm_work> and really, I am considering formulas that we could use to turn that into one easily digestable number
16:33:44 <rm_work> which is apparently what AWS does with ELB
16:34:41 <johnsom> Initially I'm not sure we should even add the "system" metrics to the API. Simply because they have little to no meaning for other provider drivers
16:35:15 <rm_work> Yeah I think I agree
16:35:20 <johnsom> And I don't want us to get in a strange situation when we enable active/active.
16:35:36 <rm_work> We should just collect at first
16:35:41 <johnsom> +1
16:36:07 <rm_work> which actually simplifies the task a bit :D
16:36:25 <rm_work> then we can handle what to do with those new metrics in step 3, when we ship them somewhere
16:37:27 <aannuusshhkkaa> AWS offers read/write bandwidth, idle time, latency on EBS. Do we want to offer features like that?
16:37:47 <johnsom> Correct. Maybe, if we want to give some indication to the end user, we could consider adding a "HIGH LOAD" operating status, but I would consider that #4 or #20 on the list.
16:38:11 <rm_work> I wonder if we can actually have any idea what percentage of read/write bandwidth is actually being used
16:38:23 <rm_work> that would require an operator config setting, possibly
16:38:37 <johnsom> Right, that is a hard one given neutron can't usually come close to what we can handle.
16:39:04 <rm_work> yeah, and even if we know which NIC is in a HV, we don't know what bandwidth is like on the rest of the VMs that live there
16:39:08 <johnsom> We do have bytes in/out and with deltas you could calculate the rate
16:39:37 <rm_work> ah, yeah... how do we do deltas, exactly? that is one of my major concerns
16:39:46 <rm_work> there's a few concerns there actually
16:40:02 <rm_work> firstly, HOW? do we *reset* haproxy's counters constantly?
16:40:12 <rm_work> do we keep an internal tracker in the agent?
16:40:32 <rm_work> I suppose that's just going to be some research
16:40:44 <johnsom> Yeah, my expectation is the agent will keep the previous value in memory and calculate the delta
16:41:01 <rm_work> also, since we use UDP, do we just... hope all the packets get there, and possibly under-report?
16:41:25 <johnsom> Yes, this would be a "may under report in some cases" scenario
16:41:39 <rm_work> we have a sequence number so on the control-plane we could actually tell if we're missing packets and try to do some fill based on the points on either side... but that could be wrong too
16:41:51 <rm_work> and better to under-report than over-report i guess
16:42:19 <rm_work> also don't want to hugely increase the workload on the heartbeat ingestion
16:42:20 <johnsom> Right and complex. We do already have a sequence number in the heartbeat message. We just don't use it for more than a nonce
16:42:34 <rm_work> right
16:43:24 <johnsom> FYI, here are the metrics haproxy can report:
16:43:26 <johnsom> #link http://cbonte.github.io/haproxy-dconv/1.8/management.html#9.1
16:43:39 <johnsom> However, the LVS UDP side cannot support most of those
16:44:17 <rm_work> yeah...
16:44:18 <johnsom> So until we can switch out the UDP engine, that may constrain what we report or we need to call out the limitations.
16:44:49 <johnsom> I also want to make sure we are careful to not put in things that other drivers don't support. I.e. no haproxy specific metrics.
16:45:00 <rm_work> ah, hrsp_4xx hrsp_5xx etc was something mentioned
16:45:20 <rm_work> but I don't know if we want to try to ship those from haproxy, or allow those to be calculated by a user via log analysis
16:45:47 <johnsom> ereq and econ are also candidates
16:45:50 <rm_work> I believe other things should be able to report those for HTTP type stuff
16:46:04 <rm_work> we already ship ereq :)
16:46:17 <johnsom> Ah, ok, so .... grin
16:46:31 <rm_work> maybe hanafail?
16:46:35 <rm_work> "failed health checks details"
16:46:42 <rm_work> that is one other use-case
16:46:53 <rm_work> but also, can be handled by logs
16:47:20 <johnsom> Yeah, that is in the flow logs
16:47:52 <johnsom> We also need to keep in mind the heartbeat message size. I think it is limited to 64k at the moment. That includes both stats and status
16:48:16 <johnsom> Not that we can't change that, but just a consideration
16:48:35 <rm_work> lbtot for members would be interesting
16:48:50 <rm_work> "total number of times a server was selected, either for new
16:48:50 <rm_work> sessions, or when re-dispatching"
16:49:16 <johnsom> Yeah, that is per-member hits
16:49:17 <rm_work> but anyway, I suppose we can move on, could be here all day :D
16:49:37 <rm_work> and also, the user could get that info *from their members* :D
16:49:54 <johnsom> Lots of goodies, but we need to be conservative
16:49:59 <rm_work> yeah
16:50:09 <johnsom> Or the flow logs, it's in there
16:50:35 <rm_work> alright, last thing would be, does anyone ELSE want to work on #3? since that also could probably be done in parallel
16:50:55 <rm_work> (updating to allow multiple drivers to be used at once, and adding one for influxdb or similar)
16:51:20 <johnsom> The hard work there is defining the interface really
16:51:26 <rm_work> if not, we can look at handling that after we wrap up 1+2
16:51:56 <rm_work> I think the interface is already defined -- technically it's already a driver layer?
16:52:07 <rm_work> and it takes "our health message" :D
16:52:14 <rm_work> unless you are saying you want to rework that
16:52:22 <rm_work> and actually do some level of pre-parsing first
16:53:05 <johnsom> Yeah, I was trying to remember what the content was. It's a de-wrapped heartbeat json isn't it?
16:53:09 <rm_work> that would require a decent refactor -- it'd basically mean shifting 90% of the current "update_db" code up above the driver layer
16:53:16 <rm_work> which maybe should be done
16:53:21 <rm_work> because that doesn't really make sense
16:53:29 <rm_work> we should have all that pre-parsing outside of the drivers
16:53:44 <rm_work> and the "update_db" part should literally just be taking the final stats struct, and ... updating the DB
16:54:08 <johnsom> Yeah, we should be able to rev the message format version without requiring all of the drivers to respin IMO. If we can avoid it.
16:54:10 <rm_work> it's actually pretty badly organized
16:54:15 <rm_work> yeah
16:54:21 <rm_work> ok so maybe we MOVE the driver layer there
16:54:35 <rm_work> do you think it'd be ok to break our existing plugin agreement there?
16:54:46 <rm_work> i doubt anyone is using it?
16:55:10 <johnsom> It is not a published interface today. We don't document it.
16:55:12 <rm_work> it's internal to the amphora-driver
16:55:15 <rm_work> alright
16:55:24 <rm_work> so we'll probably reorganize that first
16:55:44 <rm_work> which i guess actually means parts of #3 will be #0
16:55:48 <johnsom> But do keep in mind, we have a stats interface for the provider drivers too
16:55:57 <rm_work> k
16:56:20 <rm_work> yeah but i believe it is already totally different from the interface i'm referring to
16:56:34 <johnsom> Yeah, it is a bit different
16:57:38 <rm_work> https://github.com/openstack/octavia/blob/master/octavia/amphorae/drivers/health/heartbeat_udp.py#L32-L47
16:57:45 <rm_work> I am referring to that one
16:58:08 <rm_work> because currently 100% of the logic that parses the packet lives in the "update_db" driver
16:58:10 <johnsom> Yeah, that will need improvement
16:58:12 <rm_work> which is ... not correct
16:58:39 <rm_work> that should all happen well before it passes to a driver, and what it should pass is a final structure with data
16:59:25 <johnsom> I am talking about:
16:59:27 <johnsom> #link https://github.com/openstack/octavia/blob/master/octavia/api/drivers/driver_agent/driver_updater.py#L139
17:00:27 <rm_work> yeah thats already closer to what an "update_db" driver SHOULD be tho
17:00:46 <rm_work> so right inside there, we can actually share the driver layer and the struct we pass, I think
17:00:49 <johnsom> Ok, we are out of time today. Thanks for the great discussion and work on metrics!
17:01:03 <rm_work> driver layer should be here: https://github.com/openstack/octavia/blob/master/octavia/api/drivers/driver_agent/driver_updater.py#L166-L167
17:01:06 <rm_work> o/ thanks everyone
17:01:17 <johnsom> #endmeeting