#openstack-lbaas log

16:00:14 <johnsom> #startmeeting Octavia
16:00:15 <openstack> Meeting started Wed Jul  1 16:00:14 2020 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:18 <openstack> The meeting name has been set to 'octavia'
16:00:25 <rm_work> o/
16:00:26 <johnsom> Hi everyone
16:00:28 <ataraday_> hi
16:00:29 <cgoncalves> hi
16:00:56 <johnsom> #topic Announcements
16:01:06 <johnsom> I don't really have any announcements this week.
16:01:35 <johnsom> We finally got a stable/train release out the door. I think it had been nine months since the last release.
16:01:40 * johnsom shames himself
16:01:56 <johnsom> Any other announcements this week?
16:02:56 <johnsom> #topic Brief progress reports / bugs needing review
16:03:43 <ataraday_> I create patch with experimental job with amphorav2 https://review.opendev.org/#/c/737993/ and find out that several things is/got broken.
16:03:52 <ataraday_> #link https://review.opendev.org/#/c/738609/
16:04:03 <johnsom> Aside from reviews, rebases, and the occasional small bug fix, I have been focusing on failover for amphorav2. It's a bit slow going, but making progress.
16:04:42 <ataraday_> there is also issue with barbican tls, still looking into it..
16:04:54 <johnsom> I think after we have that landed there are going to be some patches to simplify and clean up stuff. I'm trying to not go crazy doing that now as I don't want another monster patch and want to move faster on this patch.
16:05:08 <johnsom> Nice, thank you.
16:05:12 <ataraday_> johnsom, Thanks for proposing failover for amphorav2!
16:05:23 <johnsom> I'm also looking into why the IPv6 job is timing out. sigh
16:05:37 <johnsom> ataraday_ Still a lot to be done, but work in progress
16:06:36 <cgoncalves> I extended the amphora flavor capabilities to add amp_image_tag. this is useful for multi-architecture clouds and testing amphora images in staging environments, for example. as part of that work I also removed some deprecated options and create an image driver interface (noop and glance drivers)
16:07:05 <johnsom> Yeah, nice! I think I have reviewed about half of that now
16:07:19 <rm_work> I'm revisiting the failover threshold thingy
16:07:31 <rm_work> #link https://review.opendev.org/#/c/656811/
16:07:45 <johnsom> Nice, also helpful
16:07:46 <cgoncalves> I successfully deployed via devstack Octavia in an arm64/aarch64 system, although the amphora agent is not coming up in Zuul CI. this is a side project, thus low priority for me
16:08:10 <rm_work> isn't TrevorV working on that stuff? aarch64
16:08:26 <johnsom> Trevor is working on power
16:08:32 <rm_work> ahh he's on ppc, ok
16:09:04 <johnsom> There was a mailing list post about the ppc work and Octavia I mentioned last week. I think I was the only one to reply
16:09:10 <cgoncalves> OpenDev has aarch64 nodepool nodes, thanks to Linaro!
16:09:16 <rm_work> what's a "mailing list"
16:09:21 <johnsom> lol
16:09:32 <rm_work> I forgot what email was after Google killed Inbox
16:09:33 <rm_work> RIP Inbox
16:09:34 <johnsom> You sign up for coupons with one
16:09:43 <rm_work> RIP email
16:09:44 <cgoncalves> IT'S A TRAP!
16:10:18 <rm_work> ooo nice, aarch64 testing resources is good :) do we also have ppc?
16:10:47 <cgoncalves> I found a clever way to run our CI jobs in nest-virt enabled nodes, cutting job time from close to 2 hours down to as low as 38 minutes
16:10:53 <cgoncalves> #link https://review.opendev.org/#/c/738246/
16:10:58 <johnsom> I don't think so, not in zuul at least. Red Hat has some ppc stuffs
16:11:01 <rm_work> yeah I just +2'd that, seems workable
16:11:35 <cgoncalves> yeah, we have also ppc systems internally but, as you said, TrevorV has been working on that
16:12:07 <johnsom> Yeah, I don't have cycles to put into that at the moment
16:12:09 * rm_work is just enjoying causing TrevorV's IRC client to possibly ping
16:12:12 <TrevorV> Woah woah woah!  I haven't touched arm
16:12:18 <rm_work> haha there we go
16:12:26 <TrevorV> What'd I miss now?
16:12:28 <rm_work> yes, johnsom pointed out it was ppc :D
16:12:34 <johnsom> Arm either, though I have run our code on a raspberry pi 4 with very good results.
16:12:38 <rm_work> I just remembered you were on some alternate arch
16:12:58 <cgoncalves> TrevorV, we were saying you've been working on ppc, not arm
16:13:52 <rm_work> ok so moving on
16:13:58 <johnsom> Yep
16:14:09 <johnsom> #topic Open Discussion
16:14:17 <johnsom> Someone seems anxious....
16:14:44 <johnsom> Any other topics this week?
16:15:16 <cgoncalves> I asked just before the meeting started a question about the amphorav2. would now be a good time to discuss it?
16:16:01 <johnsom> Sure
16:16:13 <cgoncalves> ok, I'll paste the question again:
16:16:16 <cgoncalves> have we considered either renaming the amphorav2 provider to amphora or alias amphora->amphorav2 once we switch the default to amphorav2?
16:16:21 <johnsom> I think it is a logical choice to make that change at some point, but given the extra infrastructure I'm not sure aliasing amphorav2 to amphora is a good idea.
16:16:29 * johnsom pastes his answer again
16:16:50 <cgoncalves> ok, so you're not for aliasing. are you then for renaming or?
16:17:07 <johnsom> That said, I might look at if we can have the v2 path jobboard/extra requirements optional. It seems like it should be do-able
16:18:05 <johnsom> Interested in what people think....
16:18:15 <rm_work> Yeah I had some concerns here
16:18:51 <rm_work> If we deprecate/remove the amphora (v1) driver, we are going to be explicitly choosing to make our default deployment more complex than previously (introducing a second required service)
16:19:11 <rm_work> if we DON'T, we will forever have two code-paths, and that sucks, so I don't think it's really an option T_T
16:19:51 <johnsom> Yeah, we have to deprecate the v1 path. It's just not workable long term IMO
16:19:53 <rm_work> SO going from that: if we deprecate the old version, and force people over to the new version, given the new complexity/requirements, they MUST explicitly switch over for it to work
16:20:49 <rm_work> so for people who do upgrades without reading release-notes, if they upgrade from say X->Y release and v1 becomes v2, it's just going to explode but not "cleanly"/"clearly"
16:21:19 <rm_work> rather than "the provider does not exist" it'll be some error nestled down inside the worker where it tries and fails to connect to unconfigured redis
16:22:18 <johnsom> We can put some tooling in to warn people or make it obvious early. I think Ann already has updated the upgrade check tool.
16:22:33 * johnsom notes, upgrade check tool nobody runs....
16:22:35 <cgoncalves> ataraday_ has a pre-upgrade check for the amphorav2 provider that may be handy
16:22:37 <cgoncalves> #link https://review.opendev.org/#/c/735556/
16:22:41 <rm_work> yeah
16:23:26 <johnsom> We can set it up to check on startup and fail to start if it's not present as well.
16:23:52 <rm_work> fail to start if a flavor is using a provider that doesn't exist?
16:24:38 <johnsom> Or the default is v2, yeah, basically.  It's ugly, but possible
16:25:59 <rm_work> anyway, yeah, if we rename/alias automatically then it really hides stuff from the operator that I don't think we should be hiding
16:26:41 <johnsom> So I think the original question was about the name. I assume for upgrade reasons. Is there a need to rename it from amphorav2?
16:26:54 <cgoncalves> amphorav2 should be able manage amphorav1-created resources, correct? when we remove the amphorav1 code, we could either do a db migration to change the provider driver of existing LBs to amphorav2 or rename amphorav2 to amphora (with the bonus that we don't potentially break things for the user as the provider name would not change)
16:26:56 <rm_work> i mean, it does look kinda weird
16:27:52 <rm_work> I guess so, with the caveat that we prevent the service from starting if everything for amphorav2 isn't configured properly?
16:28:05 <rm_work> which kinda handles that in a more up-front way
16:28:13 <cgoncalves> my question comes from a deployment side as I started work to support the amphorav2 in tripleo
16:28:47 <rm_work> yeah ok i'm coming around, maybe we do alias it, and deal with my concerns via the service startup checking its config is valid
16:29:11 <rm_work> same as "can i connect to SQL" and "can I connect to RMQ", add "can I connect to Redis/Zookeeper"
16:29:15 <johnsom> I think we should also look at making the "jobboard" part of the v2 driver optional.
16:29:24 <rm_work> hmm that is the other posibility
16:29:38 <rm_work> I actually would prefer that but I didn't know if THAT was feasible, as it's pretty baked in
16:29:42 <johnsom> We should be able to run the new flows just like we have, without the need for redis, and have a bit-flip for jobboard
16:30:06 <rm_work> yeah ok I think that is my #1 preferred option -- and then yes, alias amphora to amphorav2
16:30:09 <johnsom> Really it's about how we launch the flows.
16:30:21 <rm_work> and keep the default to False for "use_jobboard"
16:31:00 <ataraday_> without jobboard we won't be able to resume jobs
16:31:18 <rm_work> right
16:31:19 <ataraday_> so why we will may need that?
16:31:27 <johnsom> Right, that would be the trade off of that setting
16:31:35 <rm_work> so Octavia is still possible to run without Redis
16:31:41 <ataraday_> in this case we may leave amphora
16:31:45 <rm_work> and using no additional upgrade requirements
16:32:12 <openstackgerrit> Pierre Riteau proposed openstack/octavia master: Add debootstrap installation instructions for CentOS  https://review.opendev.org/738885
16:32:15 <rm_work> basically that allows for what I wanted as far as "keeping v1 and v2 around" except we don't ACTUALLY have to keep v1 around, and consolidate code paths
16:32:29 <rm_work> because the ONLY advantage of v1 was not requiring Redis
16:34:38 <johnsom> Yeah, I really don't want to keep the v1 code around. That is asking for mistakes to be made and doubles work effort (I'm feeling the pain now, lol)
16:35:16 <rm_work> yep
16:35:31 <rm_work> i absolutely do not want to keep it around, for the record
16:35:56 <cgoncalves> +1. amphorav2++
16:35:59 <rm_work> I just hated the idea of having no way to run without jobboard (if you want a simpler install, with the possibility of stuff stuck in PENDING)
16:36:09 <johnsom> So, maybe in parallel someone can look at making that config setting? Or I could look at it after failover is done.
16:37:09 <johnsom> I think it's just making a method that runs the flows like the v1 driver does or like the v2 driver does, depending on the config setting.
16:38:03 <cgoncalves> I have little to none (more like none TBH, lol) understanding on the amphorav2/jobboard but I could maybe take a look
16:38:31 <ataraday_> this may make code really twisted
16:38:46 <johnsom> cgoncalves Ok cool. I can give you a few pointers to what I'm thinking.
16:38:52 * cgoncalves l
16:38:56 <cgoncalves> oops!
16:39:24 <rm_work> yeah, my concern (and why I didn't ask for this originally) was that it might not even be possible or would make things incredibly more complex within v2
16:39:33 <johnsom> ataraday_ Yeah, I may be overlooking something, but I think we should give it a shot.
16:40:21 <aannuusshhkkaa> we needed feedback on metric selection for the amphoras.. can we take that up next?
16:40:28 <rm_work> ok, so decision was: yes to alias, and try to make v2 have a "jobboardless" option?
16:40:45 <cgoncalves> if feasible, I'd advocate for use_jobboard=true as default. devstack can set redis up and is our recommendation for production environments, right?
16:40:53 <johnsom> Or was it "alias if we can make the jobboard part optional"?
16:40:58 <rm_work> hmm maybe that
16:41:20 <rm_work> yes, let's plan to take that topic up next aannuusshhkkaa!
16:41:21 <johnsom> Yeah, we should push for it as the default
16:41:33 <rm_work> I am unsure I agree
16:42:12 <rm_work> but I guess it is as simple as "oh, my service won't start due to an error that says I don't have redis configured and might need to turn off use_jobboard" and then they do that
16:42:21 <johnsom> I think it is a question of timing.
16:42:24 <cgoncalves> rm_work, before this conversation, amphorav2 was already set to become the default in Victoria or later release so either way Redis would be required
16:43:21 <rm_work> yeah which I never liked
16:43:24 <johnsom> We need to make a call by the end of ms2 because we need to send an e-mail out about the need for Redis to the deployment tools have time to add it.
16:43:36 <rm_work> ok
16:43:38 <cgoncalves> you get the bonus now that there may be a chance jobboard/redis not to be mandatory :)
16:43:47 <rm_work> lets do some research on this and see if it's even possible to make it optional?
16:44:02 <rm_work> then revisit before ms2
16:44:05 <johnsom> Ok, yeah, I agree
16:44:12 <johnsom> MS2 is coming up fast though
16:44:33 <rm_work> kk
16:44:35 <johnsom> Week of July 27th
16:44:39 <rm_work> can we move on to metrics then?
16:44:44 <aannuusshhkkaa> yes!
16:44:47 <johnsom> #link https://releases.openstack.org/victoria/schedule.html
16:44:47 <aannuusshhkkaa> :D
16:45:09 <cgoncalves> metrics, that's why you were anxious earlier :D
16:45:14 <johnsom> I think so. What is up with metrics? I know there is a patch that needs some reviews
16:45:17 <rm_work> heh
16:45:22 <aannuusshhkkaa> here is the list of metrics we are thinking of implementing:
16:45:22 <aannuusshhkkaa> Must Haves:
16:45:22 <aannuusshhkkaa> CPU Usage (Current %)
16:45:22 <aannuusshhkkaa> Load Averages
16:45:22 <aannuusshhkkaa> RAM Usage (some combo of: total / free / available / cached / etc)
16:45:23 <aannuusshhkkaa> Nice to Haves:
16:45:23 <aannuusshhkkaa> Disk usage (used/free? or used%?)
16:45:24 <aannuusshhkkaa> Random extra HAProxy fields (taking suggestions)
16:45:25 <aannuusshhkkaa> are we good on the ones we have selected? are we missing something? have we included something that isn’t plausible?
16:45:26 <rm_work> yes, we have one patch up that changes the interfaces around slightly
16:45:53 <rm_work> ack, in the future you should paste multi-line stuff into something like http://paste.openstack.org/
16:46:18 <johnsom> Do we need load averages? Personally I would like to keep the data minimal, so just %'s
16:47:01 <rm_work> I feel like it's generally more useful than "point in time" CPU usage
16:47:03 <johnsom> Yeah, you can get booted off the server to too much multi-line pastes.
16:47:22 <aannuusshhkkaa> rm_work gotcha
16:48:16 <aannuusshhkkaa> johnsom, haha okay! thanks for the correction..
16:48:32 <johnsom> Ok, if you have a use for it.
16:48:48 <rm_work> Well, it's super easy for a single point in time CPU % to be totally off
16:48:53 <johnsom> aannuusshhkkaa The server can consider you a spam bot basically.
16:49:09 <aannuusshhkkaa> yeap that makes sense.. will keep that in mind..
16:49:43 <johnsom> Yeah, but we will get samples every 10 seconds or so.
16:49:48 <rm_work> whereas load averages are very nice locally generated averages that we can keep collecting at a much slower interval and still have them be useful
16:50:09 <rm_work> yeah if we were collecting every second, doing our own averages might make more sense, but...
16:50:40 <johnsom> Yeah, maybe. You would have to then get the number of cores to calculate a percent
16:51:12 <rm_work> hmm yes that is true
16:51:30 <rm_work> possibly do THAT locally? and return load average %s
16:51:41 <rm_work> rather than have to ship that info up and calculate later
16:52:05 <johnsom> That could work.
16:52:17 <rm_work> anyway, we will look into options for that -- any comments about the others? RAM numbers that are actually useful?
16:52:26 <rm_work> Linux "free memory" is basically a useless metric
16:52:29 <rm_work> AFAICT
16:52:47 <rm_work> but I don't know exactly which combination of RAM metrics is *actually* most useful
16:53:02 <johnsom> Yeah, really I'm looking for % of available memory
16:53:13 <johnsom> available/total
16:53:28 <johnsom> cache and free, not so useful to me
16:55:16 <aannuusshhkkaa> okay.. what about disk usage? used %s?
16:56:20 <rm_work> I would assume used% is better possibly? though I also wonder if without context that could be not so useful if it says 50% and then you wonder "at what rate will that fill"
16:56:46 <johnsom> Rate is why you have time series
16:57:08 <johnsom> IMO
16:57:25 <johnsom> timeseries/deltas
16:58:12 <johnsom> A simple scaling driver could have just thresholds and ignore rate. A fancy driver could build a rate and make decisions.
16:58:59 <johnsom> Then at some point someone can add AI and ML, then build a model of what works and doesn't. Then we retire
16:59:14 <aannuusshhkkaa> lol
16:59:29 <rm_work> so that means... percentage? or used/total? lol
16:59:39 <johnsom> Oh, just about out of time for the meeting. We can continue after if you still have questions.
16:59:47 <johnsom> Or defer to next week.
16:59:52 <aannuusshhkkaa> yes we do..
17:00:04 <aannuusshhkkaa> next week would probably be a little too late..
17:00:10 <johnsom> #endmeeting