#openstack-lbaas log

20:00:03 <johnsom> #startmeeting Octavia
20:00:04 <openstack> Meeting started Wed Feb 27 20:00:03 2019 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:00:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
20:00:07 <openstack> The meeting name has been set to 'octavia'
20:00:11 <johnsom> Hi folks
20:00:17 <colin-> hi
20:00:27 <nmagnezi> o/
20:00:46 <johnsom> #topic Announcements
20:01:00 <cgoncalves> hi
20:01:07 <johnsom> The TC elections are on. You should have received an e-mail with your link to the ballot.
20:01:28 <johnsom> The octavia-lib feature freeze is now in effect.
20:01:47 <johnsom> I have also released version 1.1.0 for Stein with our recent updates.
20:02:07 <colin-> nice
20:02:18 <johnsom> And the most important, NEXT WEEK IS FEATURE FREEZE FOR EVERYTHING ELSE
20:03:18 <johnsom> As usual, we are working against the priority list:
20:03:27 <johnsom> #link https://etherpad.openstack.org/p/octavia-priority-reviews
20:03:45 <johnsom> Any other announcements today?
20:04:22 <johnsom> #topic Brief progress reports / bugs needing review
20:04:58 <johnsom> I have mostly been focused on the TLS patch chains.  The TLS client authentication patches have now merged. They work well in my testing.
20:05:30 <johnsom> I'm currently working on the backend re-encyrption chain. I hope I can finish that up today, give it a test, and we can get that merged too.
20:06:39 <johnsom> If all goes well, I might try to help the volume backed storage patch and see if we can get it working for Stein. I created a test gate, but the patch fails...
20:07:27 <johnsom> Any other updates?
20:07:32 <cgoncalves> I have been working on multiple fronts
20:07:44 <xgerman> o/
20:07:45 <cgoncalves> 1. RHEL 8 DIB and amphora support (tempest tests passing)
20:07:48 <cgoncalves> #link https://review.openstack.org/#/c/623137/
20:07:49 <colin-> appreciate the oslo merge, rebuilt and running at that point in master now
20:07:54 <cgoncalves> #link https://review.openstack.org/#/c/638581/
20:08:01 <cgoncalves> 2. Allow ERROR'd load balancers to be failed over
20:08:06 <cgoncalves> #link #link https://review.openstack.org/#/c/638790/
20:08:17 <cgoncalves> 3. iptables-based active-standby tempest test
20:08:18 <cgoncalves> #link https://review.openstack.org/#/c/637073/
20:08:36 <cgoncalves> 4. general bug fix backports
20:09:08 <johnsom> Cool, thank you for working on the backports!
20:09:32 <xgerman> +1
20:09:54 <cgoncalves> stable/rocky grenade job is sadly still broken. I apologize for not having invested much time on it yet
20:10:10 <johnsom> That is next on the agenda, I wanted to check in on that issue.
20:10:36 <johnsom> #topic Status of the Rocky grenade gate
20:11:03 <johnsom> I just wanted to get an update on that. I saw your note earlier about a potential cause.
20:11:10 <cgoncalves> right
20:11:13 <cgoncalves> #link https://review.openstack.org/#/c/639395/
20:11:26 <johnsom> Are you actively working on that or is it an open item?
20:11:38 <cgoncalves> ^ this backport allow us now to see what's going on wrong creating a member
20:11:50 <cgoncalves> that is where the grenade job is failing on
20:11:57 <cgoncalves> the error is: http://logs.openstack.org/49/639349/5/check/octavia-grenade/461ebf7/logs/screen-o-cw.txt.gz?level=WARNING#_Feb_27_08_32_43_986674
20:12:27 <cgoncalves> the rocky grenade job started failing between Dec 14-17 if I got that right
20:13:00 <cgoncalves> so I'm thinking if https://review.openstack.org/#/c/624804/  is what introduced the regression
20:13:30 <cgoncalves> the member create call fails still on queens, not rocky
20:13:38 <xgerman> with all those regressions looks like we are lacking gates
20:14:17 <cgoncalves> xgerman, speaking of that, your VIP refactor patch partially broke active-standby in master :P
20:14:27 <xgerman> I put up a fix
20:14:41 <johnsom> Yeah, not sure how the scenario tests pased but grenade is not.
20:14:52 <cgoncalves> xgerman, I don't see it. we can chat about that after grenade
20:15:17 <johnsom> xgerman It looks like in my rush I forgot to switch it off of amphorae....
20:15:22 <johnsom> lol
20:15:53 <xgerman> yeah, two small changes and it came up on my devstack
20:16:08 <cgoncalves> xgerman, ah, I see it now. you submitted a new PS to Michael's change
20:16:13 <xgerman> yep
20:16:13 <johnsom> Cool, I just rechecked my act/stdby patch which is setup to test taht
20:16:23 <cgoncalves> #link https://review.openstack.org/#/c/638992/
20:16:42 <johnsom> #link https://review.openstack.org/#/c/584681
20:17:07 <johnsom> Ok, so cgoncalves you are actively working on the grenade issue?
20:17:37 <cgoncalves> johnsom, I will starting actively tomorrow, yes
20:18:06 <johnsom> Ok, cool. Thanks.  Just wanted to make sure we didn't think each other was looking at it, when in reality none of us were....
20:18:20 <johnsom> #topic Open Discussion
20:18:42 <johnsom> I have one open discussion topic, but will open the floor up first to other discussions
20:18:44 <cgoncalves> I'm sure you'll be looking at it too, at least reviewing ;)
20:19:09 <johnsom> Other topics today?
20:19:29 <johnsom> Ok, then I will go.
20:19:32 <colin-> would like to soicit guidance
20:19:34 <colin-> very briefly
20:19:44 <johnsom> Sure, go ahead colin-
20:20:23 <colin-> an increaing number of internal customers are asking about the performance capabilities of the VIPs we create with octavia, and we're going to endeavor to measure that really carefully in terms of average latency, connection concurrency, and throughput (as these all vary dramatically based on cloud hw)
20:21:06 <johnsom> Yes, I did a similar exercise last year.
20:21:08 <colin-> so, aside from economies of scale with multiple tcp/udp/http listeners, does anyone have advice on how to capture this information really effectively with octavia and its amphorae?
20:21:38 <colin-> and i'm hoping to use this same approach to measure the benefits of various nova flavors and haproxy configruations later in stein
20:23:00 <openstackgerrit> Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware.http_proxy_to_wsgi  https://review.openstack.org/639736
20:23:03 <johnsom> Yeah, so I setup a lab, had three hosts for traffic generation, three for content serving. one for the amp
20:23:27 <johnsom> I used iperf3 for the TCP (L4) tests and tsung for the HTTP tests
20:23:47 <openstackgerrit> Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware.http_proxy_to_wsgi  https://review.openstack.org/639736
20:23:55 <johnsom> I wrote a custom module for nginx (ugh, but it was easy) that returned static buffers.
20:24:14 <colin-> did you add any monitoring/observability tools for visualizing?
20:24:18 <openstackgerrit> Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware http_proxy_to_wsgi  https://review.openstack.org/639736
20:24:23 <colin-> or was shell output sufficient for your purposes
20:24:28 <johnsom> I did one series where traffic crossed hosts, one with everything on one host (eliminates the neutron issues).
20:24:45 <johnsom> tsung comes with reporting tools
20:25:00 <colin-> oh ok
20:25:02 <johnsom> I also did some crossing a neutron router vs. all L2
20:25:31 <johnsom> Then it's just a bunch of time tweaking all of the knobs
20:25:39 <colin-> good feedback, thank you
20:26:20 <johnsom> For the same-host tests, iperf3 with 20 parallel flows, 1vcpu, 1GB ram, 2GB disk did ~14gbps
20:27:02 <johnsom> But of course your hardware, cloud config, butterflys flapping wings is Tahiti, all impacts what you get.
20:27:13 <johnsom> caveat, caveat, caveat.....
20:28:22 <colin-> yeah indeed. if anyone else has done this differently or tested different hardware NICs this way please lmk! that's all i had
20:28:41 <johnsom> Yeah, get ready to add a ton of ******
20:28:49 <johnsom> for all the caveats
20:29:35 <johnsom> I can share the nginx hack code too if you decide you want it.
20:30:22 <johnsom> Ok, so we have this issue where if people kill -9 the controller processes we can leave objects in PENDING_*
20:30:31 <xgerman> also are you running the vip on an overlay? Or dedicated vlan, etc.
20:31:34 <johnsom> I have an idea for an interim solution until we do jobboard/resumption.
20:31:37 <xgerman> johnsom: that type of thing was supposed to get fixed when we adopt job-board
20:31:49 <johnsom> lol, yeah, that
20:31:54 <xgerman> our task-(flow) engine should have a way to deal with that
20:31:59 <xgerman> that’s why we went with an engine
20:32:21 <johnsom> It does, in fact multiple ways, but that will take some development time to address IMO
20:33:18 <johnsom> So, as a short term, interim fix I was thinking that we could have the processes create a UUID unique to it's instance, write that out to a file somewhere, then check it on startup and mark anything it "owned" as ERROR.
20:33:25 <johnsom> Thoughts?  Comments?
20:33:26 <cgoncalves> if $time.now() > $last_updated_time+$timeout -> ERROR?
20:33:40 <johnsom> The hardest part is where to write the file....
20:34:41 <johnsom> It would require a DB schema change, which we would want to get in before feature freeze (just to be nice for upgrades, etc.). So thought I would throw the idea out now.
20:36:15 <johnsom> I think the per-process UUID would be more reliable than trying to do a timeout.
20:36:41 <openstackgerrit> Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware http_proxy_to_wsgi  https://review.openstack.org/639736
20:37:14 <cgoncalves> hmmm
20:37:57 <cgoncalves> what then flipping status to PENDING_UPDATE? maybe only valid to certain resources
20:38:21 <johnsom> The only downside is we don't have a /var/lib/octavia on the controllers today, so it's an upgrade/packaging issue
20:38:41 <cgoncalves> and not backportable
20:39:01 <johnsom> Right, the "don't do that" still applies to older releases
20:39:23 <johnsom> I didn't follow the PENDING_UPDATE comment
20:40:15 <cgoncalves> nah, never mind. it prolly doesn't make any sense anyway xD (I was thinking along the same lines of allowing ERROR'd LBs to be failed over)
20:40:25 <johnsom> It would have to flip them to ERROR because we don't know where in the flow they killed it
20:41:15 <johnsom> Yeah, maybe a follow on could attempt to "fix" it, but that is again logic to identify where it died. Which is starting the work on jobboard/resumption.
20:41:33 <cgoncalves> thinking of a backportable solution, wouldn't timeouts suffice?
20:42:48 <johnsom> I don't like that approach for a few reasons. We seem to have widely varying performance in the field, so picking the right number would be hard, sort of making it an hour or something, which defeats the purpose of a timely cleanup
20:43:05 <xgerman> mmh, people would likley be happy if we just flip PENDING to ERROR with the housekeeper after a while
20:43:30 <johnsom> I mean we already have flows that timeout after 25 minutes due to some deployments, so it would have to be longer than that.
20:43:49 <xgerman> some operators tend to trade resources for less work… so there’s that
20:44:10 <johnsom> Yeah, the nice thing about the UUID too is it shames the operator for kill -9
20:44:22 <johnsom> We know exactly what happened
20:44:28 <xgerman> or for having servers explode
20:44:44 <xgerman> or poweswitch istakes
20:44:48 <cgoncalves> also more and more clouds run services in containers, so docker restart would basically mean kill -9
20:44:55 <xgerman> yep
20:45:05 <johnsom> Yep, k8s is horrible
20:45:22 <cgoncalves> you don't need k8s to run services in containers ;)
20:45:23 <colin-> stop, my eyes will roll out of my head
20:45:27 <cgoncalves> I mean openstack services!
20:45:46 <xgerman> yeah, we should rewrite octavia as a function-as-a-service
20:46:01 <johnsom> I know, but running the openstack control plane in k8s means lots of random kills
20:46:09 <colin-> indeed
20:46:40 <xgerman> so how difficult is job board? did we ever look into the effort?
20:46:53 <johnsom> Anyway, this is an option, yes, may not solve all of the ills.
20:47:11 <johnsom> Yeah, we did, it's going to probably be a cycles worth of effort to go full job board.
20:47:40 <johnsom> There might be a not-so-full job board that would meet our needs too, but that again is going to take some time.
20:48:15 <xgerman> I would rather start on the “right” solution then do crudges
20:48:19 <cgoncalves> I was unaware of jobboards until now. does it sync state across multiple controller nodes?
20:48:50 <johnsom> not really, but accomplishes the same thing.
20:49:27 <cgoncalves> asking because if octavia worker N on node X goes down, worker N+1 on node X+1 takes over
20:49:43 <johnsom> So first it enables persistence of the flow data. I uses a set of "worker" processes. The main jobboard assigns and monitors the workers completion of each task
20:50:02 <johnsom> Right, effectively that is what happens.
20:50:06 <cgoncalves> without a syncing mechanism, how would octavia know which pending resources to ERROR?
20:50:11 <xgerman> do we need a zookeeper for jobboard. Yuck!
20:50:25 <johnsom> Much of the state is stored in the DB
20:50:32 <cgoncalves> ok
20:50:53 <colin-> jobboard = ?, for the uninitiated
20:50:55 <johnsom> Yeah, so there was a locking requirement I remember from the analysis. I don't think zookeeper was the only option, but maybe
20:50:59 <colin-> is this a work tracking tool?
20:51:31 <colin-> ah, disregard
20:51:55 <xgerman> https://docs.openstack.org/taskflow/ocata/jobs.html
20:52:00 <johnsom> #link https://docs.openstack.org/taskflow/latest/user/jobs.html
20:52:13 <johnsom> Anyway, I didn't want to go deep on the future solution.
20:52:45 <johnsom> What I am hearing is we would prefer to leave this issue until we have resources to work on the full solution and that an interim solution is not valuable
20:53:32 <xgerman> #vote?
20:53:39 <cgoncalves> I still didn't get why timeouts wouldn't be a good interim (and backportable) solution
20:53:56 <johnsom> What would you pick as a timeout?
20:54:24 <cgoncalves> what ever is in the config file
20:54:27 <johnsom> We know some clouds complete tasks in less than a minute, others it takes over 20
20:54:36 <cgoncalves> if load creation: build timeout + heartbeat timeout
20:54:53 <cgoncalves> otherwise, just heartbeat timeout. no?
20:54:56 <johnsom> So 26 minutes?
20:55:12 <cgoncalves> better than forever and ever
20:55:26 <cgoncalves> and not being able to delete/error
20:55:40 <johnsom> I don't think we can backport this even if it has a timeout really
20:56:16 <johnsom> The timeout would be a new feature to the housekeeping process
20:56:34 <cgoncalves> no API or DB schema changes. no new config option
20:56:41 <johnsom> The other thing that worries me about timeouts is folks setting it and not understanding the ramifications
20:56:42 <colin-> yeah that's tricky, i too don't want to leave them (forever) in the state where they can't be deleted
20:56:44 <cgoncalves> it would be a new periodic in housekeeping
20:57:52 <xgerman> yeah, I am hunted by untuned tieouts almost every day
20:58:01 <colin-> xgerman: thanks for the link
20:58:05 <johnsom> Yep. I think it breaks the risk of regression and self-contained rules
20:58:44 <johnsom> And certainly the "New feature" rule
20:59:24 <johnsom> Well, we are about out of time.  Thanks folks.
20:59:27 <cgoncalves> "Fix an issue where resources could eternally be left in a transient state" ;)
20:59:41 <johnsom> If you all want to talk about job board more, let me know and I can put it on the agenda.
20:59:52 <cgoncalves> I will certainly read more about it
21:00:20 <johnsom> I just think it's a super dangerous thing in our model to change the state out from under other processes
21:00:25 <johnsom> #endmeeting