20:00:09 #startmeeting Octavia 20:00:10 Meeting started Wed Aug 29 20:00:09 2018 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:00:14 The meeting name has been set to 'octavia' 20:00:18 o/ 20:00:19 o/ 20:00:27 Hi folks 20:00:42 #topic Announcements 20:00:55 Same reminder about the PTG etherpad: 20:01:00 #link https://etherpad.openstack.org/p/octavia-stein-ptg 20:01:20 The PTG is coming up fast, so I will try to put a rough schedule together soon 20:01:43 Also a note, the TC election nominations start tomorrow. 20:01:50 #link https://governance.openstack.org/election/ 20:02:01 In case you are interested in running for the TC 20:02:38 The official Rocky release patch is up for review. I have already checked the SHAs and our stuff looks good. 20:03:06 Nice 20:03:48 Any other announcements today? 20:04:14 #link https://review.openstack.org/597529 20:04:24 Finally found the Rocky release link... 20:04:28 Octavia Queens 2.0.2 tagging is blocked until stable team reviews a commit that caught that attention 20:04:39 Yeah, I saw that. 20:05:02 #link https://review.openstack.org/#/c/593954/ 20:05:03 FYI, I have already bumped the OpenStack Ansible SHA to use that 2.0.2 version 20:05:20 I think it will be ok since it includes a default. 20:06:08 That is all I have so, moving on down the agenda 20:06:18 #topic Brief progress reports / bugs needing review 20:06:41 I have wrapped up most of my internal work so should have a bit more time to work on things upstream. 20:07:23 I recently wrote a tool to stress test the health manager process. It can also be used to populate a DB with load balancers. 20:07:41 I will probably post that in my github space sometime in the near future. 20:08:00 nice, thanks! 20:08:10 do share the link later :) 20:08:11 Testing has shown that we have a bit of a performance regressing in the HM and I have a patch for that in the works. 20:08:53 johnsom, if only we had such tools for nlbaas :D 20:08:56 I also noticed the UDP code is not reporting listener health correctly and the work around code is in my way with the HM fix, so I will also be posting a fix for that soonish 20:09:34 nmagnezi HA, well, you can run my existing API stress tool against neutron-lbaas.... But the results aren't pretty 20:09:54 yeah.. we know.. :) 20:10:00 It does pass well on Octavia however 20:10:35 That is mostly what I have been up to over the week. Any other updates? 20:10:48 Oh, I forgot one item 20:10:59 #link https://review.openstack.org/#/c/594786/ 20:11:20 I have posted an alternate option for how we can handle API versioning in the tempest plugin. 20:11:34 rm_work Also posted an idea for it. 20:11:46 We should try to review those and merge one soonish. 20:12:08 Without this, the tempest plugin fails when run against queens Octavia 20:12:21 Yes, we saw that in-house 20:12:30 Thanks so much for submitting that one 20:12:33 It tries to test features that were added in Rocky on the Queens cloud 20:12:46 Yeah stuff like listener timeout and such 20:13:00 Yep 20:13:13 This patch tests that patch with stable/queens: 20:13:17 #link https://review.openstack.org/#/c/595257/ 20:13:28 Reviews would be appreciated. 20:14:34 Ack 20:14:35 ah, forgot to re-review. you made jobs voting, thanks 20:14:44 Any other updates? 20:14:54 Speaking of reviews, I have a question about other patch you posted 20:14:56 lol, I already re-reviewed :) 20:15:09 #link https://review.openstack.org/#/c/585031/ 20:15:25 Sure, what is the question? 20:15:44 I also posted this in gerrit, what would happen if we have a new controller (with this patch) with an older amp 20:15:51 And we try to ask for stats 20:16:26 You should get an answer as this doesn't connect to the AMP, it just pulls data we already have in the database in a different way. 20:17:16 Oh, alright. didn't get to test it just yet sadly, just a thought that I had 20:18:14 Yeah, as of now, there is never a connection from the API process directly to an Amphora. Only the other three processes 20:19:08 The background on this patch and the two in tempest-plugin is I needed a gate test that tests out VRRP failover for an internal request. 20:19:32 The best way I could come up with to figure out which amp is passing traffic was to look at the per-amphora stats. 20:20:16 The gate test works, but I still need to add some tempest API tests to the middle patch in that chain, so I have it WIP right now 20:21:32 Ok, if we don't have any more updates, I will move on. 20:21:42 #topic Upgrade-checker community goal 20:22:07 One of the two community goals for Stein is to have an "upgrade-checker" script like nova has. 20:22:33 Frankly I think this is low hanging fruit if someone is looking for a small project. 20:22:40 There are details here: 20:22:47 #link http://lists.openstack.org/pipermail/openstack-dev/2018-August/133888.html 20:23:36 Just wanted to raise awareness in case someone is looking for a little project. 20:23:57 I don't think we have much that would need to go in there at the moment. 20:24:58 #topic Open Discussion 20:25:08 Any other topics today? 20:26:20 we've received couple of reports these past days that octavia can bring down all loadbalancers due to a DB outage 20:26:26 #link https://storyboard.openstack.org/#!/story/2003575 20:26:50 I plan to work on this as soon as I can 20:27:09 a traceback can be read here: https://bugzilla.redhat.com/show_bug.cgi?id=1603138#c4 20:27:09 bugzilla.redhat.com bug 1603138 in openstack-octavia "Controller replacement with Octavia- After controller replacement amphora is in ERROR state" [Urgent,Closed: duplicate] - Assigned to cgoncalves 20:27:40 Yeah, bummer. I have added a comment to that story with an idea of how to approach it 20:28:07 examples of operations that can trigger this issue are: node reboot, db service restart, cloud upgrade (causing DB downtime) 20:28:29 so, whatever causes DB downtime... 20:28:56 Well, ideally you are running a DB cluster so minor things like a DB restart should not impact Octavia 20:29:42 a tmp network disruption can also block the connection to DB 20:29:47 yeah, I need to check why it still caused that on 3-controller HA deployments 20:30:56 There is built in DB retries in the oslo DB layer, so short network blips should not trigger it either. Plus, with the default settings it would have to be unreachable for 60 seconds or more across all of the HMs 20:31:49 Maybe there is a bug in galera (if that is what you are using for clustering) or oslo DB. We should do some testing. 20:32:03 Either way, I think we should improve how the HMs handle the situation 20:32:26 agreed. we should catch db_exc.DBConnectionError exceptions in health_check() 20:32:42 sorry, had some prod issue to attend... 20:33:04 xgerman_, let me guess. octavia brought down all LBs due to DB outage? xd 20:33:17 no, unrelated to Octavia 20:33:38 He had to look around for zombies 20:33:52 ha! :) 20:34:59 cgoncalves Feel free to ping me if you want to bounce ideas around.... 20:35:33 johnsom, I most certainly will, thanks :) 20:35:42 Other topics for today? 20:36:44 PTG? Did you plug the etherpad? 20:37:09 ah saw it 20:38:02 Yeah, I posted the normal reminder at the start of the meeting 20:38:39 +1 20:38:41 xgerman_ lol, oh, we will be talking about Octavia..... (topic 18 in the etherpad) 20:38:52 I like item "18. Octavia". let's discuss Octavia at the PTG xD 20:39:20 Ok, if we don't have any more topics, I will close out the meeting. 20:40:28 Ok, thanks folks! 20:40:34 #endmeeting