15:01:34 <haleyb> #startmeeting neutron_dvr
15:01:35 <openstack> Meeting started Wed Oct 12 15:01:34 2016 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:37 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:39 <openstack> The meeting name has been set to 'neutron_dvr'
15:01:43 <haleyb> #chair Swami
15:01:47 <openstack> Warning: Nick not in channel: Swami
15:01:49 <openstack> Current chairs: Swami haleyb
15:01:54 <haleyb> #chair Swami__
15:01:55 <openstack> Current chairs: Swami Swami__ haleyb
15:02:22 <haleyb> #topic Announcements
15:02:54 <haleyb> Newton is done, Ocata is open, so let's merge some code :)
15:03:05 <Swami__> haleyb: good
15:03:25 <haleyb> #topic Bugs
15:03:33 <Swami__> haleyb: thanks
15:03:43 <Swami__> There are two new bugs filed this week.
15:04:06 <Swami__> #link https://bugs.launchpad.net/neutron/+bug/1632540
15:04:06 <openstack> Launchpad bug 1632540 in neutron "l3-agent print the ERROR log in l3 log file continuously ,finally fill file space,leading to crash the l3-agent service" [Undecided,New] - Assigned to zhichao zhu (rtmdk)
15:04:53 <Swami__> It seems this bug can be recreated in a L3+HA+DVR scenarion where agent restart is not able to process the new router that was created when the agent is dead and the logs quickly fill in.
15:05:13 <Swami__> Unable to process compatible router is the error message seen.
15:05:28 <Swami__> This has been reported as seen in Mitaka.
15:05:48 <Swami__> This bug needs triaging on why the router was not processed in the first place.
15:06:16 <haleyb> we should try and reproduce in master as well
15:06:51 <Swami__> haleyb: yes while triaging we can do it in the master and Mitaka. Seems to be raising an exception while trying to apply the iptable rules.
15:07:46 <Swami__> Swami: Let me triage it and will update the bug.
15:08:02 * haleyb wonders how many swamis are here :)
15:08:02 <Swami__> The next one in the list is
15:08:21 <Swami__> haleyb: hexchat does the magic for swami
15:08:38 <haleyb> Swami__: thanks, if you can reproduce i can look at logs, or let me know if you don't have time
15:08:43 <Swami__> haleyb: i keep connecting and disconnecting, that is reason for multiple swami.
15:08:53 <Swami__> The next one in the list is
15:09:14 <Swami__> #link https://bugs.launchpad.net/neutron/+bug/1631513
15:09:14 <openstack> Launchpad bug 1631513 in neutron "DVR: Fix race conditions when trying to add default gateway for fip gateway port." [Undecided,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:09:40 <Swami__> A patch is up for review. #link https://review.openstack.org/#/c/383941/
15:09:52 <Swami__> This is the one that we have been discussing yesterday.
15:10:09 <haleyb> Swami__: i will look at your comments
15:10:14 <Swami__> I knew that you had some concerns with the patch.
15:10:45 <Swami__> haleyb: Basically we still can't find why two router_updates are coming in for the same router before the first update completed.
15:10:46 <haleyb> i still wish we could flag router as "building" so updates block until it's done
15:11:36 <Swami__> haleyb: so when do you think that we should say the router is complete. It might depend upon the operation.
15:12:32 <Swami__> haleyb: because here it may be routerA and we would have successfully completed the deployment of routerA and would have updated the state to complete. Then when floatingip association comes for the same router, do you want to reset the state of the router and then set it again.
15:13:10 <haleyb> Swami__: right, i guess complete means init_l3 has run and/or interfaces have been created.  But yes there are other dynamic things like FIPs.
15:13:14 <Swami__> haleyb: So for every update that is coming in you wanted to update the state?
15:13:43 <Swami__> haleyb: That would be complex and we will hit a timing window there as well.
15:13:51 <haleyb> Swami__: i was originally just thinking initial creation, but it might not work correctly
15:14:23 <haleyb> I think the change is almost there, it's just hard to test being intermittent
15:14:47 <Swami__> haleyb: agreed.
15:15:22 <Swami__> haleyb: but because we could not figure out the real cause I have added a 'try' loop for the update_gateway_port.
15:16:30 <Swami__> haleyb: The reason is for any reason while we are adding the gateway if there is a delay is assigning the IP for the 'fg' port or creation of the port we can catch it in update_gateway_port and cleanup the namespace and reraise the exception so that the router can re-sync the update and see if it processes the floatingip properly.
15:17:14 <haleyb> Swami__: yes, i'll have to read your comments again to figure out what code path is building things, and which is broken again.  If one is in update_gateway_port and not done we can always add a semaphore to stop the second
15:17:33 <Swami__> haleyb: on the other hand if there is a parallel router_update and goes directly to update_gateway_port we would throw an exception and try to re-sync so that when the next update comes in, everything would be there. ( This is just an assumption).
15:19:29 <haleyb> we can continue discussion in the patch
15:19:51 <Swami__> haleyb: The only issue we should have something in there to check why in the first place the 'fg-' port is not getting created.
15:19:57 <Swami__> haleyb: thanks will do.
15:20:16 <Swami__> The next one is
15:20:20 <Swami__> #link https://bugs.launchpad.net/neutron/+bug/1629539
15:20:20 <openstack> Launchpad bug 1629539 in neutron "Broken distributed virtual router w/ lbaas v1" [Undecided,Incomplete]
15:20:51 <Swami__> I have not triaged this bug yet. It seems that a legacy router is getting migrated to DVR and they are seeing this problem.
15:21:21 <haleyb> that's also mitaka
15:21:30 <Swami__> haleyb: yes in mitaka.
15:21:59 <Swami__> haleyb: I don't know if it is worth testing with lbaasv1 at this point.
15:22:34 <haleyb> i will still caution getting too deep based on submittor
15:22:52 <Swami__> haleyb: ok got it.
15:23:40 <Swami__> I don't have any other new bugs to discuss, the rest of the bugs we have discussed last week.
15:24:25 <haleyb> i was just looking at https://bugs.launchpad.net/neutron/+bug/1612192
15:24:25 <openstack> Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [High,Confirmed]
15:24:33 <Swami__> haleyb: yes any updates.
15:24:59 <haleyb> either logstash is broken, or it's not being seen any more.  i couldn't find anything for the past 30+ days
15:25:35 <haleyb> so i will closed that based on that info
15:25:50 <Swami__> haleyb: Ok, makes sense.
15:26:15 <haleyb> this one still shows up https://bugs.launchpad.net/neutron/+bug/1612804 but i haven't looked very deep
15:26:15 <openstack> Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [High,Confirmed]
15:27:05 <Swami__> haleyb: thanks
15:27:24 <haleyb> Any other "in progress" ones to discuss?  i know some just need reviews
15:28:00 <Swami__> haleyb: can I also get your attention on the patch #link https://review.openstack.org/#/c/377108/
15:28:15 <Swami__> haleyb: Also on this https://review.openstack.org/#/c/308068/
15:28:25 <Swami__> They both are related to bugs.
15:28:40 <haleyb> ok, will look
15:29:13 <Swami__> #link https://bugs.launchpad.net/neutron/+bug/1506567
15:29:13 <openstack> Launchpad bug 1506567 in neutron "No information from Neutron Metering agent" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:29:48 <Swami__> #link https://bugs.launchpad.net/neutron/+bug/1571676
15:29:48 <openstack> Launchpad bug 1571676 in neutron "After binding a floating IP to VM, the static route can't work in DVR." [Undecided,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:29:50 <Swami__> haleyb: thanks
15:30:08 <Swami__> haleyb: that's all I had for bugs.
15:30:41 <haleyb> this is a better time to land these patches in case we have a regression
15:30:56 <haleyb> jschwarz: any HA bugs?
15:31:53 <Swami__> jschwarz: I don't think he is here.
15:32:04 <haleyb> and i don't see anil, i think he had a 3-node infra change out for review?  i don't have a link
15:33:07 <haleyb> guess it will have to wait until next time
15:33:14 <Swami__> haleyb: true
15:35:20 <haleyb> #topic Gate failures
15:36:48 <haleyb> It doesn't look like DVR is being an issue so nothing critical to look at, anyone seen otherwise?
15:37:25 <Swami__> That is good, but it might be time bound.
15:39:19 <haleyb> Swami__: meaning we timeout before failure?  i.e. the dhcp failure?
15:40:15 <Swami__> haleyb: I said, that may be we will see bugs today or tomorrow.
15:41:09 * haleyb will take today and tomorrow off then :)
15:41:36 <haleyb> #topic Stable backports
15:42:21 <Swami__> haleyb: Do we have any pending backports for now.
15:42:37 <haleyb> backport policy changed wrt releases, ihar sent an email to the list, but basically liberty is CVE only, mitaka High and CVE, newton more open
15:42:57 <haleyb> Swami__: i think there's a few HA ones, but not much
15:43:36 <haleyb> like https://review.openstack.org/#/c/364407/
15:44:08 <Swami__> haleyb: Some of the HA l2pop has some table updates I am not sure if we can backport those.
15:45:14 <haleyb> no, db updates aren't allowed
15:45:50 <Swami__> haleyb: yes that's what I thought.
15:46:24 <haleyb> #topic Open Discussion
15:46:38 <haleyb> anything else?  it might just be you and me swami
15:47:01 <Swami__> haleyb: I don't have any.
15:47:30 <Swami__> haleyb: There is one issue that have been reported against metadata and dvr routers.
15:47:57 <Swami__> haleyb: This seems to me like a timing issue, since we do start routers dynamically after the VM pops up.
15:49:01 <haleyb> bug #?  or is this the one we see internally?
15:49:20 <Swami__> haleyb: yes there is bug for it.
15:49:51 <haleyb> https://bugs.launchpad.net/neutron/+bug/1526855
15:49:51 <openstack> Launchpad bug 1526855 in neutron "VMs fail to get metadata in large scale environments" [Medium,Confirmed] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:49:54 <Swami__> #link https://bugs.launchpad.net/neutron/+bug/1526855
15:50:05 <Swami__> haleyb: yes that's the one.
15:50:42 <Swami__> haleyb: I was looking through it and the metadata driver is only called when the router is added.
15:51:03 <Swami__> Also the bug states that the timeout is seen only for the first router that is getting created on that node.
15:52:16 <haleyb> so VM port comes up before router is created
15:52:31 <Swami__> haleyb: yes
15:53:15 <Swami__> VM comes up and starts sending the request and then times out after 20 retries. In some cases the router creation might be taking longer time than the 20 retries. That is the theory.
15:53:34 <Swami__> Once the router is in place, then any number of VMs that pop up will not fail to get the metadata.
15:54:20 <haleyb> i have not gone through the logs, but is the l3-agent getting the message and slow, or getting the message late?
15:54:55 <Swami__> haleyb: no I don't have clues.
15:56:44 <haleyb> I know the internal bug has more info, we might need to copy things to the public one
15:57:46 <Swami__> haleyb: will try to get it and post it.
15:58:54 <haleyb> ok, thanks.  we're about out of time
15:59:36 <haleyb> #endmeeting