#openstack-meeting-alt log

15:01:21 <haleyb_> #startmeeting neutron_dvr
15:01:22 <openstack> Meeting started Wed Sep  7 15:01:21 2016 UTC and is due to finish in 60 minutes.  The chair is haleyb_. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:25 <openstack> The meeting name has been set to 'neutron_dvr'
15:01:29 <jschwarz> haleyb_, underscores ++
15:01:35 <haleyb_> #chair Swami_
15:01:35 <openstack> Current chairs: Swami_ haleyb_
15:01:44 <haleyb_> guess swami is too :)
15:01:56 <Swami_> hi
15:02:33 <haleyb_> #topic Announcements
15:02:49 <haleyb_> N-3 is out, so we're onto RC1
15:03:19 <haleyb_> So any patches targeted for Newton must have a bug targeted at -rc1
15:04:42 <haleyb_> ping a core or armando to get it targeted, we only want the really important stuff, and the bar will be set higher as we move forward
15:04:56 <Swami_> haleyb_: makes sense
15:05:11 <jschwarz> haleyb_, can you remind the team when will RC1 be cut?
15:05:22 <haleyb_> RFEs can be added via the postmortem doc
15:05:33 <haleyb_> jschwarz: let me look
15:06:29 <haleyb_> https://releases.openstack.org/newton/schedule.html shows week of Sept 12th, so next week
15:06:40 <jschwarz> haleyb_, thanks
15:07:46 <haleyb_> #topic Bugs
15:08:32 <jschwarz> haleyb_, I sorted the list out a bit, added a few new patches from the past week and giving an indication which ones are HA
15:08:40 * jschwarz is bug deputy so had an eye for new bugs
15:09:08 <jschwarz> new bugs*
15:09:10 <Swami_> haleyb_: This week there are two new bugs
15:09:30 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1620824
15:09:30 <openstack> Launchpad bug 1620824 in neutron "Neutron DVR(SNAT) steals FIP traffic" [Undecided,In progress] - Assigned to David-wahlstrom (david-wahlstrom)
15:10:22 <haleyb_> i just noticed they are not using the built-in l2pop
15:10:33 <Swami_> The bug reported has a detailed description of how it occurs but it says under heavy load the connection tracking misbehaves and the packet is forwarded to SNAT node.
15:11:00 <Swami_> haleyb_: is that the main problem
15:11:38 <jschwarz> haleyb_, Swami_, I'm wondering if it reproduces with the reference l2pop
15:11:48 <jschwarz> if not, this could be not-our-bug to solve anyway
15:11:50 <haleyb_> Swami_: i don't think so, but it would be good to reproduce without it since we don't have that code
15:12:23 <haleyb_> Swami_: did you see my note in the bug on not setting this to zero?
15:12:27 <Swami_> haleyb_: I don't think we have seen this such behavior in heavy load with the reference l2pop.
15:12:40 <Swami_> haleyb_: no
15:13:24 <haleyb_> with non-DVR setups we experimented with setting tcp_loose=0 and realized failover between network nodes didn't work, since connections get dropped with that setting
15:13:58 <Swami_> haleyb_: makes sense
15:14:06 <haleyb_> that's my worry about changing this, even with just SNAT traffic
15:15:01 <Swami_> So in that case we can wait and see how it goes if they don't set the tcp_loose to zero.
15:15:30 <Swami_> The next one in the list is
15:15:35 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1476469
15:15:35 <openstack> Launchpad bug 1476469 in neutron "with DVR, a VM can't use floatingIP and VPN at the same time" [Medium,Confirmed]
15:16:17 <Swami_> I am sure if centralized routers can still support VPN with floatingIPs. My memory is weak.
15:16:21 <haleyb_> that is an old bug, resurrected
15:16:54 <Swami_> haleyb_: yeah but my doubt is if it is still a valid bug or not.
15:17:32 <haleyb_> oh, reproduced on mitaka
15:17:35 <Swami_> Because today we don't run IPsec process in the FIP namespace, we only run in the SNAT namespace.
15:17:55 <haleyb_> is this something we should move to the vpnaas team?
15:18:01 <yedongcan> Hi, Swami_
15:18:08 <Swami_> As per design we want to run the VPN process only on the SNAT since we make this service a singleton service running in the centralized node.
15:18:12 <Swami_> yedongcan: hi
15:18:23 <yedongcan> centralized routers can'tsup work with port VPN with floatingIPs
15:18:57 <Swami_> yedongcan: not clear what you mean.
15:19:37 <yedongcan> Swami_: sorry, in centralized routers VPN can't work with floatingIPs
15:20:15 <Swami_> yedongcan: yes that was my understanding too. So we can then mark this bug as invalid
15:20:50 <Swami_> The next in the list is
15:20:54 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1612192
15:20:54 <openstack> Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [Critical,Confirmed]
15:22:14 <Swami_> haleyb_: I did see your update on the bug. So are we still considering this as critical. I have not triaged this yet.
15:22:19 <haleyb_> I updated that the other day, that issue was only seen in the SFC and OVN jobs, small in dvr
15:23:12 <Swami_> haleyb_: thanks
15:23:39 <Swami_> The next in the list is
15:23:42 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1612804
15:23:42 <openstack> Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed]
15:23:48 <Swami_> This is again a gate failure.
15:24:16 <Swami_> I don't think we have triaged this bug as well.
15:25:07 <haleyb_> no, i just clicked the logstash link to see if it's still there
15:25:14 <Swami_> haleyb_: thanks
15:26:02 <haleyb_> and the page was completely blank, i'll have to look again later maybe kibana is flaky
15:26:03 * jschwarz is not getting coorporation from logstash
15:26:14 <haleyb_> jschwarz: glad it's not just me
15:26:21 <Swami_> ok.
15:26:24 <Swami_> let us move on
15:26:28 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1597461
15:26:28 <openstack> Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,Fix released] - Assigned to John Schwarz (jschwarz)
15:26:37 <Swami_> jschwarz: any update
15:26:51 <jschwarz> Swami_, the patch merged last week and the issue should be fixed
15:27:10 <jschwarz> Swami_, I'm pondering whether or not we should merge it backwards or not - if so I'll send them patches today/tomorrow morning
15:27:17 <haleyb_> i marked as fixed just a minute ago, will need to update wiki and move it lower
15:27:34 <Swami_> haleyb_: ok thanks
15:27:36 <haleyb_> jschwarz: it can't merge back without a dependent
15:27:44 <jschwarz> haleyb, which one?
15:28:13 <haleyb_> provisioning blocks
15:28:21 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1607381
15:28:21 <openstack> Launchpad bug 1607381 in neutron "HA router in l3 dvr_snat/legacy agent has no ha_port" [High,Fix released] - Assigned to John Schwarz (jschwarz)
15:28:21 <haleyb_> and that has a db change
15:28:28 <jschwarz> haleyb, ack. will look into it. thanks
15:29:04 <jschwarz> Swami_, this has also been fixed - patch merged on Friday
15:29:24 <Swami_> ok I need to clean up the wiki. thanks
15:29:28 <jschwarz> Swami_, we decided to fix this server-side and not agent-side, so the server-side merged
15:29:30 <haleyb_> guess i need to spend more time before the meeting going through these
15:29:50 <jschwarz> Swami_, we are still debating whether or not we need to go forward with the agent-side patch
15:30:16 <jschwarz> which basically covers a bunch of uninitialized usage of variables all over the HA RouterInfo
15:30:26 <jschwarz> this is a discussion ongoing between kevinbenton and liuyulong
15:30:29 <Swami_> jschwarz: so do you mean then this is still incomplete without the agent.
15:30:40 <jschwarz> Swami_, no - sorry for being unclear
15:30:59 <jschwarz> Swami_, the server-side fix should eliminate all the exceptions that were caused agent-side (so the bug is completely fixed)
15:31:09 <jschwarz> as an added measure, we can also fix the agent-side, which is TBD
15:32:21 <Swami_> jschwarz: thanks for the update
15:32:38 <jschwarz> #link https://review.openstack.org/#/c/265672/ is the discussion going on
15:32:58 <Swami_> jschwarz: thanks
15:33:48 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1593354
15:33:48 <openstack> Launchpad bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,Incomplete]
15:34:30 <jschwarz> Swami_, I put in some time in that one yesterday
15:34:41 <jschwarz> I couldn't reproduce this on master (though it was reported on mitaka)
15:35:13 <Swami_> jschwarz: ok. should we still test it on mitaka
15:35:14 <jschwarz> since I don't have a mitaka setup at the ready I commented such and I'm waiting for Hao to provide additional informationi
15:35:30 <Swami_> jschwarz: thanks that would help.
15:35:55 <Swami_> jschwarz: But do we need to mark it as incomplete now or wait for his response. since this was filed against mitaka
15:36:17 <jschwarz> Swami_, perhaps the Incomplete was premature. I will re-set to New
15:36:31 <Swami_> jschwarz: ok thanks
15:36:52 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1602320
15:36:52 <openstack> Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Medium,In progress] - Assigned to He Qing (tsinghe-7)
15:37:39 <Swami_> There is an outstanding patch under review.
15:37:41 <Swami_> #link https://review.openstack.org/#/c/342730/
15:37:45 <jschwarz> Swami_, so yedongcan worked on a patch https://review.openstack.org/#/c/342730/ to solve this
15:38:02 <jschwarz> Swami_, and an alternative patch https://review.openstack.org/#/c/366493/ has been submitted earlier today
15:38:30 <jschwarz> Swami_, it looks like the second patch is easier to look at but I have yet to look at it in depth
15:38:52 <jschwarz> Swami_, also, the first one fails because the argument that was added to keepalived, -R, is only supported in keepalived versions 1.2.8
15:39:00 <jschwarz> Swami_, the gate uses (you guessed it) 1.2.7
15:39:16 <jschwarz> Swami_, so that's that. I hope to have more on this next week so this is on me
15:39:25 <Swami_> jschwarz: So is this only possible with 1.2.8 or it can be used for 1.2.7 as well.
15:39:55 <jschwarz> Swami_, it looks like the second patch works for 1.2.7 (and is simpler)
15:40:15 <jschwarz> Swami_, need to see if the second patch actually solves this. I should talk with yedongcan tomorrow to see what he thinks
15:40:26 <Swami_> jschwarz: thanks
15:41:17 <Swami_> That's all I have for bugs, that needs discussion
15:41:29 <Swami_> Is there any other bugs that needs further discussion.
15:41:36 <jschwarz> I have a new one, fresh from a few hours ago
15:41:39 <jschwarz> #link https://bugs.launchpad.net/neutron/+bug/1621086
15:41:39 <openstack> Launchpad bug 1621086 in neutron "Port delete on router interface remove" [Undecided,New]
15:41:50 <yedongcan> jschwarz:  I will see tomorrow
15:42:38 <jschwarz> Swami_, basically the reported creates a port and feeds it into router-interface-add
15:42:58 <jschwarz> Swami_, then when the router is deleted, the port is also deleted (even though it's a port he created himself)
15:43:09 <Swami_> jschwarz: looking at it.
15:43:11 <jschwarz> Swami_, I think this can be closed as Opinion?
15:43:13 <jschwarz> haleyb_, thoughts?
15:43:36 <haleyb_> i was just looking, but is this dvr-specific issue?
15:43:49 <jschwarz> haleyb_, nope
15:44:03 <Swami_> jschwarz: is it not true that we need to remove the ports connected to the router before deleting the router.
15:44:03 <jschwarz> it's a L3 specific issue.. perhaps more suitable for tomorrow's meeting?
15:44:36 <haleyb_> yes
15:45:08 <Swami_> Ok so he is removing the interface and it deletes the port that he has created.
15:45:22 <jschwarz> Swami_, that's my understanding
15:46:01 <yedongcan> Swami_ haleyb_ : Hi, there is a bug need to see .  jschwarz had comment early
15:46:31 <yedongcan> #link: https://bugs.launchpad.net/neutron/+bug/1606741
15:46:31 <openstack> Launchpad bug 1606741 in neutron "Metadata service for instances is unavailable when the l3-agent on the compute host is dvr_snat mode" [High,In progress] - Assigned to Brian Haley (brian-haley)
15:47:20 <jschwarz> I remember this one. It looks like an issue between resources and connectivity
15:47:22 <Swami_> yedongcan: I think we have discussed about this earlier. Why do you need to run dvr_snat mode l3_agent in a compute host.
15:48:33 <jschwarz> Swami_, because you can? placing l3 agents in dvr_snat mode should be agnostic to what other agents are running on that node IMO
15:48:36 <haleyb_> Swami_: the only config i can think of is a single-node, which i do in devstack, but that's not "normal"
15:48:49 <yedongcan> this make us deploy easy
15:49:09 <jschwarz> haleyb, a single node will work - I think the issue is with dvr_snat nodes which are in backup?
15:49:16 <jschwarz> (i.e. DVR+HA)
15:49:19 <Swami_> haleyb_: I agree with you, it is only apt for a single node scenario.
15:50:00 <haleyb_> jschwarz: you don't really want every node to be a network node, scheduling will be bad possibly
15:50:01 <Swami_> jschwarz: but in the case of DVR+HA, do you still create compute hosts on the DVR+HA node.
15:50:27 <jschwarz> Swami_, not necessarily - but you can.
15:51:04 <jschwarz> an alternative solution could be to add flows that will direct metadata packets from backup nodes to the active one... is that possible?
15:51:04 <Swami_> jschwarz: The reason I am asking if we know the use case properly then we can design based on that.
15:51:41 <jschwarz> Swami_, the usecase I'm seeing is having 3 network+compute nodes, and having a router that is DVR+HA scheduled to all of them
15:51:52 <jschwarz> the 2 non-active nodes won't get metadata services
15:52:47 <Swami_> jschwarz: Is this because snat is only active on a single node, so the metadata service is only active on the active node.
15:52:56 <jschwarz> Swami_, exactly
15:53:11 <Swami_> In that case we need to make change to the Bug description which says 'compute host' on it .
15:53:28 <jschwarz> yes
15:54:01 <jschwarz> we should probably add this to our weekly sync as well. I'm gonna reply on the patch and see if the alternative approach of adding redirection flows from backup nodes to the active will suffice
15:54:18 <Swami_> jschwarz: we should look into it, I am not sure what are implications in starting the metadata service in a passive node where the snat is not functional. Will it even work.
15:54:46 <jschwarz> Swami_, it should - it'll serve instances in the local node and not other nodes
15:55:45 <Swami_> jschwarz: I am not familiar with the metadata functionality. Sorry for my questions.
15:55:53 <jschwarz> Swami_, all is good
15:55:59 <jschwarz> Swami_, expect my dvr questions soon ;-)
15:56:22 <Swami_> jschwarz: -:)
15:56:57 <haleyb_> we're running out of time
15:57:22 <haleyb_> any more bugs?
15:57:30 <Swami_> haleyb_: not for today.
15:57:36 <haleyb_> #topic Gate failures
15:57:40 <jschwarz> haleyb_, that's it from me
15:58:00 <Swami_> jschwarz: did you get a chance to test the functional test that I sent you a while back.
15:58:11 <jschwarz> Swami_, not yet - hopefully I will next week
15:58:18 <Swami_> jschwarz: thanks
15:58:20 <haleyb_> My only comment here is that it's hard to tell about the gate, since it looks like jobs were added (for migration ?) and our failures went to zero
15:58:35 <jschwarz> Swami_, I won't be present for next week's meeting (travelling), but if I do come up with something I'll mail you
15:58:47 <Swami_> haleyb_: great
15:58:50 <haleyb_> #topic Open Discussion
15:58:51 <jschwarz> haleyb_, no more gate failures?! rejoice!
15:59:00 <jschwarz> we should tell armax ;-)
15:59:02 <haleyb_> Swami_: yes, zero failures is good, except they're just hiding
15:59:14 <Swami_> jschwarz: that is not permanent
15:59:42 <jschwarz> :)
15:59:56 <haleyb_> well, it's the end of the hour, thanks everyone for the status, let's just get the last few bugs fixed on the high list
16:00:03 <haleyb_> #endmeeting