15:01:21 #startmeeting neutron_dvr 15:01:22 Meeting started Wed Sep 7 15:01:21 2016 UTC and is due to finish in 60 minutes. The chair is haleyb_. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:25 The meeting name has been set to 'neutron_dvr' 15:01:29 haleyb_, underscores ++ 15:01:35 #chair Swami_ 15:01:35 Current chairs: Swami_ haleyb_ 15:01:44 guess swami is too :) 15:01:56 hi 15:02:33 #topic Announcements 15:02:49 N-3 is out, so we're onto RC1 15:03:19 So any patches targeted for Newton must have a bug targeted at -rc1 15:04:42 ping a core or armando to get it targeted, we only want the really important stuff, and the bar will be set higher as we move forward 15:04:56 haleyb_: makes sense 15:05:11 haleyb_, can you remind the team when will RC1 be cut? 15:05:22 RFEs can be added via the postmortem doc 15:05:33 jschwarz: let me look 15:06:29 https://releases.openstack.org/newton/schedule.html shows week of Sept 12th, so next week 15:06:40 haleyb_, thanks 15:07:46 #topic Bugs 15:08:32 haleyb_, I sorted the list out a bit, added a few new patches from the past week and giving an indication which ones are HA 15:08:40 * jschwarz is bug deputy so had an eye for new bugs 15:09:08 new bugs* 15:09:10 haleyb_: This week there are two new bugs 15:09:30 #link https://bugs.launchpad.net/neutron/+bug/1620824 15:09:30 Launchpad bug 1620824 in neutron "Neutron DVR(SNAT) steals FIP traffic" [Undecided,In progress] - Assigned to David-wahlstrom (david-wahlstrom) 15:10:22 i just noticed they are not using the built-in l2pop 15:10:33 The bug reported has a detailed description of how it occurs but it says under heavy load the connection tracking misbehaves and the packet is forwarded to SNAT node. 15:11:00 haleyb_: is that the main problem 15:11:38 haleyb_, Swami_, I'm wondering if it reproduces with the reference l2pop 15:11:48 if not, this could be not-our-bug to solve anyway 15:11:50 Swami_: i don't think so, but it would be good to reproduce without it since we don't have that code 15:12:23 Swami_: did you see my note in the bug on not setting this to zero? 15:12:27 haleyb_: I don't think we have seen this such behavior in heavy load with the reference l2pop. 15:12:40 haleyb_: no 15:13:24 with non-DVR setups we experimented with setting tcp_loose=0 and realized failover between network nodes didn't work, since connections get dropped with that setting 15:13:58 haleyb_: makes sense 15:14:06 that's my worry about changing this, even with just SNAT traffic 15:15:01 So in that case we can wait and see how it goes if they don't set the tcp_loose to zero. 15:15:30 The next one in the list is 15:15:35 #link https://bugs.launchpad.net/neutron/+bug/1476469 15:15:35 Launchpad bug 1476469 in neutron "with DVR, a VM can't use floatingIP and VPN at the same time" [Medium,Confirmed] 15:16:17 I am sure if centralized routers can still support VPN with floatingIPs. My memory is weak. 15:16:21 that is an old bug, resurrected 15:16:54 haleyb_: yeah but my doubt is if it is still a valid bug or not. 15:17:32 oh, reproduced on mitaka 15:17:35 Because today we don't run IPsec process in the FIP namespace, we only run in the SNAT namespace. 15:17:55 is this something we should move to the vpnaas team? 15:18:01 Hi, Swami_ 15:18:08 As per design we want to run the VPN process only on the SNAT since we make this service a singleton service running in the centralized node. 15:18:12 yedongcan: hi 15:18:23 centralized routers can'tsup work with port VPN with floatingIPs 15:18:57 yedongcan: not clear what you mean. 15:19:37 Swami_: sorry, in centralized routers VPN can't work with floatingIPs 15:20:15 yedongcan: yes that was my understanding too. So we can then mark this bug as invalid 15:20:50 The next in the list is 15:20:54 #link https://bugs.launchpad.net/neutron/+bug/1612192 15:20:54 Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [Critical,Confirmed] 15:22:14 haleyb_: I did see your update on the bug. So are we still considering this as critical. I have not triaged this yet. 15:22:19 I updated that the other day, that issue was only seen in the SFC and OVN jobs, small in dvr 15:23:12 haleyb_: thanks 15:23:39 The next in the list is 15:23:42 #link https://bugs.launchpad.net/neutron/+bug/1612804 15:23:42 Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed] 15:23:48 This is again a gate failure. 15:24:16 I don't think we have triaged this bug as well. 15:25:07 no, i just clicked the logstash link to see if it's still there 15:25:14 haleyb_: thanks 15:26:02 and the page was completely blank, i'll have to look again later maybe kibana is flaky 15:26:03 * jschwarz is not getting coorporation from logstash 15:26:14 jschwarz: glad it's not just me 15:26:21 ok. 15:26:24 let us move on 15:26:28 #link https://bugs.launchpad.net/neutron/+bug/1597461 15:26:28 Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,Fix released] - Assigned to John Schwarz (jschwarz) 15:26:37 jschwarz: any update 15:26:51 Swami_, the patch merged last week and the issue should be fixed 15:27:10 Swami_, I'm pondering whether or not we should merge it backwards or not - if so I'll send them patches today/tomorrow morning 15:27:17 i marked as fixed just a minute ago, will need to update wiki and move it lower 15:27:34 haleyb_: ok thanks 15:27:36 jschwarz: it can't merge back without a dependent 15:27:44 haleyb, which one? 15:28:13 provisioning blocks 15:28:21 #link https://bugs.launchpad.net/neutron/+bug/1607381 15:28:21 Launchpad bug 1607381 in neutron "HA router in l3 dvr_snat/legacy agent has no ha_port" [High,Fix released] - Assigned to John Schwarz (jschwarz) 15:28:21 and that has a db change 15:28:28 haleyb, ack. will look into it. thanks 15:29:04 Swami_, this has also been fixed - patch merged on Friday 15:29:24 ok I need to clean up the wiki. thanks 15:29:28 Swami_, we decided to fix this server-side and not agent-side, so the server-side merged 15:29:30 guess i need to spend more time before the meeting going through these 15:29:50 Swami_, we are still debating whether or not we need to go forward with the agent-side patch 15:30:16 which basically covers a bunch of uninitialized usage of variables all over the HA RouterInfo 15:30:26 this is a discussion ongoing between kevinbenton and liuyulong 15:30:29 jschwarz: so do you mean then this is still incomplete without the agent. 15:30:40 Swami_, no - sorry for being unclear 15:30:59 Swami_, the server-side fix should eliminate all the exceptions that were caused agent-side (so the bug is completely fixed) 15:31:09 as an added measure, we can also fix the agent-side, which is TBD 15:32:21 jschwarz: thanks for the update 15:32:38 #link https://review.openstack.org/#/c/265672/ is the discussion going on 15:32:58 jschwarz: thanks 15:33:48 #link https://bugs.launchpad.net/neutron/+bug/1593354 15:33:48 Launchpad bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,Incomplete] 15:34:30 Swami_, I put in some time in that one yesterday 15:34:41 I couldn't reproduce this on master (though it was reported on mitaka) 15:35:13 jschwarz: ok. should we still test it on mitaka 15:35:14 since I don't have a mitaka setup at the ready I commented such and I'm waiting for Hao to provide additional informationi 15:35:30 jschwarz: thanks that would help. 15:35:55 jschwarz: But do we need to mark it as incomplete now or wait for his response. since this was filed against mitaka 15:36:17 Swami_, perhaps the Incomplete was premature. I will re-set to New 15:36:31 jschwarz: ok thanks 15:36:52 #link https://bugs.launchpad.net/neutron/+bug/1602320 15:36:52 Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Medium,In progress] - Assigned to He Qing (tsinghe-7) 15:37:39 There is an outstanding patch under review. 15:37:41 #link https://review.openstack.org/#/c/342730/ 15:37:45 Swami_, so yedongcan worked on a patch https://review.openstack.org/#/c/342730/ to solve this 15:38:02 Swami_, and an alternative patch https://review.openstack.org/#/c/366493/ has been submitted earlier today 15:38:30 Swami_, it looks like the second patch is easier to look at but I have yet to look at it in depth 15:38:52 Swami_, also, the first one fails because the argument that was added to keepalived, -R, is only supported in keepalived versions 1.2.8 15:39:00 Swami_, the gate uses (you guessed it) 1.2.7 15:39:16 Swami_, so that's that. I hope to have more on this next week so this is on me 15:39:25 jschwarz: So is this only possible with 1.2.8 or it can be used for 1.2.7 as well. 15:39:55 Swami_, it looks like the second patch works for 1.2.7 (and is simpler) 15:40:15 Swami_, need to see if the second patch actually solves this. I should talk with yedongcan tomorrow to see what he thinks 15:40:26 jschwarz: thanks 15:41:17 That's all I have for bugs, that needs discussion 15:41:29 Is there any other bugs that needs further discussion. 15:41:36 I have a new one, fresh from a few hours ago 15:41:39 #link https://bugs.launchpad.net/neutron/+bug/1621086 15:41:39 Launchpad bug 1621086 in neutron "Port delete on router interface remove" [Undecided,New] 15:41:50 jschwarz: I will see tomorrow 15:42:38 Swami_, basically the reported creates a port and feeds it into router-interface-add 15:42:58 Swami_, then when the router is deleted, the port is also deleted (even though it's a port he created himself) 15:43:09 jschwarz: looking at it. 15:43:11 Swami_, I think this can be closed as Opinion? 15:43:13 haleyb_, thoughts? 15:43:36 i was just looking, but is this dvr-specific issue? 15:43:49 haleyb_, nope 15:44:03 jschwarz: is it not true that we need to remove the ports connected to the router before deleting the router. 15:44:03 it's a L3 specific issue.. perhaps more suitable for tomorrow's meeting? 15:44:36 yes 15:45:08 Ok so he is removing the interface and it deletes the port that he has created. 15:45:22 Swami_, that's my understanding 15:46:01 Swami_ haleyb_ : Hi, there is a bug need to see . jschwarz had comment early 15:46:31 #link: https://bugs.launchpad.net/neutron/+bug/1606741 15:46:31 Launchpad bug 1606741 in neutron "Metadata service for instances is unavailable when the l3-agent on the compute host is dvr_snat mode" [High,In progress] - Assigned to Brian Haley (brian-haley) 15:47:20 I remember this one. It looks like an issue between resources and connectivity 15:47:22 yedongcan: I think we have discussed about this earlier. Why do you need to run dvr_snat mode l3_agent in a compute host. 15:48:33 Swami_, because you can? placing l3 agents in dvr_snat mode should be agnostic to what other agents are running on that node IMO 15:48:36 Swami_: the only config i can think of is a single-node, which i do in devstack, but that's not "normal" 15:48:49 this make us deploy easy 15:49:09 haleyb, a single node will work - I think the issue is with dvr_snat nodes which are in backup? 15:49:16 (i.e. DVR+HA) 15:49:19 haleyb_: I agree with you, it is only apt for a single node scenario. 15:50:00 jschwarz: you don't really want every node to be a network node, scheduling will be bad possibly 15:50:01 jschwarz: but in the case of DVR+HA, do you still create compute hosts on the DVR+HA node. 15:50:27 Swami_, not necessarily - but you can. 15:51:04 an alternative solution could be to add flows that will direct metadata packets from backup nodes to the active one... is that possible? 15:51:04 jschwarz: The reason I am asking if we know the use case properly then we can design based on that. 15:51:41 Swami_, the usecase I'm seeing is having 3 network+compute nodes, and having a router that is DVR+HA scheduled to all of them 15:51:52 the 2 non-active nodes won't get metadata services 15:52:47 jschwarz: Is this because snat is only active on a single node, so the metadata service is only active on the active node. 15:52:56 Swami_, exactly 15:53:11 In that case we need to make change to the Bug description which says 'compute host' on it . 15:53:28 yes 15:54:01 we should probably add this to our weekly sync as well. I'm gonna reply on the patch and see if the alternative approach of adding redirection flows from backup nodes to the active will suffice 15:54:18 jschwarz: we should look into it, I am not sure what are implications in starting the metadata service in a passive node where the snat is not functional. Will it even work. 15:54:46 Swami_, it should - it'll serve instances in the local node and not other nodes 15:55:45 jschwarz: I am not familiar with the metadata functionality. Sorry for my questions. 15:55:53 Swami_, all is good 15:55:59 Swami_, expect my dvr questions soon ;-) 15:56:22 jschwarz: -:) 15:56:57 we're running out of time 15:57:22 any more bugs? 15:57:30 haleyb_: not for today. 15:57:36 #topic Gate failures 15:57:40 haleyb_, that's it from me 15:58:00 jschwarz: did you get a chance to test the functional test that I sent you a while back. 15:58:11 Swami_, not yet - hopefully I will next week 15:58:18 jschwarz: thanks 15:58:20 My only comment here is that it's hard to tell about the gate, since it looks like jobs were added (for migration ?) and our failures went to zero 15:58:35 Swami_, I won't be present for next week's meeting (travelling), but if I do come up with something I'll mail you 15:58:47 haleyb_: great 15:58:50 #topic Open Discussion 15:58:51 haleyb_, no more gate failures?! rejoice! 15:59:00 we should tell armax ;-) 15:59:02 Swami_: yes, zero failures is good, except they're just hiding 15:59:14 jschwarz: that is not permanent 15:59:42 :) 15:59:56 well, it's the end of the hour, thanks everyone for the status, let's just get the last few bugs fixed on the high list 16:00:03 #endmeeting