15:00:14 #startmeeting neutron_dvr 15:00:14 Meeting started Wed Nov 4 15:00:14 2015 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:15 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:15 o/ 15:00:16 hi 15:00:16 hi 15:00:17 The meeting name has been set to 'neutron_dvr' 15:00:19 moo 15:00:46 hi 15:00:53 it's been a while since i chaired a meeting, have a cheat sheet here though 15:01:38 haleyb: feel free to cochair somebody just in case? 15:01:40 #topic Announcements 15:02:10 My only announcement is: Welcome, as this is the first DVR meeting in a while 15:02:18 yay! 15:02:24 #topic Bugs 15:02:29 somebody pass around the jug of punch 15:02:33 haleyb: hi 15:02:51 I have captured a bunch of bugs in the wiki and let us go through the bugs one by one. 15:03:10 #link https://bugs.launchpad.net/neutron/+bug/1365473/ 15:03:13 Launchpad bug 1365473 in neutron "Unable to create a router that's both HA and distributed" [High,In progress] - Assigned to Adolfo Duarte (adolfo-duarte) 15:03:17 https://wiki.openstack.org/wiki/Meetings/Neutron-DVR 15:03:38 This is the DVR HA bug and the server side code is still under review. The agent side patch had gone in. 15:04:07 We need some cores attention on this patch. Otherwise it is ready to be pushed in. 15:04:09 how close is the server side patch to making it in? 15:04:28 n/m messages passed in the aether 15:04:35 fitoduarte: do you have any comments on this. 15:05:03 o/ 15:05:08 yeap. it has gotten some reviews these past couple of days. 15:05:09 * sc68cal joins late 15:05:11 Swami_: looks like https://review.openstack.org/#/c/143169 is the open review 15:05:34 haleyb: yes, that's the server side patch Swami_ was talking about 15:05:48 haleyb: it needs some core eyes 15:05:52 haleyb: yes this needs to get in for the HA to work. 15:06:02 it keeps going into merge conflict. besides that I don't think there is anyone adding more to it. 15:06:29 yes without you can't use dvr ha 15:06:48 fitoduarte: thanks, can you ping either carl_baldwin or haleyb today to get their attention 15:07:11 * carl_baldwin will look at it today. 15:07:16 carl_baldwin: thanks 15:07:17 I will review it today, there are other cores listed as well that you might want to ping, like Assaf 15:07:29 thanks (ping ping) 15:07:35 haleyb: I think assaf is on vaccation 15:08:02 #action carl_baldwin and haleyb will look at https://review.openstack.org/#/c/143169 15:08:06 are we done with this bug, can we move on 15:08:13 yup 15:08:15 is assaf on vacation? I have not seen him in a few days 15:08:21 The next in the list is 15:08:24 #link https://review.openstack.org/#/c/229561/ 15:09:01 This patch got merged. But there is a dependent patch that is still in review 15:09:26 #link https://review.openstack.org/#/c/230079/ 15:09:29 fitoduarte: I believe amuller is on vacation, yes 15:10:09 haleyb, carl_baldwin I need your attention on this above patch. 15:10:26 note - we are now talking bug 1501873 15:10:26 bug 1501873 in neutron "FIP Namespace add/delete race condition seen in DVR router log" [Undecided,In progress] https://launchpad.net/bugs/1501873 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:11:21 Swami_: ack 15:11:29 regXboi: Yes you are right bug 1501873 15:11:29 bug 1501873 in neutron "FIP Namespace add/delete race condition seen in DVR router log" [Undecided,In progress] https://launchpad.net/bugs/1501873 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:11:29 Swami_: ack. I had reviewed it last week but was waiting for Carl since you had addressed his comments with latest patch. summit got int he way i'm sure 15:11:54 haleyb: thanks 15:11:59 carl_baldwin: thanks 15:12:30 Does anyone have anything to say about this above bug, or can we move on 15:12:51 nit: somebody should give it a priority at some point 15:12:58 but other than that, let's move on 15:13:18 regXboi: got it, we will move on 15:13:23 The next in the list 15:13:29 #link https://bugs.launchpad.net/neutron/+bug/1372141/ 15:13:29 Launchpad bug 1372141 in neutron "StaleDataError while updating ml2_dvr_port_bindings" [Medium,Confirmed] 15:14:17 I think there was a patch for this bug. 15:14:55 Does anyone have the link handy for this patch 15:15:08 can somebody update the logstash querty? the old one doesn't work 15:15:35 the patch was https://review.openstack.org/#/c/143761/ 15:16:13 obondarev_: I think this is the same as pavel bonder was working on. 15:16:52 Swami_: which one? 15:17:26 that patch was abandoned and will likely need a serious merge 15:17:39 Swami_: i don't see any recent patches or info, so we'll need some new info on this one 15:18:39 haleyb: ok I will update this bug or will add a comment if it is the duplicate of the one that pavel is working on. 15:18:42 I think a new logstash query would help 15:18:50 so... a logstash query for StaleDataError shows 54 hits in the last 7 days 15:18:56 and *none* of them are from DVR jobs 15:19:22 #link https://bugs.launchpad.net/neutron/+bug/1494351 15:19:22 Launchpad bug 1494351 in neutron "Observed StaleDataError in gate-neutron-dsvm-api tests if reference IPAM driver is used" [High,In progress] - Assigned to Pavel Bondar (pasha117) 15:19:51 all: there is a new logstash query in the notes for the previous bug 15:20:06 I thought that this bug above was same as the 1372141 15:20:34 It may be/have been 15:20:35 regXboi: thanks 15:21:21 we need to figure out if they both are the same and if so we can close one and focus on the later. 15:21:45 ok let us move on to the next 15:22:00 #link https://review.openstack.org/#/c/228026/ 15:22:09 Swami_: the IPAM change merged, we can look at it again next week and see if failure rate still there... 15:22:27 (bug 1499785 and bug 1499787) 15:22:27 bug 1499785 in neutron "Static routes are not added to the qrouter namespace for DVR routers" [Undecided,In progress] https://launchpad.net/bugs/1499785 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:22:27 haleyb: ok, makes sense. 15:22:28 bug 1499787 in neutron "Static routes are attempted to add to SNAT Namespace of DVR routers without checking for Router Gateway." [Undecided,In progress] https://launchpad.net/bugs/1499787 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:22:47 regXboi: thanks for the links 15:22:54 The patch is up for review. 15:23:52 carl_baldwin and assaf had reviewed this patch and I have addressed the review comments. 15:24:18 I think carl_baldwin had a question if we need to add all the routes in the snat namespace or not. 15:24:53 In my opinion, we should be only adding the external network related static routes in the snat_namespace. 15:25:37 My question was more like “why not add all the same routes to all the router namespaces”. My thinking is that the code could be less complicated. 15:27:39 carl_baldwin: I am ok with your suggestion but we should not introduce some other issues. 15:28:10 Swami_: Does it introduce issues? 15:28:42 #link https://docs.google.com/presentation/d/1THyot5yZVLb4ouU90GrY8Mmwm_27-jI6mZF8ZljtoFU/edit#slide=id.p 15:28:58 carl_baldwin: I think this is the slide that put together to discuss the static routes. 15:29:26 Swami_: i don't have permission to see that slide 15:29:54 is there a danger 15:30:01 to add incorrect routes to snat that way? 15:30:23 haleyb: changed the permission on the slide 15:31:33 15:31:33 fitoduarte: These are next hop routes, and does not makes sense for private subnet routes to be added to the snat namespace, since the nexthop for the external network will be in the external network space. 15:32:29 Swami_: right, should the snat only have the on-link and default route ? 15:32:44 haleyb: yes that's what I thought. 15:33:33 ok, if you have any suggestions please add your comments to the patch and let us move on. 15:33:59 The next one is 15:34:02 #link https://review.openstack.org/#/c/225319/ 15:34:32 This patch has been under review for a while and carl_baldwin had blessed this patch couple of times and when into merge conflict. 15:34:59 There was gate issue yesterday were all the lbaasv1 tests were failing. 15:35:12 once the jenkins passes this can go in. 15:35:18 #action haleyb will look at https://review.openstack.org/#/c/225319/ 15:35:28 Ok, let us move on to the next one. 15:35:47 #link https://bugs.launchpad.net/neutron/+bug/1504726/ 15:35:47 Launchpad bug 1504726 in neutron "The vm can not access the vip of load balancer under DVR enviroment" [High,New] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:36:03 This was the bug that was filed on lbaas and dvr with Kilo. 15:36:14 I verified on the master branch and I don't see any issue. 15:37:01 well, since kilo is security only now, how about we mark it invalid and ask for a retest on liberty/master? 15:37:03 I have one from the fwaas side to toss on the pile 15:37:09 But I was trying to triage this on the Kilo branch and was not successful in getting the LBaas to work with devstack. So this is still in progress and I asked for some help from the lbaas team to run lbaas with kilo on devstack. 15:37:21 whoa... why? 15:37:39 sc68cal: you can go ahead and file a bug and we can triage it and take it next week, if it is ok for you. 15:37:51 I'll just toss in the link for now 15:37:53 https://bugs.launchpad.net/neutron/+bug/1476097 15:37:53 Launchpad bug 1476097 in neutron "[fwaas]Support fwaas to control east-west traffic in dvr router" [High,Triaged] - Assigned to lee jian (leejian0612) 15:37:55 regxboi: agree on lbaas issue bug 15:38:15 sc68cal: I had a couple of discussions with mickeys and sridhark on this bug. 15:38:16 haleyb, carl_baldwin: what are your thoughts? 15:38:39 * carl_baldwin catches up on last few minutes... 15:38:48 The suggestion is to wait until your v2 api targetted towards the VM ports arrive and then use those API rather than making any changes to the DVR right now. 15:39:12 sc68cal: making changes to the DVR to support the East-West will be complicated at this point. 15:39:40 Swami_: if we don't see it in master then wait for Zhou to answer your question. We should move on since we've onlyl got :20 left 15:39:44 sc68cal: I have already updated the bug with my notes. Please let me know if you need more info on it. 15:39:50 Swami_: ack. I just wanted to make sure both sides are in sync 15:39:53 ok, thanks. 15:40:08 #link https://bugs.launchpad.net/neutron/+bug/1505575/ 15:40:08 Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:40:17 haleyb, carl_baldwin: I still want to mark 1504726 as invalid as it is kilo 15:40:29 oleg was working on this bug. 15:40:34 the patch for it is there 15:40:51 armax is not happy with pagination approach though 15:40:59 so we probably need to review it. 15:41:18 obondarev_: thanks for the info 15:41:30 obondarev_: do you have any other thoughts on this patch. 15:41:36 regXboi: I guess I don’t understand the state of support for Kilo. I have a hard time believing that we’re not fixing bugs. 15:41:43 however I need to verify if https://review.openstack.org/#/c/234067/ is still needed given https://review.openstack.org/#/c/214974/ 15:42:14 I was unable to reproduce the bug on the 50 nodes scale with https://review.openstack.org/#/c/214974/ applied 15:42:41 though I was sure it won’t help in case agent is resyncing and requests info for all its routers 15:43:08 will try more at a higher scale 15:43:16 obondarev_: can you update us on your testing next week. 15:43:26 obondarev_: then we can decide on this patch. 15:43:34 Ok moving on to the next one. 15:43:34 carl_badlwin: security-supported is described in https://wiki.openstack.org/wiki/StableBranch 15:43:39 Swami_: ok 15:43:47 it means it is target for security fixes by the VMT 15:44:00 #link https://bugs.launchpad.net/neutron/+bug/1496201/ 15:44:00 Launchpad bug 1496201 in neutron "DVR: router namespace can't be deleted if bulk delete VMs" [Medium,New] - Assigned to Kasey Alusi (kasey-alusi) 15:44:09 but I read that as meaning we don't have to fix other items for it 15:44:46 regXboi: carl_baldwin: let us check with armax and then decide the fate of that bug. 15:45:20 The above bug should be triaged for bulk delete. 15:45:20 * regXboi attempts to summon armax and just for S&G mestery as well :) 15:45:49 The last bug on today's list is 15:45:54 #link https://bugs.launchpad.net/neutron/+bug/1512199/ 15:45:54 Launchpad bug 1512199 in neutron "change vm fixed ips will cause unable to communicate to vm in other network " [Undecided,Invalid] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:46:14 This bug I did triage it and could not reproduce in the master branch. So marked it as invalid. 15:46:22 Swami_: since we are still triaging the last two bugs, and they're kilo, let's move on since i want to get to gate failures (and there might be other bugs not on the agenda?) 15:46:33 amen to gate failures 15:46:47 haleyb: sure, that's all I had for bugs. 15:46:51 * regXboi will take action item to triage undecided bugs 15:47:07 I know there are other bugs as well, right carl_baldwin ? We saw one internally that I need to file an upstream bug for 15:47:36 haleyb: Yes, I don’t have access to the internal Jira, otherwise I’d get the description. 15:47:44 #topic Gate-failures 15:48:12 we'll move on - regXboi did you want to talk about the issues ? 15:48:25 haleyb: sure 15:48:59 If I look at the check pipeline graphs, we are still seeing too many failures in the gate to make mn voting 15:49:21 http://goo.gl/j5UkwT 15:49:25 OTOH, I think we can argue that single node dvr can be made voting again 15:49:29 haleyb: thx 15:49:52 but note, that is only *think* 15:50:23 most of the failures I'm seeing are still FIP failures and I'm still trying to capture those locally to understand what the failure mode is 15:50:43 It still seems to be double the failure rate of the non-dvr job. Need more good history. 15:51:09 what do people think of the idea of trying to get the pipeline job to dump the cloud state when a FIP failure occurs? 15:51:17 regXboi: yes, it would just be good to maybe have that under 10% for some time to say it's stable, the others are single digits 15:51:29 armax mentioned that we should poke into the gate systems to see what is the state of the VM when these failures happen. 15:52:02 regXboi: Is there a way do it and if so can we do it. 15:52:05 poking into the gate systems is going to likely take job patches to do the poking 15:52:37 Is there anyone from tempest who could help us figure out the best way to get the state? I vaguely remember a comment in a session at summit where someone from tempest spoke up about this. Anyone recall? 15:52:48 regXboi: I did not get it. 15:52:50 * regXboi suspects it may be *possible*, but we'll have to work with qa/tempest 15:53:25 carl_baldwin: that would have been good to follow up on :( 15:53:31 carl_baldwin: but I think armax was not ok adding any help into the tempest and he wanted someone to look at the VM state in the gate. 15:53:53 Swami_: I don't believe that is practical 15:54:00 because of the tests cleaning up after themselves 15:54:16 regXboi: So much going on… But, it isn’t too late. We need someone to take an action. 15:54:18 carl_baldwin: i don't recall, but perhaps we can start with the tempest PTL 15:54:21 regXboi: Ok I will check with armax on this. 15:54:53 haleyb: can he join this meeting next week, in case we don't get a head way. 15:55:03 #action Swami to ping armax about tempest resources to look at VMs during DVR failures in the gate 15:55:15 I'll ping mtreinish on this in the qa channel and see what *might* be possible 15:55:28 thanks regXboi 15:55:55 anything else on the gate? 15:56:02 not from me 15:56:26 #topic Performance/Scalability 15:56:39 haleyb: I remember assaf mentioned that he had a script that can probe the logstash and pull the data, filter it and provide the details of the failures. 15:57:10 regXboi: do you have something like that script. 15:57:11 obondarev: i had only added https://review.openstack.org/#/c/231555/ to the agenda as a perf change in review 15:57:16 Swami_: no 15:57:21 https://review.openstack.org/#/c/231555/ (and dependent) needs reviews 15:57:56 all: we desperately need to cut down on L3 agent message queue traffic 15:58:10 Swami_: you had a -1 on that patch, can you take a look again? 15:58:19 haleyb: yeah, then the next round of scale tests might probably reveal other pain points/bottlenecks 15:58:21 haleyb: sure 15:58:27 * regXboi has seen installations where neutron L3 agent queue is the only queue dying 15:58:50 regXboi: Could you express that in the form of a bug? ;) 15:59:09 carl_baldwin: you *really* don't want that 15:59:17 https://review.openstack.org/#/c/222863/ should also help scalability 15:59:26 regXboi: yes, the latest FIP changed oleg did will help, there probably are more, but filing a bug will get it on the agenda :) 15:59:28 regXboi: ? 15:59:50 but need to investigate why it breaks graceful ovs agent restart 15:59:55 so this would be an overarching bug for all of L3 16:00:13 because the message target patches are addressing half of it 16:00:19 regXboi: we'd have to identify them one by one 16:00:23 and the O(n) stuff I've been chasing addresses the other half of it 16:00:26 top of hour, we need to end... 16:00:33 let's continue in channel 16:00:37 #endmeeting neutron_dvr