15:00:14 <haleyb> #startmeeting neutron_dvr
15:00:14 <openstack> Meeting started Wed Nov  4 15:00:14 2015 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:15 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:15 <obondarev_> o/
15:00:16 <Swami_> hi
15:00:16 <carl_baldwin> hi
15:00:17 <openstack> The meeting name has been set to 'neutron_dvr'
15:00:19 <regXboi> moo
15:00:46 <fitoduarte> hi
15:00:53 <haleyb> it's been a while since i chaired a meeting, have a cheat sheet here though
15:01:38 <regXboi> haleyb: feel free to cochair somebody just in case?
15:01:40 <haleyb> #topic Announcements
15:02:10 <haleyb> My only announcement is: Welcome, as this is the first DVR meeting in a while
15:02:18 <obondarev_> yay!
15:02:24 <haleyb> #topic Bugs
15:02:29 <regXboi> somebody pass around the jug of punch
15:02:33 <Swami_> haleyb: hi
15:02:51 <Swami_> I have captured a bunch of bugs in the wiki and let us go through the bugs one by one.
15:03:10 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1365473/
15:03:13 <openstack> Launchpad bug 1365473 in neutron "Unable to create a router that's both HA and distributed" [High,In progress] - Assigned to Adolfo Duarte (adolfo-duarte)
15:03:17 <obondarev_> https://wiki.openstack.org/wiki/Meetings/Neutron-DVR
15:03:38 <Swami_> This is the DVR HA bug and the server side code is still under review. The agent side patch had gone in.
15:04:07 <Swami_> We need some cores attention on this patch. Otherwise it is ready to be pushed in.
15:04:09 <regXboi> how close is the server side patch to making it in?
15:04:28 <regXboi> n/m messages passed in the aether
15:04:35 <Swami_> fitoduarte: do you have any comments on this.
15:05:03 <sc68cal> o/
15:05:08 <fitoduarte> yeap.  it has gotten some reviews these past couple of days.
15:05:09 * sc68cal joins late
15:05:11 <haleyb> Swami_: looks like https://review.openstack.org/#/c/143169 is the open review
15:05:34 <regXboi> haleyb: yes, that's the server side patch Swami_ was talking about
15:05:48 <regXboi> haleyb: it needs some core eyes
15:05:52 <Swami_> haleyb: yes this needs to get in for the HA to work.
15:06:02 <fitoduarte> it keeps going into merge conflict. besides that I don't think there is anyone adding more to it.
15:06:29 <fitoduarte> yes without you can't use dvr ha
15:06:48 <Swami_> fitoduarte: thanks, can you ping either carl_baldwin or haleyb today to get their attention
15:07:11 * carl_baldwin will look at it today.
15:07:16 <Swami_> carl_baldwin: thanks
15:07:17 <haleyb> I will review it today, there are other cores listed as well that you might want to ping, like Assaf
15:07:29 <fitoduarte> thanks (ping ping)
15:07:35 <Swami_> haleyb: I think assaf is on vaccation
15:08:02 <haleyb> #action carl_baldwin and haleyb will look at https://review.openstack.org/#/c/143169
15:08:06 <Swami_> are we done with this bug, can we move on
15:08:13 <haleyb> yup
15:08:15 <fitoduarte> is assaf on vacation? I have not seen him in a few days
15:08:21 <Swami_> The next in the list is
15:08:24 <Swami_> #link https://review.openstack.org/#/c/229561/
15:09:01 <Swami_> This patch got merged. But there is a dependent patch that is still in review
15:09:26 <Swami_> #link https://review.openstack.org/#/c/230079/
15:09:29 <regXboi> fitoduarte: I believe amuller is on vacation, yes
15:10:09 <Swami_> haleyb, carl_baldwin I need your attention on this above patch.
15:10:26 <regXboi> note - we are now talking bug 1501873
15:10:26 <openstack> bug 1501873 in neutron "FIP Namespace add/delete race condition seen in DVR router log" [Undecided,In progress] https://launchpad.net/bugs/1501873 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:11:21 <carl_baldwin> Swami_: ack
15:11:29 <Swami_> regXboi: Yes you are right bug 1501873
15:11:29 <openstack> bug 1501873 in neutron "FIP Namespace add/delete race condition seen in DVR router log" [Undecided,In progress] https://launchpad.net/bugs/1501873 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:11:29 <haleyb> Swami_: ack.  I had reviewed it last week but was waiting for Carl since you had addressed his comments with latest patch.  summit got int he way i'm sure
15:11:54 <Swami_> haleyb: thanks
15:11:59 <Swami_> carl_baldwin: thanks
15:12:30 <Swami_> Does anyone have anything to say about this above bug, or can we move on
15:12:51 <regXboi> nit: somebody should give it a priority at some point
15:12:58 <regXboi> but other than that, let's move on
15:13:18 <Swami_> regXboi: got it, we will move on
15:13:23 <Swami_> The next in the list
15:13:29 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1372141/
15:13:29 <openstack> Launchpad bug 1372141 in neutron "StaleDataError while updating ml2_dvr_port_bindings" [Medium,Confirmed]
15:14:17 <Swami_> I think there was a patch for this bug.
15:14:55 <Swami_> Does anyone have the link handy for this patch
15:15:08 <regXboi> can somebody update the logstash querty?  the old one doesn't work
15:15:35 <obondarev_> the patch was https://review.openstack.org/#/c/143761/
15:16:13 <Swami_> obondarev_: I think this is the same as pavel bonder was working on.
15:16:52 <obondarev_> Swami_: which one?
15:17:26 <regXboi> that patch was abandoned and will likely need a serious merge
15:17:39 <haleyb> Swami_: i don't see any recent patches or info, so we'll need some new info on this one
15:18:39 <Swami_> haleyb: ok I will update this bug or will add a comment if it is the duplicate of the one that pavel is working on.
15:18:42 <haleyb> I think a new logstash query would help
15:18:50 <regXboi> so... a logstash query for StaleDataError shows 54 hits in the last 7 days
15:18:56 <regXboi> and *none* of them are from DVR jobs
15:19:22 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1494351
15:19:22 <openstack> Launchpad bug 1494351 in neutron "Observed StaleDataError in gate-neutron-dsvm-api tests if reference IPAM driver is used" [High,In progress] - Assigned to Pavel Bondar (pasha117)
15:19:51 <regXboi> all: there is a new logstash query in the notes for the previous bug
15:20:06 <Swami_> I thought that this bug above was same as the  1372141
15:20:34 <regXboi> It may be/have been
15:20:35 <Swami_> regXboi: thanks
15:21:21 <Swami_> we need to figure out if they both are the same and if so we can close one and focus on the later.
15:21:45 <Swami_> ok let us move on to the next
15:22:00 <Swami_> #link https://review.openstack.org/#/c/228026/
15:22:09 <haleyb> Swami_: the IPAM change merged, we can look at it again next week and see if failure rate still there...
15:22:27 <regXboi> (bug 1499785 and bug 1499787)
15:22:27 <openstack> bug 1499785 in neutron "Static routes are not added to the qrouter namespace for DVR routers" [Undecided,In progress] https://launchpad.net/bugs/1499785 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:22:27 <Swami_> haleyb: ok, makes sense.
15:22:28 <openstack> bug 1499787 in neutron "Static routes are attempted to add to SNAT Namespace of DVR routers without checking for Router Gateway." [Undecided,In progress] https://launchpad.net/bugs/1499787 - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:22:47 <Swami_> regXboi: thanks for the links
15:22:54 <Swami_> The patch is up for review.
15:23:52 <Swami_> carl_baldwin and assaf had reviewed this patch and I have addressed the review comments.
15:24:18 <Swami_> I think carl_baldwin had a question if we need to add all the routes in the snat namespace or not.
15:24:53 <Swami_> In my opinion, we should be only adding the external network related static routes in the snat_namespace.
15:25:37 <carl_baldwin> My question was more like “why not add all the same routes to all the router namespaces”.  My thinking is that the code could be less complicated.
15:27:39 <Swami_> carl_baldwin: I am ok with your suggestion but we should not introduce some other issues.
15:28:10 <carl_baldwin> Swami_: Does it introduce issues?
15:28:42 <Swami_> #link https://docs.google.com/presentation/d/1THyot5yZVLb4ouU90GrY8Mmwm_27-jI6mZF8ZljtoFU/edit#slide=id.p
15:28:58 <Swami_> carl_baldwin: I think this is the slide that put together to discuss the static routes.
15:29:26 <haleyb> Swami_: i don't have permission to see that slide
15:29:54 <fitoduarte> is there a danger
15:30:01 <fitoduarte> to add incorrect routes to snat that way?
15:30:23 <Swami_> haleyb: changed the permission on the slide
15:31:33 <fitoduarte> 
15:31:33 <Swami_> fitoduarte: These are next hop routes, and does not makes sense for private subnet routes to be added to the snat namespace, since the nexthop for the external network will be in the external network space.
15:32:29 <haleyb> Swami_: right, should the snat only have the on-link and default route ?
15:32:44 <Swami_> haleyb: yes that's what I thought.
15:33:33 <Swami_> ok, if you have any suggestions please add your comments to the patch and let us move on.
15:33:59 <Swami_> The next one is
15:34:02 <Swami_> #link https://review.openstack.org/#/c/225319/
15:34:32 <Swami_> This patch has been under review for a while and carl_baldwin had blessed this patch couple of times and when into merge conflict.
15:34:59 <Swami_> There was gate issue yesterday were all the lbaasv1 tests were failing.
15:35:12 <Swami_> once the jenkins passes this can go in.
15:35:18 <haleyb> #action haleyb will look at https://review.openstack.org/#/c/225319/
15:35:28 <Swami_> Ok, let us move on to the next one.
15:35:47 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1504726/
15:35:47 <openstack> Launchpad bug 1504726 in neutron "The vm can not access the vip of load balancer under DVR enviroment" [High,New] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:36:03 <Swami_> This was the bug that was filed on lbaas and dvr with Kilo.
15:36:14 <Swami_> I verified on the master branch and I don't see any issue.
15:37:01 <regXboi> well, since kilo is security only now, how about we mark it invalid and ask for a retest on liberty/master?
15:37:03 <sc68cal> I have one from the fwaas side to toss on the pile
15:37:09 <Swami_> But I was trying to triage this on the Kilo branch and was not successful in getting the LBaas to work with devstack. So this is still in progress and I asked for some help from the lbaas team to run lbaas with kilo on devstack.
15:37:21 <regXboi> whoa... why?
15:37:39 <Swami_> sc68cal: you can go ahead and file a bug and we can triage it and take it next week, if it is ok for you.
15:37:51 <sc68cal> I'll just toss in the link for now
15:37:53 <sc68cal> https://bugs.launchpad.net/neutron/+bug/1476097
15:37:53 <openstack> Launchpad bug 1476097 in neutron "[fwaas]Support fwaas to control east-west traffic in dvr router" [High,Triaged] - Assigned to lee jian (leejian0612)
15:37:55 <fitoduarte> regxboi: agree on lbaas issue bug
15:38:15 <Swami_> sc68cal: I had a couple of discussions with mickeys and sridhark on this bug.
15:38:16 <regXboi> haleyb, carl_baldwin: what are your thoughts?
15:38:39 * carl_baldwin catches up on last few minutes...
15:38:48 <Swami_> The suggestion is to wait until your v2 api targetted towards the VM ports arrive and then use those API rather than making any changes to the DVR right now.
15:39:12 <Swami_> sc68cal: making changes to the DVR to support the East-West will be complicated at this point.
15:39:40 <haleyb> Swami_: if we don't see it in master then wait for Zhou to answer your question.  We should move on since we've onlyl got :20 left
15:39:44 <Swami_> sc68cal: I have already updated the bug with my notes. Please let me know if you need more info on it.
15:39:50 <sc68cal> Swami_: ack. I just wanted to make sure both sides are in sync
15:39:53 <Swami_> ok, thanks.
15:40:08 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1505575/
15:40:08 <openstack> Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev)
15:40:17 <regXboi> haleyb, carl_baldwin: I still want to mark 1504726 as invalid as it is kilo
15:40:29 <Swami_> oleg was working on this bug.
15:40:34 <obondarev_> the patch for it is there
15:40:51 <obondarev_> armax is not happy with pagination approach though
15:40:59 <Swami_> so we probably need to review it.
15:41:18 <Swami_> obondarev_: thanks for the info
15:41:30 <Swami_> obondarev_: do you have any other thoughts on this patch.
15:41:36 <carl_baldwin> regXboi: I guess I don’t understand the state of support for Kilo.  I have a hard time believing that we’re not fixing bugs.
15:41:43 <obondarev_> however I need to verify if https://review.openstack.org/#/c/234067/ is still needed given https://review.openstack.org/#/c/214974/
15:42:14 <obondarev_> I was unable to reproduce the bug on the 50 nodes scale with https://review.openstack.org/#/c/214974/ applied
15:42:41 <obondarev_> though I was sure it won’t help in case agent is resyncing and requests info for all its routers
15:43:08 <obondarev_> will try more at a higher scale
15:43:16 <Swami_> obondarev_: can you update us on your testing next week.
15:43:26 <Swami_> obondarev_: then we can decide on this patch.
15:43:34 <Swami_> Ok moving on to the next one.
15:43:34 <regXboi> carl_badlwin: security-supported is described in https://wiki.openstack.org/wiki/StableBranch
15:43:39 <obondarev_> Swami_: ok
15:43:47 <regXboi> it means it is target for security fixes by the VMT
15:44:00 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1496201/
15:44:00 <openstack> Launchpad bug 1496201 in neutron "DVR: router namespace can't be deleted if bulk delete VMs" [Medium,New] - Assigned to Kasey Alusi (kasey-alusi)
15:44:09 <regXboi> but I read that as meaning we don't have to fix other items for it
15:44:46 <Swami_> regXboi: carl_baldwin: let us check with armax and then decide the fate of that bug.
15:45:20 <Swami_> The above bug should be triaged for bulk delete.
15:45:20 * regXboi attempts to summon armax and just for S&G mestery as well :)
15:45:49 <Swami_> The last bug on today's list is
15:45:54 <Swami_> #link https://bugs.launchpad.net/neutron/+bug/1512199/
15:45:54 <openstack> Launchpad bug 1512199 in neutron "change vm fixed ips will cause unable to communicate to vm in other network " [Undecided,Invalid] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:46:14 <Swami_> This bug I did triage it and could not reproduce in the master branch. So marked it as invalid.
15:46:22 <haleyb> Swami_: since we are still triaging the last two bugs, and they're kilo, let's move on since i want to get to gate failures (and there might be other bugs not on the agenda?)
15:46:33 <regXboi> amen to gate failures
15:46:47 <Swami_> haleyb: sure, that's all I had for bugs.
15:46:51 * regXboi will take action item to triage undecided bugs
15:47:07 <haleyb> I know there are other bugs as well, right carl_baldwin ?  We saw one internally that I need to file an upstream bug for
15:47:36 <carl_baldwin> haleyb: Yes, I don’t have access to the internal Jira, otherwise I’d get the description.
15:47:44 <haleyb> #topic Gate-failures
15:48:12 <haleyb> we'll move on - regXboi did you want to talk about the issues ?
15:48:25 <regXboi> haleyb: sure
15:48:59 <regXboi> If I look at the check pipeline graphs, we are still seeing too many failures in the gate to make mn voting
15:49:21 <haleyb> http://goo.gl/j5UkwT
15:49:25 <regXboi> OTOH, I think we can argue that single node dvr can be made voting again
15:49:29 <regXboi> haleyb: thx
15:49:52 <regXboi> but note, that is only *think*
15:50:23 <regXboi> most of the failures I'm seeing are still FIP failures and I'm still trying to capture those locally to understand what the failure mode is
15:50:43 <carl_baldwin> It still seems to be double the failure rate of the non-dvr job.  Need more good history.
15:51:09 <regXboi> what do people think of the idea of trying to get the pipeline job to dump the cloud state when a FIP failure occurs?
15:51:17 <haleyb> regXboi: yes, it would just be good to maybe have that under 10% for some time to say it's stable, the others are single digits
15:51:29 <Swami_> armax mentioned that we should poke into the gate systems to see what is the state of the VM when these failures happen.
15:52:02 <Swami_> regXboi: Is there a way do it and if so can we do it.
15:52:05 <regXboi> poking into the gate systems is going to likely take job patches to do the poking
15:52:37 <carl_baldwin> Is there anyone from tempest who could help us figure out the best way to get the state?  I vaguely remember a comment in a session at summit where someone from tempest spoke up about this.  Anyone recall?
15:52:48 <Swami_> regXboi: I did not get it.
15:52:50 * regXboi suspects it may be *possible*, but we'll have to work with qa/tempest
15:53:25 <regXboi> carl_baldwin: that would have been good to follow up on :(
15:53:31 <Swami_> carl_baldwin: but I think armax was not ok adding any help into the tempest and he wanted someone to look at the VM state in the gate.
15:53:53 <regXboi> Swami_: I don't believe that is practical
15:54:00 <regXboi> because of the tests cleaning up after themselves
15:54:16 <carl_baldwin> regXboi: So much going on…  But, it isn’t too late.  We need someone to take an action.
15:54:18 <haleyb> carl_baldwin: i don't recall, but perhaps we can start with the tempest PTL
15:54:21 <Swami_> regXboi: Ok I will check with armax on this.
15:54:53 <Swami_> haleyb: can he join this meeting next week, in case we don't get a head way.
15:55:03 <haleyb> #action Swami to ping armax about tempest resources to look at VMs during DVR failures in the gate
15:55:15 <regXboi> I'll ping mtreinish on this in the qa channel and see what *might* be possible
15:55:28 <haleyb> thanks regXboi
15:55:55 <haleyb> anything else on the gate?
15:56:02 <regXboi> not from me
15:56:26 <haleyb> #topic Performance/Scalability
15:56:39 <Swami_> haleyb: I remember assaf mentioned that he had a script that can probe the logstash and pull the data, filter it and provide the details of the failures.
15:57:10 <Swami_> regXboi: do you have something like that script.
15:57:11 <haleyb> obondarev: i had only added https://review.openstack.org/#/c/231555/ to the agenda as a perf change in review
15:57:16 <regXboi> Swami_: no
15:57:21 <obondarev_> https://review.openstack.org/#/c/231555/ (and dependent) needs reviews
15:57:56 <regXboi> all: we desperately need to cut down on L3 agent message queue traffic
15:58:10 <haleyb> Swami_: you had a -1 on that patch, can you take a look again?
15:58:19 <obondarev_> haleyb: yeah, then the next round of scale tests might probably reveal other pain points/bottlenecks
15:58:21 <Swami_> haleyb: sure
15:58:27 * regXboi has seen installations where neutron L3 agent queue is the only queue dying
15:58:50 <carl_baldwin> regXboi: Could you express that in the form of a bug?  ;)
15:59:09 <regXboi> carl_baldwin: you *really* don't want that
15:59:17 <obondarev_> https://review.openstack.org/#/c/222863/ should also help scalability
15:59:26 <haleyb> regXboi: yes, the latest FIP changed oleg did will help, there probably are more, but filing a bug will get it on the agenda :)
15:59:28 <carl_baldwin> regXboi: ?
15:59:50 <obondarev_> but need to investigate why it breaks graceful ovs agent restart
15:59:55 <regXboi> so this would be an overarching bug for all of L3
16:00:13 <regXboi> because the message target patches are addressing half of it
16:00:19 <haleyb> regXboi: we'd have to identify them one by one
16:00:23 <regXboi> and the O(n) stuff I've been chasing addresses the other half of it
16:00:26 <haleyb> top of hour, we need to end...
16:00:33 <regXboi> let's continue in channel
16:00:37 <haleyb> #endmeeting neutron_dvr