15:00:30 <haleyb> #startmeeting neutron_dvr
15:00:31 <openstack> Meeting started Wed Nov 18 15:00:30 2015 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:34 <openstack> The meeting name has been set to 'neutron_dvr'
15:00:39 <haleyb> #chair Swami
15:00:39 <openstack> Current chairs: Swami haleyb
15:01:07 <haleyb> #topic Announcements
15:01:23 <neiljerram> o/
15:01:54 <Swami> haleyb: is the devstack broken yesterday
15:01:58 <haleyb> My only announcement is that I'm away next week, so will have to hand over to Swami or someone else
15:02:17 <Swami> haleyb: no worries I can handle the meeting when you are gone.
15:02:33 <haleyb> i'm assuming others will be here as well
15:03:16 <haleyb> Swami: is devstack broken or devstack is broken ?
15:03:23 <carl_baldwin> Hi, sorry I'm late
15:03:43 <haleyb> np, we just started
15:03:48 <Swami> haleyb: yesterday I was not able to bringup devstack with a fresh pull
15:05:00 <obondarev> Swami: what is the error?
15:05:07 <stephen-ma> hi
15:05:14 <haleyb> Swami: i have one devstack that is also "broken" with a "Could not determine suitable URL for the plugin" failure, assuming i'll have to google that again, but other one working
15:05:23 <Swami> haleyb: I did see that it was failing when trying to start the cinder service.
15:05:38 <haleyb> "disable cinder" :)
15:05:44 <Swami> haleyb: Yes I also saw that failure in my previous working node.
15:06:02 <Swami> haleyb: I tried disabling it and still no success.
15:06:40 <dasm> Swami: at Friday, colleague of mine had similar problem. The only what helped was deleting/installing everything from scratch on new VM.
15:06:56 <dasm> Swami: i've built devstack at Sunday and had no problems (I know it is not yesterda)
15:07:01 <dasm> *yesterday
15:07:27 <Swami> dasm: Unfortunately it did not work for me, since I tried, a fresh install and old install. The fresh install had this cinder problem, while the old one had this URL issue.
15:07:59 <Swami> never mind, let us continue since we have a forum. I will see if I can take it up to the channel once we finish the meeting.
15:08:14 <haleyb> Swami: so there is probably some pip update that needs to happen, but we can debug that offline or in the neutron room
15:08:26 <Swami> haleyb: +1
15:08:38 <haleyb> #topic Bugs
15:08:59 <Swami> haleyb: sure
15:09:09 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1372141
15:09:09 <openstack> Launchpad bug 1372141 in neutron "StaleDataError while updating ml2_dvr_port_bindings" [Medium,Confirmed]
15:09:09 <haleyb> There's just a few listed on the agenda, all yours Swami
15:09:39 <Swami> This bug can be closed as per regXboi since he did not see any failures in the last week or so.
15:09:50 <Swami> so I don't think we need more discussion on this bug.
15:10:13 <Swami> The next high in the list is
15:10:18 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1462154
15:10:18 <openstack> Launchpad bug 1462154 in neutron "With DVR Pings to floating IPs replied with fixed-ips" [High,In progress] - Assigned to ZongKai LI (lzklibj)
15:11:03 <stephen-ma> There is a WIP patch in review right now https://review.openstack.org/#/c/240677
15:11:11 <stephen-ma> for bug 1462154.
15:11:12 <Swami> We have a patch right now up for review from ZongKai. I spoke to stephen ma the original owner of this bug and he mentioned that ZongKai's patch is better and should be pursued.
15:11:34 <Swami> stephen-ma: thanks for the link
15:11:54 <stephen-ma> I fixed some of the unit test failures. There are more unit test failures that need to be fixed.
15:12:15 <Swami> stephen-ma: ok I will ping ZongKai on this or add a comment in there.
15:12:33 <Swami> The next in our list is
15:12:37 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1505575
15:12:37 <openstack> Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev)
15:13:00 <obondarev> the patch for it is still on review
15:13:02 <Swami> obondarev: any update on this
15:13:31 <obondarev> similar bug was filed by amuller recently https://bugs.launchpad.net/neutron/+bug/1516260
15:13:31 <openstack> Launchpad bug 1505575 in neutron "duplicate for #1516260 Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev)
15:13:41 <obondarev> ah, sorry
15:13:52 <dasm> probably this one: https://bugs.launchpad.net/neutron/+bug/1516260
15:14:12 <dasm> https://bugs.launchpad.net/neutron/+bug/1505575
15:14:12 <openstack> Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev)
15:14:13 <obondarev> dasm: yep
15:14:43 <Swami> are we going to mark the other one as duplicate or still is it different.
15:14:49 <obondarev> so I still need to check on scale lab if it's a problem/reproduce
15:15:02 <obondarev> the other one is already marked as duplicate
15:15:30 <obondarev> need to sync with amuller on this
15:15:52 <obondarev> I left a comment on his review
15:16:19 <Swami> obondarev: ok please update next week on your progress.
15:16:23 <obondarev> reviews are welcome however
15:16:47 <obondarev> Swami: sure
15:16:49 <Swami> obondarev: sure we will review
15:16:58 <obondarev> Swami: thanks
15:17:08 <Swami> The next one in the list is
15:17:12 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1513678
15:17:12 <openstack> Launchpad bug 1513678 in neutron "At scale router scheduling takes a long time with DVR routers with multiple compute nodes hosting thousands of VMs" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan)
15:17:35 <Swami> I have pushed in both the patches yesterday and removed it out of WIP.
15:18:22 <Swami> But I saw that one of the patch was failing dvr tests. I am still looking into what the problem is. The jenkens does not report any test failures specific to dvr.
15:18:43 <Swami> #link https://review.openstack.org/#/c/241843/
15:19:00 <Swami> #link https://review.openstack.org/#/c/242286/
15:19:32 <Swami> The second patch that I posted is the one that is currently failing, I will test it again in my setup and see where the problem is.
15:20:24 <Swami> Let us move on to the next one.
15:21:14 <Swami> I think we are done with the high bugs, but there are still other bugs that are in progress.
15:21:45 <Swami> haleyb: we need to clean up the bugs that are fix committed.
15:22:01 <Swami> haleyb: who should be doing it.
15:22:18 <Swami> haleyb: This is in launchpad.
15:22:21 <haleyb> will that get them off this list, https://bugs.launchpad.net/neutron/+bugs?field.tag=l3-dvr-backlog
15:23:02 <Swami> haleyb: yes it should off that list.
15:23:50 <Swami> Now let us move on to the next topic.
15:23:58 <Swami> #topic Gate-failures
15:24:52 <Swami> last week I had an action item to create a bug to track any tempest related activity to address the gate failures.
15:24:57 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1515360
15:24:57 <openstack> Launchpad bug 1515360 in neutron "Add more verbose to Tempest Test Errors that causes "SSHTimeout" seen in CVR and DVR" [High,Confirmed]
15:25:07 <Swami> I have created this bug.
15:25:28 <Swami> What can be done to address this issue?
15:25:59 <haleyb> and i didn't reach-out to tempest and/or infra teams
15:26:13 <Swami> haleyb: I had a chat with armax as well and he feels that adding more debug info related to the namespaces will be too much of information.
15:26:48 <Swami> haleyb: he also mentioned that previously it used to dump the namespace information and they got away from that.
15:27:29 <Swami> Is there any other way within the L3 agent domain that we can confirm that the FIPs will work.
15:27:30 <haleyb> i was almost hoping we could add an experimental job we can run for debugging
15:28:24 <Swami> haleyb: I was trying to add something like "pinging" from the fip namespace to the router namespace destination as the floatingip.
15:28:37 <Swami> haleyb: But it did not work
15:29:00 <haleyb> Swami: if the tempest patches ran the neutron multinode jobs (they don't now), we could then maybe create a tempest patch with a bunch of debug, then recheck it to trigger a failure ?
15:30:07 <Swami> haleyb: so are you saying just target it for the multinode and not for the single node.
15:30:52 <haleyb> hmm, i guess the single node is growing as well
15:31:43 <Swami> haleyb: carl_baldwin: can we add a _fip_path_validate kind of function in l3_agent and then after setting up the floatingip namespace and add rules to the routernamespace ping internally and then provide a status update.
15:32:35 <Swami> In this case we do have control within neutron and no need to rely on tempest. Because also tempest test cannot add such low level tests.
15:33:19 <Swami> I am just throwing in some ideas in here. Please feel free to add in your input if you have any thoughts.
15:33:19 <carl_baldwin> Swami: I need a minute to catch up.  Distracted today.
15:33:30 <Swami> carl_baldwin: thanks, no problem.
15:33:59 <haleyb> Swami: yes, i suppose we could add some debugging via the l3-agent log
15:35:02 <Swami> haleyb: ok, I will try to push in a wip patch and we can discuss about this in that patch, that way we are all in sync.
15:35:29 <haleyb> Swami: ok, thanks
15:35:51 <Swami> haleyb: also what I noticed is, I am not sure how the metrics are calculated in the logstash.
15:36:28 <Swami> haleyb: I do see that it reports some patch that is even not running gate_tempest_dsvm_neutron_dvr as gate_tempest_dsvm_neturon_dvr failures.
15:37:21 <haleyb> so it's not being run but failing?
15:38:12 <Swami> haleyb: If I look at the logstash data, and filter by neutron-dvr failures, it reports a patch id, and if I look at the patch, all it has is rally_related tests, but it still reports as dvr related failure.
15:38:23 <Swami> haleyb: do you know what might be happening here.
15:38:39 <haleyb> Swami: pointer ?
15:39:07 <Swami> haleyb: I don't have it handy, but I will send it out.
15:40:02 <Swami> Ok before we move on to the next section, any other ideas or thoughts on how to mitigate the gate failures, feel free to let us know.
15:40:24 * carl_baldwin back
15:40:31 <carl_baldwin> Sorry to have gotten distracted.
15:40:45 <Swami> carl_baldwin: no problem we can catch up later.
15:40:52 <carl_baldwin> Swami: ack
15:41:10 <haleyb> Swami: maybe if we ignore it longer the failures will just go away? :)
15:41:26 <dasm> haleyb: +1 :D
15:41:29 <Swami> haleyb: no I don't want to do that, then we will not really get to the bottom of it.
15:41:35 <haleyb> no, we're actually making progress on the bugs, so let's not stop
15:41:43 <Swami> haleyb: this will come back to us.
15:41:44 <haleyb> Swami: you missed my :)
15:42:09 <Swami> haleyb: yes got it.
15:42:29 <Swami> haleyb: can we move on to the next topic
15:42:34 <haleyb> yes
15:42:46 <haleyb> #topic Performance/Scalability
15:42:54 <obondarev> not much to update here really
15:43:10 <obondarev> I'm planning to start implementing dvr binding refactoring bp this week/early next week
15:43:32 <obondarev> also should grab scale lab to run some tests, make some analysis
15:44:06 <obondarev> that's it from me
15:44:35 <haleyb> obondarev: great, can you just update the wiki with a link to the BP?  I think we can remove the merged bug from it :)
15:44:51 <obondarev> haleyb: sure, will do
15:45:15 <Swami> obondarev: I had one question on this.
15:45:53 <Swami> obondarev: the original blueprint had some notes about refactoring the agent side. Do we really need to refactor the agent side for this binding changes or is it orthogonal.
15:46:22 <obondarev> Swami: I don'e expect big changes on agent side
15:46:29 <obondarev> probably some minor changes
15:46:40 <carl_baldwin> obondarev: +1
15:46:42 <Swami> obondarev: thanks
15:48:18 <Swami> I think that's all we had in for today's agenda. Is there any other items that the team wanted to discuss today.
15:48:21 <haleyb> #topic Open Discussion
15:49:09 <Swami> haleyb: is there an easier way to figure out from logstash all types of failures for dvr.
15:50:40 <Swami> haleyb: I meant to just filter and get the frequently failed test cases, so that if we have missed anything we can focus our attention on that.
15:50:44 <carl_baldwin> Swami:  Any chance you could get the pointer to that DVR failure you mentioned above where dvr was not run on the patch?
15:51:12 <Swami> carl_baldwin: sure, I will get the filter going again and I will let you know.
15:51:31 <Swami> carl_baldwin: I captured the screen shot, but before I saved, my pc crashed. Bad luck.
15:51:35 <haleyb> Swami: i don't know, what you describe seems pretty manual - finding the test failing and searching for it
15:52:29 <Swami> haleyb: in logstash right now it only displays first 500 errors and it might have just come from a single patch.
15:52:53 <Swami> haleyb: when I try to increase the page size in logstash it is a nightmare and it hangs or takes forever.
15:53:50 <haleyb> Swami: ok, now i understand, but i don't know the answer.  Who owns logstash?
15:54:25 <Swami> haleyb: I don't know who owns the logstash is it the infra team or someone else.
15:55:28 <carl_baldwin> I'd start with infra
15:55:43 <Swami> carl_baldwin: thanks
15:56:46 <haleyb> yeah, perhaps there needs to be some other filter button
15:57:35 <Swami> haleyb: you can filter by message type, but if you exactly know what test is failing and its error message.
15:58:26 <haleyb> ok, if nothing else i'll call it, thanks everyone
15:58:29 <haleyb> #endmeeting