15:00:30 <haleyb> #startmeeting neutron_dvr 15:00:31 <openstack> Meeting started Wed Nov 18 15:00:30 2015 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:34 <openstack> The meeting name has been set to 'neutron_dvr' 15:00:39 <haleyb> #chair Swami 15:00:39 <openstack> Current chairs: Swami haleyb 15:01:07 <haleyb> #topic Announcements 15:01:23 <neiljerram> o/ 15:01:54 <Swami> haleyb: is the devstack broken yesterday 15:01:58 <haleyb> My only announcement is that I'm away next week, so will have to hand over to Swami or someone else 15:02:17 <Swami> haleyb: no worries I can handle the meeting when you are gone. 15:02:33 <haleyb> i'm assuming others will be here as well 15:03:16 <haleyb> Swami: is devstack broken or devstack is broken ? 15:03:23 <carl_baldwin> Hi, sorry I'm late 15:03:43 <haleyb> np, we just started 15:03:48 <Swami> haleyb: yesterday I was not able to bringup devstack with a fresh pull 15:05:00 <obondarev> Swami: what is the error? 15:05:07 <stephen-ma> hi 15:05:14 <haleyb> Swami: i have one devstack that is also "broken" with a "Could not determine suitable URL for the plugin" failure, assuming i'll have to google that again, but other one working 15:05:23 <Swami> haleyb: I did see that it was failing when trying to start the cinder service. 15:05:38 <haleyb> "disable cinder" :) 15:05:44 <Swami> haleyb: Yes I also saw that failure in my previous working node. 15:06:02 <Swami> haleyb: I tried disabling it and still no success. 15:06:40 <dasm> Swami: at Friday, colleague of mine had similar problem. The only what helped was deleting/installing everything from scratch on new VM. 15:06:56 <dasm> Swami: i've built devstack at Sunday and had no problems (I know it is not yesterda) 15:07:01 <dasm> *yesterday 15:07:27 <Swami> dasm: Unfortunately it did not work for me, since I tried, a fresh install and old install. The fresh install had this cinder problem, while the old one had this URL issue. 15:07:59 <Swami> never mind, let us continue since we have a forum. I will see if I can take it up to the channel once we finish the meeting. 15:08:14 <haleyb> Swami: so there is probably some pip update that needs to happen, but we can debug that offline or in the neutron room 15:08:26 <Swami> haleyb: +1 15:08:38 <haleyb> #topic Bugs 15:08:59 <Swami> haleyb: sure 15:09:09 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1372141 15:09:09 <openstack> Launchpad bug 1372141 in neutron "StaleDataError while updating ml2_dvr_port_bindings" [Medium,Confirmed] 15:09:09 <haleyb> There's just a few listed on the agenda, all yours Swami 15:09:39 <Swami> This bug can be closed as per regXboi since he did not see any failures in the last week or so. 15:09:50 <Swami> so I don't think we need more discussion on this bug. 15:10:13 <Swami> The next high in the list is 15:10:18 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1462154 15:10:18 <openstack> Launchpad bug 1462154 in neutron "With DVR Pings to floating IPs replied with fixed-ips" [High,In progress] - Assigned to ZongKai LI (lzklibj) 15:11:03 <stephen-ma> There is a WIP patch in review right now https://review.openstack.org/#/c/240677 15:11:11 <stephen-ma> for bug 1462154. 15:11:12 <Swami> We have a patch right now up for review from ZongKai. I spoke to stephen ma the original owner of this bug and he mentioned that ZongKai's patch is better and should be pursued. 15:11:34 <Swami> stephen-ma: thanks for the link 15:11:54 <stephen-ma> I fixed some of the unit test failures. There are more unit test failures that need to be fixed. 15:12:15 <Swami> stephen-ma: ok I will ping ZongKai on this or add a comment in there. 15:12:33 <Swami> The next in our list is 15:12:37 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1505575 15:12:37 <openstack> Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:13:00 <obondarev> the patch for it is still on review 15:13:02 <Swami> obondarev: any update on this 15:13:31 <obondarev> similar bug was filed by amuller recently https://bugs.launchpad.net/neutron/+bug/1516260 15:13:31 <openstack> Launchpad bug 1505575 in neutron "duplicate for #1516260 Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:13:41 <obondarev> ah, sorry 15:13:52 <dasm> probably this one: https://bugs.launchpad.net/neutron/+bug/1516260 15:14:12 <dasm> https://bugs.launchpad.net/neutron/+bug/1505575 15:14:12 <openstack> Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:14:13 <obondarev> dasm: yep 15:14:43 <Swami> are we going to mark the other one as duplicate or still is it different. 15:14:49 <obondarev> so I still need to check on scale lab if it's a problem/reproduce 15:15:02 <obondarev> the other one is already marked as duplicate 15:15:30 <obondarev> need to sync with amuller on this 15:15:52 <obondarev> I left a comment on his review 15:16:19 <Swami> obondarev: ok please update next week on your progress. 15:16:23 <obondarev> reviews are welcome however 15:16:47 <obondarev> Swami: sure 15:16:49 <Swami> obondarev: sure we will review 15:16:58 <obondarev> Swami: thanks 15:17:08 <Swami> The next one in the list is 15:17:12 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1513678 15:17:12 <openstack> Launchpad bug 1513678 in neutron "At scale router scheduling takes a long time with DVR routers with multiple compute nodes hosting thousands of VMs" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:17:35 <Swami> I have pushed in both the patches yesterday and removed it out of WIP. 15:18:22 <Swami> But I saw that one of the patch was failing dvr tests. I am still looking into what the problem is. The jenkens does not report any test failures specific to dvr. 15:18:43 <Swami> #link https://review.openstack.org/#/c/241843/ 15:19:00 <Swami> #link https://review.openstack.org/#/c/242286/ 15:19:32 <Swami> The second patch that I posted is the one that is currently failing, I will test it again in my setup and see where the problem is. 15:20:24 <Swami> Let us move on to the next one. 15:21:14 <Swami> I think we are done with the high bugs, but there are still other bugs that are in progress. 15:21:45 <Swami> haleyb: we need to clean up the bugs that are fix committed. 15:22:01 <Swami> haleyb: who should be doing it. 15:22:18 <Swami> haleyb: This is in launchpad. 15:22:21 <haleyb> will that get them off this list, https://bugs.launchpad.net/neutron/+bugs?field.tag=l3-dvr-backlog 15:23:02 <Swami> haleyb: yes it should off that list. 15:23:50 <Swami> Now let us move on to the next topic. 15:23:58 <Swami> #topic Gate-failures 15:24:52 <Swami> last week I had an action item to create a bug to track any tempest related activity to address the gate failures. 15:24:57 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1515360 15:24:57 <openstack> Launchpad bug 1515360 in neutron "Add more verbose to Tempest Test Errors that causes "SSHTimeout" seen in CVR and DVR" [High,Confirmed] 15:25:07 <Swami> I have created this bug. 15:25:28 <Swami> What can be done to address this issue? 15:25:59 <haleyb> and i didn't reach-out to tempest and/or infra teams 15:26:13 <Swami> haleyb: I had a chat with armax as well and he feels that adding more debug info related to the namespaces will be too much of information. 15:26:48 <Swami> haleyb: he also mentioned that previously it used to dump the namespace information and they got away from that. 15:27:29 <Swami> Is there any other way within the L3 agent domain that we can confirm that the FIPs will work. 15:27:30 <haleyb> i was almost hoping we could add an experimental job we can run for debugging 15:28:24 <Swami> haleyb: I was trying to add something like "pinging" from the fip namespace to the router namespace destination as the floatingip. 15:28:37 <Swami> haleyb: But it did not work 15:29:00 <haleyb> Swami: if the tempest patches ran the neutron multinode jobs (they don't now), we could then maybe create a tempest patch with a bunch of debug, then recheck it to trigger a failure ? 15:30:07 <Swami> haleyb: so are you saying just target it for the multinode and not for the single node. 15:30:52 <haleyb> hmm, i guess the single node is growing as well 15:31:43 <Swami> haleyb: carl_baldwin: can we add a _fip_path_validate kind of function in l3_agent and then after setting up the floatingip namespace and add rules to the routernamespace ping internally and then provide a status update. 15:32:35 <Swami> In this case we do have control within neutron and no need to rely on tempest. Because also tempest test cannot add such low level tests. 15:33:19 <Swami> I am just throwing in some ideas in here. Please feel free to add in your input if you have any thoughts. 15:33:19 <carl_baldwin> Swami: I need a minute to catch up. Distracted today. 15:33:30 <Swami> carl_baldwin: thanks, no problem. 15:33:59 <haleyb> Swami: yes, i suppose we could add some debugging via the l3-agent log 15:35:02 <Swami> haleyb: ok, I will try to push in a wip patch and we can discuss about this in that patch, that way we are all in sync. 15:35:29 <haleyb> Swami: ok, thanks 15:35:51 <Swami> haleyb: also what I noticed is, I am not sure how the metrics are calculated in the logstash. 15:36:28 <Swami> haleyb: I do see that it reports some patch that is even not running gate_tempest_dsvm_neutron_dvr as gate_tempest_dsvm_neturon_dvr failures. 15:37:21 <haleyb> so it's not being run but failing? 15:38:12 <Swami> haleyb: If I look at the logstash data, and filter by neutron-dvr failures, it reports a patch id, and if I look at the patch, all it has is rally_related tests, but it still reports as dvr related failure. 15:38:23 <Swami> haleyb: do you know what might be happening here. 15:38:39 <haleyb> Swami: pointer ? 15:39:07 <Swami> haleyb: I don't have it handy, but I will send it out. 15:40:02 <Swami> Ok before we move on to the next section, any other ideas or thoughts on how to mitigate the gate failures, feel free to let us know. 15:40:24 * carl_baldwin back 15:40:31 <carl_baldwin> Sorry to have gotten distracted. 15:40:45 <Swami> carl_baldwin: no problem we can catch up later. 15:40:52 <carl_baldwin> Swami: ack 15:41:10 <haleyb> Swami: maybe if we ignore it longer the failures will just go away? :) 15:41:26 <dasm> haleyb: +1 :D 15:41:29 <Swami> haleyb: no I don't want to do that, then we will not really get to the bottom of it. 15:41:35 <haleyb> no, we're actually making progress on the bugs, so let's not stop 15:41:43 <Swami> haleyb: this will come back to us. 15:41:44 <haleyb> Swami: you missed my :) 15:42:09 <Swami> haleyb: yes got it. 15:42:29 <Swami> haleyb: can we move on to the next topic 15:42:34 <haleyb> yes 15:42:46 <haleyb> #topic Performance/Scalability 15:42:54 <obondarev> not much to update here really 15:43:10 <obondarev> I'm planning to start implementing dvr binding refactoring bp this week/early next week 15:43:32 <obondarev> also should grab scale lab to run some tests, make some analysis 15:44:06 <obondarev> that's it from me 15:44:35 <haleyb> obondarev: great, can you just update the wiki with a link to the BP? I think we can remove the merged bug from it :) 15:44:51 <obondarev> haleyb: sure, will do 15:45:15 <Swami> obondarev: I had one question on this. 15:45:53 <Swami> obondarev: the original blueprint had some notes about refactoring the agent side. Do we really need to refactor the agent side for this binding changes or is it orthogonal. 15:46:22 <obondarev> Swami: I don'e expect big changes on agent side 15:46:29 <obondarev> probably some minor changes 15:46:40 <carl_baldwin> obondarev: +1 15:46:42 <Swami> obondarev: thanks 15:48:18 <Swami> I think that's all we had in for today's agenda. Is there any other items that the team wanted to discuss today. 15:48:21 <haleyb> #topic Open Discussion 15:49:09 <Swami> haleyb: is there an easier way to figure out from logstash all types of failures for dvr. 15:50:40 <Swami> haleyb: I meant to just filter and get the frequently failed test cases, so that if we have missed anything we can focus our attention on that. 15:50:44 <carl_baldwin> Swami: Any chance you could get the pointer to that DVR failure you mentioned above where dvr was not run on the patch? 15:51:12 <Swami> carl_baldwin: sure, I will get the filter going again and I will let you know. 15:51:31 <Swami> carl_baldwin: I captured the screen shot, but before I saved, my pc crashed. Bad luck. 15:51:35 <haleyb> Swami: i don't know, what you describe seems pretty manual - finding the test failing and searching for it 15:52:29 <Swami> haleyb: in logstash right now it only displays first 500 errors and it might have just come from a single patch. 15:52:53 <Swami> haleyb: when I try to increase the page size in logstash it is a nightmare and it hangs or takes forever. 15:53:50 <haleyb> Swami: ok, now i understand, but i don't know the answer. Who owns logstash? 15:54:25 <Swami> haleyb: I don't know who owns the logstash is it the infra team or someone else. 15:55:28 <carl_baldwin> I'd start with infra 15:55:43 <Swami> carl_baldwin: thanks 15:56:46 <haleyb> yeah, perhaps there needs to be some other filter button 15:57:35 <Swami> haleyb: you can filter by message type, but if you exactly know what test is failing and its error message. 15:58:26 <haleyb> ok, if nothing else i'll call it, thanks everyone 15:58:29 <haleyb> #endmeeting