15:00:30 #startmeeting neutron_dvr 15:00:31 Meeting started Wed Nov 18 15:00:30 2015 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:32 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:34 The meeting name has been set to 'neutron_dvr' 15:00:39 #chair Swami 15:00:39 Current chairs: Swami haleyb 15:01:07 #topic Announcements 15:01:23 o/ 15:01:54 haleyb: is the devstack broken yesterday 15:01:58 My only announcement is that I'm away next week, so will have to hand over to Swami or someone else 15:02:17 haleyb: no worries I can handle the meeting when you are gone. 15:02:33 i'm assuming others will be here as well 15:03:16 Swami: is devstack broken or devstack is broken ? 15:03:23 Hi, sorry I'm late 15:03:43 np, we just started 15:03:48 haleyb: yesterday I was not able to bringup devstack with a fresh pull 15:05:00 Swami: what is the error? 15:05:07 hi 15:05:14 Swami: i have one devstack that is also "broken" with a "Could not determine suitable URL for the plugin" failure, assuming i'll have to google that again, but other one working 15:05:23 haleyb: I did see that it was failing when trying to start the cinder service. 15:05:38 "disable cinder" :) 15:05:44 haleyb: Yes I also saw that failure in my previous working node. 15:06:02 haleyb: I tried disabling it and still no success. 15:06:40 Swami: at Friday, colleague of mine had similar problem. The only what helped was deleting/installing everything from scratch on new VM. 15:06:56 Swami: i've built devstack at Sunday and had no problems (I know it is not yesterda) 15:07:01 *yesterday 15:07:27 dasm: Unfortunately it did not work for me, since I tried, a fresh install and old install. The fresh install had this cinder problem, while the old one had this URL issue. 15:07:59 never mind, let us continue since we have a forum. I will see if I can take it up to the channel once we finish the meeting. 15:08:14 Swami: so there is probably some pip update that needs to happen, but we can debug that offline or in the neutron room 15:08:26 haleyb: +1 15:08:38 #topic Bugs 15:08:59 haleyb: sure 15:09:09 #link https://bugs.launchpad.net/neutron/+bug/1372141 15:09:09 Launchpad bug 1372141 in neutron "StaleDataError while updating ml2_dvr_port_bindings" [Medium,Confirmed] 15:09:09 There's just a few listed on the agenda, all yours Swami 15:09:39 This bug can be closed as per regXboi since he did not see any failures in the last week or so. 15:09:50 so I don't think we need more discussion on this bug. 15:10:13 The next high in the list is 15:10:18 #link https://bugs.launchpad.net/neutron/+bug/1462154 15:10:18 Launchpad bug 1462154 in neutron "With DVR Pings to floating IPs replied with fixed-ips" [High,In progress] - Assigned to ZongKai LI (lzklibj) 15:11:03 There is a WIP patch in review right now https://review.openstack.org/#/c/240677 15:11:11 for bug 1462154. 15:11:12 We have a patch right now up for review from ZongKai. I spoke to stephen ma the original owner of this bug and he mentioned that ZongKai's patch is better and should be pursued. 15:11:34 stephen-ma: thanks for the link 15:11:54 I fixed some of the unit test failures. There are more unit test failures that need to be fixed. 15:12:15 stephen-ma: ok I will ping ZongKai on this or add a comment in there. 15:12:33 The next in our list is 15:12:37 #link https://bugs.launchpad.net/neutron/+bug/1505575 15:12:37 Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:13:00 the patch for it is still on review 15:13:02 obondarev: any update on this 15:13:31 similar bug was filed by amuller recently https://bugs.launchpad.net/neutron/+bug/1516260 15:13:31 Launchpad bug 1505575 in neutron "duplicate for #1516260 Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:13:41 ah, sorry 15:13:52 probably this one: https://bugs.launchpad.net/neutron/+bug/1516260 15:14:12 https://bugs.launchpad.net/neutron/+bug/1505575 15:14:12 Launchpad bug 1505575 in neutron "Fatal memory consumption by neutron-server with DVR at scale" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 15:14:13 dasm: yep 15:14:43 are we going to mark the other one as duplicate or still is it different. 15:14:49 so I still need to check on scale lab if it's a problem/reproduce 15:15:02 the other one is already marked as duplicate 15:15:30 need to sync with amuller on this 15:15:52 I left a comment on his review 15:16:19 obondarev: ok please update next week on your progress. 15:16:23 reviews are welcome however 15:16:47 Swami: sure 15:16:49 obondarev: sure we will review 15:16:58 Swami: thanks 15:17:08 The next one in the list is 15:17:12 #link https://bugs.launchpad.net/neutron/+bug/1513678 15:17:12 Launchpad bug 1513678 in neutron "At scale router scheduling takes a long time with DVR routers with multiple compute nodes hosting thousands of VMs" [High,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:17:35 I have pushed in both the patches yesterday and removed it out of WIP. 15:18:22 But I saw that one of the patch was failing dvr tests. I am still looking into what the problem is. The jenkens does not report any test failures specific to dvr. 15:18:43 #link https://review.openstack.org/#/c/241843/ 15:19:00 #link https://review.openstack.org/#/c/242286/ 15:19:32 The second patch that I posted is the one that is currently failing, I will test it again in my setup and see where the problem is. 15:20:24 Let us move on to the next one. 15:21:14 I think we are done with the high bugs, but there are still other bugs that are in progress. 15:21:45 haleyb: we need to clean up the bugs that are fix committed. 15:22:01 haleyb: who should be doing it. 15:22:18 haleyb: This is in launchpad. 15:22:21 will that get them off this list, https://bugs.launchpad.net/neutron/+bugs?field.tag=l3-dvr-backlog 15:23:02 haleyb: yes it should off that list. 15:23:50 Now let us move on to the next topic. 15:23:58 #topic Gate-failures 15:24:52 last week I had an action item to create a bug to track any tempest related activity to address the gate failures. 15:24:57 #link https://bugs.launchpad.net/neutron/+bug/1515360 15:24:57 Launchpad bug 1515360 in neutron "Add more verbose to Tempest Test Errors that causes "SSHTimeout" seen in CVR and DVR" [High,Confirmed] 15:25:07 I have created this bug. 15:25:28 What can be done to address this issue? 15:25:59 and i didn't reach-out to tempest and/or infra teams 15:26:13 haleyb: I had a chat with armax as well and he feels that adding more debug info related to the namespaces will be too much of information. 15:26:48 haleyb: he also mentioned that previously it used to dump the namespace information and they got away from that. 15:27:29 Is there any other way within the L3 agent domain that we can confirm that the FIPs will work. 15:27:30 i was almost hoping we could add an experimental job we can run for debugging 15:28:24 haleyb: I was trying to add something like "pinging" from the fip namespace to the router namespace destination as the floatingip. 15:28:37 haleyb: But it did not work 15:29:00 Swami: if the tempest patches ran the neutron multinode jobs (they don't now), we could then maybe create a tempest patch with a bunch of debug, then recheck it to trigger a failure ? 15:30:07 haleyb: so are you saying just target it for the multinode and not for the single node. 15:30:52 hmm, i guess the single node is growing as well 15:31:43 haleyb: carl_baldwin: can we add a _fip_path_validate kind of function in l3_agent and then after setting up the floatingip namespace and add rules to the routernamespace ping internally and then provide a status update. 15:32:35 In this case we do have control within neutron and no need to rely on tempest. Because also tempest test cannot add such low level tests. 15:33:19 I am just throwing in some ideas in here. Please feel free to add in your input if you have any thoughts. 15:33:19 Swami: I need a minute to catch up. Distracted today. 15:33:30 carl_baldwin: thanks, no problem. 15:33:59 Swami: yes, i suppose we could add some debugging via the l3-agent log 15:35:02 haleyb: ok, I will try to push in a wip patch and we can discuss about this in that patch, that way we are all in sync. 15:35:29 Swami: ok, thanks 15:35:51 haleyb: also what I noticed is, I am not sure how the metrics are calculated in the logstash. 15:36:28 haleyb: I do see that it reports some patch that is even not running gate_tempest_dsvm_neutron_dvr as gate_tempest_dsvm_neturon_dvr failures. 15:37:21 so it's not being run but failing? 15:38:12 haleyb: If I look at the logstash data, and filter by neutron-dvr failures, it reports a patch id, and if I look at the patch, all it has is rally_related tests, but it still reports as dvr related failure. 15:38:23 haleyb: do you know what might be happening here. 15:38:39 Swami: pointer ? 15:39:07 haleyb: I don't have it handy, but I will send it out. 15:40:02 Ok before we move on to the next section, any other ideas or thoughts on how to mitigate the gate failures, feel free to let us know. 15:40:24 * carl_baldwin back 15:40:31 Sorry to have gotten distracted. 15:40:45 carl_baldwin: no problem we can catch up later. 15:40:52 Swami: ack 15:41:10 Swami: maybe if we ignore it longer the failures will just go away? :) 15:41:26 haleyb: +1 :D 15:41:29 haleyb: no I don't want to do that, then we will not really get to the bottom of it. 15:41:35 no, we're actually making progress on the bugs, so let's not stop 15:41:43 haleyb: this will come back to us. 15:41:44 Swami: you missed my :) 15:42:09 haleyb: yes got it. 15:42:29 haleyb: can we move on to the next topic 15:42:34 yes 15:42:46 #topic Performance/Scalability 15:42:54 not much to update here really 15:43:10 I'm planning to start implementing dvr binding refactoring bp this week/early next week 15:43:32 also should grab scale lab to run some tests, make some analysis 15:44:06 that's it from me 15:44:35 obondarev: great, can you just update the wiki with a link to the BP? I think we can remove the merged bug from it :) 15:44:51 haleyb: sure, will do 15:45:15 obondarev: I had one question on this. 15:45:53 obondarev: the original blueprint had some notes about refactoring the agent side. Do we really need to refactor the agent side for this binding changes or is it orthogonal. 15:46:22 Swami: I don'e expect big changes on agent side 15:46:29 probably some minor changes 15:46:40 obondarev: +1 15:46:42 obondarev: thanks 15:48:18 I think that's all we had in for today's agenda. Is there any other items that the team wanted to discuss today. 15:48:21 #topic Open Discussion 15:49:09 haleyb: is there an easier way to figure out from logstash all types of failures for dvr. 15:50:40 haleyb: I meant to just filter and get the frequently failed test cases, so that if we have missed anything we can focus our attention on that. 15:50:44 Swami: Any chance you could get the pointer to that DVR failure you mentioned above where dvr was not run on the patch? 15:51:12 carl_baldwin: sure, I will get the filter going again and I will let you know. 15:51:31 carl_baldwin: I captured the screen shot, but before I saved, my pc crashed. Bad luck. 15:51:35 Swami: i don't know, what you describe seems pretty manual - finding the test failing and searching for it 15:52:29 haleyb: in logstash right now it only displays first 500 errors and it might have just come from a single patch. 15:52:53 haleyb: when I try to increase the page size in logstash it is a nightmare and it hangs or takes forever. 15:53:50 Swami: ok, now i understand, but i don't know the answer. Who owns logstash? 15:54:25 haleyb: I don't know who owns the logstash is it the infra team or someone else. 15:55:28 I'd start with infra 15:55:43 carl_baldwin: thanks 15:56:46 yeah, perhaps there needs to be some other filter button 15:57:35 haleyb: you can filter by message type, but if you exactly know what test is failing and its error message. 15:58:26 ok, if nothing else i'll call it, thanks everyone 15:58:29 #endmeeting