15:01:09 #startmeeting neutron_l3 15:01:10 Meeting started Thu Feb 5 15:01:09 2015 UTC and is due to finish in 60 minutes. The chair is carl_baldwin. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:14 The meeting name has been set to 'neutron_l3' 15:01:35 #topic Announcements 15:01:42 #link https://wiki.openstack.org/wiki/Meetings/Neutron-L3-Subteam 15:01:54 I think Kilo-2 is today! 15:02:15 That means if it hasn’t entered the queue yet, it won’t make it. I’m not sure when they’ll cut it. 15:02:39 Also means that Kilo-3 is coming right up. It’ll be here before we know it. 15:03:02 #link https://wiki.openstack.org/wiki/Kilo_Release_Schedule 15:04:01 Any other announcements? 15:04:11 #topic Bugs 15:04:55 Any bugs to bring up? 15:05:11 https://bugs.launchpad.net/neutron/+bug/1418097 15:05:25 Probably more OVS related than L3, but I need help to triage that one 15:05:45 This one has eluded me until this week: https://bugs.launchpad.net/neutron/+bug/1404743 I’ll take a look at it. 15:06:04 amuller: Anything to discuss wrt to this kernel panic? 15:07:04 Logic dictates it's more to do with OVS / kernel versions, and some OVS bug. However, that doesn't explain why it only started happening since the patch linked in the bug report. 15:07:05 amuller: I encountered a Jenkins failure for dsvm-functional neutron.tests.function yesterday evening too 15:07:16 Rajeev: Can you link? 15:07:34 amuller: I can’t imagine how the patch referenced could cause it but I’ve not thought deeply about it yet. 15:07:45 amuller: Is this only seen in the functional test. 15:08:02 amuller: http://logs.openstack.org/77/153077/1/check/check-neutron-dsvm-functional/a1a6a1f/ 15:08:13 carl_baldwin: I can rebase back to before that patch, and I can run the tests reliably. I then go one patch forward and it happens on the first run. It's very odd. I agree Carl that the patch looks innocent. 15:08:56 amuller: my recheck went through ok 15:08:59 Rajeev: That seems unrelated but still interesting 15:09:19 amuller: good to know, thanks. 15:09:41 amuller: Well, I can’t say there is nothing in the patch but it is strange. 15:09:46 Swami: Some more context: Internal testing on a Juno based build (!) started seeing the same issue, with ovs-vswitch crashing or causing a kernel panic 15:10:04 amuller: thanks 15:10:08 Now, that's the same symptom, but a Juno based build obviously doesn't have any Kilo patches 15:10:28 armax: Is Rajeev’s timeout exception similar to or the same as the one you saw? 15:10:30 Swami: And internal testing was done on a 'production' system, not via functional testing 15:10:59 I suspect that some combinations of kernels and OVS builds are utterly broken 15:11:19 carl_baldwin: yes it’s the same I believe 15:11:22 and will cause catastrophic failures in any openstack cloud using that combination 15:12:15 amuller: Does the ovs crash with Juno have the same trace or other symptoms? 15:12:34 It looks the same 15:13:11 amuller: But, yet, when you rebase to before the patch, it no longer happens? Is that reliably repeatable? 15:13:29 armax: Any idea yet if this timeout thing is happening with any frequency? 15:13:30 But to reiterate, I have a VM, where if I go back in time everything works properly, then when I fast forward to HEAD it starts breaking, without reinstalling anything, without changing any versions of anything. Simply git rebasing back and forward. 15:13:35 carl_baldwin: Yes it's reliable. 15:13:41 and if you have the crashdump trace let's get it in the bug and/or sent to the OVS/netdev ML - could be a known bug 15:14:08 carl_baldwin: not directly, but it should be fairly trivial to pull a logstash query and see what’s going on 15:15:24 armax: okay 15:16:31 haleyb: Will do 15:17:30 amuller: I’ll look in to it a bit more after the meeting and add notes to the bug report. 15:17:38 amuller: Anything else we can discuss here? 15:17:47 Nay 15:17:48 amuller: do you know what specific operation is it crashing on ? 15:17:57 Rajeev: No 15:18:22 ovs agent log might be of some help 15:18:38 We don't use the OVS agent for the L3 functional testing though 15:18:50 I'll see if there's anything in the OVS logs or crash traces 15:19:42 amuller: Rajeev: thanks. Let’s move on and discuss this on the mail thread or IRC. 15:19:50 carl_baldwin: agreed 15:19:57 Any other bugs? 15:20:23 #topic L3 Agent Restructuring 15:20:24 carl_baldwin: there are couple of dvr related bugs still waiting for review 15:20:43 Swami: Okay, we’ll get the the dvr section soon. 15:21:00 carl_baldwin: thanks 15:21:07 I have three refactoring patches up for review. 15:21:41 amuller has one that I can see now. 15:21:53 The chain starts here: https://review.openstack.org/#/c/150154 15:22:22 Then mlavalle has a namespace patch. 15:22:36 correct 15:22:49 amuller: will you be able to visit your patch today? 15:23:07 carl_baldwin: Yeah you should expect a new revision in a few minutes 15:23:45 amuller: Great. Thanks. It is a good patch. I rebased my stuff behind it. 15:23:54 mlavalle: Anything on yours to discuss? 15:24:34 carl_baldwin: as far as the namespace patchset, I finieshed testing it locally in devstack and with the l3 agent funtional tests. I will push the next reviision today. I didn't do it last night because it was late and wanted to respond to all the comments made by amuller pc_m and you 15:24:49 After these patches, there is only a little bit of router stuff left in the agent. All dealing with plugging interfaces if I’m not mistaken. That’ll be another patch. 15:25:11 good job guys! 15:25:25 mlavalle: Great. 15:25:52 carl_baldwin: there were some changes to the functional test 15:26:06 nothing big 15:26:38 After the interfaces patch, I think there will be one more. I’m nearly convinced that DvrRouter needs to split in to two classes. 15:26:53 … one for compute nodes and one for shared snat nodes. 15:27:17 So, two new patches and I think we’ll be in pretty good shape with this project. 15:27:28 * carl_baldwin sees the light at the end of the tunnel. 15:27:39 carl_baldwin: the interfaces patch, is that what we talked about last Friday? 15:27:57 mlavalle: I don’t think so. 15:28:12 ok 15:28:25 mlavalle: ping me a bit later. 15:28:30 will do 15:28:51 #topic neutron-ipam 15:28:53 carl_baldwin: after lucnh 15:28:55 salv-orlando: ping 15:29:03 johnbelamaric: pavel_bondar: hi 15:29:06 tidwellr: hi 15:29:13 carl_baldwin: hi 15:29:15 hey 15:29:16 hello 15:29:28 I feel some good momentum starting to build here. 15:29:39 I will address feedback on my patch sometime this morning. 15:30:45 I also put some notes on salv-orlando ’s patch. 15:31:08 Is there anything to discuss now? 15:31:28 I have added an early WIP for db_base_refactoring 15:31:34 #link https://review.openstack.org/#/c/153236/ 15:32:07 pavel_bondar: ^ it passed one test. ;) 15:32:12 :) 15:32:19 yeah, it does not work 15:32:29 and too early for reviewing it 15:32:54 will keep working on it 15:33:41 pavel_bondar: I added me as a reviewer. Do you think it is worth looking at it from a high level perspective? Or wait? 15:34:06 it is better wait for next patset 15:34:10 pavel_bondar: ack 15:34:24 Anything else to discuss about IPAM? 15:34:37 carl_baldwin: I think any questions are best addressed in the reviews right now 15:35:08 agreed 15:35:14 carl_baldwin: well, there is this open question on the sequence, but I think it may be easier to deal with in the review than here 15:35:17 johnbelamaric: Fair enough. With refactoring winding down, I can have a better presence on the reviews. 15:35:27 carl_baldwin: excellent! 15:35:45 johnbelamaric: okay. 15:35:54 #topic dvr 15:36:11 Swami: hi, I wanted to be sure to leave some time for dvr. 15:36:19 carl_baldwin: thanks 15:36:45 We are currently working on finding a solution to fix the issues in the gate with the dvr related tests. 15:37:23 One thing that came up is, should we always delete the "fip namespace" when vms and come and go. Is it introducing some delay and complexity. 15:38:05 Swami: That is a good question. I have wondered about that myself. 15:38:06 One idea we had is, can we leave the fip namespace and the fip-agent-gateway there until the external network is there for that particular router. 15:38:45 When external network or the gateway is droped we can go ahead and clean up all the fip-agent port and namespaces. 15:39:06 That would substantially reduce the inter communication between the agents and plugin. 15:40:26 Swami: The L3 agent knows which external networks are available to it, right? 15:40:41 Yes. 15:41:10 Swami: Will the list of external networks ever be very big? 15:41:18 I guess it could be. 15:41:47 Just give it a thought and let me know if that is ok and then I can push in the patch. 15:42:24 I also wanted to reduce an rpc call from agent to the plugin to create the "fip agent gw port". 15:42:30 Swami: So, if I understand it correctly, you want to change so that the fip namespace will stay even if there are no fips. As long as the router has an external gateway then you will create the namespace. 15:42:37 Swami: Is that correct? 15:42:38 Instead the plugin can create during a floatingip associate. 15:42:57 carl_baldwin: Yes you are right, that is my proposal 15:43:29 Swami: That sounds fine to me. I think it would be good to reduce the potential for thrash in creating/deleting the namespace. 15:43:38 Do others want to weigh in? 15:43:48 carl_baldwin: thanks 15:44:24 Swami: Just off the top of my head, you’d change it to reference count routers with gateway ports on the compute host instead of floating ips. 15:44:27 we like the idea. one clarification 15:44:52 is the the ns will be deleted when all the routers on the node do not have external gateway 15:46:01 The fip namespace is related to the external network, so when particular external network is removed from the router, that particular fip namespace will be deleted. 15:46:26 Swami: when particular network is removed from *all* routers, right? 15:46:36 *all* meaning all routers present on the compute node. 15:46:39 Yes, from all routers? 15:47:07 If there is even a single router that has a gateway with the specified external network, we will leave the namespace intact. 15:47:21 carl_baldwin: Just a last note before I quit. 15:47:23 Swami: Just pointing out that more than one router on the compute node may be connected to the same external network. 15:47:30 Can you shed some light on the high bugs that are out there. 15:47:46 #link https://bugs.launchpad.net/neutron/+bug/1411883 15:48:02 #link https://bugs.launchpad.net/neutron/+bug/1374473 https://bugs.launchpad.net/neutron/+bug/1413630 15:48:24 carl_baldwin: I need to drop off. 15:48:34 Swami: What do you mean by “shed some light" 15:48:35 ? 15:48:41 Rajeev: can continue the discussion if he has any. Or I will ping you later if you have questions. 15:48:59 Swami: Okay. ttyl 15:49:05 carl_baldwin: Sorry I meant can you take a look at those patches. I is waiting for a while. 15:49:12 Thanks 15:49:13 bye 15:49:30 Swami: Yes, I expect to have some more review time freed up. bye. 15:50:07 Rajeev: anything more? 15:50:28 carl_baldwin: sure 15:51:11 in one of the test failures we see that the agent is returning floating ip status as not active 15:51:23 because the fip namespace is still being set 15:51:52 the tests in dvr job doesn't check the status of the fips before using 15:52:08 so attempts to ping and times out 15:52:46 Rajeev: So, the agent does not allow the fip namespace to get fully constructed before reporting state on the floating ip? 15:53:05 Rajeev: Does it eventually report active status for the fip? 15:53:29 carl_baldwin: correct, because there is no waiting in the agent 15:53:59 carl_baldwin: for eventually I believe not 15:54:26 but have to look into the code 15:54:58 Does the test attempt to wait for active status? Or, does it ignore status and try to ping anyway? 15:55:16 carl_baldwin: the test doesn't check status, just pings 15:55:42 carl_baldwin: and fails 15:56:13 Rajeev: Maybe the test could be enhanced to first wait for active status (fail if it is *never* acheived) 15:57:13 carl_baldwin: yes that would be a better approach. 15:57:27 Rajeev: hence my question about whether it eventually reports active. 15:58:17 We only have about a minute left. 15:58:23 carl_baldwin: since there is no wait/retry only 15:58:50 a new update will cause that to happen so will make it unreliable. 15:59:14 Rajeev: I’m not sure I follow. 15:59:46 Rajeev: do you want to continue in the openstack-neutron room? 15:59:50 carl_baldwin: can take offline. Otherwise we are continuing to look into ways to stabilize the job and resume HA work 16:00:12 Rajeev: great. ping me later. 16:00:16 carl_baldwin: sure. will sign in 16:00:17 Thanks all! 16:00:23 #endmeeting