15:01:09 <carl_baldwin> #startmeeting neutron_l3
15:01:10 <openstack> Meeting started Thu Feb  5 15:01:09 2015 UTC and is due to finish in 60 minutes.  The chair is carl_baldwin. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:14 <openstack> The meeting name has been set to 'neutron_l3'
15:01:35 <carl_baldwin> #topic Announcements
15:01:42 <carl_baldwin> #link https://wiki.openstack.org/wiki/Meetings/Neutron-L3-Subteam
15:01:54 <carl_baldwin> I think Kilo-2 is today!
15:02:15 <carl_baldwin> That means if it hasn’t entered the queue yet, it won’t make it.  I’m not sure when they’ll cut it.
15:02:39 <carl_baldwin> Also means that Kilo-3 is coming right up.  It’ll be here before we know it.
15:03:02 <carl_baldwin> #link https://wiki.openstack.org/wiki/Kilo_Release_Schedule
15:04:01 <carl_baldwin> Any other announcements?
15:04:11 <carl_baldwin> #topic Bugs
15:04:55 <carl_baldwin> Any bugs to bring up?
15:05:11 <amuller> https://bugs.launchpad.net/neutron/+bug/1418097
15:05:25 <amuller> Probably more OVS related than L3, but I need help to triage that one
15:05:45 <carl_baldwin> This one has eluded me until this week:  https://bugs.launchpad.net/neutron/+bug/1404743  I’ll take a look at it.
15:06:04 <carl_baldwin> amuller: Anything to discuss wrt to this kernel panic?
15:07:04 <amuller> Logic dictates it's more to do with OVS / kernel versions, and some OVS bug. However, that doesn't explain why it only started happening since the patch linked in the bug report.
15:07:05 <Rajeev> amuller: I encountered a Jenkins failure for dsvm-functional neutron.tests.function yesterday evening too
15:07:16 <amuller> Rajeev: Can you link?
15:07:34 <carl_baldwin> amuller: I can’t imagine how the patch referenced could cause it but I’ve not thought deeply about it yet.
15:07:45 <Swami> amuller: Is this only seen in the functional test.
15:08:02 <Rajeev> amuller: http://logs.openstack.org/77/153077/1/check/check-neutron-dsvm-functional/a1a6a1f/
15:08:13 <amuller> carl_baldwin: I can rebase back to before that patch, and I can run the tests reliably. I then go one patch forward and it happens on the first run. It's very odd. I agree Carl that the patch looks innocent.
15:08:56 <Rajeev> amuller: my recheck went through ok
15:08:59 <amuller> Rajeev: That seems unrelated but still interesting
15:09:19 <Rajeev> amuller: good to know, thanks.
15:09:41 <carl_baldwin> amuller: Well, I can’t say there is nothing in the patch but it is strange.
15:09:46 <amuller> Swami: Some more context: Internal testing on a Juno based build (!) started seeing the same issue, with ovs-vswitch crashing or causing a kernel panic
15:10:04 <Swami> amuller: thanks
15:10:08 <amuller> Now, that's the same symptom, but a Juno based build obviously doesn't have any Kilo patches
15:10:28 <carl_baldwin> armax: Is Rajeev’s timeout exception similar to or the same as the one you saw?
15:10:30 <amuller> Swami: And internal testing was done on a 'production' system, not via functional testing
15:10:59 <amuller> I suspect that some combinations of kernels and OVS builds are utterly broken
15:11:19 <armax> carl_baldwin: yes it’s the same I believe
15:11:22 <amuller> and will cause catastrophic failures in any openstack cloud using that combination
15:12:15 <carl_baldwin> amuller: Does the ovs crash with Juno have the same trace or other symptoms?
15:12:34 <amuller> It looks the same
15:13:11 <carl_baldwin> amuller: But, yet, when you rebase to before the patch, it no longer happens?  Is that reliably repeatable?
15:13:29 <carl_baldwin> armax: Any idea yet if this timeout thing is happening with any frequency?
15:13:30 <amuller> But to reiterate, I have a VM, where if I go back in time everything works properly, then when I fast forward to HEAD it starts breaking, without reinstalling anything, without changing any versions of anything. Simply git rebasing back and forward.
15:13:35 <amuller> carl_baldwin: Yes it's reliable.
15:13:41 <haleyb> and if you have the crashdump trace let's get it in the bug and/or sent to the OVS/netdev ML - could be a known bug
15:14:08 <armax> carl_baldwin: not directly, but it should be fairly trivial to pull a logstash query and see what’s going on
15:15:24 <carl_baldwin> armax: okay
15:16:31 <amuller> haleyb: Will do
15:17:30 <carl_baldwin> amuller: I’ll look in to it a bit more after the meeting and add notes to the bug report.
15:17:38 <carl_baldwin> amuller: Anything else we can discuss here?
15:17:47 <amuller> Nay
15:17:48 <Rajeev> amuller: do you know what specific operation is it crashing on ?
15:17:57 <amuller> Rajeev: No
15:18:22 <Rajeev> ovs agent log might be of some help
15:18:38 <amuller> We don't use the OVS agent for the L3 functional testing though
15:18:50 <amuller> I'll see if there's anything in the OVS logs or crash traces
15:19:42 <carl_baldwin> amuller: Rajeev: thanks.  Let’s move on and discuss this on the mail thread or IRC.
15:19:50 <Rajeev> carl_baldwin: agreed
15:19:57 <carl_baldwin> Any other bugs?
15:20:23 <carl_baldwin> #topic L3 Agent Restructuring
15:20:24 <Swami> carl_baldwin: there are couple of dvr related bugs still waiting for review
15:20:43 <carl_baldwin> Swami: Okay, we’ll get the the dvr section soon.
15:21:00 <Swami> carl_baldwin: thanks
15:21:07 <carl_baldwin> I have three refactoring patches up for review.
15:21:41 <carl_baldwin> amuller has one that I can see now.
15:21:53 <carl_baldwin> The chain starts here:  https://review.openstack.org/#/c/150154
15:22:22 <carl_baldwin> Then mlavalle has a namespace patch.
15:22:36 <mlavalle> correct
15:22:49 <carl_baldwin> amuller: will you be able to visit your patch today?
15:23:07 <amuller> carl_baldwin: Yeah you should expect a new revision in a few minutes
15:23:45 <carl_baldwin> amuller: Great.  Thanks.  It is a good patch.  I rebased my stuff behind it.
15:23:54 <carl_baldwin> mlavalle: Anything on yours to discuss?
15:24:34 <mlavalle> carl_baldwin: as far as the namespace patchset, I finieshed testing it locally in devstack and with the l3 agent funtional tests. I will push the next reviision today. I didn't do it last night because it was late and wanted to respond to all the comments made by amuller pc_m and you
15:24:49 <carl_baldwin> After these patches, there is only a little bit of router stuff left in the agent.  All dealing with plugging interfaces if I’m not mistaken.  That’ll be another patch.
15:25:11 <pc_m> good job guys!
15:25:25 <carl_baldwin> mlavalle: Great.
15:25:52 <mlavalle> carl_baldwin: there were some changes to the functional test
15:26:06 <mlavalle> nothing big
15:26:38 <carl_baldwin> After the interfaces patch, I think there will be one more.  I’m nearly convinced that DvrRouter needs to split in to two classes.
15:26:53 <carl_baldwin> … one for compute nodes and one for shared snat nodes.
15:27:17 <carl_baldwin> So, two new patches and I think we’ll be in pretty good shape with this project.
15:27:28 * carl_baldwin sees the light at the end of the tunnel.
15:27:39 <mlavalle> carl_baldwin: the interfaces patch, is that what we talked about last Friday?
15:27:57 <carl_baldwin> mlavalle: I don’t think so.
15:28:12 <mlavalle> ok
15:28:25 <carl_baldwin> mlavalle: ping me a bit later.
15:28:30 <mlavalle> will do
15:28:51 <carl_baldwin> #topic neutron-ipam
15:28:53 <mlavalle> carl_baldwin: after lucnh
15:28:55 <carl_baldwin> salv-orlando: ping
15:29:03 <carl_baldwin> johnbelamaric: pavel_bondar: hi
15:29:06 <carl_baldwin> tidwellr: hi
15:29:13 <pavel_bondar> carl_baldwin: hi
15:29:15 <tidwellr> hey
15:29:16 <johnbelamaric> hello
15:29:28 <carl_baldwin> I feel some good momentum starting to build here.
15:29:39 <carl_baldwin> I will address feedback on my patch sometime this morning.
15:30:45 <carl_baldwin> I also put some notes on salv-orlando ’s patch.
15:31:08 <carl_baldwin> Is there anything to discuss now?
15:31:28 <pavel_bondar> I have added an early WIP for db_base_refactoring
15:31:34 <pavel_bondar> #link https://review.openstack.org/#/c/153236/
15:32:07 <carl_baldwin> pavel_bondar: ^ it passed one test.  ;)
15:32:12 <pavel_bondar> :)
15:32:19 <pavel_bondar> yeah, it does not work
15:32:29 <pavel_bondar> and too early for reviewing it
15:32:54 <pavel_bondar> will keep working on it
15:33:41 <carl_baldwin> pavel_bondar: I added me as a reviewer.  Do you think it is worth looking at it from a high level perspective?  Or wait?
15:34:06 <pavel_bondar> it is better wait for next patset
15:34:10 <carl_baldwin> pavel_bondar: ack
15:34:24 <carl_baldwin> Anything else to discuss about IPAM?
15:34:37 <johnbelamaric> carl_baldwin: I think any questions are best addressed in the reviews right now
15:35:08 <tidwellr> agreed
15:35:14 <johnbelamaric> carl_baldwin: well, there is this open question on the sequence, but I think it may be easier to deal with in the review than here
15:35:17 <carl_baldwin> johnbelamaric: Fair enough.  With refactoring winding down, I can have a better presence on the reviews.
15:35:27 <johnbelamaric> carl_baldwin: excellent!
15:35:45 <carl_baldwin> johnbelamaric: okay.
15:35:54 <carl_baldwin> #topic dvr
15:36:11 <carl_baldwin> Swami: hi, I wanted to be sure to leave some time for dvr.
15:36:19 <Swami> carl_baldwin: thanks
15:36:45 <Swami> We are currently working on finding a solution to fix the issues in the gate with the dvr related tests.
15:37:23 <Swami> One thing that came up is, should we always delete the "fip namespace" when vms and come and go. Is it introducing some delay and complexity.
15:38:05 <carl_baldwin> Swami: That is a good question.  I have wondered about that myself.
15:38:06 <Swami> One idea we had is, can we leave the fip namespace and the fip-agent-gateway there until the external network is there for that particular router.
15:38:45 <Swami> When external network or the gateway is droped we can go ahead and clean up all the fip-agent port and namespaces.
15:39:06 <Swami> That would substantially reduce the inter communication between the agents and plugin.
15:40:26 <carl_baldwin> Swami: The L3 agent knows which external networks are available to it, right?
15:40:41 <Swami> Yes.
15:41:10 <carl_baldwin> Swami: Will the list of external networks ever be very big?
15:41:18 <carl_baldwin> I guess it could be.
15:41:47 <Swami> Just give it a thought and let me know if that is ok and then I can push in the patch.
15:42:24 <Swami> I also wanted to reduce an rpc call from agent to the plugin to create the "fip agent gw port".
15:42:30 <carl_baldwin> Swami: So, if I understand it correctly, you want to change so that the fip namespace will stay even if there are no fips.  As long as the router has an external gateway then you will create the namespace.
15:42:37 <carl_baldwin> Swami: Is that correct?
15:42:38 <Swami> Instead the plugin can create during a floatingip associate.
15:42:57 <Swami> carl_baldwin: Yes you are right, that is my proposal
15:43:29 <carl_baldwin> Swami: That sounds fine to me.  I think it would be good to reduce the potential for thrash in creating/deleting the namespace.
15:43:38 <carl_baldwin> Do others want to weigh in?
15:43:48 <Swami> carl_baldwin: thanks
15:44:24 <carl_baldwin> Swami: Just off the top of my head, you’d change it to reference count routers with gateway ports on the compute host instead of floating ips.
15:44:27 <Rajeev> we like the idea. one clarification
15:44:52 <Rajeev> is the the ns will be deleted when all the routers on the node do not have external gateway
15:46:01 <Swami> The fip namespace is related to the external network, so when particular external network is removed from the router, that particular fip namespace will be deleted.
15:46:26 <carl_baldwin> Swami: when particular network is removed from *all* routers, right?
15:46:36 <carl_baldwin> *all* meaning all routers present on the compute node.
15:46:39 <Swami> Yes, from all routers?
15:47:07 <Swami> If there is even a single router that has a gateway with the specified external network, we will leave the namespace intact.
15:47:21 <Swami> carl_baldwin: Just a last note before I quit.
15:47:23 <carl_baldwin> Swami: Just pointing out that more than one router on the compute node may be connected to the same external network.
15:47:30 <Swami> Can you shed some light on the high bugs that are out there.
15:47:46 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1411883
15:48:02 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1374473 https://bugs.launchpad.net/neutron/+bug/1413630
15:48:24 <Swami> carl_baldwin: I need to drop off.
15:48:34 <carl_baldwin> Swami: What do you mean by “shed some light"
15:48:35 <carl_baldwin> ?
15:48:41 <Swami> Rajeev: can continue the discussion if he has any. Or I will ping you later if you have questions.
15:48:59 <carl_baldwin> Swami: Okay.  ttyl
15:49:05 <Swami> carl_baldwin: Sorry I meant can you take a look at those patches. I is waiting for a while.
15:49:12 <Swami> Thanks
15:49:13 <Swami> bye
15:49:30 <carl_baldwin> Swami: Yes, I expect to have some more review time freed up.  bye.
15:50:07 <carl_baldwin> Rajeev: anything more?
15:50:28 <Rajeev> carl_baldwin: sure
15:51:11 <Rajeev> in one of the test failures we see that the agent is returning floating ip status as not active
15:51:23 <Rajeev> because the fip namespace is still being set
15:51:52 <Rajeev> the tests in dvr job doesn't check the status of the fips before using
15:52:08 <Rajeev> so attempts to ping and times out
15:52:46 <carl_baldwin> Rajeev: So, the agent does not allow the fip namespace to get fully constructed before reporting state on the floating ip?
15:53:05 <carl_baldwin> Rajeev: Does it eventually report active status for the fip?
15:53:29 <Rajeev> carl_baldwin: correct, because there is no waiting in the agent
15:53:59 <Rajeev> carl_baldwin: for eventually I believe not
15:54:26 <Rajeev> but have to look into the code
15:54:58 <carl_baldwin> Does the test attempt to wait for active status?  Or, does it ignore status and try to ping anyway?
15:55:16 <Rajeev> carl_baldwin: the test doesn't check status, just pings
15:55:42 <Rajeev> carl_baldwin: and fails
15:56:13 <carl_baldwin> Rajeev: Maybe the test could be enhanced to first wait for active status (fail if it is *never* acheived)
15:57:13 <Rajeev> carl_baldwin: yes that would be a better approach.
15:57:27 <carl_baldwin> Rajeev: hence my question about whether it eventually reports active.
15:58:17 <carl_baldwin> We only have about a minute left.
15:58:23 <Rajeev> carl_baldwin: since there is no wait/retry only
15:58:50 <Rajeev> a new update will cause that to happen so will make it unreliable.
15:59:14 <carl_baldwin> Rajeev: I’m not sure I follow.
15:59:46 <carl_baldwin> Rajeev: do you want to continue in the openstack-neutron room?
15:59:50 <Rajeev> carl_baldwin: can take offline. Otherwise we are continuing to look into ways to stabilize the job and resume HA work
16:00:12 <carl_baldwin> Rajeev:  great.  ping me later.
16:00:16 <Rajeev> carl_baldwin: sure. will sign in
16:00:17 <carl_baldwin> Thanks all!
16:00:23 <carl_baldwin> #endmeeting