15:01:16 <carl_baldwin> #startmeeting neutron_l3
15:01:17 <openstack> Meeting started Thu Aug 14 15:01:16 2014 UTC and is due to finish in 60 minutes.  The chair is carl_baldwin. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:21 <openstack> The meeting name has been set to 'neutron_l3'
15:01:23 <carl_baldwin> #topic Announcements
15:01:34 <carl_baldwin> #link https://wiki.openstack.org/wiki/Meetings/Neutron-L3-Subteam
15:02:14 <carl_baldwin> Juno-3 is September 4th. FPF is the 21st.  That is one week from today.
15:02:57 <carl_baldwin> #link https://wiki.openstack.org/wiki/Juno_Release_Schedule
15:03:35 <carl_baldwin> #topic neutron-ovs-dvr
15:03:45 <carl_baldwin> Swami: Do you have a report?
15:03:54 <Swami> carl_baldwin: yes
15:04:06 <Swami> We are progressing with the bug fix
15:04:25 <Swami> There were couple of bugs that was added to the l3-dvr-backlog yesterday
15:05:07 <Swami> The migration patch for the DVR is almost done and we have post it yesterday
15:05:45 <armax> Swami: I am getting confused by all the bugs reports about where and when the namespace get created
15:05:51 <Swami> While we were testing the migration patch we were able to reproduce the "lock wait" issue that was reported by Armando on the DB.
15:05:51 <armax> but that’s probably just me
15:06:12 <Swami> armax: Yes I can explain.
15:06:19 <armax> Swami: we can take this offline
15:06:36 <mrsmith> armax: I agree - there are too many scheduling bugs being reported
15:06:42 <armax> but it might make sense to have one umbrella bug
15:06:44 <mrsmith> I want to take a look at some of the snat ones
15:07:08 <Swami> armax: For snat scheduler is looking for a payload "gw_exists".
15:07:08 <armax> that lists all the expected conditions and file patches that address the issues partially
15:07:09 * pcm_ sorry...I'm late
15:07:21 <Swami> But it has to come from three different scenarios.
15:07:23 <armax> until we’re happy that all the conditions are met
15:07:45 <carl_baldwin> armax: I agree it makes sense to cull them together.
15:07:47 <Swami> When an interface is added we are called notify_router_updated but only "subnet is passed as payload".
15:07:49 <armax> I have seen these bug reports being filed over time and it’s confusing as to whether some are regressions, new bugs and whatnot
15:08:15 <Swami> When we create a router with gateway we don't send any "notify_router_updated".
15:09:02 <carl_baldwin> Does it make sense for one person, maybe mrsmith, to work on them as a single project with one patch?
15:09:05 <Swami> I will go over the SNAT namespace issues with respect to the scheduler today and will consult with you armax and carl.
15:09:41 <mrsmith> Swami: I will work with you as well
15:09:41 <Swami> carl_baldwin: agreed
15:09:48 <armax> Swami: thanks, my update is that we’re getting close to getting Tempest to be green across the board with DVR
15:09:54 <Swami> I will work with mrsmith to resolve the scheduler issues.
15:10:24 <armax> there are still a couple issues to iron out, like the DB lock timeout as swami mentioned
15:10:38 <armax> but on a good day, we show DVR failing only on the firewally tests
15:10:42 <armax> which is expected
15:10:43 <mrsmith> nice
15:10:56 <mrsmith> "firewally"
15:11:07 <armax> we have persistent issues that are going to be addressed by this patch
15:11:09 <armax> #link
15:11:10 <armax> https://review.openstack.org/#/c/113420/
15:11:14 <armax> #link: https://review.openstack.org/#/c/113420/
15:11:24 <armax> I welcome tempest folks roaming in this room
15:11:32 <armax> to nudge it in, if they are happy with it
15:11:40 <carl_baldwin> The tempest test situation has improved a lot.  That is good.  Also the backlog has actually gone down since last week with some patches propsosed to close out others.
15:11:56 <armax> this will unblock the two persistent faliures…I managed to run the tests successfully locally with that tempest patch
15:12:17 <armax> so I am really looking forward to enabling non-voting DVR on every change…
15:12:30 <armax> that most likely will happen post j3, but we’re getting pretty close
15:12:31 <mrsmith> +1
15:12:34 <Rajeev> very cool. I will look at couple of reviews and work on enable_snat issue
15:12:36 <carl_baldwin> armax: +1
15:12:46 <armax> nothing else from me on DVR…keep reviewing and keep up the good work!
15:12:56 <carl_baldwin> armax: Can we propose a patch to infra to enable it?
15:13:10 <armax> carl_baldwin: yes we can, but it’s still early imo
15:13:27 <carl_baldwin> #action carl_baldwin will look for someone to nudge the tempest patch in.
15:13:32 <armax> until we get rid of all the persistent failures there’s no point
15:13:38 <armax> as the run will steal precious resources
15:13:45 <armax> to more important job
15:13:57 <armax> I’d say the cut-off is as soon as we get a full pass
15:14:09 <armax> with the odd random failure
15:14:29 <armax> thanks folks, mic back to you
15:14:34 <carl_baldwin> armax: That sounds good.  Thanks.
15:14:39 <carl_baldwin> Anything else on DVR?
15:14:58 <matrohon> I also that some of you are interested in enabling multinode in the gate
15:15:09 <Swami> I think we covered most of the topics
15:15:11 <armax> matrohon: yes
15:15:23 <Rajeev> matrohon: absolutely
15:15:27 <armax> matrohon: link: https://review.openstack.org/#/c/106043/
15:15:30 <matrohon> armax : fine do you have worked on it already?
15:15:37 <armax> this is the patch that is addressing this
15:15:42 <armax> as soon as it lands
15:15:49 <armax> we’ll whip something together to enable DVR
15:15:52 <armax> and see what happens ;)
15:16:08 <matrohon> thanks!! sounds great
15:16:30 <carl_baldwin> armax: Thanks for the link.
15:16:31 <armax> matrohon: what’s your actual name if you don’t mind me asking?
15:16:58 <matrohon> armax : mathieu rohon :)
15:17:01 <armax> gotcha
15:17:08 <armax> I knew that was somewhat familiar
15:17:26 <matrohon> we use to work on this item by the past
15:17:42 <matrohon> but didn't found much time to continue
15:17:56 <carl_baldwin> Anything else or time to move on?
15:18:22 <matrohon> armax : this summarize our backlog :
15:18:23 <Swami> please move on
15:18:24 <matrohon> https://www.mail-archive.com/openstack-infra@lists.openstack.org/msg01132.html
15:18:50 <carl_baldwin> matrohon: Thanks, we’re looking forward to the multi-node capability.
15:19:02 <carl_baldwin> #topic l3-high-availability
15:19:16 <carl_baldwin> safchain, amuller_: ping
15:19:22 <armax> matrohon: thanks
15:19:28 <amuller_> Sylvain is on PTO, I've been pushing the agent patches
15:19:37 <amuller_> Sylvain left the 2 server-side patches in a good state before leaving
15:19:53 <amuller_> I think that the code is at a state where it needs some attention from core
15:19:57 <amuller_> s
15:20:13 <carl_baldwin> amuller_: I had a look through the list of patches recentlly.  There are a number of them and it is difficult to know where to start.
15:20:50 <carl_baldwin> However, I took a stab at organizing them and I think I’ve nearly got my head wrapped around it.
15:21:19 <carl_baldwin> I will share what I have on the L3 team page today.
15:21:43 <carl_baldwin> Hopefully, that will make reviewing a little less intimidating.  ;)
15:21:55 <amuller_> There are 2 server side patches: https://review.openstack.org/#/c/64553/, then https://review.openstack.org/#/c/66347/
15:22:04 <amuller_> Armando did a few iterations on the first patch in the past
15:22:25 <amuller_> Then there's a chain of 4 patches in the agent side. There's no dependency between the agent and server side patches.
15:22:37 <amuller_> The 4 agent patches start from: https://review.openstack.org/#/c/112140/
15:22:38 <armax> amuller_: I’ll have another pass soon
15:22:53 <armax> amuller_: but things are looking up!
15:23:01 <carl_baldwin> amuller_: Does that chain include your added tests?
15:23:04 <amuller_> I added a functional test for the l3 agent which is working for me locally but failing at the gate
15:23:05 <amuller_> yes Carl
15:23:24 <amuller_> Maru should be helping out with the gate failure Soon (TM)
15:23:55 <amuller_> That's the only known issue at this point
15:24:17 <amuller_> Also I pushed CLI patches
15:24:18 <amuller_> for DVR and VRRP
15:24:20 <amuller_> today
15:24:32 <amuller_> and a devstack dependencies patch
15:24:38 <amuller_> That's it for l3 ha
15:24:51 <carl_baldwin> There is also a devstack patch and maybe one or two more.  Hence, my desire to wrap my head around how the patches are organized and share that knowledge.
15:24:59 <amuller_> right
15:25:12 <amuller_> That'd be helpful Carl
15:25:24 <amuller_> I should be done with a blog post over the weekend, about the feature...
15:25:29 <amuller_> How VRRP works, keepalived, how we use it
15:25:32 <carl_baldwin> I think we may be up to 10 patches if I’m not mistaken.  So, a map will be very useful.
15:25:41 <amuller_> it's aimed at reviewers, operators
15:25:43 <carl_baldwin> #action carl_baldwin will publish a map for reviewers today.
15:26:03 <carl_baldwin> ^ I will post a link to the ML.
15:26:12 <amuller_> Thank you :)
15:26:28 <carl_baldwin> amuller_: if I recall, I saw a TODO to integrate with DVR in one of the patches but I can’t find it at the moment.
15:26:37 <amuller_> it's the first  server side patch
15:26:54 <amuller_> it's working at the model level, everything is persisted correctly last I checked
15:27:05 <amuller_> but we haven't tested it out further than that
15:27:22 <amuller_> (As for how L3 HA interacts with DVR)
15:27:38 <carl_baldwin> amuller_: Is it the right time to start getting testers willing to test the two features together?
15:27:45 <amuller_> I think it is
15:28:12 <carl_baldwin> Has the DVR team looked at this?
15:28:52 <mrsmith> not yet - sounds like we need to
15:29:02 <mrsmith> I've looked at some of the patches
15:29:43 <carl_baldwin> mrsmith: Great, I’d like to see your feedback on them.
15:30:36 <carl_baldwin> I’ll see what I can do about testing DVR + L3 HA.
15:30:38 <amuller_> I think the work will mostly be about scheduling, so that when a router is created with both DVR and HA turned on, it needs to go as HA on the SNAT nodes, and as non-HA on the computes
15:33:12 <carl_baldwin_> Anything else on l3 ha?
15:33:19 <Sudhakar> hi carl
15:33:26 <Sudhakar> i have one quick question..
15:33:27 <amuller_> there's a bit of a mess with the CLI patches but it can be worked out over Gerrit
15:33:40 <amuller_> Swami: ^
15:33:56 <Sudhakar> what are the implications this patch ..https://review.openstack.org/#/c/110893/ for L3 HA..
15:34:24 <Sudhakar> hi assaf..
15:34:34 <amuller_> heya Sudhakar
15:34:40 <Sudhakar> i guess you also have reviewed that patch..
15:34:46 <amuller_> Kevin's patch is implementing what many deployments are doing out of band
15:34:57 <Sudhakar> true...
15:35:05 <amuller_> It suffers from long failover times which is what L3 HA aims to solve
15:35:16 <amuller_> moving 10k routers from one node to another can take dozens of minutes
15:35:30 <Sudhakar> or even more ;)
15:35:51 <amuller_> The L3 HA approach should be constant time, not linear with the amount of routers
15:36:05 <amuller_> As for the technical implications of Kevin's patch
15:36:32 <armax> kevinbenton: you there?
15:36:38 <kevinbenton> yes
15:36:39 <amuller_> I'll have to look into reschedule_router with the L3 HA scheduler changes. I'd expect it to see that it's already scheduled and that's it
15:36:49 <amuller_> so it won't actually do anything
15:36:51 <armax> we’re talking about you
15:37:03 <armax> or your patch, more precisely
15:37:04 <Sudhakar> hi kevin
15:37:25 <kevinbenton> yes, i’m not sure how HA looks from a scheduling perspective
15:37:36 <kevinbenton> is one router_id bound to many agents?
15:37:41 <amuller_> yes
15:38:01 <amuller_> L3 HA scheduler changes: https://review.openstack.org/#/c/66347/
15:39:30 <kevinbenton> i’ll have to look at this
15:39:33 <amuller_> unbinding the router might actually be an issue
15:39:48 <amuller_> reschedule_router should perhaps only be called for non-HA routers
15:39:54 <kevinbenton> yeah
15:40:07 <Sudhakar> non-HA and non-distributed as well..
15:40:27 <carl_baldwin> #action carl_baldwin to look in to organizing DVR + L3 HA testing.
15:40:35 <armax> this filtering can only be done after HA merges
15:40:50 <amuller_> Sudhakar: But if a distributed router was scheduled to an SNAT node you'd want to move it if the node is dead
15:41:12 <amuller_> but if the agent is down is on a compute node then nothing should be done
15:41:17 <carl_baldwin> Sudhakar: amuller_: scheduling for a distributed router is really only about the snat component of the router.
15:41:19 <Sudhakar> right..agreed
15:41:29 <amuller_> ok
15:42:02 <carl_baldwin> I guess there is a different component to scheduling for compute nodes but it is orthogonal.
15:43:21 <carl_baldwin> Anything else on l3 ha?
15:43:25 <amuller_> Not from me
15:43:33 <Sudhakar> nope..
15:43:44 <carl_baldwin> #topic Reschedule routers from downed agents
15:44:01 <carl_baldwin> We’re kind of already on this topic.  Anything more to discuss here?
15:44:44 <carl_baldwin> kevinbenton: Sudhakar:  Does either of you have anything?
15:45:09 <kevinbenton> i just had a question about terminology
15:45:29 <kevinbenton> armax had some concerns about mentioning the L3 agent being dead
15:45:41 <armax> kevinbenton: only about the wording
15:45:43 <kevinbenton> since the namespace may still be running or it may be disconnected
15:45:57 <Sudhakar> also as discussed above.. we need to handle rescheduling the router considering L3 HA and DVR
15:46:11 <Sudhakar> current reschedule_router doesnt have any checks and tries to unbind..
15:46:25 <kevinbenton> Sudhakar: i think we can fix this patch after the DVR code merges
15:46:45 <kevinbenton> Sudhakar: it should be a single check for a flag, right?
15:46:56 <amuller_> pretty much
15:46:58 <Sudhakar> kevinbenton: yes
15:47:28 <kevinbenton> i can discuss the wording with armax in #openstack-neutron or on the patch. that’s all i have for now
15:47:45 <armax> kevinbenton: thanks
15:48:15 <Sudhakar> carl_baldwin: what about the concerns on moving the routers around at scale?
15:48:40 <amuller> that's why you have L3 HA :)
15:49:00 <amuller> I think we're facing a documentation challenge though
15:49:06 <carl_baldwin> The concerns are still there.
15:49:20 <amuller> there's 3 different features surrounding the same topics coming in, in the same release
15:49:26 <carl_baldwin> amuller: Mentioned it can take dozens of minutes to move many routers.  I’ve seen it take much longer.
15:49:30 <Sudhakar> :)
15:50:42 <carl_baldwin> what I’ve seen is that a momentary loss of connectivity or a spike in load on a network node can trigger a lot of disruption.
15:51:08 <carl_baldwin> So I’m concerned about turning this on by default.
15:51:31 <carl_baldwin> We’ve got one more topic so I think we’ll move on.
15:51:34 <kevinbenton> carl_baldwin: my patch? this feature is off by default
15:51:59 <carl_baldwin> kevinbenton: ok, I haven’t stopped by to look in a little while.
15:52:08 <Sudhakar> what about l3 HA .. is it also OFF by default?
15:52:24 <amuller> like DVR the global conf is off by the default
15:52:32 <amuller> and the admin can create DVR or HA routes explicitly
15:52:35 <carl_baldwin> Sudhakar: Yes.  I believe it is.
15:52:43 <amuller> if the conf is turned on, all tenant routers will be HA
15:52:49 <Sudhakar> ok..thanks..
15:53:23 <carl_baldwin> I think there is some potential to turning them on by default down the road.
15:53:31 <amuller> sure
15:54:09 <Sudhakar> if L3 HA is ON by default, rescheduling might not be required in the first place..
15:54:24 <amuller> We can consider that for the next release
15:54:32 <carl_baldwin> #topic bgp-dynamic-routing
15:54:44 <carl_baldwin> I’d like to get a quick update on this.  devvesa ping
15:54:51 <devvesa> hi
15:55:09 <devvesa> i've pushed a new patch today https://review.openstack.org/#/c/111324/
15:55:29 <devvesa> amuller asked me to split the previous one in several patches and i'm doing so... it makes sense
15:56:02 <devvesa> I'll create new patches with this one as a dependency
15:56:37 <devvesa> I have a question about this: if i have a bunch of dependent patches, when they do merge into upstream? until all of them has been approved?
15:57:10 <carl_baldwin> devvesa: any patch that is approved as itself is not dependent on another patch will merge.
15:57:54 <carl_baldwin> s/as/and/
15:58:23 <carl_baldwin> devvesa: Dependent patches have their own challenges.  Feel free to ping me if you have any questions or problems.
15:58:34 <devvesa> uhm... this one has trivial functionality , just CRUD of routing peers. does is have sense as a single patch then?
15:58:46 <devvesa> by itself, it is useless
15:59:13 <amuller> I think that patches should be small and self contained. The communit's tendancy towards huge monolithic patches should be moved away from
15:59:47 <amuller> if you can contain functionality in a patch, please do so
16:00:05 <amuller> (IE split by functionality as you have done, and not by files or anything like that)
16:00:18 <devvesa> ok then
16:00:42 <carl_baldwin> I don’t think a patch needs to fully implement a feature.  But, a patch should be self-contained and make some meaningful and complete change to the code base.
16:01:12 <carl_baldwin> … and it shouldn’t break any existing functionality or interfere.
16:01:29 <carl_baldwin> I think we’re out of time.
16:01:33 <devvesa> then I think I've done it well
16:01:33 <carl_baldwin> Thanks everyone.
16:01:37 <devvesa> thanks carl
16:01:37 <yamamoto> bye
16:01:41 <devvesa> bye
16:01:48 <carl_baldwin> #endmeeting