15:01:16 <carl_baldwin> #startmeeting neutron_l3 15:01:17 <openstack> Meeting started Thu Aug 14 15:01:16 2014 UTC and is due to finish in 60 minutes. The chair is carl_baldwin. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:21 <openstack> The meeting name has been set to 'neutron_l3' 15:01:23 <carl_baldwin> #topic Announcements 15:01:34 <carl_baldwin> #link https://wiki.openstack.org/wiki/Meetings/Neutron-L3-Subteam 15:02:14 <carl_baldwin> Juno-3 is September 4th. FPF is the 21st. That is one week from today. 15:02:57 <carl_baldwin> #link https://wiki.openstack.org/wiki/Juno_Release_Schedule 15:03:35 <carl_baldwin> #topic neutron-ovs-dvr 15:03:45 <carl_baldwin> Swami: Do you have a report? 15:03:54 <Swami> carl_baldwin: yes 15:04:06 <Swami> We are progressing with the bug fix 15:04:25 <Swami> There were couple of bugs that was added to the l3-dvr-backlog yesterday 15:05:07 <Swami> The migration patch for the DVR is almost done and we have post it yesterday 15:05:45 <armax> Swami: I am getting confused by all the bugs reports about where and when the namespace get created 15:05:51 <Swami> While we were testing the migration patch we were able to reproduce the "lock wait" issue that was reported by Armando on the DB. 15:05:51 <armax> but that’s probably just me 15:06:12 <Swami> armax: Yes I can explain. 15:06:19 <armax> Swami: we can take this offline 15:06:36 <mrsmith> armax: I agree - there are too many scheduling bugs being reported 15:06:42 <armax> but it might make sense to have one umbrella bug 15:06:44 <mrsmith> I want to take a look at some of the snat ones 15:07:08 <Swami> armax: For snat scheduler is looking for a payload "gw_exists". 15:07:08 <armax> that lists all the expected conditions and file patches that address the issues partially 15:07:09 * pcm_ sorry...I'm late 15:07:21 <Swami> But it has to come from three different scenarios. 15:07:23 <armax> until we’re happy that all the conditions are met 15:07:45 <carl_baldwin> armax: I agree it makes sense to cull them together. 15:07:47 <Swami> When an interface is added we are called notify_router_updated but only "subnet is passed as payload". 15:07:49 <armax> I have seen these bug reports being filed over time and it’s confusing as to whether some are regressions, new bugs and whatnot 15:08:15 <Swami> When we create a router with gateway we don't send any "notify_router_updated". 15:09:02 <carl_baldwin> Does it make sense for one person, maybe mrsmith, to work on them as a single project with one patch? 15:09:05 <Swami> I will go over the SNAT namespace issues with respect to the scheduler today and will consult with you armax and carl. 15:09:41 <mrsmith> Swami: I will work with you as well 15:09:41 <Swami> carl_baldwin: agreed 15:09:48 <armax> Swami: thanks, my update is that we’re getting close to getting Tempest to be green across the board with DVR 15:09:54 <Swami> I will work with mrsmith to resolve the scheduler issues. 15:10:24 <armax> there are still a couple issues to iron out, like the DB lock timeout as swami mentioned 15:10:38 <armax> but on a good day, we show DVR failing only on the firewally tests 15:10:42 <armax> which is expected 15:10:43 <mrsmith> nice 15:10:56 <mrsmith> "firewally" 15:11:07 <armax> we have persistent issues that are going to be addressed by this patch 15:11:09 <armax> #link 15:11:10 <armax> https://review.openstack.org/#/c/113420/ 15:11:14 <armax> #link: https://review.openstack.org/#/c/113420/ 15:11:24 <armax> I welcome tempest folks roaming in this room 15:11:32 <armax> to nudge it in, if they are happy with it 15:11:40 <carl_baldwin> The tempest test situation has improved a lot. That is good. Also the backlog has actually gone down since last week with some patches propsosed to close out others. 15:11:56 <armax> this will unblock the two persistent faliures…I managed to run the tests successfully locally with that tempest patch 15:12:17 <armax> so I am really looking forward to enabling non-voting DVR on every change… 15:12:30 <armax> that most likely will happen post j3, but we’re getting pretty close 15:12:31 <mrsmith> +1 15:12:34 <Rajeev> very cool. I will look at couple of reviews and work on enable_snat issue 15:12:36 <carl_baldwin> armax: +1 15:12:46 <armax> nothing else from me on DVR…keep reviewing and keep up the good work! 15:12:56 <carl_baldwin> armax: Can we propose a patch to infra to enable it? 15:13:10 <armax> carl_baldwin: yes we can, but it’s still early imo 15:13:27 <carl_baldwin> #action carl_baldwin will look for someone to nudge the tempest patch in. 15:13:32 <armax> until we get rid of all the persistent failures there’s no point 15:13:38 <armax> as the run will steal precious resources 15:13:45 <armax> to more important job 15:13:57 <armax> I’d say the cut-off is as soon as we get a full pass 15:14:09 <armax> with the odd random failure 15:14:29 <armax> thanks folks, mic back to you 15:14:34 <carl_baldwin> armax: That sounds good. Thanks. 15:14:39 <carl_baldwin> Anything else on DVR? 15:14:58 <matrohon> I also that some of you are interested in enabling multinode in the gate 15:15:09 <Swami> I think we covered most of the topics 15:15:11 <armax> matrohon: yes 15:15:23 <Rajeev> matrohon: absolutely 15:15:27 <armax> matrohon: link: https://review.openstack.org/#/c/106043/ 15:15:30 <matrohon> armax : fine do you have worked on it already? 15:15:37 <armax> this is the patch that is addressing this 15:15:42 <armax> as soon as it lands 15:15:49 <armax> we’ll whip something together to enable DVR 15:15:52 <armax> and see what happens ;) 15:16:08 <matrohon> thanks!! sounds great 15:16:30 <carl_baldwin> armax: Thanks for the link. 15:16:31 <armax> matrohon: what’s your actual name if you don’t mind me asking? 15:16:58 <matrohon> armax : mathieu rohon :) 15:17:01 <armax> gotcha 15:17:08 <armax> I knew that was somewhat familiar 15:17:26 <matrohon> we use to work on this item by the past 15:17:42 <matrohon> but didn't found much time to continue 15:17:56 <carl_baldwin> Anything else or time to move on? 15:18:22 <matrohon> armax : this summarize our backlog : 15:18:23 <Swami> please move on 15:18:24 <matrohon> https://www.mail-archive.com/openstack-infra@lists.openstack.org/msg01132.html 15:18:50 <carl_baldwin> matrohon: Thanks, we’re looking forward to the multi-node capability. 15:19:02 <carl_baldwin> #topic l3-high-availability 15:19:16 <carl_baldwin> safchain, amuller_: ping 15:19:22 <armax> matrohon: thanks 15:19:28 <amuller_> Sylvain is on PTO, I've been pushing the agent patches 15:19:37 <amuller_> Sylvain left the 2 server-side patches in a good state before leaving 15:19:53 <amuller_> I think that the code is at a state where it needs some attention from core 15:19:57 <amuller_> s 15:20:13 <carl_baldwin> amuller_: I had a look through the list of patches recentlly. There are a number of them and it is difficult to know where to start. 15:20:50 <carl_baldwin> However, I took a stab at organizing them and I think I’ve nearly got my head wrapped around it. 15:21:19 <carl_baldwin> I will share what I have on the L3 team page today. 15:21:43 <carl_baldwin> Hopefully, that will make reviewing a little less intimidating. ;) 15:21:55 <amuller_> There are 2 server side patches: https://review.openstack.org/#/c/64553/, then https://review.openstack.org/#/c/66347/ 15:22:04 <amuller_> Armando did a few iterations on the first patch in the past 15:22:25 <amuller_> Then there's a chain of 4 patches in the agent side. There's no dependency between the agent and server side patches. 15:22:37 <amuller_> The 4 agent patches start from: https://review.openstack.org/#/c/112140/ 15:22:38 <armax> amuller_: I’ll have another pass soon 15:22:53 <armax> amuller_: but things are looking up! 15:23:01 <carl_baldwin> amuller_: Does that chain include your added tests? 15:23:04 <amuller_> I added a functional test for the l3 agent which is working for me locally but failing at the gate 15:23:05 <amuller_> yes Carl 15:23:24 <amuller_> Maru should be helping out with the gate failure Soon (TM) 15:23:55 <amuller_> That's the only known issue at this point 15:24:17 <amuller_> Also I pushed CLI patches 15:24:18 <amuller_> for DVR and VRRP 15:24:20 <amuller_> today 15:24:32 <amuller_> and a devstack dependencies patch 15:24:38 <amuller_> That's it for l3 ha 15:24:51 <carl_baldwin> There is also a devstack patch and maybe one or two more. Hence, my desire to wrap my head around how the patches are organized and share that knowledge. 15:24:59 <amuller_> right 15:25:12 <amuller_> That'd be helpful Carl 15:25:24 <amuller_> I should be done with a blog post over the weekend, about the feature... 15:25:29 <amuller_> How VRRP works, keepalived, how we use it 15:25:32 <carl_baldwin> I think we may be up to 10 patches if I’m not mistaken. So, a map will be very useful. 15:25:41 <amuller_> it's aimed at reviewers, operators 15:25:43 <carl_baldwin> #action carl_baldwin will publish a map for reviewers today. 15:26:03 <carl_baldwin> ^ I will post a link to the ML. 15:26:12 <amuller_> Thank you :) 15:26:28 <carl_baldwin> amuller_: if I recall, I saw a TODO to integrate with DVR in one of the patches but I can’t find it at the moment. 15:26:37 <amuller_> it's the first server side patch 15:26:54 <amuller_> it's working at the model level, everything is persisted correctly last I checked 15:27:05 <amuller_> but we haven't tested it out further than that 15:27:22 <amuller_> (As for how L3 HA interacts with DVR) 15:27:38 <carl_baldwin> amuller_: Is it the right time to start getting testers willing to test the two features together? 15:27:45 <amuller_> I think it is 15:28:12 <carl_baldwin> Has the DVR team looked at this? 15:28:52 <mrsmith> not yet - sounds like we need to 15:29:02 <mrsmith> I've looked at some of the patches 15:29:43 <carl_baldwin> mrsmith: Great, I’d like to see your feedback on them. 15:30:36 <carl_baldwin> I’ll see what I can do about testing DVR + L3 HA. 15:30:38 <amuller_> I think the work will mostly be about scheduling, so that when a router is created with both DVR and HA turned on, it needs to go as HA on the SNAT nodes, and as non-HA on the computes 15:33:12 <carl_baldwin_> Anything else on l3 ha? 15:33:19 <Sudhakar> hi carl 15:33:26 <Sudhakar> i have one quick question.. 15:33:27 <amuller_> there's a bit of a mess with the CLI patches but it can be worked out over Gerrit 15:33:40 <amuller_> Swami: ^ 15:33:56 <Sudhakar> what are the implications this patch ..https://review.openstack.org/#/c/110893/ for L3 HA.. 15:34:24 <Sudhakar> hi assaf.. 15:34:34 <amuller_> heya Sudhakar 15:34:40 <Sudhakar> i guess you also have reviewed that patch.. 15:34:46 <amuller_> Kevin's patch is implementing what many deployments are doing out of band 15:34:57 <Sudhakar> true... 15:35:05 <amuller_> It suffers from long failover times which is what L3 HA aims to solve 15:35:16 <amuller_> moving 10k routers from one node to another can take dozens of minutes 15:35:30 <Sudhakar> or even more ;) 15:35:51 <amuller_> The L3 HA approach should be constant time, not linear with the amount of routers 15:36:05 <amuller_> As for the technical implications of Kevin's patch 15:36:32 <armax> kevinbenton: you there? 15:36:38 <kevinbenton> yes 15:36:39 <amuller_> I'll have to look into reschedule_router with the L3 HA scheduler changes. I'd expect it to see that it's already scheduled and that's it 15:36:49 <amuller_> so it won't actually do anything 15:36:51 <armax> we’re talking about you 15:37:03 <armax> or your patch, more precisely 15:37:04 <Sudhakar> hi kevin 15:37:25 <kevinbenton> yes, i’m not sure how HA looks from a scheduling perspective 15:37:36 <kevinbenton> is one router_id bound to many agents? 15:37:41 <amuller_> yes 15:38:01 <amuller_> L3 HA scheduler changes: https://review.openstack.org/#/c/66347/ 15:39:30 <kevinbenton> i’ll have to look at this 15:39:33 <amuller_> unbinding the router might actually be an issue 15:39:48 <amuller_> reschedule_router should perhaps only be called for non-HA routers 15:39:54 <kevinbenton> yeah 15:40:07 <Sudhakar> non-HA and non-distributed as well.. 15:40:27 <carl_baldwin> #action carl_baldwin to look in to organizing DVR + L3 HA testing. 15:40:35 <armax> this filtering can only be done after HA merges 15:40:50 <amuller_> Sudhakar: But if a distributed router was scheduled to an SNAT node you'd want to move it if the node is dead 15:41:12 <amuller_> but if the agent is down is on a compute node then nothing should be done 15:41:17 <carl_baldwin> Sudhakar: amuller_: scheduling for a distributed router is really only about the snat component of the router. 15:41:19 <Sudhakar> right..agreed 15:41:29 <amuller_> ok 15:42:02 <carl_baldwin> I guess there is a different component to scheduling for compute nodes but it is orthogonal. 15:43:21 <carl_baldwin> Anything else on l3 ha? 15:43:25 <amuller_> Not from me 15:43:33 <Sudhakar> nope.. 15:43:44 <carl_baldwin> #topic Reschedule routers from downed agents 15:44:01 <carl_baldwin> We’re kind of already on this topic. Anything more to discuss here? 15:44:44 <carl_baldwin> kevinbenton: Sudhakar: Does either of you have anything? 15:45:09 <kevinbenton> i just had a question about terminology 15:45:29 <kevinbenton> armax had some concerns about mentioning the L3 agent being dead 15:45:41 <armax> kevinbenton: only about the wording 15:45:43 <kevinbenton> since the namespace may still be running or it may be disconnected 15:45:57 <Sudhakar> also as discussed above.. we need to handle rescheduling the router considering L3 HA and DVR 15:46:11 <Sudhakar> current reschedule_router doesnt have any checks and tries to unbind.. 15:46:25 <kevinbenton> Sudhakar: i think we can fix this patch after the DVR code merges 15:46:45 <kevinbenton> Sudhakar: it should be a single check for a flag, right? 15:46:56 <amuller_> pretty much 15:46:58 <Sudhakar> kevinbenton: yes 15:47:28 <kevinbenton> i can discuss the wording with armax in #openstack-neutron or on the patch. that’s all i have for now 15:47:45 <armax> kevinbenton: thanks 15:48:15 <Sudhakar> carl_baldwin: what about the concerns on moving the routers around at scale? 15:48:40 <amuller> that's why you have L3 HA :) 15:49:00 <amuller> I think we're facing a documentation challenge though 15:49:06 <carl_baldwin> The concerns are still there. 15:49:20 <amuller> there's 3 different features surrounding the same topics coming in, in the same release 15:49:26 <carl_baldwin> amuller: Mentioned it can take dozens of minutes to move many routers. I’ve seen it take much longer. 15:49:30 <Sudhakar> :) 15:50:42 <carl_baldwin> what I’ve seen is that a momentary loss of connectivity or a spike in load on a network node can trigger a lot of disruption. 15:51:08 <carl_baldwin> So I’m concerned about turning this on by default. 15:51:31 <carl_baldwin> We’ve got one more topic so I think we’ll move on. 15:51:34 <kevinbenton> carl_baldwin: my patch? this feature is off by default 15:51:59 <carl_baldwin> kevinbenton: ok, I haven’t stopped by to look in a little while. 15:52:08 <Sudhakar> what about l3 HA .. is it also OFF by default? 15:52:24 <amuller> like DVR the global conf is off by the default 15:52:32 <amuller> and the admin can create DVR or HA routes explicitly 15:52:35 <carl_baldwin> Sudhakar: Yes. I believe it is. 15:52:43 <amuller> if the conf is turned on, all tenant routers will be HA 15:52:49 <Sudhakar> ok..thanks.. 15:53:23 <carl_baldwin> I think there is some potential to turning them on by default down the road. 15:53:31 <amuller> sure 15:54:09 <Sudhakar> if L3 HA is ON by default, rescheduling might not be required in the first place.. 15:54:24 <amuller> We can consider that for the next release 15:54:32 <carl_baldwin> #topic bgp-dynamic-routing 15:54:44 <carl_baldwin> I’d like to get a quick update on this. devvesa ping 15:54:51 <devvesa> hi 15:55:09 <devvesa> i've pushed a new patch today https://review.openstack.org/#/c/111324/ 15:55:29 <devvesa> amuller asked me to split the previous one in several patches and i'm doing so... it makes sense 15:56:02 <devvesa> I'll create new patches with this one as a dependency 15:56:37 <devvesa> I have a question about this: if i have a bunch of dependent patches, when they do merge into upstream? until all of them has been approved? 15:57:10 <carl_baldwin> devvesa: any patch that is approved as itself is not dependent on another patch will merge. 15:57:54 <carl_baldwin> s/as/and/ 15:58:23 <carl_baldwin> devvesa: Dependent patches have their own challenges. Feel free to ping me if you have any questions or problems. 15:58:34 <devvesa> uhm... this one has trivial functionality , just CRUD of routing peers. does is have sense as a single patch then? 15:58:46 <devvesa> by itself, it is useless 15:59:13 <amuller> I think that patches should be small and self contained. The communit's tendancy towards huge monolithic patches should be moved away from 15:59:47 <amuller> if you can contain functionality in a patch, please do so 16:00:05 <amuller> (IE split by functionality as you have done, and not by files or anything like that) 16:00:18 <devvesa> ok then 16:00:42 <carl_baldwin> I don’t think a patch needs to fully implement a feature. But, a patch should be self-contained and make some meaningful and complete change to the code base. 16:01:12 <carl_baldwin> … and it shouldn’t break any existing functionality or interfere. 16:01:29 <carl_baldwin> I think we’re out of time. 16:01:33 <devvesa> then I think I've done it well 16:01:33 <carl_baldwin> Thanks everyone. 16:01:37 <devvesa> thanks carl 16:01:37 <yamamoto> bye 16:01:41 <devvesa> bye 16:01:48 <carl_baldwin> #endmeeting