15:01:16 #startmeeting neutron_l3 15:01:17 Meeting started Thu Aug 14 15:01:16 2014 UTC and is due to finish in 60 minutes. The chair is carl_baldwin. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:21 The meeting name has been set to 'neutron_l3' 15:01:23 #topic Announcements 15:01:34 #link https://wiki.openstack.org/wiki/Meetings/Neutron-L3-Subteam 15:02:14 Juno-3 is September 4th. FPF is the 21st. That is one week from today. 15:02:57 #link https://wiki.openstack.org/wiki/Juno_Release_Schedule 15:03:35 #topic neutron-ovs-dvr 15:03:45 Swami: Do you have a report? 15:03:54 carl_baldwin: yes 15:04:06 We are progressing with the bug fix 15:04:25 There were couple of bugs that was added to the l3-dvr-backlog yesterday 15:05:07 The migration patch for the DVR is almost done and we have post it yesterday 15:05:45 Swami: I am getting confused by all the bugs reports about where and when the namespace get created 15:05:51 While we were testing the migration patch we were able to reproduce the "lock wait" issue that was reported by Armando on the DB. 15:05:51 but that’s probably just me 15:06:12 armax: Yes I can explain. 15:06:19 Swami: we can take this offline 15:06:36 armax: I agree - there are too many scheduling bugs being reported 15:06:42 but it might make sense to have one umbrella bug 15:06:44 I want to take a look at some of the snat ones 15:07:08 armax: For snat scheduler is looking for a payload "gw_exists". 15:07:08 that lists all the expected conditions and file patches that address the issues partially 15:07:09 * pcm_ sorry...I'm late 15:07:21 But it has to come from three different scenarios. 15:07:23 until we’re happy that all the conditions are met 15:07:45 armax: I agree it makes sense to cull them together. 15:07:47 When an interface is added we are called notify_router_updated but only "subnet is passed as payload". 15:07:49 I have seen these bug reports being filed over time and it’s confusing as to whether some are regressions, new bugs and whatnot 15:08:15 When we create a router with gateway we don't send any "notify_router_updated". 15:09:02 Does it make sense for one person, maybe mrsmith, to work on them as a single project with one patch? 15:09:05 I will go over the SNAT namespace issues with respect to the scheduler today and will consult with you armax and carl. 15:09:41 Swami: I will work with you as well 15:09:41 carl_baldwin: agreed 15:09:48 Swami: thanks, my update is that we’re getting close to getting Tempest to be green across the board with DVR 15:09:54 I will work with mrsmith to resolve the scheduler issues. 15:10:24 there are still a couple issues to iron out, like the DB lock timeout as swami mentioned 15:10:38 but on a good day, we show DVR failing only on the firewally tests 15:10:42 which is expected 15:10:43 nice 15:10:56 "firewally" 15:11:07 we have persistent issues that are going to be addressed by this patch 15:11:09 #link 15:11:10 https://review.openstack.org/#/c/113420/ 15:11:14 #link: https://review.openstack.org/#/c/113420/ 15:11:24 I welcome tempest folks roaming in this room 15:11:32 to nudge it in, if they are happy with it 15:11:40 The tempest test situation has improved a lot. That is good. Also the backlog has actually gone down since last week with some patches propsosed to close out others. 15:11:56 this will unblock the two persistent faliures…I managed to run the tests successfully locally with that tempest patch 15:12:17 so I am really looking forward to enabling non-voting DVR on every change… 15:12:30 that most likely will happen post j3, but we’re getting pretty close 15:12:31 +1 15:12:34 very cool. I will look at couple of reviews and work on enable_snat issue 15:12:36 armax: +1 15:12:46 nothing else from me on DVR…keep reviewing and keep up the good work! 15:12:56 armax: Can we propose a patch to infra to enable it? 15:13:10 carl_baldwin: yes we can, but it’s still early imo 15:13:27 #action carl_baldwin will look for someone to nudge the tempest patch in. 15:13:32 until we get rid of all the persistent failures there’s no point 15:13:38 as the run will steal precious resources 15:13:45 to more important job 15:13:57 I’d say the cut-off is as soon as we get a full pass 15:14:09 with the odd random failure 15:14:29 thanks folks, mic back to you 15:14:34 armax: That sounds good. Thanks. 15:14:39 Anything else on DVR? 15:14:58 I also that some of you are interested in enabling multinode in the gate 15:15:09 I think we covered most of the topics 15:15:11 matrohon: yes 15:15:23 matrohon: absolutely 15:15:27 matrohon: link: https://review.openstack.org/#/c/106043/ 15:15:30 armax : fine do you have worked on it already? 15:15:37 this is the patch that is addressing this 15:15:42 as soon as it lands 15:15:49 we’ll whip something together to enable DVR 15:15:52 and see what happens ;) 15:16:08 thanks!! sounds great 15:16:30 armax: Thanks for the link. 15:16:31 matrohon: what’s your actual name if you don’t mind me asking? 15:16:58 armax : mathieu rohon :) 15:17:01 gotcha 15:17:08 I knew that was somewhat familiar 15:17:26 we use to work on this item by the past 15:17:42 but didn't found much time to continue 15:17:56 Anything else or time to move on? 15:18:22 armax : this summarize our backlog : 15:18:23 please move on 15:18:24 https://www.mail-archive.com/openstack-infra@lists.openstack.org/msg01132.html 15:18:50 matrohon: Thanks, we’re looking forward to the multi-node capability. 15:19:02 #topic l3-high-availability 15:19:16 safchain, amuller_: ping 15:19:22 matrohon: thanks 15:19:28 Sylvain is on PTO, I've been pushing the agent patches 15:19:37 Sylvain left the 2 server-side patches in a good state before leaving 15:19:53 I think that the code is at a state where it needs some attention from core 15:19:57 s 15:20:13 amuller_: I had a look through the list of patches recentlly. There are a number of them and it is difficult to know where to start. 15:20:50 However, I took a stab at organizing them and I think I’ve nearly got my head wrapped around it. 15:21:19 I will share what I have on the L3 team page today. 15:21:43 Hopefully, that will make reviewing a little less intimidating. ;) 15:21:55 There are 2 server side patches: https://review.openstack.org/#/c/64553/, then https://review.openstack.org/#/c/66347/ 15:22:04 Armando did a few iterations on the first patch in the past 15:22:25 Then there's a chain of 4 patches in the agent side. There's no dependency between the agent and server side patches. 15:22:37 The 4 agent patches start from: https://review.openstack.org/#/c/112140/ 15:22:38 amuller_: I’ll have another pass soon 15:22:53 amuller_: but things are looking up! 15:23:01 amuller_: Does that chain include your added tests? 15:23:04 I added a functional test for the l3 agent which is working for me locally but failing at the gate 15:23:05 yes Carl 15:23:24 Maru should be helping out with the gate failure Soon (TM) 15:23:55 That's the only known issue at this point 15:24:17 Also I pushed CLI patches 15:24:18 for DVR and VRRP 15:24:20 today 15:24:32 and a devstack dependencies patch 15:24:38 That's it for l3 ha 15:24:51 There is also a devstack patch and maybe one or two more. Hence, my desire to wrap my head around how the patches are organized and share that knowledge. 15:24:59 right 15:25:12 That'd be helpful Carl 15:25:24 I should be done with a blog post over the weekend, about the feature... 15:25:29 How VRRP works, keepalived, how we use it 15:25:32 I think we may be up to 10 patches if I’m not mistaken. So, a map will be very useful. 15:25:41 it's aimed at reviewers, operators 15:25:43 #action carl_baldwin will publish a map for reviewers today. 15:26:03 ^ I will post a link to the ML. 15:26:12 Thank you :) 15:26:28 amuller_: if I recall, I saw a TODO to integrate with DVR in one of the patches but I can’t find it at the moment. 15:26:37 it's the first server side patch 15:26:54 it's working at the model level, everything is persisted correctly last I checked 15:27:05 but we haven't tested it out further than that 15:27:22 (As for how L3 HA interacts with DVR) 15:27:38 amuller_: Is it the right time to start getting testers willing to test the two features together? 15:27:45 I think it is 15:28:12 Has the DVR team looked at this? 15:28:52 not yet - sounds like we need to 15:29:02 I've looked at some of the patches 15:29:43 mrsmith: Great, I’d like to see your feedback on them. 15:30:36 I’ll see what I can do about testing DVR + L3 HA. 15:30:38 I think the work will mostly be about scheduling, so that when a router is created with both DVR and HA turned on, it needs to go as HA on the SNAT nodes, and as non-HA on the computes 15:33:12 Anything else on l3 ha? 15:33:19 hi carl 15:33:26 i have one quick question.. 15:33:27 there's a bit of a mess with the CLI patches but it can be worked out over Gerrit 15:33:40 Swami: ^ 15:33:56 what are the implications this patch ..https://review.openstack.org/#/c/110893/ for L3 HA.. 15:34:24 hi assaf.. 15:34:34 heya Sudhakar 15:34:40 i guess you also have reviewed that patch.. 15:34:46 Kevin's patch is implementing what many deployments are doing out of band 15:34:57 true... 15:35:05 It suffers from long failover times which is what L3 HA aims to solve 15:35:16 moving 10k routers from one node to another can take dozens of minutes 15:35:30 or even more ;) 15:35:51 The L3 HA approach should be constant time, not linear with the amount of routers 15:36:05 As for the technical implications of Kevin's patch 15:36:32 kevinbenton: you there? 15:36:38 yes 15:36:39 I'll have to look into reschedule_router with the L3 HA scheduler changes. I'd expect it to see that it's already scheduled and that's it 15:36:49 so it won't actually do anything 15:36:51 we’re talking about you 15:37:03 or your patch, more precisely 15:37:04 hi kevin 15:37:25 yes, i’m not sure how HA looks from a scheduling perspective 15:37:36 is one router_id bound to many agents? 15:37:41 yes 15:38:01 L3 HA scheduler changes: https://review.openstack.org/#/c/66347/ 15:39:30 i’ll have to look at this 15:39:33 unbinding the router might actually be an issue 15:39:48 reschedule_router should perhaps only be called for non-HA routers 15:39:54 yeah 15:40:07 non-HA and non-distributed as well.. 15:40:27 #action carl_baldwin to look in to organizing DVR + L3 HA testing. 15:40:35 this filtering can only be done after HA merges 15:40:50 Sudhakar: But if a distributed router was scheduled to an SNAT node you'd want to move it if the node is dead 15:41:12 but if the agent is down is on a compute node then nothing should be done 15:41:17 Sudhakar: amuller_: scheduling for a distributed router is really only about the snat component of the router. 15:41:19 right..agreed 15:41:29 ok 15:42:02 I guess there is a different component to scheduling for compute nodes but it is orthogonal. 15:43:21 Anything else on l3 ha? 15:43:25 Not from me 15:43:33 nope.. 15:43:44 #topic Reschedule routers from downed agents 15:44:01 We’re kind of already on this topic. Anything more to discuss here? 15:44:44 kevinbenton: Sudhakar: Does either of you have anything? 15:45:09 i just had a question about terminology 15:45:29 armax had some concerns about mentioning the L3 agent being dead 15:45:41 kevinbenton: only about the wording 15:45:43 since the namespace may still be running or it may be disconnected 15:45:57 also as discussed above.. we need to handle rescheduling the router considering L3 HA and DVR 15:46:11 current reschedule_router doesnt have any checks and tries to unbind.. 15:46:25 Sudhakar: i think we can fix this patch after the DVR code merges 15:46:45 Sudhakar: it should be a single check for a flag, right? 15:46:56 pretty much 15:46:58 kevinbenton: yes 15:47:28 i can discuss the wording with armax in #openstack-neutron or on the patch. that’s all i have for now 15:47:45 kevinbenton: thanks 15:48:15 carl_baldwin: what about the concerns on moving the routers around at scale? 15:48:40 that's why you have L3 HA :) 15:49:00 I think we're facing a documentation challenge though 15:49:06 The concerns are still there. 15:49:20 there's 3 different features surrounding the same topics coming in, in the same release 15:49:26 amuller: Mentioned it can take dozens of minutes to move many routers. I’ve seen it take much longer. 15:49:30 :) 15:50:42 what I’ve seen is that a momentary loss of connectivity or a spike in load on a network node can trigger a lot of disruption. 15:51:08 So I’m concerned about turning this on by default. 15:51:31 We’ve got one more topic so I think we’ll move on. 15:51:34 carl_baldwin: my patch? this feature is off by default 15:51:59 kevinbenton: ok, I haven’t stopped by to look in a little while. 15:52:08 what about l3 HA .. is it also OFF by default? 15:52:24 like DVR the global conf is off by the default 15:52:32 and the admin can create DVR or HA routes explicitly 15:52:35 Sudhakar: Yes. I believe it is. 15:52:43 if the conf is turned on, all tenant routers will be HA 15:52:49 ok..thanks.. 15:53:23 I think there is some potential to turning them on by default down the road. 15:53:31 sure 15:54:09 if L3 HA is ON by default, rescheduling might not be required in the first place.. 15:54:24 We can consider that for the next release 15:54:32 #topic bgp-dynamic-routing 15:54:44 I’d like to get a quick update on this. devvesa ping 15:54:51 hi 15:55:09 i've pushed a new patch today https://review.openstack.org/#/c/111324/ 15:55:29 amuller asked me to split the previous one in several patches and i'm doing so... it makes sense 15:56:02 I'll create new patches with this one as a dependency 15:56:37 I have a question about this: if i have a bunch of dependent patches, when they do merge into upstream? until all of them has been approved? 15:57:10 devvesa: any patch that is approved as itself is not dependent on another patch will merge. 15:57:54 s/as/and/ 15:58:23 devvesa: Dependent patches have their own challenges. Feel free to ping me if you have any questions or problems. 15:58:34 uhm... this one has trivial functionality , just CRUD of routing peers. does is have sense as a single patch then? 15:58:46 by itself, it is useless 15:59:13 I think that patches should be small and self contained. The communit's tendancy towards huge monolithic patches should be moved away from 15:59:47 if you can contain functionality in a patch, please do so 16:00:05 (IE split by functionality as you have done, and not by files or anything like that) 16:00:18 ok then 16:00:42 I don’t think a patch needs to fully implement a feature. But, a patch should be self-contained and make some meaningful and complete change to the code base. 16:01:12 … and it shouldn’t break any existing functionality or interfere. 16:01:29 I think we’re out of time. 16:01:33 then I think I've done it well 16:01:33 Thanks everyone. 16:01:37 thanks carl 16:01:37 bye 16:01:41 bye 16:01:48 #endmeeting