14:00:04 <liuyulong> #startmeeting neutron_l3 14:00:05 <openstack> Meeting started Wed Jul 24 14:00:04 2019 UTC and is due to finish in 60 minutes. The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:08 <openstack> The meeting name has been set to 'neutron_l3' 14:00:47 <slaweq> hi 14:02:03 <liuyulong> hi 14:02:22 <liuyulong> #topic Announcements 14:02:50 <njohnston> o/ 14:06:00 <liuyulong_> Any announcements? 14:06:32 <liuyulong_> OK, let's move on. 14:06:48 <liuyulong_> #topic Bugs 14:07:12 <liuyulong> #topic Bugs 14:07:30 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-July/007952.html 14:07:36 <liuyulong> Hongbin was our bug deputy last week, thanks. 14:07:55 <liuyulong> IMO, it is a quiet week for L3, (we are in neutron_l3 meeting) : ) 14:08:26 <liuyulong> So today, I will re-raise some old bugs. And I've reset some bugs with a higher level, because it has been submitted for a really long time. 14:08:45 <liuyulong> (Maybe I should change the bug level more higher if it still does not have much activities. LOL) 14:08:50 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1826695 14:09:01 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1826695 (https://launchpad.net/bugs/1826695). The error has been logged 14:09:50 <liuyulong> What happened? 14:10:06 <liuyulong> The bug title is "[L3][QoS] cache does not removed when router is down or deleted" 14:10:24 <liuyulong> The fix is here: 14:10:24 <liuyulong> https://review.opendev.org/#/c/656105/ 14:10:48 <slaweq> I would say opposite - if bug is there for long time and nobody really cares about it, we should IMO decrease its priority :) 14:11:15 * njohnston thinks Launchpad is having issues 14:12:11 <ralonsoh> njohnston, I can't load anything 14:12:49 <liuyulong> slaweq, until someday nobody care about the entire project? LOL 14:13:20 <slaweq> liuyulong: who knows :) 14:13:53 <liuyulong> Next 14:13:55 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1811352 14:14:05 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1811352 (https://launchpad.net/bugs/1811352). The error has been logged 14:14:18 <liuyulong> openstack, all right, I know! 14:14:29 <liuyulong> We need this for Shanghai related topic: 14:14:29 <liuyulong> https://review.opendev.org/#/c/650062/ 14:14:55 <liuyulong> The CLI patch is here ^^ 14:17:24 <liuyulong> The progress is a bit slow. All OSC core reviewers has been added to that patch. : ( 14:17:50 * tidwellr wanders in late and lurks 14:18:16 <liuyulong> But it's OK, we can tag it locally and install it for the demo. 14:18:37 <liuyulong> Next one: #link https://bugs.launchpad.net/neutron/+bug/1609217 14:18:47 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1609217 (https://launchpad.net/bugs/1609217). The error has been logged 14:19:17 <slaweq> liuyulong: do You have any presentation about port forwarding in Shanghai? 14:19:27 <liuyulong> This is really an old one, the title is "DVR: dvr router ns should not exist in scheduled DHCP agent nodes" 14:19:43 <liuyulong> The fix is here, it adds a new config for cloud deployment: https://review.opendev.org/#/c/364793/ 14:20:35 <liuyulong> slaweq, yes, mlavalle submitted a topic. 14:20:49 <slaweq> good to know :) 14:20:54 <slaweq> thx for info 14:21:04 <liuyulong> I will not repeat the reason of the fix, if you are interested in this bug, this will be the full scenarios I added before: 14:21:08 <liuyulong> https://review.opendev.org/#/c/364793/3//COMMIT_MSG 14:21:50 <liuyulong> It makes large scale deployment really happy. 14:22:15 <liuyulong> Next 14:22:20 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1813787 14:22:33 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1813787 (https://launchpad.net/bugs/1813787). The error has been logged 14:23:19 <liuyulong> The bug title is "[L3] DVR router in compute node was not up but nova port needs its functionality" 14:23:28 <liuyulong> The main fix is here: https://review.opendev.org/#/c/633871/ 14:24:06 <liuyulong> We already have some related fix, but not aim to the root cause. This one is one approach. 14:24:35 <liuyulong> We have run such code locally for a long time. It acts good. 14:25:09 <liuyulong> Next #link https://bugs.launchpad.net/neutron/+bug/1825152 14:25:19 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1825152 (https://launchpad.net/bugs/1825152). The error has been logged 14:25:54 <liuyulong> The title is "[scale issue] the root rootwrap deamon causes l3 agent router procssing very very slow" 14:26:14 <liuyulong> These two config options really hurt the performance: `use_helper_for_ns_read=` and `root_helper_daemon=`. 14:26:37 <liuyulong> The fix https://review.opendev.org/#/c/653378/ just set it to False by default, since we should set the more proper value for the widely used distro. 14:27:22 <slaweq> about this one I still don't agree that we should change default value which can possibly break some deployments during upgrade 14:27:38 <liuyulong> Yes, another large scale issue. And we also have a nice performance improvement locally. 14:28:20 <slaweq> IMO this should be well documented what to do to potentially improve performance here 14:28:31 <slaweq> but IMO changing default value isn't good solution 14:29:38 <ralonsoh> slaweq, that's the point, this is a potential issue in some environments 14:29:49 <liuyulong> slaweq, thanks for the advice 14:30:02 <liuyulong> I will update the doc 14:30:12 <slaweq> thx 14:30:36 <liuyulong> But may I know the real distro which rely on this? XEN? 14:31:41 <ralonsoh> not only XEN but environments where the user can't access to the namespaces 14:31:42 <liuyulong> OK, last one 14:31:53 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1828494 14:32:04 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1828494 (https://launchpad.net/bugs/1828494). The error has been logged 14:32:34 <slaweq> liuyulong: TBH I don't know - maybe if You want to change default value You can start thread on ML to ask other operators/distro maintainers who can potentially be hurt by this change and maybe we can change it in the future 14:32:35 <liuyulong> The title is "[RFE][L3] l3-agent should have its capacity" 14:32:43 <liuyulong> It is a RFE 14:33:43 <liuyulong> slaweq, OK, thank you : ) 14:33:46 <liuyulong> The spec "L3 agent capacity and scheduling" 14:33:46 <liuyulong> https://review.opendev.org/#/c/658451/ 14:34:09 <slaweq> this needs to be discussed by drivers team first 14:34:23 <liuyulong> And the ready to review code: 14:34:23 <liuyulong> https://review.opendev.org/#/c/661492/ 14:34:28 <liuyulong> slaweq, yes 14:35:17 <slaweq> but I'm also not sure if this is good idea 14:35:23 <liuyulong> But I do not get a slot in amlost 3 months. : ) 14:35:31 <slaweq> it sounds for me a bit like implementing placement in neutron 14:35:48 <ralonsoh> I had the same impression 14:36:02 <liuyulong> slaweq, why? 14:36:18 <ralonsoh> all resource tracking should be done in the placemente 14:36:26 <ralonsoh> not in the projects 14:36:30 <ralonsoh> to centralize the information 14:36:40 <slaweq> liuyulong: because generally placement is used to get reports about resources, track usage and propose candidates for place new services based on some criteria 14:36:40 <ralonsoh> for example: the router BW 14:37:45 <slaweq> I can understand that You want to do something in easiest and fastest possible way but IMO it's not good idea - maybe we should instead try to integrate this with placement 14:38:06 <slaweq> and don't get me wrong - I'm just asking questions to think about it :) 14:38:38 <liuyulong> That just make things complicated. Nova scheduler already hurt by it from our colleagues complaints 14:38:56 <ralonsoh> this is not nova sheduler 14:39:04 <ralonsoh> actually nova scheduler is being deprecated 14:39:28 <liuyulong> I mean nova scheduler has been hurt by placement... 14:39:37 <slaweq> also, it's not trivial thing to report resources and decide how many bandwidth You have available 14:39:43 <liuyulong> It makes nova team refactor and refactor 14:39:48 <slaweq> one host can be connected to various physical networks 14:40:16 <slaweq> You can have router which will later have interfaces on networks which uses different physical nets 14:40:25 <liuyulong> slaweq, yes, this is a good point, and can be easy to implement 14:40:28 <slaweq> how You want to choose this bandwidth during router creation? 14:41:24 <liuyulong> This is scheduler mechanism for router, yes 14:42:30 <liuyulong> random chice, and minimum quantity scheduling, is not as good as enough 14:42:39 <slaweq> I will read this spec once again in this week 14:42:49 <slaweq> and will write my comments there 14:43:05 <liuyulong> You can not say your L3 agent has an unlimited capacity 14:43:13 <slaweq> but IMO there is many cases there which may be hard to deal with 14:43:38 <liuyulong> But you have no way to prevent the router creating on that, until someday, boom... 14:43:41 <slaweq> also if You want this rfe to be discussed in drivers meeting, please ping mlavalle about that 14:44:11 <liuyulong> Your host die, and your custom complaints again, : ) 14:45:29 <slaweq> but with this change You will end up with no space on network nodes where there will be many routers which are doing nothing 14:45:45 <slaweq> and your customer will complain due to error while creation of router :) 14:46:35 <liuyulong> An none-exist resource error is easy to explain. 14:46:54 <liuyulong> An data-plane down means you may pay money for it. 14:47:17 <slaweq> so You can use https://github.com/openstack/neutron/blob/master/neutron/scheduler/l3_agent_scheduler.py#L346 now and monitor number of routers on each L3 agent 14:47:43 <slaweq> or propose new scheduler which would have simply configured max number of routers on it - without reporting bandwidth and things like that 14:47:54 <liuyulong> API error and host-down are totally a different level. 14:49:25 <slaweq> yes, so why not just new simply scheduler with limited number of routers per agent? 14:50:31 <liuyulong> slaweq, I considered it once, it is a bit simple and rough, it is not facing the real capacity: NIC bandwidth. 14:50:53 <slaweq> but You may have many NICs on network node 14:51:02 <slaweq> and router can consume bandwidth from each of them 14:51:17 <slaweq> how You want to know which bandwidth it will consume? 14:51:44 <slaweq> next question: what about L3 HA? 14:52:32 <slaweq> from which agent You will then "consume" this bandwidth? 14:52:56 <liuyulong> slaweq, all routers will have to schedule 14:53:07 <liuyulong> so the bandwidth_ratio will have its value. 14:54:12 <slaweq> another question - what about dvr routers? what this "bandwidth" attribute will mean for them? 14:54:45 <liuyulong> It means, if a HA router needs two nodes with 10Mbps, the scheduler will find two l3-agents for it with 10Mbps free bandwidth. 14:54:45 <slaweq> I will go throug this spec once again this week and will write those questions there for further discussion 14:55:29 <slaweq> 10 Mbits per interface? for all interfaces? on specific physical segment? or all physical segments? 14:56:30 <slaweq> also what about other resources? like memory for example? 14:56:41 <liuyulong> router can only have one external gateway, so this one. 14:57:14 <slaweq> but router can also not have any external gateway - what about them? 14:58:44 <wwriverrat> brief status update for (if time allows): multiple segments per host WIP 14:59:18 <liuyulong> wwriverrat, go ahead 14:59:20 <wwriverrat> For status on https://review.opendev.org/#/c/623115 14:59:20 <wwriverrat> Re-working WIP patch to have a check method on base classes: `supports_multi_segments_per_host` (False by default). For LinuxBridge implementations it would return True. 14:59:20 <wwriverrat> When False, takes data from old self.network_map[network_id]. When true, it gives all from self.segments for that network_id. Naturally code may have to either handle single segment or list of segments. 14:59:22 <wwriverrat> The code I was working before spread too far and wide. If other drivers suffer same problem, they can implement supports_multi_segments_per_host too. 14:59:35 <liuyulong> slaweq, please add your question to the patch, I will reply it. 14:59:42 <liuyulong> Time is up. 14:59:42 <ralonsoh> maybe next time we can talk about #link https://bugs.launchpad.net/neutron/+bug/1837635 15:00:00 <openstack> Launchpad bug 1837635 in neutron "HA router state change from "standby" to "master" should be delayed" [Undecided,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 15:00:00 <slaweq> liuyulong: sure 15:00:00 <liuyulong> #endmeeting