#openstack-meeting log

14:00:04 <liuyulong> #startmeeting neutron_l3
14:00:05 <openstack> Meeting started Wed Jul 24 14:00:04 2019 UTC and is due to finish in 60 minutes.  The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:08 <openstack> The meeting name has been set to 'neutron_l3'
14:00:47 <slaweq> hi
14:02:03 <liuyulong> hi
14:02:22 <liuyulong> #topic Announcements
14:02:50 <njohnston> o/
14:06:00 <liuyulong_> Any announcements?
14:06:32 <liuyulong_> OK, let's move on.
14:06:48 <liuyulong_> #topic Bugs
14:07:12 <liuyulong> #topic Bugs
14:07:30 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-July/007952.html
14:07:36 <liuyulong> Hongbin was our bug deputy last week, thanks.
14:07:55 <liuyulong> IMO, it is a quiet week for L3, (we are in neutron_l3 meeting) : )
14:08:26 <liuyulong> So today, I will re-raise some old bugs. And I've reset some bugs with a higher level, because it has been submitted for a really long time.
14:08:45 <liuyulong> (Maybe I should change the bug level more higher if it still does not have much activities. LOL)
14:08:50 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1826695
14:09:01 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1826695 (https://launchpad.net/bugs/1826695). The error has been logged
14:09:50 <liuyulong> What happened?
14:10:06 <liuyulong> The bug title is "[L3][QoS] cache does not removed when router is down or deleted"
14:10:24 <liuyulong> The fix is here:
14:10:24 <liuyulong> https://review.opendev.org/#/c/656105/
14:10:48 <slaweq> I would say opposite - if bug is there for long time and nobody really cares about it, we should IMO decrease its priority :)
14:11:15 * njohnston thinks Launchpad is having issues
14:12:11 <ralonsoh> njohnston, I can't load anything
14:12:49 <liuyulong> slaweq, until someday nobody care about the entire project? LOL
14:13:20 <slaweq> liuyulong: who knows :)
14:13:53 <liuyulong> Next
14:13:55 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1811352
14:14:05 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1811352 (https://launchpad.net/bugs/1811352). The error has been logged
14:14:18 <liuyulong> openstack, all right, I know!
14:14:29 <liuyulong> We need this for Shanghai related topic:
14:14:29 <liuyulong> https://review.opendev.org/#/c/650062/
14:14:55 <liuyulong> The CLI patch is here ^^
14:17:24 <liuyulong> The progress is a bit slow. All OSC core reviewers has been added to that patch. : (
14:17:50 * tidwellr wanders in late and lurks
14:18:16 <liuyulong> But it's OK, we can tag it locally and install it for the demo.
14:18:37 <liuyulong> Next one: #link https://bugs.launchpad.net/neutron/+bug/1609217
14:18:47 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1609217 (https://launchpad.net/bugs/1609217). The error has been logged
14:19:17 <slaweq> liuyulong: do You have any presentation about port forwarding in Shanghai?
14:19:27 <liuyulong> This is really an old one, the title is "DVR: dvr router ns should not exist in scheduled DHCP agent nodes"
14:19:43 <liuyulong> The fix is here, it adds a new config for cloud deployment: https://review.opendev.org/#/c/364793/
14:20:35 <liuyulong> slaweq, yes, mlavalle submitted a topic.
14:20:49 <slaweq> good to know :)
14:20:54 <slaweq> thx for info
14:21:04 <liuyulong> I will not repeat the reason of the fix, if you are interested in this bug, this will be the full scenarios I added before:
14:21:08 <liuyulong> https://review.opendev.org/#/c/364793/3//COMMIT_MSG
14:21:50 <liuyulong> It makes large scale deployment really happy.
14:22:15 <liuyulong> Next
14:22:20 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1813787
14:22:33 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1813787 (https://launchpad.net/bugs/1813787). The error has been logged
14:23:19 <liuyulong> The bug title is "[L3] DVR router in compute node was not up but nova port needs its functionality"
14:23:28 <liuyulong> The main fix is here: https://review.opendev.org/#/c/633871/
14:24:06 <liuyulong> We already have some related fix, but not aim to the root cause. This one is one approach.
14:24:35 <liuyulong> We have run such code locally for a long time. It acts good.
14:25:09 <liuyulong> Next #link https://bugs.launchpad.net/neutron/+bug/1825152
14:25:19 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1825152 (https://launchpad.net/bugs/1825152). The error has been logged
14:25:54 <liuyulong> The title is "[scale issue] the root rootwrap deamon causes l3 agent router procssing very very slow"
14:26:14 <liuyulong> These two config options really hurt the performance: `use_helper_for_ns_read=` and `root_helper_daemon=`.
14:26:37 <liuyulong> The fix https://review.opendev.org/#/c/653378/ just set it to False by default, since we should set the more proper value for the widely used distro.
14:27:22 <slaweq> about this one I still don't agree that we should change default value which can possibly break some deployments during upgrade
14:27:38 <liuyulong> Yes, another large scale issue. And we also have a nice performance improvement locally.
14:28:20 <slaweq> IMO this should be well documented what to do to potentially improve performance here
14:28:31 <slaweq> but IMO changing default value isn't good solution
14:29:38 <ralonsoh> slaweq, that's the point, this is a potential issue in some environments
14:29:49 <liuyulong> slaweq, thanks for the advice
14:30:02 <liuyulong> I will update the doc
14:30:12 <slaweq> thx
14:30:36 <liuyulong> But may I know the real distro which rely on this? XEN?
14:31:41 <ralonsoh> not only XEN but environments where the user can't access to the namespaces
14:31:42 <liuyulong> OK, last one
14:31:53 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1828494
14:32:04 <openstack> liuyulong: Error: Could not gather data from Launchpad for bug #1828494 (https://launchpad.net/bugs/1828494). The error has been logged
14:32:34 <slaweq> liuyulong: TBH I don't know - maybe if You want to change default value You can start thread on ML to ask other operators/distro maintainers who can potentially be hurt by this change and maybe we can change it in the future
14:32:35 <liuyulong> The title is "[RFE][L3] l3-agent should have its capacity"
14:32:43 <liuyulong> It is a RFE
14:33:43 <liuyulong> slaweq, OK, thank you : )
14:33:46 <liuyulong> The spec "L3 agent capacity and scheduling"
14:33:46 <liuyulong> https://review.opendev.org/#/c/658451/
14:34:09 <slaweq> this needs to be discussed by drivers team first
14:34:23 <liuyulong> And the ready to review code:
14:34:23 <liuyulong> https://review.opendev.org/#/c/661492/
14:34:28 <liuyulong> slaweq, yes
14:35:17 <slaweq> but I'm also not sure if this is good idea
14:35:23 <liuyulong> But I do not get a slot in amlost 3 months. : )
14:35:31 <slaweq> it sounds for me a bit like implementing placement in neutron
14:35:48 <ralonsoh> I had the same impression
14:36:02 <liuyulong> slaweq, why?
14:36:18 <ralonsoh> all resource tracking should be done in the placemente
14:36:26 <ralonsoh> not in the projects
14:36:30 <ralonsoh> to centralize the information
14:36:40 <slaweq> liuyulong: because generally placement is used to get reports about resources, track usage and propose candidates for place new services based on some criteria
14:36:40 <ralonsoh> for example: the router BW
14:37:45 <slaweq> I can understand that You want to do something in easiest and fastest possible way but IMO it's not good idea - maybe we should instead try to integrate this with placement
14:38:06 <slaweq> and don't get me wrong - I'm just asking questions to think about it :)
14:38:38 <liuyulong> That just make things complicated. Nova scheduler already hurt by it from our colleagues complaints
14:38:56 <ralonsoh> this is not nova sheduler
14:39:04 <ralonsoh> actually nova scheduler is being deprecated
14:39:28 <liuyulong> I mean nova scheduler has been hurt by placement...
14:39:37 <slaweq> also, it's not trivial thing to report resources and decide how many bandwidth You have available
14:39:43 <liuyulong> It makes nova team refactor and refactor
14:39:48 <slaweq> one host can be connected to various physical networks
14:40:16 <slaweq> You can have router which will later have interfaces on networks which uses different physical nets
14:40:25 <liuyulong> slaweq, yes, this is a good point, and can be easy to implement
14:40:28 <slaweq> how You want to choose this bandwidth during router creation?
14:41:24 <liuyulong> This is scheduler mechanism for router, yes
14:42:30 <liuyulong> random chice, and minimum quantity scheduling, is not as good as enough
14:42:39 <slaweq> I will read this spec once again in this week
14:42:49 <slaweq> and will write my comments there
14:43:05 <liuyulong> You can not say your L3 agent has an unlimited capacity
14:43:13 <slaweq> but IMO there is many cases there which may be hard to deal with
14:43:38 <liuyulong> But you have no way to prevent the router creating on that, until someday, boom...
14:43:41 <slaweq> also if You want this rfe to be discussed in drivers meeting, please ping mlavalle about that
14:44:11 <liuyulong> Your host die, and your custom complaints again, : )
14:45:29 <slaweq> but with this change You will end up with no space on network nodes where there will be many routers which are doing nothing
14:45:45 <slaweq> and your customer will complain due to error while creation of router :)
14:46:35 <liuyulong> An none-exist resource error is easy to explain.
14:46:54 <liuyulong> An data-plane down means you may pay money for it.
14:47:17 <slaweq> so You can use https://github.com/openstack/neutron/blob/master/neutron/scheduler/l3_agent_scheduler.py#L346 now and monitor number of routers on each L3 agent
14:47:43 <slaweq> or propose new scheduler which would have simply configured max number of routers on it - without reporting bandwidth and things like that
14:47:54 <liuyulong> API error and host-down are totally a different level.
14:49:25 <slaweq> yes, so why not just new simply scheduler with limited number of routers per agent?
14:50:31 <liuyulong> slaweq, I considered it once, it is a bit simple and rough, it is not facing the real capacity: NIC bandwidth.
14:50:53 <slaweq> but You may have many NICs on network node
14:51:02 <slaweq> and router can consume bandwidth from each of them
14:51:17 <slaweq> how You want to know which bandwidth it will consume?
14:51:44 <slaweq> next question: what about L3 HA?
14:52:32 <slaweq> from which agent You will then "consume" this bandwidth?
14:52:56 <liuyulong> slaweq, all routers will have to schedule
14:53:07 <liuyulong> so the bandwidth_ratio will have its value.
14:54:12 <slaweq> another question - what about dvr routers? what this "bandwidth" attribute will mean for them?
14:54:45 <liuyulong> It means, if a HA router needs two nodes with 10Mbps, the scheduler will find two l3-agents for it with 10Mbps free bandwidth.
14:54:45 <slaweq> I will go throug this spec once again this week and will write those questions there for further discussion
14:55:29 <slaweq> 10 Mbits per interface? for all interfaces? on specific physical segment? or all physical segments?
14:56:30 <slaweq> also what about other resources? like memory for example?
14:56:41 <liuyulong> router can only have one external gateway, so this one.
14:57:14 <slaweq> but router can also not have any external gateway - what about them?
14:58:44 <wwriverrat> brief status update for (if time allows): multiple segments per host WIP
14:59:18 <liuyulong> wwriverrat, go ahead
14:59:20 <wwriverrat> For status on https://review.opendev.org/#/c/623115
14:59:20 <wwriverrat> Re-working WIP patch to have a check method on base classes: `supports_multi_segments_per_host` (False by default). For LinuxBridge implementations it would return True.
14:59:20 <wwriverrat> When False, takes data from old self.network_map[network_id]. When true, it gives all from self.segments for that network_id. Naturally code may have to either handle single segment or list of segments.
14:59:22 <wwriverrat> The code I was working before spread too far and wide. If other drivers suffer same problem, they can implement supports_multi_segments_per_host too.
14:59:35 <liuyulong> slaweq, please add your question to the patch, I will reply it.
14:59:42 <liuyulong> Time is up.
14:59:42 <ralonsoh> maybe next time we can talk about #link https://bugs.launchpad.net/neutron/+bug/1837635
15:00:00 <openstack> Launchpad bug 1837635 in neutron "HA router state change from "standby" to "master" should be delayed" [Undecided,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
15:00:00 <slaweq> liuyulong: sure
15:00:00 <liuyulong> #endmeeting