14:00:04 #startmeeting neutron_l3 14:00:05 Meeting started Wed Jul 24 14:00:04 2019 UTC and is due to finish in 60 minutes. The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:08 The meeting name has been set to 'neutron_l3' 14:00:47 hi 14:02:03 hi 14:02:22 #topic Announcements 14:02:50 o/ 14:06:00 Any announcements? 14:06:32 OK, let's move on. 14:06:48 #topic Bugs 14:07:12 #topic Bugs 14:07:30 #link http://lists.openstack.org/pipermail/openstack-discuss/2019-July/007952.html 14:07:36 Hongbin was our bug deputy last week, thanks. 14:07:55 IMO, it is a quiet week for L3, (we are in neutron_l3 meeting) : ) 14:08:26 So today, I will re-raise some old bugs. And I've reset some bugs with a higher level, because it has been submitted for a really long time. 14:08:45 (Maybe I should change the bug level more higher if it still does not have much activities. LOL) 14:08:50 #link https://bugs.launchpad.net/neutron/+bug/1826695 14:09:01 liuyulong: Error: Could not gather data from Launchpad for bug #1826695 (https://launchpad.net/bugs/1826695). The error has been logged 14:09:50 What happened? 14:10:06 The bug title is "[L3][QoS] cache does not removed when router is down or deleted" 14:10:24 The fix is here: 14:10:24 https://review.opendev.org/#/c/656105/ 14:10:48 I would say opposite - if bug is there for long time and nobody really cares about it, we should IMO decrease its priority :) 14:11:15 * njohnston thinks Launchpad is having issues 14:12:11 njohnston, I can't load anything 14:12:49 slaweq, until someday nobody care about the entire project? LOL 14:13:20 liuyulong: who knows :) 14:13:53 Next 14:13:55 #link https://bugs.launchpad.net/neutron/+bug/1811352 14:14:05 liuyulong: Error: Could not gather data from Launchpad for bug #1811352 (https://launchpad.net/bugs/1811352). The error has been logged 14:14:18 openstack, all right, I know! 14:14:29 We need this for Shanghai related topic: 14:14:29 https://review.opendev.org/#/c/650062/ 14:14:55 The CLI patch is here ^^ 14:17:24 The progress is a bit slow. All OSC core reviewers has been added to that patch. : ( 14:17:50 * tidwellr wanders in late and lurks 14:18:16 But it's OK, we can tag it locally and install it for the demo. 14:18:37 Next one: #link https://bugs.launchpad.net/neutron/+bug/1609217 14:18:47 liuyulong: Error: Could not gather data from Launchpad for bug #1609217 (https://launchpad.net/bugs/1609217). The error has been logged 14:19:17 liuyulong: do You have any presentation about port forwarding in Shanghai? 14:19:27 This is really an old one, the title is "DVR: dvr router ns should not exist in scheduled DHCP agent nodes" 14:19:43 The fix is here, it adds a new config for cloud deployment: https://review.opendev.org/#/c/364793/ 14:20:35 slaweq, yes, mlavalle submitted a topic. 14:20:49 good to know :) 14:20:54 thx for info 14:21:04 I will not repeat the reason of the fix, if you are interested in this bug, this will be the full scenarios I added before: 14:21:08 https://review.opendev.org/#/c/364793/3//COMMIT_MSG 14:21:50 It makes large scale deployment really happy. 14:22:15 Next 14:22:20 #link https://bugs.launchpad.net/neutron/+bug/1813787 14:22:33 liuyulong: Error: Could not gather data from Launchpad for bug #1813787 (https://launchpad.net/bugs/1813787). The error has been logged 14:23:19 The bug title is "[L3] DVR router in compute node was not up but nova port needs its functionality" 14:23:28 The main fix is here: https://review.opendev.org/#/c/633871/ 14:24:06 We already have some related fix, but not aim to the root cause. This one is one approach. 14:24:35 We have run such code locally for a long time. It acts good. 14:25:09 Next #link https://bugs.launchpad.net/neutron/+bug/1825152 14:25:19 liuyulong: Error: Could not gather data from Launchpad for bug #1825152 (https://launchpad.net/bugs/1825152). The error has been logged 14:25:54 The title is "[scale issue] the root rootwrap deamon causes l3 agent router procssing very very slow" 14:26:14 These two config options really hurt the performance: `use_helper_for_ns_read=` and `root_helper_daemon=`. 14:26:37 The fix https://review.opendev.org/#/c/653378/ just set it to False by default, since we should set the more proper value for the widely used distro. 14:27:22 about this one I still don't agree that we should change default value which can possibly break some deployments during upgrade 14:27:38 Yes, another large scale issue. And we also have a nice performance improvement locally. 14:28:20 IMO this should be well documented what to do to potentially improve performance here 14:28:31 but IMO changing default value isn't good solution 14:29:38 slaweq, that's the point, this is a potential issue in some environments 14:29:49 slaweq, thanks for the advice 14:30:02 I will update the doc 14:30:12 thx 14:30:36 But may I know the real distro which rely on this? XEN? 14:31:41 not only XEN but environments where the user can't access to the namespaces 14:31:42 OK, last one 14:31:53 #link https://bugs.launchpad.net/neutron/+bug/1828494 14:32:04 liuyulong: Error: Could not gather data from Launchpad for bug #1828494 (https://launchpad.net/bugs/1828494). The error has been logged 14:32:34 liuyulong: TBH I don't know - maybe if You want to change default value You can start thread on ML to ask other operators/distro maintainers who can potentially be hurt by this change and maybe we can change it in the future 14:32:35 The title is "[RFE][L3] l3-agent should have its capacity" 14:32:43 It is a RFE 14:33:43 slaweq, OK, thank you : ) 14:33:46 The spec "L3 agent capacity and scheduling" 14:33:46 https://review.opendev.org/#/c/658451/ 14:34:09 this needs to be discussed by drivers team first 14:34:23 And the ready to review code: 14:34:23 https://review.opendev.org/#/c/661492/ 14:34:28 slaweq, yes 14:35:17 but I'm also not sure if this is good idea 14:35:23 But I do not get a slot in amlost 3 months. : ) 14:35:31 it sounds for me a bit like implementing placement in neutron 14:35:48 I had the same impression 14:36:02 slaweq, why? 14:36:18 all resource tracking should be done in the placemente 14:36:26 not in the projects 14:36:30 to centralize the information 14:36:40 liuyulong: because generally placement is used to get reports about resources, track usage and propose candidates for place new services based on some criteria 14:36:40 for example: the router BW 14:37:45 I can understand that You want to do something in easiest and fastest possible way but IMO it's not good idea - maybe we should instead try to integrate this with placement 14:38:06 and don't get me wrong - I'm just asking questions to think about it :) 14:38:38 That just make things complicated. Nova scheduler already hurt by it from our colleagues complaints 14:38:56 this is not nova sheduler 14:39:04 actually nova scheduler is being deprecated 14:39:28 I mean nova scheduler has been hurt by placement... 14:39:37 also, it's not trivial thing to report resources and decide how many bandwidth You have available 14:39:43 It makes nova team refactor and refactor 14:39:48 one host can be connected to various physical networks 14:40:16 You can have router which will later have interfaces on networks which uses different physical nets 14:40:25 slaweq, yes, this is a good point, and can be easy to implement 14:40:28 how You want to choose this bandwidth during router creation? 14:41:24 This is scheduler mechanism for router, yes 14:42:30 random chice, and minimum quantity scheduling, is not as good as enough 14:42:39 I will read this spec once again in this week 14:42:49 and will write my comments there 14:43:05 You can not say your L3 agent has an unlimited capacity 14:43:13 but IMO there is many cases there which may be hard to deal with 14:43:38 But you have no way to prevent the router creating on that, until someday, boom... 14:43:41 also if You want this rfe to be discussed in drivers meeting, please ping mlavalle about that 14:44:11 Your host die, and your custom complaints again, : ) 14:45:29 but with this change You will end up with no space on network nodes where there will be many routers which are doing nothing 14:45:45 and your customer will complain due to error while creation of router :) 14:46:35 An none-exist resource error is easy to explain. 14:46:54 An data-plane down means you may pay money for it. 14:47:17 so You can use https://github.com/openstack/neutron/blob/master/neutron/scheduler/l3_agent_scheduler.py#L346 now and monitor number of routers on each L3 agent 14:47:43 or propose new scheduler which would have simply configured max number of routers on it - without reporting bandwidth and things like that 14:47:54 API error and host-down are totally a different level. 14:49:25 yes, so why not just new simply scheduler with limited number of routers per agent? 14:50:31 slaweq, I considered it once, it is a bit simple and rough, it is not facing the real capacity: NIC bandwidth. 14:50:53 but You may have many NICs on network node 14:51:02 and router can consume bandwidth from each of them 14:51:17 how You want to know which bandwidth it will consume? 14:51:44 next question: what about L3 HA? 14:52:32 from which agent You will then "consume" this bandwidth? 14:52:56 slaweq, all routers will have to schedule 14:53:07 so the bandwidth_ratio will have its value. 14:54:12 another question - what about dvr routers? what this "bandwidth" attribute will mean for them? 14:54:45 It means, if a HA router needs two nodes with 10Mbps, the scheduler will find two l3-agents for it with 10Mbps free bandwidth. 14:54:45 I will go throug this spec once again this week and will write those questions there for further discussion 14:55:29 10 Mbits per interface? for all interfaces? on specific physical segment? or all physical segments? 14:56:30 also what about other resources? like memory for example? 14:56:41 router can only have one external gateway, so this one. 14:57:14 but router can also not have any external gateway - what about them? 14:58:44 brief status update for (if time allows): multiple segments per host WIP 14:59:18 wwriverrat, go ahead 14:59:20 For status on https://review.opendev.org/#/c/623115 14:59:20 Re-working WIP patch to have a check method on base classes: `supports_multi_segments_per_host` (False by default). For LinuxBridge implementations it would return True. 14:59:20 When False, takes data from old self.network_map[network_id]. When true, it gives all from self.segments for that network_id. Naturally code may have to either handle single segment or list of segments. 14:59:22 The code I was working before spread too far and wide. If other drivers suffer same problem, they can implement supports_multi_segments_per_host too. 14:59:35 slaweq, please add your question to the patch, I will reply it. 14:59:42 Time is up. 14:59:42 maybe next time we can talk about #link https://bugs.launchpad.net/neutron/+bug/1837635 15:00:00 Launchpad bug 1837635 in neutron "HA router state change from "standby" to "master" should be delayed" [Undecided,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 15:00:00 liuyulong: sure 15:00:00 #endmeeting