#openstack-meeting log

14:00:10 <slaweq> #startmeeting neutron_drivers
14:00:11 <openstack> Meeting started Fri Feb 14 14:00:10 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:14 <openstack> The meeting name has been set to 'neutron_drivers'
14:00:15 <njohnston> o/
14:00:15 <slaweq> hi
14:00:17 <ralonsoh> hi
14:00:18 <yamamoto> hi
14:00:22 <TheJulia> o/
14:00:34 <stephen-ma> hi
14:01:09 <slaweq> lets wait few more minutes for amotoki haleyb and mlavalle
14:01:16 <haleyb> hi
14:01:44 <slaweq> hi haleyb
14:01:56 <slaweq> so we have already quorum and I think we can start
14:02:13 <slaweq> as TheJulia is here, lets start not as usual :)
14:02:19 <slaweq> #topic On Demand Agenda
14:02:32 <slaweq> TheJulia has added topic to https://wiki.openstack.org/wiki/Meetings/NeutronDrivers
14:03:44 <mlavalle> o/
14:03:47 <njohnston> hello TheJulia
14:04:01 <amotoki> hi1 sorry for late
14:04:08 <slaweq> hi mlavalle and amotoki
14:04:33 <TheJulia> So bottom line is we're wondering if the mac address update can be made non-admin or covered by a specific policy because ironic is making the service more multitenanty and usable for non-admins, but we pass credentials through for port actions and are trying to avoid pulling a second admin session as the ironic service user to just update the mac address
14:04:35 <slaweq> just FYI, I started today with On demand agenda as TheJulia added some topic to it and I didn't want to hold her on the meeting for whole hour :)
14:04:56 <TheJulia> slaweq: much appreciated,.... for I have hours of meetings ahead of me :)
14:05:06 <njohnston> I wonder if this could be achieved with a policy.json modification defining a role tied to a specific service credential for Ironic
14:05:14 <slaweq> njohnston: I think it can: https://github.com/openstack/neutron/blob/master/neutron/conf/policies/port.py#L192
14:05:27 <slaweq> it's defined there IIUC what is the need from Ironic
14:06:19 <slaweq> and it seems for me that it can be done by admin or advsvc user
14:06:50 <njohnston> slaweq: yes that is exactly what I was thinking about
14:07:16 <amotoki> are we discussing mac address update by all non-admin users or users with specific roles?
14:08:00 <TheJulia> non-admin users of baremetal, which has me thinking we're going to have to do the thing we don't want to do which is pull a separate client/session to directly update the port mac as a separate action
14:08:53 <njohnston> the thing is, Neutron has no way of distinguishing between the Ironic use case and the other use cases where non-admin access to this would be a bad idea
14:09:36 <amotoki> we prepared the advsvc role for such purpose. If it works it would be great.
14:10:51 <TheJulia> Yeah, I suspect we could just have ironic learn how to do it separately, which would prevent potential security issues. I guess well need to look at that. Anyway thanks everyone!
14:11:32 <amotoki> one thing to note is that updating mac address should be limited to a private network.
14:11:41 <amotoki> I mean "self-service network".
14:11:57 <njohnston> Are there some baremetal scenarios where it would not be a good idea to allow mac address updating?
14:12:18 <amotoki> it should not be allowed in a shared network.
14:12:38 <amotoki> in other words, the operation should be limited to a network owner IMHO.
14:12:57 <TheJulia> njohnston: none, we must be able to update the mac address for pxe booting and addressing of physical ports.
14:14:55 <TheJulia> in that case the ironic service account needs to perform the port update action for just the mac address
14:15:28 <TheJulia> since we know it and manage it
14:16:12 <njohnston> I am wondering if we could permit port update for all in policy.json and then later in the logic require specific privileges unless it's a baremetal port.
14:17:12 <njohnston> But I don't know if there are baremetal-but-not-Ironic scenarios that could bite us with that
14:17:24 <slaweq> njohnston: I'm not sure if we should do such hard coded rules for some specific kinds of resources
14:17:26 <TheJulia> that is only informed via the ?binding profile? which I think is later on, also since users can request vifs on user created networks and ironic will request it be attached to that network
14:18:03 <njohnston> Having a separate admin session has the virtue of simplicity, other methods for doing it get complex quickly it seems to me
14:18:19 <TheJulia> agreed
14:18:39 <TheJulia> Thanks, I'll let the contributor working on the multitenancy feature set know!
14:19:31 <slaweq> ok, so I think we are good with Your topic TheJulia, right? You will try to use advsvc role for this action.
14:20:39 <TheJulia> yup, thanks
14:21:02 <slaweq> thx TheJulia :)
14:21:15 <slaweq> so now we can move to our regular topic
14:21:17 <slaweq> #topic RFEs
14:21:31 <slaweq> we have 2 RFEs for today
14:21:34 <slaweq> first one:
14:21:36 <slaweq> https://bugs.launchpad.net/neutron/+bug/1860521
14:21:38 <openstack> Launchpad bug 1860521 in neutron "L2 pop notifications are not reliable" [Undecided,New]
14:24:21 <slaweq> I remember from when I was working in OVH that we had similar problems with L2pop mechanism and we added something like periodic sync of tunnels config on host
14:25:49 <njohnston> I am not sure what the impact to the message bus would be to change from fanout/cast to RPC calls
14:27:24 <slaweq> njohnston: to the message bus not much but for neutron-rpc workers which will send such messages and wait for reply, impact will be at least "noticeable" IMO
14:27:59 <njohnston> slaweq: Yeah, I was worried more about the RPC workers
14:28:21 <slaweq> njohnston: ahh, ok :)
14:28:25 <mlavalle> it's a matter of whether the mesh of tunnels works vs the cost
14:28:36 <njohnston> Does OVN use l2pop?  IIRC it doesn't but I haven't looked in that part of the code in a while.
14:29:06 <slaweq> njohnston: nope
14:29:07 <amotoki> I am not sure right now which is better to switch it to RPC calls or to sync info periodically.
14:29:25 <amotoki> if the number of nodes to be informed is small, it makes sense to switch it to RPC calls.
14:30:22 <mlavalle> but I am sure the problem is more acute in large deployments
14:31:09 <njohnston> There are costs and benefits each way - if the RPC call idea adds overhead then it could make the situation in large deployments worse.  With the periodic sync you have a period of time where things might not work correctly, before the next sync.
14:31:27 <mlavalle> yeap
14:31:33 <mlavalle> it's always a trade off
14:31:59 <njohnston> I personally favor the periodic sync as being in keeping with our "eventually consistent" way of doing things, but I have a bias towards large deployment thinking.
14:32:22 <slaweq> actually looking at the code it will not be possible to switch always to call() from cast()
14:32:23 <mlavalle> but if the mesh of tunnels gets to a point where it doesn't work, then it's time to consider the trade offs
14:32:28 <slaweq> https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/rpc.py
14:32:45 <slaweq> in some cases it uses fanout=True and then it can't call() be used
14:33:15 <ralonsoh> slaweq, why?
14:33:46 <ralonsoh> actually one problem we have with the MQ is that some calls are blocking
14:34:59 <slaweq> ralonsoh: https://docs.openstack.org/oslo.messaging/latest/reference/rpcclient.html#oslo_messaging.RPCClient.call
14:35:16 <slaweq> call() waits for return value so You can't send it to many hosts and wait for many replies
14:35:28 <slaweq> that's at least how I understand it
14:35:33 <ralonsoh> I know, but why do we need to use call instead of cast?
14:35:44 <ralonsoh> if the MQ is down, the server will stop working
14:35:56 <slaweq> ralonsoh: this was "Option 2" in the RFE
14:36:01 <ralonsoh> (maybe this is out of topic, sorry)
14:36:06 <amotoki> yeah, switching cast() into call() is not straight-forward. in case of fanout=True, we need to convert it into multiple call() and also need to check the status of individual call().
14:36:18 <amotoki> using call() allows us to check the result
14:36:39 <amotoki> but perhaps it will bring another scaling issue in this case.
14:36:51 <stephen-ma> slaweq: when can an RFE can be brought up for discussion?
14:37:02 <slaweq> amotoki: yes, but IMO that's not good idea as L2pop was IMO designed to address some scale problems and such change would make it totally not scallable
14:37:20 <amotoki> slaweq: exactly
14:37:31 <amotoki> I just tried to explain what would happen.
14:37:39 <slaweq> stephen-ma: are You asking about https://bugs.launchpad.net/neutron/+bug/1861529 ? If yes, I keep it for the end of the meeting :)
14:37:40 <openstack> Launchpad bug 1861529 in neutron "[RFE] A port's network should be changable" [Wishlist,New]
14:38:03 <stephen-ma> yes that's the RFE
14:38:21 <slaweq> amotoki: ralonsoh  so IMO to address issue described by Oleg, we should only consider "option 1 - periodic sync mechanism"
14:38:45 <amotoki> slaweq: agree. I personally prefer to the periodic sync.
14:39:19 <ralonsoh> (I don't, that's why we have the MQ)
14:39:28 <ralonsoh> but the problem is why the MQ is not reliable
14:39:56 <ralonsoh> anyway, if making this update periodic can solve this problem, I'm ok
14:40:02 <slaweq> ralonsoh: problem here is that with cast() method neutron-server don't know if agent configured everything fine
14:41:17 <ralonsoh> I will comment in the bug
14:41:34 <ralonsoh> but that should not be a server problem
14:41:49 <ralonsoh> if agent is down, server should keep working
14:41:52 <amotoki> MQ is durable in some cases, but from my operator experience it is not easy to ensure MQ msgs are reliable, so it is nice if neutron (MQ users) provides some mechanism for reliability.
14:42:08 <ralonsoh> if the agent received the config and everything went fine, ok
14:42:20 <ralonsoh> if not, the agent should communicate to the server informing about the error
14:42:44 <njohnston> The other alternative, just to play devil's advocate, is to build the reliability higher up in the stack
14:42:45 <mlavalle> which is synch of sorts, right?
14:43:10 <ralonsoh> njohnston, the reliability is on the services: agent, server, etc
14:43:11 <njohnston> ralonsoh: If the MQ never sent the message tot he agent then the agent has no idea it has something to complain about
14:43:26 <ralonsoh> the agent should be smart enough to send a warning message to the server
14:43:35 <ralonsoh> njohnston, I know
14:43:51 <ralonsoh> and if the MQ does not work, then will have a stopped Neutron server
14:43:54 <slaweq> ralonsoh: exactly how njohnston said, in case of cast() You will not have e.g. message timeout error on server side
14:44:07 <njohnston> ralonsoh: Instead of depending on the call() method the agent sends a message saying "I processed this update", and neutron server counts these acks.
14:44:07 <ralonsoh> (something very common in some bugs)
14:44:50 <njohnston> Similarly to how you design a reliable service on top of UDP, you don't have the transmission mechanism ensure reliability, you build it into the application layer
14:44:54 <ralonsoh> exactly: this should be like a UDP call, and the client should inform.....
14:45:01 <ralonsoh> exactly!
14:45:05 <ralonsoh> I was writing the same
14:45:30 <mlavalle> and all of his is a synch mechanism, isn't it?
14:45:30 <ralonsoh> goal: do NOT block the server
14:45:33 <njohnston> that requires neutron to track what agent(s) should respond to this kind of request and keep an account for responses received
14:45:43 <ralonsoh> yes
14:46:05 <ralonsoh> in a async way
14:46:32 <njohnston> mlavalle: the value here is that the approach is not periodic or timer-based
14:46:52 <mlavalle> sure
14:47:56 <mlavalle> whenever to asynch entities need to cooperate (server and agent for example) you need a way to fins synchronization points, periodic or otherwise
14:48:23 <mlavalle> nature of distributed systems
14:48:41 <mlavalle> I'd say the idea has merit and we should explore it further with a spce
14:48:56 <mlavalle> *spec
14:49:23 <ralonsoh> agree
14:49:43 <njohnston> My main question: the work of syncing to the database for FDB updates and then keeping an account of responses received, is it worth the effort?  Compared to the simpler periodic sync mechanism.
14:50:10 <ralonsoh> IMO, you don't need to track the responses
14:50:18 <slaweq> ok, so to sumup what we discussed so far: we should continue discussion in the sync (periodic or not) mechanism in the spec, we don't want to switch from cast() to call(), right?
14:50:22 <ralonsoh> you'll have error responses or nothing
14:50:28 <njohnston> If you don't track the responses then you can't reissue the update when a response is not received
14:50:47 <ralonsoh> 1) if the agent is not working, the server will notice that, period checks
14:51:17 <ralonsoh> 2) if the agent message didn't work, the agent will reply with an error
14:51:32 <ralonsoh> 3) if the MQ is unreliable.... well, this IS a problem
14:51:40 <ralonsoh> but not Neutron's problem
14:52:00 <njohnston> But addressing 3 is the point of the RFE, it is Neutron's problem
14:52:08 <njohnston> If AMQP drops the message then neither server nor client will know there was an error
14:52:33 <njohnston> drops a casted message to be specific
14:52:38 <slaweq> I agree with njohnston here - we should do as much as we can to address such case on our side
14:52:42 <mlavalle> it is Neutron's problem in the sense that is has at least cope with it
14:52:57 <ralonsoh> ok
14:55:25 <slaweq> so, I will sum up this discussion in RFE and ask for spec to continue discussion there, right?
14:55:32 <mlavalle> slaweq: I would point out in the RFE that in light of today's discussion, we lean towards some sort of synch problem
14:55:49 <mlavalle> and that we would like to explore it in a spec
14:55:55 <slaweq> mlavalle: sure
14:56:06 <njohnston> +1
14:56:12 <amotoki> +1
14:56:16 <ralonsoh> +1
14:56:42 <mlavalle> I meant synch mechanism
14:56:59 <slaweq> ok, thx for discussion about this RFE - it was the good one today :)
14:57:22 <slaweq> as we are almost on top of the hour, I don't want to start discussion about next rfe
14:58:03 <slaweq> but I want to ask all of You to check https://bugs.launchpad.net/neutron/+bug/1861529
14:58:04 <openstack> Launchpad bug 1861529 in neutron "[RFE] A port's network should be changable" [Wishlist,New]
14:58:12 <ralonsoh> ok
14:58:31 <slaweq> and that's all for today from me
14:58:36 <slaweq> thx for attending
14:58:41 <slaweq> and have a great weekend
14:58:42 <slaweq> o/
14:58:45 <amotoki> o/
14:58:45 <mlavalle> o/
14:58:45 <ralonsoh> bye
14:58:46 <slaweq> #endmeeting