#openstack-meeting log

14:00:25 <mlavalle> #startmeeting neutron_drivers
14:00:26 <openstack> Meeting started Fri Dec 21 14:00:25 2018 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:29 <openstack> The meeting name has been set to 'neutron_drivers'
14:02:29 <slaweq> hi
14:02:37 <kailun> hi
14:02:45 <mlavalle> hi
14:03:26 <nicky0419_> hi
14:03:47 <cheng1> hi
14:03:56 <mlavalle> let's wait a few minutes for quorum
14:04:01 <nicky0419_> https://bugs.launchpad.net/neutron/+bug/1805769 Please take look at it.
14:04:02 <openstack> Launchpad bug 1805769 in neutron "[RFE] Add a config option rpc_response_max_timeout" [Undecided,Incomplete] - Assigned to nicky (ss104301)
14:04:33 <nicky0419_> If possible I want to disscuss this rfe.
14:04:36 <mlavalle> nicky0419_: we will probably not discuss this RFE in this meeting today
14:05:45 <nicky0419_> mlavalle: Got it. Thanks.
14:06:00 <mlavalle> nicky0419_: but I will comment on it later today
14:06:57 <haleyb> hi
14:07:40 <mlavalle> hi haleyb
14:07:46 <mlavalle> we have quorum now
14:08:43 <mlavalle> #topic RFEs
14:08:54 <mlavalle> Let's discuss https://bugs.launchpad.net/neutron/+bug/1795212
14:08:55 <openstack> Launchpad bug 1795212 in neutron "[RFE] Prevent DHCP agent from processing stale RPC messages when restarting up" [Wishlist,In progress] - Assigned to Kailun Qin (kailun.qin)
14:09:21 <mlavalle> kailun: I believe this is yours, right?
14:09:34 <kailun> mlavalle: yes
14:10:04 <kailun> team, I've addressed your comments/questions in the launchpad, please kindly have a look
14:10:33 <mlavalle> kailun: yes, I went over it carefully yesterday
14:10:42 <mlavalle> thanks for your responses
14:11:14 <kailun> mlavalle: thank you, your last comment together w/ slaweq's, I've answered it today
14:11:30 <mlavalle> kailun: yes, I am looking at the code right now
14:11:41 <kailun> mlavalle: slaweq: see if you have any further comment on that corner case raised
14:12:19 <slaweq> kailun: mlavalle I think that You dispel my doubts about it
14:12:24 <liuyulong> This is worthy to do some testing like booting tons instance at one time, and restart the dhcp agent.
14:13:20 <liuyulong> Rally may be the tool for it.
14:13:23 <slaweq> liuyulong: yes, I also wonder what impact it will have on performance
14:13:49 <kailun> liuyulong: the bug was initially coming from a production env, and the fix has been validated based on that as well
14:14:11 <kailun> slaweq: liuyulong: personally I agree w/ your opinions
14:14:12 <liuyulong> kailun: Thanks, good to know
14:15:59 <slaweq> kailun: so, to be sure I understand it correctly, during this delay just after start of agent, You expect that all "stale" rpc messages will be removed from queue?
14:16:21 <slaweq> and then after delay they will not be processed by agent, am I right?
14:16:41 <kailun> slaweq: liuyulong: actually the stale RPC msgs would be consumedd quickly since they are discarded w/o doing any real work in the agent, so the performance shouldn't be impacted much
14:17:01 <kailun> slaweq: correct
14:17:01 <slaweq> ahh, ok
14:17:07 <haleyb> the other option (which i've probably mentioned before) is to do as the l3-agent does - use a queue
14:17:20 <slaweq> so during this delay, agent will consume old message and not do anything with them
14:17:39 <slaweq> and then it will start doing full sync, right?
14:18:04 <kailun> haleyb: yes thanks for the proposal, I investigated a little bit but found that it might not meet our requirment
14:18:14 <liuyulong> https://review.openstack.org/#/c/626830/, this patch was submit todady, use resource queue.
14:19:47 <kailun> slaweq: during the delay, agent will discard the old msgs, but while doing the full sync, it will keep them due to the read lock and process them later
14:19:47 <haleyb> kailun: can you explain a little?  it should order rpc messages based on arrival
14:20:31 <kailun> haleyb: please kindly have a look at the comment #8 of the launchpad
14:21:30 <kailun> haleyb: I tried to answer your question, I'd like to copy-paste the answer here but I have a problem w/ the irc client:) sry
14:22:10 <haleyb> kailun: ack.  but in your case, how does the agent know there are stale rpc messages, just based on the config option timeout?
14:23:16 <mlavalle> yes, that is my understanding. it is assumed that after the the wait period, the stale messages would have been consumed
14:24:18 <mlavalle> right kailun?
14:24:58 <kailun> haleyb: I agree it's a little uncertain in comparing w/ timestamp/sequence based, but based on the comment #10, we consider some other points
14:25:13 <slaweq> can it happen that delay will be too short and some messages will not be discarded in some cases?
14:25:42 <kailun> haleyb: personnally, I think it's no harm to add a delay for agents w/ default 0
14:26:10 <liuyulong> IIRC, the oslo.messaging has some settings about the queue duration
14:27:11 <haleyb> slaweq: yes, and i was thinking too long and we throw out "good" ones, but i guess the fullsync would fix that
14:27:12 <kailun> mlavalle: the msgs will be discarded if not in the syncing stage, but for all those during syncing they'll be delayed processing
14:27:25 <mlavalle> kailun: I understand that
14:28:01 <kailun> slaweq: as I said, the delay should not be long as the stale msgs can be consumed quickly
14:28:08 <slaweq> is there maybe any way in oslo messaging to do something like "clean queue from all messages" and then start normal full_sync?
14:28:37 <slaweq> like one call instead of delay and pray that it will be enough :)
14:29:04 <mlavalle> kailun: so the "stale" messages will be consumed while waiting here, right: https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@167?
14:29:21 <kailun> slaweq: understood your wish, but I am not aware of that :)
14:30:00 <kailun> mlavalle: correct
14:30:05 <slaweq> kailun: maybe it would be good to ask oslo folks about that, and maybe it would be possible to add something like that there?
14:30:52 <haleyb> right, because otherwise there is no perfect setting for every deployment
14:32:17 <slaweq> there is something like that in rabbitmq: https://www.rabbitmq.com/rabbitmqctl.8.html#purge_queue
14:32:25 <mlavalle> kailun: let's be clear. I think I can safely say we all understand the problem you are bringing up and we highly appreciate the fact that you and your team are uncovering these corner cases
14:32:30 <slaweq> so probably it can be also implemented in oslo.messaging
14:32:51 <haleyb> right, and it might also be applied to other agents that way
14:33:26 <mlavalle> kailun: this is clearly a contribution to improve Neutron's performance. we are just thinking with you what the best solution might be
14:33:57 <kailun> just one concern, if multiple dhcp agents but only one restarted
14:34:20 <kailun> does it make sense for this dhcp agent restarted to clean all messages?
14:34:55 <liuyulong> Each dhcp agent may have its own topic, clean that
14:35:10 <slaweq> don't each agent have own queue?
14:36:17 <slaweq> I'm looking at https://github.com/openstack/neutron/blob/master/neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py#L191
14:36:27 <slaweq> and it looks that messages are sent "to host"
14:36:35 <slaweq> so to specific agent
14:36:45 <kailun> slaweq: liuyulong: per host
14:36:51 <slaweq> and this queue can be purged by agent IMO
14:37:19 <slaweq> kailun: per host and per agent type, right?
14:37:29 <kailun> slaweq: yes
14:37:33 <slaweq> so You will not purge e.g. L3 agent's queue with this
14:37:54 <kailun> slaweq: no
14:38:47 <slaweq> I know that I may be annoying with my requests but can You maybe explore also this one possibility instead of doing this delay? Because some constant delay don't look very reliable for me :/
14:39:42 <mlavalle> the proposed mechanism is simple and doesn't have backwards compatibility issues
14:39:59 <liuyulong> One little concern about the current approach, message may be send twice for high availability scenario, right? And dhcp agent just stop that block, I'm not quite sure what will happen after that.
14:40:19 <kailun> slaweq: it's OK no worry:) I cannot definitely ensure that purge queue is a better approach, but i agree it's worth looking into
14:40:29 <mlavalle> however, based, on what we are discussing here, it won't perform predictably in all situations
14:41:06 <kailun> mlavalle: agree w/ you
14:42:02 <kailun> mlavalle: as I said, it's no harm anyway, and should address some corner cases as a quick solution
14:42:28 <haleyb> mlavalle: right, and adding a timer value means more work in other projects, for example, we'll need to support changing it on deploy in tripleo or ansible, etc
14:43:21 <mlavalle> kailun: the good thing is that you and your team are helping us to uncover the corner cases that we have been unable to envision so far. But by definition this corner cases are more dificult to fix and therefore it is reasonable that we have a longer / more heated debate
14:43:41 <mlavalle> don't you agree?
14:44:28 <kailun> liuyulong: correct, i need further look into your HA case to give an answer
14:44:42 <kailun> mlavalle: I agree
14:45:16 <mlavalle> kailun: but again, let me reiterate that we highly appreciate your bringing these cases to our attention
14:45:46 <kailun> mlavalle: no problem, thank you all
14:46:17 <mlavalle> kailun: before you go, let's also take a look at this patch: https://review.openstack.org/#/c/626830
14:46:36 <mlavalle> let's consider it in the range of possible solutions, please
14:46:53 <mlavalle> It might not be applicable, but at first glance it seems relevant
14:47:12 <kailun> mlavalle: yes, yulong points that out, i'll take a look
14:47:34 <liuyulong> Thanks kailun for the fix, yes, we have DHCP debts. And we recently also meet some DHCP issue locally during a booting storm...
14:48:10 <liuyulong> Mainly this issue: https://bugs.launchpad.net/neutron/+bug/1760047
14:48:11 <openstack> Launchpad bug 1760047 in neutron "some ports does not become ACTIVE during provisioning" [High,Triaged] - Assigned to yangjianfeng (yangjianfeng)
14:49:03 <kailun> liuyulong: thanks, I'll have a look and see if any insight
14:49:07 <mlavalle> I knew that the discussion of this RFE was going to be long, so I only scheduled this one for today
14:49:49 <mlavalle> there are a few though that I would like to bring your attention to before we go:
14:50:20 <mlavalle> first https://bugs.launchpad.net/neutron/+bug/1805769
14:50:22 <openstack> Launchpad bug 1805769 in neutron "[RFE] Add a config option rpc_response_max_timeout" [Undecided,Incomplete] - Assigned to nicky (ss104301)
14:50:45 <mlavalle> I will take the time today to comment on it. If you have spare bandwidth please take a look
14:51:30 <slaweq> this one looks reasonable for me, I just though that would be good to have approval from drivers team as it proposes new config option
14:52:23 <mlavalle> what do you think haleyb?
14:53:32 <haleyb> i have not read it yet
14:54:00 <haleyb> was this the config option in neutron-lib
14:54:02 <haleyb> ?
14:54:03 <mlavalle> ok, let's look at it and comment
14:54:29 <slaweq> haleyb: it was originally proposed in neutron-lib
14:54:33 <slaweq> but now it is moved to neutron
14:54:41 <slaweq> as all other config options are in neutron
14:55:25 <haleyb> https://review.openstack.org/#/c/626109/
14:55:37 <slaweq> yep
14:56:42 <haleyb> i guess it looks fine
14:56:59 <haleyb> was there a follow-on that used it?
14:57:52 <mlavalle> I don't think there was
14:58:10 <haleyb> if a tree falls.... :)
14:58:23 <haleyb> it would be good to have a user
14:58:30 <slaweq> https://review.openstack.org/#/c/623401/10
14:58:35 <slaweq> this uses this option
14:59:21 <mlavalle> thanks for digging that one up
14:59:38 <mlavalle> running out of time
14:59:57 <mlavalle> let's finish this one in channel
15:00:03 <mlavalle> #endmeeting