14:00:25 <mlavalle> #startmeeting neutron_drivers 14:00:26 <openstack> Meeting started Fri Dec 21 14:00:25 2018 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:29 <openstack> The meeting name has been set to 'neutron_drivers' 14:02:29 <slaweq> hi 14:02:37 <kailun> hi 14:02:45 <mlavalle> hi 14:03:26 <nicky0419_> hi 14:03:47 <cheng1> hi 14:03:56 <mlavalle> let's wait a few minutes for quorum 14:04:01 <nicky0419_> https://bugs.launchpad.net/neutron/+bug/1805769 Please take look at it. 14:04:02 <openstack> Launchpad bug 1805769 in neutron "[RFE] Add a config option rpc_response_max_timeout" [Undecided,Incomplete] - Assigned to nicky (ss104301) 14:04:33 <nicky0419_> If possible I want to disscuss this rfe. 14:04:36 <mlavalle> nicky0419_: we will probably not discuss this RFE in this meeting today 14:05:45 <nicky0419_> mlavalle: Got it. Thanks. 14:06:00 <mlavalle> nicky0419_: but I will comment on it later today 14:06:57 <haleyb> hi 14:07:40 <mlavalle> hi haleyb 14:07:46 <mlavalle> we have quorum now 14:08:43 <mlavalle> #topic RFEs 14:08:54 <mlavalle> Let's discuss https://bugs.launchpad.net/neutron/+bug/1795212 14:08:55 <openstack> Launchpad bug 1795212 in neutron "[RFE] Prevent DHCP agent from processing stale RPC messages when restarting up" [Wishlist,In progress] - Assigned to Kailun Qin (kailun.qin) 14:09:21 <mlavalle> kailun: I believe this is yours, right? 14:09:34 <kailun> mlavalle: yes 14:10:04 <kailun> team, I've addressed your comments/questions in the launchpad, please kindly have a look 14:10:33 <mlavalle> kailun: yes, I went over it carefully yesterday 14:10:42 <mlavalle> thanks for your responses 14:11:14 <kailun> mlavalle: thank you, your last comment together w/ slaweq's, I've answered it today 14:11:30 <mlavalle> kailun: yes, I am looking at the code right now 14:11:41 <kailun> mlavalle: slaweq: see if you have any further comment on that corner case raised 14:12:19 <slaweq> kailun: mlavalle I think that You dispel my doubts about it 14:12:24 <liuyulong> This is worthy to do some testing like booting tons instance at one time, and restart the dhcp agent. 14:13:20 <liuyulong> Rally may be the tool for it. 14:13:23 <slaweq> liuyulong: yes, I also wonder what impact it will have on performance 14:13:49 <kailun> liuyulong: the bug was initially coming from a production env, and the fix has been validated based on that as well 14:14:11 <kailun> slaweq: liuyulong: personally I agree w/ your opinions 14:14:12 <liuyulong> kailun: Thanks, good to know 14:15:59 <slaweq> kailun: so, to be sure I understand it correctly, during this delay just after start of agent, You expect that all "stale" rpc messages will be removed from queue? 14:16:21 <slaweq> and then after delay they will not be processed by agent, am I right? 14:16:41 <kailun> slaweq: liuyulong: actually the stale RPC msgs would be consumedd quickly since they are discarded w/o doing any real work in the agent, so the performance shouldn't be impacted much 14:17:01 <kailun> slaweq: correct 14:17:01 <slaweq> ahh, ok 14:17:07 <haleyb> the other option (which i've probably mentioned before) is to do as the l3-agent does - use a queue 14:17:20 <slaweq> so during this delay, agent will consume old message and not do anything with them 14:17:39 <slaweq> and then it will start doing full sync, right? 14:18:04 <kailun> haleyb: yes thanks for the proposal, I investigated a little bit but found that it might not meet our requirment 14:18:14 <liuyulong> https://review.openstack.org/#/c/626830/, this patch was submit todady, use resource queue. 14:19:47 <kailun> slaweq: during the delay, agent will discard the old msgs, but while doing the full sync, it will keep them due to the read lock and process them later 14:19:47 <haleyb> kailun: can you explain a little? it should order rpc messages based on arrival 14:20:31 <kailun> haleyb: please kindly have a look at the comment #8 of the launchpad 14:21:30 <kailun> haleyb: I tried to answer your question, I'd like to copy-paste the answer here but I have a problem w/ the irc client:) sry 14:22:10 <haleyb> kailun: ack. but in your case, how does the agent know there are stale rpc messages, just based on the config option timeout? 14:23:16 <mlavalle> yes, that is my understanding. it is assumed that after the the wait period, the stale messages would have been consumed 14:24:18 <mlavalle> right kailun? 14:24:58 <kailun> haleyb: I agree it's a little uncertain in comparing w/ timestamp/sequence based, but based on the comment #10, we consider some other points 14:25:13 <slaweq> can it happen that delay will be too short and some messages will not be discarded in some cases? 14:25:42 <kailun> haleyb: personnally, I think it's no harm to add a delay for agents w/ default 0 14:26:10 <liuyulong> IIRC, the oslo.messaging has some settings about the queue duration 14:27:11 <haleyb> slaweq: yes, and i was thinking too long and we throw out "good" ones, but i guess the fullsync would fix that 14:27:12 <kailun> mlavalle: the msgs will be discarded if not in the syncing stage, but for all those during syncing they'll be delayed processing 14:27:25 <mlavalle> kailun: I understand that 14:28:01 <kailun> slaweq: as I said, the delay should not be long as the stale msgs can be consumed quickly 14:28:08 <slaweq> is there maybe any way in oslo messaging to do something like "clean queue from all messages" and then start normal full_sync? 14:28:37 <slaweq> like one call instead of delay and pray that it will be enough :) 14:29:04 <mlavalle> kailun: so the "stale" messages will be consumed while waiting here, right: https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@167? 14:29:21 <kailun> slaweq: understood your wish, but I am not aware of that :) 14:30:00 <kailun> mlavalle: correct 14:30:05 <slaweq> kailun: maybe it would be good to ask oslo folks about that, and maybe it would be possible to add something like that there? 14:30:52 <haleyb> right, because otherwise there is no perfect setting for every deployment 14:32:17 <slaweq> there is something like that in rabbitmq: https://www.rabbitmq.com/rabbitmqctl.8.html#purge_queue 14:32:25 <mlavalle> kailun: let's be clear. I think I can safely say we all understand the problem you are bringing up and we highly appreciate the fact that you and your team are uncovering these corner cases 14:32:30 <slaweq> so probably it can be also implemented in oslo.messaging 14:32:51 <haleyb> right, and it might also be applied to other agents that way 14:33:26 <mlavalle> kailun: this is clearly a contribution to improve Neutron's performance. we are just thinking with you what the best solution might be 14:33:57 <kailun> just one concern, if multiple dhcp agents but only one restarted 14:34:20 <kailun> does it make sense for this dhcp agent restarted to clean all messages? 14:34:55 <liuyulong> Each dhcp agent may have its own topic, clean that 14:35:10 <slaweq> don't each agent have own queue? 14:36:17 <slaweq> I'm looking at https://github.com/openstack/neutron/blob/master/neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py#L191 14:36:27 <slaweq> and it looks that messages are sent "to host" 14:36:35 <slaweq> so to specific agent 14:36:45 <kailun> slaweq: liuyulong: per host 14:36:51 <slaweq> and this queue can be purged by agent IMO 14:37:19 <slaweq> kailun: per host and per agent type, right? 14:37:29 <kailun> slaweq: yes 14:37:33 <slaweq> so You will not purge e.g. L3 agent's queue with this 14:37:54 <kailun> slaweq: no 14:38:47 <slaweq> I know that I may be annoying with my requests but can You maybe explore also this one possibility instead of doing this delay? Because some constant delay don't look very reliable for me :/ 14:39:42 <mlavalle> the proposed mechanism is simple and doesn't have backwards compatibility issues 14:39:59 <liuyulong> One little concern about the current approach, message may be send twice for high availability scenario, right? And dhcp agent just stop that block, I'm not quite sure what will happen after that. 14:40:19 <kailun> slaweq: it's OK no worry:) I cannot definitely ensure that purge queue is a better approach, but i agree it's worth looking into 14:40:29 <mlavalle> however, based, on what we are discussing here, it won't perform predictably in all situations 14:41:06 <kailun> mlavalle: agree w/ you 14:42:02 <kailun> mlavalle: as I said, it's no harm anyway, and should address some corner cases as a quick solution 14:42:28 <haleyb> mlavalle: right, and adding a timer value means more work in other projects, for example, we'll need to support changing it on deploy in tripleo or ansible, etc 14:43:21 <mlavalle> kailun: the good thing is that you and your team are helping us to uncover the corner cases that we have been unable to envision so far. But by definition this corner cases are more dificult to fix and therefore it is reasonable that we have a longer / more heated debate 14:43:41 <mlavalle> don't you agree? 14:44:28 <kailun> liuyulong: correct, i need further look into your HA case to give an answer 14:44:42 <kailun> mlavalle: I agree 14:45:16 <mlavalle> kailun: but again, let me reiterate that we highly appreciate your bringing these cases to our attention 14:45:46 <kailun> mlavalle: no problem, thank you all 14:46:17 <mlavalle> kailun: before you go, let's also take a look at this patch: https://review.openstack.org/#/c/626830 14:46:36 <mlavalle> let's consider it in the range of possible solutions, please 14:46:53 <mlavalle> It might not be applicable, but at first glance it seems relevant 14:47:12 <kailun> mlavalle: yes, yulong points that out, i'll take a look 14:47:34 <liuyulong> Thanks kailun for the fix, yes, we have DHCP debts. And we recently also meet some DHCP issue locally during a booting storm... 14:48:10 <liuyulong> Mainly this issue: https://bugs.launchpad.net/neutron/+bug/1760047 14:48:11 <openstack> Launchpad bug 1760047 in neutron "some ports does not become ACTIVE during provisioning" [High,Triaged] - Assigned to yangjianfeng (yangjianfeng) 14:49:03 <kailun> liuyulong: thanks, I'll have a look and see if any insight 14:49:07 <mlavalle> I knew that the discussion of this RFE was going to be long, so I only scheduled this one for today 14:49:49 <mlavalle> there are a few though that I would like to bring your attention to before we go: 14:50:20 <mlavalle> first https://bugs.launchpad.net/neutron/+bug/1805769 14:50:22 <openstack> Launchpad bug 1805769 in neutron "[RFE] Add a config option rpc_response_max_timeout" [Undecided,Incomplete] - Assigned to nicky (ss104301) 14:50:45 <mlavalle> I will take the time today to comment on it. If you have spare bandwidth please take a look 14:51:30 <slaweq> this one looks reasonable for me, I just though that would be good to have approval from drivers team as it proposes new config option 14:52:23 <mlavalle> what do you think haleyb? 14:53:32 <haleyb> i have not read it yet 14:54:00 <haleyb> was this the config option in neutron-lib 14:54:02 <haleyb> ? 14:54:03 <mlavalle> ok, let's look at it and comment 14:54:29 <slaweq> haleyb: it was originally proposed in neutron-lib 14:54:33 <slaweq> but now it is moved to neutron 14:54:41 <slaweq> as all other config options are in neutron 14:55:25 <haleyb> https://review.openstack.org/#/c/626109/ 14:55:37 <slaweq> yep 14:56:42 <haleyb> i guess it looks fine 14:56:59 <haleyb> was there a follow-on that used it? 14:57:52 <mlavalle> I don't think there was 14:58:10 <haleyb> if a tree falls.... :) 14:58:23 <haleyb> it would be good to have a user 14:58:30 <slaweq> https://review.openstack.org/#/c/623401/10 14:58:35 <slaweq> this uses this option 14:59:21 <mlavalle> thanks for digging that one up 14:59:38 <mlavalle> running out of time 14:59:57 <mlavalle> let's finish this one in channel 15:00:03 <mlavalle> #endmeeting