#openstack-meeting log

14:00:14 <liuyulong> #startmeeting neutron_l3
14:00:14 <openstack> Meeting started Wed Jul 17 14:00:14 2019 UTC and is due to finish in 60 minutes.  The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:15 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:17 <openstack> The meeting name has been set to 'neutron_l3'
14:00:33 <njohnston_> o/
14:00:34 <haleyb> hi
14:00:40 <liuyulong> hi
14:00:46 <liuyulong> #chair haleyb
14:00:47 <openstack> Current chairs: haleyb liuyulong
14:01:15 <liuyulong> #topic Announcements
14:01:23 <ralonsoh> hi
14:01:25 <liuyulong> I have a question
14:01:37 <liuyulong> Where we can apply the early-bird discount for Shanghai summit and PTG?
14:01:49 <liuyulong> I have not received any discount CODE recently. If you have any information, please let us know.
14:02:37 <slaweq> hi
14:02:56 <liuyulong> And we have this PTG plan etherpad now:
14:03:03 <liuyulong> #link https://etherpad.openstack.org/p/Shanghai-Neutron-Planning
14:03:37 <liuyulong> And one more important thing is, seems we all still do not know the final official 'U' release name.
14:03:45 <liuyulong> It is interesting now.
14:04:23 <haleyb> there will be a vote eventually...
14:04:32 <liuyulong> Chinese Pinyin does not have 'U' starts pronunciation. But I have sent a suggestion to the mail list, like a stone dropped into the ocean, without any response.
14:04:39 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-February/002706.html
14:06:03 <liuyulong> Allow me to quote the contents from that mail:
14:06:11 <liuyulong> """
14:06:27 <liuyulong> And my name is Yulong, then 'Uylong' can be a good example to explain my suggestion, : )
14:07:47 <slaweq> so maybe we should propose 'Uylong' as a name of the most famous chineese neutron core ;)
14:08:04 <haleyb> +2 :)
14:08:15 <liuyulong> haleyb, What vote time is it usually?
14:09:14 <liuyulong> For OpenStack tradition, it should be a place name. Haha
14:09:45 <haleyb> liuyulong: i don't remember when exactly, but yes, a place or street or ???
14:10:14 <liuyulong> Yes, mountains and rivers
14:10:43 <haleyb> Ussuri
14:11:09 <haleyb> the TC will eventually send an email with a place to add suggestions
14:11:41 <slaweq> yes, usually it was some wiki page or something like that where people were adding proposals for voting IIRC
14:12:06 <liuyulong> Ussuri is more like a Russia word. For Chinese Pinyin, it is "Wusuli".
14:12:07 <haleyb> we had the same problem in Hong Kong and chose Icehouse (street)
14:13:05 <haleyb> not sure we will solve this in the L3 meeting though :)
14:13:17 <liuyulong> Haha
14:13:27 <liuyulong> OK, Any other announcements?
14:13:45 <liuyulong> Sure, let's move on.
14:13:52 <liuyulong> #topic Bugs
14:14:01 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-July/007763.html
14:14:07 <liuyulong> Slawek Kaplonski (slaweq) was our bug deputy last week, thank you for the collection.
14:14:35 <liuyulong> I will skip all the bugs which were fixed or the related patches are getting merged now.
14:14:55 <liuyulong> First one
14:14:57 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1836642
14:14:58 <openstack> Launchpad bug 1836642 in neutron "Metadata responses are very slow sometimes" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
14:15:09 <liuyulong> Looks like nova metadata API was do nothing during that 16s+.
14:15:14 <liuyulong> Here is an example:
14:15:19 <liuyulong> #link http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-n-api-meta.txt.gz#_Jul_11_23_43_01_357100
14:15:39 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1821912
14:15:41 <openstack> Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,In progress] - Assigned to LIU Yulong (dragon889)
14:15:50 <liuyulong> slaweq sent some similar logs here before.
14:16:10 <slaweq> sean-k-mooney was looking into it yesterday with me, and he found that in case which we were analysing there was most of the time wasted on http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-q-svc.txt.gz#_Jul_11_23_43_04_584216
14:16:35 <slaweq> because when nova is preparing metadata for instance it is asking neutron server for security groups for instance :O
14:16:57 <slaweq> and it looks that in this case that call to neutron took most of the time and caused problem
14:17:24 <slaweq> becuase of that I sent today DNM patch to check time-cost of those resync quota methods
14:17:36 <slaweq> njohnston_: ^^ that's explanation for Your question in review there :)
14:17:47 <slaweq> but I'm also looking at other examples
14:17:48 <haleyb> why does nova need that?  is it something given in metadata?
14:17:55 <liuyulong> We can not blame nova now, haha
14:17:56 <slaweq> haleyb: I have no idea
14:18:14 <njohnsto_> yes security groups are part of metadata
14:18:21 <liuyulong> haleyb, I know some clue
14:18:34 <slaweq> so, I was looking also at other examples and I found that it's not always this quota resync which takes long time
14:18:52 <liuyulong> haleyb, when nova try to sync network info it will try to get the port and its secruity group informations.
14:19:12 <slaweq> BUT, in every case in almost the same time as timeouted request to metadata is send, there is some API call in neutron which takes more than 10 seconds
14:19:51 <slaweq> so currently it looks for me like some slow down in neutron or maybe in db? I don't know
14:20:16 <haleyb> just a thought, but we should look at the code making the call, and make sure it's not asking for everything, but supplying a good filter
14:20:17 <liuyulong> DB slow query, maybe something like the bug we talked about last week.
14:20:47 <slaweq> haleyb: sure, but it is working fine in most cases
14:20:47 <njohnsto_> in the long term it would be good to
14:21:02 <slaweq> in tempest job there is plenty of vms spawned, each of them is asking for public-keys to metadata
14:21:17 <slaweq> and sometimes, one of such queries is long (more than 10 seconds)
14:21:49 <slaweq> and it's not last/middle/first test AFAICT - there is no any other pattern IMO
14:21:49 <liuyulong> Some related neutron API call is here: https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py
14:22:33 <slaweq> the only common thing is that VM is doing request GET /2009-04-04/public-keys/ and it takes more than 10 seconds
14:22:41 <slaweq> 10 seconds is set as timeout in cirros script
14:22:45 <slaweq> so this fails
14:22:59 <slaweq> even if later nova send proper 200 response
14:23:34 <slaweq> I will try to read one more time all analysis from sean and go through all those calls there
14:23:43 <slaweq> maybe I will find something more
14:25:23 <liuyulong> slaweq, OK, thank you for working on this. Where is your DNM patch?
14:25:37 <haleyb> slaweq: i wonder how many SGs nova is asking for after noticing this...
14:25:44 <haleyb> https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py#L760
14:25:45 <slaweq> liuyulong: https://review.opendev.org/#/c/671300/
14:26:09 <haleyb> it is at least passing the tenant_id though
14:26:12 <liuyulong> slaweq, this should be the metadata API call neutron security group list: https://github.com/openstack/nova/blob/master/nova/api/metadata/base.py#L145
14:26:14 <njohnston_> a related bug filed under the tripleo project for similar failures in queens: https://bugs.launchpad.net/tripleo/+bug/1836046
14:26:15 <openstack> Launchpad bug 1836046 in tripleo "tempest.scenario.test_network_basic_ops.TestNetworkBasicOps Failing on queens" [Critical,Triaged]
14:26:22 <slaweq> haleyb: but as I said, in other cases it wasn't exactly the same, and there was other call which took long time
14:26:49 <slaweq> njohnston_: yep
14:27:01 <slaweq> and we have also d/s bug in bugzilla for the same
14:27:21 <liuyulong> One more thing is, we have added some time-consuming tracking log. It will help us to find some potential causes of CI failure.
14:27:26 <liuyulong> L3 RPC time-costs:
14:27:30 <liuyulong> #link http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Time-cost%3A%5C%22%20and%20NOT%20message%3A%20%5C%22start%5C%22
14:27:37 <liuyulong> L3 router processing time:
14:27:42 <liuyulong> #link http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Finished%20a%20router%20update%5C%22
14:28:38 <njohnston_> this is great stuff
14:28:55 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1836253
14:28:56 <openstack> Launchpad bug 1836253 in neutron "Sometimes InstanceMetada API returns 404 due to invalid InstaceID returned by _get_instance_and_tenant_id()" [Medium,Confirmed] - Assigned to Bence Romsics (bence-romsics)
14:29:11 <liuyulong> And this one looks like also related to the former bugs.
14:29:21 <slaweq> I though that it may be related
14:29:41 <slaweq> but it seems that we don't have this cache configured in neutron-metadata agent in any job
14:29:48 <slaweq> so it's not the case in gate
14:30:00 <liuyulong> More like a race condition.
14:31:33 <liuyulong> Next one
14:31:52 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1806032
14:31:52 <openstack> Launchpad bug 1806032 in neutron "neutron doesn't prevent the network update from external to internal when floatingIPs present" [Low,New]
14:32:01 <liuyulong> This will be proceed again from my understanding.
14:32:24 <liuyulong> I don't know why a cloud wants to change the external network type, but it is indeed a neutron bug.
14:33:05 <liuyulong> Next
14:33:06 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1835914
14:33:07 <openstack> Launchpad bug 1835914 in neutron "Test test_show_network_segment_range failing" [Medium,Confirmed]
14:33:20 <liuyulong> Let's try to contact Kailun Qin, he is the original author.
14:35:06 <liuyulong> And for this new feature, our test team have report many bugs. I will file them to the launchpad recently.
14:36:53 <liuyulong> Next two has been talked last week:
14:36:55 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1834308
14:36:56 <openstack> Launchpad bug 1834308 in neutron "[DVR][DB] too many slow query during agent restart" [Medium,Confirmed] - Assigned to LIU Yulong (dragon889)
14:36:59 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1835663
14:37:00 <openstack> Launchpad bug 1835663 in neutron "Some L3 RPCs are time-consuming especially get_routers" [Medium,Confirmed]
14:37:53 <liuyulong> Yes, I will upload some fix for the DB slow query.
14:39:18 <liuyulong> No more bug from me today.
14:39:33 <slaweq> \o/ no more bugs \o/ :D
14:39:37 <haleyb> there was one i had
14:39:46 <haleyb> https://bugs.launchpad.net/neutron/+bug/1835731
14:39:47 <openstack> Launchpad bug 1835731 in neutron "Neutron server error: failed to update port DOWN" [High,In progress] - Assigned to Oleg Bondarev (obondarev)
14:39:47 <slaweq> haleyb: :(
14:39:53 <haleyb> https://review.opendev.org/#/c/669640/
14:40:09 <haleyb> liuyulong had a -1 on the change, didn't know if we needed to discuss
14:40:27 <liuyulong> We have remove that config for master branch
14:40:54 <liuyulong> And that removal fixes the bug, IMO
14:41:19 <liuyulong> But for the stable branches, it may need another approach.
14:41:49 <ralonsoh> IMO, we can use this patch with a note and then cherry-pick to stable branches
14:42:00 <ralonsoh> the logic seems to be correct in master and stable
14:42:46 <slaweq> ralonsoh++
14:43:08 <slaweq> and later merge liuyulong's patch which removes this option in master
14:43:16 <ralonsoh> correct
14:43:52 <haleyb> ok, seems we have a way forward, just didn't want it to fall through the cracks
14:46:19 <haleyb> that's all from me
14:46:34 <ralonsoh> just a last note: https://review.opendev.org/#/c/521035/
14:46:41 <ralonsoh> reviews are welcome
14:47:41 <liuyulong> OK, Let's move on.
14:47:45 <liuyulong> #topic Routed Networks
14:48:31 <liuyulong> I have get no reponse for the concern of "externel network with multiple segments".
14:48:31 <liuyulong> #link https://review.opendev.org/#/q/topic:bug/1764738
14:48:38 <liuyulong> no reponse and not too much activities from these patches.
14:49:21 <ralonsoh> maybe mlavalle can ping David
14:49:30 <wwriverrat> yes. sorry. I fear I'm a little over my head on how far reaching allowing multiple segments per host touches
14:51:03 <wwriverrat> Would love to have a 1-1 review with someone who has pulled ^ code and knows the intent of where it was going
14:52:01 <liuyulong> We are now facing such issue, external network has a large sets IPs,  broadcast domain is too large.
14:53:03 <wwriverrat> Our original thought: why not allow multiple segments per network *everywhere* (thinking that for most implementations only one would be returned in a list)
14:53:28 <wwriverrat> but this crosses api, rpc and agent boundries
14:54:43 <wwriverrat> so... I know mlavalle and I were trying to find time to have a review session. Will keep trying
14:55:40 <liuyulong> both mlavalle and tidwellr may help
14:56:12 <liuyulong> #topic On demand agenda
14:56:17 <liuyulong> #link https://blueprints.launchpad.net/neutron/+spec/openflow-based-dvr
14:56:26 <liuyulong> I've added a small NOTE here: we have abandoned it.
14:56:32 <liuyulong> And these patches should be abandoned. I have no right to do that.
14:56:36 <liuyulong> #link https://review.opendev.org/#/q/topic:openflow-based-dvr+status:open
14:58:03 <liuyulong> We are running out of time.
14:58:07 <liuyulong> Let
14:58:17 <liuyulong> us stop here
14:58:21 <liuyulong> #endmeeting