14:00:14 #startmeeting neutron_l3 14:00:14 Meeting started Wed Jul 17 14:00:14 2019 UTC and is due to finish in 60 minutes. The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:15 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:17 The meeting name has been set to 'neutron_l3' 14:00:33 o/ 14:00:34 hi 14:00:40 hi 14:00:46 #chair haleyb 14:00:47 Current chairs: haleyb liuyulong 14:01:15 #topic Announcements 14:01:23 hi 14:01:25 I have a question 14:01:37 Where we can apply the early-bird discount for Shanghai summit and PTG? 14:01:49 I have not received any discount CODE recently. If you have any information, please let us know. 14:02:37 hi 14:02:56 And we have this PTG plan etherpad now: 14:03:03 #link https://etherpad.openstack.org/p/Shanghai-Neutron-Planning 14:03:37 And one more important thing is, seems we all still do not know the final official 'U' release name. 14:03:45 It is interesting now. 14:04:23 there will be a vote eventually... 14:04:32 Chinese Pinyin does not have 'U' starts pronunciation. But I have sent a suggestion to the mail list, like a stone dropped into the ocean, without any response. 14:04:39 #link http://lists.openstack.org/pipermail/openstack-discuss/2019-February/002706.html 14:06:03 Allow me to quote the contents from that mail: 14:06:11 """ 14:06:27 And my name is Yulong, then 'Uylong' can be a good example to explain my suggestion, : ) 14:07:47 so maybe we should propose 'Uylong' as a name of the most famous chineese neutron core ;) 14:08:04 +2 :) 14:08:15 haleyb, What vote time is it usually? 14:09:14 For OpenStack tradition, it should be a place name. Haha 14:09:45 liuyulong: i don't remember when exactly, but yes, a place or street or ??? 14:10:14 Yes, mountains and rivers 14:10:43 Ussuri 14:11:09 the TC will eventually send an email with a place to add suggestions 14:11:41 yes, usually it was some wiki page or something like that where people were adding proposals for voting IIRC 14:12:06 Ussuri is more like a Russia word. For Chinese Pinyin, it is "Wusuli". 14:12:07 we had the same problem in Hong Kong and chose Icehouse (street) 14:13:05 not sure we will solve this in the L3 meeting though :) 14:13:17 Haha 14:13:27 OK, Any other announcements? 14:13:45 Sure, let's move on. 14:13:52 #topic Bugs 14:14:01 #link http://lists.openstack.org/pipermail/openstack-discuss/2019-July/007763.html 14:14:07 Slawek Kaplonski (slaweq) was our bug deputy last week, thank you for the collection. 14:14:35 I will skip all the bugs which were fixed or the related patches are getting merged now. 14:14:55 First one 14:14:57 #link https://bugs.launchpad.net/neutron/+bug/1836642 14:14:58 Launchpad bug 1836642 in neutron "Metadata responses are very slow sometimes" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 14:15:09 Looks like nova metadata API was do nothing during that 16s+. 14:15:14 Here is an example: 14:15:19 #link http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-n-api-meta.txt.gz#_Jul_11_23_43_01_357100 14:15:39 #link https://bugs.launchpad.net/neutron/+bug/1821912 14:15:41 Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,In progress] - Assigned to LIU Yulong (dragon889) 14:15:50 slaweq sent some similar logs here before. 14:16:10 sean-k-mooney was looking into it yesterday with me, and he found that in case which we were analysing there was most of the time wasted on http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-q-svc.txt.gz#_Jul_11_23_43_04_584216 14:16:35 because when nova is preparing metadata for instance it is asking neutron server for security groups for instance :O 14:16:57 and it looks that in this case that call to neutron took most of the time and caused problem 14:17:24 becuase of that I sent today DNM patch to check time-cost of those resync quota methods 14:17:36 njohnston_: ^^ that's explanation for Your question in review there :) 14:17:47 but I'm also looking at other examples 14:17:48 why does nova need that? is it something given in metadata? 14:17:55 We can not blame nova now, haha 14:17:56 haleyb: I have no idea 14:18:14 yes security groups are part of metadata 14:18:21 haleyb, I know some clue 14:18:34 so, I was looking also at other examples and I found that it's not always this quota resync which takes long time 14:18:52 haleyb, when nova try to sync network info it will try to get the port and its secruity group informations. 14:19:12 BUT, in every case in almost the same time as timeouted request to metadata is send, there is some API call in neutron which takes more than 10 seconds 14:19:51 so currently it looks for me like some slow down in neutron or maybe in db? I don't know 14:20:16 just a thought, but we should look at the code making the call, and make sure it's not asking for everything, but supplying a good filter 14:20:17 DB slow query, maybe something like the bug we talked about last week. 14:20:47 haleyb: sure, but it is working fine in most cases 14:20:47 in the long term it would be good to 14:21:02 in tempest job there is plenty of vms spawned, each of them is asking for public-keys to metadata 14:21:17 and sometimes, one of such queries is long (more than 10 seconds) 14:21:49 and it's not last/middle/first test AFAICT - there is no any other pattern IMO 14:21:49 Some related neutron API call is here: https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py 14:22:33 the only common thing is that VM is doing request GET /2009-04-04/public-keys/ and it takes more than 10 seconds 14:22:41 10 seconds is set as timeout in cirros script 14:22:45 so this fails 14:22:59 even if later nova send proper 200 response 14:23:34 I will try to read one more time all analysis from sean and go through all those calls there 14:23:43 maybe I will find something more 14:25:23 slaweq, OK, thank you for working on this. Where is your DNM patch? 14:25:37 slaweq: i wonder how many SGs nova is asking for after noticing this... 14:25:44 https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py#L760 14:25:45 liuyulong: https://review.opendev.org/#/c/671300/ 14:26:09 it is at least passing the tenant_id though 14:26:12 slaweq, this should be the metadata API call neutron security group list: https://github.com/openstack/nova/blob/master/nova/api/metadata/base.py#L145 14:26:14 a related bug filed under the tripleo project for similar failures in queens: https://bugs.launchpad.net/tripleo/+bug/1836046 14:26:15 Launchpad bug 1836046 in tripleo "tempest.scenario.test_network_basic_ops.TestNetworkBasicOps Failing on queens" [Critical,Triaged] 14:26:22 haleyb: but as I said, in other cases it wasn't exactly the same, and there was other call which took long time 14:26:49 njohnston_: yep 14:27:01 and we have also d/s bug in bugzilla for the same 14:27:21 One more thing is, we have added some time-consuming tracking log. It will help us to find some potential causes of CI failure. 14:27:26 L3 RPC time-costs: 14:27:30 #link http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Time-cost%3A%5C%22%20and%20NOT%20message%3A%20%5C%22start%5C%22 14:27:37 L3 router processing time: 14:27:42 #link http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Finished%20a%20router%20update%5C%22 14:28:38 this is great stuff 14:28:55 #link https://bugs.launchpad.net/neutron/+bug/1836253 14:28:56 Launchpad bug 1836253 in neutron "Sometimes InstanceMetada API returns 404 due to invalid InstaceID returned by _get_instance_and_tenant_id()" [Medium,Confirmed] - Assigned to Bence Romsics (bence-romsics) 14:29:11 And this one looks like also related to the former bugs. 14:29:21 I though that it may be related 14:29:41 but it seems that we don't have this cache configured in neutron-metadata agent in any job 14:29:48 so it's not the case in gate 14:30:00 More like a race condition. 14:31:33 Next one 14:31:52 #link https://bugs.launchpad.net/neutron/+bug/1806032 14:31:52 Launchpad bug 1806032 in neutron "neutron doesn't prevent the network update from external to internal when floatingIPs present" [Low,New] 14:32:01 This will be proceed again from my understanding. 14:32:24 I don't know why a cloud wants to change the external network type, but it is indeed a neutron bug. 14:33:05 Next 14:33:06 #link https://bugs.launchpad.net/neutron/+bug/1835914 14:33:07 Launchpad bug 1835914 in neutron "Test test_show_network_segment_range failing" [Medium,Confirmed] 14:33:20 Let's try to contact Kailun Qin, he is the original author. 14:35:06 And for this new feature, our test team have report many bugs. I will file them to the launchpad recently. 14:36:53 Next two has been talked last week: 14:36:55 #link https://bugs.launchpad.net/neutron/+bug/1834308 14:36:56 Launchpad bug 1834308 in neutron "[DVR][DB] too many slow query during agent restart" [Medium,Confirmed] - Assigned to LIU Yulong (dragon889) 14:36:59 #link https://bugs.launchpad.net/neutron/+bug/1835663 14:37:00 Launchpad bug 1835663 in neutron "Some L3 RPCs are time-consuming especially get_routers" [Medium,Confirmed] 14:37:53 Yes, I will upload some fix for the DB slow query. 14:39:18 No more bug from me today. 14:39:33 \o/ no more bugs \o/ :D 14:39:37 there was one i had 14:39:46 https://bugs.launchpad.net/neutron/+bug/1835731 14:39:47 Launchpad bug 1835731 in neutron "Neutron server error: failed to update port DOWN" [High,In progress] - Assigned to Oleg Bondarev (obondarev) 14:39:47 haleyb: :( 14:39:53 https://review.opendev.org/#/c/669640/ 14:40:09 liuyulong had a -1 on the change, didn't know if we needed to discuss 14:40:27 We have remove that config for master branch 14:40:54 And that removal fixes the bug, IMO 14:41:19 But for the stable branches, it may need another approach. 14:41:49 IMO, we can use this patch with a note and then cherry-pick to stable branches 14:42:00 the logic seems to be correct in master and stable 14:42:46 ralonsoh++ 14:43:08 and later merge liuyulong's patch which removes this option in master 14:43:16 correct 14:43:52 ok, seems we have a way forward, just didn't want it to fall through the cracks 14:46:19 that's all from me 14:46:34 just a last note: https://review.opendev.org/#/c/521035/ 14:46:41 reviews are welcome 14:47:41 OK, Let's move on. 14:47:45 #topic Routed Networks 14:48:31 I have get no reponse for the concern of "externel network with multiple segments". 14:48:31 #link https://review.opendev.org/#/q/topic:bug/1764738 14:48:38 no reponse and not too much activities from these patches. 14:49:21 maybe mlavalle can ping David 14:49:30 yes. sorry. I fear I'm a little over my head on how far reaching allowing multiple segments per host touches 14:51:03 Would love to have a 1-1 review with someone who has pulled ^ code and knows the intent of where it was going 14:52:01 We are now facing such issue, external network has a large sets IPs, broadcast domain is too large. 14:53:03 Our original thought: why not allow multiple segments per network *everywhere* (thinking that for most implementations only one would be returned in a list) 14:53:28 but this crosses api, rpc and agent boundries 14:54:43 so... I know mlavalle and I were trying to find time to have a review session. Will keep trying 14:55:40 both mlavalle and tidwellr may help 14:56:12 #topic On demand agenda 14:56:17 #link https://blueprints.launchpad.net/neutron/+spec/openflow-based-dvr 14:56:26 I've added a small NOTE here: we have abandoned it. 14:56:32 And these patches should be abandoned. I have no right to do that. 14:56:36 #link https://review.opendev.org/#/q/topic:openflow-based-dvr+status:open 14:58:03 We are running out of time. 14:58:07 Let 14:58:17 us stop here 14:58:21 #endmeeting