16:00:07 <slaweq> #startmeeting neutron_ci 16:00:08 <openstack> Meeting started Tue Jul 9 16:00:07 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:12 <openstack> The meeting name has been set to 'neutron_ci' 16:00:16 <mlavalle> o/ 16:00:17 <njohnston> o/ 16:01:12 <slaweq> lets wait few more minutes for others 16:02:34 <ralonsoh> hi 16:02:53 <slaweq> ok, so lets start 16:02:54 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:03:08 <slaweq> please open now so it will be ready later 16:03:14 <slaweq> #topic Actions from previous meetings 16:03:17 * mlavalle triggered Grafana 16:03:22 <slaweq> mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:03:32 <mlavalle> I did continue looking at that 16:03:47 <mlavalle> I am still finding the metadata proxy issue 16:04:13 <mlavalle> Left detailed comment here: https://bugs.launchpad.net/neutron/+bug/1830763/comments/3 16:04:14 <openstack> Launchpad bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:04:52 <mlavalle> The summary is that the instance is ready before haproxy-metadata-proxy is created 16:04:57 <mlavalle> in one of the nodes 16:05:07 <mlavalle> and therefore fails to get its metadata 16:05:29 <njohnston> just FYI, the failure rate for that job has been dropping the past few days and is down to 32% 16:05:30 <mlavalle> this analysis was done with http://logs.openstack.org/14/668914/1/check/neutron-tempest-plugin-dvr-multinode-scenario/f2ce738/ 16:06:05 <mlavalle> this patch was created by ralonsoh late last week 16:06:11 <mlavalle> I think 16:06:43 <slaweq> mlavalle: so there is some error on neutron-server side which slows down creation of router on compute node, right? 16:07:01 <mlavalle> yes, that's my theory 16:08:09 <slaweq> do You know what is this "8831ed85-9ccf-48a2-92eb-ab39d3d30e89-ubuntu-bionic-rax-ord-0008" which is reported as duplicate in error message? 16:08:26 <mlavalle> no, I haven't gotten to that yet 16:08:31 <slaweq> is it agent_id-hostname pair? 16:08:33 <slaweq> or what? 16:08:37 <slaweq> ok 16:09:01 <slaweq> so You will continue this investigation, right? 16:09:06 <mlavalle> yes 16:09:10 <mlavalle> indeed 16:09:21 <mlavalle> I'll dig in this case as deeply as possible 16:09:23 <slaweq> thx mlavalle, great progress on this :) 16:09:29 <slaweq> #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:09:42 <mlavalle> the good thing is that this patch is very recent, so the logs will be there 16:09:43 <slaweq> ok, next action was 16:09:50 <slaweq> slaweq to send patch to switch functional tests job in fwaas repo to py3 16:10:11 <slaweq> Patches: https://review.opendev.org/668917 and https://review.opendev.org/668918 are send 16:10:16 <slaweq> also https://review.opendev.org/#/c/669757/ is needed 16:10:23 <slaweq> please take a look if You will have some time 16:10:29 * njohnston will take a look 16:10:34 <slaweq> njohnston: thx 16:10:43 <slaweq> next one: 16:10:45 <slaweq> ralonsoh to continue debugging TestNeutronServer: start function (bug/1833279) with new logs 16:11:29 <ralonsoh> I didn't see any other error in the CI related to thjis bug 16:11:56 <ralonsoh> http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:neutron-functional%20AND%20message:%5C%22Timed%20out%20waiting%20for%20file%5C%22 16:13:42 <slaweq> ok, so maybe we can simply close this bug for now 16:13:49 <slaweq> what do You think? 16:13:57 <slaweq> I also didn't saw it during last week 16:13:59 <ralonsoh> I'll keep an eye on this during the next week 16:14:10 <slaweq> ok, thx ralonsoh 16:14:12 <ralonsoh> and if I don't see anythiong, I'll close it 16:14:14 <ralonsoh> np! 16:14:17 <slaweq> ++ 16:14:30 <slaweq> ok, next one 16:14:31 <mlavalle> can I say something about fwaas 16:14:32 <slaweq> slaweq to mark test_ha_router_restart_agents_no_packet_lost as unstable again 16:14:35 <slaweq> Done: https://review.opendev.org/668914 16:14:45 <slaweq> mlavalle: sure, go on 16:15:12 <mlavalle> I pinged Sridar yesterday. He responded back that he is still trying to help 16:15:25 <mlavalle> so I will organize a meeting with him 16:15:35 <mlavalle> This coming Thursday 16:15:40 <mlavalle> I'll invite njohnston 16:15:44 <njohnston> yes please 16:15:47 <bcafarel> late o/ sorry the local Q&A session took some time 16:15:59 <mlavalle> that's it 16:16:23 <slaweq> mlavalle: can You invite me to this meeting as well? 16:16:30 <mlavalle> slaweq: yes 16:16:32 <slaweq> thx 16:17:15 <slaweq> ok, lets move on then 16:17:23 <slaweq> #topic Stadium projects 16:17:31 <slaweq> first Python 3 migration 16:17:37 <slaweq> etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:17:59 <slaweq> I moved finished projects to the bottom of the document as bcafarel suggested last week 16:18:18 <njohnston> I'm currently working on the bagpipe tempest jobs 16:18:23 <slaweq> we have only 6 projects on the "not finished" list 16:18:47 <slaweq> but for most of the things we have volunteers already 16:19:09 <slaweq> I think we should ask yamamoto about networking-midonet 16:19:26 <slaweq> or is there anyone else involved in this project who we can ping? mlavalle do You know? 16:19:40 <slaweq> and thx njohnston for taking care of bagpipe :) 16:19:56 <mlavalle> let's ask yamamoto 16:20:05 <slaweq> ok, I will send him an email this week 16:20:11 <slaweq> good for You? 16:20:14 <mlavalle> yes 16:20:33 <slaweq> #action slaweq to contact yamamoto about networking-midonet py3 status 16:20:59 <slaweq> any other updates on it? 16:22:13 <slaweq> ok, lets move on then 16:22:20 <slaweq> tempest-plugins migration 16:22:26 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo 16:22:28 <slaweq> any updates? 16:22:54 <njohnston> I got back to the fwaas first change and made some progress with it 16:23:19 <slaweq> njohnston: yeah, I saw it today and I commented on one of Your patches already 16:23:24 <njohnston> it has 2 +2s but no +W https://review.opendev.org/#/c/643662/ 16:23:42 <njohnston> I'll fix up that issue, it's in the second stage change 16:23:59 <slaweq> +W'ed now 16:24:12 <njohnston> thanks! 16:24:21 <slaweq> thanks for working on this 16:24:40 <slaweq> so we will have still not finished vpnaas and neutron-dynamic-routing 16:25:13 <slaweq> and we should be good on this finally so gmann will be happy :) 16:25:48 <slaweq> mlavalle: tidwellr do You need any help with patches for vpnaas and neutron-dynamic-routing ? I can help with them if needed 16:26:34 <mlavalle> slaweq: if I am slowing down anybody. go ahead. but I would still like to give it a try 16:26:48 <slaweq> mlavalle: no, it's not urgent of course 16:26:52 <slaweq> I just wanted to ask :) 16:27:00 <mlavalle> thanks 16:27:34 <slaweq> ok, any other questions/updates about stadium projects? 16:28:20 <slaweq> ok, so lets move on then 16:28:26 <slaweq> #topic Grafana 16:28:31 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:29:11 <slaweq> I don't see anything urgent or "special" this week 16:29:23 <njohnston> So hey, how's about that neutron-tempest-plugin-dvr-multinode-scenario failure rate? Looking good! 16:29:53 <slaweq> exactly, it's better 16:29:59 <mlavalle> agree 16:30:29 <slaweq> I think that changing order of connecting router interfaces and spawning instance minimized possibility of this race with metadata proxy 16:30:34 <slaweq> so it is failing less now 16:30:40 <slaweq> but issue is still there :/ 16:31:19 <mlavalle> you mean after your patch? 16:31:24 <slaweq> still neutron-tempest-with-uwsgi is failing 100% but there is patch https://review.opendev.org/#/c/668311/ for that 16:31:29 <slaweq> mlavalle: yes 16:32:34 <slaweq> anything else related to grafana? 16:32:37 <mlavalle> nope 16:33:23 <slaweq> ok, lets move on then 16:33:25 <slaweq> #topic fullstack/functional 16:33:41 <slaweq> I found 3 different test failures in functional tests recently: 16:33:47 <slaweq> neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase - http://logs.openstack.org/35/521035/8/check/neutron-functional/65a4ecf/testr_results.html.gz 16:34:03 <slaweq> ^^ I think ralonsoh was looking into something similar some time ago, right? 16:34:19 <ralonsoh> yes, I almost have a patch for this 16:34:41 <slaweq> it's this one which removes locks on netns privileged functions? 16:34:42 <ralonsoh> actually what I'm doing is replacing the ip_monitor in keepalived 16:34:51 <ralonsoh> which is the source of some problems 16:35:15 <ralonsoh> ooooh sorry 16:35:17 <slaweq> but in this case error was "Cannot open network namespace "qrouter-599a3366-e0ea-4a35-9497-4263b8409b00": No such file or directory" 16:35:17 <ralonsoh> one sec 16:35:28 <ralonsoh> yes, one sec (I write without reading first) 16:35:42 <ralonsoh> https://review.opendev.org/#/c/668682/ 16:36:20 <ralonsoh> one line, big discussion in the patch 16:36:52 <slaweq> personally I'm fine with merging Your patch 16:37:09 <ralonsoh> (me too) 16:37:19 <slaweq> the config removal proposed by liuyulong should be IMO discussed on neutron team meeting or drivers meeting maybe 16:39:00 <slaweq> mlavalle: njohnston: please add this patch to Your list of reviews and say what is Your opinion about it 16:39:13 <mlavalle> ok 16:39:17 <slaweq> thx a lot 16:39:22 <njohnston_> I think it’s a good topic for the drivers meeting, I don’t think we need to use the time of the entire neutron community to talk about it 16:39:26 <njohnston_> will do 16:40:05 <slaweq> njohnston_: I agree 16:40:25 <slaweq> ok, next one was: 16:40:27 <slaweq> neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase - http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/testr_results.html.gz 16:41:10 <slaweq> which I don't remember to see in the past 16:41:13 <slaweq> did You saw it before? 16:41:46 <ralonsoh> IMO, this is similar to other privsep errors 16:41:57 <ralonsoh> the privsep thread pool is limited 16:42:09 <ralonsoh> and sometimes many ops are executed at the same time 16:42:14 <slaweq> in logs there is almost nothing: http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase.test_arp_protection_update.txt.gz 16:42:33 <ralonsoh> eventually will have an operation without time to be executed 16:43:02 <slaweq> ralonsoh: so do You think that maybe limiting number of workers can improve this? 16:43:32 <ralonsoh> slaweq, how many workers do we have? 16:43:38 <ralonsoh> one per core/thread? 16:44:25 <slaweq> ralonsoh: I don't know exactly 16:44:36 <ralonsoh> slaweq, that should be the limit 16:44:39 <slaweq> ralonsoh: from log in this failed job it looks that it's not specified 16:44:47 <slaweq> and it runs 8 workers IMO 16:44:51 <ralonsoh> slaweq, no, this is infra config 16:44:55 <slaweq> http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/job-output.txt.gz#_2019-07-08_18_54_37_453034 16:46:07 <slaweq> it simply runs "tox -edsvm-functional-python27" and nothing else 16:46:09 <slaweq> http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/job-output.txt.gz#_2019-07-08_18_53_34_474097 16:46:25 <slaweq> so if we want to limit number of test workers we should specify it there IMO 16:46:26 <ralonsoh> slaweq, so by default one worker per thread 16:46:52 <ralonsoh> let me check how to increase, in FT, the privsep pool 16:47:27 <slaweq> ralonsoh: ok, thx a lot 16:47:36 <slaweq> #action ralonsoh to check how to increase, in FT, the privsep pool 16:47:57 <slaweq> last one which I found is 16:47:58 <slaweq> neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase - http://logs.openstack.org/69/668569/3/check/neutron-functional-python27/9d345e8/testr_results.html.gz 16:48:50 <slaweq> and I see error like: http://logs.openstack.org/69/668569/3/check/neutron-functional-python27/9d345e8/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase.test_keepalived_state_change_notification.txt.gz#_2019-07-05_10_15_10_728 16:48:55 <slaweq> in logs from this test 16:49:17 <slaweq> but I'm not sure if that is related or not 16:49:38 <slaweq> I can investigate this one during this week 16:49:54 <slaweq> #action slaweq to check neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase issue 16:50:05 <njohnston> did we merge a check to validate gateway is inside the subnet? Because this looks like it would fail such a check: ['ip', 'netns', 'exec', 'qrouter-eb89b020-c3b0-410f-82ec-639b197dc05b', 'ip', 'route', 'replace', 'to', '8.8.8.0/24', 'via', '19.4.4.4'] 16:50:34 <njohnston> oh, sorry, I misread that it was a default gateway, that just looks like an additional route 16:50:50 <slaweq> but it's not gateway 16:50:57 <slaweq> only extra route IIRC: http://logs.openstack.org/69/668569/3/check/neutron-functional-python27/9d345e8/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase.test_keepalived_state_change_notification.txt.gz#_2019-07-05_10_15_10_662 16:51:03 <njohnston> yeah 16:51:28 <slaweq> so I will look deeper into it this week 16:51:50 <slaweq> and I will report a bug or send a patch, whatever will be needed :) 16:52:12 <slaweq> that's all from my side according to functional (and fullstack) tests 16:52:22 <slaweq> do You have anything else on that topic? 16:53:23 <slaweq> ok, lets move forward than 16:53:25 <slaweq> #topic Tempest/Scenario 16:53:35 <slaweq> first of all, good news 16:53:37 <slaweq> gmann proposed new integrated-gate templates, one for networking is merged: https://review.opendev.org/#/c/668930/ 16:53:43 <slaweq> and I proposed to use it in neutron: https://review.opendev.org/#/c/669815/ 16:54:17 <slaweq> I will also later try to change our exising neutron-tempest jobs to be inherit from this new jobs dedicated for networking 16:54:36 <slaweq> that should improve our gates as we will not run e.g. cinder or glance related tests in our gate 16:54:38 <slaweq> :) 16:54:58 <mlavalle> cool 16:55:31 <slaweq> and now lets get back to the reality :P 16:55:36 <slaweq> I opened new bug https://bugs.launchpad.net/neutron/+bug/1835914 16:55:36 <openstack> Launchpad bug 1835914 in neutron "Test test_show_network_segment_range failing" [Medium,Confirmed] 16:55:45 <slaweq> I found it at least 2 times during last week 16:56:05 <slaweq> and it is strange for me that network_segment_range object don't have project_id sometimes :/ 16:56:24 <slaweq> if there is any volunteer who wants to take a look into this, feel free :) 16:56:32 <njohnston> I noticed weirdnesses like that with bulk port - sometimes the requests that get created in testing aren't complete 16:56:42 <mlavalle> I might give it a try later this week 16:56:52 <slaweq> njohnston: good to know 16:57:02 <haleyb> i'm always confused by the tenant_id/project_id transition, are we supposed to be using both still? shouldn't one have gone away? 16:57:22 <slaweq> haleyb: but in this case tenant_id wasn't there also 16:57:41 <njohnston> haleyb: either/or 16:58:02 <haleyb> oh, that's strange 16:58:08 <haleyb> both missing that is 16:58:08 <slaweq> haleyb: exactly 16:58:20 <njohnston> I'll try to take a look as well 16:58:42 <slaweq> I set it as medium priority as it don't happens often but it happens from time to time so there is definitely something to check :) 16:58:51 <slaweq> thx njohnston and mlavalle for taking care of it 16:59:05 <slaweq> ok, we are running out of time now 16:59:17 <slaweq> thx for attending 16:59:21 <mlavalle> o/ 16:59:23 <njohnston> o/ 16:59:30 <slaweq> have a great week and see You online :) 16:59:32 <slaweq> o/ 16:59:36 <slaweq> #endmeeting