16:00:25 <slaweq> #startmeeting neutron_ci 16:00:26 <openstack> Meeting started Tue Jul 2 16:00:25 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:29 <openstack> The meeting name has been set to 'neutron_ci' 16:00:30 <slaweq> hello again :) 16:00:49 <ralonsoh> hi 16:00:57 <mlavalle> o/ 16:01:07 <haleyb> hi 16:01:21 <bcafarel> hey again 16:01:22 <slaweq> first of all 16:01:24 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:01:51 <slaweq> #topic Actions from previous meetings 16:02:26 <slaweq> last week we didn't have meeting with full agenda as there was only me and ralonsoh there and I didn't even had time to prepare everthing 16:02:34 <slaweq> so lets check actions from 2 weeks ago now 16:02:43 <njohnston> o/ 16:02:44 <slaweq> first action: 16:02:45 <slaweq> mlavalle to debug neutron-tempest-plugin-dvr-multinode-scenario failures (bug 1830763) reproducing most common failure: test_connectivity_through_2_routers 16:02:46 <openstack> bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] https://launchpad.net/bugs/1830763 - Assigned to Miguel Lavalle (minsel) 16:02:58 <mlavalle> I have a few things to say about this 16:03:48 <mlavalle> 1) your neutron-tempest-plugin patch improved the situation. In fact, now I cannot reproduce the failure locally 16:04:18 <slaweq> and it also happens less in gate IMHO :) 16:04:27 <mlavalle> 2) We are still seeing failures in in our jobs 16:05:41 <mlavalle> Look for example at https://review.opendev.org/#/c/668182 16:05:59 <mlavalle> this is a failure from today 16:06:02 <bcafarel> 1) is https://review.opendev.org/#/c/667547/ right? 16:06:43 <mlavalle> bcafarel: yes 16:06:53 <slaweq> I have similar conclusions basically 16:07:07 <slaweq> and I was also looking into many other failed tests in this job 16:07:17 <mlavalle> 3) I dug in in http://paste.openstack.org/show/66882 16:07:31 <mlavalle> 3) I dug in in http://paste.openstack.org/show/668182 16:07:34 <bcafarel> :) 16:07:53 <slaweq> is it good link mlavalle ? 16:07:55 <slaweq> :D 16:08:13 <mlavalle> Here's some notes from the tempest and L3 agents log 16:08:20 <slaweq> I see there some "Wifi cracker ...." info :) 16:08:28 <mlavalle> http://paste.openstack.org/show/753775/ 16:08:58 <mlavalle> First note that one of the routers created is c2e104a2-fb21-4f1b-b540-e4620ecc0c50 16:09:06 <mlavalle> lines 1 - 5 16:09:33 <mlavalle> The note that (lines 8 - 34) we can ssh into the first VM 16:09:46 <mlavalle> but from there we cannot ping the second instance 16:10:06 <slaweq> mlavalle: do You have console log from second vm? 16:10:24 <slaweq> I almost sure that it couldn't get instance-id from metadata service 16:10:47 <slaweq> so there should be something like http://paste.openstack.org/show/753779/ in this console log 16:10:55 <mlavalle> then, in lines 37 - 79, from the L3 agent, note that we have a traceback as a consequence of not being able to retrieve info for the router mentioned above 16:11:20 <mlavalle> mhhhhh 16:11:24 <mlavalle> wait a second 16:12:47 <mlavalle> let's look here: http://logs.openstack.org/82/668182/3/check/neutron-tempest-plugin-dvr-multinode-scenario/495737b/testr_results.html.gz 16:13:14 <slaweq> ahh, yes 16:13:22 <slaweq> so in this one it is like I supposed 16:13:39 <mlavalle> I can see that both instances got instance_id 16:13:43 <mlavalle> don't they? 16:14:41 <haleyb> failed 20/20: up 92.68. request failed 16:14:41 <haleyb> failed to read iid from metadata. tried 20 16:14:43 <slaweq> yes 16:14:53 <slaweq> in second failed test one instance didn't get it 16:14:56 <slaweq> sorry for noise mlavalle 16:15:21 <haleyb> i pasted that from neutron_tempest_plugin.scenario.test_connectivity.NetworkConnectivityTest failure 16:16:18 <mlavalle> haleyb: the other test, right? 16:16:58 <haleyb> mlavalle: it's the second server? 16:17:39 <mlavalle> haleyb: there are two tests that failed. I am looking at the second one test_connectivity_through_2_routers 16:17:41 <haleyb> test_connectivity_router_east_west_traffic 16:18:31 <haleyb> yes, the 2_routers test both got id's 16:18:35 <mlavalle> so the point I am trying to make is that I can tracebacks in the l3 agent log related to getting info for the routers involved in the test 16:18:56 <mlavalle> test_connectivity_through_2_routers that is 16:19:21 <mlavalle> and that might mean that when the VM attempts the ping one of the routers is not ready 16:20:00 <slaweq> yes 16:20:09 <haleyb> ack 16:20:18 <mlavalle> I'll dig deeper in this 16:20:29 <mlavalle> but haleyb's point is still valid 16:20:34 <slaweq> thx mlavalle 16:20:55 <mlavalle> it seems we still see instances failing to get instance-id from metadata service 16:21:03 <mlavalle> I'll pursue both 16:21:08 <slaweq> that is something what I also wanted to mention 16:21:19 <slaweq> it looks that we quite often got issues with getting metadata 16:21:24 <haleyb> so this is reminding me of a patch liu (?) had wrt dvr routers and provisioning? just a thought, i'll find it 16:21:33 <slaweq> haleyb: exactly :) 16:22:08 <slaweq> and today I also send patch https://review.opendev.org/#/c/668643/ just to check how it will work when we don't need to use metadata service 16:22:23 <slaweq> but I don't think we should merge it 16:22:43 <haleyb> https://review.opendev.org/#/c/633871/ ? 16:22:44 <mlavalle> I was actually thinking of that last night 16:23:26 <slaweq> haleyb: yes 16:23:38 <haleyb> slaweq: the job failed there too :( 16:23:45 <slaweq> that patch can probably help a lot with this issue 16:23:55 <slaweq> haleyb: but this job failed even before tests started 16:24:19 <haleyb> ahh, you're right, can wait for a recheck 16:24:26 <slaweq> yep 16:24:33 <mlavalle> I'll investigate its impact with the findings described above ^^^^ 16:24:43 <slaweq> I also found one interesting case with this test 16:24:54 <slaweq> please see log: http://logs.openstack.org/11/668411/2/check/neutron-tempest-plugin-dvr-multinode-scenario/0d123cc/compute1/logs/screen-q-meta.txt.gz#_Jul_02_03_18_47_249581 16:25:12 <slaweq> each request to metadata service took more than 10 seconds 16:25:32 <slaweq> thus e.g. public keys wasn't configured on instance and ssh was not possible 16:26:31 <slaweq> mlavalle: thx for working on this 16:26:42 <haleyb> slaweq: test ran on slow infra somewhere ? 16:26:50 <slaweq> haleyb: maybe 16:27:29 <slaweq> haleyb: but we have more issues like that, when request to metadata/public-keys fails due to timeout 16:27:37 <slaweq> and then ssh to the instance is not possible 16:28:12 <mlavalle> but I am pretty sure that environment has impact on this test 16:28:24 <slaweq> #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:28:29 <slaweq> mlavalle: I agree 16:28:33 <mlavalle> since we merged your patch, I have not been able to reproduce locally 16:28:55 <mlavalle> I'm running it every half hour, with success.... 16:28:57 <slaweq> in many of those cases it's IMO matter of high load on node when tests are run 16:29:00 <mlavalle> in getting a failure :-) 16:29:25 <mlavalle> without success getting a failure I meant 16:30:00 <slaweq> ok, lets move on 16:30:11 <slaweq> we have many other thinks to talk about :) 16:30:20 <slaweq> next action was: 16:30:21 <slaweq> slaweq to send patch to switch neutron-fwaas to python 3 16:30:32 <slaweq> I did https://review.opendev.org/#/c/666165/ 16:30:44 <slaweq> and it's merged already 16:30:57 <mlavalle> +1 16:30:57 <slaweq> this week I will try to do the same for functional tests job 16:31:17 <slaweq> #action slaweq to send patch to switch functional tests job in fwaas repo to py3 16:31:17 <njohnston> thanks slaweq! 16:31:32 <slaweq> njohnston: yw 16:31:36 <slaweq> ok, next one 16:31:37 <slaweq> njohnston update dashboard when ovn job becomes voting 16:32:15 <njohnston> I have that change ready locally but I haven't pushed it yet, I'll push it in a sec 16:32:34 <slaweq> njohnston: thx 16:32:43 <slaweq> next one was: 16:32:44 <slaweq> slaweq to open bug related to missing namespace issue in functional tests 16:32:49 <slaweq> Done https://bugs.launchpad.net/neutron/+bug/1833385 16:32:50 <openstack> Launchpad bug 1833385 in neutron "Functional tests are failing due to missing namespace" [Medium,Confirmed] 16:33:11 <slaweq> but it's still unassigned 16:33:35 <slaweq> and I saw at least 2 other examples of similar failures today 16:33:56 <slaweq> so I will change Importance to High for this bug, what do You think? 16:33:58 <ralonsoh> I think this is related to https://bugs.launchpad.net/neutron/+bug/1833717 16:33:59 <openstack> Launchpad bug 1833717 in neutron "Functional tests: error during namespace creation" [High,Fix released] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:34:20 <ralonsoh> same problem as before with namespace creation/deletion/listing 16:34:53 <slaweq> ralonsoh: ok, can You then add "related-bug" or "closes-bug" tag to commit message in this patch? 16:35:04 <slaweq> and assign it to Yourself, ok? 16:35:11 <ralonsoh> IMO, during functional tests, we are overloading sometimes the thread pool of privsep 16:35:31 <ralonsoh> slaweq, ok 16:35:49 <ralonsoh> and related to this: https://review.opendev.org/#/c/668682/ 16:35:52 <slaweq> thx ralonsoh 16:36:13 <slaweq> yes, I was thinking about this patch :) 16:36:26 <slaweq> to add in it info that it's related to the https://bugs.launchpad.net/neutron/+bug/1833385 also 16:36:27 <openstack> Launchpad bug 1833385 in neutron "Functional tests are failing due to missing namespace" [Medium,Confirmed] 16:37:01 <slaweq> ok, next one 16:37:02 <slaweq> ralonsoh to open bug related to failed test_server functional tests 16:37:22 <ralonsoh> let me check... 16:37:42 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1833279 16:37:43 <openstack> Launchpad bug 1833279 in neutron "TestNeutronServer: start function not called (or not logged in the temp file)" [Medium,Confirmed] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:38:00 <ralonsoh> just added more info: https://review.opendev.org/#/c/666261/ 16:38:44 <slaweq> ok, and now You will try to spot it again and investigate logs, right? 16:38:53 <ralonsoh> sure 16:39:14 <slaweq> #action ralonsoh to continue debugging TestNeutronServer: start function (bug/1833279) with new logs 16:39:17 <slaweq> thx ralonsoh 16:39:31 <slaweq> and the last one from last week was 16:39:33 <slaweq> slaweq to report bug regarding failing neutron.tests.fullstack.test_l3_agent.TestLegacyL3Agent.test_north_south_traffic tests 16:39:37 <slaweq> Done: https://bugs.launchpad.net/neutron/+bug/1833386 16:39:38 <openstack> Launchpad bug 1833386 in neutron "Fullstack test neutron.tests.fullstack.test_l3_agent.TestLegacyL3Agent.test_north_south_traffic is failing" [High,Fix released] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:39:53 <ralonsoh> yes 16:39:59 <slaweq> and this seems to be fixed by ralonsoh :) 16:40:01 <slaweq> thx a lot 16:40:01 <ralonsoh> the problem wasn't the connectivity 16:40:06 <ralonsoh> but the probe 16:40:21 <ralonsoh> and, of course, in high loaded servers, we can have this problem 16:40:54 <ralonsoh> so it's better to remove this probe in fullstack tests, doesn't add anything valuable 16:41:50 <slaweq> yes, and it's merged already 16:42:37 <slaweq> ok, lets move on then 16:42:39 <slaweq> #topic Stadium projects 16:42:50 <slaweq> we already talked about py3 for stadium projects 16:43:02 <slaweq> so I think there is no need to talk about it again 16:43:11 <slaweq> any updates about tempest-plugins migration? 16:43:36 <mlavalle> not from me 16:43:57 <slaweq> njohnston: any update about fwaas? 16:44:05 * njohnston is going to work on the fwaas one today if it kills him 16:44:21 <slaweq> :) 16:44:44 <slaweq> ok, so lets move on 16:44:46 <slaweq> #topic Grafana 16:45:21 <slaweq> yesterday we had problem with neutron-tempest-plugin-designate-scenario but it's now fixed with https://review.opendev.org/#/c/668447/ 16:45:49 <ralonsoh> slaweq++ 16:46:21 <slaweq> from other things, I see that functional/fullstack jobs and unit tests also are going to high numberf today 16:47:02 <slaweq> but I saw today at least couple of patches where unit tests failures were related to change so maybe it will not be big problem 16:47:05 <njohnston> even the pep8 job shows the same curve 16:47:43 <slaweq> lets observe it maybe and if You will notice some repeated failure, please open bug for it 16:47:51 <mlavalle> ack 16:47:55 <slaweq> or maybe You already saw some :) 16:49:24 <slaweq> ok, anything else You want to talk regarding grafana? 16:50:41 <mlavalle> nope 16:50:43 <njohnston> nope 16:50:52 <slaweq> ok, lets move on then 16:51:01 <slaweq> #topic Tempest/Scenario 16:51:21 <slaweq> I have couple of things to mention regarding scenario jobs so lets start with this one today 16:51:34 <slaweq> first of all, we promoted networking-ovn job to be voted recently 16:51:44 <slaweq> and it's now failing about 25-30% of times 16:51:45 <mlavalle> I noticed 16:51:56 <slaweq> I found one repeat error 16:52:02 <slaweq> so I reported it https://bugs.launchpad.net/networking-ovn/+bug/1835029 16:52:02 <openstack> Launchpad bug 1835029 in networking-ovn "test_reassign_port_between_servers failing in networking-ovn-tempest-dsvm-ovs-release CI job" [Undecided,New] 16:52:14 <slaweq> and I asked today dalvarez and jlibosva to take a look into it 16:52:39 <slaweq> I also found other issue (but this one isn't related stricly to networking-ovn) 16:53:23 <slaweq> https://bugs.launchpad.net/tempest/+bug/1835058 16:53:24 <openstack> Launchpad bug 1835058 in tempest "During instance creation tenant network should be given always " [Undecided,New] 16:53:29 <slaweq> I reported it as tempest bug for now 16:53:50 <slaweq> basically some tests are failing due to "multiple networks found" exception from nova during spawning vm 16:54:13 <slaweq> from other issues 16:54:29 <slaweq> I noticed that from time to time port forwarding scenario test is failing 16:54:40 <slaweq> and we don't have console log from vm in this test 16:54:50 <slaweq> so I send small patch https://review.opendev.org/668645 to add logging of console log there 16:54:56 <slaweq> please review if You will have time 16:55:40 <slaweq> and finally, during the weekend I spent some time on debugging issue with neutron-tempest-with-uwsgi 16:55:57 <slaweq> I found what is wrong there and I proposed patch to tempest https://review.opendev.org/#/c/668311/ 16:56:18 <slaweq> with this patch I run couple of times this job in https://review.opendev.org/#/c/668312/1 and it was passing 16:56:53 <njohnston> nice! 16:56:56 <slaweq> that all from me about scenario jobs 16:57:01 <slaweq> anything else You want to add? 16:57:05 <mlavalle> Thanks for the pacthes 16:57:54 <slaweq> ok, lets move on then as we are almost out of time 16:57:56 <slaweq> #topic fullstack/functional 16:58:04 <slaweq> I have one more info to mention about fullstack tests 16:58:17 <slaweq> we recently merged https://review.opendev.org/#/c/664629/ and again we have failures of test neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost 16:58:43 <slaweq> so I think we should again mark it as unstable until we will finally found out what is going on with this test 16:58:46 <slaweq> what do You think? 16:59:12 <ralonsoh> ok to mark the test 16:59:41 <slaweq> #action slaweq to mark test_ha_router_restart_agents_no_packet_lost as unstable again 16:59:46 <njohnston> +1 16:59:54 <slaweq> ok, we are running out of time 16:59:58 <slaweq> thx for attending 17:00:00 <slaweq> #endmeeting