16:00:25 #startmeeting neutron_ci 16:00:26 Meeting started Tue Jul 2 16:00:25 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:29 The meeting name has been set to 'neutron_ci' 16:00:30 hello again :) 16:00:49 hi 16:00:57 o/ 16:01:07 hi 16:01:21 hey again 16:01:22 first of all 16:01:24 Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:01:51 #topic Actions from previous meetings 16:02:26 last week we didn't have meeting with full agenda as there was only me and ralonsoh there and I didn't even had time to prepare everthing 16:02:34 so lets check actions from 2 weeks ago now 16:02:43 o/ 16:02:44 first action: 16:02:45 mlavalle to debug neutron-tempest-plugin-dvr-multinode-scenario failures (bug 1830763) reproducing most common failure: test_connectivity_through_2_routers 16:02:46 bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] https://launchpad.net/bugs/1830763 - Assigned to Miguel Lavalle (minsel) 16:02:58 I have a few things to say about this 16:03:48 1) your neutron-tempest-plugin patch improved the situation. In fact, now I cannot reproduce the failure locally 16:04:18 and it also happens less in gate IMHO :) 16:04:27 2) We are still seeing failures in in our jobs 16:05:41 Look for example at https://review.opendev.org/#/c/668182 16:05:59 this is a failure from today 16:06:02 1) is https://review.opendev.org/#/c/667547/ right? 16:06:43 bcafarel: yes 16:06:53 I have similar conclusions basically 16:07:07 and I was also looking into many other failed tests in this job 16:07:17 3) I dug in in http://paste.openstack.org/show/66882 16:07:31 3) I dug in in http://paste.openstack.org/show/668182 16:07:34 :) 16:07:53 is it good link mlavalle ? 16:07:55 :D 16:08:13 Here's some notes from the tempest and L3 agents log 16:08:20 I see there some "Wifi cracker ...." info :) 16:08:28 http://paste.openstack.org/show/753775/ 16:08:58 First note that one of the routers created is c2e104a2-fb21-4f1b-b540-e4620ecc0c50 16:09:06 lines 1 - 5 16:09:33 The note that (lines 8 - 34) we can ssh into the first VM 16:09:46 but from there we cannot ping the second instance 16:10:06 mlavalle: do You have console log from second vm? 16:10:24 I almost sure that it couldn't get instance-id from metadata service 16:10:47 so there should be something like http://paste.openstack.org/show/753779/ in this console log 16:10:55 then, in lines 37 - 79, from the L3 agent, note that we have a traceback as a consequence of not being able to retrieve info for the router mentioned above 16:11:20 mhhhhh 16:11:24 wait a second 16:12:47 let's look here: http://logs.openstack.org/82/668182/3/check/neutron-tempest-plugin-dvr-multinode-scenario/495737b/testr_results.html.gz 16:13:14 ahh, yes 16:13:22 so in this one it is like I supposed 16:13:39 I can see that both instances got instance_id 16:13:43 don't they? 16:14:41 failed 20/20: up 92.68. request failed 16:14:41 failed to read iid from metadata. tried 20 16:14:43 yes 16:14:53 in second failed test one instance didn't get it 16:14:56 sorry for noise mlavalle 16:15:21 i pasted that from neutron_tempest_plugin.scenario.test_connectivity.NetworkConnectivityTest failure 16:16:18 haleyb: the other test, right? 16:16:58 mlavalle: it's the second server? 16:17:39 haleyb: there are two tests that failed. I am looking at the second one test_connectivity_through_2_routers 16:17:41 test_connectivity_router_east_west_traffic 16:18:31 yes, the 2_routers test both got id's 16:18:35 so the point I am trying to make is that I can tracebacks in the l3 agent log related to getting info for the routers involved in the test 16:18:56 test_connectivity_through_2_routers that is 16:19:21 and that might mean that when the VM attempts the ping one of the routers is not ready 16:20:00 yes 16:20:09 ack 16:20:18 I'll dig deeper in this 16:20:29 but haleyb's point is still valid 16:20:34 thx mlavalle 16:20:55 it seems we still see instances failing to get instance-id from metadata service 16:21:03 I'll pursue both 16:21:08 that is something what I also wanted to mention 16:21:19 it looks that we quite often got issues with getting metadata 16:21:24 so this is reminding me of a patch liu (?) had wrt dvr routers and provisioning? just a thought, i'll find it 16:21:33 haleyb: exactly :) 16:22:08 and today I also send patch https://review.opendev.org/#/c/668643/ just to check how it will work when we don't need to use metadata service 16:22:23 but I don't think we should merge it 16:22:43 https://review.opendev.org/#/c/633871/ ? 16:22:44 I was actually thinking of that last night 16:23:26 haleyb: yes 16:23:38 slaweq: the job failed there too :( 16:23:45 that patch can probably help a lot with this issue 16:23:55 haleyb: but this job failed even before tests started 16:24:19 ahh, you're right, can wait for a recheck 16:24:26 yep 16:24:33 I'll investigate its impact with the findings described above ^^^^ 16:24:43 I also found one interesting case with this test 16:24:54 please see log: http://logs.openstack.org/11/668411/2/check/neutron-tempest-plugin-dvr-multinode-scenario/0d123cc/compute1/logs/screen-q-meta.txt.gz#_Jul_02_03_18_47_249581 16:25:12 each request to metadata service took more than 10 seconds 16:25:32 thus e.g. public keys wasn't configured on instance and ssh was not possible 16:26:31 mlavalle: thx for working on this 16:26:42 slaweq: test ran on slow infra somewhere ? 16:26:50 haleyb: maybe 16:27:29 haleyb: but we have more issues like that, when request to metadata/public-keys fails due to timeout 16:27:37 and then ssh to the instance is not possible 16:28:12 but I am pretty sure that environment has impact on this test 16:28:24 #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:28:29 mlavalle: I agree 16:28:33 since we merged your patch, I have not been able to reproduce locally 16:28:55 I'm running it every half hour, with success.... 16:28:57 in many of those cases it's IMO matter of high load on node when tests are run 16:29:00 in getting a failure :-) 16:29:25 without success getting a failure I meant 16:30:00 ok, lets move on 16:30:11 we have many other thinks to talk about :) 16:30:20 next action was: 16:30:21 slaweq to send patch to switch neutron-fwaas to python 3 16:30:32 I did https://review.opendev.org/#/c/666165/ 16:30:44 and it's merged already 16:30:57 +1 16:30:57 this week I will try to do the same for functional tests job 16:31:17 #action slaweq to send patch to switch functional tests job in fwaas repo to py3 16:31:17 thanks slaweq! 16:31:32 njohnston: yw 16:31:36 ok, next one 16:31:37 njohnston update dashboard when ovn job becomes voting 16:32:15 I have that change ready locally but I haven't pushed it yet, I'll push it in a sec 16:32:34 njohnston: thx 16:32:43 next one was: 16:32:44 slaweq to open bug related to missing namespace issue in functional tests 16:32:49 Done https://bugs.launchpad.net/neutron/+bug/1833385 16:32:50 Launchpad bug 1833385 in neutron "Functional tests are failing due to missing namespace" [Medium,Confirmed] 16:33:11 but it's still unassigned 16:33:35 and I saw at least 2 other examples of similar failures today 16:33:56 so I will change Importance to High for this bug, what do You think? 16:33:58 I think this is related to https://bugs.launchpad.net/neutron/+bug/1833717 16:33:59 Launchpad bug 1833717 in neutron "Functional tests: error during namespace creation" [High,Fix released] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:34:20 same problem as before with namespace creation/deletion/listing 16:34:53 ralonsoh: ok, can You then add "related-bug" or "closes-bug" tag to commit message in this patch? 16:35:04 and assign it to Yourself, ok? 16:35:11 IMO, during functional tests, we are overloading sometimes the thread pool of privsep 16:35:31 slaweq, ok 16:35:49 and related to this: https://review.opendev.org/#/c/668682/ 16:35:52 thx ralonsoh 16:36:13 yes, I was thinking about this patch :) 16:36:26 to add in it info that it's related to the https://bugs.launchpad.net/neutron/+bug/1833385 also 16:36:27 Launchpad bug 1833385 in neutron "Functional tests are failing due to missing namespace" [Medium,Confirmed] 16:37:01 ok, next one 16:37:02 ralonsoh to open bug related to failed test_server functional tests 16:37:22 let me check... 16:37:42 #link https://bugs.launchpad.net/neutron/+bug/1833279 16:37:43 Launchpad bug 1833279 in neutron "TestNeutronServer: start function not called (or not logged in the temp file)" [Medium,Confirmed] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:38:00 just added more info: https://review.opendev.org/#/c/666261/ 16:38:44 ok, and now You will try to spot it again and investigate logs, right? 16:38:53 sure 16:39:14 #action ralonsoh to continue debugging TestNeutronServer: start function (bug/1833279) with new logs 16:39:17 thx ralonsoh 16:39:31 and the last one from last week was 16:39:33 slaweq to report bug regarding failing neutron.tests.fullstack.test_l3_agent.TestLegacyL3Agent.test_north_south_traffic tests 16:39:37 Done: https://bugs.launchpad.net/neutron/+bug/1833386 16:39:38 Launchpad bug 1833386 in neutron "Fullstack test neutron.tests.fullstack.test_l3_agent.TestLegacyL3Agent.test_north_south_traffic is failing" [High,Fix released] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:39:53 yes 16:39:59 and this seems to be fixed by ralonsoh :) 16:40:01 thx a lot 16:40:01 the problem wasn't the connectivity 16:40:06 but the probe 16:40:21 and, of course, in high loaded servers, we can have this problem 16:40:54 so it's better to remove this probe in fullstack tests, doesn't add anything valuable 16:41:50 yes, and it's merged already 16:42:37 ok, lets move on then 16:42:39 #topic Stadium projects 16:42:50 we already talked about py3 for stadium projects 16:43:02 so I think there is no need to talk about it again 16:43:11 any updates about tempest-plugins migration? 16:43:36 not from me 16:43:57 njohnston: any update about fwaas? 16:44:05 * njohnston is going to work on the fwaas one today if it kills him 16:44:21 :) 16:44:44 ok, so lets move on 16:44:46 #topic Grafana 16:45:21 yesterday we had problem with neutron-tempest-plugin-designate-scenario but it's now fixed with https://review.opendev.org/#/c/668447/ 16:45:49 slaweq++ 16:46:21 from other things, I see that functional/fullstack jobs and unit tests also are going to high numberf today 16:47:02 but I saw today at least couple of patches where unit tests failures were related to change so maybe it will not be big problem 16:47:05 even the pep8 job shows the same curve 16:47:43 lets observe it maybe and if You will notice some repeated failure, please open bug for it 16:47:51 ack 16:47:55 or maybe You already saw some :) 16:49:24 ok, anything else You want to talk regarding grafana? 16:50:41 nope 16:50:43 nope 16:50:52 ok, lets move on then 16:51:01 #topic Tempest/Scenario 16:51:21 I have couple of things to mention regarding scenario jobs so lets start with this one today 16:51:34 first of all, we promoted networking-ovn job to be voted recently 16:51:44 and it's now failing about 25-30% of times 16:51:45 I noticed 16:51:56 I found one repeat error 16:52:02 so I reported it https://bugs.launchpad.net/networking-ovn/+bug/1835029 16:52:02 Launchpad bug 1835029 in networking-ovn "test_reassign_port_between_servers failing in networking-ovn-tempest-dsvm-ovs-release CI job" [Undecided,New] 16:52:14 and I asked today dalvarez and jlibosva to take a look into it 16:52:39 I also found other issue (but this one isn't related stricly to networking-ovn) 16:53:23 https://bugs.launchpad.net/tempest/+bug/1835058 16:53:24 Launchpad bug 1835058 in tempest "During instance creation tenant network should be given always " [Undecided,New] 16:53:29 I reported it as tempest bug for now 16:53:50 basically some tests are failing due to "multiple networks found" exception from nova during spawning vm 16:54:13 from other issues 16:54:29 I noticed that from time to time port forwarding scenario test is failing 16:54:40 and we don't have console log from vm in this test 16:54:50 so I send small patch https://review.opendev.org/668645 to add logging of console log there 16:54:56 please review if You will have time 16:55:40 and finally, during the weekend I spent some time on debugging issue with neutron-tempest-with-uwsgi 16:55:57 I found what is wrong there and I proposed patch to tempest https://review.opendev.org/#/c/668311/ 16:56:18 with this patch I run couple of times this job in https://review.opendev.org/#/c/668312/1 and it was passing 16:56:53 nice! 16:56:56 that all from me about scenario jobs 16:57:01 anything else You want to add? 16:57:05 Thanks for the pacthes 16:57:54 ok, lets move on then as we are almost out of time 16:57:56 #topic fullstack/functional 16:58:04 I have one more info to mention about fullstack tests 16:58:17 we recently merged https://review.opendev.org/#/c/664629/ and again we have failures of test neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost 16:58:43 so I think we should again mark it as unstable until we will finally found out what is going on with this test 16:58:46 what do You think? 16:59:12 ok to mark the test 16:59:41 #action slaweq to mark test_ha_router_restart_agents_no_packet_lost as unstable again 16:59:46 +1 16:59:54 ok, we are running out of time 16:59:58 thx for attending 17:00:00 #endmeeting