#openstack-meeting log

16:00:25 <slaweq> #startmeeting neutron_ci
16:00:26 <openstack> Meeting started Tue Jul  2 16:00:25 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:29 <openstack> The meeting name has been set to 'neutron_ci'
16:00:30 <slaweq> hello again :)
16:00:49 <ralonsoh> hi
16:00:57 <mlavalle> o/
16:01:07 <haleyb> hi
16:01:21 <bcafarel> hey again
16:01:22 <slaweq> first of all
16:01:24 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:01:51 <slaweq> #topic Actions from previous meetings
16:02:26 <slaweq> last week we didn't have meeting with full agenda as there was only me and ralonsoh there and I didn't even had time to prepare everthing
16:02:34 <slaweq> so lets check actions from 2 weeks ago now
16:02:43 <njohnston> o/
16:02:44 <slaweq> first action:
16:02:45 <slaweq> mlavalle to debug neutron-tempest-plugin-dvr-multinode-scenario failures (bug 1830763) reproducing most common failure: test_connectivity_through_2_routers
16:02:46 <openstack> bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] https://launchpad.net/bugs/1830763 - Assigned to Miguel Lavalle (minsel)
16:02:58 <mlavalle> I have a few things to say about this
16:03:48 <mlavalle> 1) your neutron-tempest-plugin patch improved the situation. In fact, now I cannot reproduce the failure locally
16:04:18 <slaweq> and it also happens less in gate IMHO :)
16:04:27 <mlavalle> 2) We are still seeing failures in in our jobs
16:05:41 <mlavalle> Look for example at https://review.opendev.org/#/c/668182
16:05:59 <mlavalle> this is a failure from today
16:06:02 <bcafarel> 1) is https://review.opendev.org/#/c/667547/ right?
16:06:43 <mlavalle> bcafarel: yes
16:06:53 <slaweq> I have similar conclusions basically
16:07:07 <slaweq> and I was also looking into many other failed tests in this job
16:07:17 <mlavalle> 3) I dug in in http://paste.openstack.org/show/66882
16:07:31 <mlavalle> 3) I dug in in http://paste.openstack.org/show/668182
16:07:34 <bcafarel> :)
16:07:53 <slaweq> is it good link mlavalle ?
16:07:55 <slaweq> :D
16:08:13 <mlavalle> Here's some notes from the tempest and L3 agents log
16:08:20 <slaweq> I see there some "Wifi cracker ...." info :)
16:08:28 <mlavalle> http://paste.openstack.org/show/753775/
16:08:58 <mlavalle> First note that one of the routers created is c2e104a2-fb21-4f1b-b540-e4620ecc0c50
16:09:06 <mlavalle> lines 1 - 5
16:09:33 <mlavalle> The note that (lines 8 - 34) we can ssh into the first VM
16:09:46 <mlavalle> but from there we cannot ping the second instance
16:10:06 <slaweq> mlavalle: do You have console log from second vm?
16:10:24 <slaweq> I almost sure that it couldn't get instance-id from metadata service
16:10:47 <slaweq> so there should be something like http://paste.openstack.org/show/753779/ in this console log
16:10:55 <mlavalle> then, in lines 37 - 79, from the L3 agent, note that we have a traceback as a consequence of not being able to retrieve info for the router mentioned above
16:11:20 <mlavalle> mhhhhh
16:11:24 <mlavalle> wait a second
16:12:47 <mlavalle> let's look here: http://logs.openstack.org/82/668182/3/check/neutron-tempest-plugin-dvr-multinode-scenario/495737b/testr_results.html.gz
16:13:14 <slaweq> ahh, yes
16:13:22 <slaweq> so in this one it is like I supposed
16:13:39 <mlavalle> I can see that both instances got instance_id
16:13:43 <mlavalle> don't they?
16:14:41 <haleyb> failed 20/20: up 92.68. request failed
16:14:41 <haleyb> failed to read iid from metadata. tried 20
16:14:43 <slaweq> yes
16:14:53 <slaweq> in second failed test one instance didn't get it
16:14:56 <slaweq> sorry for noise mlavalle
16:15:21 <haleyb> i pasted that from neutron_tempest_plugin.scenario.test_connectivity.NetworkConnectivityTest failure
16:16:18 <mlavalle> haleyb: the other test, right?
16:16:58 <haleyb> mlavalle: it's the second server?
16:17:39 <mlavalle> haleyb: there are two tests that failed. I am looking at the second one test_connectivity_through_2_routers
16:17:41 <haleyb> test_connectivity_router_east_west_traffic
16:18:31 <haleyb> yes, the 2_routers test both got id's
16:18:35 <mlavalle> so the point I am trying to make is that I can tracebacks in the l3 agent log related to getting info for the routers involved in the test
16:18:56 <mlavalle> test_connectivity_through_2_routers that is
16:19:21 <mlavalle> and that might mean that when the VM attempts the ping one of the routers is not ready
16:20:00 <slaweq> yes
16:20:09 <haleyb> ack
16:20:18 <mlavalle> I'll dig deeper in this
16:20:29 <mlavalle> but haleyb's point is still valid
16:20:34 <slaweq> thx mlavalle
16:20:55 <mlavalle> it seems we still see instances failing to get instance-id from metadata service
16:21:03 <mlavalle> I'll pursue both
16:21:08 <slaweq> that is something what I also wanted to mention
16:21:19 <slaweq> it looks that we quite often got issues with getting metadata
16:21:24 <haleyb> so this is reminding me of a patch liu (?) had wrt dvr routers and provisioning?  just a thought, i'll find it
16:21:33 <slaweq> haleyb: exactly :)
16:22:08 <slaweq> and today I also send patch https://review.opendev.org/#/c/668643/ just to check how it will work when we don't need to use metadata service
16:22:23 <slaweq> but I don't think we should merge it
16:22:43 <haleyb> https://review.opendev.org/#/c/633871/ ?
16:22:44 <mlavalle> I was actually thinking of that last night
16:23:26 <slaweq> haleyb: yes
16:23:38 <haleyb> slaweq: the job failed there too :(
16:23:45 <slaweq> that patch can probably help a lot with this issue
16:23:55 <slaweq> haleyb: but this job failed even before tests started
16:24:19 <haleyb> ahh, you're right, can wait for a recheck
16:24:26 <slaweq> yep
16:24:33 <mlavalle> I'll investigate its impact with the findings described above ^^^^
16:24:43 <slaweq> I also found one interesting case with this test
16:24:54 <slaweq> please see log: http://logs.openstack.org/11/668411/2/check/neutron-tempest-plugin-dvr-multinode-scenario/0d123cc/compute1/logs/screen-q-meta.txt.gz#_Jul_02_03_18_47_249581
16:25:12 <slaweq> each request to metadata service took more than 10 seconds
16:25:32 <slaweq> thus e.g. public keys wasn't configured on instance and ssh was not possible
16:26:31 <slaweq> mlavalle: thx for working on this
16:26:42 <haleyb> slaweq: test ran on slow infra somewhere ?
16:26:50 <slaweq> haleyb: maybe
16:27:29 <slaweq> haleyb: but we have more issues like that, when request to metadata/public-keys fails due to timeout
16:27:37 <slaweq> and then ssh to the instance is not possible
16:28:12 <mlavalle> but I am pretty sure that environment has impact on this test
16:28:24 <slaweq> #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues
16:28:29 <slaweq> mlavalle: I agree
16:28:33 <mlavalle> since we merged your patch, I have not been able to reproduce locally
16:28:55 <mlavalle> I'm running it every half hour, with success....
16:28:57 <slaweq> in many of those cases it's IMO matter of high load on node when tests are run
16:29:00 <mlavalle> in getting a failure :-)
16:29:25 <mlavalle> without success getting a failure I meant
16:30:00 <slaweq> ok, lets move on
16:30:11 <slaweq> we have many other thinks to talk about :)
16:30:20 <slaweq> next action was:
16:30:21 <slaweq> slaweq to send patch to switch neutron-fwaas to python 3
16:30:32 <slaweq> I did https://review.opendev.org/#/c/666165/
16:30:44 <slaweq> and it's merged already
16:30:57 <mlavalle> +1
16:30:57 <slaweq> this week I will try to do the same for functional tests job
16:31:17 <slaweq> #action slaweq to send patch to switch functional tests job in fwaas repo to py3
16:31:17 <njohnston> thanks slaweq!
16:31:32 <slaweq> njohnston: yw
16:31:36 <slaweq> ok, next one
16:31:37 <slaweq> njohnston update dashboard when ovn job becomes voting
16:32:15 <njohnston> I have that change ready locally but I haven't pushed it yet, I'll push it in a sec
16:32:34 <slaweq> njohnston: thx
16:32:43 <slaweq> next one was:
16:32:44 <slaweq> slaweq to open bug related to missing namespace issue in functional tests
16:32:49 <slaweq> Done https://bugs.launchpad.net/neutron/+bug/1833385
16:32:50 <openstack> Launchpad bug 1833385 in neutron "Functional tests are failing due to missing namespace" [Medium,Confirmed]
16:33:11 <slaweq> but it's still unassigned
16:33:35 <slaweq> and I saw at least 2 other examples of similar failures today
16:33:56 <slaweq> so I will change Importance to High for this bug, what do You think?
16:33:58 <ralonsoh> I think this is related to https://bugs.launchpad.net/neutron/+bug/1833717
16:33:59 <openstack> Launchpad bug 1833717 in neutron "Functional tests: error during namespace creation" [High,Fix released] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:34:20 <ralonsoh> same problem as before with namespace creation/deletion/listing
16:34:53 <slaweq> ralonsoh: ok, can You then add "related-bug" or "closes-bug" tag to commit message in this patch?
16:35:04 <slaweq> and assign it to Yourself, ok?
16:35:11 <ralonsoh> IMO, during functional tests, we are overloading sometimes the thread pool of privsep
16:35:31 <ralonsoh> slaweq, ok
16:35:49 <ralonsoh> and related to this: https://review.opendev.org/#/c/668682/
16:35:52 <slaweq> thx ralonsoh
16:36:13 <slaweq> yes, I was thinking about this patch :)
16:36:26 <slaweq> to add in it info that it's related to the https://bugs.launchpad.net/neutron/+bug/1833385 also
16:36:27 <openstack> Launchpad bug 1833385 in neutron "Functional tests are failing due to missing namespace" [Medium,Confirmed]
16:37:01 <slaweq> ok, next one
16:37:02 <slaweq> ralonsoh to open bug related to failed test_server functional tests
16:37:22 <ralonsoh> let me check...
16:37:42 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1833279
16:37:43 <openstack> Launchpad bug 1833279 in neutron "TestNeutronServer: start function not called (or not logged in the temp file)" [Medium,Confirmed] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:38:00 <ralonsoh> just added more info: https://review.opendev.org/#/c/666261/
16:38:44 <slaweq> ok, and now You will try to spot it again and investigate logs, right?
16:38:53 <ralonsoh> sure
16:39:14 <slaweq> #action ralonsoh to continue debugging TestNeutronServer: start function (bug/1833279) with new logs
16:39:17 <slaweq> thx ralonsoh
16:39:31 <slaweq> and the last one from last week was
16:39:33 <slaweq> slaweq to report bug regarding failing neutron.tests.fullstack.test_l3_agent.TestLegacyL3Agent.test_north_south_traffic tests
16:39:37 <slaweq> Done: https://bugs.launchpad.net/neutron/+bug/1833386
16:39:38 <openstack> Launchpad bug 1833386 in neutron "Fullstack test neutron.tests.fullstack.test_l3_agent.TestLegacyL3Agent.test_north_south_traffic is failing" [High,Fix released] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:39:53 <ralonsoh> yes
16:39:59 <slaweq> and this seems to be fixed by ralonsoh :)
16:40:01 <slaweq> thx a lot
16:40:01 <ralonsoh> the problem wasn't the connectivity
16:40:06 <ralonsoh> but the probe
16:40:21 <ralonsoh> and, of course, in high loaded servers, we can have this problem
16:40:54 <ralonsoh> so it's better to remove this probe in fullstack tests, doesn't add anything valuable
16:41:50 <slaweq> yes, and it's merged already
16:42:37 <slaweq> ok, lets move on then
16:42:39 <slaweq> #topic Stadium projects
16:42:50 <slaweq> we already talked about py3 for stadium projects
16:43:02 <slaweq> so I think there is no need to talk about it again
16:43:11 <slaweq> any updates about tempest-plugins migration?
16:43:36 <mlavalle> not from me
16:43:57 <slaweq> njohnston: any update about fwaas?
16:44:05 * njohnston is going to work on the fwaas one today if it kills him
16:44:21 <slaweq> :)
16:44:44 <slaweq> ok, so lets move on
16:44:46 <slaweq> #topic Grafana
16:45:21 <slaweq> yesterday we had problem with neutron-tempest-plugin-designate-scenario but it's now fixed with https://review.opendev.org/#/c/668447/
16:45:49 <ralonsoh> slaweq++
16:46:21 <slaweq> from other things, I see that functional/fullstack jobs and unit tests also are going to high numberf today
16:47:02 <slaweq> but I saw today at least couple of patches where unit tests failures were related to change so maybe it will not be big problem
16:47:05 <njohnston> even the pep8 job shows the same curve
16:47:43 <slaweq> lets observe it maybe and if You will notice some repeated failure, please open bug for it
16:47:51 <mlavalle> ack
16:47:55 <slaweq> or maybe You already saw some :)
16:49:24 <slaweq> ok, anything else You want to talk regarding grafana?
16:50:41 <mlavalle> nope
16:50:43 <njohnston> nope
16:50:52 <slaweq> ok, lets move on then
16:51:01 <slaweq> #topic Tempest/Scenario
16:51:21 <slaweq> I have couple of things to mention regarding scenario jobs so lets start with this one today
16:51:34 <slaweq> first of all, we promoted networking-ovn job to be voted recently
16:51:44 <slaweq> and it's now failing about 25-30% of times
16:51:45 <mlavalle> I noticed
16:51:56 <slaweq> I found one repeat error
16:52:02 <slaweq> so I reported it https://bugs.launchpad.net/networking-ovn/+bug/1835029
16:52:02 <openstack> Launchpad bug 1835029 in networking-ovn "test_reassign_port_between_servers failing in networking-ovn-tempest-dsvm-ovs-release CI job" [Undecided,New]
16:52:14 <slaweq> and I asked today dalvarez and jlibosva to take a look into it
16:52:39 <slaweq> I also found other issue (but this one isn't related stricly to networking-ovn)
16:53:23 <slaweq> https://bugs.launchpad.net/tempest/+bug/1835058
16:53:24 <openstack> Launchpad bug 1835058 in tempest "During instance creation tenant network should be given always " [Undecided,New]
16:53:29 <slaweq> I reported it as tempest bug for now
16:53:50 <slaweq> basically some tests are failing due to "multiple networks found" exception from nova during spawning vm
16:54:13 <slaweq> from other issues
16:54:29 <slaweq> I noticed that from time to time port forwarding scenario test is failing
16:54:40 <slaweq> and we don't have console log from vm in this test
16:54:50 <slaweq> so I send small patch https://review.opendev.org/668645  to add logging of console log there
16:54:56 <slaweq> please review if You will have time
16:55:40 <slaweq> and finally, during the weekend I spent some time on debugging issue with neutron-tempest-with-uwsgi
16:55:57 <slaweq> I found what is wrong there and I proposed patch to tempest https://review.opendev.org/#/c/668311/
16:56:18 <slaweq> with this patch I run couple of times this job in https://review.opendev.org/#/c/668312/1 and it was passing
16:56:53 <njohnston> nice!
16:56:56 <slaweq> that all from me about scenario jobs
16:57:01 <slaweq> anything else You want to add?
16:57:05 <mlavalle> Thanks for the pacthes
16:57:54 <slaweq> ok, lets move on then as we are almost out of time
16:57:56 <slaweq> #topic fullstack/functional
16:58:04 <slaweq> I have one more info to mention about fullstack tests
16:58:17 <slaweq> we recently merged https://review.opendev.org/#/c/664629/ and again we have failures of test neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost
16:58:43 <slaweq> so I think we should again mark it as unstable until we will finally found out what is going on with this test
16:58:46 <slaweq> what do You think?
16:59:12 <ralonsoh> ok to mark the test
16:59:41 <slaweq> #action slaweq to mark test_ha_router_restart_agents_no_packet_lost as unstable again
16:59:46 <njohnston> +1
16:59:54 <slaweq> ok, we are running out of time
16:59:58 <slaweq> thx for attending
17:00:00 <slaweq> #endmeeting