#openstack-meeting log

16:00:06 <slaweq> #startmeeting neutron_ci
16:00:07 <openstack> Meeting started Tue Sep  3 16:00:06 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:10 <openstack> The meeting name has been set to 'neutron_ci'
16:00:23 <slaweq> welcome back after short break :)
16:00:30 <mlavalle> thanks!
16:00:34 <ralonsoh> hi
16:01:13 <slaweq> lets wait 1 or 2 minutes for njohnston and others
16:01:26 <bcafarel> o/ (though I will probably leave soon)
16:02:51 <slaweq> ok, let's start than
16:03:06 <slaweq> first of all: Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:03:14 <slaweq> please open now that it will be ready later :)
16:03:22 <slaweq> #topic Actions from previous meetings
16:03:36 <slaweq> ralonsoh will continue working on error patterns and open bugs for functional tests
16:03:52 <ralonsoh> yes, I found today something that could be an error
16:03:57 <ralonsoh> related to pyroute2
16:04:15 <ralonsoh> tomorrow, reviewing the CI tests, I'll report an error if there is one
16:04:25 <ralonsoh> no other patterns found this week
16:05:04 <ralonsoh> just for information: the possible error is in test_get_devices_info_veth_different_namespaces
16:05:08 <ralonsoh> that's all
16:05:24 <slaweq> thx ralonsoh
16:05:35 <slaweq> I also saw today 2 failures which I wanted to raise here but lets do it later in functional tests section
16:05:50 <slaweq> next one:
16:05:52 <slaweq> mlavalle will continue debugging https://bugs.launchpad.net/neutron/+bug/1838449
16:05:53 <openstack> Launchpad bug 1838449 in neutron "Router migrations failing in the gate" [Medium,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:06:33 <mlavalle> last night I left my latest comments here: https://bugs.launchpad.net/neutron/+bug/1838449
16:07:49 <slaweq> mlavalle: so, based on Your comment, it looks that it could be introduced by https://review.opendev.org/#/c/597567, right?
16:08:04 <slaweq> as it is some race condition in case when router is updated
16:08:19 <mlavalle> well, not necessarilly that patch
16:08:46 <mlavalle> there are other 2 patches that have touched the "ralted routers" code later
16:08:55 <mlavalle> which I also mention in the bug
16:09:41 <mlavalle> the naive solution would be for each test case in the test_magrations script to create its router in separate nets / subnets
16:09:54 <mlavalle> that would fix our tests
16:10:02 <mlavalle> but this is a real bug IMO
16:10:07 <mlavalle> which we need to fix
16:10:10 <mlavalle> right?
16:10:52 <slaweq> so different routers from those tests are using same networks/subnets now?
16:10:53 <bcafarel> and backport :)
16:11:06 <mlavalle> I am assuming that
16:11:09 <slaweq> isn't it that there is one migration "per test"?
16:11:19 <slaweq> and then new network/subnet created per each test
16:11:39 <mlavalle> well, if that is the case, the problem is even worst
16:12:07 <mlavalle> why would the router under migration have related routers?
16:13:40 <slaweq> but here: https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_migration.py#L124 it looks like every test has got own network and subnet
16:13:52 <slaweq> IMO it's created here: https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_migration.py#L129
16:14:25 <slaweq> this method is defined here https://github.com/openstack/neutron-tempest-plugin/blob/d11f4ec31ab1cf7965671817f2733c362765ebb1/neutron_tempest_plugin/scenario/base.py#L173
16:14:41 <mlavalle> see my comment just before yours please ^^^^
16:14:46 <mlavalle> we agree
16:15:23 <slaweq> yes, so it seems that it is "even worst" :)
16:16:04 <mlavalle> I have a question that I want to confirm....
16:16:44 <slaweq> shoot
16:17:07 <mlavalle> the related routers ids are sent to the L3 agent in the router update rpc message, right?
16:18:16 <slaweq> yes (IIRC)
16:18:35 <slaweq> but delete router should also be sent in such case to controller, no?
16:18:38 <mlavalle> in other words, this line https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L720
16:18:56 <mlavalle> returns a non empty list, correct
16:19:46 <mlavalle> in the case of a router migration, what is sent from the server to the agent is a router update
16:19:56 <mlavalle> because we are not deleting the router
16:20:13 <mlavalle> we are just setting admin_state_up to false
16:20:43 <mlavalle> let's move on
16:20:48 <mlavalle> I can test that locally
16:20:54 <slaweq> but, according to Your paste:
16:21:00 <slaweq> http://paste.openstack.org/show/769795/
16:21:18 <slaweq> router 2cd0c0f0-75ab-444c-afe1-be7c6a1a7702 was deleted on compute
16:21:27 <slaweq> and was only updated on controller
16:21:29 <slaweq> why?
16:21:31 <mlavalle> yes, but that is the result of processing an update
16:21:41 <mlavalle> an update message
16:21:47 <mlavalle> not a delete message
16:22:16 <slaweq> true
16:22:18 <mlavalle> that I am 100% sure
16:23:07 <mlavalle> the difference is that when the update message contains "related routers" updates
16:23:17 <mlavalle> is is processed in a different manner
16:23:31 <mlavalle> and therefore the router is not deleted locally in the agent
16:24:14 <mlavalle> delete the router locally means removing the network space form that agent, even though the router still exists
16:24:15 <slaweq> but what are related routers in such case?
16:24:34 <mlavalle> that is excatly the conundrum :-)
16:24:48 <slaweq> :)
16:25:04 <mlavalle> why does the router under migration has related routers?
16:25:12 <slaweq> maybe You should send patch with some additional debug logging to get that info later
16:25:27 <mlavalle> that is exactly the plan
16:25:30 <slaweq> ok
16:26:35 <slaweq> #action mlavalle to continue investigating router migrations issue
16:26:41 <slaweq> fine ^^?
16:26:50 <mlavalle> yeap
16:26:53 <slaweq> ok
16:27:01 <slaweq> thx a lot mlavalle for working on this
16:27:09 <slaweq> let's move on
16:27:11 <mlavalle> :-)
16:27:13 <slaweq> next action
16:27:15 <slaweq> njohnston will get the new location for periodic jobs logs
16:27:29 <njohnston> I did that at the end of the last meeting
16:27:46 <njohnston> and then contacted the glance folks for their broken job that was affecting all postgresql
16:28:18 <slaweq> njohnston: do You have link to those logs then?
16:28:30 <slaweq> just to add it to CI meeting agenda for the future :)
16:29:40 <njohnston> you can look here: http://zuul.openstack.org/builds?pipeline=periodic-stable&project=openstack%2Fneutron
16:30:00 <njohnston> or in the buildsets view http://zuul.openstack.org/buildsets?project=openstack%2Fneutron&pipeline=periodic
16:31:24 * mlavalle needs to step out 5 minutes
16:31:26 <mlavalle> brb
16:31:32 <slaweq> thx a lot
16:31:48 <slaweq> and thx a lot for taking care of broken postgresql job
16:32:04 <slaweq> ok, let's move on to the next topic
16:32:06 <slaweq> #topic Stadium projects
16:32:12 <slaweq> first
16:32:14 <slaweq> Python 3 migration
16:32:20 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:32:36 <slaweq> I saw today that we are in pretty good shape with networking-odl now thank to lajoskatona
16:32:55 <slaweq> and we made some progress with networking-bagpipe
16:33:15 <slaweq> so only "not touched" yet is networking-midonet
16:33:38 <slaweq> anything else You want to add regarding to python 3 migration?
16:33:53 <njohnston> No, I think that covers it, thanks!
16:34:39 <slaweq> thx
16:34:44 <slaweq> so next is
16:35:04 <slaweq> tempest-plugins migration
16:35:09 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo
16:35:17 <slaweq> any updates on this one?
16:36:10 <njohnston> I don't have any; I don't see tidwellr and mlavalle is AFK, they are who I would look at for updates.
16:36:22 <slaweq> njohnston: right
16:36:26 <slaweq> so lets move on then
16:36:48 <slaweq> #topic Grafana
16:36:56 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:38:13 <slaweq> I see that neutron-functional-python27 in gate queue was failing quite often last week
16:38:27 <slaweq> was there any known problem with it?
16:38:29 <slaweq> do You know?
16:38:55 <ralonsoh> no sorry
16:39:01 <slaweq> ok, maybe it wasn't any big problem as there was only few runs of this job then
16:39:11 <slaweq> so it wasn't relatively many failures
16:39:42 * mlavalle is back
16:40:04 <slaweq> from other things I see that neutron-tempest-plugin-scenario-openvswitch in check queue is failing quite often
16:40:55 <slaweq> any volunteer to check this job more deeply?
16:41:04 <slaweq> if no, I will assign it to myself for this week
16:41:17 <ralonsoh> I can take a look, but I don't know when
16:41:32 <ralonsoh> I have a very busy agenda now
16:42:01 <slaweq> ralonsoh: thx
16:42:10 <slaweq> but if You are busy, I can take a look into this
16:42:21 <slaweq> so I will assign it to myself
16:42:27 <slaweq> to not overload You :)
16:42:30 <slaweq> ok?
16:42:38 <ralonsoh> thanks!
16:43:11 <slaweq> #action slaweq to check reasons of failures of neutron-tempest-plugin-scenario-openvswitch job
16:43:50 <slaweq> from other things which I noticed from grafana, neutron-functional-with-uwsgi high failure rate but it's non voting so no big problem (yet)
16:44:19 <slaweq> I hope we will be able to focus more on stabilizing uwsgi jobs in next cycle :)
16:44:37 <njohnston> unit test failures look pretty high - 40% to 50% today
16:45:25 <slaweq> njohnston: correct
16:45:43 <slaweq> but again, please note that it's "just" 6 runs today
16:45:59 <slaweq> so it don't have to big problem
16:46:06 <njohnston> ok
16:46:08 <njohnston> :-)
16:46:27 <slaweq> and I think that I saw some patches with failures related to patch itself
16:46:33 <slaweq> so lets keep an eye on it :)
16:46:42 <slaweq> ok?
16:47:05 <njohnston> sounds good.  We'll have better data later - I see 10 neutron jobs in queue right now
16:47:20 <slaweq> agree
16:47:38 <njohnston> FYI for those that don't know, you can put "neutron" in http://zuul.openstack.org/status to see all neutron jobs
16:48:08 <ralonsoh> btw, test_get_devices_info_veth_different_namespaces is a problem now
16:48:24 <ralonsoh> I see many CI jobs failing because of this
16:48:33 <ralonsoh> do we have a new pyroute2 version?
16:48:41 <slaweq> thx njohnston
16:48:57 <slaweq> ok, so as ralonsoh started with it, let's move to next topic :)
16:48:59 <slaweq> #topic fullstack/functional
16:49:19 <ralonsoh> so yes, I need to check what is happening with this test
16:49:20 <slaweq> and indeed I saw this test failing today also, first I though it is maybe related to the patch on which it was run
16:49:27 <ralonsoh> I'm on it now
16:49:33 <slaweq> but now it's clear it's some bigger issue
16:49:46 <slaweq> two examples of failure:
16:49:48 <slaweq> https://a23f52ac6d169d81429a-a52e23b005b6607e27c6770fa63e26fe.ssl.cf1.rackcdn.com/679462/1/gate/neutron-functional/6d6a4c1/testr_results.html.gz
16:49:50 <slaweq> https://e33ddd780e29e3545bf9-6c7fec3fffbf24afb7394804bcdecfae.ssl.cf5.rackcdn.com/679399/6/check/neutron-functional/bc96527/testr_results.html.gz
16:50:05 <ralonsoh> yes, at least is consistent
16:50:36 <slaweq> ohh, so it's now failing 100% times?
16:50:45 <ralonsoh> yes
16:50:57 <slaweq> ralonsoh: will You report bug for it?
16:51:01 <ralonsoh> yes
16:51:04 <slaweq> thx
16:51:31 <slaweq> #action ralonsoh to report bug and investigate failing test_get_devices_info_veth_different_namespaces functional test
16:51:45 <slaweq> please set it as critical :)
16:51:52 <ralonsoh> ok
16:52:11 <slaweq> thx
16:52:28 <slaweq> as I said before, I also saw 2 other failures in functional tests today
16:52:41 <slaweq> 1. neutron.tests.functional.agent.test_firewall.FirewallTestCase.test_rule_ordering_correct
16:52:47 <slaweq> https://019ab552bc17f89947ce-f1e24edd0ae51a8de312c1bf83189630.ssl.cf2.rackcdn.com/670177/7/check/neutron-functional-python27/74e7c20/testr_results.html.gz
16:52:51 <slaweq> I saw it first time
16:53:00 <slaweq> did You maybe got something similar before?
16:53:16 <ralonsoh> no, first time
16:53:52 <ralonsoh> ok, that's because we need https://review.opendev.org/#/c/679428/
16:54:41 <slaweq> ok, I will check tomorrow exactly if this failed test_rule_ordering_correct test wasn't related to patch on which it was running
16:54:54 <slaweq> #action slaweq to check reason of failure neutron.tests.functional.agent.test_firewall.FirewallTestCase.test_rule_ordering_correct
16:55:17 <slaweq> ralonsoh: we need https://review.opendev.org/#/c/679428/ to fix issue with test_get_devices_info_veth_different_namespaces ?
16:55:20 <slaweq> or for what?
16:55:23 <ralonsoh> no no
16:55:34 <ralonsoh> the last one, test_rule_ordering_correct
16:55:50 <ralonsoh> the error in this test
16:55:51 <ralonsoh> File "neutron/agent/linux/ip_lib.py", line 941, in list_namespace_pids
16:55:51 <ralonsoh> return privileged.list_ns_pids(namespace)
16:55:55 <slaweq> ahh, ok
16:55:57 <slaweq> :)
16:56:05 <slaweq> right
16:56:27 <slaweq> mlavalle: njohnston: if You will have some time, please review https://review.opendev.org/#/c/679428/ :)
16:56:37 <njohnston> slaweq ralonsoh: +2+w
16:56:43 <ralonsoh> thanks!
16:56:46 <slaweq> njohnston: thx :)
16:56:49 <slaweq> You're fast
16:56:55 <mlavalle> he is indded
16:57:05 <slaweq> LOL
16:57:16 <slaweq> ok, and the second failed test which I saw:
16:57:17 <slaweq> neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase.test_keepalived_spawns_conflicting_pid_vrrp_subprocess
16:57:29 <slaweq> https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/logs_66/677166/11/check/neutron-functional-python27/864c837/testr_results.html.gz
16:57:55 <slaweq> but I think I already saw something similar before
16:58:04 <ralonsoh> same problem
16:58:20 <slaweq> ahh, right
16:58:33 <slaweq> different stack trace but there is list_ns_pids in it too :)
16:58:52 <slaweq> ok, so we should be better in functinal tests with Your patch
16:59:13 <slaweq> we are almost out of time
16:59:20 <slaweq> so quickly, one last thing for today
16:59:22 <slaweq> #topic Tempest/Scenario
16:59:30 <slaweq> Recently I noticed that we are testing all jobs with MySQL 5.7
16:59:32 <slaweq> So I asked on ML about Mariadb: http://lists.openstack.org/pipermail/openstack-discuss/2019-August/008925.html
16:59:34 <slaweq> And I will need to add periodic job with mariadb to Neutron
16:59:41 <slaweq> are You ok with such job?
16:59:43 <njohnston> +1
16:59:44 <ralonsoh> sure
17:00:13 <slaweq> mlavalle? I hope You are fine too with such job :)
17:00:25 <mlavalle> I'm ok
17:00:28 <slaweq> #action slaweq to add mariadb periodic job
17:00:36 <slaweq> thx
17:00:41 <slaweq> so we are out of time now
17:00:43 <ralonsoh> bye!
17:00:44 <slaweq> thx for attending
17:00:47 <mlavalle> o/
17:00:47 <slaweq> #endmeeting