16:00:06 <slaweq> #startmeeting neutron_ci 16:00:07 <openstack> Meeting started Tue Sep 3 16:00:06 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:10 <openstack> The meeting name has been set to 'neutron_ci' 16:00:23 <slaweq> welcome back after short break :) 16:00:30 <mlavalle> thanks! 16:00:34 <ralonsoh> hi 16:01:13 <slaweq> lets wait 1 or 2 minutes for njohnston and others 16:01:26 <bcafarel> o/ (though I will probably leave soon) 16:02:51 <slaweq> ok, let's start than 16:03:06 <slaweq> first of all: Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:03:14 <slaweq> please open now that it will be ready later :) 16:03:22 <slaweq> #topic Actions from previous meetings 16:03:36 <slaweq> ralonsoh will continue working on error patterns and open bugs for functional tests 16:03:52 <ralonsoh> yes, I found today something that could be an error 16:03:57 <ralonsoh> related to pyroute2 16:04:15 <ralonsoh> tomorrow, reviewing the CI tests, I'll report an error if there is one 16:04:25 <ralonsoh> no other patterns found this week 16:05:04 <ralonsoh> just for information: the possible error is in test_get_devices_info_veth_different_namespaces 16:05:08 <ralonsoh> that's all 16:05:24 <slaweq> thx ralonsoh 16:05:35 <slaweq> I also saw today 2 failures which I wanted to raise here but lets do it later in functional tests section 16:05:50 <slaweq> next one: 16:05:52 <slaweq> mlavalle will continue debugging https://bugs.launchpad.net/neutron/+bug/1838449 16:05:53 <openstack> Launchpad bug 1838449 in neutron "Router migrations failing in the gate" [Medium,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:06:33 <mlavalle> last night I left my latest comments here: https://bugs.launchpad.net/neutron/+bug/1838449 16:07:49 <slaweq> mlavalle: so, based on Your comment, it looks that it could be introduced by https://review.opendev.org/#/c/597567, right? 16:08:04 <slaweq> as it is some race condition in case when router is updated 16:08:19 <mlavalle> well, not necessarilly that patch 16:08:46 <mlavalle> there are other 2 patches that have touched the "ralted routers" code later 16:08:55 <mlavalle> which I also mention in the bug 16:09:41 <mlavalle> the naive solution would be for each test case in the test_magrations script to create its router in separate nets / subnets 16:09:54 <mlavalle> that would fix our tests 16:10:02 <mlavalle> but this is a real bug IMO 16:10:07 <mlavalle> which we need to fix 16:10:10 <mlavalle> right? 16:10:52 <slaweq> so different routers from those tests are using same networks/subnets now? 16:10:53 <bcafarel> and backport :) 16:11:06 <mlavalle> I am assuming that 16:11:09 <slaweq> isn't it that there is one migration "per test"? 16:11:19 <slaweq> and then new network/subnet created per each test 16:11:39 <mlavalle> well, if that is the case, the problem is even worst 16:12:07 <mlavalle> why would the router under migration have related routers? 16:13:40 <slaweq> but here: https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_migration.py#L124 it looks like every test has got own network and subnet 16:13:52 <slaweq> IMO it's created here: https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_migration.py#L129 16:14:25 <slaweq> this method is defined here https://github.com/openstack/neutron-tempest-plugin/blob/d11f4ec31ab1cf7965671817f2733c362765ebb1/neutron_tempest_plugin/scenario/base.py#L173 16:14:41 <mlavalle> see my comment just before yours please ^^^^ 16:14:46 <mlavalle> we agree 16:15:23 <slaweq> yes, so it seems that it is "even worst" :) 16:16:04 <mlavalle> I have a question that I want to confirm.... 16:16:44 <slaweq> shoot 16:17:07 <mlavalle> the related routers ids are sent to the L3 agent in the router update rpc message, right? 16:18:16 <slaweq> yes (IIRC) 16:18:35 <slaweq> but delete router should also be sent in such case to controller, no? 16:18:38 <mlavalle> in other words, this line https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L720 16:18:56 <mlavalle> returns a non empty list, correct 16:19:46 <mlavalle> in the case of a router migration, what is sent from the server to the agent is a router update 16:19:56 <mlavalle> because we are not deleting the router 16:20:13 <mlavalle> we are just setting admin_state_up to false 16:20:43 <mlavalle> let's move on 16:20:48 <mlavalle> I can test that locally 16:20:54 <slaweq> but, according to Your paste: 16:21:00 <slaweq> http://paste.openstack.org/show/769795/ 16:21:18 <slaweq> router 2cd0c0f0-75ab-444c-afe1-be7c6a1a7702 was deleted on compute 16:21:27 <slaweq> and was only updated on controller 16:21:29 <slaweq> why? 16:21:31 <mlavalle> yes, but that is the result of processing an update 16:21:41 <mlavalle> an update message 16:21:47 <mlavalle> not a delete message 16:22:16 <slaweq> true 16:22:18 <mlavalle> that I am 100% sure 16:23:07 <mlavalle> the difference is that when the update message contains "related routers" updates 16:23:17 <mlavalle> is is processed in a different manner 16:23:31 <mlavalle> and therefore the router is not deleted locally in the agent 16:24:14 <mlavalle> delete the router locally means removing the network space form that agent, even though the router still exists 16:24:15 <slaweq> but what are related routers in such case? 16:24:34 <mlavalle> that is excatly the conundrum :-) 16:24:48 <slaweq> :) 16:25:04 <mlavalle> why does the router under migration has related routers? 16:25:12 <slaweq> maybe You should send patch with some additional debug logging to get that info later 16:25:27 <mlavalle> that is exactly the plan 16:25:30 <slaweq> ok 16:26:35 <slaweq> #action mlavalle to continue investigating router migrations issue 16:26:41 <slaweq> fine ^^? 16:26:50 <mlavalle> yeap 16:26:53 <slaweq> ok 16:27:01 <slaweq> thx a lot mlavalle for working on this 16:27:09 <slaweq> let's move on 16:27:11 <mlavalle> :-) 16:27:13 <slaweq> next action 16:27:15 <slaweq> njohnston will get the new location for periodic jobs logs 16:27:29 <njohnston> I did that at the end of the last meeting 16:27:46 <njohnston> and then contacted the glance folks for their broken job that was affecting all postgresql 16:28:18 <slaweq> njohnston: do You have link to those logs then? 16:28:30 <slaweq> just to add it to CI meeting agenda for the future :) 16:29:40 <njohnston> you can look here: http://zuul.openstack.org/builds?pipeline=periodic-stable&project=openstack%2Fneutron 16:30:00 <njohnston> or in the buildsets view http://zuul.openstack.org/buildsets?project=openstack%2Fneutron&pipeline=periodic 16:31:24 * mlavalle needs to step out 5 minutes 16:31:26 <mlavalle> brb 16:31:32 <slaweq> thx a lot 16:31:48 <slaweq> and thx a lot for taking care of broken postgresql job 16:32:04 <slaweq> ok, let's move on to the next topic 16:32:06 <slaweq> #topic Stadium projects 16:32:12 <slaweq> first 16:32:14 <slaweq> Python 3 migration 16:32:20 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:32:36 <slaweq> I saw today that we are in pretty good shape with networking-odl now thank to lajoskatona 16:32:55 <slaweq> and we made some progress with networking-bagpipe 16:33:15 <slaweq> so only "not touched" yet is networking-midonet 16:33:38 <slaweq> anything else You want to add regarding to python 3 migration? 16:33:53 <njohnston> No, I think that covers it, thanks! 16:34:39 <slaweq> thx 16:34:44 <slaweq> so next is 16:35:04 <slaweq> tempest-plugins migration 16:35:09 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo 16:35:17 <slaweq> any updates on this one? 16:36:10 <njohnston> I don't have any; I don't see tidwellr and mlavalle is AFK, they are who I would look at for updates. 16:36:22 <slaweq> njohnston: right 16:36:26 <slaweq> so lets move on then 16:36:48 <slaweq> #topic Grafana 16:36:56 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:38:13 <slaweq> I see that neutron-functional-python27 in gate queue was failing quite often last week 16:38:27 <slaweq> was there any known problem with it? 16:38:29 <slaweq> do You know? 16:38:55 <ralonsoh> no sorry 16:39:01 <slaweq> ok, maybe it wasn't any big problem as there was only few runs of this job then 16:39:11 <slaweq> so it wasn't relatively many failures 16:39:42 * mlavalle is back 16:40:04 <slaweq> from other things I see that neutron-tempest-plugin-scenario-openvswitch in check queue is failing quite often 16:40:55 <slaweq> any volunteer to check this job more deeply? 16:41:04 <slaweq> if no, I will assign it to myself for this week 16:41:17 <ralonsoh> I can take a look, but I don't know when 16:41:32 <ralonsoh> I have a very busy agenda now 16:42:01 <slaweq> ralonsoh: thx 16:42:10 <slaweq> but if You are busy, I can take a look into this 16:42:21 <slaweq> so I will assign it to myself 16:42:27 <slaweq> to not overload You :) 16:42:30 <slaweq> ok? 16:42:38 <ralonsoh> thanks! 16:43:11 <slaweq> #action slaweq to check reasons of failures of neutron-tempest-plugin-scenario-openvswitch job 16:43:50 <slaweq> from other things which I noticed from grafana, neutron-functional-with-uwsgi high failure rate but it's non voting so no big problem (yet) 16:44:19 <slaweq> I hope we will be able to focus more on stabilizing uwsgi jobs in next cycle :) 16:44:37 <njohnston> unit test failures look pretty high - 40% to 50% today 16:45:25 <slaweq> njohnston: correct 16:45:43 <slaweq> but again, please note that it's "just" 6 runs today 16:45:59 <slaweq> so it don't have to big problem 16:46:06 <njohnston> ok 16:46:08 <njohnston> :-) 16:46:27 <slaweq> and I think that I saw some patches with failures related to patch itself 16:46:33 <slaweq> so lets keep an eye on it :) 16:46:42 <slaweq> ok? 16:47:05 <njohnston> sounds good. We'll have better data later - I see 10 neutron jobs in queue right now 16:47:20 <slaweq> agree 16:47:38 <njohnston> FYI for those that don't know, you can put "neutron" in http://zuul.openstack.org/status to see all neutron jobs 16:48:08 <ralonsoh> btw, test_get_devices_info_veth_different_namespaces is a problem now 16:48:24 <ralonsoh> I see many CI jobs failing because of this 16:48:33 <ralonsoh> do we have a new pyroute2 version? 16:48:41 <slaweq> thx njohnston 16:48:57 <slaweq> ok, so as ralonsoh started with it, let's move to next topic :) 16:48:59 <slaweq> #topic fullstack/functional 16:49:19 <ralonsoh> so yes, I need to check what is happening with this test 16:49:20 <slaweq> and indeed I saw this test failing today also, first I though it is maybe related to the patch on which it was run 16:49:27 <ralonsoh> I'm on it now 16:49:33 <slaweq> but now it's clear it's some bigger issue 16:49:46 <slaweq> two examples of failure: 16:49:48 <slaweq> https://a23f52ac6d169d81429a-a52e23b005b6607e27c6770fa63e26fe.ssl.cf1.rackcdn.com/679462/1/gate/neutron-functional/6d6a4c1/testr_results.html.gz 16:49:50 <slaweq> https://e33ddd780e29e3545bf9-6c7fec3fffbf24afb7394804bcdecfae.ssl.cf5.rackcdn.com/679399/6/check/neutron-functional/bc96527/testr_results.html.gz 16:50:05 <ralonsoh> yes, at least is consistent 16:50:36 <slaweq> ohh, so it's now failing 100% times? 16:50:45 <ralonsoh> yes 16:50:57 <slaweq> ralonsoh: will You report bug for it? 16:51:01 <ralonsoh> yes 16:51:04 <slaweq> thx 16:51:31 <slaweq> #action ralonsoh to report bug and investigate failing test_get_devices_info_veth_different_namespaces functional test 16:51:45 <slaweq> please set it as critical :) 16:51:52 <ralonsoh> ok 16:52:11 <slaweq> thx 16:52:28 <slaweq> as I said before, I also saw 2 other failures in functional tests today 16:52:41 <slaweq> 1. neutron.tests.functional.agent.test_firewall.FirewallTestCase.test_rule_ordering_correct 16:52:47 <slaweq> https://019ab552bc17f89947ce-f1e24edd0ae51a8de312c1bf83189630.ssl.cf2.rackcdn.com/670177/7/check/neutron-functional-python27/74e7c20/testr_results.html.gz 16:52:51 <slaweq> I saw it first time 16:53:00 <slaweq> did You maybe got something similar before? 16:53:16 <ralonsoh> no, first time 16:53:52 <ralonsoh> ok, that's because we need https://review.opendev.org/#/c/679428/ 16:54:41 <slaweq> ok, I will check tomorrow exactly if this failed test_rule_ordering_correct test wasn't related to patch on which it was running 16:54:54 <slaweq> #action slaweq to check reason of failure neutron.tests.functional.agent.test_firewall.FirewallTestCase.test_rule_ordering_correct 16:55:17 <slaweq> ralonsoh: we need https://review.opendev.org/#/c/679428/ to fix issue with test_get_devices_info_veth_different_namespaces ? 16:55:20 <slaweq> or for what? 16:55:23 <ralonsoh> no no 16:55:34 <ralonsoh> the last one, test_rule_ordering_correct 16:55:50 <ralonsoh> the error in this test 16:55:51 <ralonsoh> File "neutron/agent/linux/ip_lib.py", line 941, in list_namespace_pids 16:55:51 <ralonsoh> return privileged.list_ns_pids(namespace) 16:55:55 <slaweq> ahh, ok 16:55:57 <slaweq> :) 16:56:05 <slaweq> right 16:56:27 <slaweq> mlavalle: njohnston: if You will have some time, please review https://review.opendev.org/#/c/679428/ :) 16:56:37 <njohnston> slaweq ralonsoh: +2+w 16:56:43 <ralonsoh> thanks! 16:56:46 <slaweq> njohnston: thx :) 16:56:49 <slaweq> You're fast 16:56:55 <mlavalle> he is indded 16:57:05 <slaweq> LOL 16:57:16 <slaweq> ok, and the second failed test which I saw: 16:57:17 <slaweq> neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase.test_keepalived_spawns_conflicting_pid_vrrp_subprocess 16:57:29 <slaweq> https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/logs_66/677166/11/check/neutron-functional-python27/864c837/testr_results.html.gz 16:57:55 <slaweq> but I think I already saw something similar before 16:58:04 <ralonsoh> same problem 16:58:20 <slaweq> ahh, right 16:58:33 <slaweq> different stack trace but there is list_ns_pids in it too :) 16:58:52 <slaweq> ok, so we should be better in functinal tests with Your patch 16:59:13 <slaweq> we are almost out of time 16:59:20 <slaweq> so quickly, one last thing for today 16:59:22 <slaweq> #topic Tempest/Scenario 16:59:30 <slaweq> Recently I noticed that we are testing all jobs with MySQL 5.7 16:59:32 <slaweq> So I asked on ML about Mariadb: http://lists.openstack.org/pipermail/openstack-discuss/2019-August/008925.html 16:59:34 <slaweq> And I will need to add periodic job with mariadb to Neutron 16:59:41 <slaweq> are You ok with such job? 16:59:43 <njohnston> +1 16:59:44 <ralonsoh> sure 17:00:13 <slaweq> mlavalle? I hope You are fine too with such job :) 17:00:25 <mlavalle> I'm ok 17:00:28 <slaweq> #action slaweq to add mariadb periodic job 17:00:36 <slaweq> thx 17:00:41 <slaweq> so we are out of time now 17:00:43 <ralonsoh> bye! 17:00:44 <slaweq> thx for attending 17:00:47 <mlavalle> o/ 17:00:47 <slaweq> #endmeeting