#openstack-meeting log

16:00:40 <slaweq> #startmeeting neutron_ci
16:00:41 <openstack> Meeting started Tue Jul 30 16:00:40 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:42 <slaweq> hi again
16:00:44 <openstack> The meeting name has been set to 'neutron_ci'
16:00:46 <ralonsoh> hi
16:00:46 <mlavalle> o/
16:01:31 <slaweq> I know that haleyb and bcafarel will be late so I think we can start
16:01:40 <slaweq> I hope njohnston will join us soon :)
16:01:47 <openstack> slaweq: Error: Can't start another meeting, one is in progress.  Use #endmeeting first.
16:01:53 <slaweq> #undo
16:01:55 <slaweq> #topic Actions from previous meetings
16:02:05 <slaweq> first one:
16:02:07 <slaweq> mlavalle to report bug with router migrations
16:02:24 <mlavalle> I didn't report the bug but I started working on fixing it
16:02:32 <slaweq> :)
16:02:33 <mlavalle> I'll report it today
16:02:37 <slaweq> thx a lot
16:02:54 <slaweq> do You have any ideas what is the root cause of this failure?
16:03:07 <mlavalle> haven't got to the root yet
16:03:23 <slaweq> ok, so please report it this week that we can track it
16:03:28 <slaweq> #action mlavalle to report bug with router migrations
16:03:33 <mlavalle> but the problem is that the tests are failing because once the router is updated with...
16:03:50 <mlavalle> admin_state_up False
16:04:29 <mlavalle> the router service ports (at least the one used for the interface) never gets down
16:04:57 <slaweq> I remember we had such issue with some kind of routers in the past already
16:04:57 <mlavalle> I can see it being removed from the hypervisor where it was originally scheduled
16:05:29 <mlavalle> but the server doesn't catch it
16:05:39 <mlavalle> that's where I am right now
16:06:13 <slaweq> ok, I think You should look into ovs-agent maybe as this agent is IMHO responsible for updating port status to DOWN or UP
16:06:32 <mlavalle> yeap, that's where I am looking presently
16:06:38 <slaweq> great
16:06:40 <slaweq> thx mlavalle
16:07:12 <slaweq> ok, lets move on
16:07:16 <slaweq> next action item
16:07:18 <slaweq> ralonsoh to report bug with qos scenario test failures
16:07:38 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1838068
16:07:39 <openstack> Launchpad bug 1838068 in neutron ""QoSTest:test_qos_basic_and_update" failing in DVR node scenario" [Undecided,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:07:47 <ralonsoh> and patch: #link https://review.opendev.org/#/c/673023/
16:08:00 <ralonsoh> in 20 secs:
16:08:34 <ralonsoh> force to stop the ns process, close the socket from the test machine and set a socket timeout, to recheck it again if there is still time
16:08:45 <ralonsoh> that;s all
16:09:26 <bcafarel> late hi o/ (as promised)
16:09:35 <slaweq> I hope this will help ralonsoh :)
16:09:38 <slaweq> thx for the patch
16:09:47 <ralonsoh> no problem!
16:09:49 <slaweq> btw. I forgot at the beginning: http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1
16:09:55 <slaweq> please open it to be ready later :)
16:10:06 <slaweq> ok, next one
16:10:09 <slaweq> slaweq to take a look at issue with dvr and metadata: https://bugs.launchpad.net/neutron/+bug/1830763
16:10:10 <openstack> Launchpad bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:10:20 <slaweq> I did, my findings are in https://bugs.launchpad.net/neutron/+bug/1830763/comments/13 and I proposed patch https://review.opendev.org/#/c/673331/
16:10:51 <slaweq> long story short: I found out that there is race condition and sometimes one L3 agent can have created 2 "floating ip agent gateway" ports for network
16:11:24 <slaweq> and that cause later error in L3 agent during configuration of one of routers and metadata is not reachable in this router
16:11:54 <slaweq> so workaround which I proposed now should help to solve this problem in gate as we are always using only single controller node there
16:12:12 <slaweq> and this can be also backported to stable branches if needed
16:12:38 <slaweq> but proper fix will IMO require some db changes to provide correct constraint on db level for that kind of ports
16:12:43 <slaweq> I will work on it later
16:13:45 <mlavalle> so constraint the db and if when creating the gateway you get a duplicate ignore?
16:13:57 <njohnston> o/ sorry I am late
16:15:05 <slaweq> mlavalle: basically yes, something like that
16:15:19 <mlavalle> ack
16:16:28 <slaweq> ok, next one
16:16:29 <slaweq> ralonsoh to try a patch to resuce the number of workers in FT
16:17:05 <ralonsoh> slaweq, I've seen that the problems we have now in zuul is lower than 2/3 weeks ago
16:17:17 <ralonsoh> and this patch will slow down the FT execution
16:17:22 <ralonsoh> can we hold this patch?
16:17:38 <slaweq> ralonsoh: so do You think that we should just wait to see how it will be in the future?
16:17:43 <ralonsoh> yes
16:18:07 <slaweq> +1 for that, I also didn't saw many of such issues last week
16:18:09 <ralonsoh> reducing the number of workers (from 8 to 7) will reduce a lot the speed of FT executiuon
16:18:37 <slaweq> ralonsoh: do You know by how much it will slow down the job?
16:18:56 <ralonsoh> almost proportionally to the core reduction
16:19:20 <ralonsoh> in this case, 12,5%
16:20:03 <slaweq> so second question: do You know by how much it may improve stability of tests? :)
16:20:22 <ralonsoh> slaweq, I can't answer this question
16:20:34 <slaweq> ralonsoh: I though that :)
16:20:52 <slaweq> ok, lets maybe keep it as our last possible thing to do
16:21:08 <slaweq> thx ralonsoh for checking that
16:21:11 <slaweq> next one
16:21:12 <ralonsoh> np!
16:21:12 <slaweq> ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed
16:21:30 <ralonsoh> that's the previous one
16:22:08 <ralonsoh> nope, my bad
16:22:16 <ralonsoh> no sorry, I didn't have time for this one
16:22:26 <slaweq> ok, no problem
16:22:35 <slaweq> can I assign it to You for next week than?
16:22:44 <slaweq> just to report it at least :)
16:22:46 <ralonsoh> I hope I have time, yes
16:22:50 <slaweq> thx
16:23:03 <slaweq> #action ralonsoh to report a bug about failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed
16:23:08 <slaweq> thx ralonsoh
16:23:14 <slaweq> ok, that's all from last week
16:23:19 <slaweq> any questions/comments?
16:24:06 <slaweq> ok, so lets move on
16:24:15 <slaweq> #topic Stadium projects
16:24:23 <slaweq> Python 3 migration
16:24:25 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:24:38 <njohnston> I think we covered that really well in the neutron team meeting
16:24:39 <slaweq> we already discussed that on neutron meeting today
16:24:44 <slaweq> right njohnston
16:24:50 <njohnston> slaweq++
16:24:54 <slaweq> :)
16:25:03 <slaweq> so lets move quickly to second part of this topic
16:25:05 <slaweq> tempest-plugins migration
16:25:07 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo
16:25:15 <slaweq> any progress on this?
16:26:19 <njohnston> I'll update the second part of the fwaas change today; escalations got the better of me this week
16:27:03 <slaweq> sure, thx njohnston
16:27:18 <slaweq> I know that tidwellr is also making some progress on neutron-dynamic-routing recently
16:27:40 <bcafarel> I saw some recent updates by tidwellr on https://review.opendev.org/#/c/652099/ (though it's still in zuul checks)
16:27:52 <slaweq> yep, it is
16:27:57 <mlavalle> and I will try to make progress with vpn
16:28:46 <slaweq> so we are covered on this topic and I hope we will be ready with this at the end of T cycle
16:28:54 <bcafarel> not sure if it got already in latest revisions, but we should make sure all these new moved plugins have a config switch to enable/disable them
16:29:06 <bcafarel> (as added for the 3 completed ones for 0.4.0 release)
16:29:15 <slaweq> bcafarel: good point
16:30:42 <slaweq> ok, I think we can move on to the next topic then
16:30:44 <slaweq> #topic Grafana
16:30:51 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:32:04 <slaweq> we didn't have many commits in gate queue recently so there is no data in last few days there
16:32:14 <slaweq> but lets look at check queue graphs
16:32:28 <njohnston> interesting that in the last couple of hours there is a spike in failures across multiple charts in the check queue.  I wonder if someone is just pushing some really crappy changes.
16:33:25 <haleyb> i was nowhere the check queue :)
16:33:32 <njohnston> lol
16:34:35 <slaweq> njohnston: IMHO it is just getting back to normal as there was almost nothing running during the weekend
16:34:50 <njohnston> oh, that makes sense
16:35:10 <slaweq> njohnston: but that is only my theory - lets keep an eye on it for next days :)
16:35:18 <njohnston> sounds good
16:35:30 <haleyb> is the midonet co-gating job healthy?  it's been 100% failure (non-voting)
16:35:40 <slaweq> haleyb: no, it's not
16:35:56 <njohnston> yeah, I think we mentioned that last week, yamamoto needs to take a look
16:36:15 <mlavalle> I think he also mentioned he doesn't have much time
16:37:32 <slaweq> I will take a look into this job this week and try to check if it's always the same test(s) which are failing or maybe various ones
16:37:37 <slaweq> and will report bug(s) for that
16:37:45 <slaweq> sounds good for You?
16:37:54 <mlavalle> yes
16:37:55 <haleyb> yes, good for me
16:38:15 <slaweq> #action slaweq to check midonet job and report bug(s) related to it
16:38:37 <slaweq> other than that I think we are in pretty good shape recently
16:39:36 <slaweq> any other questions/comments about grafana?
16:40:32 <mlavalle> niot from me
16:40:36 <slaweq> ok, so lets move on then
16:40:39 <slaweq> #topic fullstack/functional
16:41:00 <slaweq> I was looking into some recent patches looking for some failures and I found only few of them
16:41:06 <slaweq> first functional tests
16:41:11 <slaweq> http://logs.openstack.org/12/672612/4/check/neutron-functional/e357646/testr_results.html.gz
16:41:18 <slaweq> failure in neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase
16:41:52 <slaweq> but this looks like issue related to host load, and we talked about it with ralonsoh already
16:41:55 <slaweq> right ralonsoh?
16:42:02 <ralonsoh> I think so
16:42:15 <ralonsoh> yes, that's right
16:42:32 <slaweq> and second issue which I found:
16:42:34 <slaweq> neutron.tests.functional.services.trunk.drivers.openvswitch.agent.test_trunk_manager.TrunkManagerTestCase
16:42:38 <slaweq> http://logs.openstack.org/03/670203/10/check/neutron-functional/80d0831/testr_results.html.gz
16:43:35 <slaweq> looking at logs from this test: http://logs.openstack.org/03/670203/10/check/neutron-functional/80d0831/controller/logs/dsvm-functional-logs/neutron.tests.functional.services.trunk.drivers.openvswitch.agent.test_trunk_manager.TrunkManagerTestCase.test_connectivity.txt.gz
16:43:42 <slaweq> I don't see anything obvious
16:44:47 <slaweq> but as it happend only once, lets just keep an eye on it for now
16:44:50 <slaweq> do You agree?
16:44:54 <njohnston> yes
16:44:54 <ralonsoh> that's curious: the test is claiming that the ping process was not spawned, but it was
16:44:55 <mlavalle> ++
16:46:04 <slaweq> ralonsoh: true
16:46:31 <slaweq> that is strange
16:47:30 <slaweq> if I will have some time, I will take deeper look into this
16:47:47 <slaweq> maybe at least to add some more logs which will help debugging such issues in the future
16:48:39 <slaweq> any other issues related to functional tests?
16:48:43 <slaweq> or can we continue?
16:49:45 <slaweq> ok, lets move on
16:49:54 <slaweq> I don't have anything new related to fullstack for today
16:49:58 <slaweq> so next topic
16:50:03 <slaweq> #topic Tempest/Scenario
16:50:37 <slaweq> first of all, my patch to tempest https://review.opendev.org/#/c/672715/ is merged
16:51:00 <slaweq> so I hope it should be much better with SSH failured due to failing to get public-keys now
16:51:08 <njohnston> \o/
16:51:21 <slaweq> if You will see such errors now, let me know - I will investigate again
16:51:38 <mlavalle> Thanks!
16:51:55 <slaweq> this should help for all jobs which inherits from devstack-tempest
16:52:07 <slaweq> so it will not solve problem in e.g. tripleo based jobs
16:52:15 <slaweq> (just saying :))
16:52:30 <slaweq> but in neutron u/s gate we should be much better now I hope
16:52:33 <slaweq> ok
16:52:48 <slaweq> from other things I spotted one new error in API tests:
16:52:55 <slaweq> http://logs.openstack.org/30/670930/3/check/neutron-tempest-plugin-api/5a731da/testr_results.html.gz
16:53:02 <slaweq> it is failure in neutron_tempest_plugin.api.test_port_forwardings.PortForwardingTestJSON
16:53:19 <slaweq> looks like issue in test for me, I will report bug and work on it
16:53:28 <slaweq> ok for You?
16:53:52 <mlavalle> sure
16:54:17 <slaweq> thx
16:54:33 <slaweq> #action slaweq to report and try to fix bug in neutron_tempest_plugin.api.test_port_forwardings.PortForwardingTestJSON
16:54:38 <slaweq> ok
16:54:46 <slaweq> and that's all from my side for today
16:54:59 <slaweq> anything else You want to discuss today?
16:55:04 <mlavalle> not from me
16:55:22 <bcafarel> catching up on the recent activity, so nothing from me eiher :)
16:55:48 <slaweq> ok, thx for attending
16:55:50 <bcafarel> nice recent findings btw (race condition, memcached in nova,  ...)
16:55:54 <slaweq> and have a nice week
16:55:58 <slaweq> thx bcafarel  :)
16:56:03 <slaweq> o/
16:56:07 <slaweq> #endmeeting