#openstack-meeting-3 log

15:00:01 <slaweq> #startmeeting neutron_ci
15:00:02 <openstack> Meeting started Wed May  6 15:00:01 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:03 <slaweq> hi
15:00:04 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:06 <openstack> The meeting name has been set to 'neutron_ci'
15:00:08 <njohnston> o/
15:01:03 <bcafarel> o/
15:02:13 <ralonsoh> hi
15:02:20 <slaweq> ok, I think we can start
15:02:25 <slaweq> #topic Actions from previous meetings
15:02:35 <slaweq> first one
15:02:37 <slaweq> maciejjozefczyk to take a look at ovn related functional test failures
15:02:50 <lajoskatona> Hi
15:03:46 <slaweq> I pinged maciejjozefczyk on neutron channel
15:03:52 <slaweq> maybe he will join soon
15:04:34 <maciejjozefczyk> hey
15:04:50 <maciejjozefczyk> ok, so I took a look on those failures for functional tests
15:06:07 <maciejjozefczyk> #link https://bugs.launchpad.net/neutron/+bug/1868110
15:06:07 <openstack> Launchpad bug 1868110 in neutron "[OVN] neutron.tests.functional.plugins.ml2.drivers.ovn.mech_driver.ovsdb.test_ovn_db_sync.TestOvnNbSyncOverTcp.test_ovn_nb_sync_log randomly fails" [High,In progress] - Assigned to Maciej Jozefczyk (maciej.jozefczyk)
15:07:05 <maciejjozefczyk> I found that the test from this class failed only once recently, on the same which is: TimeoutException
15:07:33 <maciejjozefczyk> So i think we have 2 possible solution: let it be, as per 1 failure per week is not that much
15:07:41 <maciejjozefczyk> or bump the timeout value
15:08:05 <bcafarel> what's the current timeout?
15:08:07 <slaweq> didn't we lower this timeout already to solve this problem? :)
15:08:11 <maciejjozefczyk> I think that after bumping it from very litle value to 15 seconds we already have less occurance of this test failures
15:08:43 <maciejjozefczyk> slaweq, bcafarel default in the neutron configuration is 180 seconds
15:09:09 <bcafarel> oh right that was the one that was bumped
15:09:14 <maciejjozefczyk> we had issues while it was 5 seconds set for functional tests, we updated to 15 seconds and I found only particular failure of this test class
15:09:14 <slaweq> ahh, it was 5 seconds and You bumped it to 15, right?
15:09:28 <slaweq> why not use default value?
15:09:53 <maciejjozefczyk> slaweq, I think that the author a few years ago wanted to not have that high timeout value for functional tests because it causes the whole job to timeout
15:10:13 <maciejjozefczyk> but we can try setting it to default
15:10:34 <slaweq> so maybe we can try with e.g. 30 seconds for now and we will see :)
15:11:47 <maciejjozefczyk> slaweq, yes
15:12:05 <maciejjozefczyk> but for now, as I said, I found only 1 failure of the test, at least I was checking week ago :)
15:13:14 <slaweq> maciejjozefczyk: ok, IMO we can try to bump it a litte and see - maybe there will be no occurences anymore :)
15:13:20 <maciejjozefczyk> slaweq, sure
15:13:41 <slaweq> ralonsoh: njohnston bcafarel what do You think?
15:13:46 <ralonsoh> +1
15:13:54 <ralonsoh> to increase the time
15:14:05 <njohnston> I agree, it seems reasonable
15:14:26 <bcafarel> yep, it is still an OK value
15:14:39 <slaweq> ok, maciejjozefczyk will You take care of it?
15:14:49 <maciejjozefczyk> slaweq, yes
15:14:52 <slaweq> thx
15:15:12 <slaweq> #action maciejjozefczyk will increase timeout in neutron.tests.functional.plugins.ml2.drivers.ovn.mech_driver.ovsdb.test_ovn_db_sync.TestOvnNbSyncOverTcp.test_ovn_nb_sync_log
15:15:27 <slaweq> ok, next one
15:15:29 <slaweq> maciejjozefczyk to report LP regarding failing neutron_tempest_plugin.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle in neutron-ovn-tempest-ovs-release job
15:15:43 <maciejjozefczyk> #link https://bugs.launchpad.net/neutron/+bug/1874447
15:15:43 <openstack> Launchpad bug 1874447 in neutron "[OVN] Tempest test neutron_tempest_plugin.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle fails randomly" [High,Confirmed]
15:16:30 <slaweq> maciejjozefczyk: thx
15:16:40 <slaweq> but I think Your comments to this LP are wrong :)
15:16:43 <maciejjozefczyk> I was investigating it, but at that time I also haven't found any occurance of that failure (afair the logs are rotated after a week?)
15:17:28 <slaweq> logs are gone after some time, I don't know exactly how long
15:17:37 <njohnston> I've been doing some experimenting with trunks on the 2 changes you cite, but I don't think I have had enough CI runs on them to account for the statsu you mention.
15:17:51 <njohnston> *stats
15:19:36 <slaweq> if You would found similar issues, please link them to this LP
15:19:53 <slaweq> we will see if that is serious issue
15:20:13 <maciejjozefczyk> ok slaweq
15:20:21 <njohnston> +1
15:20:25 <slaweq> thx
15:20:29 <slaweq> ok, next one
15:20:31 <slaweq> ralonsoh to check timeout during interface creation in functional tests
15:20:51 <ralonsoh> #link https://review.opendev.org/#/c/722254/
15:21:00 <ralonsoh> and a patch in pyroute2
15:21:07 <ralonsoh> https://github.com/svinota/pyroute2/issues/702
15:21:25 <ralonsoh> now we can pass the PyDLL when creating a namespace context
15:21:38 <ralonsoh> we need a new release in pyroyte2 0.5.13
15:21:56 <ralonsoh> same problem as when calling create/delete namespaces
15:22:04 <ralonsoh> that's all
15:22:47 <slaweq> thx a lot
15:22:53 <ralonsoh> yw
15:23:05 <slaweq> ok, so that's all from the last week
15:23:12 <slaweq> lets move to the next topic
15:23:14 <slaweq> #topic Stadium projects
15:23:30 <slaweq> standardize on zuul v3
15:23:43 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron-train-zuulv3-py27drop
15:23:57 <slaweq> only update which I have is that there is now zuulv3 grenade job
15:24:09 <slaweq> so we can migrate our grenade jobs too
15:24:24 <slaweq> it's already done for 2 of the neutron grenade jobs
15:24:34 <slaweq> I will do the same for neutron-ovn job
15:24:34 <njohnston> I like the idea of an OVN grenade job to complement the existing ml2/ovs one
15:25:20 <slaweq> and there is also networking-odl grenade job to migrate
15:25:25 <njohnston> so that just meanes the networking-odl-grenade job, and the regular zuulv3 transition for networking-midonet.  Any word from yamamoto recently?
15:25:40 <slaweq> njohnston: nope
15:25:41 <lajoskatona> I started tat and got some advices from Luigi Toscano
15:25:53 <njohnston> excellent
15:25:59 <lajoskatona> that -^
15:26:04 <slaweq> he only told me that his time for openstack is "very limited" currently
15:26:41 <slaweq> we will need to review list of stadium projects again during the ptg probably
15:27:19 <slaweq> thx lajoskatona for taking care of networking-odl
15:27:55 <lajoskatona> slaweq: np, we use it so its noral way of working ;)
15:28:26 <lajoskatona> by the way: could you check these please : https://review.opendev.org/#/q/I3aed8031c1b784c16dd7055afd41456ed935b7fc
15:28:54 <lajoskatona> as I know these are missing to make master and ussuri patches to be merged for reno etc...
15:29:14 <slaweq> sure, I will
15:29:43 <slaweq> ok, anything else regarding stadium projects' ci for today?
15:30:18 <lajoskatona> nothing from me at least
15:31:04 <njohnston> nope
15:31:15 <ralonsoh> no
15:31:33 <bcafarel> neutron-dynamic-routing fails tempest on train (though this was the first backport attempt): https://review.opendev.org/#/c/725790/
15:32:05 <bcafarel> related to that series of monkeypatch eventlet fixes
15:32:48 <slaweq> ModuleNotFoundError: No module named 'neutron_tempest_plugin'
15:34:26 <slaweq> so IMO it's some issue with jobs definition
15:35:28 <njohnsto_> yeah
15:35:51 <bcafarel> yes, in train tests were still in-repo
15:35:59 <slaweq> those are still legacy jobs
15:37:16 <slaweq> bcafarel: isn't that related to https://review.opendev.org/#/c/721277/3/devstack/plugin.sh ?
15:38:26 <slaweq> maybe those jobs in stable/train should use some pinned version of neutron-tempest-plugin?
15:39:55 <bcafarel> hmm yes it's possible (so the legacy job was actually using the installed version of the plugin?)
15:40:12 <slaweq> bcafarel: idk really :/
15:40:31 <slaweq> bcafarel: will You check that this week?
15:40:47 <bcafarel> slaweq: I'll try at least :) and worst case fill a LP for it
15:40:53 <slaweq> thx
15:41:10 <slaweq> #action bcafarel to check neutron-dynamic-routing failurs in stable/train
15:41:25 <slaweq> ok, lets move on
15:41:26 <slaweq> #topic Stable branches
15:41:32 <slaweq> Train dashboard: http://grafana.openstack.org/d/pM54U-Kiz/neutron-failure-rate-previous-stable-release?orgId=1
15:41:33 <slaweq> Stein dashboard: http://grafana.openstack.org/d/dCFVU-Kik/neutron-failure-rate-older-stable-release?orgId=1
15:42:51 <slaweq> IMHO dashboards looks more or less fine
15:43:08 <bcafarel> most backports got in without too many rechecks yes
15:43:28 <bcafarel> designate gate even behaved :) mostly rally/grenade rechecks
15:44:10 <slaweq> for grenade it should be better when nova will backport fix for scheduler/placement
15:44:19 <slaweq> in master it now works much better IMO
15:44:54 <slaweq> ok, next topic then
15:44:56 <slaweq> #topic Grafana
15:45:00 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:45:10 <slaweq> here there is more results :)
15:45:17 <slaweq> *more data
15:45:43 <slaweq> and I see couple of problems there
15:45:47 <slaweq> or potential problems
15:46:09 <slaweq> neutron-ovn-tempest-slow (non-voting)
15:46:11 <slaweq> neutron-ovn-tempest-full-multinode-ovs-master (non-voting)
15:46:20 <slaweq> both jobs are going to high failure rate today
15:46:38 <slaweq> I don't know really if that is some new problem
15:47:52 <slaweq> in most cases where I see it both jobs are TIMED_OUT
15:48:05 <slaweq> e.g. https://review.opendev.org/#/c/725836/
15:48:17 <slaweq> or https://review.opendev.org/#/c/724445/
15:49:20 <ralonsoh> I'll try to track those two jobs to see a possible problem
15:49:40 <ralonsoh> maybe is just a matter of increasing the job timeout
15:49:51 <slaweq> I now checked one such job https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_07c/717083/9/check/neutron-ovn-tempest-slow/07c6abc/job-output.txt
15:50:07 <slaweq> it seems that all tests are failing due to "Failed to allocate the network(s), not rescheduling."
15:50:17 <slaweq> so it seems like serious issue
15:50:33 <maciejjozefczyk> I can try to investigate that
15:50:45 <slaweq> and I think that it's due to something from today or yesterday
15:51:52 <slaweq> https://zuul.opendev.org/t/openstack/build/07c6abc773a84930b20a5ea1b46f666e/log/controller/logs/screen-q-svc.txt?severity=4
15:51:59 <slaweq> I think that this can be culprit :)
15:52:19 <slaweq> or maybe not :) but at least it's IMO worth to check
15:53:09 <slaweq> ok, so ralonsoh maciejjozefczyk - to whom I should assign action to check that? :)
15:53:12 <maciejjozefczyk> May 06 09:52:01.354607 ubuntu-bionic-rax-iad-0016398327 neutron-server[10171]: WARNING neutron.db.agents_db [None req-e6c8c261-b416-41da-beae-c05e137265bb None None] Agent healthcheck: found 1 dead agents out of 4:
15:53:12 <maciejjozefczyk> May 06 09:52:01.354607 ubuntu-bionic-rax-iad-0016398327 neutron-server[10171]:                 Type       Last heartbeat host
15:53:12 <maciejjozefczyk> May 06 09:52:01.354607 ubuntu-bionic-rax-iad-0016398327 neutron-server[10171]: OVN Controller Gateway agent 2020-05-06 09:52:01.334533 ubuntu-bionic-rax-iad-0016398327
15:53:32 <maciejjozefczyk> ^ during one of failed tests
15:54:29 <ralonsoh> I can take it
15:54:34 <slaweq> thx ralonsoh
15:54:44 <slaweq> and thx maciejjozefczyk for help with this
15:54:53 <slaweq> #action ralonsoh to check ovn jobs timeouts
15:54:58 <slaweq> ok, lets move on
15:55:07 <slaweq> some recheck stats from last weeks
15:55:09 <slaweq> Average number of rechecks in last weeks:
15:55:11 <slaweq> week 17 of 2020: 2.62
15:55:13 <slaweq> week 18 of 2020: 1.4
15:55:15 <slaweq> week 19 of 2020: 1.71
15:55:17 <slaweq> so it seems not very bad recently :)
15:55:50 <slaweq> ok, lets move quickly to the periodic jobs as we are almost out of time today
15:55:57 <slaweq> #topic Periodic
15:56:09 <slaweq> neutron-ovn-tempest-ovs-master-fedora is failing since 26.04
15:56:18 <ralonsoh> again?
15:56:24 <slaweq> first it was failing with error like: https://d11d69999bcdac29c8fb-3dea60a35fb0d38e41c535f13b48e895.ssl.cf5.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/b12ec3a/job-output.txt
15:56:39 <slaweq> and in the next days it changed a bit:
15:56:41 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_dd7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/dd76029/job-output.txt
15:56:43 <slaweq> and
15:56:45 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_71e/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/71e5c2f/job-output.txt
15:56:52 <slaweq> ralonsoh: yeah - I don't like this fedora job ;)
15:57:07 <ralonsoh> 2020-04-26 06:28:30.004222 | controller | Error: Unable to find a match: kernel-devel-5.5.17
15:57:40 <ralonsoh> but this is not the only error
15:57:49 <slaweq> but in next days it was failing with:
15:57:51 <slaweq> configure: error: Linux kernel in /lib/modules/5.6.6-200.fc31.x86_64/build is version 5.6.6, but version newer than 5.5.x is not supported (please refer to the FAQ for advice)
15:58:00 <ralonsoh> and another one
15:58:01 <ralonsoh> 2020-05-02 06:18:55.602437 | controller | configure: error: source dir /lib/modules/5.6.7-200.fc31.x86_64/build doesn't exist
15:58:08 <slaweq> yes
15:58:18 <slaweq> so basically it is failing on compilation of ovs
15:58:24 <ralonsoh> exactly
15:58:44 <ralonsoh> we can ping fedora guys in bugzilla
15:58:49 <ralonsoh> (I think so)
15:58:58 <maciejjozefczyk> we recently added compilation of ovs module while using qos
15:59:03 <slaweq> maciejjozefczyk: You was playing with this complile of the ovs module
15:59:11 <slaweq> can You take a look at this one?
15:59:13 <maciejjozefczyk> yes, that was me
15:59:17 <maciejjozefczyk> slaweq, ok
15:59:21 <slaweq> thx a lot
15:59:47 <slaweq> #action maciejjozefczyk to check failing compile of ovs in the fedora periodic job
15:59:58 <slaweq> and we are out of time almost
16:00:00 <slaweq> thx for attending
16:00:06 <slaweq> #endmeeting