15:00:01 <slaweq> #startmeeting neutron_ci 15:00:02 <openstack> Meeting started Wed May 6 15:00:01 2020 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:03 <slaweq> hi 15:00:04 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:06 <openstack> The meeting name has been set to 'neutron_ci' 15:00:08 <njohnston> o/ 15:01:03 <bcafarel> o/ 15:02:13 <ralonsoh> hi 15:02:20 <slaweq> ok, I think we can start 15:02:25 <slaweq> #topic Actions from previous meetings 15:02:35 <slaweq> first one 15:02:37 <slaweq> maciejjozefczyk to take a look at ovn related functional test failures 15:02:50 <lajoskatona> Hi 15:03:46 <slaweq> I pinged maciejjozefczyk on neutron channel 15:03:52 <slaweq> maybe he will join soon 15:04:34 <maciejjozefczyk> hey 15:04:50 <maciejjozefczyk> ok, so I took a look on those failures for functional tests 15:06:07 <maciejjozefczyk> #link https://bugs.launchpad.net/neutron/+bug/1868110 15:06:07 <openstack> Launchpad bug 1868110 in neutron "[OVN] neutron.tests.functional.plugins.ml2.drivers.ovn.mech_driver.ovsdb.test_ovn_db_sync.TestOvnNbSyncOverTcp.test_ovn_nb_sync_log randomly fails" [High,In progress] - Assigned to Maciej Jozefczyk (maciej.jozefczyk) 15:07:05 <maciejjozefczyk> I found that the test from this class failed only once recently, on the same which is: TimeoutException 15:07:33 <maciejjozefczyk> So i think we have 2 possible solution: let it be, as per 1 failure per week is not that much 15:07:41 <maciejjozefczyk> or bump the timeout value 15:08:05 <bcafarel> what's the current timeout? 15:08:07 <slaweq> didn't we lower this timeout already to solve this problem? :) 15:08:11 <maciejjozefczyk> I think that after bumping it from very litle value to 15 seconds we already have less occurance of this test failures 15:08:43 <maciejjozefczyk> slaweq, bcafarel default in the neutron configuration is 180 seconds 15:09:09 <bcafarel> oh right that was the one that was bumped 15:09:14 <maciejjozefczyk> we had issues while it was 5 seconds set for functional tests, we updated to 15 seconds and I found only particular failure of this test class 15:09:14 <slaweq> ahh, it was 5 seconds and You bumped it to 15, right? 15:09:28 <slaweq> why not use default value? 15:09:53 <maciejjozefczyk> slaweq, I think that the author a few years ago wanted to not have that high timeout value for functional tests because it causes the whole job to timeout 15:10:13 <maciejjozefczyk> but we can try setting it to default 15:10:34 <slaweq> so maybe we can try with e.g. 30 seconds for now and we will see :) 15:11:47 <maciejjozefczyk> slaweq, yes 15:12:05 <maciejjozefczyk> but for now, as I said, I found only 1 failure of the test, at least I was checking week ago :) 15:13:14 <slaweq> maciejjozefczyk: ok, IMO we can try to bump it a litte and see - maybe there will be no occurences anymore :) 15:13:20 <maciejjozefczyk> slaweq, sure 15:13:41 <slaweq> ralonsoh: njohnston bcafarel what do You think? 15:13:46 <ralonsoh> +1 15:13:54 <ralonsoh> to increase the time 15:14:05 <njohnston> I agree, it seems reasonable 15:14:26 <bcafarel> yep, it is still an OK value 15:14:39 <slaweq> ok, maciejjozefczyk will You take care of it? 15:14:49 <maciejjozefczyk> slaweq, yes 15:14:52 <slaweq> thx 15:15:12 <slaweq> #action maciejjozefczyk will increase timeout in neutron.tests.functional.plugins.ml2.drivers.ovn.mech_driver.ovsdb.test_ovn_db_sync.TestOvnNbSyncOverTcp.test_ovn_nb_sync_log 15:15:27 <slaweq> ok, next one 15:15:29 <slaweq> maciejjozefczyk to report LP regarding failing neutron_tempest_plugin.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle in neutron-ovn-tempest-ovs-release job 15:15:43 <maciejjozefczyk> #link https://bugs.launchpad.net/neutron/+bug/1874447 15:15:43 <openstack> Launchpad bug 1874447 in neutron "[OVN] Tempest test neutron_tempest_plugin.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle fails randomly" [High,Confirmed] 15:16:30 <slaweq> maciejjozefczyk: thx 15:16:40 <slaweq> but I think Your comments to this LP are wrong :) 15:16:43 <maciejjozefczyk> I was investigating it, but at that time I also haven't found any occurance of that failure (afair the logs are rotated after a week?) 15:17:28 <slaweq> logs are gone after some time, I don't know exactly how long 15:17:37 <njohnston> I've been doing some experimenting with trunks on the 2 changes you cite, but I don't think I have had enough CI runs on them to account for the statsu you mention. 15:17:51 <njohnston> *stats 15:19:36 <slaweq> if You would found similar issues, please link them to this LP 15:19:53 <slaweq> we will see if that is serious issue 15:20:13 <maciejjozefczyk> ok slaweq 15:20:21 <njohnston> +1 15:20:25 <slaweq> thx 15:20:29 <slaweq> ok, next one 15:20:31 <slaweq> ralonsoh to check timeout during interface creation in functional tests 15:20:51 <ralonsoh> #link https://review.opendev.org/#/c/722254/ 15:21:00 <ralonsoh> and a patch in pyroute2 15:21:07 <ralonsoh> https://github.com/svinota/pyroute2/issues/702 15:21:25 <ralonsoh> now we can pass the PyDLL when creating a namespace context 15:21:38 <ralonsoh> we need a new release in pyroyte2 0.5.13 15:21:56 <ralonsoh> same problem as when calling create/delete namespaces 15:22:04 <ralonsoh> that's all 15:22:47 <slaweq> thx a lot 15:22:53 <ralonsoh> yw 15:23:05 <slaweq> ok, so that's all from the last week 15:23:12 <slaweq> lets move to the next topic 15:23:14 <slaweq> #topic Stadium projects 15:23:30 <slaweq> standardize on zuul v3 15:23:43 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron-train-zuulv3-py27drop 15:23:57 <slaweq> only update which I have is that there is now zuulv3 grenade job 15:24:09 <slaweq> so we can migrate our grenade jobs too 15:24:24 <slaweq> it's already done for 2 of the neutron grenade jobs 15:24:34 <slaweq> I will do the same for neutron-ovn job 15:24:34 <njohnston> I like the idea of an OVN grenade job to complement the existing ml2/ovs one 15:25:20 <slaweq> and there is also networking-odl grenade job to migrate 15:25:25 <njohnston> so that just meanes the networking-odl-grenade job, and the regular zuulv3 transition for networking-midonet. Any word from yamamoto recently? 15:25:40 <slaweq> njohnston: nope 15:25:41 <lajoskatona> I started tat and got some advices from Luigi Toscano 15:25:53 <njohnston> excellent 15:25:59 <lajoskatona> that -^ 15:26:04 <slaweq> he only told me that his time for openstack is "very limited" currently 15:26:41 <slaweq> we will need to review list of stadium projects again during the ptg probably 15:27:19 <slaweq> thx lajoskatona for taking care of networking-odl 15:27:55 <lajoskatona> slaweq: np, we use it so its noral way of working ;) 15:28:26 <lajoskatona> by the way: could you check these please : https://review.opendev.org/#/q/I3aed8031c1b784c16dd7055afd41456ed935b7fc 15:28:54 <lajoskatona> as I know these are missing to make master and ussuri patches to be merged for reno etc... 15:29:14 <slaweq> sure, I will 15:29:43 <slaweq> ok, anything else regarding stadium projects' ci for today? 15:30:18 <lajoskatona> nothing from me at least 15:31:04 <njohnston> nope 15:31:15 <ralonsoh> no 15:31:33 <bcafarel> neutron-dynamic-routing fails tempest on train (though this was the first backport attempt): https://review.opendev.org/#/c/725790/ 15:32:05 <bcafarel> related to that series of monkeypatch eventlet fixes 15:32:48 <slaweq> ModuleNotFoundError: No module named 'neutron_tempest_plugin' 15:34:26 <slaweq> so IMO it's some issue with jobs definition 15:35:28 <njohnsto_> yeah 15:35:51 <bcafarel> yes, in train tests were still in-repo 15:35:59 <slaweq> those are still legacy jobs 15:37:16 <slaweq> bcafarel: isn't that related to https://review.opendev.org/#/c/721277/3/devstack/plugin.sh ? 15:38:26 <slaweq> maybe those jobs in stable/train should use some pinned version of neutron-tempest-plugin? 15:39:55 <bcafarel> hmm yes it's possible (so the legacy job was actually using the installed version of the plugin?) 15:40:12 <slaweq> bcafarel: idk really :/ 15:40:31 <slaweq> bcafarel: will You check that this week? 15:40:47 <bcafarel> slaweq: I'll try at least :) and worst case fill a LP for it 15:40:53 <slaweq> thx 15:41:10 <slaweq> #action bcafarel to check neutron-dynamic-routing failurs in stable/train 15:41:25 <slaweq> ok, lets move on 15:41:26 <slaweq> #topic Stable branches 15:41:32 <slaweq> Train dashboard: http://grafana.openstack.org/d/pM54U-Kiz/neutron-failure-rate-previous-stable-release?orgId=1 15:41:33 <slaweq> Stein dashboard: http://grafana.openstack.org/d/dCFVU-Kik/neutron-failure-rate-older-stable-release?orgId=1 15:42:51 <slaweq> IMHO dashboards looks more or less fine 15:43:08 <bcafarel> most backports got in without too many rechecks yes 15:43:28 <bcafarel> designate gate even behaved :) mostly rally/grenade rechecks 15:44:10 <slaweq> for grenade it should be better when nova will backport fix for scheduler/placement 15:44:19 <slaweq> in master it now works much better IMO 15:44:54 <slaweq> ok, next topic then 15:44:56 <slaweq> #topic Grafana 15:45:00 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:45:10 <slaweq> here there is more results :) 15:45:17 <slaweq> *more data 15:45:43 <slaweq> and I see couple of problems there 15:45:47 <slaweq> or potential problems 15:46:09 <slaweq> neutron-ovn-tempest-slow (non-voting) 15:46:11 <slaweq> neutron-ovn-tempest-full-multinode-ovs-master (non-voting) 15:46:20 <slaweq> both jobs are going to high failure rate today 15:46:38 <slaweq> I don't know really if that is some new problem 15:47:52 <slaweq> in most cases where I see it both jobs are TIMED_OUT 15:48:05 <slaweq> e.g. https://review.opendev.org/#/c/725836/ 15:48:17 <slaweq> or https://review.opendev.org/#/c/724445/ 15:49:20 <ralonsoh> I'll try to track those two jobs to see a possible problem 15:49:40 <ralonsoh> maybe is just a matter of increasing the job timeout 15:49:51 <slaweq> I now checked one such job https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_07c/717083/9/check/neutron-ovn-tempest-slow/07c6abc/job-output.txt 15:50:07 <slaweq> it seems that all tests are failing due to "Failed to allocate the network(s), not rescheduling." 15:50:17 <slaweq> so it seems like serious issue 15:50:33 <maciejjozefczyk> I can try to investigate that 15:50:45 <slaweq> and I think that it's due to something from today or yesterday 15:51:52 <slaweq> https://zuul.opendev.org/t/openstack/build/07c6abc773a84930b20a5ea1b46f666e/log/controller/logs/screen-q-svc.txt?severity=4 15:51:59 <slaweq> I think that this can be culprit :) 15:52:19 <slaweq> or maybe not :) but at least it's IMO worth to check 15:53:09 <slaweq> ok, so ralonsoh maciejjozefczyk - to whom I should assign action to check that? :) 15:53:12 <maciejjozefczyk> May 06 09:52:01.354607 ubuntu-bionic-rax-iad-0016398327 neutron-server[10171]: WARNING neutron.db.agents_db [None req-e6c8c261-b416-41da-beae-c05e137265bb None None] Agent healthcheck: found 1 dead agents out of 4: 15:53:12 <maciejjozefczyk> May 06 09:52:01.354607 ubuntu-bionic-rax-iad-0016398327 neutron-server[10171]: Type Last heartbeat host 15:53:12 <maciejjozefczyk> May 06 09:52:01.354607 ubuntu-bionic-rax-iad-0016398327 neutron-server[10171]: OVN Controller Gateway agent 2020-05-06 09:52:01.334533 ubuntu-bionic-rax-iad-0016398327 15:53:32 <maciejjozefczyk> ^ during one of failed tests 15:54:29 <ralonsoh> I can take it 15:54:34 <slaweq> thx ralonsoh 15:54:44 <slaweq> and thx maciejjozefczyk for help with this 15:54:53 <slaweq> #action ralonsoh to check ovn jobs timeouts 15:54:58 <slaweq> ok, lets move on 15:55:07 <slaweq> some recheck stats from last weeks 15:55:09 <slaweq> Average number of rechecks in last weeks: 15:55:11 <slaweq> week 17 of 2020: 2.62 15:55:13 <slaweq> week 18 of 2020: 1.4 15:55:15 <slaweq> week 19 of 2020: 1.71 15:55:17 <slaweq> so it seems not very bad recently :) 15:55:50 <slaweq> ok, lets move quickly to the periodic jobs as we are almost out of time today 15:55:57 <slaweq> #topic Periodic 15:56:09 <slaweq> neutron-ovn-tempest-ovs-master-fedora is failing since 26.04 15:56:18 <ralonsoh> again? 15:56:24 <slaweq> first it was failing with error like: https://d11d69999bcdac29c8fb-3dea60a35fb0d38e41c535f13b48e895.ssl.cf5.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/b12ec3a/job-output.txt 15:56:39 <slaweq> and in the next days it changed a bit: 15:56:41 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_dd7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/dd76029/job-output.txt 15:56:43 <slaweq> and 15:56:45 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_71e/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/71e5c2f/job-output.txt 15:56:52 <slaweq> ralonsoh: yeah - I don't like this fedora job ;) 15:57:07 <ralonsoh> 2020-04-26 06:28:30.004222 | controller | Error: Unable to find a match: kernel-devel-5.5.17 15:57:40 <ralonsoh> but this is not the only error 15:57:49 <slaweq> but in next days it was failing with: 15:57:51 <slaweq> configure: error: Linux kernel in /lib/modules/5.6.6-200.fc31.x86_64/build is version 5.6.6, but version newer than 5.5.x is not supported (please refer to the FAQ for advice) 15:58:00 <ralonsoh> and another one 15:58:01 <ralonsoh> 2020-05-02 06:18:55.602437 | controller | configure: error: source dir /lib/modules/5.6.7-200.fc31.x86_64/build doesn't exist 15:58:08 <slaweq> yes 15:58:18 <slaweq> so basically it is failing on compilation of ovs 15:58:24 <ralonsoh> exactly 15:58:44 <ralonsoh> we can ping fedora guys in bugzilla 15:58:49 <ralonsoh> (I think so) 15:58:58 <maciejjozefczyk> we recently added compilation of ovs module while using qos 15:59:03 <slaweq> maciejjozefczyk: You was playing with this complile of the ovs module 15:59:11 <slaweq> can You take a look at this one? 15:59:13 <maciejjozefczyk> yes, that was me 15:59:17 <maciejjozefczyk> slaweq, ok 15:59:21 <slaweq> thx a lot 15:59:47 <slaweq> #action maciejjozefczyk to check failing compile of ovs in the fedora periodic job 15:59:58 <slaweq> and we are out of time almost 16:00:00 <slaweq> thx for attending 16:00:06 <slaweq> #endmeeting