#openstack-neutron log

15:00:13 <slaweq> #startmeeting neutron_ci
15:00:13 <opendevmeet> Meeting started Tue Feb 14 15:00:13 2023 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:13 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:13 <opendevmeet> The meeting name has been set to 'neutron_ci'
15:00:17 <mlavalle> o/
15:00:23 <slaweq> ping bcafarel, lajoskatona, mlavalle, mtomaska, ralonsoh, ykarel, jlibosva
15:00:27 <bcafarel> o/
15:00:29 <ykarel> o/
15:00:29 <mtomaska> o/
15:00:30 <slaweq> hi
15:00:30 <jlibosva> o/
15:00:32 <ralonsoh> hi
15:00:32 <slaweq> Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1
15:00:53 <slaweq> ok, lets start as we have many topics for today
15:00:55 <slaweq> #topic Actions from previous meetings
15:01:02 <slaweq> lajoskatona to check additional logs in failing dvr related functional tests
15:01:37 <lajoskatona> o/
15:02:01 <lajoskatona> yes, I checked and I am still not closer
15:02:24 <lajoskatona> I pushed a patch with tries for this issue, and today it is in this state:
15:02:44 <lajoskatona> https://review.opendev.org/c/openstack/neutron/+/873111
15:03:24 <lajoskatona> it is now to run part of these tests with concurrency=1
15:04:00 <opendevreview> Fernando Royo proposed openstack/ovn-octavia-provider master: Avoid use of ovn metadata port IP for HM checks  https://review.opendev.org/c/openstack/ovn-octavia-provider/+/873426
15:04:04 <lajoskatona> when I checked the functional failures were for other failures and not for the usual DVR related issues, but I have to check more this topic
15:04:35 <slaweq> what about execution time? it it much longer?
15:05:06 <lajoskatona> good question, I haven't checked it yet, but I see one issue which is "too many files open"
15:05:28 <lajoskatona> not always, but from this patch's 20 runs I saw once as I remember
15:05:42 <slaweq> :/
15:06:10 <lajoskatona> which is strange as in serial I would expect less load on such things
15:06:35 <slaweq> maybe that's another issue and some files aren't closed properly?
15:06:56 <ralonsoh> during the config parsing, maybe
15:07:05 <lajoskatona> possible, I can check that
15:07:18 <slaweq> thx
15:07:21 <ralonsoh> can you open another bug for this issue?
15:07:33 <slaweq> I will add AI for You for next week to remember about it
15:07:44 <lajoskatona> ack
15:07:50 <slaweq> #action lajoskatona to continue checking dvr functional tests issues
15:09:00 <slaweq> next one
15:09:02 <slaweq> ralonsoh to try to store journal log in UT job's results to debug "no such table" issues
15:09:20 <ralonsoh> no, I didn't work on this one
15:09:29 <ralonsoh> I was working on the FT ones
15:09:40 <slaweq> #action ralonsoh to try to store journal log in UT job's results to debug "no such table" issues
15:09:46 <slaweq> lets keep it for next week then
15:09:54 <slaweq> next one
15:09:56 <slaweq> mtomaska to check failed test_restart_rpc_on_sighup_multiple_workers functional test
15:10:06 <mtomaska> I looked into it.Of course I am not able to reproduce it locally. I really think this failure happens because of stestr concurrency which I cant quite replicate on my dev machine. Adding more logs to the test wont help. Is there a way to mark the test with concurrency=1 permanently?
15:10:41 <ralonsoh> you can move this test to the second FT execution
15:10:48 <ralonsoh> where we use concurrency one
15:11:00 <slaweq> You can do it in a way like lajoskatona in https://review.opendev.org/c/openstack/neutron/+/873111
15:11:02 <slaweq> but I would really like to avoid doing that for all tests :)
15:11:15 <ralonsoh> we can move one single test only
15:11:19 <ralonsoh> same as sql ones
15:11:38 <ralonsoh> or "test_get_all_devices"
15:11:40 <mtomaska> ok cool. ralonsoh can I ping you on how to later?
15:11:43 <ralonsoh> sure
15:11:45 <slaweq> but if we will be sure that this is an issue
15:12:51 <slaweq> next one
15:12:52 <slaweq> ralonsoh to check Can't find port binding with logical port XXX error
15:13:15 <ralonsoh> #link https://review.opendev.org/c/openstack/neutron/+/873126
15:13:20 <ralonsoh> ^ this is the patch
15:13:54 <slaweq> great, at least one fixed :)
15:14:10 <slaweq> thx
15:14:19 <slaweq> next one
15:14:20 <slaweq> mlavalle to check LRP failed to bind failure
15:14:31 <mlavalle> I checked it here: https://zuul.opendev.org/t/openstack/builds?job_name=neutron-functional&branch=master&skip=0
15:14:49 <mlavalle> didn't happen again since the occurence you reported a week ago
15:15:02 <mlavalle> so for now, we don't need to worry about it
15:15:10 <mlavalle> it was a one of
15:15:14 <lajoskatona> \o/
15:15:28 <lajoskatona> good news
15:15:48 <slaweq> not really
15:15:53 <slaweq> I saw the same issue today :/
15:16:06 <slaweq> https://32c2553a5b1b0d58bd31-9e074609c9fda723118db1a634ce1a12.ssl.cf5.rackcdn.com/873247/3/gate/neutron-functional-with-uwsgi/6195659/testr_results.html
15:16:09 <lajoskatona> ohhhh
15:16:09 <mlavalle> oh, I'm reporting from my last check last night
15:16:44 <mlavalle> and that is a different job I think
15:16:53 <mlavalle> than the one we looked at last week
15:17:57 <slaweq> yes, it's different
15:18:06 <slaweq> also functional tests but it's from gate queue
15:18:12 <mlavalle> yeap
15:18:12 <slaweq> last week it was in periodic
15:18:17 <mlavalle> I was looking at periodic
15:19:01 <mlavalle> so that tells us that test case is flaky
15:19:08 <slaweq> yeap
15:19:14 <mlavalle> acorss jobs
15:19:22 <mlavalle> across jobs
15:19:23 <slaweq> will You try to investigate again this week?
15:19:51 <mlavalle> I'm sorry. I can't. I will be off on PTO starting tomorrow until March 1st
15:20:11 <mlavalle> but if you are willing to wait, I'll be happy to look at it upon my return
15:20:25 <slaweq> oh, ok
15:20:38 <slaweq> I will open LP bug to not forget about it
15:20:49 <slaweq> and we will see how often it will be occuring
15:21:01 <mlavalle> assign it to me if nobody is going to look at it
15:21:14 <slaweq> #action slaweq to report bug about failed to bind LRP in functional tests
15:21:21 <slaweq> sure, thx mlavalle
15:21:30 <mlavalle> thank you!
15:21:54 <slaweq> and last one
15:21:55 <slaweq> slaweq to check fullstack failures in test_multi_segs_network
15:21:56 <slaweq> I spent some time investigating it
15:21:58 <slaweq> I reported bug https://bugs.launchpad.net/neutron/+bug/2007152
15:22:01 <slaweq> it seems to be valid issue in the code, not test problem
15:22:13 <slaweq> but I don't know yet what's wrong exactly there
15:22:38 <slaweq> there is some race between creation of subnets in network and cleaning stale devices from the namespace
15:23:01 <slaweq> I assigned this bug to myself and I will continue working on it
15:24:42 <slaweq> any questions/comments?
15:25:14 <lajoskatona> nothing from me
15:25:23 <slaweq> if not, lets move on
15:25:28 <slaweq> #topic Stable branches
15:25:36 * bcafarel hides
15:25:36 <slaweq> bcafarel any updates?
15:25:44 <slaweq> :)
15:25:54 <bcafarel> good and bad this week :)
15:26:26 <bcafarel> ussuri grenade and train look good from a quick look thanks to ykarel, wallaby I opened a bug and also fix in progress by ykarel https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/873708
15:26:58 <bcafarel> victoria, we have 2 backports failing on py38/cover with " pkg_resources.extern.packaging.version.InvalidVersion: Invalid version: '<MagicMock name='execute().split().__getitem__().__getitem__()' id='140203724932048'>'"
15:27:06 <bcafarel> https://zuul.opendev.org/t/openstack/build/8f14ad998131488680c8f2142d1eab94
15:27:29 <bcafarel> this one seems familiar but I could not find anything relevant, does it ring a bell? (if not will open a lp to track it)
15:27:47 <bcafarel> (and other branches are good \o/)
15:29:20 <ralonsoh> ^^ we can force the _get_version() to return the value, not utils_exec
15:29:25 <ralonsoh> I'll push a patch
15:29:47 <slaweq> ++
15:29:48 <slaweq> thx ralonsoh
15:29:53 <bcafarel> ralonsoh++ thanks
15:30:00 <slaweq> and thx bcafarel for updates
15:31:18 <slaweq> so lets move on
15:31:20 <slaweq> #topic Stadium projects
15:31:33 <slaweq> lajoskatona I see there is bunch of topics there in agenda
15:31:40 <slaweq> are those added by You?
15:31:43 <lajoskatona> most of them seems ok
15:32:00 <lajoskatona> no I think it was not me
15:32:27 <lajoskatona> I think it is for one of the not so green ones
15:32:43 <slaweq> ok, maybe ykarel added them
15:33:02 <lajoskatona> fwaas job is failing, and it is causing failures as with tempest-plugin it is executed
15:33:09 <slaweq> there added links to failing fwaas tests
15:33:13 <ykarel> fwaas one we discussed in previous meeting
15:33:32 <lajoskatona> this is the bug for this: https://bugs.launchpad.net/neutron/+bug/2006683
15:33:37 <lajoskatona> for fwaas I mean
15:34:38 <slaweq> who is now fwaas liaison?
15:34:47 <lajoskatona> the other topic is for the cfg related issue (see https://review.opendev.org/c/openstack/neutron/+/872644  )
15:35:25 <lajoskatona> I think neutron-dynamic-routing and bgpvpn was the last affected by this, I have to check the fresh runs for them
15:35:29 <ralonsoh> I don't remember where this list is
15:36:10 <slaweq> https://docs.openstack.org/neutron/latest/contributor/policies/neutron-teams.html
15:36:12 <slaweq> found it :)
15:36:26 <slaweq> so it is zhouhenglc
15:36:36 <slaweq> maybe we should ping him to check those failures?
15:36:57 <lajoskatona> +1
15:36:58 <slaweq> and temporary we can propose to make fwaas jobs to be non-voting if he will not reply
15:37:00 <lajoskatona> I will ping him
15:37:06 <ralonsoh> thanks
15:37:08 <slaweq> lajoskatona++ thx
15:37:28 <slaweq> #action lajoskatona to talk with zhouhenglc about fwaas jobs issues
15:37:50 <slaweq> ok, anything else regarding stadium today?
15:38:22 <lajoskatona> nothing from me
15:38:51 <slaweq> so lets move on
15:38:55 <slaweq> #topic Grafana
15:39:10 <slaweq> here everything looks that are on high failure rate for me
15:39:21 <ralonsoh> don't check too much the numbers
15:39:35 <ralonsoh> tempest ipv6: failing because of OVN issue in Jammy
15:39:53 <ralonsoh> FT: failign because of the test error I introduced (fixed now)
15:40:08 <slaweq> yes, I also saw pretty many "DNM" or "WIP" patches where most of the jobs failed
15:40:11 <ralonsoh> and I've been playing with CIm using 20 FT jobs at the same time
15:40:12 <ralonsoh> yes
15:40:26 <ralonsoh> so please, don't focus this week on numbers
15:40:29 <slaweq> so those results may be a bit missleading
15:40:42 <slaweq> ok, so lets move on to the rechecks then
15:40:47 <slaweq> #topic Rechecks
15:40:57 <slaweq> +---------+----------+... (full message at <https://matrix.org/_matrix/media/v3/download/matrix.org/flyUncZffilepRknBJrkPGIz>)
15:41:12 <slaweq> last week it was still very high: 2.0 rechecks in average to get patch merged
15:41:34 <ralonsoh> yes but always the same errors
15:41:44 <slaweq> but it's better in e.g. last 5 days or so
15:42:10 <slaweq> so hopefully it will be going down once we will fix few issues mentioned already
15:42:21 <slaweq> and some which I have to discuss still today :)
15:42:41 <slaweq> #topic Unit tests
15:42:51 <slaweq> critical bug reported https://bugs.launchpad.net/neutron/+bug/2007254
15:43:02 <ralonsoh> I'm on this one
15:43:36 <slaweq> You are working on it?
15:43:38 <ralonsoh> tyes
15:43:40 <ralonsoh> yes
15:43:48 <ralonsoh> I'll assign it to me
15:43:51 <slaweq> ok, please assign it to Yourself then :)
15:43:53 <slaweq> thx
15:43:54 <ykarel> k thx, i too was trying to reproduce it, but couldn't
15:44:27 <slaweq> next topic
15:44:28 <slaweq> #topic fullstack/functional
15:44:47 <slaweq> here I found only one new issue
15:44:48 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_777/870081/6/check/neutron-functional-with-uwsgi/7775413/testr_results.html
15:44:56 <slaweq> something related to mysql errors
15:45:33 <ralonsoh> again? maybe those tests have been executed at the same time, using mysql
15:45:39 <ralonsoh> right, mysql
15:46:05 <ralonsoh> I would't investigate it, this is just a test clash
15:46:48 <mtomaska> a test called "test_1" should not be even considered as a real test :) . Thats just bad naming
15:47:11 <slaweq> mtomaska yeah :)
15:47:29 <slaweq> ralonsoh but both were run with "--concurrency 1"
15:47:30 <ralonsoh> my bad when I pushed this patch
15:48:07 <slaweq> 2023-02-10 11:29:15.131934 | controller | ======... (full message at <https://matrix.org/_matrix/media/v3/download/matrix.org/AJmoLvYurxYfOWemkJFUhUBQ>)
15:48:20 <ralonsoh> slaweq, yes, you are right
15:49:33 <slaweq> my bet is that it was oom-killer what caused failure of first test
15:49:38 <slaweq> so cleanup wasn't done
15:49:41 <slaweq> and that's why second failed
15:49:50 <ralonsoh> yes, the 2nd is caused by the first one
15:49:56 <slaweq> but I didn't checked it in journal logs
15:50:05 <ralonsoh> and the first because of a timeout
15:50:14 <ralonsoh> I wouldn't spend too much time on this one
15:50:35 <slaweq> yeah, lets move on to the other topics
15:50:44 <slaweq> #topic grenade
15:50:56 <slaweq> I added this today as we have pretty many failures there
15:51:03 <slaweq> first, added by ykarel
15:51:09 <slaweq> seeing pcp deploy issues in these jobs randomly, basically can be seen in jobs where dstat is enabled
15:51:33 <slaweq> ykarel do You think that disabling dstat in grenade jobs will fix that problem (or workaround)?
15:51:54 <ykarel> slaweq, workaroud
15:52:03 <ykarel> as actual fix would need in pcp package
15:52:09 <slaweq> ok
15:52:15 <slaweq> will You propose patch?
15:52:21 <ykarel> sure will do that
15:52:25 <ralonsoh> ykarel++
15:52:26 <slaweq> thx a lot
15:52:47 <slaweq> another issue which I have seen in grenade this week was related to failed ping to server:
15:52:52 <slaweq> https://58140b717b0fd2108807-1424e0f18ed1d65bb7e7c00bb059b2d8.ssl.cf2.rackcdn.com/872830/1/gate/neutron-ovs-grenade-multinode/6b0f11b/controller/logs/grenade.sh_log.txt
15:52:52 <slaweq> https://bca8357d212b699db2ea-90e26954923e94ff0df2f7d3073d9fdf.ssl.cf2.rackcdn.com/871983/3/check/neutron-ovn-grenade-multinode-skip-level/fd1c1fa/controller/logs/grenade.sh_log.txt
15:52:52 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_01d/873615/2/check/neutron-ovs-grenade-dvr-multinode/01d90b6/controller/logs/grenade.sh_log.txt
15:53:00 <slaweq> I found at least those 3 occurences
15:53:07 <slaweq> so I think we should investigate it
15:53:25 <slaweq> anyone wants to check it?
15:53:38 <ralonsoh> sorry, not this week
15:53:49 <slaweq> no need to sorry ralonsoh  :)
15:54:12 <slaweq> if nobody have any cycles, I will report LP for it and we will see
15:54:28 <slaweq> #action slaweq to report bug with failed ping in grenade jobs
15:54:44 <ykarel> +1
15:54:48 <slaweq> next topic
15:54:55 <slaweq> #topic Tempest/Scenario
15:55:07 <slaweq> here I saw bunch of tests failing with ssh timeout
15:55:18 <slaweq> like e.g. tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic:
15:55:24 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f1f/872737/3/gate/neutron-ovn-tempest-ipv6-only-ovs-release/f1f5212/testr_results.html
15:55:24 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_18b/864649/7/check/neutron-ovn-tempest-ipv6-only-ovs-release/18b432e/testr_results.html
15:55:24 <slaweq> https://74b83773945845894408-fddd31f569f44d98bd401bfd88253d97.ssl.cf2.rackcdn.com/873553/1/check/neutron-ovn-tempest-ipv6-only-ovs-release/6800220/testr_results.html
15:55:26 <ralonsoh> ykarel, founf the error
15:55:31 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/873684
15:55:54 <ralonsoh> I think we should mark it as unstable
15:56:02 <ralonsoh> until OVN is fixed in Jammy
15:56:03 <ykarel> yes from logs i found that metadata flows were missing
15:56:14 <slaweq> that's great
15:56:20 <ykarel> and issue is not happening in focal
15:56:30 <ykarel> so it's specific to ovn22.03.0 in ubuntu jammy
15:56:35 <slaweq> can we skip this one test only or it can happen in other tests too?
15:56:48 <ralonsoh> only this one
15:56:50 <ralonsoh> so far
15:57:09 <slaweq> so lets skip that test in the job's definiton temporary and keep job voting
15:57:10 <ykarel> yes seen in this only tests, but can happen to others too
15:57:37 <slaweq> if it will happen in other tests too, we can always mark job as non-voting temporary later
15:57:39 <slaweq> wdyt?
15:57:54 <ralonsoh> we can skip this test only, for now
15:58:14 <ykarel> +1 to what ralonsoh said, and monitor the job
15:58:16 <slaweq> ok, ykarel will You propose patch for that?
15:58:26 <ykarel> ok will do
15:58:32 <slaweq> thx
15:58:47 <slaweq> ok, we are almost on top of the hour
15:58:55 <slaweq> but I have one last thing for today
15:59:01 <slaweq> there is question from amorin in https://review.opendev.org/c/openstack/neutron/+/869741
15:59:16 <amorin> hello!
15:59:17 <slaweq> it's related to CI and neutron-ovs-tempest-dvr-ha-multinode-full
15:59:23 <ralonsoh> yes, I talked to him about this
15:59:26 <slaweq> and I think it's valid question
15:59:32 <ralonsoh> dvr_snat is not valid in compute
15:59:38 <ralonsoh> but we are still configuring it
15:59:42 <slaweq> I would like obondarev to look into this one too
16:00:25 <slaweq> IMO we can change L3 agent's mode to "dvr" in compute in that job
16:00:30 <ykarel> fwiw this job used to be 3 node in past, when it was switched to 2 node dvr_snat was set
16:00:40 <slaweq> and keep "ha=True" so it will be "ha" routers with only one node
16:00:46 <ralonsoh> yeah...
16:00:49 <slaweq> the same as we are doing in neutron-tempest-plugin- jobs
16:00:58 <slaweq> code path for ha will be tested
16:01:20 <slaweq> there will be no failover only really but that's not something what we are testing really in this job
16:01:24 <slaweq> ok, we are out of time
16:01:33 <slaweq> please comment on this patch and we can discuss there
16:01:37 <ralonsoh> ok
16:01:41 <slaweq> thx for attending the meeting today
16:01:43 <slaweq> #endmeeting