15:00:13 <slaweq> #startmeeting neutron_ci 15:00:13 <opendevmeet> Meeting started Tue Feb 14 15:00:13 2023 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:13 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:13 <opendevmeet> The meeting name has been set to 'neutron_ci' 15:00:17 <mlavalle> o/ 15:00:23 <slaweq> ping bcafarel, lajoskatona, mlavalle, mtomaska, ralonsoh, ykarel, jlibosva 15:00:27 <bcafarel> o/ 15:00:29 <ykarel> o/ 15:00:29 <mtomaska> o/ 15:00:30 <slaweq> hi 15:00:30 <jlibosva> o/ 15:00:32 <ralonsoh> hi 15:00:32 <slaweq> Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1 15:00:53 <slaweq> ok, lets start as we have many topics for today 15:00:55 <slaweq> #topic Actions from previous meetings 15:01:02 <slaweq> lajoskatona to check additional logs in failing dvr related functional tests 15:01:37 <lajoskatona> o/ 15:02:01 <lajoskatona> yes, I checked and I am still not closer 15:02:24 <lajoskatona> I pushed a patch with tries for this issue, and today it is in this state: 15:02:44 <lajoskatona> https://review.opendev.org/c/openstack/neutron/+/873111 15:03:24 <lajoskatona> it is now to run part of these tests with concurrency=1 15:04:00 <opendevreview> Fernando Royo proposed openstack/ovn-octavia-provider master: Avoid use of ovn metadata port IP for HM checks https://review.opendev.org/c/openstack/ovn-octavia-provider/+/873426 15:04:04 <lajoskatona> when I checked the functional failures were for other failures and not for the usual DVR related issues, but I have to check more this topic 15:04:35 <slaweq> what about execution time? it it much longer? 15:05:06 <lajoskatona> good question, I haven't checked it yet, but I see one issue which is "too many files open" 15:05:28 <lajoskatona> not always, but from this patch's 20 runs I saw once as I remember 15:05:42 <slaweq> :/ 15:06:10 <lajoskatona> which is strange as in serial I would expect less load on such things 15:06:35 <slaweq> maybe that's another issue and some files aren't closed properly? 15:06:56 <ralonsoh> during the config parsing, maybe 15:07:05 <lajoskatona> possible, I can check that 15:07:18 <slaweq> thx 15:07:21 <ralonsoh> can you open another bug for this issue? 15:07:33 <slaweq> I will add AI for You for next week to remember about it 15:07:44 <lajoskatona> ack 15:07:50 <slaweq> #action lajoskatona to continue checking dvr functional tests issues 15:09:00 <slaweq> next one 15:09:02 <slaweq> ralonsoh to try to store journal log in UT job's results to debug "no such table" issues 15:09:20 <ralonsoh> no, I didn't work on this one 15:09:29 <ralonsoh> I was working on the FT ones 15:09:40 <slaweq> #action ralonsoh to try to store journal log in UT job's results to debug "no such table" issues 15:09:46 <slaweq> lets keep it for next week then 15:09:54 <slaweq> next one 15:09:56 <slaweq> mtomaska to check failed test_restart_rpc_on_sighup_multiple_workers functional test 15:10:06 <mtomaska> I looked into it.Of course I am not able to reproduce it locally. I really think this failure happens because of stestr concurrency which I cant quite replicate on my dev machine. Adding more logs to the test wont help. Is there a way to mark the test with concurrency=1 permanently? 15:10:41 <ralonsoh> you can move this test to the second FT execution 15:10:48 <ralonsoh> where we use concurrency one 15:11:00 <slaweq> You can do it in a way like lajoskatona in https://review.opendev.org/c/openstack/neutron/+/873111 15:11:02 <slaweq> but I would really like to avoid doing that for all tests :) 15:11:15 <ralonsoh> we can move one single test only 15:11:19 <ralonsoh> same as sql ones 15:11:38 <ralonsoh> or "test_get_all_devices" 15:11:40 <mtomaska> ok cool. ralonsoh can I ping you on how to later? 15:11:43 <ralonsoh> sure 15:11:45 <slaweq> but if we will be sure that this is an issue 15:12:51 <slaweq> next one 15:12:52 <slaweq> ralonsoh to check Can't find port binding with logical port XXX error 15:13:15 <ralonsoh> #link https://review.opendev.org/c/openstack/neutron/+/873126 15:13:20 <ralonsoh> ^ this is the patch 15:13:54 <slaweq> great, at least one fixed :) 15:14:10 <slaweq> thx 15:14:19 <slaweq> next one 15:14:20 <slaweq> mlavalle to check LRP failed to bind failure 15:14:31 <mlavalle> I checked it here: https://zuul.opendev.org/t/openstack/builds?job_name=neutron-functional&branch=master&skip=0 15:14:49 <mlavalle> didn't happen again since the occurence you reported a week ago 15:15:02 <mlavalle> so for now, we don't need to worry about it 15:15:10 <mlavalle> it was a one of 15:15:14 <lajoskatona> \o/ 15:15:28 <lajoskatona> good news 15:15:48 <slaweq> not really 15:15:53 <slaweq> I saw the same issue today :/ 15:16:06 <slaweq> https://32c2553a5b1b0d58bd31-9e074609c9fda723118db1a634ce1a12.ssl.cf5.rackcdn.com/873247/3/gate/neutron-functional-with-uwsgi/6195659/testr_results.html 15:16:09 <lajoskatona> ohhhh 15:16:09 <mlavalle> oh, I'm reporting from my last check last night 15:16:44 <mlavalle> and that is a different job I think 15:16:53 <mlavalle> than the one we looked at last week 15:17:57 <slaweq> yes, it's different 15:18:06 <slaweq> also functional tests but it's from gate queue 15:18:12 <mlavalle> yeap 15:18:12 <slaweq> last week it was in periodic 15:18:17 <mlavalle> I was looking at periodic 15:19:01 <mlavalle> so that tells us that test case is flaky 15:19:08 <slaweq> yeap 15:19:14 <mlavalle> acorss jobs 15:19:22 <mlavalle> across jobs 15:19:23 <slaweq> will You try to investigate again this week? 15:19:51 <mlavalle> I'm sorry. I can't. I will be off on PTO starting tomorrow until March 1st 15:20:11 <mlavalle> but if you are willing to wait, I'll be happy to look at it upon my return 15:20:25 <slaweq> oh, ok 15:20:38 <slaweq> I will open LP bug to not forget about it 15:20:49 <slaweq> and we will see how often it will be occuring 15:21:01 <mlavalle> assign it to me if nobody is going to look at it 15:21:14 <slaweq> #action slaweq to report bug about failed to bind LRP in functional tests 15:21:21 <slaweq> sure, thx mlavalle 15:21:30 <mlavalle> thank you! 15:21:54 <slaweq> and last one 15:21:55 <slaweq> slaweq to check fullstack failures in test_multi_segs_network 15:21:56 <slaweq> I spent some time investigating it 15:21:58 <slaweq> I reported bug https://bugs.launchpad.net/neutron/+bug/2007152 15:22:01 <slaweq> it seems to be valid issue in the code, not test problem 15:22:13 <slaweq> but I don't know yet what's wrong exactly there 15:22:38 <slaweq> there is some race between creation of subnets in network and cleaning stale devices from the namespace 15:23:01 <slaweq> I assigned this bug to myself and I will continue working on it 15:24:42 <slaweq> any questions/comments? 15:25:14 <lajoskatona> nothing from me 15:25:23 <slaweq> if not, lets move on 15:25:28 <slaweq> #topic Stable branches 15:25:36 * bcafarel hides 15:25:36 <slaweq> bcafarel any updates? 15:25:44 <slaweq> :) 15:25:54 <bcafarel> good and bad this week :) 15:26:26 <bcafarel> ussuri grenade and train look good from a quick look thanks to ykarel, wallaby I opened a bug and also fix in progress by ykarel https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/873708 15:26:58 <bcafarel> victoria, we have 2 backports failing on py38/cover with " pkg_resources.extern.packaging.version.InvalidVersion: Invalid version: '<MagicMock name='execute().split().__getitem__().__getitem__()' id='140203724932048'>'" 15:27:06 <bcafarel> https://zuul.opendev.org/t/openstack/build/8f14ad998131488680c8f2142d1eab94 15:27:29 <bcafarel> this one seems familiar but I could not find anything relevant, does it ring a bell? (if not will open a lp to track it) 15:27:47 <bcafarel> (and other branches are good \o/) 15:29:20 <ralonsoh> ^^ we can force the _get_version() to return the value, not utils_exec 15:29:25 <ralonsoh> I'll push a patch 15:29:47 <slaweq> ++ 15:29:48 <slaweq> thx ralonsoh 15:29:53 <bcafarel> ralonsoh++ thanks 15:30:00 <slaweq> and thx bcafarel for updates 15:31:18 <slaweq> so lets move on 15:31:20 <slaweq> #topic Stadium projects 15:31:33 <slaweq> lajoskatona I see there is bunch of topics there in agenda 15:31:40 <slaweq> are those added by You? 15:31:43 <lajoskatona> most of them seems ok 15:32:00 <lajoskatona> no I think it was not me 15:32:27 <lajoskatona> I think it is for one of the not so green ones 15:32:43 <slaweq> ok, maybe ykarel added them 15:33:02 <lajoskatona> fwaas job is failing, and it is causing failures as with tempest-plugin it is executed 15:33:09 <slaweq> there added links to failing fwaas tests 15:33:13 <ykarel> fwaas one we discussed in previous meeting 15:33:32 <lajoskatona> this is the bug for this: https://bugs.launchpad.net/neutron/+bug/2006683 15:33:37 <lajoskatona> for fwaas I mean 15:34:38 <slaweq> who is now fwaas liaison? 15:34:47 <lajoskatona> the other topic is for the cfg related issue (see https://review.opendev.org/c/openstack/neutron/+/872644 ) 15:35:25 <lajoskatona> I think neutron-dynamic-routing and bgpvpn was the last affected by this, I have to check the fresh runs for them 15:35:29 <ralonsoh> I don't remember where this list is 15:36:10 <slaweq> https://docs.openstack.org/neutron/latest/contributor/policies/neutron-teams.html 15:36:12 <slaweq> found it :) 15:36:26 <slaweq> so it is zhouhenglc 15:36:36 <slaweq> maybe we should ping him to check those failures? 15:36:57 <lajoskatona> +1 15:36:58 <slaweq> and temporary we can propose to make fwaas jobs to be non-voting if he will not reply 15:37:00 <lajoskatona> I will ping him 15:37:06 <ralonsoh> thanks 15:37:08 <slaweq> lajoskatona++ thx 15:37:28 <slaweq> #action lajoskatona to talk with zhouhenglc about fwaas jobs issues 15:37:50 <slaweq> ok, anything else regarding stadium today? 15:38:22 <lajoskatona> nothing from me 15:38:51 <slaweq> so lets move on 15:38:55 <slaweq> #topic Grafana 15:39:10 <slaweq> here everything looks that are on high failure rate for me 15:39:21 <ralonsoh> don't check too much the numbers 15:39:35 <ralonsoh> tempest ipv6: failing because of OVN issue in Jammy 15:39:53 <ralonsoh> FT: failign because of the test error I introduced (fixed now) 15:40:08 <slaweq> yes, I also saw pretty many "DNM" or "WIP" patches where most of the jobs failed 15:40:11 <ralonsoh> and I've been playing with CIm using 20 FT jobs at the same time 15:40:12 <ralonsoh> yes 15:40:26 <ralonsoh> so please, don't focus this week on numbers 15:40:29 <slaweq> so those results may be a bit missleading 15:40:42 <slaweq> ok, so lets move on to the rechecks then 15:40:47 <slaweq> #topic Rechecks 15:40:57 <slaweq> +---------+----------+... (full message at <https://matrix.org/_matrix/media/v3/download/matrix.org/flyUncZffilepRknBJrkPGIz>) 15:41:12 <slaweq> last week it was still very high: 2.0 rechecks in average to get patch merged 15:41:34 <ralonsoh> yes but always the same errors 15:41:44 <slaweq> but it's better in e.g. last 5 days or so 15:42:10 <slaweq> so hopefully it will be going down once we will fix few issues mentioned already 15:42:21 <slaweq> and some which I have to discuss still today :) 15:42:41 <slaweq> #topic Unit tests 15:42:51 <slaweq> critical bug reported https://bugs.launchpad.net/neutron/+bug/2007254 15:43:02 <ralonsoh> I'm on this one 15:43:36 <slaweq> You are working on it? 15:43:38 <ralonsoh> tyes 15:43:40 <ralonsoh> yes 15:43:48 <ralonsoh> I'll assign it to me 15:43:51 <slaweq> ok, please assign it to Yourself then :) 15:43:53 <slaweq> thx 15:43:54 <ykarel> k thx, i too was trying to reproduce it, but couldn't 15:44:27 <slaweq> next topic 15:44:28 <slaweq> #topic fullstack/functional 15:44:47 <slaweq> here I found only one new issue 15:44:48 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_777/870081/6/check/neutron-functional-with-uwsgi/7775413/testr_results.html 15:44:56 <slaweq> something related to mysql errors 15:45:33 <ralonsoh> again? maybe those tests have been executed at the same time, using mysql 15:45:39 <ralonsoh> right, mysql 15:46:05 <ralonsoh> I would't investigate it, this is just a test clash 15:46:48 <mtomaska> a test called "test_1" should not be even considered as a real test :) . Thats just bad naming 15:47:11 <slaweq> mtomaska yeah :) 15:47:29 <slaweq> ralonsoh but both were run with "--concurrency 1" 15:47:30 <ralonsoh> my bad when I pushed this patch 15:48:07 <slaweq> 2023-02-10 11:29:15.131934 | controller | ======... (full message at <https://matrix.org/_matrix/media/v3/download/matrix.org/AJmoLvYurxYfOWemkJFUhUBQ>) 15:48:20 <ralonsoh> slaweq, yes, you are right 15:49:33 <slaweq> my bet is that it was oom-killer what caused failure of first test 15:49:38 <slaweq> so cleanup wasn't done 15:49:41 <slaweq> and that's why second failed 15:49:50 <ralonsoh> yes, the 2nd is caused by the first one 15:49:56 <slaweq> but I didn't checked it in journal logs 15:50:05 <ralonsoh> and the first because of a timeout 15:50:14 <ralonsoh> I wouldn't spend too much time on this one 15:50:35 <slaweq> yeah, lets move on to the other topics 15:50:44 <slaweq> #topic grenade 15:50:56 <slaweq> I added this today as we have pretty many failures there 15:51:03 <slaweq> first, added by ykarel 15:51:09 <slaweq> seeing pcp deploy issues in these jobs randomly, basically can be seen in jobs where dstat is enabled 15:51:33 <slaweq> ykarel do You think that disabling dstat in grenade jobs will fix that problem (or workaround)? 15:51:54 <ykarel> slaweq, workaroud 15:52:03 <ykarel> as actual fix would need in pcp package 15:52:09 <slaweq> ok 15:52:15 <slaweq> will You propose patch? 15:52:21 <ykarel> sure will do that 15:52:25 <ralonsoh> ykarel++ 15:52:26 <slaweq> thx a lot 15:52:47 <slaweq> another issue which I have seen in grenade this week was related to failed ping to server: 15:52:52 <slaweq> https://58140b717b0fd2108807-1424e0f18ed1d65bb7e7c00bb059b2d8.ssl.cf2.rackcdn.com/872830/1/gate/neutron-ovs-grenade-multinode/6b0f11b/controller/logs/grenade.sh_log.txt 15:52:52 <slaweq> https://bca8357d212b699db2ea-90e26954923e94ff0df2f7d3073d9fdf.ssl.cf2.rackcdn.com/871983/3/check/neutron-ovn-grenade-multinode-skip-level/fd1c1fa/controller/logs/grenade.sh_log.txt 15:52:52 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_01d/873615/2/check/neutron-ovs-grenade-dvr-multinode/01d90b6/controller/logs/grenade.sh_log.txt 15:53:00 <slaweq> I found at least those 3 occurences 15:53:07 <slaweq> so I think we should investigate it 15:53:25 <slaweq> anyone wants to check it? 15:53:38 <ralonsoh> sorry, not this week 15:53:49 <slaweq> no need to sorry ralonsoh :) 15:54:12 <slaweq> if nobody have any cycles, I will report LP for it and we will see 15:54:28 <slaweq> #action slaweq to report bug with failed ping in grenade jobs 15:54:44 <ykarel> +1 15:54:48 <slaweq> next topic 15:54:55 <slaweq> #topic Tempest/Scenario 15:55:07 <slaweq> here I saw bunch of tests failing with ssh timeout 15:55:18 <slaweq> like e.g. tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic: 15:55:24 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f1f/872737/3/gate/neutron-ovn-tempest-ipv6-only-ovs-release/f1f5212/testr_results.html 15:55:24 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_18b/864649/7/check/neutron-ovn-tempest-ipv6-only-ovs-release/18b432e/testr_results.html 15:55:24 <slaweq> https://74b83773945845894408-fddd31f569f44d98bd401bfd88253d97.ssl.cf2.rackcdn.com/873553/1/check/neutron-ovn-tempest-ipv6-only-ovs-release/6800220/testr_results.html 15:55:26 <ralonsoh> ykarel, founf the error 15:55:31 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/873684 15:55:54 <ralonsoh> I think we should mark it as unstable 15:56:02 <ralonsoh> until OVN is fixed in Jammy 15:56:03 <ykarel> yes from logs i found that metadata flows were missing 15:56:14 <slaweq> that's great 15:56:20 <ykarel> and issue is not happening in focal 15:56:30 <ykarel> so it's specific to ovn22.03.0 in ubuntu jammy 15:56:35 <slaweq> can we skip this one test only or it can happen in other tests too? 15:56:48 <ralonsoh> only this one 15:56:50 <ralonsoh> so far 15:57:09 <slaweq> so lets skip that test in the job's definiton temporary and keep job voting 15:57:10 <ykarel> yes seen in this only tests, but can happen to others too 15:57:37 <slaweq> if it will happen in other tests too, we can always mark job as non-voting temporary later 15:57:39 <slaweq> wdyt? 15:57:54 <ralonsoh> we can skip this test only, for now 15:58:14 <ykarel> +1 to what ralonsoh said, and monitor the job 15:58:16 <slaweq> ok, ykarel will You propose patch for that? 15:58:26 <ykarel> ok will do 15:58:32 <slaweq> thx 15:58:47 <slaweq> ok, we are almost on top of the hour 15:58:55 <slaweq> but I have one last thing for today 15:59:01 <slaweq> there is question from amorin in https://review.opendev.org/c/openstack/neutron/+/869741 15:59:16 <amorin> hello! 15:59:17 <slaweq> it's related to CI and neutron-ovs-tempest-dvr-ha-multinode-full 15:59:23 <ralonsoh> yes, I talked to him about this 15:59:26 <slaweq> and I think it's valid question 15:59:32 <ralonsoh> dvr_snat is not valid in compute 15:59:38 <ralonsoh> but we are still configuring it 15:59:42 <slaweq> I would like obondarev to look into this one too 16:00:25 <slaweq> IMO we can change L3 agent's mode to "dvr" in compute in that job 16:00:30 <ykarel> fwiw this job used to be 3 node in past, when it was switched to 2 node dvr_snat was set 16:00:40 <slaweq> and keep "ha=True" so it will be "ha" routers with only one node 16:00:46 <ralonsoh> yeah... 16:00:49 <slaweq> the same as we are doing in neutron-tempest-plugin- jobs 16:00:58 <slaweq> code path for ha will be tested 16:01:20 <slaweq> there will be no failover only really but that's not something what we are testing really in this job 16:01:24 <slaweq> ok, we are out of time 16:01:33 <slaweq> please comment on this patch and we can discuss there 16:01:37 <ralonsoh> ok 16:01:41 <slaweq> thx for attending the meeting today 16:01:43 <slaweq> #endmeeting