15:00:13 #startmeeting neutron_ci 15:00:13 Meeting started Tue Feb 14 15:00:13 2023 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:13 The meeting name has been set to 'neutron_ci' 15:00:17 o/ 15:00:23 ping bcafarel, lajoskatona, mlavalle, mtomaska, ralonsoh, ykarel, jlibosva 15:00:27 o/ 15:00:29 o/ 15:00:29 o/ 15:00:30 hi 15:00:30 o/ 15:00:32 hi 15:00:32 Grafana dashboard: https://grafana.opendev.org/d/f913631585/neutron-failure-rate?orgId=1 15:00:53 ok, lets start as we have many topics for today 15:00:55 #topic Actions from previous meetings 15:01:02 lajoskatona to check additional logs in failing dvr related functional tests 15:01:37 o/ 15:02:01 yes, I checked and I am still not closer 15:02:24 I pushed a patch with tries for this issue, and today it is in this state: 15:02:44 https://review.opendev.org/c/openstack/neutron/+/873111 15:03:24 it is now to run part of these tests with concurrency=1 15:04:00 Fernando Royo proposed openstack/ovn-octavia-provider master: Avoid use of ovn metadata port IP for HM checks https://review.opendev.org/c/openstack/ovn-octavia-provider/+/873426 15:04:04 when I checked the functional failures were for other failures and not for the usual DVR related issues, but I have to check more this topic 15:04:35 what about execution time? it it much longer? 15:05:06 good question, I haven't checked it yet, but I see one issue which is "too many files open" 15:05:28 not always, but from this patch's 20 runs I saw once as I remember 15:05:42 :/ 15:06:10 which is strange as in serial I would expect less load on such things 15:06:35 maybe that's another issue and some files aren't closed properly? 15:06:56 during the config parsing, maybe 15:07:05 possible, I can check that 15:07:18 thx 15:07:21 can you open another bug for this issue? 15:07:33 I will add AI for You for next week to remember about it 15:07:44 ack 15:07:50 #action lajoskatona to continue checking dvr functional tests issues 15:09:00 next one 15:09:02 ralonsoh to try to store journal log in UT job's results to debug "no such table" issues 15:09:20 no, I didn't work on this one 15:09:29 I was working on the FT ones 15:09:40 #action ralonsoh to try to store journal log in UT job's results to debug "no such table" issues 15:09:46 lets keep it for next week then 15:09:54 next one 15:09:56 mtomaska to check failed test_restart_rpc_on_sighup_multiple_workers functional test 15:10:06 I looked into it.Of course I am not able to reproduce it locally. I really think this failure happens because of stestr concurrency which I cant quite replicate on my dev machine. Adding more logs to the test wont help. Is there a way to mark the test with concurrency=1 permanently? 15:10:41 you can move this test to the second FT execution 15:10:48 where we use concurrency one 15:11:00 You can do it in a way like lajoskatona in https://review.opendev.org/c/openstack/neutron/+/873111 15:11:02 but I would really like to avoid doing that for all tests :) 15:11:15 we can move one single test only 15:11:19 same as sql ones 15:11:38 or "test_get_all_devices" 15:11:40 ok cool. ralonsoh can I ping you on how to later? 15:11:43 sure 15:11:45 but if we will be sure that this is an issue 15:12:51 next one 15:12:52 ralonsoh to check Can't find port binding with logical port XXX error 15:13:15 #link https://review.opendev.org/c/openstack/neutron/+/873126 15:13:20 ^ this is the patch 15:13:54 great, at least one fixed :) 15:14:10 thx 15:14:19 next one 15:14:20 mlavalle to check LRP failed to bind failure 15:14:31 I checked it here: https://zuul.opendev.org/t/openstack/builds?job_name=neutron-functional&branch=master&skip=0 15:14:49 didn't happen again since the occurence you reported a week ago 15:15:02 so for now, we don't need to worry about it 15:15:10 it was a one of 15:15:14 \o/ 15:15:28 good news 15:15:48 not really 15:15:53 I saw the same issue today :/ 15:16:06 https://32c2553a5b1b0d58bd31-9e074609c9fda723118db1a634ce1a12.ssl.cf5.rackcdn.com/873247/3/gate/neutron-functional-with-uwsgi/6195659/testr_results.html 15:16:09 ohhhh 15:16:09 oh, I'm reporting from my last check last night 15:16:44 and that is a different job I think 15:16:53 than the one we looked at last week 15:17:57 yes, it's different 15:18:06 also functional tests but it's from gate queue 15:18:12 yeap 15:18:12 last week it was in periodic 15:18:17 I was looking at periodic 15:19:01 so that tells us that test case is flaky 15:19:08 yeap 15:19:14 acorss jobs 15:19:22 across jobs 15:19:23 will You try to investigate again this week? 15:19:51 I'm sorry. I can't. I will be off on PTO starting tomorrow until March 1st 15:20:11 but if you are willing to wait, I'll be happy to look at it upon my return 15:20:25 oh, ok 15:20:38 I will open LP bug to not forget about it 15:20:49 and we will see how often it will be occuring 15:21:01 assign it to me if nobody is going to look at it 15:21:14 #action slaweq to report bug about failed to bind LRP in functional tests 15:21:21 sure, thx mlavalle 15:21:30 thank you! 15:21:54 and last one 15:21:55 slaweq to check fullstack failures in test_multi_segs_network 15:21:56 I spent some time investigating it 15:21:58 I reported bug https://bugs.launchpad.net/neutron/+bug/2007152 15:22:01 it seems to be valid issue in the code, not test problem 15:22:13 but I don't know yet what's wrong exactly there 15:22:38 there is some race between creation of subnets in network and cleaning stale devices from the namespace 15:23:01 I assigned this bug to myself and I will continue working on it 15:24:42 any questions/comments? 15:25:14 nothing from me 15:25:23 if not, lets move on 15:25:28 #topic Stable branches 15:25:36 * bcafarel hides 15:25:36 bcafarel any updates? 15:25:44 :) 15:25:54 good and bad this week :) 15:26:26 ussuri grenade and train look good from a quick look thanks to ykarel, wallaby I opened a bug and also fix in progress by ykarel https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/873708 15:26:58 victoria, we have 2 backports failing on py38/cover with " pkg_resources.extern.packaging.version.InvalidVersion: Invalid version: ''" 15:27:06 https://zuul.opendev.org/t/openstack/build/8f14ad998131488680c8f2142d1eab94 15:27:29 this one seems familiar but I could not find anything relevant, does it ring a bell? (if not will open a lp to track it) 15:27:47 (and other branches are good \o/) 15:29:20 ^^ we can force the _get_version() to return the value, not utils_exec 15:29:25 I'll push a patch 15:29:47 ++ 15:29:48 thx ralonsoh 15:29:53 ralonsoh++ thanks 15:30:00 and thx bcafarel for updates 15:31:18 so lets move on 15:31:20 #topic Stadium projects 15:31:33 lajoskatona I see there is bunch of topics there in agenda 15:31:40 are those added by You? 15:31:43 most of them seems ok 15:32:00 no I think it was not me 15:32:27 I think it is for one of the not so green ones 15:32:43 ok, maybe ykarel added them 15:33:02 fwaas job is failing, and it is causing failures as with tempest-plugin it is executed 15:33:09 there added links to failing fwaas tests 15:33:13 fwaas one we discussed in previous meeting 15:33:32 this is the bug for this: https://bugs.launchpad.net/neutron/+bug/2006683 15:33:37 for fwaas I mean 15:34:38 who is now fwaas liaison? 15:34:47 the other topic is for the cfg related issue (see https://review.opendev.org/c/openstack/neutron/+/872644 ) 15:35:25 I think neutron-dynamic-routing and bgpvpn was the last affected by this, I have to check the fresh runs for them 15:35:29 I don't remember where this list is 15:36:10 https://docs.openstack.org/neutron/latest/contributor/policies/neutron-teams.html 15:36:12 found it :) 15:36:26 so it is zhouhenglc 15:36:36 maybe we should ping him to check those failures? 15:36:57 +1 15:36:58 and temporary we can propose to make fwaas jobs to be non-voting if he will not reply 15:37:00 I will ping him 15:37:06 thanks 15:37:08 lajoskatona++ thx 15:37:28 #action lajoskatona to talk with zhouhenglc about fwaas jobs issues 15:37:50 ok, anything else regarding stadium today? 15:38:22 nothing from me 15:38:51 so lets move on 15:38:55 #topic Grafana 15:39:10 here everything looks that are on high failure rate for me 15:39:21 don't check too much the numbers 15:39:35 tempest ipv6: failing because of OVN issue in Jammy 15:39:53 FT: failign because of the test error I introduced (fixed now) 15:40:08 yes, I also saw pretty many "DNM" or "WIP" patches where most of the jobs failed 15:40:11 and I've been playing with CIm using 20 FT jobs at the same time 15:40:12 yes 15:40:26 so please, don't focus this week on numbers 15:40:29 so those results may be a bit missleading 15:40:42 ok, so lets move on to the rechecks then 15:40:47 #topic Rechecks 15:40:57 +---------+----------+... (full message at ) 15:41:12 last week it was still very high: 2.0 rechecks in average to get patch merged 15:41:34 yes but always the same errors 15:41:44 but it's better in e.g. last 5 days or so 15:42:10 so hopefully it will be going down once we will fix few issues mentioned already 15:42:21 and some which I have to discuss still today :) 15:42:41 #topic Unit tests 15:42:51 critical bug reported https://bugs.launchpad.net/neutron/+bug/2007254 15:43:02 I'm on this one 15:43:36 You are working on it? 15:43:38 tyes 15:43:40 yes 15:43:48 I'll assign it to me 15:43:51 ok, please assign it to Yourself then :) 15:43:53 thx 15:43:54 k thx, i too was trying to reproduce it, but couldn't 15:44:27 next topic 15:44:28 #topic fullstack/functional 15:44:47 here I found only one new issue 15:44:48 https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_777/870081/6/check/neutron-functional-with-uwsgi/7775413/testr_results.html 15:44:56 something related to mysql errors 15:45:33 again? maybe those tests have been executed at the same time, using mysql 15:45:39 right, mysql 15:46:05 I would't investigate it, this is just a test clash 15:46:48 a test called "test_1" should not be even considered as a real test :) . Thats just bad naming 15:47:11 mtomaska yeah :) 15:47:29 ralonsoh but both were run with "--concurrency 1" 15:47:30 my bad when I pushed this patch 15:48:07 2023-02-10 11:29:15.131934 | controller | ======... (full message at ) 15:48:20 slaweq, yes, you are right 15:49:33 my bet is that it was oom-killer what caused failure of first test 15:49:38 so cleanup wasn't done 15:49:41 and that's why second failed 15:49:50 yes, the 2nd is caused by the first one 15:49:56 but I didn't checked it in journal logs 15:50:05 and the first because of a timeout 15:50:14 I wouldn't spend too much time on this one 15:50:35 yeah, lets move on to the other topics 15:50:44 #topic grenade 15:50:56 I added this today as we have pretty many failures there 15:51:03 first, added by ykarel 15:51:09 seeing pcp deploy issues in these jobs randomly, basically can be seen in jobs where dstat is enabled 15:51:33 ykarel do You think that disabling dstat in grenade jobs will fix that problem (or workaround)? 15:51:54 slaweq, workaroud 15:52:03 as actual fix would need in pcp package 15:52:09 ok 15:52:15 will You propose patch? 15:52:21 sure will do that 15:52:25 ykarel++ 15:52:26 thx a lot 15:52:47 another issue which I have seen in grenade this week was related to failed ping to server: 15:52:52 https://58140b717b0fd2108807-1424e0f18ed1d65bb7e7c00bb059b2d8.ssl.cf2.rackcdn.com/872830/1/gate/neutron-ovs-grenade-multinode/6b0f11b/controller/logs/grenade.sh_log.txt 15:52:52 https://bca8357d212b699db2ea-90e26954923e94ff0df2f7d3073d9fdf.ssl.cf2.rackcdn.com/871983/3/check/neutron-ovn-grenade-multinode-skip-level/fd1c1fa/controller/logs/grenade.sh_log.txt 15:52:52 https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_01d/873615/2/check/neutron-ovs-grenade-dvr-multinode/01d90b6/controller/logs/grenade.sh_log.txt 15:53:00 I found at least those 3 occurences 15:53:07 so I think we should investigate it 15:53:25 anyone wants to check it? 15:53:38 sorry, not this week 15:53:49 no need to sorry ralonsoh :) 15:54:12 if nobody have any cycles, I will report LP for it and we will see 15:54:28 #action slaweq to report bug with failed ping in grenade jobs 15:54:44 +1 15:54:48 next topic 15:54:55 #topic Tempest/Scenario 15:55:07 here I saw bunch of tests failing with ssh timeout 15:55:18 like e.g. tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic: 15:55:24 https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f1f/872737/3/gate/neutron-ovn-tempest-ipv6-only-ovs-release/f1f5212/testr_results.html 15:55:24 https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_18b/864649/7/check/neutron-ovn-tempest-ipv6-only-ovs-release/18b432e/testr_results.html 15:55:24 https://74b83773945845894408-fddd31f569f44d98bd401bfd88253d97.ssl.cf2.rackcdn.com/873553/1/check/neutron-ovn-tempest-ipv6-only-ovs-release/6800220/testr_results.html 15:55:26 ykarel, founf the error 15:55:31 https://review.opendev.org/c/openstack/neutron/+/873684 15:55:54 I think we should mark it as unstable 15:56:02 until OVN is fixed in Jammy 15:56:03 yes from logs i found that metadata flows were missing 15:56:14 that's great 15:56:20 and issue is not happening in focal 15:56:30 so it's specific to ovn22.03.0 in ubuntu jammy 15:56:35 can we skip this one test only or it can happen in other tests too? 15:56:48 only this one 15:56:50 so far 15:57:09 so lets skip that test in the job's definiton temporary and keep job voting 15:57:10 yes seen in this only tests, but can happen to others too 15:57:37 if it will happen in other tests too, we can always mark job as non-voting temporary later 15:57:39 wdyt? 15:57:54 we can skip this test only, for now 15:58:14 +1 to what ralonsoh said, and monitor the job 15:58:16 ok, ykarel will You propose patch for that? 15:58:26 ok will do 15:58:32 thx 15:58:47 ok, we are almost on top of the hour 15:58:55 but I have one last thing for today 15:59:01 there is question from amorin in https://review.opendev.org/c/openstack/neutron/+/869741 15:59:16 hello! 15:59:17 it's related to CI and neutron-ovs-tempest-dvr-ha-multinode-full 15:59:23 yes, I talked to him about this 15:59:26 and I think it's valid question 15:59:32 dvr_snat is not valid in compute 15:59:38 but we are still configuring it 15:59:42 I would like obondarev to look into this one too 16:00:25 IMO we can change L3 agent's mode to "dvr" in compute in that job 16:00:30 fwiw this job used to be 3 node in past, when it was switched to 2 node dvr_snat was set 16:00:40 and keep "ha=True" so it will be "ha" routers with only one node 16:00:46 yeah... 16:00:49 the same as we are doing in neutron-tempest-plugin- jobs 16:00:58 code path for ha will be tested 16:01:20 there will be no failover only really but that's not something what we are testing really in this job 16:01:24 ok, we are out of time 16:01:33 please comment on this patch and we can discuss there 16:01:37 ok 16:01:41 thx for attending the meeting today 16:01:43 #endmeeting