15:01:37 <slaweq> #startmeeting neutron_ci 15:01:38 <opendevmeet> Meeting started Tue Jun 1 15:01:37 2021 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:39 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:41 <opendevmeet> The meeting name has been set to 'neutron_ci' 15:02:18 <ralonsoh> hi 15:03:14 <lajoskatona> Hi 15:03:18 <obondarev> hi 15:03:28 <slaweq> bcafarel: ping 15:03:33 <slaweq> ci meeting 15:03:42 <bcafarel> o/ sorry 15:03:50 <slaweq> np :) 15:03:54 <bcafarel> I got used to the usual 15-20 min back between these meetings :p 15:03:54 <slaweq> ok, let's start 15:04:05 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:04:07 <slaweq> Please open now :) 15:04:13 <slaweq> #topic Actions from previous meetings 15:04:20 <slaweq> obondarev to check neutron-tempest-dvr-ha-multinode-full and switch it to ML2/OVS 15:05:26 <obondarev> https://review.opendev.org/c/openstack/neutron/+/793104 15:05:39 <obondarev> ready 15:05:55 <slaweq> obondarev: thx 15:06:15 <slaweq> today I found out that there is also neutron-tempest-ipv6 which is now running on ovn 15:06:33 <slaweq> and now the question is - do we want to switch it back to ovs or keep it with default backend? 15:06:52 <ralonsoh> it is not failing 15:07:08 <lajoskatona> +1 15:07:09 <slaweq> I would say - keep it with default backend (ovn now) but maybe You have other opinions about it 15:07:09 <ralonsoh> so, IMO, keep it in OVN 15:07:25 <lajoskatona> agree 15:07:30 <obondarev> will it mean that reference implementation will not be covered in gates? 15:07:44 <obondarev> with tempest ipv6 tests 15:07:53 <slaweq> obondarev: what do You mean by reference implementation? ML2/OVS? 15:07:56 <bcafarel> it may turn close to a duplicated of neutron-ovn-tempest-ovs-release-ipv6-only too? 15:07:59 <obondarev> if yes - that would be a problem 15:08:23 <obondarev> slaweq, yes ML2-OVS 15:08:24 <slaweq> bcafarel: good point, I missed that we have such job already 15:08:37 <ralonsoh> if I'm not wrong, ovs release uses master 15:08:40 <ralonsoh> right? 15:08:56 <ralonsoh> in any case, it won't affect and could be a duplicate 15:09:18 <slaweq> so according to that and to what obondarev said, maybe we should switch it neutron-tempest-ipv6 to be ml2/ovs again 15:09:43 <obondarev> so if reference ML2-OVS is not protected with CI then it's prone to regressions which is not good 15:09:55 <obondarev> for ipv6 again 15:10:07 <slaweq> obondarev: that's good point, we don't want regression in ML2/OVS for sure 15:10:34 <ralonsoh> btw, why don't we rename neutron-ovn-tempest-ovs-release-ipv6-only to be OVS? 15:10:45 <slaweq> ralonsoh: that' 15:10:47 <ralonsoh> and keep it neutron-tempest-ipv6 with the default backend 15:10:59 <slaweq> that's my another question - what we should do as next step with our jobs 15:11:21 <slaweq> should we switch "*-ovn" jobs to be "-ovs" and keep "default" jobs as ovn ones now? 15:11:32 <slaweq> to reflect the devstack change in our ci jobs too? 15:11:59 <slaweq> or should we for now just keep everything as it was, so "regular" jobs running ovs and "-ovn" jobs running ovn 15:12:02 <slaweq> wdyt? 15:12:05 <ralonsoh> IMO, rename those with a different backend 15:12:20 <ralonsoh> in this case, -ovs 15:13:12 <obondarev> that makes sense 15:13:39 <bcafarel> +1 at least for a while, it will be clearer 15:13:55 <slaweq> ok, so we need to change many of our jobs now :) 15:14:09 <lajoskatona> yeah make the names help identifing what is the backend 15:14:32 <bcafarel> before there was only select job with linuxbridge (and some new with ovn) so naming was clear, but with the default switch, it can get confusing (even for us :) ) 15:14:34 <slaweq> but I agree that this is better long term, especially that some of our jobs inherits e.g. from tempest jobs and those tempest jobs with default settings are run in e.g. tempest or nova's gate too 15:14:35 <obondarev> we can set them use OVS explicitly as first step 15:14:46 <obondarev> and go on with renaming as second 15:14:53 <slaweq> obondarev: yes, I agree 15:15:19 <slaweq> let's merge patches which we have now to enforce ovs where it was before, to have working ci 15:15:29 <slaweq> and then let's switch jobs completly 15:15:36 <slaweq> as that will require more work for sure 15:16:23 <bcafarel> sounds good to me! 15:16:45 <slaweq> ok, sounds like a plan :) 15:17:12 <slaweq> I will try to prepare plan to switch jobs for next week 15:17:40 <slaweq> #action slaweq to prepare plan of switch ovn <-> ovs jobs in neutron CI 15:19:10 <slaweq> ok 15:19:16 <slaweq> next one 15:19:18 <slaweq> ralonsoh to talk with ccamposr about issue https://bugs.launchpad.net/neutron/+bug/1929523 15:19:19 <opendevmeet> Launchpad bug 1929523 in neutron "Test tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details is failing from time to time" [High,Confirmed] 15:19:56 <slaweq> ralonsoh: it's related to the patch https://review.opendev.org/c/openstack/tempest/+/779756 right? 15:20:14 <ralonsoh> yes 15:20:54 <slaweq> ralonsoh: I'm still not convinced that this will solve that problem from https://bugs.launchpad.net/neutron/+bug/1929523 15:20:55 <opendevmeet> Launchpad bug 1929523 in neutron "Test tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details is failing from time to time" [High,Confirmed] 15:21:06 <slaweq> as the issue is a bit different now 15:21:19 <slaweq> it's not that we have additional server in the list 15:21:21 <ralonsoh> in this case we don't have any DNS regoster 15:21:24 <slaweq> but we got empty list 15:22:10 <slaweq> and e.g. failure https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_567/785895/1/gate/neutron-tempest-slow-py3/567fc7f/testr_results.html happened 20.05, more than week after patch https://review.opendev.org/c/openstack/tempest/+/779756 was merged 15:22:11 <ralonsoh> are we using cirros or ubuntu? 15:22:58 <ralonsoh> if we use the advance image, maybe we should use resolvectl 15:22:59 <slaweq> in that failed test, Ubuntu 15:23:10 <ralonsoh> instead of reading /etc/resolv.conf 15:23:40 <ralonsoh> I'll propose a patch in tempest to use resolvectl, if present in the VM 15:23:45 <slaweq> k 15:23:49 <ralonsoh> that should be more accurate 15:23:54 <slaweq> maybe indeed that will help 15:23:57 <slaweq> thx ralonsoh 15:24:37 <slaweq> #action ralonsoh to propose tempest patch to use resolvectl to address https://bugs.launchpad.net/neutron/+bug/1929523 15:24:38 <opendevmeet> Launchpad bug 1929523 in neutron "Test tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details is failing from time to time" [High,Confirmed] 15:24:51 <slaweq> ok, I think we can move on 15:24:53 <slaweq> #topic Stadium projects 15:24:58 <slaweq> lajoskatona: any updates? 15:25:03 <lajoskatona> nothing 15:25:20 <lajoskatona> I think one backend change patch is open, letme check 15:25:43 <lajoskatona> https://review.opendev.org/c/openstack/networking-bagpipe/+/791126 15:26:48 <lajoskatona> yeah one more thing, the old patches of boden for using payload are now again active, but I hope there will be no problem with them 15:27:15 <slaweq> yes, I saw some of them already 15:27:24 <lajoskatona> I have seen patch in x/vmware-nsx as abandoned for payload patch, and I have no right to activate it 15:27:41 <lajoskatona> not sure if we have to warn them somehow 15:28:21 <slaweq> good idea, I will try to reach out to boden somehow 15:28:35 <slaweq> maybe he will redirect me to someone who is now working on it 15:28:51 <slaweq> #action slaweq to reach out to boden about payload patches and x/vmware-nsx 15:28:54 <slaweq> thx lajoskatona 15:31:08 <slaweq> if that is all, lets move on 15:31:09 <slaweq> #topic Stable branches 15:31:16 <slaweq> bcafarel: anything new here? 15:31:50 <bcafarel> mostly good all around :) one question I had there (coming from https://review.opendev.org/c/openstack/neutron/+/793417/ failing backport) 15:32:53 <bcafarel> other branches do not have that irrelevant-files issue as in newer branches these jobs run in periodic 15:33:49 <bcafarel> but for victoria/ussuri I think it is better to fix the job dep instead of backporting the move to periodic 15:33:50 <slaweq> I think we are still missing many patches like https://review.opendev.org/q/topic:%22improve-neutron-ci-stable%252Fussuri%22+(status:open%20OR%20status:merged) 15:33:55 <slaweq> in stable branches 15:34:02 <slaweq> and that's only example for ussuri 15:34:09 <slaweq> but similar patches are opened for other branches too 15:34:53 <bcafarel> yes, getting these ones in will probably help ussuri in general too 15:35:17 <bcafarel> I have https://review.opendev.org/c/openstack/neutron/+/793799 and https://review.opendev.org/c/openstack/neutron/+/793801 mostly for that provider job issue 15:36:14 <bcafarel> ralonsoh: looks like that whole chain is just wainting on https://review.opendev.org/c/openstack/neutron/+/778708 if you can check it 15:36:59 <ralonsoh> I'll do 15:37:19 <ralonsoh> ah I know this patch, perfect 15:37:50 <bcafarel> yes hopefully it should not take too much off of your infinite cycles :) 15:38:51 <slaweq> LOL 15:38:59 <slaweq> thx ralonsoh 15:39:04 <slaweq> ok, lets move on 15:39:06 <slaweq> #topic Grafana 15:39:29 <slaweq> we have our gate broken now due to ovs->ovn migration and some other issue 15:39:58 <slaweq> so we can only focus on the check queue graphs today 15:40:31 <slaweq> and the biggest issues which I see for now are with neutron-ovn-tempest-slow job 15:40:38 <slaweq> which is failing very often 15:40:50 <slaweq> and ralonsoh already proposed to make it non-voting temporary 15:40:54 <ralonsoh> yes 15:41:09 <slaweq> I reported LP for that https://bugs.launchpad.net/neutron/+bug/1930402 15:41:10 <opendevmeet> Launchpad bug 1930402 in neutron "SSH timeouts happens very often in the ovn based CI jobs" [Critical,Confirmed] 15:41:23 <slaweq> and I know that jlibosva and lucasgomes are looking into it 15:42:14 <slaweq> do You have anything else regarding grafana? 15:42:42 <bcafarel> I see openstack-tox-py36-with-neutron-lib-master started 100% failing in periodic few days ago 15:43:06 <ralonsoh> link? 15:43:15 <slaweq> bcafarel: yes, I had it for the last topic of the meeting :) 15:43:22 <slaweq> but as You started, we can discuss it now 15:43:29 <slaweq> https://bugs.launchpad.net/neutron/+bug/1930397 15:43:31 <opendevmeet> Launchpad bug 1930397 in neutron "neutron-lib from master branch is breaking our UT job" [Critical,Confirmed] 15:43:32 <slaweq> ther is bug reported 15:43:41 <slaweq> and example https://zuul.openstack.org/build/9e852a424a52479695223ac2a7723e1a 15:43:58 <bcafarel> ah thanks I was looking for some job link 15:44:05 <ralonsoh> maybe this is because of the change in the n-lib session 15:44:13 <ralonsoh> I'll check it 15:44:37 <ralonsoh> good to have this n-lib master job 15:44:40 <slaweq> ralonsoh: yes, I suspect that 15:44:56 <slaweq> so we should avoid release new neutron-lib before we will not fix that issue 15:45:07 <slaweq> otherwise we will probably break our gate (again) :) 15:45:10 <ralonsoh> right 15:45:13 <ralonsoh> pffff 15:45:17 <ralonsoh> no, not again 15:45:22 <bcafarel> one broken gate at a time 15:45:25 <slaweq> LOL 15:45:28 <obondarev> :) 15:45:35 <bcafarel> maybe related to recent "Allow lazy load in model_query" neutron-lib commit? 15:45:48 <ralonsoh> no, not this 15:45:50 <obondarev> I checked it but seems unrelated 15:45:54 <ralonsoh> this is not used yet 15:46:15 <obondarev> yes 15:46:18 <bcafarel> ok :) 15:47:12 <slaweq> so, ralonsoh You will check it, right? 15:47:16 <ralonsoh> yes 15:47:19 <slaweq> thx a lot 15:47:31 <slaweq> #action ralonsoh to check failing neutron-lib-from-master periodic job 15:47:43 <slaweq> ok, let's move on then 15:47:45 <slaweq> #topic fullstack/functional 15:47:59 <slaweq> regarding functional job, I didn't found any new issues for today 15:48:07 <slaweq> but for fullstack there is new one: 15:48:12 <slaweq> https://bugs.launchpad.net/neutron/+bug/1930401 15:48:13 <opendevmeet> Launchpad bug 1930401 in neutron "Fullstack l3 agent tests failing due to timeout waiting until port is active" [Critical,Confirmed] 15:48:39 <slaweq> seem like it happens pretty often on various L3 related tests 15:48:47 <slaweq> I can investigate it more in next days 15:48:59 <slaweq> unless someone else wants to take it :) 15:49:19 <ralonsoh> maybe next week 15:49:21 <lajoskatona> I can check 15:50:14 <slaweq> lajoskatona: thx a lot 15:50:31 <slaweq> #action lajoskatona to check fullstack failures https://bugs.launchpad.net/neutron/+bug/1930401 15:50:32 <opendevmeet> Launchpad bug 1930401 in neutron "Fullstack l3 agent tests failing due to timeout waiting until port is active" [Critical,Confirmed] 15:50:54 <slaweq> lajoskatona: and also, there is another fullstack issue: https://bugs.launchpad.net/neutron/+bug/1928764 15:50:55 <opendevmeet> Launchpad bug 1928764 in neutron "Fullstack test TestUninterruptedConnectivityOnL2AgentRestart failing often with LB agent" [Critical,Confirmed] - Assigned to Lajos Katona (lajos-katona) 15:51:02 <slaweq> which is hitting us pretty often 15:51:22 <slaweq> I know You were working on it some time ago 15:51:32 <slaweq> do You have any patch which should fix it? 15:51:38 <lajoskatona> Yes we discussed it with Oleg in review 15:51:45 <slaweq> or should we maybe mark those failing tests as unstable for now? 15:51:56 <lajoskatona> https://review.opendev.org/c/openstack/neutron/+/792507 15:52:19 <lajoskatona> but obondarev is right, ping should not fail during restart of agent 15:52:53 <slaweq> actually yes - that is even main goal of this test AFAIR 15:53:05 <slaweq> to ensure that ping will work during the restart all the time 15:53:08 <lajoskatona> yeah marking them unstable can be a way forward to decrease the pressure on CI 15:53:21 <slaweq> lajoskatona: will You propose it? 15:53:41 <lajoskatona> Yes 15:53:47 <slaweq> thank You 15:54:19 <slaweq> #action lajoskatona to mark failing TestUninterruptedConnectivityOnL2AgentRestart fullstack tests as unstable temporary 15:54:54 <slaweq> lajoskatona: if You will not have too much time to work on the https://bugs.launchpad.net/neutron/+bug/1930401 this week, maybe You can also mark those tests as unstable for now 15:54:55 <opendevmeet> Launchpad bug 1930401 in neutron "Fullstack l3 agent tests failing due to timeout waiting until port is active" [Critical,Confirmed] 15:55:02 <obondarev> another bug related to PTG discussion on linuxbridge fiture 15:55:08 <obondarev> future* 15:55:10 <lajoskatona> slaweq: I will check 15:55:24 <slaweq> IMHO we need to make our CI to be a bit better as now it's a nightmare 15:55:29 <slaweq> obondarev: yes, that's true 15:55:54 <slaweq> probably we will get back to that discussion in some time :) 15:56:14 <lajoskatona> we should ask NASA to help maintaining it :P 15:56:20 <slaweq> lajoskatona: yeah :) 15:56:24 <slaweq> good idea 15:56:50 <slaweq> can I assign it as an action item to You? :P 15:56:54 * slaweq is just kidding 15:57:39 <lajoskatona> :-) 15:57:45 <slaweq> ok, that was all what I had for today 15:58:04 <slaweq> if You don't have any last minute topics, I will give You few minutes back 15:58:15 <obondarev> o/ 15:58:29 <bcafarel> nothing from me 15:58:42 <slaweq> ok, thx for attending the meeting today 15:58:46 <slaweq> #endmeeting