#openstack-meeting-3 log

15:01:37 <slaweq> #startmeeting neutron_ci
15:01:38 <opendevmeet> Meeting started Tue Jun  1 15:01:37 2021 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:39 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:41 <opendevmeet> The meeting name has been set to 'neutron_ci'
15:02:18 <ralonsoh> hi
15:03:14 <lajoskatona> Hi
15:03:18 <obondarev> hi
15:03:28 <slaweq> bcafarel: ping
15:03:33 <slaweq> ci meeting
15:03:42 <bcafarel> o/ sorry
15:03:50 <slaweq> np :)
15:03:54 <bcafarel> I got used to the usual 15-20 min back between these meetings :p
15:03:54 <slaweq> ok, let's start
15:04:05 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:04:07 <slaweq> Please open now :)
15:04:13 <slaweq> #topic Actions from previous meetings
15:04:20 <slaweq> obondarev to check neutron-tempest-dvr-ha-multinode-full and switch it to ML2/OVS
15:05:26 <obondarev> https://review.opendev.org/c/openstack/neutron/+/793104
15:05:39 <obondarev> ready
15:05:55 <slaweq> obondarev: thx
15:06:15 <slaweq> today I found out that there is also neutron-tempest-ipv6 which is now running on ovn
15:06:33 <slaweq> and now the question is - do we want to switch it back to ovs or keep it with default backend?
15:06:52 <ralonsoh> it is not failing
15:07:08 <lajoskatona> +1
15:07:09 <slaweq> I would say - keep it with default backend (ovn now) but maybe You have other opinions about it
15:07:09 <ralonsoh> so, IMO, keep it in OVN
15:07:25 <lajoskatona> agree
15:07:30 <obondarev> will it mean that reference implementation will not be covered in gates?
15:07:44 <obondarev> with tempest ipv6 tests
15:07:53 <slaweq> obondarev: what do You mean by reference implementation? ML2/OVS?
15:07:56 <bcafarel> it may turn close to a duplicated of neutron-ovn-tempest-ovs-release-ipv6-only too?
15:07:59 <obondarev> if yes - that would be a problem
15:08:23 <obondarev> slaweq, yes ML2-OVS
15:08:24 <slaweq> bcafarel: good point, I missed that we have such job already
15:08:37 <ralonsoh> if I'm not wrong, ovs release uses master
15:08:40 <ralonsoh> right?
15:08:56 <ralonsoh> in any case, it won't affect and could be a duplicate
15:09:18 <slaweq> so according to that and to what obondarev said, maybe we should switch it neutron-tempest-ipv6 to be ml2/ovs again
15:09:43 <obondarev> so if reference ML2-OVS is not protected with CI then it's prone to regressions which is not good
15:09:55 <obondarev> for ipv6 again
15:10:07 <slaweq> obondarev: that's good point, we don't want regression in ML2/OVS for sure
15:10:34 <ralonsoh> btw, why don't we rename neutron-ovn-tempest-ovs-release-ipv6-only to be OVS?
15:10:45 <slaweq> ralonsoh: that'
15:10:47 <ralonsoh> and keep it neutron-tempest-ipv6 with the default backend
15:10:59 <slaweq> that's my another question - what we should do as next step with our jobs
15:11:21 <slaweq> should we switch "*-ovn" jobs to be "-ovs" and keep "default" jobs as ovn ones now?
15:11:32 <slaweq> to reflect the devstack change in our ci jobs too?
15:11:59 <slaweq> or should we for now just keep everything as it was, so "regular" jobs running ovs and "-ovn" jobs running ovn
15:12:02 <slaweq> wdyt?
15:12:05 <ralonsoh> IMO, rename those with a different backend
15:12:20 <ralonsoh> in this case, -ovs
15:13:12 <obondarev> that makes sense
15:13:39 <bcafarel> +1 at least for a while, it will be clearer
15:13:55 <slaweq> ok, so we need to change many of our jobs now :)
15:14:09 <lajoskatona> yeah make the names help identifing what is the backend
15:14:32 <bcafarel> before there was only select job with linuxbridge (and some new with ovn) so naming was clear, but with the default switch, it can get confusing (even for us :) )
15:14:34 <slaweq> but I agree that this is better long term, especially that some of our jobs inherits e.g. from tempest jobs and those tempest jobs with default settings are run in e.g. tempest or nova's gate too
15:14:35 <obondarev> we can set them use OVS explicitly as first step
15:14:46 <obondarev> and go on with renaming as second
15:14:53 <slaweq> obondarev: yes, I agree
15:15:19 <slaweq> let's merge patches which we have now to enforce ovs where it was before, to have working ci
15:15:29 <slaweq> and then let's switch jobs completly
15:15:36 <slaweq> as that will require more work for sure
15:16:23 <bcafarel> sounds good to me!
15:16:45 <slaweq> ok, sounds like a plan :)
15:17:12 <slaweq> I will try to prepare plan to switch jobs for next week
15:17:40 <slaweq> #action slaweq to prepare plan of switch ovn <-> ovs jobs in neutron CI
15:19:10 <slaweq> ok
15:19:16 <slaweq> next one
15:19:18 <slaweq> ralonsoh to talk with ccamposr about issue https://bugs.launchpad.net/neutron/+bug/1929523
15:19:19 <opendevmeet> Launchpad bug 1929523 in neutron "Test tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details is failing from time to time" [High,Confirmed]
15:19:56 <slaweq> ralonsoh: it's related to the patch https://review.opendev.org/c/openstack/tempest/+/779756 right?
15:20:14 <ralonsoh> yes
15:20:54 <slaweq> ralonsoh: I'm still not convinced that this will solve that problem from https://bugs.launchpad.net/neutron/+bug/1929523
15:20:55 <opendevmeet> Launchpad bug 1929523 in neutron "Test tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details is failing from time to time" [High,Confirmed]
15:21:06 <slaweq> as the issue is a bit different now
15:21:19 <slaweq> it's not that we have additional server in the list
15:21:21 <ralonsoh> in this case we don't have any DNS regoster
15:21:24 <slaweq> but we got empty list
15:22:10 <slaweq> and e.g. failure https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_567/785895/1/gate/neutron-tempest-slow-py3/567fc7f/testr_results.html happened 20.05, more than week after patch https://review.opendev.org/c/openstack/tempest/+/779756 was merged
15:22:11 <ralonsoh> are we using cirros or ubuntu?
15:22:58 <ralonsoh> if we use the advance image, maybe we should use resolvectl
15:22:59 <slaweq> in that failed test, Ubuntu
15:23:10 <ralonsoh> instead of reading /etc/resolv.conf
15:23:40 <ralonsoh> I'll propose a patch in tempest to use resolvectl, if present in the VM
15:23:45 <slaweq> k
15:23:49 <ralonsoh> that should be more accurate
15:23:54 <slaweq> maybe indeed that will help
15:23:57 <slaweq> thx ralonsoh
15:24:37 <slaweq> #action ralonsoh to propose tempest patch to use resolvectl to address https://bugs.launchpad.net/neutron/+bug/1929523
15:24:38 <opendevmeet> Launchpad bug 1929523 in neutron "Test tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details is failing from time to time" [High,Confirmed]
15:24:51 <slaweq> ok, I think we can move on
15:24:53 <slaweq> #topic Stadium projects
15:24:58 <slaweq> lajoskatona: any updates?
15:25:03 <lajoskatona> nothing
15:25:20 <lajoskatona> I think one backend change patch is open, letme check
15:25:43 <lajoskatona> https://review.opendev.org/c/openstack/networking-bagpipe/+/791126
15:26:48 <lajoskatona> yeah one more thing, the old patches of boden for using payload are now again active, but I hope there will be no problem with them
15:27:15 <slaweq> yes, I saw some of them already
15:27:24 <lajoskatona> I have seen patch in x/vmware-nsx as abandoned for payload patch, and I have no right to activate it
15:27:41 <lajoskatona> not sure if we have to warn them somehow
15:28:21 <slaweq> good idea, I will try to reach out to boden somehow
15:28:35 <slaweq> maybe he will redirect me to someone who is now working on it
15:28:51 <slaweq> #action slaweq to reach out to boden about payload patches and x/vmware-nsx
15:28:54 <slaweq> thx lajoskatona
15:31:08 <slaweq> if that is all, lets move on
15:31:09 <slaweq> #topic Stable branches
15:31:16 <slaweq> bcafarel: anything new here?
15:31:50 <bcafarel> mostly good all around :) one question I had there (coming from https://review.opendev.org/c/openstack/neutron/+/793417/ failing backport)
15:32:53 <bcafarel> other branches do not have that irrelevant-files issue as in newer branches these jobs run in periodic
15:33:49 <bcafarel> but for victoria/ussuri I think it is better to fix the job dep instead of backporting the move to periodic
15:33:50 <slaweq> I think we are still missing many patches like https://review.opendev.org/q/topic:%22improve-neutron-ci-stable%252Fussuri%22+(status:open%20OR%20status:merged)
15:33:55 <slaweq> in stable branches
15:34:02 <slaweq> and that's only example for ussuri
15:34:09 <slaweq> but similar patches are opened for other branches too
15:34:53 <bcafarel> yes, getting these ones in will probably help ussuri in general too
15:35:17 <bcafarel> I have https://review.opendev.org/c/openstack/neutron/+/793799 and https://review.opendev.org/c/openstack/neutron/+/793801 mostly for that provider job issue
15:36:14 <bcafarel> ralonsoh: looks like that whole chain is just wainting on https://review.opendev.org/c/openstack/neutron/+/778708 if you can check it
15:36:59 <ralonsoh> I'll do
15:37:19 <ralonsoh> ah I know this patch, perfect
15:37:50 <bcafarel> yes hopefully it should not take too much off of your infinite cycles :)
15:38:51 <slaweq> LOL
15:38:59 <slaweq> thx ralonsoh
15:39:04 <slaweq> ok, lets move on
15:39:06 <slaweq> #topic Grafana
15:39:29 <slaweq> we have our gate broken now due to ovs->ovn migration and some other issue
15:39:58 <slaweq> so we can only focus on the check queue graphs today
15:40:31 <slaweq> and the biggest issues which I see for now are with neutron-ovn-tempest-slow job
15:40:38 <slaweq> which is failing very often
15:40:50 <slaweq> and ralonsoh already proposed to make it non-voting temporary
15:40:54 <ralonsoh> yes
15:41:09 <slaweq> I reported LP for that https://bugs.launchpad.net/neutron/+bug/1930402
15:41:10 <opendevmeet> Launchpad bug 1930402 in neutron "SSH timeouts happens very often in the ovn based CI jobs" [Critical,Confirmed]
15:41:23 <slaweq> and I know that jlibosva and lucasgomes are looking into it
15:42:14 <slaweq> do You have anything else regarding grafana?
15:42:42 <bcafarel> I see openstack-tox-py36-with-neutron-lib-master started 100% failing in periodic few days ago
15:43:06 <ralonsoh> link?
15:43:15 <slaweq> bcafarel: yes, I had it for the last topic of the meeting :)
15:43:22 <slaweq> but as You started, we can discuss it now
15:43:29 <slaweq> https://bugs.launchpad.net/neutron/+bug/1930397
15:43:31 <opendevmeet> Launchpad bug 1930397 in neutron "neutron-lib from master branch is breaking our UT job" [Critical,Confirmed]
15:43:32 <slaweq> ther is bug reported
15:43:41 <slaweq> and example https://zuul.openstack.org/build/9e852a424a52479695223ac2a7723e1a
15:43:58 <bcafarel> ah thanks I was looking for some job link
15:44:05 <ralonsoh> maybe this is because of the change in the n-lib session
15:44:13 <ralonsoh> I'll check it
15:44:37 <ralonsoh> good to have this n-lib master job
15:44:40 <slaweq> ralonsoh: yes, I suspect that
15:44:56 <slaweq> so we should avoid release new neutron-lib before we will not fix that issue
15:45:07 <slaweq> otherwise we will probably break our gate (again) :)
15:45:10 <ralonsoh> right
15:45:13 <ralonsoh> pffff
15:45:17 <ralonsoh> no, not again
15:45:22 <bcafarel> one broken gate at a time
15:45:25 <slaweq> LOL
15:45:28 <obondarev> :)
15:45:35 <bcafarel> maybe related to recent "Allow lazy load in model_query" neutron-lib commit?
15:45:48 <ralonsoh> no, not this
15:45:50 <obondarev> I checked it but seems unrelated
15:45:54 <ralonsoh> this is not used yet
15:46:15 <obondarev> yes
15:46:18 <bcafarel> ok :)
15:47:12 <slaweq> so, ralonsoh You will check it, right?
15:47:16 <ralonsoh> yes
15:47:19 <slaweq> thx a lot
15:47:31 <slaweq> #action ralonsoh to check failing neutron-lib-from-master periodic job
15:47:43 <slaweq> ok, let's move on then
15:47:45 <slaweq> #topic fullstack/functional
15:47:59 <slaweq> regarding functional job, I didn't found any new issues for today
15:48:07 <slaweq> but for fullstack there is new one:
15:48:12 <slaweq> https://bugs.launchpad.net/neutron/+bug/1930401
15:48:13 <opendevmeet> Launchpad bug 1930401 in neutron "Fullstack l3 agent tests failing due to timeout waiting until port is active" [Critical,Confirmed]
15:48:39 <slaweq> seem like it happens pretty often on various L3 related tests
15:48:47 <slaweq> I can investigate it more in next days
15:48:59 <slaweq> unless someone else wants to take it :)
15:49:19 <ralonsoh> maybe next week
15:49:21 <lajoskatona> I can check
15:50:14 <slaweq> lajoskatona: thx a lot
15:50:31 <slaweq> #action lajoskatona to check fullstack failures https://bugs.launchpad.net/neutron/+bug/1930401
15:50:32 <opendevmeet> Launchpad bug 1930401 in neutron "Fullstack l3 agent tests failing due to timeout waiting until port is active" [Critical,Confirmed]
15:50:54 <slaweq> lajoskatona: and also, there is another fullstack issue: https://bugs.launchpad.net/neutron/+bug/1928764
15:50:55 <opendevmeet> Launchpad bug 1928764 in neutron "Fullstack test TestUninterruptedConnectivityOnL2AgentRestart failing often with LB agent" [Critical,Confirmed] - Assigned to Lajos Katona (lajos-katona)
15:51:02 <slaweq> which is hitting us pretty often
15:51:22 <slaweq> I know You were working on it some time ago
15:51:32 <slaweq> do You have any patch which should fix it?
15:51:38 <lajoskatona> Yes we discussed it with Oleg in review
15:51:45 <slaweq> or should we maybe mark those failing tests as unstable for now?
15:51:56 <lajoskatona> https://review.opendev.org/c/openstack/neutron/+/792507
15:52:19 <lajoskatona> but obondarev is right, ping should not fail during restart of agent
15:52:53 <slaweq> actually yes - that is even main goal of this test AFAIR
15:53:05 <slaweq> to ensure that ping will work during the restart all the time
15:53:08 <lajoskatona> yeah marking them unstable can be a way forward to decrease the pressure on CI
15:53:21 <slaweq> lajoskatona: will You propose it?
15:53:41 <lajoskatona> Yes
15:53:47 <slaweq> thank You
15:54:19 <slaweq> #action lajoskatona to mark failing TestUninterruptedConnectivityOnL2AgentRestart fullstack tests as unstable temporary
15:54:54 <slaweq> lajoskatona: if You will not have too much time to work on the https://bugs.launchpad.net/neutron/+bug/1930401 this week, maybe You can also mark those tests as unstable for now
15:54:55 <opendevmeet> Launchpad bug 1930401 in neutron "Fullstack l3 agent tests failing due to timeout waiting until port is active" [Critical,Confirmed]
15:55:02 <obondarev> another bug related to PTG discussion on linuxbridge fiture
15:55:08 <obondarev> future*
15:55:10 <lajoskatona> slaweq: I will check
15:55:24 <slaweq> IMHO we need to make our CI to be a bit better as now it's a nightmare
15:55:29 <slaweq> obondarev: yes, that's true
15:55:54 <slaweq> probably we will get back to that discussion in some time :)
15:56:14 <lajoskatona> we should ask NASA to help maintaining it :P
15:56:20 <slaweq> lajoskatona: yeah :)
15:56:24 <slaweq> good idea
15:56:50 <slaweq> can I assign it as an action item to You? :P
15:56:54 * slaweq is just kidding
15:57:39 <lajoskatona> :-)
15:57:45 <slaweq> ok, that was all what I had for today
15:58:04 <slaweq> if You don't have any last minute topics, I will give You few minutes back
15:58:15 <obondarev> o/
15:58:29 <bcafarel> nothing from me
15:58:42 <slaweq> ok, thx for attending the meeting today
15:58:46 <slaweq> #endmeeting