15:00:41 <slaweq> #startmeeting neutron_ci
15:00:43 <openstack> Meeting started Tue Dec  1 15:00:41 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:44 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:46 <openstack> The meeting name has been set to 'neutron_ci'
15:00:47 <slaweq> welcome again :)
15:00:50 <bcafarel> not even time for coffee break :(
15:00:56 <ralonsoh> hi again
15:00:57 <lajoskatona> o/
15:01:42 <obondarev> o/
15:02:09 <slaweq> ok, lets start as we have couple of things to discuss here also :)
15:02:15 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:02:37 <slaweq> #topic Actions from previous meetings
15:02:42 <slaweq> bcafarel to fix stable branches upper-constraints in stadium projects
15:03:34 <bcafarel> done for victoria https://review.opendev.org/c/openstack/requirements/+/764022
15:03:48 <bcafarel> ussuri close https://review.opendev.org/c/openstack/requirements/+/764021
15:03:50 * mlavalle has a doctor appointment. will skip this meeting o/
15:04:10 <bcafarel> in the end this also required dropping neutron from blacklist
15:04:11 <slaweq> take care mlavalle :)
15:04:19 <bcafarel> o/ mlavalle
15:04:39 <bcafarel> with requirements folks still hoping neutron-lib would be complete one day and remove need for these steps
15:04:51 <bcafarel> but well we know this will not be the case soon™
15:05:13 <bcafarel> anyway at least this will be noted in my next action item
15:05:18 <slaweq> what do You mean by "neutron-lib will be complete"?
15:05:31 <slaweq> so all projects will import only neutron-lib, and not neutron?
15:05:33 * mlavalle is only going for an eye exam. needs new eye glasses. that's all :-)
15:05:44 <bcafarel> slaweq: indeed
15:06:01 <slaweq> bcafarel: that can be hard, especially that we don't work on that too much recently :/
15:07:06 <bcafarel> yes :/ so I think we will stay with the "need to update requirements after a release" step
15:07:33 <slaweq> bcafarel: and to fix that in ussuri we need https://review.opendev.org/c/openstack/requirements/+/764021 right?
15:08:04 <bcafarel> slaweq: yes that's the one (764022 is the merged one for victoria)
15:08:21 <slaweq> ok, so it's almost there
15:09:33 <slaweq> ok, lets move to the next one
15:09:35 <slaweq> bcafarel to check and update doc https://docs.openstack.org/neutron/latest/contributor/policies/release-checklist.html
15:10:04 <bcafarel> barely started, we can put that for next week
15:10:18 <slaweq> ok
15:10:24 <slaweq> #action bcafarel to check and update doc https://docs.openstack.org/neutron/latest/contributor/policies/release-checklist.html
15:10:38 <slaweq> so next one
15:10:42 <slaweq> slaweq to explore options to fix https://bugs.launchpad.net/neutron/+bug/1903531
15:10:44 <openstack> Launchpad bug 1903531 in neutron "Update of neutron-server breaks compatibility to previous neutron-agent version" [Critical,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
15:10:52 <slaweq> we already dicussed that on the previous meeting
15:10:56 <bcafarel> just a bit :)
15:10:58 <slaweq> so no need to repeat it here
15:11:05 <slaweq> next one
15:11:07 <slaweq> slaweq to report bug against rally
15:11:16 <slaweq> I checked that and it's really not rally bug
15:11:23 <slaweq> but some red herring
15:11:34 <slaweq> real bug was that some subnet creation failed simply
15:11:42 <slaweq> so I didn't report anything againt rally
15:12:02 <slaweq> and that's all actions from last week
15:12:15 <slaweq> next topic
15:12:17 <slaweq> #topic Stadium projects
15:12:28 <slaweq> any updates about stadium projects ci?
15:12:31 <slaweq> lajoskatona?
15:12:41 <lajoskatona> nothing as I have seen
15:12:56 <lajoskatona> things are going on without much problem
15:13:24 <slaweq> lajoskatona: that's good to hear
15:13:35 <slaweq> #topic Stable branches
15:13:46 <slaweq> Victoria dashboard: https://grafana.opendev.org/d/HUCHup2Gz/neutron-failure-rate-previous-stable-release?orgId=1
15:13:49 <slaweq> Ussuri dashboard: https://grafana.opendev.org/d/smqHXphMk/neutron-failure-rate-older-stable-release?orgId=1
15:14:01 <slaweq> bcafarel: any updates/issues regarding ci of stable branches?
15:14:14 <bcafarel> not that I am aware of at least :)
15:15:02 <slaweq> ok
15:15:05 <slaweq> so lets move on
15:15:07 <slaweq> #topic Grafana
15:15:33 <slaweq> in master branch I don't think that things are going well
15:15:45 <slaweq> we have plenty of issues and failure rates are pretty high for some jobs
15:15:55 <slaweq> especially functional/fullstack recently
15:17:16 <ralonsoh> if we see a recurrent error in the CI (on those jobs), report it and inform in IRC
15:17:30 <ralonsoh> just to let everybody know that you are on it
15:17:31 <slaweq> ralonsoh: yes, I have couple of examples
15:17:35 <ralonsoh> perfect
15:17:37 <slaweq> I found them today
15:17:43 <ralonsoh> (test_walk_versions, for example)
15:17:45 <slaweq> but I didn't had time yet to report LPs
15:18:05 <slaweq> ok, regarding grafana I don't really have more to say
15:18:39 <slaweq> I know that some graphs are a bit not up to date recently but I want to propose one update for that when all patches which changes some jobs will be merged
15:18:48 <slaweq> I think there is still one or too in gerrit
15:19:02 <slaweq> other than that, I think we can talk about some specific jobs now
15:19:06 <slaweq> are You ok with that?
15:20:01 <ralonsoh> yes
15:20:09 <slaweq> #topic fullstack/functional
15:20:16 <slaweq> ok
15:20:34 <slaweq> first one is bug https://bugs.launchpad.net/neutron/+bug/1889781 which is still hitting us from time to time
15:20:35 <openstack> Launchpad bug 1889781 in neutron "Functional tests are timing out" [High,Confirmed]
15:20:43 <slaweq> and I think that it's even more often recently
15:21:30 <slaweq> I may try to limit number of logs send to stdout during those tests
15:21:47 <slaweq> but if there is anyone else who wants to do that, that would be great :)
15:21:58 <slaweq> please then simply assign this bug to You
15:22:03 <slaweq> and work on it
15:22:18 <ralonsoh> is that related to the size of the logs?
15:22:29 <slaweq> ralonsoh: most likely yes
15:22:33 <ralonsoh> ok
15:22:41 <slaweq> we saw similar issue in the past in UT IIRC
15:22:53 <ralonsoh> but I think this is because of some failing tests
15:22:54 <slaweq> basically it is some bug in stestr or something like that
15:22:58 <slaweq> ralonsoh: no
15:23:00 <ralonsoh> like neutron.tests.functional.agent.linux.test_tc_lib.TcFiltersTestCase.test_add_tc_filter_vxlan [540.005735s] ... FAILED
15:23:11 <ralonsoh> expending too much time
15:23:14 <slaweq> if You will see logs, there is always huge gap when nothing happens
15:23:43 <ralonsoh> because all workers are blocked in other tests
15:23:55 <slaweq> see for example:
15:23:57 <slaweq> 2020-11-30 10:03:00.937710 | controller | {1} neutron.tests.functional.agent.ovn.metadata.test_metadata_agent.TestMetadataAgent.test_agent_resync_on_non_existing_bridge [1.997655s] ... ok
15:23:59 <slaweq> 2020-11-30 10:43:39.465033 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/openstack/neutron/playbooks/run_functional_job.yaml@master]
15:24:04 <ralonsoh> I know
15:24:11 <slaweq> those are 2 consequent lines from the log
15:24:24 <slaweq> so there is nothing for about 40 minutes there
15:24:27 <ralonsoh> but IMO this is because the other workers are blocked checking something
15:24:37 <slaweq> and that was exactly the symptom of the issue with too much output and stestr
15:26:07 <slaweq> ralonsoh: maybe the root cause now is different than it was with that stestr issue
15:26:10 <slaweq> idk really
15:26:17 <lajoskatona> there was a new release of stestr recently, not sure though what it fixes
15:26:21 <slaweq> but at first glance it looks similar to what we had in the past
15:28:23 <slaweq> anyway, if someone will have some time, You can take a look at that bug :)
15:28:43 <slaweq> lets move on
15:28:45 <slaweq> next one
15:28:59 <slaweq> I noticed few times this week failures with TestSimpleMonitorInterface
15:29:03 <slaweq> like e.g.:
15:29:08 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_93d/764365/1/gate/neutron-functional-with-uwsgi/93df51c/testr_results.html
15:29:16 <slaweq> I need to report LP for that
15:29:45 <slaweq> ralonsoh: isn't that related to some of Your changes maybe? It looks like something what You could work on :)
15:30:03 <ralonsoh> sure, I'll check it
15:30:09 <ralonsoh> and I'll report a LP
15:31:12 <ralonsoh> ahh I think you are talking about a fullstack patch
15:31:12 <slaweq> ralonsoh: in log I see something like:
15:31:15 <slaweq> 2020-11-30 10:40:38.271 61912 DEBUG neutron.agent.linux.utils [req-2aa4c2b1-90e9-4f8d-a708-61d18ad4f3ec - - - - -] Running command: ['sudo', '/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/bin/neutron-rootwrap', '/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/etc/neutron/rootwrap.conf', 'ovsdb-client', 'monitor', 'Interface',
15:31:17 <slaweq> 'name,ofport,external_ids', '--format=json'] create_process /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/linux/utils.py:88
15:31:19 <slaweq> 2020-11-30 10:40:38.321 61912 DEBUG neutron.agent.common.async_process [-] Output received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: None _read_stdout /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/common/async_process.py:264
15:31:21 <slaweq> 2020-11-30 10:40:38.322 61912 DEBUG neutron.agent.common.async_process [-] Halting async process [ovsdb-client monitor Interface name,ofport,external_ids --format=json] in response to an error. stdout: [[]] - stderr: [[]] _handle_process_error /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/common/async_process.py:222
15:31:48 <ralonsoh> slaweq, in OVS there are two monitors
15:31:56 <ralonsoh> one for the ports and another one for the bridges
15:32:01 <ralonsoh> I migrated the birdges one
15:32:11 <ralonsoh> but I never finished the complex one, for ports
15:32:40 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/735201
15:32:41 <slaweq> so this seems for me like monitor of Interfaces
15:32:50 <ralonsoh> yes
15:32:55 <slaweq> name,ofport,external_ids
15:32:55 <ralonsoh> (not ports, interfaces)
15:33:37 <slaweq> ok, do You want to investigate it? Or do You want me to check that?
15:33:47 <ralonsoh> I'll report and investigate it
15:33:52 <slaweq> thx
15:34:13 <slaweq> #action ralonsoh to report and check issue with TestSimpleMonitorInterface in functional tests
15:34:28 <slaweq> in the meeting agenda https://etherpad.opendev.org/p/neutron-ci-meetings there is more examples of same failure
15:34:48 <slaweq> lets move now to fullstack tests
15:34:54 <slaweq> which are also not very stable recently
15:35:12 <slaweq> most often I saw issue with mysql killed by oom killer
15:35:17 <slaweq> bug reported https://launchpad.net/bugs/1906366
15:35:19 <openstack> Launchpad bug 1906366 in neutron "oom killer kills mysqld process on the node running fullstack tests" [Critical,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
15:35:38 <slaweq> I proposed patch to limit resources used there
15:35:48 <slaweq> but I saw that ralonsoh had some comments there
15:35:56 <slaweq> I didn't had time yet to address them
15:36:12 <ralonsoh> we test concurrency so I should not reduce to 1 the number of API workers
15:36:15 <ralonsoh> just this
15:36:33 <slaweq> ralonsoh: but is that only this one test which You actually mentioned?
15:36:38 <slaweq> or there are others also
15:36:47 <ralonsoh> I only found this one
15:36:55 <slaweq> because if that's the only test which needs 2 workers, I can set 2 workers only for that test
15:37:05 <ralonsoh> perfect
15:37:05 <slaweq> and use default to "1" for all other tests
15:37:58 <ralonsoh> yes, I think this is the only one
15:38:07 <slaweq> ok, great
15:38:12 <slaweq> so I will update my patch
15:38:36 <lajoskatona> that is a good exampl,e thanks for mentioning
15:38:48 <slaweq> and also as I see in the results now, lowering number of test runner workers from 4 to 3 results in about 18 minutes more for whole job
15:38:54 <slaweq> so should be acceptable
15:39:44 <lajoskatona> slaweq: if you are overloaded I can take care of this api_worker change, that dhcp test comes from us.
15:40:18 <slaweq> lajoskatona: thx, if You could update my patch that would be great
15:41:00 <lajoskatona> slaweq: sure
15:41:21 <slaweq> lajoskatona: thx a lot
15:41:40 <slaweq> ok, lets move on to the scenario/tempest jobs
15:41:47 <slaweq> #topic Tempest/Scenario
15:41:57 <slaweq> first of all neutron-tempest-plugin-api
15:42:11 <slaweq> I noticed quite often that there is one test failing
15:42:16 <slaweq> test_dhcp_port_status_active
15:42:20 <slaweq> e.g.:
15:42:24 <slaweq> https://1973ad26b23f3d5a6239-a05b796fccac2efb122cdf71ce7f0104.ssl.cf5.rackcdn.com/763828/4/check/neutron-tempest-plugin-api/bda79c4/testr_results.html
15:42:28 <slaweq> or
15:42:29 <slaweq> https://38bbf4ec3cadfd43de08-7d0e556db3075d25d1b91bbdcc8a4562.ssl.cf2.rackcdn.com/764108/6/check/neutron-tempest-plugin-api/cc5cbc6/testr_results.html
15:42:34 <slaweq> I need to report that one too
15:44:38 <slaweq> from what I saw in neutron-ovs-agent logs it seems that the issue is with rpc loop iteration which takes long time and due to that port is not becoming ACTIVE in 60 seconds
15:45:00 <slaweq> so one workaround for that could be to bump timeout in that test
15:45:09 <ralonsoh> is this because the VM is not spawned?
15:45:29 <slaweq> but I though that maybe ralonsoh's patches which moves sleep(0) to the end of the rpc loop iteration may help with that
15:45:40 <slaweq> and second patch which lowers number of workers in the tests
15:45:45 <ralonsoh> agree
15:45:47 <slaweq> is it also for neutron-tempest-plugin-api job?
15:46:06 <slaweq> ralonsoh: there is no really vm spawned in that test. It is checking just dhcp port
15:46:20 <slaweq> but that port needs to be provisioned by L2 entity also to be ACTIVE
15:48:05 <ralonsoh> it takes more than one minute to set the device UP
15:48:19 <slaweq> ralonsoh: yes
15:48:33 <ralonsoh> that's insane...
15:48:44 <slaweq> and You can check in neutron-ovs-agent's logs that rpc loop iteration takes about 80-90 seconds in that specific time
15:49:02 <ralonsoh> yeah
15:49:29 <slaweq> so I thought about patch  https://review.opendev.org/c/openstack/neutron/+/755313 that maybe will help with that issue
15:49:50 <slaweq> if that will be merged and we will still see the same issues, I will investigate it more
15:49:57 <ralonsoh> perfect
15:50:27 <slaweq> #action slaweq to check if test_dhcp_port_status_active will be still failing after https://review.opendev.org/c/openstack/neutron/+/755313 will be merged
15:50:40 <slaweq> btw. lajoskatona if You can take a look at ^^ that would be great :)
15:51:11 <lajoskatona> slaweq: I check it
15:51:14 <slaweq> lajoskatona: thx
15:51:17 <slaweq> ok, lets move on
15:51:29 <slaweq> next issue which I found was in     neutron-ovn-tempest-ovs-release-ipv6-only
15:51:40 <slaweq> I saw few times ssh failures in that job
15:51:45 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_cac/764356/1/check/neutron-ovn-tempest-ovs-release-ipv6-only/cacd054/testr_results.html
15:51:47 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_08c/752795/24/check/neutron-ovn-tempest-ovs-release-ipv6-only/08c6400/testr_results.html
15:53:13 <slaweq> in both cases it seems that even metadata wasn't reachable from the vm
15:53:30 <slaweq> do You know any issues which could cause that and are already reported/in progress?
15:53:48 <ralonsoh> yes but for OVS-DPDK
15:54:00 <ralonsoh> (I think this is not related)
15:54:20 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/763745
15:57:03 <slaweq> ok, I will report that issue on LP and ask someone from OVN squad to take a look at it
15:57:29 <slaweq> #action slaweq to report LP about SSH failures in the neutron-ovn-tempest-ovs-release-ipv6-only
15:57:46 <slaweq> and with that I think it's all for today
15:57:49 <ralonsoh> give me 10 secs, please. Fullstack related
15:57:50 <slaweq> from me
15:57:52 <slaweq> sure
15:57:56 <ralonsoh> liuyulong, https://review.opendev.org/c/openstack/neutron/+/738446
15:58:05 <ralonsoh> please, take a look at the replies
15:58:16 <ralonsoh> and anyone else is welcome to review it
15:58:19 <ralonsoh> thanks a lot
15:58:23 <ralonsoh> (that's all)
15:59:35 <slaweq> ok
15:59:40 <slaweq> thx for attending the meeting
15:59:43 <ralonsoh> bye!
15:59:46 <slaweq> see You online
15:59:48 <slaweq> o/
15:59:48 <lajoskatona> Bye!
15:59:50 <slaweq> #endmeeting