15:00:41 <slaweq> #startmeeting neutron_ci 15:00:43 <openstack> Meeting started Tue Dec 1 15:00:41 2020 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:44 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:46 <openstack> The meeting name has been set to 'neutron_ci' 15:00:47 <slaweq> welcome again :) 15:00:50 <bcafarel> not even time for coffee break :( 15:00:56 <ralonsoh> hi again 15:00:57 <lajoskatona> o/ 15:01:42 <obondarev> o/ 15:02:09 <slaweq> ok, lets start as we have couple of things to discuss here also :) 15:02:15 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:02:37 <slaweq> #topic Actions from previous meetings 15:02:42 <slaweq> bcafarel to fix stable branches upper-constraints in stadium projects 15:03:34 <bcafarel> done for victoria https://review.opendev.org/c/openstack/requirements/+/764022 15:03:48 <bcafarel> ussuri close https://review.opendev.org/c/openstack/requirements/+/764021 15:03:50 * mlavalle has a doctor appointment. will skip this meeting o/ 15:04:10 <bcafarel> in the end this also required dropping neutron from blacklist 15:04:11 <slaweq> take care mlavalle :) 15:04:19 <bcafarel> o/ mlavalle 15:04:39 <bcafarel> with requirements folks still hoping neutron-lib would be complete one day and remove need for these steps 15:04:51 <bcafarel> but well we know this will not be the case soon™ 15:05:13 <bcafarel> anyway at least this will be noted in my next action item 15:05:18 <slaweq> what do You mean by "neutron-lib will be complete"? 15:05:31 <slaweq> so all projects will import only neutron-lib, and not neutron? 15:05:33 * mlavalle is only going for an eye exam. needs new eye glasses. that's all :-) 15:05:44 <bcafarel> slaweq: indeed 15:06:01 <slaweq> bcafarel: that can be hard, especially that we don't work on that too much recently :/ 15:07:06 <bcafarel> yes :/ so I think we will stay with the "need to update requirements after a release" step 15:07:33 <slaweq> bcafarel: and to fix that in ussuri we need https://review.opendev.org/c/openstack/requirements/+/764021 right? 15:08:04 <bcafarel> slaweq: yes that's the one (764022 is the merged one for victoria) 15:08:21 <slaweq> ok, so it's almost there 15:09:33 <slaweq> ok, lets move to the next one 15:09:35 <slaweq> bcafarel to check and update doc https://docs.openstack.org/neutron/latest/contributor/policies/release-checklist.html 15:10:04 <bcafarel> barely started, we can put that for next week 15:10:18 <slaweq> ok 15:10:24 <slaweq> #action bcafarel to check and update doc https://docs.openstack.org/neutron/latest/contributor/policies/release-checklist.html 15:10:38 <slaweq> so next one 15:10:42 <slaweq> slaweq to explore options to fix https://bugs.launchpad.net/neutron/+bug/1903531 15:10:44 <openstack> Launchpad bug 1903531 in neutron "Update of neutron-server breaks compatibility to previous neutron-agent version" [Critical,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 15:10:52 <slaweq> we already dicussed that on the previous meeting 15:10:56 <bcafarel> just a bit :) 15:10:58 <slaweq> so no need to repeat it here 15:11:05 <slaweq> next one 15:11:07 <slaweq> slaweq to report bug against rally 15:11:16 <slaweq> I checked that and it's really not rally bug 15:11:23 <slaweq> but some red herring 15:11:34 <slaweq> real bug was that some subnet creation failed simply 15:11:42 <slaweq> so I didn't report anything againt rally 15:12:02 <slaweq> and that's all actions from last week 15:12:15 <slaweq> next topic 15:12:17 <slaweq> #topic Stadium projects 15:12:28 <slaweq> any updates about stadium projects ci? 15:12:31 <slaweq> lajoskatona? 15:12:41 <lajoskatona> nothing as I have seen 15:12:56 <lajoskatona> things are going on without much problem 15:13:24 <slaweq> lajoskatona: that's good to hear 15:13:35 <slaweq> #topic Stable branches 15:13:46 <slaweq> Victoria dashboard: https://grafana.opendev.org/d/HUCHup2Gz/neutron-failure-rate-previous-stable-release?orgId=1 15:13:49 <slaweq> Ussuri dashboard: https://grafana.opendev.org/d/smqHXphMk/neutron-failure-rate-older-stable-release?orgId=1 15:14:01 <slaweq> bcafarel: any updates/issues regarding ci of stable branches? 15:14:14 <bcafarel> not that I am aware of at least :) 15:15:02 <slaweq> ok 15:15:05 <slaweq> so lets move on 15:15:07 <slaweq> #topic Grafana 15:15:33 <slaweq> in master branch I don't think that things are going well 15:15:45 <slaweq> we have plenty of issues and failure rates are pretty high for some jobs 15:15:55 <slaweq> especially functional/fullstack recently 15:17:16 <ralonsoh> if we see a recurrent error in the CI (on those jobs), report it and inform in IRC 15:17:30 <ralonsoh> just to let everybody know that you are on it 15:17:31 <slaweq> ralonsoh: yes, I have couple of examples 15:17:35 <ralonsoh> perfect 15:17:37 <slaweq> I found them today 15:17:43 <ralonsoh> (test_walk_versions, for example) 15:17:45 <slaweq> but I didn't had time yet to report LPs 15:18:05 <slaweq> ok, regarding grafana I don't really have more to say 15:18:39 <slaweq> I know that some graphs are a bit not up to date recently but I want to propose one update for that when all patches which changes some jobs will be merged 15:18:48 <slaweq> I think there is still one or too in gerrit 15:19:02 <slaweq> other than that, I think we can talk about some specific jobs now 15:19:06 <slaweq> are You ok with that? 15:20:01 <ralonsoh> yes 15:20:09 <slaweq> #topic fullstack/functional 15:20:16 <slaweq> ok 15:20:34 <slaweq> first one is bug https://bugs.launchpad.net/neutron/+bug/1889781 which is still hitting us from time to time 15:20:35 <openstack> Launchpad bug 1889781 in neutron "Functional tests are timing out" [High,Confirmed] 15:20:43 <slaweq> and I think that it's even more often recently 15:21:30 <slaweq> I may try to limit number of logs send to stdout during those tests 15:21:47 <slaweq> but if there is anyone else who wants to do that, that would be great :) 15:21:58 <slaweq> please then simply assign this bug to You 15:22:03 <slaweq> and work on it 15:22:18 <ralonsoh> is that related to the size of the logs? 15:22:29 <slaweq> ralonsoh: most likely yes 15:22:33 <ralonsoh> ok 15:22:41 <slaweq> we saw similar issue in the past in UT IIRC 15:22:53 <ralonsoh> but I think this is because of some failing tests 15:22:54 <slaweq> basically it is some bug in stestr or something like that 15:22:58 <slaweq> ralonsoh: no 15:23:00 <ralonsoh> like neutron.tests.functional.agent.linux.test_tc_lib.TcFiltersTestCase.test_add_tc_filter_vxlan [540.005735s] ... FAILED 15:23:11 <ralonsoh> expending too much time 15:23:14 <slaweq> if You will see logs, there is always huge gap when nothing happens 15:23:43 <ralonsoh> because all workers are blocked in other tests 15:23:55 <slaweq> see for example: 15:23:57 <slaweq> 2020-11-30 10:03:00.937710 | controller | {1} neutron.tests.functional.agent.ovn.metadata.test_metadata_agent.TestMetadataAgent.test_agent_resync_on_non_existing_bridge [1.997655s] ... ok 15:23:59 <slaweq> 2020-11-30 10:43:39.465033 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/openstack/neutron/playbooks/run_functional_job.yaml@master] 15:24:04 <ralonsoh> I know 15:24:11 <slaweq> those are 2 consequent lines from the log 15:24:24 <slaweq> so there is nothing for about 40 minutes there 15:24:27 <ralonsoh> but IMO this is because the other workers are blocked checking something 15:24:37 <slaweq> and that was exactly the symptom of the issue with too much output and stestr 15:26:07 <slaweq> ralonsoh: maybe the root cause now is different than it was with that stestr issue 15:26:10 <slaweq> idk really 15:26:17 <lajoskatona> there was a new release of stestr recently, not sure though what it fixes 15:26:21 <slaweq> but at first glance it looks similar to what we had in the past 15:28:23 <slaweq> anyway, if someone will have some time, You can take a look at that bug :) 15:28:43 <slaweq> lets move on 15:28:45 <slaweq> next one 15:28:59 <slaweq> I noticed few times this week failures with TestSimpleMonitorInterface 15:29:03 <slaweq> like e.g.: 15:29:08 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_93d/764365/1/gate/neutron-functional-with-uwsgi/93df51c/testr_results.html 15:29:16 <slaweq> I need to report LP for that 15:29:45 <slaweq> ralonsoh: isn't that related to some of Your changes maybe? It looks like something what You could work on :) 15:30:03 <ralonsoh> sure, I'll check it 15:30:09 <ralonsoh> and I'll report a LP 15:31:12 <ralonsoh> ahh I think you are talking about a fullstack patch 15:31:12 <slaweq> ralonsoh: in log I see something like: 15:31:15 <slaweq> 2020-11-30 10:40:38.271 61912 DEBUG neutron.agent.linux.utils [req-2aa4c2b1-90e9-4f8d-a708-61d18ad4f3ec - - - - -] Running command: ['sudo', '/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/bin/neutron-rootwrap', '/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/etc/neutron/rootwrap.conf', 'ovsdb-client', 'monitor', 'Interface', 15:31:17 <slaweq> 'name,ofport,external_ids', '--format=json'] create_process /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/linux/utils.py:88 15:31:19 <slaweq> 2020-11-30 10:40:38.321 61912 DEBUG neutron.agent.common.async_process [-] Output received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: None _read_stdout /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/common/async_process.py:264 15:31:21 <slaweq> 2020-11-30 10:40:38.322 61912 DEBUG neutron.agent.common.async_process [-] Halting async process [ovsdb-client monitor Interface name,ofport,external_ids --format=json] in response to an error. stdout: [[]] - stderr: [[]] _handle_process_error /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/common/async_process.py:222 15:31:48 <ralonsoh> slaweq, in OVS there are two monitors 15:31:56 <ralonsoh> one for the ports and another one for the bridges 15:32:01 <ralonsoh> I migrated the birdges one 15:32:11 <ralonsoh> but I never finished the complex one, for ports 15:32:40 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/735201 15:32:41 <slaweq> so this seems for me like monitor of Interfaces 15:32:50 <ralonsoh> yes 15:32:55 <slaweq> name,ofport,external_ids 15:32:55 <ralonsoh> (not ports, interfaces) 15:33:37 <slaweq> ok, do You want to investigate it? Or do You want me to check that? 15:33:47 <ralonsoh> I'll report and investigate it 15:33:52 <slaweq> thx 15:34:13 <slaweq> #action ralonsoh to report and check issue with TestSimpleMonitorInterface in functional tests 15:34:28 <slaweq> in the meeting agenda https://etherpad.opendev.org/p/neutron-ci-meetings there is more examples of same failure 15:34:48 <slaweq> lets move now to fullstack tests 15:34:54 <slaweq> which are also not very stable recently 15:35:12 <slaweq> most often I saw issue with mysql killed by oom killer 15:35:17 <slaweq> bug reported https://launchpad.net/bugs/1906366 15:35:19 <openstack> Launchpad bug 1906366 in neutron "oom killer kills mysqld process on the node running fullstack tests" [Critical,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 15:35:38 <slaweq> I proposed patch to limit resources used there 15:35:48 <slaweq> but I saw that ralonsoh had some comments there 15:35:56 <slaweq> I didn't had time yet to address them 15:36:12 <ralonsoh> we test concurrency so I should not reduce to 1 the number of API workers 15:36:15 <ralonsoh> just this 15:36:33 <slaweq> ralonsoh: but is that only this one test which You actually mentioned? 15:36:38 <slaweq> or there are others also 15:36:47 <ralonsoh> I only found this one 15:36:55 <slaweq> because if that's the only test which needs 2 workers, I can set 2 workers only for that test 15:37:05 <ralonsoh> perfect 15:37:05 <slaweq> and use default to "1" for all other tests 15:37:58 <ralonsoh> yes, I think this is the only one 15:38:07 <slaweq> ok, great 15:38:12 <slaweq> so I will update my patch 15:38:36 <lajoskatona> that is a good exampl,e thanks for mentioning 15:38:48 <slaweq> and also as I see in the results now, lowering number of test runner workers from 4 to 3 results in about 18 minutes more for whole job 15:38:54 <slaweq> so should be acceptable 15:39:44 <lajoskatona> slaweq: if you are overloaded I can take care of this api_worker change, that dhcp test comes from us. 15:40:18 <slaweq> lajoskatona: thx, if You could update my patch that would be great 15:41:00 <lajoskatona> slaweq: sure 15:41:21 <slaweq> lajoskatona: thx a lot 15:41:40 <slaweq> ok, lets move on to the scenario/tempest jobs 15:41:47 <slaweq> #topic Tempest/Scenario 15:41:57 <slaweq> first of all neutron-tempest-plugin-api 15:42:11 <slaweq> I noticed quite often that there is one test failing 15:42:16 <slaweq> test_dhcp_port_status_active 15:42:20 <slaweq> e.g.: 15:42:24 <slaweq> https://1973ad26b23f3d5a6239-a05b796fccac2efb122cdf71ce7f0104.ssl.cf5.rackcdn.com/763828/4/check/neutron-tempest-plugin-api/bda79c4/testr_results.html 15:42:28 <slaweq> or 15:42:29 <slaweq> https://38bbf4ec3cadfd43de08-7d0e556db3075d25d1b91bbdcc8a4562.ssl.cf2.rackcdn.com/764108/6/check/neutron-tempest-plugin-api/cc5cbc6/testr_results.html 15:42:34 <slaweq> I need to report that one too 15:44:38 <slaweq> from what I saw in neutron-ovs-agent logs it seems that the issue is with rpc loop iteration which takes long time and due to that port is not becoming ACTIVE in 60 seconds 15:45:00 <slaweq> so one workaround for that could be to bump timeout in that test 15:45:09 <ralonsoh> is this because the VM is not spawned? 15:45:29 <slaweq> but I though that maybe ralonsoh's patches which moves sleep(0) to the end of the rpc loop iteration may help with that 15:45:40 <slaweq> and second patch which lowers number of workers in the tests 15:45:45 <ralonsoh> agree 15:45:47 <slaweq> is it also for neutron-tempest-plugin-api job? 15:46:06 <slaweq> ralonsoh: there is no really vm spawned in that test. It is checking just dhcp port 15:46:20 <slaweq> but that port needs to be provisioned by L2 entity also to be ACTIVE 15:48:05 <ralonsoh> it takes more than one minute to set the device UP 15:48:19 <slaweq> ralonsoh: yes 15:48:33 <ralonsoh> that's insane... 15:48:44 <slaweq> and You can check in neutron-ovs-agent's logs that rpc loop iteration takes about 80-90 seconds in that specific time 15:49:02 <ralonsoh> yeah 15:49:29 <slaweq> so I thought about patch https://review.opendev.org/c/openstack/neutron/+/755313 that maybe will help with that issue 15:49:50 <slaweq> if that will be merged and we will still see the same issues, I will investigate it more 15:49:57 <ralonsoh> perfect 15:50:27 <slaweq> #action slaweq to check if test_dhcp_port_status_active will be still failing after https://review.opendev.org/c/openstack/neutron/+/755313 will be merged 15:50:40 <slaweq> btw. lajoskatona if You can take a look at ^^ that would be great :) 15:51:11 <lajoskatona> slaweq: I check it 15:51:14 <slaweq> lajoskatona: thx 15:51:17 <slaweq> ok, lets move on 15:51:29 <slaweq> next issue which I found was in neutron-ovn-tempest-ovs-release-ipv6-only 15:51:40 <slaweq> I saw few times ssh failures in that job 15:51:45 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_cac/764356/1/check/neutron-ovn-tempest-ovs-release-ipv6-only/cacd054/testr_results.html 15:51:47 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_08c/752795/24/check/neutron-ovn-tempest-ovs-release-ipv6-only/08c6400/testr_results.html 15:53:13 <slaweq> in both cases it seems that even metadata wasn't reachable from the vm 15:53:30 <slaweq> do You know any issues which could cause that and are already reported/in progress? 15:53:48 <ralonsoh> yes but for OVS-DPDK 15:54:00 <ralonsoh> (I think this is not related) 15:54:20 <ralonsoh> https://review.opendev.org/c/openstack/neutron/+/763745 15:57:03 <slaweq> ok, I will report that issue on LP and ask someone from OVN squad to take a look at it 15:57:29 <slaweq> #action slaweq to report LP about SSH failures in the neutron-ovn-tempest-ovs-release-ipv6-only 15:57:46 <slaweq> and with that I think it's all for today 15:57:49 <ralonsoh> give me 10 secs, please. Fullstack related 15:57:50 <slaweq> from me 15:57:52 <slaweq> sure 15:57:56 <ralonsoh> liuyulong, https://review.opendev.org/c/openstack/neutron/+/738446 15:58:05 <ralonsoh> please, take a look at the replies 15:58:16 <ralonsoh> and anyone else is welcome to review it 15:58:19 <ralonsoh> thanks a lot 15:58:23 <ralonsoh> (that's all) 15:59:35 <slaweq> ok 15:59:40 <slaweq> thx for attending the meeting 15:59:43 <ralonsoh> bye! 15:59:46 <slaweq> see You online 15:59:48 <slaweq> o/ 15:59:48 <lajoskatona> Bye! 15:59:50 <slaweq> #endmeeting