16:00:16 <slaweq> #startmeeting neutron_ci 16:00:17 <openstack> Meeting started Tue Jan 21 16:00:16 2020 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:18 <slaweq> hi 16:00:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:21 <openstack> The meeting name has been set to 'neutron_ci' 16:01:28 <njohnston> o/ 16:01:43 <ralonsoh> hi 16:01:57 <bcafarel> hi 16:02:26 <slaweq> I think we have our usuall attendees so lets start 16:02:36 <slaweq> #topic Actions from previous meetings 16:02:47 <slaweq> first one: 16:02:48 <slaweq> slaweq to remove networking-midonet and tripleo based jobs from Neutron check queue 16:02:52 <slaweq> I did patches: 16:02:59 <slaweq> Midonet: https://review.opendev.org/#/c/703282/1 16:03:00 <slaweq> TripleO: https://review.opendev.org/#/c/703283/ 16:03:27 <slaweq> I sent separate patches for each job because it will be easier to revert if we will want to bring those jobs back 16:03:34 <ralonsoh> +2 to both 16:03:40 <slaweq> thx ralonsoh 16:04:54 <slaweq> and the next one: 16:04:56 <slaweq> bcafarel to send cherry-pick of https://review.opendev.org/#/c/680001/ to stable/stein to fix functional tests failure 16:05:17 <bcafarel> sent and merged! https://review.opendev.org/#/c/702603/ 16:05:23 <slaweq> thx bcafarel 16:05:28 <njohnston> For midonet, I am not sure it is still broken on py3... look at this midonet change, to drop py27, the py3 jobs all work well https://review.opendev.org/#/c/701210/ 16:05:56 <slaweq> njohnston: yes, but there is no any scenario job there 16:05:57 <ralonsoh> njohnston, good catch 16:05:59 <slaweq> only UT 16:06:34 <slaweq> problem is that midonet's CI is using centos and it's not really working on python 3 IIRC 16:07:11 <njohnston> OK, I'll look into it and probably +2+W 16:07:22 <slaweq> thx njohnston 16:08:21 <slaweq> ok, I think we can move on 16:08:25 <njohnston> may need to wait a little, recentl failures fail with the "pip._internal.distributions.source import SourceDistribution\nImportError: cannot import name SourceDistribution\n" error that is now getting cleared up 16:08:58 <slaweq> njohnston: is that failure from midonet job or from where? 16:09:40 <njohnston> I saw that in networking-midonet-tempest-aio-ml2-centos-7 results, yes 16:10:00 <slaweq> ahh, ok 16:10:23 <njohnston> FOr those who have not seen, it was discussed on openstack-discuss 16:10:26 <njohnston> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-January/012117.html 16:10:31 <slaweq> if You have some time to play with it to fix it, that would be great, if not - I would still be for removing it for now until it will be fixed 16:11:13 <njohnston> slaweq: Yeah, I'll take a quick look, if it gets outside the timebox then I'll approve the change and we can move on :-) 16:11:39 <slaweq> njohnston: thx a lot 16:11:54 <slaweq> ok, next topic than? 16:12:17 <njohnston> +1 16:13:50 <slaweq> #topic Stadium projects 16:14:05 <slaweq> njohnston: any updates about dropping py2? 16:14:23 <slaweq> You wasn't on yesterday's team meeting so maybe You have something new for today :) 16:14:48 <njohnston> so midonet merged it's change 16:15:13 <njohnston> which I think means the only one left for dropping py27 is neutron-fwaas 16:15:24 <bcafarel> yeah open reviews list is quite short now 16:15:27 <njohnston> #link https://review.opendev.org/#/c/688278/ 16:16:19 <njohnston> if neutron-fwaas is going to be retired then I believe there's no sense in doing the work of fixing it 16:16:31 <njohnston> so in short: mission complete! 16:16:38 <ralonsoh> cool! 16:16:38 <slaweq> njohnston: I don't think it's going to be retired this cycle 16:16:40 <bcafarel> \o/ 16:16:42 <slaweq> maybe in next one 16:17:00 <slaweq> so IMHO we should merge this patch, but it shouldn't be too much work IMO 16:17:23 <bcafarel> neutron-fwaas change was in already good shape last time I checked, should be able to merge it soon 16:17:25 <njohnston> ok, I'll start pushing it 16:17:49 <slaweq> thx njohnston 16:18:05 <slaweq> yes, amotoki found some small issues in this patch, except that IMO it's good to go 16:18:29 <njohnston> I'll push the fixes after the meeting 16:18:36 <slaweq> and tempest job failure wasn't related to dropping py2 for sure 16:18:39 <slaweq> thx njohnston 16:19:07 <slaweq> ok, I think that's all about stadium projects 16:19:11 <slaweq> so we can move on 16:19:51 <slaweq> #topic Grafana 16:19:59 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:21:40 <njohnston> looking pretty good 16:22:11 <ralonsoh> but during the last week, the number of rechecks has increased 16:22:17 <ralonsoh> the gate is almost "closed" 16:22:31 <ralonsoh> FT is failing a lot because ovsdb timeouts 16:23:13 <slaweq> yes, grafana looks pretty good but I also have impression that it's not working very well for some of the jobs 16:23:59 <bcafarel> yup, same feeling here 16:24:09 <slaweq> also my script shown me that last week we had 1 recheck in average to merge patch 16:24:35 <slaweq> but I think it's because we didn't merge many patches and some of them were with e.g. docs only, so less jobs are run on them 16:25:38 <ralonsoh> and, for example, the gate now is dropping all jobs 16:25:38 <slaweq> ok, so lets maybe talk about some specific jobs and issues from last week 16:25:45 <slaweq> ralonsoh: why? 16:25:49 <slaweq> any specific reason? 16:25:55 <slaweq> or just random failures? 16:26:03 <ralonsoh> still reviewing it 16:26:11 <slaweq> ok, thx 16:26:12 <bcafarel> the recent pip issue? 16:26:38 <slaweq> ahh, yes I saw some email about it today 16:27:52 <ralonsoh> seems that the problem is fixed and released 16:27:57 <ralonsoh> pip 20.0.1 16:28:08 <slaweq> yes, this should be better now 16:29:15 <slaweq> ok, lets move on to the specific jobs now 16:29:23 <slaweq> #topic fullstack/functional 16:29:34 <slaweq> I have couple of issues for today 16:29:46 <slaweq> first one is, already mentioned by ralonsoh ovsdbapp timeouts, like e.g.: 16:29:54 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_559/703299/1/gate/neutron-functional/559c0ed/testr_results.html 16:29:55 <slaweq> (fullstack) https://1a798a15650a97d81a82-17406e3478c64e603d8ff3ea0aac16c8.ssl.cf1.rackcdn.com/703366/1/check/neutron-fullstack/59e1877/testr_results.html 16:29:57 <slaweq> https://0bad6b662fac8347dc41-be430d2f919a8698d2e96141ed3ac146.ssl.cf5.rackcdn.com/687922/15/gate/neutron-functional/b6769d6/testr_results.html 16:30:11 <slaweq> I think there was old bug reported for that even 16:30:31 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1802640 16:30:31 <openstack> Launchpad bug 1802640 in neutron "TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand in q-agt gate failure" [Medium,Confirmed] 16:30:51 <slaweq> yes, this one 16:30:53 <slaweq> thx ralonsoh 16:30:56 <ralonsoh> I would like to submit a patch to increase the ovsdbapp log 16:31:24 <ralonsoh> and then, maybe, talk to Terry Wilson to do something there, in the command commit context 16:31:24 <slaweq> if that may help to debug this issue, I'm all for this :) 16:31:46 <bcafarel> +1 if that gives a better image of what it is actually spending time on 16:32:20 <bcafarel> I did not check the specific failure, but functional has been grumpier recently on stein/trein too - may be similar issue 16:32:35 <bcafarel> (where rechecks are ping pong between functional and grenade timeouts) 16:33:47 <slaweq> ok, so ralonsoh will increase log level for ovsdbapp in those jobs and we will see then 16:34:04 <njohnston> +1 16:34:08 <slaweq> #action ralonsoh to increase log level for ovsdbapp in fullstack/functional jobs 16:34:17 <slaweq> thx ralonsoh for taking care of it 16:34:32 <ralonsoh> yw 16:35:15 <slaweq> ok, next one which I saw this week was Issue with EVENT OVSNeutronAgentOSKenApp->ofctl_service GetDatapathRequest 16:35:34 <slaweq> like e.g. https://f26da45659020db1220c-76fc92e5e7c4e5a091c792a95503ad1d.ssl.cf5.rackcdn.com/701565/5/check/neutron-functional/4db8d3f/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_policy_rule_delete_egress_.txt 16:35:56 <slaweq> and I remember that I already saw such issue in the past few times 16:37:00 <slaweq> and it failed on setup_physical_bridges because it couldn't get dp_id 16:37:51 <slaweq> is there anyone who wants to check that? 16:38:03 <ralonsoh> I can take a look at it 16:38:08 <ralonsoh> is there a bug reported? 16:38:36 <slaweq> ralonsoh: nope, but I will open one for it today 16:38:42 <ralonsoh> perfect 16:38:51 <slaweq> #action slaweq to open bug for issue with get_dp_id in os_ken 16:39:01 <slaweq> and I will send it to You 16:39:35 <slaweq> ok, lets move on to fullstack job than 16:39:42 <slaweq> first issue there: 16:39:43 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_315/601336/42/check/neutron-fullstack/3156265/testr_results.html 16:39:49 <slaweq> I saw it few weeks ago also 16:40:09 <slaweq> and according to http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AssertionError%3A%20QoS%20register%20not%20found%20with%20queue-num%5C%22 it is happening from time to time recently 16:40:41 <ralonsoh> yes, it's very "delicate" the min-qos feature 16:40:54 <ralonsoh> if you have only one agent, ewverything is ok 16:41:10 <ralonsoh> but when we are testing, we are introducing several qos/queue configurations 16:41:20 <ralonsoh> the agent extension is not ready for this 16:41:27 <ralonsoh> but we fake this in the tests 16:41:42 <ralonsoh> but of course, we can have these kind of situations... 16:42:09 <ralonsoh> but, since the last patch, is much more stable 16:42:21 <ralonsoh> and, btw, this is still pending 16:42:31 <ralonsoh> https://review.opendev.org/#/c/687922/ 16:42:39 <ralonsoh> (re-re-re-rechecking) 16:42:52 <slaweq> ahh, ok 16:43:03 <slaweq> so this patch should helps with this issue, right? 16:43:11 <ralonsoh> I hope so 16:43:14 <slaweq> ok 16:43:26 <ralonsoh> (patch you are "my last hope") 16:43:33 <slaweq> LOL 16:43:44 <slaweq> ok, next issue with fullstack is 16:43:46 <slaweq> Failure on cleanup phase: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_564/702250/6/gate/neutron-fullstack/564244f/testr_results.html 16:43:48 <slaweq> Do we really need cleanup in fullstack tests? 16:44:13 <slaweq> I started thinking today that we maybe don't need cleanup phase in fullstack tests 16:44:27 <slaweq> for each test we are spawning new neutron-server, agents and everything else 16:44:36 <slaweq> and it's always only for this one test 16:44:53 <slaweq> so what's the point to delete routers, ports, networks, etc. at the end? 16:45:00 <slaweq> IMO it's just waste of time 16:45:07 <slaweq> or am I missing something here? 16:45:11 <njohnston> to make sure the delete functionality doesn't have bugs, yes? 16:45:25 <ralonsoh> exactly 16:45:47 <slaweq> njohnston: sure, but we are testing delete functionality in scenario and api tests also, no? 16:46:21 <slaweq> fullstack is more for kind of whitebox testing where You can check if things are actually configured on host properly 16:48:05 <njohnston> OK, you could make that argument. We could spin a change to remove those cleanups and make sure nothing bag happens. 16:48:23 <slaweq> :) 16:48:31 <njohnston> *bad 16:48:32 <njohnston> SMH 16:48:47 <slaweq> so I will send such patch to also see how much time we can safe on that job if we will remove cleanups from it 16:49:07 <slaweq> if it's not too much, maybe we can drop this patch and stay with it as it is now 16:49:08 <ralonsoh> my only concern are the L1 elements in the host (taps, devices, bridges, etc) 16:49:09 <slaweq> ok for You? 16:49:26 <slaweq> ralonsoh: sure, I'm not talking about things like that 16:49:30 <ralonsoh> we can try it 16:49:33 <ralonsoh> ok 16:49:39 <slaweq> I just want to skip cleaning neutron resources 16:50:00 <bcafarel> if you run fullstack in a loop and do no start seeing OOM errors it is probably good ;) 16:50:47 <slaweq> bcafarel: good test indeed :) 16:51:11 <slaweq> #action slaweq to try to skip cleaning up neutron resources in fullstack job 16:51:37 <slaweq> ok, lets move on quickly to the next topic as we are running out of time 16:51:39 <slaweq> #topic Tempest/Scenario 16:51:44 <njohnston> just make sure that we also accomodate developers running fullstack on their personal laptops 16:52:06 <slaweq> njohnston: sure, I will check if this will be fine in such case 16:52:57 <slaweq> ok, speaking about tempest jobs I saw again this week issue with paramiko: 16:53:10 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ca7/653883/11/check/neutron-tempest-plugin-scenario-linuxbridge/ca7d140/testr_results.html 16:53:18 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_820/611605/22/gate/neutron-tempest-plugin-scenario-linuxbridge/82023b2/testr_results.html 16:53:45 <ralonsoh> #link https://review.opendev.org/#/c/702903/ 16:53:49 <slaweq> basically there is error like "paramiko.ssh_exception.SSHException: No existing session" 16:53:53 <ralonsoh> that's why I pushed this 16:54:22 <ralonsoh> ahhh no, this is another error 16:54:33 <ralonsoh> but that was solved last week 16:54:47 <slaweq> I don't think it was solved last week 16:55:01 <ralonsoh> https://review.opendev.org/#/c/701018/ 16:55:13 <slaweq> there was another, similar issue but with error like "NoneType object has no attribute session" or something like that 16:55:26 <ralonsoh> pfffff ok then 16:55:43 <slaweq> there is bug reported for this new issue, I need to find it 16:56:10 <slaweq> paramiko.ssh_exception.SSHException: No existing session 16:56:17 <slaweq> sorry 16:56:19 <slaweq> https://bugs.launchpad.net/neutron/+bug/1858642 16:56:19 <openstack> Launchpad bug 1858642 in neutron "paramiko.ssh_exception.NoValidConnectionsError error cause dvr scenario jobs failing" [High,Confirmed] 16:56:37 <slaweq> if anyone whould have some time to look into it, that would be great 16:56:47 <slaweq> and it's not only in dvr jobs 16:57:34 <slaweq> and that's all from me for today 16:57:35 <ralonsoh> I can take a look, but this problem seems to be in the zuul.test.base 16:57:53 <slaweq> zuul.test.base? what is it? 16:58:15 <ralonsoh> nothing, wrong search 16:58:20 <slaweq> ahh, ok 16:58:53 <slaweq> ok, lets move on quickly 16:58:58 <slaweq> 2 more things 16:59:12 <slaweq> 1. thx ralonsoh for fix mariadb periodic job - it's working fine now 16:59:23 <bcafarel> thanks ralonsoh! 16:59:25 <slaweq> 2. I would like to ask You about change time of this meeting 16:59:51 <slaweq> what do You say about moving it to wednesday for 2pm utc, just after L3 meeting? 17:00:11 <njohnston> that works fine for me 17:00:12 <slaweq> as today at same time there is tc call in redhat every 3 weeks and I would like to attend it sometimes 17:00:23 <ralonsoh> you mean 3pm UTC 17:00:33 <slaweq> sorry, yes 17:00:34 <slaweq> 3pm 17:00:38 <slaweq> 2pm is L3 meeting 17:00:41 <ralonsoh> ok for me 17:00:44 <slaweq> thx 17:00:53 <slaweq> I will propose patch and add all of You as reviewers than 17:01:02 <slaweq> ok, that's all for today 17:01:05 <bcafarel> and mail maybe also for interested folks 17:01:06 <slaweq> thx for attending 17:01:08 <ralonsoh> bye 17:01:12 <slaweq> #endmeeting