#openstack-meeting log

16:00:16 <slaweq> #startmeeting neutron_ci
16:00:17 <openstack> Meeting started Tue Jan 21 16:00:16 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:18 <slaweq> hi
16:00:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:21 <openstack> The meeting name has been set to 'neutron_ci'
16:01:28 <njohnston> o/
16:01:43 <ralonsoh> hi
16:01:57 <bcafarel> hi
16:02:26 <slaweq> I think we have our usuall attendees so lets start
16:02:36 <slaweq> #topic Actions from previous meetings
16:02:47 <slaweq> first one:
16:02:48 <slaweq> slaweq to remove networking-midonet and tripleo based jobs from Neutron check queue
16:02:52 <slaweq> I did patches:
16:02:59 <slaweq> Midonet: https://review.opendev.org/#/c/703282/1
16:03:00 <slaweq> TripleO: https://review.opendev.org/#/c/703283/
16:03:27 <slaweq> I sent separate patches for each job because it will be easier to revert if we will want to bring those jobs back
16:03:34 <ralonsoh> +2 to both
16:03:40 <slaweq> thx ralonsoh
16:04:54 <slaweq> and the next one:
16:04:56 <slaweq> bcafarel to send cherry-pick of https://review.opendev.org/#/c/680001/ to stable/stein to fix functional tests failure
16:05:17 <bcafarel> sent and merged! https://review.opendev.org/#/c/702603/
16:05:23 <slaweq> thx bcafarel
16:05:28 <njohnston> For midonet, I am not sure it is still broken on py3... look at this midonet change, to drop py27, the py3 jobs all work well https://review.opendev.org/#/c/701210/
16:05:56 <slaweq> njohnston: yes, but there is no any scenario job there
16:05:57 <ralonsoh> njohnston, good catch
16:05:59 <slaweq> only UT
16:06:34 <slaweq> problem is that midonet's CI is using centos and it's not really working on python 3 IIRC
16:07:11 <njohnston> OK, I'll look into it and probably +2+W
16:07:22 <slaweq> thx njohnston
16:08:21 <slaweq> ok, I think we can move on
16:08:25 <njohnston> may need to wait a little, recentl failures fail with the "pip._internal.distributions.source import SourceDistribution\nImportError: cannot import name SourceDistribution\n" error that is now getting cleared up
16:08:58 <slaweq> njohnston: is that failure from midonet job or from where?
16:09:40 <njohnston> I saw that in networking-midonet-tempest-aio-ml2-centos-7 results, yes
16:10:00 <slaweq> ahh, ok
16:10:23 <njohnston> FOr those who have not seen, it was discussed on openstack-discuss
16:10:26 <njohnston> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-January/012117.html
16:10:31 <slaweq> if You have some time to play with it to fix it, that would be great, if not - I would still be for removing it for now until it will be fixed
16:11:13 <njohnston> slaweq: Yeah, I'll take a quick look, if it gets outside the timebox then I'll approve the change and we can move on :-)
16:11:39 <slaweq> njohnston: thx a lot
16:11:54 <slaweq> ok, next topic than?
16:12:17 <njohnston> +1
16:13:50 <slaweq> #topic Stadium projects
16:14:05 <slaweq> njohnston: any updates about dropping py2?
16:14:23 <slaweq> You wasn't on yesterday's team meeting so maybe You have something new for today :)
16:14:48 <njohnston> so midonet merged it's change
16:15:13 <njohnston> which I think means the only one left for dropping py27 is neutron-fwaas
16:15:24 <bcafarel> yeah open reviews list is quite short now
16:15:27 <njohnston> #link https://review.opendev.org/#/c/688278/
16:16:19 <njohnston> if neutron-fwaas is going to be retired then I believe there's no sense in doing the work of fixing it
16:16:31 <njohnston> so in short: mission complete!
16:16:38 <ralonsoh> cool!
16:16:38 <slaweq> njohnston: I don't think it's going to be retired this cycle
16:16:40 <bcafarel> \o/
16:16:42 <slaweq> maybe in next one
16:17:00 <slaweq> so IMHO we should merge this patch, but it shouldn't be too much work IMO
16:17:23 <bcafarel> neutron-fwaas change was in already good shape last time I checked, should be able to merge it soon
16:17:25 <njohnston> ok, I'll start pushing it
16:17:49 <slaweq> thx njohnston
16:18:05 <slaweq> yes, amotoki found some small issues in this patch, except that IMO it's good to go
16:18:29 <njohnston> I'll push the fixes after the meeting
16:18:36 <slaweq> and tempest job failure wasn't related to dropping py2 for sure
16:18:39 <slaweq> thx njohnston
16:19:07 <slaweq> ok, I think that's all about stadium projects
16:19:11 <slaweq> so we can move on
16:19:51 <slaweq> #topic Grafana
16:19:59 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:21:40 <njohnston> looking pretty good
16:22:11 <ralonsoh> but during the last week, the number of rechecks has increased
16:22:17 <ralonsoh> the gate is almost "closed"
16:22:31 <ralonsoh> FT is failing a lot because ovsdb timeouts
16:23:13 <slaweq> yes, grafana looks pretty good but I also have impression that it's not working very well for some of the jobs
16:23:59 <bcafarel> yup, same feeling here
16:24:09 <slaweq> also my script shown me that last week we had 1 recheck in average to merge patch
16:24:35 <slaweq> but I think it's because we didn't merge many patches and some of them were with e.g. docs only, so less jobs are run on them
16:25:38 <ralonsoh> and, for example, the gate now is dropping all jobs
16:25:38 <slaweq> ok, so lets maybe talk about some specific jobs and issues from last week
16:25:45 <slaweq> ralonsoh: why?
16:25:49 <slaweq> any specific reason?
16:25:55 <slaweq> or just random failures?
16:26:03 <ralonsoh> still reviewing it
16:26:11 <slaweq> ok, thx
16:26:12 <bcafarel> the recent pip issue?
16:26:38 <slaweq> ahh, yes I saw some email about it today
16:27:52 <ralonsoh> seems that the problem is fixed and released
16:27:57 <ralonsoh> pip 20.0.1
16:28:08 <slaweq> yes, this should be better now
16:29:15 <slaweq> ok, lets move on to the specific jobs now
16:29:23 <slaweq> #topic fullstack/functional
16:29:34 <slaweq> I have couple of issues for today
16:29:46 <slaweq> first one is, already mentioned by ralonsoh ovsdbapp timeouts, like e.g.:
16:29:54 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_559/703299/1/gate/neutron-functional/559c0ed/testr_results.html
16:29:55 <slaweq> (fullstack) https://1a798a15650a97d81a82-17406e3478c64e603d8ff3ea0aac16c8.ssl.cf1.rackcdn.com/703366/1/check/neutron-fullstack/59e1877/testr_results.html
16:29:57 <slaweq> https://0bad6b662fac8347dc41-be430d2f919a8698d2e96141ed3ac146.ssl.cf5.rackcdn.com/687922/15/gate/neutron-functional/b6769d6/testr_results.html
16:30:11 <slaweq> I think there was old bug reported for that even
16:30:31 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1802640
16:30:31 <openstack> Launchpad bug 1802640 in neutron "TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand in q-agt gate failure" [Medium,Confirmed]
16:30:51 <slaweq> yes, this one
16:30:53 <slaweq> thx ralonsoh
16:30:56 <ralonsoh> I would like to submit a patch to increase the ovsdbapp log
16:31:24 <ralonsoh> and then, maybe, talk to Terry Wilson to do something there, in the command commit context
16:31:24 <slaweq> if that may help to debug this issue, I'm all for this :)
16:31:46 <bcafarel> +1 if that gives a better image of what it is actually spending time on
16:32:20 <bcafarel> I did not check the specific failure, but functional has been grumpier recently on stein/trein too - may be similar issue
16:32:35 <bcafarel> (where rechecks are ping pong between functional and grenade timeouts)
16:33:47 <slaweq> ok, so ralonsoh will increase log level for ovsdbapp in those jobs and we will see then
16:34:04 <njohnston> +1
16:34:08 <slaweq> #action ralonsoh to increase log level for ovsdbapp in fullstack/functional jobs
16:34:17 <slaweq> thx ralonsoh for taking care of it
16:34:32 <ralonsoh> yw
16:35:15 <slaweq> ok, next one which I saw this week was Issue with EVENT OVSNeutronAgentOSKenApp->ofctl_service GetDatapathRequest
16:35:34 <slaweq> like e.g. https://f26da45659020db1220c-76fc92e5e7c4e5a091c792a95503ad1d.ssl.cf5.rackcdn.com/701565/5/check/neutron-functional/4db8d3f/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_policy_rule_delete_egress_.txt
16:35:56 <slaweq> and I remember that I already saw such issue in the past few times
16:37:00 <slaweq> and it failed on setup_physical_bridges because it couldn't get dp_id
16:37:51 <slaweq> is there anyone who wants to check that?
16:38:03 <ralonsoh> I can take a look at it
16:38:08 <ralonsoh> is there a bug reported?
16:38:36 <slaweq> ralonsoh: nope, but I will open one for it today
16:38:42 <ralonsoh> perfect
16:38:51 <slaweq> #action slaweq to open bug for issue with get_dp_id in os_ken
16:39:01 <slaweq> and I will send it to You
16:39:35 <slaweq> ok, lets move on to fullstack job than
16:39:42 <slaweq> first issue there:
16:39:43 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_315/601336/42/check/neutron-fullstack/3156265/testr_results.html
16:39:49 <slaweq> I saw it few weeks ago also
16:40:09 <slaweq> and according to http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AssertionError%3A%20QoS%20register%20not%20found%20with%20queue-num%5C%22 it is happening from time to time recently
16:40:41 <ralonsoh> yes, it's very "delicate" the min-qos feature
16:40:54 <ralonsoh> if you have only one agent, ewverything is ok
16:41:10 <ralonsoh> but when we are testing, we are introducing several qos/queue configurations
16:41:20 <ralonsoh> the agent extension is not ready for this
16:41:27 <ralonsoh> but we fake this in the tests
16:41:42 <ralonsoh> but of course, we can have these kind of situations...
16:42:09 <ralonsoh> but, since the last patch, is much more stable
16:42:21 <ralonsoh> and, btw, this is still pending
16:42:31 <ralonsoh> https://review.opendev.org/#/c/687922/
16:42:39 <ralonsoh> (re-re-re-rechecking)
16:42:52 <slaweq> ahh, ok
16:43:03 <slaweq> so this patch should helps with this issue, right?
16:43:11 <ralonsoh> I hope so
16:43:14 <slaweq> ok
16:43:26 <ralonsoh> (patch you are "my last hope")
16:43:33 <slaweq> LOL
16:43:44 <slaweq> ok, next issue with fullstack is
16:43:46 <slaweq> Failure on cleanup phase: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_564/702250/6/gate/neutron-fullstack/564244f/testr_results.html
16:43:48 <slaweq> Do we really need cleanup in fullstack tests?
16:44:13 <slaweq> I started thinking today that we maybe don't need cleanup phase in fullstack tests
16:44:27 <slaweq> for each test we are spawning new neutron-server, agents and everything else
16:44:36 <slaweq> and it's always only for this one test
16:44:53 <slaweq> so what's the point to delete routers, ports, networks, etc. at the end?
16:45:00 <slaweq> IMO it's just waste of time
16:45:07 <slaweq> or am I missing something here?
16:45:11 <njohnston> to make sure the delete functionality doesn't have bugs, yes?
16:45:25 <ralonsoh> exactly
16:45:47 <slaweq> njohnston: sure, but we are testing delete functionality in scenario and api tests also, no?
16:46:21 <slaweq> fullstack is more for kind of whitebox testing where You can check if things are actually configured on host properly
16:48:05 <njohnston> OK, you could make that argument.  We could spin a change to remove those cleanups and make sure nothing bag happens.
16:48:23 <slaweq> :)
16:48:31 <njohnston> *bad
16:48:32 <njohnston> SMH
16:48:47 <slaweq> so I will send such patch to also see how much time we can safe on that job if we will remove cleanups from it
16:49:07 <slaweq> if it's not too much, maybe we can drop this patch and stay with it as it is now
16:49:08 <ralonsoh> my only concern are the L1 elements in the host (taps, devices, bridges, etc)
16:49:09 <slaweq> ok for You?
16:49:26 <slaweq> ralonsoh: sure, I'm not talking about things like that
16:49:30 <ralonsoh> we can try it
16:49:33 <ralonsoh> ok
16:49:39 <slaweq> I just want to skip cleaning neutron resources
16:50:00 <bcafarel> if you run fullstack in a loop and do no start seeing OOM errors it is probably good ;)
16:50:47 <slaweq> bcafarel: good test indeed :)
16:51:11 <slaweq> #action slaweq to try to skip cleaning up neutron resources in fullstack job
16:51:37 <slaweq> ok, lets move on quickly to the next topic as we are running out of time
16:51:39 <slaweq> #topic Tempest/Scenario
16:51:44 <njohnston> just make sure that we also accomodate developers running fullstack on their personal laptops
16:52:06 <slaweq> njohnston: sure, I will check if this will be fine in such case
16:52:57 <slaweq> ok, speaking about tempest jobs I saw again this week issue with paramiko:
16:53:10 <slaweq> https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ca7/653883/11/check/neutron-tempest-plugin-scenario-linuxbridge/ca7d140/testr_results.html
16:53:18 <slaweq> https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_820/611605/22/gate/neutron-tempest-plugin-scenario-linuxbridge/82023b2/testr_results.html
16:53:45 <ralonsoh> #link https://review.opendev.org/#/c/702903/
16:53:49 <slaweq> basically there is error like "paramiko.ssh_exception.SSHException: No existing session"
16:53:53 <ralonsoh> that's why I pushed this
16:54:22 <ralonsoh> ahhh no, this is another error
16:54:33 <ralonsoh> but that was solved last week
16:54:47 <slaweq> I don't think it was solved last week
16:55:01 <ralonsoh> https://review.opendev.org/#/c/701018/
16:55:13 <slaweq> there was another, similar issue but with error like "NoneType object has no attribute session" or something like that
16:55:26 <ralonsoh> pfffff ok then
16:55:43 <slaweq> there is bug reported for this new issue, I need to find it
16:56:10 <slaweq> paramiko.ssh_exception.SSHException: No existing session
16:56:17 <slaweq> sorry
16:56:19 <slaweq> https://bugs.launchpad.net/neutron/+bug/1858642
16:56:19 <openstack> Launchpad bug 1858642 in neutron "paramiko.ssh_exception.NoValidConnectionsError error cause dvr scenario jobs failing" [High,Confirmed]
16:56:37 <slaweq> if anyone whould have some time to look into it, that would be great
16:56:47 <slaweq> and it's not only in dvr jobs
16:57:34 <slaweq> and that's all from me for today
16:57:35 <ralonsoh> I can take a look, but this problem seems to be in the zuul.test.base
16:57:53 <slaweq> zuul.test.base? what is it?
16:58:15 <ralonsoh> nothing, wrong search
16:58:20 <slaweq> ahh, ok
16:58:53 <slaweq> ok, lets move on quickly
16:58:58 <slaweq> 2 more things
16:59:12 <slaweq> 1. thx ralonsoh for fix mariadb periodic job - it's working fine now
16:59:23 <bcafarel> thanks ralonsoh!
16:59:25 <slaweq> 2. I would like to ask You about change time of this meeting
16:59:51 <slaweq> what do You say about moving it to wednesday for 2pm utc, just after L3 meeting?
17:00:11 <njohnston> that works fine for me
17:00:12 <slaweq> as today at same time there is tc call in redhat every 3 weeks and I would like to attend it sometimes
17:00:23 <ralonsoh> you mean 3pm UTC
17:00:33 <slaweq> sorry, yes
17:00:34 <slaweq> 3pm
17:00:38 <slaweq> 2pm is L3 meeting
17:00:41 <ralonsoh> ok for me
17:00:44 <slaweq> thx
17:00:53 <slaweq> I will propose patch and add all of You as reviewers than
17:01:02 <slaweq> ok, that's all for today
17:01:05 <bcafarel> and mail maybe also for interested folks
17:01:06 <slaweq> thx for attending
17:01:08 <ralonsoh> bye
17:01:12 <slaweq> #endmeeting