#openstack-meeting log

16:00:25 <slaweq> #startmeeting neutron_ci
16:00:29 <slaweq> hi
16:00:31 <openstack> Meeting started Tue Feb 12 16:00:25 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:35 <openstack> The meeting name has been set to 'neutron_ci'
16:02:54 <slaweq> lets wait few minutes for other people to join
16:03:39 <slaweq> but looks like there is no crowd today :)
16:04:13 <ralonsoh> hi! I'm still here
16:04:24 <slaweq> hi ralonsoh, good to see You :)
16:04:42 <slaweq> I know that mlavalle will not be available today
16:04:44 <bcafarel> o/
16:04:52 <slaweq> but maybe haleyb and bcafarel will join
16:04:56 <slaweq> hi bcafarel :)
16:05:06 <slaweq> hongbin: are You around for CI meeting?
16:05:09 <bcafarel> my spider sense was telling me I should switch IRC channel :)
16:05:12 <haleyb> oh, hi, looking at bugs
16:05:14 <hongbin> o/
16:05:25 <slaweq> ok, so now I think we can start :)
16:05:31 <slaweq> welcome everyone!
16:05:36 <slaweq> #topic Actions from previous meetings
16:05:47 <slaweq> first one was:
16:05:49 <slaweq> mlavalle to continue investigate why L3 agent is considered as down and cause trunk tests fail
16:06:13 <slaweq> mlavalle gave me some update about his findings on it
16:06:28 <slaweq> he said that problem is with the test migrating router from HA to dvr/legacy. That migration implies deleting the existing router state_change_monitor
16:06:53 <slaweq> so we try to execute 'kill', '-15', 'pid'' and that cause agent "dead" basically
16:06:58 <slaweq> please look here: http://paste.openstack.org/show/744001/
16:07:56 <slaweq> possibly we are missing filter in https://github.com/openstack/neutron/blob/master/etc/neutron/rootwrap.d/l3.filters for kill neutron-keepalived-state-change process
16:08:14 <slaweq> haleyb: do You think that it is possible?
16:08:36 <haleyb> so it's getting an error trying to kill the state change monitor?
16:09:05 <slaweq> IIUC what mlavalle was saying than yes :)
16:09:51 <slaweq> and indeed IMO this KillCommand filter is missing there
16:10:04 <slaweq> so that can be some issue
16:10:50 <haleyb> yes, that missing KillFilter could be an issue
16:11:12 <slaweq> so mlavalle will send patch to add it and we will see if that will help
16:11:24 <haleyb> +1
16:12:12 <slaweq> #action mlavalle to check if adding KillFilter for neutron-keepalived-state-change will solve issues with L3 agent in dvr jobs
16:12:14 <bcafarel> and it worked before without filter?
16:12:50 <slaweq> bcafarel: I think that possibly it wasn't killing keepalived-state-change process but agent was running fine
16:13:05 <slaweq> maybe change py27->py36 for this job triggered this somehow
16:13:32 <slaweq> also mlavalle told me that it happens during router migration
16:13:52 <slaweq> from ha to dvr/legacy
16:14:01 <slaweq> and such scenario isn't tested in other jobs
16:14:18 <slaweq> so it is possible that python 36 triggered that somehow
16:15:16 <slaweq> can we move on or do You have anything else to add here?
16:15:54 <bcafarel> nothing, it was mostly personal curiosity :)
16:15:57 <bcafarel> thanks slaweq
16:16:11 <slaweq> bcafarel: sure :) I'm also wondering how it may be possible
16:17:16 <slaweq> ok, lets move on
16:17:19 <slaweq> next action was:
16:17:24 <slaweq> njohnston Work on grenade job transition
16:17:37 <slaweq> I think he didn't make any progress on this recently
16:18:54 <slaweq> next one was:
16:18:55 <bcafarel> I sent https://review.openstack.org/#/c/636356/ to see how it goes with simple change in the meantime
16:18:56 <patchbot> patch 636356 - neutron - Switch grenade jobs to python3 - 1 patch set
16:19:00 <slaweq> slaweq to check if new ovsdb monitor implementation can allow to listen only for events from specific bridge
16:19:35 <slaweq> bcafarel: thx, lets get back to it in python3 section, ok?
16:20:01 <bcafarel> slaweq: oops sorry multitasking is bad for attention
16:20:08 <slaweq> bcafarel: no, it's fine :)
16:20:35 <slaweq> according to ovsdb monitor, I checked that SimpleInterfaceMonitor isn’t moved to new implementation yet so there was nothing to check for now.
16:20:44 <slaweq> ralonsoh is still working on it, right?
16:21:09 <ralonsoh> slaweq, yes, but I'm still facing the same functional tests problems
16:21:15 <ralonsoh> and I don't know how to solve them
16:21:23 <ralonsoh> I think this is a race condition
16:21:28 <ralonsoh> but I can't probe it
16:21:33 <slaweq> ralonsoh: do You have link to patch?
16:21:53 <ralonsoh> #link https://review.openstack.org/#/c/612400/
16:21:54 <patchbot> patch 612400 - neutron - [WIP] Add native OVSDB implementation for polling ... - 18 patch sets
16:23:05 <slaweq> a lot of events logged: http://logs.openstack.org/00/612400/18/check/neutron-functional-python27/6e7d69b/job-output.txt.gz#_2019-02-07_12_04_15_376869
16:23:51 <slaweq> is it normal?
16:24:22 <ralonsoh> no, just for testing (is in the patch comment)
16:24:53 <slaweq> but except that "issue" I don't see any failed test in PS18 in functional tests job
16:26:03 <ralonsoh> sorry, neutron-fullstack
16:26:15 <ralonsoh> test_l2_agent_restart(OVS,VLANs,openflow-cli)
16:26:18 <ralonsoh> always the same
16:27:00 <slaweq> ralonsoh: ok, I will take a look at logs from this test later
16:27:05 <ralonsoh> thanks!
16:27:09 <slaweq> maybe I will find something :)
16:27:23 <slaweq> ok, lets move on
16:27:29 <slaweq> last action from last week was:
16:27:31 <slaweq> njohnston to take care of periodic UT jobs failures
16:27:47 <slaweq> I know that he fixed this issue as periodic jobs are fine now
16:27:57 <slaweq> but I don't have link to specific patch
16:29:31 <slaweq> ok, anything else You want to ask/add about actions from last week?
16:30:56 <slaweq> ok, lets move on then
16:30:58 <slaweq> #topic Python 3
16:31:05 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3
16:31:45 <slaweq> FYI: after we merged switch of some jobs to py3 versions I pushed patch to update Grafana: https://review.openstack.org/636359
16:31:46 <patchbot> patch 636359 - openstack-infra/project-config - Add new python 3 Neutron jobs to grafana dashboard - 1 patch set
16:32:18 <slaweq> for now from gate/check queues we are only missing:
16:32:20 <slaweq> * neutron-tempest-dvr-ha-multinode-full - patch in progress https://review.openstack.org/633979
16:32:21 <patchbot> patch 633979 - neutron - Migrate neutron-tempest-dvr-ha-multinode-full job ... - 7 patch sets
16:32:22 <slaweq> * neutron-grenade-multinode - patch with new job already proposed: https://review.openstack.org/#/c/622612/  but is failing
16:32:22 <patchbot> patch 622612 - openstack-dev/grenade - Add grenade-multinode-py3 job - 3 patch sets
16:32:24 <slaweq> * neutron-grenade-dvr-multinode - should be done when above job will be done,
16:33:22 <slaweq> bcafarel: so according to Your patch, I think You should rebase/recheck this patch from njohnston https://review.openstack.org/#/c/622612/ and get this merged
16:33:23 <patchbot> patch 622612 - openstack-dev/grenade - Add grenade-multinode-py3 job - 3 patch sets
16:33:32 <slaweq> then we can switch to use this job in neutron
16:34:01 <slaweq> what do You think?
16:34:50 <bcafarel> having the job defined in grenade itself will be nice indeed
16:35:55 <bcafarel> I'll check once we have some initial results on https://review.openstack.org/#/c/636356/ (just enabling python 3 on our side)
16:35:56 <patchbot> patch 636356 - neutron - Switch grenade jobs to python3 - 1 patch set
16:36:05 <slaweq> ok
16:36:32 <slaweq> #action bcafarel to continue work on grafana jobs switch to python 3
16:36:36 <slaweq> thx bcafarel
16:36:38 <slaweq> :)
16:36:43 <bcafarel> :)
16:36:51 <slaweq> that's all from my side according to python 3
16:36:58 <slaweq> do You want to add anything?
16:38:06 <bcafarel> list is getting shorter and shorter that's good :)
16:38:20 <slaweq> indeed :)
16:38:37 <slaweq> ok, lets move on
16:38:44 <slaweq> #topic Grafana
16:38:51 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:40:50 <slaweq> I don't see anything "very bad" in dashboard
16:41:01 <slaweq> things are running as usual
16:41:15 <slaweq> do You see anything special what You want to discuss related to grafana?
16:42:46 <slaweq> ok, so I will take this as no :)
16:42:51 <slaweq> next topic
16:42:55 <slaweq> #topic fullstack/functional
16:43:16 <slaweq> 2 things here
16:43:26 <slaweq> first, about https://review.openstack.org/#/c/629420/ -
16:43:27 <patchbot> patch 629420 - neutron - Revert "Mark mysql related functional tests as uns... - 3 patch sets
16:43:38 <slaweq> I was thinking about this a bit today
16:43:58 <slaweq> and I think that maybe I can add new decorator, something like "skip_if_timeout"
16:44:18 <slaweq> it would be similar to unstable_test() but will skip only if timeout exception would be raised in test
16:44:48 <slaweq> that way we can still track how many times this issue occurs and see other failures properly
16:45:01 <slaweq> what do You think about this idea?
16:45:28 <haleyb> i like it better than using unstable_test since it is only catching one condition
16:46:23 <slaweq> haleyb: thx, so I will go with it tomorrow :)
16:46:46 <bcafarel> trying to think of situations where timeout would be an actual failure (that we would then miss)
16:46:52 <bcafarel> but it sounds reasonable indeed
16:46:58 <slaweq> #action slaweq to propose patch with new decorator skip_if_timeout in functional tests
16:47:16 <hongbin> +1
16:47:19 <slaweq> bcafarel: yes, it may happen but it's still better than how it's now :)
16:47:42 <bcafarel> definitely!
16:48:26 <slaweq> ok, and second thing related to functional tests for today is new (old) https://bugs.launchpad.net/neutron/+bug/1815142
16:48:27 <openstack> Launchpad bug 1815142 in neutron "ovsdbapp.exceptions.TimeoutException in functional tests" [High,Confirmed]
16:48:34 <slaweq> it hits us from time to time
16:49:00 <slaweq> TBH I suspect that it may be similar issue like with those db tests and it's just slow VM
16:49:18 <slaweq> but maybe we can ask otherwiseguy to look at it again
16:51:36 <slaweq> thats all from my side regarding fullstack/functional jobs
16:51:44 <slaweq> do You want to add anything else?
16:53:04 <slaweq> ok, so let's move on then
16:53:10 <slaweq> #topic Tempest/Scenario
16:53:25 <slaweq> we still have some random ssh failure issues in scenario jobs
16:53:39 <slaweq> but that is "know" issue and I don't have any updates about that
16:54:05 <slaweq> I today found some new issue which happend at least 2 times in linuxbridge job:
16:54:12 <slaweq> https://bugs.launchpad.net/neutron/+bug/1815585
16:54:13 <openstack> Launchpad bug 1815585 in neutron "Floating IP status failed to transition to DOWN in neutron-tempest-plugin-scenario-linuxbridge" [High,Confirmed]
16:54:21 <slaweq> are You aware of such problem?
16:54:41 <haleyb> i am
16:55:19 <slaweq> do You know what can be a reason of it?
16:56:04 <haleyb> although my initial debugging did not find an answer.  basically when the port is removed from the instance, we expect the floating status to change, but it didn't
16:56:29 <haleyb> at least that is what the test is doing
16:57:43 <slaweq> haleyb: and of course when doing it locally it works fine, right?
16:58:42 <haleyb> it seemed to
16:59:03 <slaweq> :/
16:59:18 <slaweq> ok, bug is reported, for now it not happens very often
16:59:25 <slaweq> maybe someone will find culprit :)
16:59:38 <slaweq> we are running out of time
16:59:42 <slaweq> so thx for attending
16:59:47 <slaweq> and see You all next week :)
16:59:53 <slaweq> #endmeeting