16:00:27 <ihrachys> #startmeeting neutron_ci
16:00:28 <openstack> Meeting started Tue Aug 22 16:00:27 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:32 <openstack> The meeting name has been set to 'neutron_ci'
16:00:42 <ihrachys> jlibosva, o/
16:00:43 <jlibosva> o/
16:00:55 <ihrachys> #topic Actions from prev week
16:01:06 <ihrachys> "jlibosva to look through uncategorized/latest gate tempest failures (15% failure rate atm)"
16:01:25 * jlibosva slacked, didn't happen
16:01:44 <ihrachys> ok, we will revisit whether it's still needed in Grafana section
16:01:54 <ihrachys> "ihrachys to capture os.kill calls in func tests and see if any of those kill test threads"
16:02:19 <ihrachys> I haven't done THAT but since last kernel update we don't see to see the failure
16:02:29 <ihrachys> jlibosva, right?
16:02:45 <jlibosva> right, and also at the end of last meeting we confirmed it's kernel who kills the process
16:02:47 <ihrachys> I haven't seen it personally for a week + grafana board is very clean
16:03:00 <jlibosva> I haven't checked via logstash
16:03:22 <jlibosva> but I'm able to reproduce that failure on my ubuntu box with the -83 kernel
16:03:36 <ihrachys> -83, is it the old or new?
16:03:42 <jlibosva> I can update and confirm it not reproducible
16:03:46 <jlibosva> ah, sorry. 83 is hte old one
16:03:48 <jlibosva> I think :)
16:03:51 <ihrachys> ok
16:04:12 <ihrachys> let's close https://bugs.launchpad.net/neutron/+bug/1707933 if/when it's confirmed (I personally feel ok doing it without additional step, but you decide)
16:04:14 <openstack> Launchpad bug 1707933 in neutron "functional tests timeout after a test worker killed" [Critical,Confirmed] - Assigned to Jakub Libosvar (libosvar)
16:04:31 <ihrachys> "jlibosva to tweak gate not to create default subnetpool and enable test_convert_default_subnetpool_to_non_default"
16:04:48 <jlibosva> didn't do either :(
16:05:13 <ihrachys> #action jlibosva to tweak gate not to create default subnetpool and enable test_convert_default_subnetpool_to_non_default
16:05:33 <ihrachys> #topic Grafana
16:05:38 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:07:03 <ihrachys> of gating jobs, the only one that stands out is gate-grenade-dsvm-neutron-multinode-ubuntu-xenial
16:07:14 <ihrachys> not a catastrophe, ~10%
16:07:32 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=6&fullscreen
16:08:35 <ihrachys> apart from that, usual check queue violators - fullstack and scenarios - are at 100%
16:08:42 <ihrachys> for fullstack, I still suck
16:09:00 <ihrachys> for scenarios, I believe jlibosva tried to gather troops to triage/fix remaining known failures
16:09:13 <ihrachys> jlibosva, was there success in that?
16:09:47 <jlibosva> Anil is looking at the router migration tests - but no outcome yet
16:10:42 <ihrachys> I also see dvr-ha job failing at 100%. I don't think that was the case a while ago.
16:11:48 <ihrachys> we may need Brian back to have a look at it closer
16:12:05 <ihrachys> he was working on making it voting
16:12:11 <anilvenkata> jlibosva, ihrachys migration tests are failing intermittently
16:12:45 <anilvenkata> jlibosva, ihrachys If consistently failing then it would be easier to fix
16:12:46 <jlibosva> anilvenkata: are you able to reproduce it?
16:12:49 <ihrachys> anilvenkata, ok. any ideas how can we make them not fail? what's missing to understand what happens there?
16:13:12 <anilvenkata> may be I need to add more checks before the test try to ssh
16:13:31 <anilvenkata> checks like HA port is active, etc
16:14:26 <anilvenkata> that will make sure the dataplane path is properly configured, before the test try to ssh
16:14:43 <ihrachys> yeah. another thing to look at would be, whether any helpful info is missing in agent logs.
16:15:17 <anilvenkata> I Will also look at that
16:15:20 <anilvenkata> thanks Ihar
16:15:30 <ihrachys> #topic anilvenkata to add more control plane checks for migrated router ports tests before ssh'ing
16:15:35 <ihrachys> oops
16:15:36 <ihrachys> #undo
16:15:36 <openstack> Removing item from minutes: #topic anilvenkata to add more control plane checks for migrated router ports tests before ssh'ing
16:15:40 <ihrachys> #action anilvenkata to add more control plane checks for migrated router ports tests before ssh'ing
16:15:57 <anilvenkata> :)
16:17:04 <ihrachys> I also noticed today that we still keep linuxbridge grenade job in dashboard even though it's experimental now
16:17:09 <ihrachys> patch to remove it from there: https://review.openstack.org/#/c/496287/
16:17:36 <ihrachys> jlibosva, back to the earlier action item you had, it doesn't seem it's a pressing need right now to classify tempest failures.
16:18:06 <jlibosva> okies
16:18:10 <ihrachys> #topic Bugs
16:18:20 <ihrachys> I started cleaning up the list: https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure
16:18:35 <ihrachys> closed a bunch of those that no longer reproduce / no logs
16:18:41 <ihrachys> we can reopen if they happen again
16:19:17 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1687709
16:19:19 <openstack> Launchpad bug 1687709 in neutron "fullstack: ovs-agents remove trunk bridges that don't belong to them" [High,Confirmed] - Assigned to Jakub Libosvar (libosvar)
16:19:29 <ihrachys> jlibosva, is it correct that ovs isolation work should help with that?
16:19:37 <jlibosva> I'm gonna close the functional killer bug, no hits in last week
16:19:50 <ihrachys> jlibosva, yep, go for it
16:19:51 <jlibosva> ihrachys: yes, that's still something I'd like to implement but can't find spare time
16:20:06 <ihrachys> jlibosva, is everything ready on ovsdbapp side for that?
16:20:30 <jlibosva> my plan is not to use the sandbox ovs, so ovsdbapp is no longer a requirement
16:20:46 <jlibosva> the isolation will use normal ovsdb-server process running in namespace instead
16:21:04 <jlibosva> as we need an actual access to kernel datapath, which sandbox mocks
16:21:17 <ihrachys> oh ok
16:22:06 <ihrachys> I haven't walked through all bugs yet; I am going to continue cleaning up the gate-failure list this week
16:22:15 <ihrachys> #action ihrachys to complete gate-failure cleanup
16:23:13 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1712278
16:23:14 <openstack> Launchpad bug 1712278 in neutron "Default qos policy doesn't work when creating network" [High,In progress] - Assigned to Hirofumi Ichihara (ichihara-hirofumi)
16:23:22 <ihrachys> we seem to have a patch here: https://review.openstack.org/#/c/496139/
16:23:54 <ihrachys> I am not sure how it happened. don't we have test coverage for that?
16:24:55 <jlibosva> on higher level, maybe tempest or fullstack - but both are non-voting
16:25:16 <ihrachys> oh right.
16:25:40 <jlibosva> hmm, actually api tests should be able to catch that
16:26:37 <ihrachys> for some reason, logstash shows osc functional test failures only
16:26:42 <ihrachys> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AttributeError%3A%20'QosPolicyDefault'%20object%20has%20no%20attribute%20'translate'%5C%22
16:27:30 <jlibosva> and api tests don't test default qos policy
16:27:37 <jlibosva> only fullstack
16:28:01 <ihrachys> I don't see fullstack hits in logstash
16:28:41 <jlibosva> do we collect server logs for fullstack?
16:29:07 <ihrachys> oh; we probably collect but don't index
16:29:15 <ihrachys> we index test runner logs only
16:29:26 <ihrachys> that explains then
16:29:48 <ihrachys> would probably make sense to ask to add api tests to the fix
16:30:17 <ihrachys> oh you did already
16:30:19 <ihrachys> ok
16:30:26 <jlibosva> yeah :)
16:30:43 <jlibosva> also looking at the post-gate hook, if I understand the find correctly, we should index also the server and agent logs
16:30:45 <jlibosva> for fullstack
16:32:11 <ihrachys> hm, right. so which test do you think covers the functionality?
16:32:14 <ihrachys> fullstack
16:32:32 <jlibosva> yep, so maybe it doesn't hit that specific
16:32:38 <ihrachys> oh I see TestQoSPolicyIsDefault
16:32:49 <jlibosva> I checked the index file and server and agents are included
16:32:53 <jlibosva> ihrachys++ :)
16:33:31 <jlibosva> so we probably don't create a network after we have the default policy
16:33:33 <ihrachys> I don't think we create a network using default policy in any of those test cases
16:33:36 <ihrachys> right
16:34:06 <ihrachys> ok, that makes sense
16:34:56 <ihrachys> #topic Other patches
16:35:06 <ihrachys> jlibosva, anything of interest not mentioned already?
16:35:31 <jlibosva> just that if you look at functional gate failures in the last 7 days, it draws a batman :)
16:35:47 <jlibosva> nothing else from me
16:35:51 <ihrachys> :))
16:36:39 <ihrachys> ok, I will note that I noticed some time ago that grenade failures don't always trigger corresponding elastic patterns, and I think I came up with the fix here: https://review.openstack.org/#/c/493987/
16:37:01 <ihrachys> elastic bot doesn't wait for all service logs to upload if it's grenade
16:37:26 <ihrachys> I have nothing else
16:37:27 <jlibosva> oh, one more thing I have
16:37:30 <ihrachys> shoot
16:37:53 <jlibosva> I noticed that bot now reports elastic rechecks at the upstream channel :) it was discussed in this meeting a while ago
16:38:20 <ihrachys> yeah. it doesn't seem very consistent though, I failed to grasp when it does and when it doesn't
16:38:37 <ihrachys> (same feeling I have for gerrit comments from the bot)
16:39:04 <ihrachys> ok, we can close the session I guess
16:39:06 <ihrachys> thanks jlibosva
16:39:07 <jlibosva> yep
16:39:09 <ihrachys> #endmeeting