15:00:23 <haleyb> #startmeeting neutron_dvr
15:00:25 <openstack> Meeting started Wed Dec  2 15:00:23 2015 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:29 <openstack> The meeting name has been set to 'neutron_dvr'
15:00:35 <carl_baldwin> o/
15:00:51 <haleyb> #chair Swami
15:00:52 <openstack> Current chairs: Swami haleyb
15:01:14 <haleyb> #topic Announcements
15:01:42 <haleyb> hi everyone, hope you had a good turkey day
15:01:50 * regXboi mutters finally
15:02:21 <Swami> went fine
15:02:33 <haleyb> i think my only announcement is that there are some new issues that might be DVR-related
15:02:35 <Swami> it was a very calm week before the storm.
15:02:50 <regXboi> do we know if M-1 got cut yet?
15:03:11 <obondarev> regXboi: I doubt so
15:03:40 <obondarev> regXboi: given state os gates..
15:03:45 <obondarev> of*
15:03:49 <regXboi> obondarev: ack
15:03:52 <haleyb> regXboi: i don't see a mitaka tag
15:03:55 <Swami> the gate seems to be unhappy
15:04:21 * regXboi resonates with the gate
15:04:40 <haleyb> #topic Bugs
15:05:00 <Swami> as haleyb mentioned there are two new bugs filed against dvr
15:05:12 <haleyb> I tried to update the agenda with the new ones, looks like 4 of them
15:05:16 <Swami> These are related to the functional test failures in the gate
15:05:39 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1521815
15:05:39 <openstack> Launchpad bug 1521815 in neutron "DVR functional tests failing intermittently" [High,New]
15:06:22 <Swami> I am still seeing the functional tests are instable. But it is not only the dvr tests that are failing but there are other tests as well.
15:06:54 <Swami> I don't think we have root caused the issue with these failures.
15:07:20 <haleyb> I believe the footprint is a missing router namespace
15:07:31 <Swami> amuller mentioned yesterday that he was seeing that some of the functional tests for dvr was taking long time to complete.
15:08:11 <Swami> haleyb: missing router namespace seems to me like a noise, because I have seen tests passing even with that message.
15:08:41 <regXboi> Well, I'd be bothered by the fact that restarted agent tests are taking over an hour each
15:08:55 <regXboi> even if they pass
15:09:13 <regXboi> see http://logs.openstack.org/18/248418/3/check/gate-neutron-dsvm-functional/8a6dfcf/console.html#_2015-12-01_22_38_48_050
15:09:22 <Swami> regXboi: yes that is what amuller mentioned.
15:09:36 <regXboi> Swami: not quite
15:09:52 <regXboi> I'm saying that even if the restarted agent test passes, it is still taking over an hour
15:10:08 <regXboi> that link I gave shows a test fail after 4000+ seconds
15:10:21 <regXboi> but another restart test passes after 5000+ seconds
15:11:37 <Swami> There was another bug related to fip namespace cleanup in functional test.
15:12:08 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1521820
15:12:08 <openstack> Launchpad bug 1521820 in neutron "Some DVR functional tests leak the FIP namespace" [Low,In progress] - Assigned to Assaf Muller (amuller)
15:12:25 <Swami> There was a patch that amuller pushed in for this fix last night.
15:12:45 <Swami> #link https://review.openstack.org/#/c/252139/
15:13:24 <Swami> The next bug in the list is
15:13:28 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1521846
15:13:28 <openstack> Launchpad bug 1521846 in neutron "Metering not configured for all-in-one DVR job, failing tests" [High,New]
15:14:03 <Swami> Based on the information in the bug it seems that metering is not configured for single node job.
15:14:24 <obondarev> regression?
15:14:29 <Swami> Does anyone know what triggered this failure, because how it was passing and why suddenly it failed.
15:14:43 <haleyb> note in bug mentions https://review.openstack.org/#/c/243949/
15:14:49 <regXboi> I'm thinking all of these are regressions
15:14:51 <Swami> obondarev: might be regression.
15:15:11 <Swami> regXboi: Yes I agree
15:15:59 <Swami> We still need to figure out the patch that caused this regression.
15:16:38 <regXboi> I'm seeing if logstash can give some hints as to when this started
15:16:47 <regXboi> because the DVR jobs have also gone off to insanity
15:16:58 <Swami> It kind of started yesterday.
15:17:28 <haleyb> that infra change merged yesterday morning
15:17:43 <Swami> haleyb: which infra change.
15:17:55 <haleyb> https://review.openstack.org/#/c/243949/
15:18:16 <regXboi> so interestingly
15:18:43 <regXboi> it looks like the functional-py34 job is not showing the same delay signature on the failing tests
15:18:46 <Swami> haleyb: thanks for the link
15:19:20 <Swami> regXboi: is it all passing 'functional-py34'
15:19:33 <obondarev> regXboi: functional-py34 has been broken for a long time I think
15:19:41 <regXboi> Swami: no, but when it fails, the test fails in less than 10 seconds
15:19:50 <regXboi> as opposed to taking over an hour
15:20:17 <regXboi> I'm looking specifically at failures with test_dvr_router_lifecycle_ha_with_snat_with_fips
15:20:59 <regXboi> the long signature failures started at 18:31:24 UTC yesterday
15:21:47 <Swami> regXboi: thanks for the information.
15:22:07 <Swami> I think we need to still find out the patch that caused this regression.
15:22:29 <regXboi> we had a failure at 16:51:48 but it showed a short time stamp
15:22:36 <Swami> Any other thoughts or information that you have to share related to this bug.
15:22:46 <regXboi> so now we need to see what merged between those two stamps
15:23:20 <Swami> regXboi: the number of patches that got merged into neutron is pretty less between those time.
15:23:41 <Swami> regXboi: but there might be other sub-project which might have triggered like the 'infra' etc.,
15:23:46 <carl_baldwin> regXboi: It would have to have merged a while before the failure, right?
15:23:48 <regXboi> Swami: if nothing there broke it, then we go look at other things
15:24:08 <regXboi> carl_baldwin: I'm not sure I agree
15:24:17 <regXboi> but I do have a test
15:24:31 <regXboi> the failure at 16:51:48 was my patch
15:24:38 <regXboi> let me rebase it to master and retest locallly
15:24:53 <Swami> regXboi: what was that patch
15:24:58 <carl_baldwin> regXboi: It has to merge (successful test run) then a new job has to be started that includes it.  Then, that job needs time to fail.
15:25:13 <regXboi> carl_baldwin - ok, I see your point
15:25:28 <regXboi> Swami: the patch that had a short failure was 251502 (rev 2)
15:25:49 <haleyb> Looking at the gate code, that patch I mentioned will have set OVERRIDE_ENABLED_SERVICES now, which will skip some of the DVR setup code in devstack-vm-gate.sh
15:26:42 <haleyb> https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate.sh#L190
15:26:45 <Swami> haleyb: thanks for the information.
15:27:17 <haleyb> Down on L210 there's additional neutron setup, which might be skipped now?
15:27:33 <Swami> can we revert that infra patch and see if the symptom goes away
15:28:01 <Swami> or should we fix the infra again to address the "OVERRIDE_ENABLED_SERVICES"
15:28:46 <regXboi> Swami, haleyb: how about we run a test where those additive services aren't there?
15:28:51 <regXboi> locally I mean
15:28:56 <haleyb> I have no idea what is right, seems a tangled mess
15:29:26 <Swami> regXboi: can you run it locally and confirm
15:29:31 <regXboi> Swami: am trying now
15:29:36 <Swami> regXboi: thanks
15:29:58 <carl_baldwin> Swami: A revert in gerrit won't do.  It is in the project-config repo and doesn't run our tests.
15:30:00 <haleyb> is yamamoto online?
15:30:28 <obondarev> carl_baldwin: can we add a noop patch in neutron depending on the revert?
15:30:30 <haleyb> carl_baldwin: that seems broken in itself
15:30:32 <Swami> haleyb: I don't see him
15:30:50 <carl_baldwin> obondarev: That is an idea.
15:30:58 <carl_baldwin> In theory, that should work.
15:31:29 <carl_baldwin> regXboi: Do you have a logstack query for these failures?
15:31:59 <regXboi> carl_baldwin: hold on a sec
15:32:12 <regXboi> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_ha_with_snat_with_fips%5C%22%20AND%20message:%5C%22FAILED%5C%22%20AND%20build_name:%20%5C%22gate-neutron-dsvm-functional%5C%22
15:32:40 <regXboi> that had 16 failures in the last 7 days, most starting at the time stamp I gave earlier
15:33:26 * regXboi watches tox run
15:34:21 <regXboi> ok, I just restacked a node to not run q-agt,q-l3,etc
15:34:30 <regXboi> and the functional tests passed locally
15:34:39 <regXboi> so I don't think that's the culprit
15:34:48 <regXboi> note: I hadn't rebased to master
15:35:21 <regXboi> oh crap
15:39:36 <Swami> hi
15:40:36 <carl_baldwin> regXboi: I don't see any rhyme or reason in the failures in that logstash yet.
15:41:21 <Swami> is the channel back to normal now
15:42:40 <Swami> did we loose haleyb and others
15:46:52 <Swami> #endmeeting