15:00:23 <haleyb> #startmeeting neutron_dvr 15:00:25 <openstack> Meeting started Wed Dec 2 15:00:23 2015 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:29 <openstack> The meeting name has been set to 'neutron_dvr' 15:00:35 <carl_baldwin> o/ 15:00:51 <haleyb> #chair Swami 15:00:52 <openstack> Current chairs: Swami haleyb 15:01:14 <haleyb> #topic Announcements 15:01:42 <haleyb> hi everyone, hope you had a good turkey day 15:01:50 * regXboi mutters finally 15:02:21 <Swami> went fine 15:02:33 <haleyb> i think my only announcement is that there are some new issues that might be DVR-related 15:02:35 <Swami> it was a very calm week before the storm. 15:02:50 <regXboi> do we know if M-1 got cut yet? 15:03:11 <obondarev> regXboi: I doubt so 15:03:40 <obondarev> regXboi: given state os gates.. 15:03:45 <obondarev> of* 15:03:49 <regXboi> obondarev: ack 15:03:52 <haleyb> regXboi: i don't see a mitaka tag 15:03:55 <Swami> the gate seems to be unhappy 15:04:21 * regXboi resonates with the gate 15:04:40 <haleyb> #topic Bugs 15:05:00 <Swami> as haleyb mentioned there are two new bugs filed against dvr 15:05:12 <haleyb> I tried to update the agenda with the new ones, looks like 4 of them 15:05:16 <Swami> These are related to the functional test failures in the gate 15:05:39 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1521815 15:05:39 <openstack> Launchpad bug 1521815 in neutron "DVR functional tests failing intermittently" [High,New] 15:06:22 <Swami> I am still seeing the functional tests are instable. But it is not only the dvr tests that are failing but there are other tests as well. 15:06:54 <Swami> I don't think we have root caused the issue with these failures. 15:07:20 <haleyb> I believe the footprint is a missing router namespace 15:07:31 <Swami> amuller mentioned yesterday that he was seeing that some of the functional tests for dvr was taking long time to complete. 15:08:11 <Swami> haleyb: missing router namespace seems to me like a noise, because I have seen tests passing even with that message. 15:08:41 <regXboi> Well, I'd be bothered by the fact that restarted agent tests are taking over an hour each 15:08:55 <regXboi> even if they pass 15:09:13 <regXboi> see http://logs.openstack.org/18/248418/3/check/gate-neutron-dsvm-functional/8a6dfcf/console.html#_2015-12-01_22_38_48_050 15:09:22 <Swami> regXboi: yes that is what amuller mentioned. 15:09:36 <regXboi> Swami: not quite 15:09:52 <regXboi> I'm saying that even if the restarted agent test passes, it is still taking over an hour 15:10:08 <regXboi> that link I gave shows a test fail after 4000+ seconds 15:10:21 <regXboi> but another restart test passes after 5000+ seconds 15:11:37 <Swami> There was another bug related to fip namespace cleanup in functional test. 15:12:08 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1521820 15:12:08 <openstack> Launchpad bug 1521820 in neutron "Some DVR functional tests leak the FIP namespace" [Low,In progress] - Assigned to Assaf Muller (amuller) 15:12:25 <Swami> There was a patch that amuller pushed in for this fix last night. 15:12:45 <Swami> #link https://review.openstack.org/#/c/252139/ 15:13:24 <Swami> The next bug in the list is 15:13:28 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1521846 15:13:28 <openstack> Launchpad bug 1521846 in neutron "Metering not configured for all-in-one DVR job, failing tests" [High,New] 15:14:03 <Swami> Based on the information in the bug it seems that metering is not configured for single node job. 15:14:24 <obondarev> regression? 15:14:29 <Swami> Does anyone know what triggered this failure, because how it was passing and why suddenly it failed. 15:14:43 <haleyb> note in bug mentions https://review.openstack.org/#/c/243949/ 15:14:49 <regXboi> I'm thinking all of these are regressions 15:14:51 <Swami> obondarev: might be regression. 15:15:11 <Swami> regXboi: Yes I agree 15:15:59 <Swami> We still need to figure out the patch that caused this regression. 15:16:38 <regXboi> I'm seeing if logstash can give some hints as to when this started 15:16:47 <regXboi> because the DVR jobs have also gone off to insanity 15:16:58 <Swami> It kind of started yesterday. 15:17:28 <haleyb> that infra change merged yesterday morning 15:17:43 <Swami> haleyb: which infra change. 15:17:55 <haleyb> https://review.openstack.org/#/c/243949/ 15:18:16 <regXboi> so interestingly 15:18:43 <regXboi> it looks like the functional-py34 job is not showing the same delay signature on the failing tests 15:18:46 <Swami> haleyb: thanks for the link 15:19:20 <Swami> regXboi: is it all passing 'functional-py34' 15:19:33 <obondarev> regXboi: functional-py34 has been broken for a long time I think 15:19:41 <regXboi> Swami: no, but when it fails, the test fails in less than 10 seconds 15:19:50 <regXboi> as opposed to taking over an hour 15:20:17 <regXboi> I'm looking specifically at failures with test_dvr_router_lifecycle_ha_with_snat_with_fips 15:20:59 <regXboi> the long signature failures started at 18:31:24 UTC yesterday 15:21:47 <Swami> regXboi: thanks for the information. 15:22:07 <Swami> I think we need to still find out the patch that caused this regression. 15:22:29 <regXboi> we had a failure at 16:51:48 but it showed a short time stamp 15:22:36 <Swami> Any other thoughts or information that you have to share related to this bug. 15:22:46 <regXboi> so now we need to see what merged between those two stamps 15:23:20 <Swami> regXboi: the number of patches that got merged into neutron is pretty less between those time. 15:23:41 <Swami> regXboi: but there might be other sub-project which might have triggered like the 'infra' etc., 15:23:46 <carl_baldwin> regXboi: It would have to have merged a while before the failure, right? 15:23:48 <regXboi> Swami: if nothing there broke it, then we go look at other things 15:24:08 <regXboi> carl_baldwin: I'm not sure I agree 15:24:17 <regXboi> but I do have a test 15:24:31 <regXboi> the failure at 16:51:48 was my patch 15:24:38 <regXboi> let me rebase it to master and retest locallly 15:24:53 <Swami> regXboi: what was that patch 15:24:58 <carl_baldwin> regXboi: It has to merge (successful test run) then a new job has to be started that includes it. Then, that job needs time to fail. 15:25:13 <regXboi> carl_baldwin - ok, I see your point 15:25:28 <regXboi> Swami: the patch that had a short failure was 251502 (rev 2) 15:25:49 <haleyb> Looking at the gate code, that patch I mentioned will have set OVERRIDE_ENABLED_SERVICES now, which will skip some of the DVR setup code in devstack-vm-gate.sh 15:26:42 <haleyb> https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate.sh#L190 15:26:45 <Swami> haleyb: thanks for the information. 15:27:17 <haleyb> Down on L210 there's additional neutron setup, which might be skipped now? 15:27:33 <Swami> can we revert that infra patch and see if the symptom goes away 15:28:01 <Swami> or should we fix the infra again to address the "OVERRIDE_ENABLED_SERVICES" 15:28:46 <regXboi> Swami, haleyb: how about we run a test where those additive services aren't there? 15:28:51 <regXboi> locally I mean 15:28:56 <haleyb> I have no idea what is right, seems a tangled mess 15:29:26 <Swami> regXboi: can you run it locally and confirm 15:29:31 <regXboi> Swami: am trying now 15:29:36 <Swami> regXboi: thanks 15:29:58 <carl_baldwin> Swami: A revert in gerrit won't do. It is in the project-config repo and doesn't run our tests. 15:30:00 <haleyb> is yamamoto online? 15:30:28 <obondarev> carl_baldwin: can we add a noop patch in neutron depending on the revert? 15:30:30 <haleyb> carl_baldwin: that seems broken in itself 15:30:32 <Swami> haleyb: I don't see him 15:30:50 <carl_baldwin> obondarev: That is an idea. 15:30:58 <carl_baldwin> In theory, that should work. 15:31:29 <carl_baldwin> regXboi: Do you have a logstack query for these failures? 15:31:59 <regXboi> carl_baldwin: hold on a sec 15:32:12 <regXboi> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_ha_with_snat_with_fips%5C%22%20AND%20message:%5C%22FAILED%5C%22%20AND%20build_name:%20%5C%22gate-neutron-dsvm-functional%5C%22 15:32:40 <regXboi> that had 16 failures in the last 7 days, most starting at the time stamp I gave earlier 15:33:26 * regXboi watches tox run 15:34:21 <regXboi> ok, I just restacked a node to not run q-agt,q-l3,etc 15:34:30 <regXboi> and the functional tests passed locally 15:34:39 <regXboi> so I don't think that's the culprit 15:34:48 <regXboi> note: I hadn't rebased to master 15:35:21 <regXboi> oh crap 15:39:36 <Swami> hi 15:40:36 <carl_baldwin> regXboi: I don't see any rhyme or reason in the failures in that logstash yet. 15:41:21 <Swami> is the channel back to normal now 15:42:40 <Swami> did we loose haleyb and others 15:46:52 <Swami> #endmeeting