15:00:23 #startmeeting neutron_dvr 15:00:25 Meeting started Wed Dec 2 15:00:23 2015 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:26 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:29 The meeting name has been set to 'neutron_dvr' 15:00:35 o/ 15:00:51 #chair Swami 15:00:52 Current chairs: Swami haleyb 15:01:14 #topic Announcements 15:01:42 hi everyone, hope you had a good turkey day 15:01:50 * regXboi mutters finally 15:02:21 went fine 15:02:33 i think my only announcement is that there are some new issues that might be DVR-related 15:02:35 it was a very calm week before the storm. 15:02:50 do we know if M-1 got cut yet? 15:03:11 regXboi: I doubt so 15:03:40 regXboi: given state os gates.. 15:03:45 of* 15:03:49 obondarev: ack 15:03:52 regXboi: i don't see a mitaka tag 15:03:55 the gate seems to be unhappy 15:04:21 * regXboi resonates with the gate 15:04:40 #topic Bugs 15:05:00 as haleyb mentioned there are two new bugs filed against dvr 15:05:12 I tried to update the agenda with the new ones, looks like 4 of them 15:05:16 These are related to the functional test failures in the gate 15:05:39 #link https://bugs.launchpad.net/neutron/+bug/1521815 15:05:39 Launchpad bug 1521815 in neutron "DVR functional tests failing intermittently" [High,New] 15:06:22 I am still seeing the functional tests are instable. But it is not only the dvr tests that are failing but there are other tests as well. 15:06:54 I don't think we have root caused the issue with these failures. 15:07:20 I believe the footprint is a missing router namespace 15:07:31 amuller mentioned yesterday that he was seeing that some of the functional tests for dvr was taking long time to complete. 15:08:11 haleyb: missing router namespace seems to me like a noise, because I have seen tests passing even with that message. 15:08:41 Well, I'd be bothered by the fact that restarted agent tests are taking over an hour each 15:08:55 even if they pass 15:09:13 see http://logs.openstack.org/18/248418/3/check/gate-neutron-dsvm-functional/8a6dfcf/console.html#_2015-12-01_22_38_48_050 15:09:22 regXboi: yes that is what amuller mentioned. 15:09:36 Swami: not quite 15:09:52 I'm saying that even if the restarted agent test passes, it is still taking over an hour 15:10:08 that link I gave shows a test fail after 4000+ seconds 15:10:21 but another restart test passes after 5000+ seconds 15:11:37 There was another bug related to fip namespace cleanup in functional test. 15:12:08 #link https://bugs.launchpad.net/neutron/+bug/1521820 15:12:08 Launchpad bug 1521820 in neutron "Some DVR functional tests leak the FIP namespace" [Low,In progress] - Assigned to Assaf Muller (amuller) 15:12:25 There was a patch that amuller pushed in for this fix last night. 15:12:45 #link https://review.openstack.org/#/c/252139/ 15:13:24 The next bug in the list is 15:13:28 #link https://bugs.launchpad.net/neutron/+bug/1521846 15:13:28 Launchpad bug 1521846 in neutron "Metering not configured for all-in-one DVR job, failing tests" [High,New] 15:14:03 Based on the information in the bug it seems that metering is not configured for single node job. 15:14:24 regression? 15:14:29 Does anyone know what triggered this failure, because how it was passing and why suddenly it failed. 15:14:43 note in bug mentions https://review.openstack.org/#/c/243949/ 15:14:49 I'm thinking all of these are regressions 15:14:51 obondarev: might be regression. 15:15:11 regXboi: Yes I agree 15:15:59 We still need to figure out the patch that caused this regression. 15:16:38 I'm seeing if logstash can give some hints as to when this started 15:16:47 because the DVR jobs have also gone off to insanity 15:16:58 It kind of started yesterday. 15:17:28 that infra change merged yesterday morning 15:17:43 haleyb: which infra change. 15:17:55 https://review.openstack.org/#/c/243949/ 15:18:16 so interestingly 15:18:43 it looks like the functional-py34 job is not showing the same delay signature on the failing tests 15:18:46 haleyb: thanks for the link 15:19:20 regXboi: is it all passing 'functional-py34' 15:19:33 regXboi: functional-py34 has been broken for a long time I think 15:19:41 Swami: no, but when it fails, the test fails in less than 10 seconds 15:19:50 as opposed to taking over an hour 15:20:17 I'm looking specifically at failures with test_dvr_router_lifecycle_ha_with_snat_with_fips 15:20:59 the long signature failures started at 18:31:24 UTC yesterday 15:21:47 regXboi: thanks for the information. 15:22:07 I think we need to still find out the patch that caused this regression. 15:22:29 we had a failure at 16:51:48 but it showed a short time stamp 15:22:36 Any other thoughts or information that you have to share related to this bug. 15:22:46 so now we need to see what merged between those two stamps 15:23:20 regXboi: the number of patches that got merged into neutron is pretty less between those time. 15:23:41 regXboi: but there might be other sub-project which might have triggered like the 'infra' etc., 15:23:46 regXboi: It would have to have merged a while before the failure, right? 15:23:48 Swami: if nothing there broke it, then we go look at other things 15:24:08 carl_baldwin: I'm not sure I agree 15:24:17 but I do have a test 15:24:31 the failure at 16:51:48 was my patch 15:24:38 let me rebase it to master and retest locallly 15:24:53 regXboi: what was that patch 15:24:58 regXboi: It has to merge (successful test run) then a new job has to be started that includes it. Then, that job needs time to fail. 15:25:13 carl_baldwin - ok, I see your point 15:25:28 Swami: the patch that had a short failure was 251502 (rev 2) 15:25:49 Looking at the gate code, that patch I mentioned will have set OVERRIDE_ENABLED_SERVICES now, which will skip some of the DVR setup code in devstack-vm-gate.sh 15:26:42 https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate.sh#L190 15:26:45 haleyb: thanks for the information. 15:27:17 Down on L210 there's additional neutron setup, which might be skipped now? 15:27:33 can we revert that infra patch and see if the symptom goes away 15:28:01 or should we fix the infra again to address the "OVERRIDE_ENABLED_SERVICES" 15:28:46 Swami, haleyb: how about we run a test where those additive services aren't there? 15:28:51 locally I mean 15:28:56 I have no idea what is right, seems a tangled mess 15:29:26 regXboi: can you run it locally and confirm 15:29:31 Swami: am trying now 15:29:36 regXboi: thanks 15:29:58 Swami: A revert in gerrit won't do. It is in the project-config repo and doesn't run our tests. 15:30:00 is yamamoto online? 15:30:28 carl_baldwin: can we add a noop patch in neutron depending on the revert? 15:30:30 carl_baldwin: that seems broken in itself 15:30:32 haleyb: I don't see him 15:30:50 obondarev: That is an idea. 15:30:58 In theory, that should work. 15:31:29 regXboi: Do you have a logstack query for these failures? 15:31:59 carl_baldwin: hold on a sec 15:32:12 http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_ha_with_snat_with_fips%5C%22%20AND%20message:%5C%22FAILED%5C%22%20AND%20build_name:%20%5C%22gate-neutron-dsvm-functional%5C%22 15:32:40 that had 16 failures in the last 7 days, most starting at the time stamp I gave earlier 15:33:26 * regXboi watches tox run 15:34:21 ok, I just restacked a node to not run q-agt,q-l3,etc 15:34:30 and the functional tests passed locally 15:34:39 so I don't think that's the culprit 15:34:48 note: I hadn't rebased to master 15:35:21 oh crap 15:39:36 hi 15:40:36 regXboi: I don't see any rhyme or reason in the failures in that logstash yet. 15:41:21 is the channel back to normal now 15:42:40 did we loose haleyb and others 15:46:52 #endmeeting