16:00:34 #startmeeting neutron_ci 16:00:35 Meeting started Tue Aug 15 16:00:34 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:36 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:38 The meeting name has been set to 'neutron_ci' 16:00:39 jlibosva, haleyb o 16:00:41 o/ 16:00:41 o/ 16:01:18 #topic Actions from prev week 16:01:25 haleyb to reach out to all affected parties, and FBI, to get multinode grenade by default 16:01:59 https://review.openstack.org/#/c/483600 16:02:09 I see that we are going to wait master open 16:02:37 also waiting for https://review.openstack.org/#/c/488381/ in devstack before making the multinode flavour voting 16:03:10 next was "haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades" 16:03:21 let's discuss that in grafana section 16:03:25 these are all action items 16:03:29 #topic Grafana 16:03:30 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:04:08 how about "ihrachys to propose a removal for linuxbridge grenade multinode job" ? :) 16:04:43 eh right! 16:04:48 I see the line to be gone, so was it done? 16:04:52 manjeet was going to handle that instead of me, sec 16:05:08 this: https://review.openstack.org/#/c/490993/ 16:05:10 so it's good 16:05:44 back to grafana, I no longer see dvr grenade flavor failure uptick, if anything, it's lower than usual flavour 16:06:04 looks like the gate is generally quite stable lately (?) 16:06:05 ah, so it seems it should be moved in grafana too 16:06:15 jlibosva, does it reflect your perception? 16:06:38 jlibosva, or removed? 16:06:55 yeah, rather removed 16:06:59 ok I will 16:07:20 so, speaking of general stability, grafana seems healthy. is it really back to normal? I haven't tracked the prev week. 16:07:21 yeah, seems like stable. also today on team meeting there were no critical bugs mentioned 16:07:35 there is a pike in one of tempest dvr job 2 days ago 16:07:44 up to ~60% 16:08:02 check queue? 16:08:06 http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=9&fullscreen 16:08:12 yep, check queu - and all other tempest jobs are around ~20% ? 16:08:28 not sure I'd call that "stable" :) 16:08:35 it may be a usual rate, since it tracks legit mistakes in patches 16:09:03 worth comparing with gate: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen 16:10:01 * jlibosva still loading 16:10:25 * ihrachys too 16:10:44 grafana does seem to show me some artifacts instead of a proper chart 16:12:04 ok I have it. 10-15% seems to be the average rate 16:12:11 so yeah, it's not too stable 16:13:12 what could we do to that? except maybe looking at reported bugs and encouraging to not nake-recheck 16:13:52 if I'll find a few minutes, I could inspect the neutron full thru logstash 16:14:51 ok let it be it. note there are a lot of infra failures http://status.openstack.org/elastic-recheck/gate.html 16:15:22 maybe worth clicking through those uncategorized: http://status.openstack.org/elastic-recheck/data/integrated_gate.html 16:16:28 #action jlibosva to look through uncategorized/latest gate tempest failures (15% failure rate atm) 16:16:53 apart from that, we have two major offenders: fullstack and scenarios being broken 16:17:21 for fullstack, I started looking at l3ha failure that pops up but had little time to make it to completion; still planning to work on it 16:17:30 for scenarios, I remember jlibosva sent email asking for help 16:17:40 do we have reviewable results to chew for that one? 16:18:06 no, doesn't seem to be popular: https://etherpad.openstack.org/p/neutron-dvr-multinode-scenario-gate-failures 16:19:04 but there is a regression in ovs-fw after implementing conjunctions when tests use a lots of remote security groups 16:19:05 we enabled qos tests; have they showed up since then? 16:19:19 that causes random SSH denials from what I observed 16:19:24 * jlibosva looks for a bug 16:19:59 https://bugs.launchpad.net/neutron/+bug/1708092 16:20:00 Launchpad bug 1708092 in neutron "ovsfw sometimes rejects legitimate traffic when multiple remote SG rules are in use" [Undecided,In progress] - Assigned to IWAMOTO Toshihiro (iwamoto) 16:21:27 for QoS we have 13 failures in last 7 days but hard to judge as the failures might be caused by ^^ 16:21:55 hm. is it a serious regression? is revert possible? 16:22:41 yes, revert is definitely possible 16:23:17 what's the patch to revert would be? 16:23:46 https://review.openstack.org/#/c/333804/ 16:25:07 there was a followup patch to it 16:25:13 wow, that's a huge piece to revert 16:25:19 so that would needed to be reverted first 16:25:24 I guess we can really do it that late in Pike 16:25:31 can't 16:25:33 can or can't 16:25:56 I see there is a patch for that but it's in conflict 16:26:14 https://review.openstack.org/#/c/492404/ 16:26:21 and it's a wip 16:27:49 I will ask in gerrit how close we are there 16:28:54 ok let's switch to bugs 16:29:02 #topic Gate failure bugs 16:29:03 https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0 16:29:17 first is https://bugs.launchpad.net/neutron/+bug/1710589 16:29:19 Launchpad bug 1710589 in neutron "rally sla failure / internal error on load" [High,Triaged] 16:30:05 hmm, didn't kevin sent a patch to reduce number of created ports? 16:30:46 ah no, that was for trunk subports 16:31:22 https://review.openstack.org/492638 ? 16:31:27 yeah not related 16:31:52 it seems like staledataerror is raised over and over until retry limit is reached 16:34:10 aaand Ihar is gone :) 16:34:18 aaand Ihar is back :) 16:34:25 sorry, lost connectivity 16:34:40 last I saw in the channel was: 16:34:44 it seems like staledataerror is raised over and over until retry limit is reached 16:34:44 if it's just high contention, should the remedy be similar? 16:35:33 the last message wasn't sent 16:35:39 and I didn't write anything, was reading the bug 16:36:36 we don't have logstash for midokura gate and we don't see it on Neutron one, right? 16:37:10 meh, my internet link is flaky 16:37:12 we don't have logstash for midokura gate and we don't see it on Neutron one, right? 16:37:32 is the scenario executed in our gate? 16:38:16 I think we have a grafana board for rally, it would show there 16:38:22 http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=11&fullscreen 16:38:47 doesn't seem high 16:39:00 oh, wrong timespan 16:39:38 the bug was reported today 16:39:57 and I don't see an issue in grafana for neutron (it's 5-10% rate) 16:40:26 and also, those scenarios are pretty basic, they are executed 16:40:32 maybe smth midonet gate specific 16:41:32 sounds like that 16:41:44 I'm sure Yamamoto will figure it out soon :) 16:42:08 ok, I asked for logstash and diff in gate for jobs in LP, we'll see what reply is 16:42:15 https://bugs.launchpad.net/neutron/+bug/1709869 16:42:16 Launchpad bug 1709869 in neutron "test_convert_default_subnetpool_to_non_default fails: Subnet pool could not be found" [High,In progress] - Assigned to Itzik Brown (itzikb1) 16:42:26 fix https://review.openstack.org/#/c/492522/ 16:43:05 the patch alone doesn't fix the test, I have another one: https://review.openstack.org/#/c/492653/ 16:43:05 is it a new test? why hasn't it failed? 16:43:10 but the test is skipped 16:43:29 cause devstack creates a default subnetpool so we have not way to test it 16:43:42 maybe if the subnetpool is not necessary, we should change devstack 16:44:09 the test has probably never worked, it's been merged at the pike dev cycle 16:44:23 ack; both in gate now 16:44:31 would be nice to follow up on gate setup 16:44:52 do you want a task for that, or we will pun? 16:45:08 gimme a task! 16:45:23 I need some default subnetpool knowledge first though 16:46:08 #action jlibosva to tweak gate not to create default subnetpool and enable test_convert_default_subnetpool_to_non_default 16:46:30 https://bugs.launchpad.net/neutron/+bug/1708030 16:46:31 Launchpad bug 1708030 in neutron "netlink test_list_entries failing with mismatch" [High,In progress] - Assigned to Cuong Nguyen (cuongnv) 16:46:40 seems like we have a fix here: https://review.openstack.org/#/c/489831/ 16:47:42 those tests are disabled for now 16:47:54 because of kernel bug in ubuntu xenial 16:48:37 yeah, we may want a follow up test patch to prove it works 16:48:55 I checked this morning that newly tagged kernel containing the fix hasn't been picked by the gate yet 16:48:56 ...but that won't work in gate 16:49:06 oh there is a new kernel? 16:49:09 yes 16:49:17 it was tagged last friday 16:49:18 cool. I imagine it a question of days 16:49:34 so we will procrastinate on the fix for now; I have a minor comment there anyway. 16:49:41 I monitor it, once I see we have proper gate in place, I'll send a patch to enable those tests back 16:49:49 ++ 16:49:49 https://bugs.launchpad.net/neutron/+bug/1707933 16:49:50 Launchpad bug 1707933 in neutron "functional tests timeout after a test worker killed" [Critical,Confirmed] 16:49:56 our old friend 16:50:36 oh, it's in Ocata too 16:50:41 I don't think we made progress. last time I thought about it, I wanted to mock os.kill but ofc could not squeeze it 16:50:47 yeah, I saw it in ocata once 16:51:22 maybe a backport, or external dep 16:52:49 #action ihrachys to capture os.kill calls in func tests and see if any of those kill test threads 16:53:34 https://bugs.launchpad.net/neutron/+bug/1707003 16:53:35 Launchpad bug 1707003 in neutron "gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:53:48 I think this one is going to be solved by Sean Dague's fix for host discovery 16:53:51 so we just wait 16:54:25 one more thing to the killing functional tests 16:54:27 http://logs.openstack.org/65/487065/5/check/gate-neutron-dsvm-functional-ubuntu-xenial/f9b22b8/logs/syslog.txt.gz#_Jul_31_08_04_37 16:54:48 eh... is it same time kill happens? 16:54:53 I wonder whether this could be related 16:54:57 it's about one minute earlier 16:55:25 well, it's about one minute earlier than the Killed output 16:56:12 very interesting. I imagine parent talks to tester children via pipes. 16:57:24 see https://github.com/moby/moby/issues/34472 16:57:36 it's fresh, it's ubuntu, and it describes a child hanging 16:57:49 but then why would only functional tests be affected and not all other using forked workers 16:58:52 ah, so maybe executor spawns other processes, like rootwrap daemon, that are killed 16:59:52 maybe. there are other google hits for a similar trace: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1702665 16:59:53 Launchpad bug 1702665 in linux (Ubuntu) "4.4.0-83-generic + Docker + EC2 frequently crashes at cgroup_rmdir GPF" [High,Confirmed] 17:00:10 ok time 17:00:14 thanks for joining 17:00:19 #endmeeting