16:00:34 <ihrachys> #startmeeting neutron_ci 16:00:35 <openstack> Meeting started Tue Aug 15 16:00:34 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:36 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:38 <openstack> The meeting name has been set to 'neutron_ci' 16:00:39 <ihrachys> jlibosva, haleyb o 16:00:41 <ihrachys> o/ 16:00:41 <jlibosva> o/ 16:01:18 <ihrachys> #topic Actions from prev week 16:01:25 <ihrachys> haleyb to reach out to all affected parties, and FBI, to get multinode grenade by default 16:01:59 <ihrachys> https://review.openstack.org/#/c/483600 16:02:09 <ihrachys> I see that we are going to wait master open 16:02:37 <ihrachys> also waiting for https://review.openstack.org/#/c/488381/ in devstack before making the multinode flavour voting 16:03:10 <ihrachys> next was "haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades" 16:03:21 <ihrachys> let's discuss that in grafana section 16:03:25 <ihrachys> these are all action items 16:03:29 <ihrachys> #topic Grafana 16:03:30 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:04:08 <jlibosva> how about "ihrachys to propose a removal for linuxbridge grenade multinode job" ? :) 16:04:43 <ihrachys> eh right! 16:04:48 <jlibosva> I see the line to be gone, so was it done? 16:04:52 <ihrachys> manjeet was going to handle that instead of me, sec 16:05:08 <ihrachys> this: https://review.openstack.org/#/c/490993/ 16:05:10 <ihrachys> so it's good 16:05:44 <ihrachys> back to grafana, I no longer see dvr grenade flavor failure uptick, if anything, it's lower than usual flavour 16:06:04 <ihrachys> looks like the gate is generally quite stable lately (?) 16:06:05 <jlibosva> ah, so it seems it should be moved in grafana too 16:06:15 <ihrachys> jlibosva, does it reflect your perception? 16:06:38 <ihrachys> jlibosva, or removed? 16:06:55 <jlibosva> yeah, rather removed 16:06:59 <ihrachys> ok I will 16:07:20 <ihrachys> so, speaking of general stability, grafana seems healthy. is it really back to normal? I haven't tracked the prev week. 16:07:21 <jlibosva> yeah, seems like stable. also today on team meeting there were no critical bugs mentioned 16:07:35 <jlibosva> there is a pike in one of tempest dvr job 2 days ago 16:07:44 <jlibosva> up to ~60% 16:08:02 <ihrachys> check queue? 16:08:06 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=9&fullscreen 16:08:12 <jlibosva> yep, check queu - and all other tempest jobs are around ~20% ? 16:08:28 <jlibosva> not sure I'd call that "stable" :) 16:08:35 <ihrachys> it may be a usual rate, since it tracks legit mistakes in patches 16:09:03 <ihrachys> worth comparing with gate: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen 16:10:01 * jlibosva still loading 16:10:25 * ihrachys too 16:10:44 <ihrachys> grafana does seem to show me some artifacts instead of a proper chart 16:12:04 <ihrachys> ok I have it. 10-15% seems to be the average rate 16:12:11 <ihrachys> so yeah, it's not too stable 16:13:12 <ihrachys> what could we do to that? except maybe looking at reported bugs and encouraging to not nake-recheck 16:13:52 <jlibosva> if I'll find a few minutes, I could inspect the neutron full thru logstash 16:14:51 <ihrachys> ok let it be it. note there are a lot of infra failures http://status.openstack.org/elastic-recheck/gate.html 16:15:22 <ihrachys> maybe worth clicking through those uncategorized: http://status.openstack.org/elastic-recheck/data/integrated_gate.html 16:16:28 <ihrachys> #action jlibosva to look through uncategorized/latest gate tempest failures (15% failure rate atm) 16:16:53 <ihrachys> apart from that, we have two major offenders: fullstack and scenarios being broken 16:17:21 <ihrachys> for fullstack, I started looking at l3ha failure that pops up but had little time to make it to completion; still planning to work on it 16:17:30 <ihrachys> for scenarios, I remember jlibosva sent email asking for help 16:17:40 <ihrachys> do we have reviewable results to chew for that one? 16:18:06 <jlibosva> no, doesn't seem to be popular: https://etherpad.openstack.org/p/neutron-dvr-multinode-scenario-gate-failures 16:19:04 <jlibosva> but there is a regression in ovs-fw after implementing conjunctions when tests use a lots of remote security groups 16:19:05 <ihrachys> we enabled qos tests; have they showed up since then? 16:19:19 <jlibosva> that causes random SSH denials from what I observed 16:19:24 * jlibosva looks for a bug 16:19:59 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1708092 16:20:00 <openstack> Launchpad bug 1708092 in neutron "ovsfw sometimes rejects legitimate traffic when multiple remote SG rules are in use" [Undecided,In progress] - Assigned to IWAMOTO Toshihiro (iwamoto) 16:21:27 <jlibosva> for QoS we have 13 failures in last 7 days but hard to judge as the failures might be caused by ^^ 16:21:55 <ihrachys> hm. is it a serious regression? is revert possible? 16:22:41 <jlibosva> yes, revert is definitely possible 16:23:17 <ihrachys> what's the patch to revert would be? 16:23:46 <jlibosva> https://review.openstack.org/#/c/333804/ 16:25:07 <jlibosva> there was a followup patch to it 16:25:13 <ihrachys> wow, that's a huge piece to revert 16:25:19 <jlibosva> so that would needed to be reverted first 16:25:24 <ihrachys> I guess we can really do it that late in Pike 16:25:31 <ihrachys> can't 16:25:33 <jlibosva> can or can't 16:25:56 <ihrachys> I see there is a patch for that but it's in conflict 16:26:14 <ihrachys> https://review.openstack.org/#/c/492404/ 16:26:21 <jlibosva> and it's a wip 16:27:49 <ihrachys> I will ask in gerrit how close we are there 16:28:54 <ihrachys> ok let's switch to bugs 16:29:02 <ihrachys> #topic Gate failure bugs 16:29:03 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0 16:29:17 <ihrachys> first is https://bugs.launchpad.net/neutron/+bug/1710589 16:29:19 <openstack> Launchpad bug 1710589 in neutron "rally sla failure / internal error on load" [High,Triaged] 16:30:05 <jlibosva> hmm, didn't kevin sent a patch to reduce number of created ports? 16:30:46 <jlibosva> ah no, that was for trunk subports 16:31:22 <ihrachys> https://review.openstack.org/492638 ? 16:31:27 <ihrachys> yeah not related 16:31:52 <ihrachys> it seems like staledataerror is raised over and over until retry limit is reached 16:34:10 <jlibosva> aaand Ihar is gone :) 16:34:18 <jlibosva> aaand Ihar is back :) 16:34:25 <ihrachys> sorry, lost connectivity 16:34:40 <ihrachys> last I saw in the channel was: 16:34:44 <ihrachys> <ihrachys> it seems like staledataerror is raised over and over until retry limit is reached 16:34:44 <ihrachys> <ihrachys> if it's just high contention, should the remedy be similar? 16:35:33 <jlibosva> the last message wasn't sent 16:35:39 <jlibosva> and I didn't write anything, was reading the bug 16:36:36 <jlibosva> we don't have logstash for midokura gate and we don't see it on Neutron one, right? 16:37:10 <ihrachys> meh, my internet link is flaky 16:37:12 <jlibosva> we don't have logstash for midokura gate and we don't see it on Neutron one, right? 16:37:32 <ihrachys> is the scenario executed in our gate? 16:38:16 <ihrachys> I think we have a grafana board for rally, it would show there 16:38:22 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=11&fullscreen 16:38:47 <jlibosva> doesn't seem high 16:39:00 <jlibosva> oh, wrong timespan 16:39:38 <ihrachys> the bug was reported today 16:39:57 <ihrachys> and I don't see an issue in grafana for neutron (it's 5-10% rate) 16:40:26 <ihrachys> and also, those scenarios are pretty basic, they are executed 16:40:32 <ihrachys> maybe smth midonet gate specific 16:41:32 <jlibosva> sounds like that 16:41:44 <jlibosva> I'm sure Yamamoto will figure it out soon :) 16:42:08 <ihrachys> ok, I asked for logstash and diff in gate for jobs in LP, we'll see what reply is 16:42:15 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1709869 16:42:16 <openstack> Launchpad bug 1709869 in neutron "test_convert_default_subnetpool_to_non_default fails: Subnet pool could not be found" [High,In progress] - Assigned to Itzik Brown (itzikb1) 16:42:26 <ihrachys> fix https://review.openstack.org/#/c/492522/ 16:43:05 <jlibosva> the patch alone doesn't fix the test, I have another one: https://review.openstack.org/#/c/492653/ 16:43:05 <ihrachys> is it a new test? why hasn't it failed? 16:43:10 <jlibosva> but the test is skipped 16:43:29 <jlibosva> cause devstack creates a default subnetpool so we have not way to test it 16:43:42 <jlibosva> maybe if the subnetpool is not necessary, we should change devstack 16:44:09 <jlibosva> the test has probably never worked, it's been merged at the pike dev cycle 16:44:23 <ihrachys> ack; both in gate now 16:44:31 <ihrachys> would be nice to follow up on gate setup 16:44:52 <ihrachys> do you want a task for that, or we will pun? 16:45:08 <jlibosva> gimme a task! 16:45:23 <jlibosva> I need some default subnetpool knowledge first though 16:46:08 <ihrachys> #action jlibosva to tweak gate not to create default subnetpool and enable test_convert_default_subnetpool_to_non_default 16:46:30 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1708030 16:46:31 <openstack> Launchpad bug 1708030 in neutron "netlink test_list_entries failing with mismatch" [High,In progress] - Assigned to Cuong Nguyen (cuongnv) 16:46:40 <ihrachys> seems like we have a fix here: https://review.openstack.org/#/c/489831/ 16:47:42 <jlibosva> those tests are disabled for now 16:47:54 <jlibosva> because of kernel bug in ubuntu xenial 16:48:37 <ihrachys> yeah, we may want a follow up test patch to prove it works 16:48:55 <jlibosva> I checked this morning that newly tagged kernel containing the fix hasn't been picked by the gate yet 16:48:56 <ihrachys> ...but that won't work in gate 16:49:06 <ihrachys> oh there is a new kernel? 16:49:09 <jlibosva> yes 16:49:17 <jlibosva> it was tagged last friday 16:49:18 <ihrachys> cool. I imagine it a question of days 16:49:34 <ihrachys> so we will procrastinate on the fix for now; I have a minor comment there anyway. 16:49:41 <jlibosva> I monitor it, once I see we have proper gate in place, I'll send a patch to enable those tests back 16:49:49 <ihrachys> ++ 16:49:49 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1707933 16:49:50 <openstack> Launchpad bug 1707933 in neutron "functional tests timeout after a test worker killed" [Critical,Confirmed] 16:49:56 <ihrachys> our old friend 16:50:36 <jlibosva> oh, it's in Ocata too 16:50:41 <ihrachys> I don't think we made progress. last time I thought about it, I wanted to mock os.kill but ofc could not squeeze it 16:50:47 <ihrachys> yeah, I saw it in ocata once 16:51:22 <ihrachys> maybe a backport, or external dep 16:52:49 <ihrachys> #action ihrachys to capture os.kill calls in func tests and see if any of those kill test threads 16:53:34 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1707003 16:53:35 <openstack> Launchpad bug 1707003 in neutron "gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:53:48 <ihrachys> I think this one is going to be solved by Sean Dague's fix for host discovery 16:53:51 <ihrachys> so we just wait 16:54:25 <jlibosva> one more thing to the killing functional tests 16:54:27 <jlibosva> http://logs.openstack.org/65/487065/5/check/gate-neutron-dsvm-functional-ubuntu-xenial/f9b22b8/logs/syslog.txt.gz#_Jul_31_08_04_37 16:54:48 <ihrachys> eh... is it same time kill happens? 16:54:53 <jlibosva> I wonder whether this could be related 16:54:57 <jlibosva> it's about one minute earlier 16:55:25 <jlibosva> well, it's about one minute earlier than the Killed output 16:56:12 <ihrachys> very interesting. I imagine parent talks to tester children via pipes. 16:57:24 <ihrachys> see https://github.com/moby/moby/issues/34472 16:57:36 <ihrachys> it's fresh, it's ubuntu, and it describes a child hanging 16:57:49 <jlibosva> but then why would only functional tests be affected and not all other using forked workers 16:58:52 <jlibosva> ah, so maybe executor spawns other processes, like rootwrap daemon, that are killed 16:59:52 <ihrachys> maybe. there are other google hits for a similar trace: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1702665 16:59:53 <openstack> Launchpad bug 1702665 in linux (Ubuntu) "4.4.0-83-generic + Docker + EC2 frequently crashes at cgroup_rmdir GPF" [High,Confirmed] 17:00:10 <ihrachys> ok time 17:00:14 <ihrachys> thanks for joining 17:00:19 <ihrachys> #endmeeting