16:00:58 #startmeeting neutron_ci 16:00:59 Meeting started Tue Aug 1 16:00:58 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:00 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:02 The meeting name has been set to 'neutron_ci' 16:01:24 #topic Actions from prev week 16:01:33 "haleyb to reach out to QA PTL about switching grenade integrated gate to multinode" 16:01:52 last comments in https://review.openstack.org/#/c/483600/ suggest there is more leg work there 16:01:57 talking to all projects affected 16:02:00 * haleyb did, needs to contact additional PTLs to get their sign-off as well 16:02:15 * haleyb is lurking for a few more minutes 16:02:34 #action haleyb to reach out to all affected parties, and FBI, to get multinode grenade by default 16:02:50 next is "jlibosva to recheck the tools/kill.sh patch until it hits timeout and report back" 16:03:00 I did 16:03:05 but didn't find anything useful 16:03:09 I believe on the team meeting today we figured it did not help much 16:03:12 sec 16:03:15 I reported https://bugs.launchpad.net/neutron/+bug/1707933 16:03:15 Launchpad bug 1707933 in neutron "functional tests timeout after a test worker killed" [Critical,Confirmed] 16:03:56 I see that worker was killed here: http://logs.openstack.org/65/487065/5/check/gate-neutron-dsvm-functional-ubuntu-xenial/f9b22b8/console.html#_2017-07-31_08_05_43_074447 16:04:04 at 08:05:43 16:04:36 there were 'kill' called around that time, but none with worker's pid 16:05:22 what is so special about FirewallTestCase ? 16:05:39 or probably BaseFirewallTestCase 16:05:48 because ipv6 class also affected 16:06:02 what do you mean? 16:06:34 well, almost all breakages have firewall tests timing out in the end, no? 16:06:45 so the workers die on running firewall tests 16:07:16 it's unclear to me if the test kills itself or it kills other running test 16:07:35 but there is a RootHelperProcess class (or something like that) 16:07:46 that takes care of finding correct pid to kill 16:07:46 ok, but again, why firewall tests are killed, even if those are not the source of kills? 16:07:54 maybe it's just because they are slow? 16:08:00 so the chance of hitting them is high? 16:08:21 I don't think it has anything to do with how fast they are 16:08:28 there is an eventlet timeout per test 16:08:39 so slow tests should raise an exception if they don't finish in time 16:08:41 IIRC 16:10:25 no, I mean, it seems like we see firewall tests involved when kill happens; it's either because there is some bug in the test; or because a bug happens somewhere else, but it's just that firewall tests take a huge chunk of total time that it hits them so often. 16:10:42 aha 16:11:04 one interesting thing I just noticed is that the "Killed" message in console is actually almost a minute after last log in the killed test 16:11:27 by the brief look at the log, it doesn't seem that those tests are too long, or their number is huge 16:12:01 maybe I'll also add a ps output to the kill.sh, so we know which process is being killed 16:12:06 jlibosva, how do you determine the killed test? the one that is 'inprogress' at the end of the run? 16:12:38 ihrachys: yes, also you can see that e.g. in the console log I sent, executor {3} doesn't run any other tests after Killed message 16:12:53 ihrachys: while there are tests from {0} {1] and {2} workers 16:13:32 I wonder if this ~1 minute delay stands for other failure instances 16:13:33 one other thing would be to change signals in firewall tests to SIGTERM and then register handler in base test class for such signal 16:14:25 whatis 'change signals' 16:14:34 default uses SIGKILL 16:14:50 for killing nc processes that are used to test whether traffic can or cannot pass through 16:16:49 so if we change from SIGKILL to SIGTERM, we would be able to catch it and reveal more info 16:17:00 what do you think? 16:17:22 sorry, I am not that into the test, so I can't find which piece of it starts nc 16:17:40 oh that's assert_connection ? 16:19:01 ok I see where it finally bubbles down to RootHelperProcess 16:19:02 ihrachys: it's NetcatTester 16:21:11 anyways, I'll try to come up with some more patches, we've already spent a lot of meeting time on this :) 16:21:25 we can come up with ideas on the LP eventually 16:21:27 ok 16:21:41 next was "haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades" 16:22:16 I don't think there was progress on that one? 16:23:11 ok let's repeat it, haleyb is probably gone 16:23:28 #action haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades 16:23:36 oh I missed one 16:23:40 "jlibosva to classify current scenario failures and send email to openstack-dev@ asking for help" 16:23:54 I did and did 16:24:00 I saw the email 16:24:09 was there any progress that emerged from it? 16:24:12 http://lists.openstack.org/pipermail/openstack-dev/2017-July/120294.html 16:24:26 slaweq sent a patch to enable qos as he wants to work on it 16:24:40 that is abandoned now 16:24:48 prolly because yamamoto had the same 16:24:58 yeah 16:25:00 https://review.openstack.org/#/c/468326 16:25:01 the yamamoto patch https://review.openstack.org/#/c/468326 16:25:31 it also seems there might be no work item and it works now 16:25:40 ok 16:25:41 I don't know what fixed the scenario or it just has a low repro rate 16:25:44 let's land then and see? 16:25:48 yep 16:26:08 I +2d 16:26:27 apart from those qos tests? 16:26:48 armax looked at trunk failure which turned out a regression in ovs firewall 16:26:53 btw the etherpad tracking the progress at https://etherpad.openstack.org/p/neutron-dvr-multinode-scenario-gate-failures 16:27:12 https://bugs.launchpad.net/neutron/+bug/1707339 16:27:12 Launchpad bug 1707339 in neutron "test_trunk_subport_lifecycle fails on subport down timeout" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio) 16:27:20 I've been working on it today 16:27:32 I'll send out patch soon, I'm writing just some basic UTs 16:28:03 nice 16:28:27 other than that, it doesn't seem as popular as python3 effort 16:28:37 tempest is hard 16:29:00 I plan to take one failure at a time and do something about it, but I've been recently quite busy 16:29:09 we can maybe ask once again, mentioning the progress and asking for more :) 16:29:10 I imagine 16:29:41 I can do the shout out 16:29:53 if you have some cookies to promise, maybe that will help :) 16:30:10 cookie or a stick, hm. tough choice. 16:30:15 next was "haleyb to check why dvr-ha job is at ~100% failure rate" 16:30:34 I talked to haleyb about that 16:30:44 and it seems like this a devstack-gate provisioning issue 16:30:46 https://bugs.launchpad.net/neutron/+bug/1707003 16:30:46 Launchpad bug 1707003 in neutron "gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:30:59 there is a patch that tackles a similar problem, in grenade (abandoned) 16:31:15 and we will probably need smth similar, but in devstack-gate, since this job is not even grenade 16:31:24 and Brian is going to work on it 16:31:29 I imagine it may take a while 16:31:32 first, because d-g 16:31:45 second, because reviews there are rare 16:31:46 :) 16:32:10 #topic Grafana 16:32:15 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:32:36 there is a periodic failure for -pg- job again 16:32:45 I noticed it today and sent an email to armax about it 16:32:55 seems like another instance of sqlalchemy not being ready for -pg- 16:33:02 there is a sql syntax error in cinder api 16:33:33 there is fullstack still at 80%, and I actually started playing with the test locally yesterday (no progress yet) 16:33:36 about pg, have you seen the patch about testing pg? 16:33:40 good thing it reproduces just fine 16:33:48 jlibosva, which one? 16:33:54 * jlibosva looking 16:34:06 https://review.openstack.org/#/c/427880/ 16:34:41 oh that's old news yeah 16:34:47 tldr, warns users that pg is not thoroughly tested, I wonder whether it is worth the effort to support pg in gate 16:34:54 still it seems like we have armax interested in doing it 16:34:59 I proposed removal of the job in the past 16:35:17 and armax was all against it and jumped fixing issues 16:35:18 I remember, but it was before this was approved 16:35:29 ok 16:35:39 I don't think it changes much if we just do it best effort 16:35:55 if someone benefits from it and maintains it above the water, I am good 16:36:10 okies 16:36:19 there is one failure spike that I think is new 16:36:33 for linuxbridge grenade multinode job 16:36:41 currently at 45% 16:37:03 I know that the job was never stable because of https://bugs.launchpad.net/neutron/+bug/1683256 16:37:03 Launchpad bug 1683256 in neutron "linuxbridge multinode depending on multicast support of provider" [High,Confirmed] - Assigned to omkar_telee (omkar-telee) 16:37:33 Kevin was going to look at it in the past, but it never happened 16:37:42 and I wonder if that's a sign that we should deprovision the job 16:37:58 because the only reason why we haven't a cycle ago was that Kevin was going to make it work 16:38:38 I guess I will send another patch removing other people jobs and see how they react ;) 16:38:51 #action ihrachys to propose a removal for linuxbridge grenade multinode job 16:39:44 I suspect there is this multicast AND something else going on there but I have no will to dig 16:40:05 #topic Open discussion 16:40:17 jlibosva, progress on ovs isolation for fullstack? 16:40:22 nah 16:40:24 :) 16:40:26 ack :) 16:40:29 anything else? 16:40:32 I have one topic thou 16:40:34 yep 16:40:38 I was pinged by ironic guys 16:40:48 they are getting hit in grenade job by neutron 16:40:58 bug number? 16:40:59 https://bugs.launchpad.net/neutron/+bug/1707160 16:40:59 Launchpad bug 1707160 in neutron "test_create_port_in_allowed_allocation_pools test fails on ironic grenade" [Undecided,Confirmed] 16:41:07 which probably leads to https://bugs.launchpad.net/neutron/+bug/1705351 16:41:07 Launchpad bug 1705351 in neutron "agent notifier getting amqp NotFound exceptions propagated up to it" [High,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:41:39 I looked at the logs of the failed job and it seemed to me that client times out after 60 seconds and then retries the operation 16:42:07 the thing is that the first operation succeeds after some time > 60 secs and then the second "retry" fails 16:42:18 so they bumped the cli timeout to 120s but still are getting hit 16:42:33 I didn't have time to look at why they get hit again 16:42:48 but I thought it was worth raising here even though it's not Neutron CI but some neutron failure :) 16:43:07 * ihrachys tries to understand why the 2nd bug is assigned to him 16:43:21 you sent a patch to catch some exceptions 16:43:23 iirc 16:43:26 oh, a partial bug 16:43:37 https://review.openstack.org/#/c/486687/ 16:44:07 jlibosva, how does the 1st bug lead to 2nd? can you elaborate? 16:44:19 the 2nd is about an exception raised 16:44:24 because of exchange missing 16:44:52 the former looks more like dns misconfigured or smth else locking the processing thread for long time 16:45:06 what I saw was subnet delete in the ironic gate - the delete was trying to notify dhcp agents they should delete their resources 16:45:38 but due to some amqp issue, rpc was failing or stuck (don't remember), which cause the operation of subnet deletion to take 97 seconds 16:46:33 and the tempest rest client called subnet delete again after 60 seconds, but the first call eventually succeeded, while the second failed 16:46:39 cause subnet was already gone 16:46:41 so, one of reasons I am aware of for a thread to lock on rpc is dns misconfigured, so oslo.messaging takes a while to get response from libresolv with ipv4/ipv6 addresses. 16:46:49 so the delay on the server side was caused by the amqp issue 16:47:03 jlibosva, not THE issue, but A issue, right? 16:47:56 in the server log, I see AMQP server on 149.202.183.40:5672 is unreachable: timed out 16:48:11 which probably suggests it's not dns 16:48:19 (the broker is referred to via ip) 16:48:21 I'm referring to amqp as THE :) 16:48:34 IIRC there was a trace about missing queue or something 16:49:11 oh yeah, i see "NotFound: Basic.publish: (404) NOT_FOUND - no exchange 'q-agent-notifier-port-delete_fanout' in vhost '/'" 16:49:43 then the 2nd bug is indeed relevant 16:50:34 with the fix landed on neutron side for this, I guess we may ask ironic folks if they are still affected? 16:51:24 makes sense 16:51:42 but I don't have much knowledge about the failure there 16:51:46 just wanted to raise it here 16:52:03 ack, thanks for that 16:52:08 I will update them about the fix in LP 16:52:13 anything else? 16:52:40 not from me 16:52:45 cool. thanks for joining. 16:52:45 #endmeeting