16:00:58 <ihrachys> #startmeeting neutron_ci 16:00:59 <openstack> Meeting started Tue Aug 1 16:00:58 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:00 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:02 <openstack> The meeting name has been set to 'neutron_ci' 16:01:24 <ihrachys> #topic Actions from prev week 16:01:33 <ihrachys> "haleyb to reach out to QA PTL about switching grenade integrated gate to multinode" 16:01:52 <ihrachys> last comments in https://review.openstack.org/#/c/483600/ suggest there is more leg work there 16:01:57 <ihrachys> talking to all projects affected 16:02:00 * haleyb did, needs to contact additional PTLs to get their sign-off as well 16:02:15 * haleyb is lurking for a few more minutes 16:02:34 <ihrachys> #action haleyb to reach out to all affected parties, and FBI, to get multinode grenade by default 16:02:50 <ihrachys> next is "jlibosva to recheck the tools/kill.sh patch until it hits timeout and report back" 16:03:00 <jlibosva> I did 16:03:05 <jlibosva> but didn't find anything useful 16:03:09 <ihrachys> I believe on the team meeting today we figured it did not help much 16:03:12 <jlibosva> sec 16:03:15 <ihrachys> I reported https://bugs.launchpad.net/neutron/+bug/1707933 16:03:15 <openstack> Launchpad bug 1707933 in neutron "functional tests timeout after a test worker killed" [Critical,Confirmed] 16:03:56 <jlibosva> I see that worker was killed here: http://logs.openstack.org/65/487065/5/check/gate-neutron-dsvm-functional-ubuntu-xenial/f9b22b8/console.html#_2017-07-31_08_05_43_074447 16:04:04 <jlibosva> at 08:05:43 16:04:36 <jlibosva> there were 'kill' called around that time, but none with worker's pid 16:05:22 <ihrachys> what is so special about FirewallTestCase ? 16:05:39 <ihrachys> or probably BaseFirewallTestCase 16:05:48 <ihrachys> because ipv6 class also affected 16:06:02 <jlibosva> what do you mean? 16:06:34 <ihrachys> well, almost all breakages have firewall tests timing out in the end, no? 16:06:45 <ihrachys> so the workers die on running firewall tests 16:07:16 <jlibosva> it's unclear to me if the test kills itself or it kills other running test 16:07:35 <jlibosva> but there is a RootHelperProcess class (or something like that) 16:07:46 <jlibosva> that takes care of finding correct pid to kill 16:07:46 <ihrachys> ok, but again, why firewall tests are killed, even if those are not the source of kills? 16:07:54 <ihrachys> maybe it's just because they are slow? 16:08:00 <ihrachys> so the chance of hitting them is high? 16:08:21 <jlibosva> I don't think it has anything to do with how fast they are 16:08:28 <jlibosva> there is an eventlet timeout per test 16:08:39 <jlibosva> so slow tests should raise an exception if they don't finish in time 16:08:41 <jlibosva> IIRC 16:10:25 <ihrachys> no, I mean, it seems like we see firewall tests involved when kill happens; it's either because there is some bug in the test; or because a bug happens somewhere else, but it's just that firewall tests take a huge chunk of total time that it hits them so often. 16:10:42 <jlibosva> aha 16:11:04 <jlibosva> one interesting thing I just noticed is that the "Killed" message in console is actually almost a minute after last log in the killed test 16:11:27 <ihrachys> by the brief look at the log, it doesn't seem that those tests are too long, or their number is huge 16:12:01 <jlibosva> maybe I'll also add a ps output to the kill.sh, so we know which process is being killed 16:12:06 <ihrachys> jlibosva, how do you determine the killed test? the one that is 'inprogress' at the end of the run? 16:12:38 <jlibosva> ihrachys: yes, also you can see that e.g. in the console log I sent, executor {3} doesn't run any other tests after Killed message 16:12:53 <jlibosva> ihrachys: while there are tests from {0} {1] and {2} workers 16:13:32 <ihrachys> I wonder if this ~1 minute delay stands for other failure instances 16:13:33 <jlibosva> one other thing would be to change signals in firewall tests to SIGTERM and then register handler in base test class for such signal 16:14:25 <ihrachys> whatis 'change signals' 16:14:34 <jlibosva> default uses SIGKILL 16:14:50 <jlibosva> for killing nc processes that are used to test whether traffic can or cannot pass through 16:16:49 <jlibosva> so if we change from SIGKILL to SIGTERM, we would be able to catch it and reveal more info 16:17:00 <jlibosva> what do you think? 16:17:22 <ihrachys> sorry, I am not that into the test, so I can't find which piece of it starts nc 16:17:40 <ihrachys> oh that's assert_connection ? 16:19:01 <ihrachys> ok I see where it finally bubbles down to RootHelperProcess 16:19:02 <jlibosva> ihrachys: it's NetcatTester 16:21:11 <jlibosva> anyways, I'll try to come up with some more patches, we've already spent a lot of meeting time on this :) 16:21:25 <jlibosva> we can come up with ideas on the LP eventually 16:21:27 <ihrachys> ok 16:21:41 <ihrachys> next was "haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades" 16:22:16 <ihrachys> I don't think there was progress on that one? 16:23:11 <ihrachys> ok let's repeat it, haleyb is probably gone 16:23:28 <ihrachys> #action haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades 16:23:36 <ihrachys> oh I missed one 16:23:40 <ihrachys> "jlibosva to classify current scenario failures and send email to openstack-dev@ asking for help" 16:23:54 <jlibosva> I did and did 16:24:00 <ihrachys> I saw the email 16:24:09 <ihrachys> was there any progress that emerged from it? 16:24:12 <jlibosva> http://lists.openstack.org/pipermail/openstack-dev/2017-July/120294.html 16:24:26 <jlibosva> slaweq sent a patch to enable qos as he wants to work on it 16:24:40 <ihrachys> that is abandoned now 16:24:48 <ihrachys> prolly because yamamoto had the same 16:24:58 <jlibosva> yeah 16:25:00 <jlibosva> https://review.openstack.org/#/c/468326 16:25:01 <ihrachys> the yamamoto patch https://review.openstack.org/#/c/468326 16:25:31 <jlibosva> it also seems there might be no work item and it works now 16:25:40 <ihrachys> ok 16:25:41 <jlibosva> I don't know what fixed the scenario or it just has a low repro rate 16:25:44 <ihrachys> let's land then and see? 16:25:48 <jlibosva> yep 16:26:08 <ihrachys> I +2d 16:26:27 <ihrachys> apart from those qos tests? 16:26:48 <jlibosva> armax looked at trunk failure which turned out a regression in ovs firewall 16:26:53 <ihrachys> btw the etherpad tracking the progress at https://etherpad.openstack.org/p/neutron-dvr-multinode-scenario-gate-failures 16:27:12 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1707339 16:27:12 <openstack> Launchpad bug 1707339 in neutron "test_trunk_subport_lifecycle fails on subport down timeout" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio) 16:27:20 <jlibosva> I've been working on it today 16:27:32 <jlibosva> I'll send out patch soon, I'm writing just some basic UTs 16:28:03 <ihrachys> nice 16:28:27 <jlibosva> other than that, it doesn't seem as popular as python3 effort 16:28:37 <ihrachys> tempest is hard 16:29:00 <jlibosva> I plan to take one failure at a time and do something about it, but I've been recently quite busy 16:29:09 <ihrachys> we can maybe ask once again, mentioning the progress and asking for more :) 16:29:10 <ihrachys> I imagine 16:29:41 <ihrachys> I can do the shout out 16:29:53 <jlibosva> if you have some cookies to promise, maybe that will help :) 16:30:10 <ihrachys> cookie or a stick, hm. tough choice. 16:30:15 <ihrachys> next was "haleyb to check why dvr-ha job is at ~100% failure rate" 16:30:34 <ihrachys> I talked to haleyb about that 16:30:44 <ihrachys> and it seems like this a devstack-gate provisioning issue 16:30:46 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1707003 16:30:46 <openstack> Launchpad bug 1707003 in neutron "gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:30:59 <ihrachys> there is a patch that tackles a similar problem, in grenade (abandoned) 16:31:15 <ihrachys> and we will probably need smth similar, but in devstack-gate, since this job is not even grenade 16:31:24 <ihrachys> and Brian is going to work on it 16:31:29 <ihrachys> I imagine it may take a while 16:31:32 <ihrachys> first, because d-g 16:31:45 <ihrachys> second, because reviews there are rare 16:31:46 <ihrachys> :) 16:32:10 <ihrachys> #topic Grafana 16:32:15 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:32:36 <ihrachys> there is a periodic failure for -pg- job again 16:32:45 <ihrachys> I noticed it today and sent an email to armax about it 16:32:55 <ihrachys> seems like another instance of sqlalchemy not being ready for -pg- 16:33:02 <ihrachys> there is a sql syntax error in cinder api 16:33:33 <ihrachys> there is fullstack still at 80%, and I actually started playing with the test locally yesterday (no progress yet) 16:33:36 <jlibosva> about pg, have you seen the patch about testing pg? 16:33:40 <ihrachys> good thing it reproduces just fine 16:33:48 <ihrachys> jlibosva, which one? 16:33:54 * jlibosva looking 16:34:06 <jlibosva> https://review.openstack.org/#/c/427880/ 16:34:41 <ihrachys> oh that's old news yeah 16:34:47 <jlibosva> tldr, warns users that pg is not thoroughly tested, I wonder whether it is worth the effort to support pg in gate 16:34:54 <ihrachys> still it seems like we have armax interested in doing it 16:34:59 <ihrachys> I proposed removal of the job in the past 16:35:17 <ihrachys> and armax was all against it and jumped fixing issues 16:35:18 <jlibosva> I remember, but it was before this was approved 16:35:29 <jlibosva> ok 16:35:39 <ihrachys> I don't think it changes much if we just do it best effort 16:35:55 <ihrachys> if someone benefits from it and maintains it above the water, I am good 16:36:10 <jlibosva> okies 16:36:19 <ihrachys> there is one failure spike that I think is new 16:36:33 <ihrachys> for linuxbridge grenade multinode job 16:36:41 <ihrachys> currently at 45% 16:37:03 <ihrachys> I know that the job was never stable because of https://bugs.launchpad.net/neutron/+bug/1683256 16:37:03 <openstack> Launchpad bug 1683256 in neutron "linuxbridge multinode depending on multicast support of provider" [High,Confirmed] - Assigned to omkar_telee (omkar-telee) 16:37:33 <ihrachys> Kevin was going to look at it in the past, but it never happened 16:37:42 <ihrachys> and I wonder if that's a sign that we should deprovision the job 16:37:58 <ihrachys> because the only reason why we haven't a cycle ago was that Kevin was going to make it work 16:38:38 <ihrachys> I guess I will send another patch removing other people jobs and see how they react ;) 16:38:51 <ihrachys> #action ihrachys to propose a removal for linuxbridge grenade multinode job 16:39:44 <ihrachys> I suspect there is this multicast AND something else going on there but I have no will to dig 16:40:05 <ihrachys> #topic Open discussion 16:40:17 <ihrachys> jlibosva, progress on ovs isolation for fullstack? 16:40:22 <jlibosva> nah 16:40:24 <jlibosva> :) 16:40:26 <ihrachys> ack :) 16:40:29 <ihrachys> anything else? 16:40:32 <jlibosva> I have one topic thou 16:40:34 <jlibosva> yep 16:40:38 <jlibosva> I was pinged by ironic guys 16:40:48 <jlibosva> they are getting hit in grenade job by neutron 16:40:58 <ihrachys> bug number? 16:40:59 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1707160 16:40:59 <openstack> Launchpad bug 1707160 in neutron "test_create_port_in_allowed_allocation_pools test fails on ironic grenade" [Undecided,Confirmed] 16:41:07 <jlibosva> which probably leads to https://bugs.launchpad.net/neutron/+bug/1705351 16:41:07 <openstack> Launchpad bug 1705351 in neutron "agent notifier getting amqp NotFound exceptions propagated up to it" [High,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:41:39 <jlibosva> I looked at the logs of the failed job and it seemed to me that client times out after 60 seconds and then retries the operation 16:42:07 <jlibosva> the thing is that the first operation succeeds after some time > 60 secs and then the second "retry" fails 16:42:18 <jlibosva> so they bumped the cli timeout to 120s but still are getting hit 16:42:33 <jlibosva> I didn't have time to look at why they get hit again 16:42:48 <jlibosva> but I thought it was worth raising here even though it's not Neutron CI but some neutron failure :) 16:43:07 * ihrachys tries to understand why the 2nd bug is assigned to him 16:43:21 <jlibosva> you sent a patch to catch some exceptions 16:43:23 <jlibosva> iirc 16:43:26 <ihrachys> oh, a partial bug 16:43:37 <jlibosva> https://review.openstack.org/#/c/486687/ 16:44:07 <ihrachys> jlibosva, how does the 1st bug lead to 2nd? can you elaborate? 16:44:19 <ihrachys> the 2nd is about an exception raised 16:44:24 <ihrachys> because of exchange missing 16:44:52 <ihrachys> the former looks more like dns misconfigured or smth else locking the processing thread for long time 16:45:06 <jlibosva> what I saw was subnet delete in the ironic gate - the delete was trying to notify dhcp agents they should delete their resources 16:45:38 <jlibosva> but due to some amqp issue, rpc was failing or stuck (don't remember), which cause the operation of subnet deletion to take 97 seconds 16:46:33 <jlibosva> and the tempest rest client called subnet delete again after 60 seconds, but the first call eventually succeeded, while the second failed 16:46:39 <jlibosva> cause subnet was already gone 16:46:41 <ihrachys> so, one of reasons I am aware of for a thread to lock on rpc is dns misconfigured, so oslo.messaging takes a while to get response from libresolv with ipv4/ipv6 addresses. 16:46:49 <jlibosva> so the delay on the server side was caused by the amqp issue 16:47:03 <ihrachys> jlibosva, not THE issue, but A issue, right? 16:47:56 <ihrachys> in the server log, I see AMQP server on 149.202.183.40:5672 is unreachable: timed out 16:48:11 <ihrachys> which probably suggests it's not dns 16:48:19 <ihrachys> (the broker is referred to via ip) 16:48:21 <jlibosva> I'm referring to amqp as THE :) 16:48:34 <jlibosva> IIRC there was a trace about missing queue or something 16:49:11 <ihrachys> oh yeah, i see "NotFound: Basic.publish: (404) NOT_FOUND - no exchange 'q-agent-notifier-port-delete_fanout' in vhost '/'" 16:49:43 <ihrachys> then the 2nd bug is indeed relevant 16:50:34 <ihrachys> with the fix landed on neutron side for this, I guess we may ask ironic folks if they are still affected? 16:51:24 <jlibosva> makes sense 16:51:42 <jlibosva> but I don't have much knowledge about the failure there 16:51:46 <jlibosva> just wanted to raise it here 16:52:03 <ihrachys> ack, thanks for that 16:52:08 <ihrachys> I will update them about the fix in LP 16:52:13 <ihrachys> anything else? 16:52:40 <jlibosva> not from me 16:52:45 <ihrachys> cool. thanks for joining. 16:52:45 <ihrachys> #endmeeting