16:00:58 <ihrachys> #startmeeting neutron_ci
16:00:59 <openstack> Meeting started Tue Aug  1 16:00:58 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:00 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:02 <openstack> The meeting name has been set to 'neutron_ci'
16:01:24 <ihrachys> #topic Actions from prev week
16:01:33 <ihrachys> "haleyb to reach out to QA PTL about switching grenade integrated gate to multinode"
16:01:52 <ihrachys> last comments in https://review.openstack.org/#/c/483600/ suggest there is more leg work there
16:01:57 <ihrachys> talking to all projects affected
16:02:00 * haleyb did, needs to contact additional PTLs to get their sign-off as well
16:02:15 * haleyb is lurking for a few more minutes
16:02:34 <ihrachys> #action haleyb to reach out to all affected parties, and FBI, to get multinode grenade by default
16:02:50 <ihrachys> next is "jlibosva to recheck the tools/kill.sh patch until it hits timeout and report back"
16:03:00 <jlibosva> I did
16:03:05 <jlibosva> but didn't find anything useful
16:03:09 <ihrachys> I believe on the team meeting today we figured it did not help much
16:03:12 <jlibosva> sec
16:03:15 <ihrachys> I reported https://bugs.launchpad.net/neutron/+bug/1707933
16:03:15 <openstack> Launchpad bug 1707933 in neutron "functional tests timeout after a test worker killed" [Critical,Confirmed]
16:03:56 <jlibosva> I see that worker was killed here: http://logs.openstack.org/65/487065/5/check/gate-neutron-dsvm-functional-ubuntu-xenial/f9b22b8/console.html#_2017-07-31_08_05_43_074447
16:04:04 <jlibosva> at 08:05:43
16:04:36 <jlibosva> there were 'kill' called around that time, but none with worker's pid
16:05:22 <ihrachys> what is so special about FirewallTestCase ?
16:05:39 <ihrachys> or probably BaseFirewallTestCase
16:05:48 <ihrachys> because ipv6 class also affected
16:06:02 <jlibosva> what do you mean?
16:06:34 <ihrachys> well, almost all breakages have firewall tests timing out in the end, no?
16:06:45 <ihrachys> so the workers die on running firewall tests
16:07:16 <jlibosva> it's unclear to me if the test kills itself or it kills other running test
16:07:35 <jlibosva> but there is a RootHelperProcess class (or something like that)
16:07:46 <jlibosva> that takes care of finding correct pid to kill
16:07:46 <ihrachys> ok, but again, why firewall tests are killed, even if those are not the source of kills?
16:07:54 <ihrachys> maybe it's just because they are slow?
16:08:00 <ihrachys> so the chance of hitting them is high?
16:08:21 <jlibosva> I don't think it has anything to do with how fast they are
16:08:28 <jlibosva> there is an eventlet timeout per test
16:08:39 <jlibosva> so slow tests should raise an exception if they don't finish in time
16:08:41 <jlibosva> IIRC
16:10:25 <ihrachys> no, I mean, it seems like we see firewall tests involved when kill happens; it's either because there is some bug in the test; or because a bug happens somewhere else, but it's just that firewall tests take a huge chunk of total time that it hits them so often.
16:10:42 <jlibosva> aha
16:11:04 <jlibosva> one interesting thing I just noticed is that the "Killed" message in console is actually almost a minute after last log in the killed test
16:11:27 <ihrachys> by the brief look at the log, it doesn't seem that those tests are too long, or their number is huge
16:12:01 <jlibosva> maybe I'll also add a ps output to the kill.sh, so we know which process is being killed
16:12:06 <ihrachys> jlibosva, how do you determine the killed test? the one that is 'inprogress' at the end of the run?
16:12:38 <jlibosva> ihrachys: yes, also you can see that e.g. in the console log I sent, executor {3} doesn't run any other tests after Killed message
16:12:53 <jlibosva> ihrachys: while there are tests from {0} {1] and {2} workers
16:13:32 <ihrachys> I wonder if this ~1 minute delay stands for other failure instances
16:13:33 <jlibosva> one other thing would be to change signals in firewall tests to SIGTERM and then register handler in base test class for such signal
16:14:25 <ihrachys> whatis 'change signals'
16:14:34 <jlibosva> default uses SIGKILL
16:14:50 <jlibosva> for killing nc processes that are used to test whether traffic can or cannot pass through
16:16:49 <jlibosva> so if we change from SIGKILL to SIGTERM, we would be able to catch it and reveal more info
16:17:00 <jlibosva> what do you think?
16:17:22 <ihrachys> sorry, I am not that into the test, so I can't find which piece of it starts nc
16:17:40 <ihrachys> oh that's assert_connection ?
16:19:01 <ihrachys> ok I see where it finally bubbles down to RootHelperProcess
16:19:02 <jlibosva> ihrachys: it's NetcatTester
16:21:11 <jlibosva> anyways, I'll try to come up with some more patches, we've already spent a lot of meeting time on this :)
16:21:25 <jlibosva> we can come up with ideas on the LP eventually
16:21:27 <ihrachys> ok
16:21:41 <ihrachys> next was "haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades"
16:22:16 <ihrachys> I don't think there was progress on that one?
16:23:11 <ihrachys> ok let's repeat it, haleyb is probably gone
16:23:28 <ihrachys> #action haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades
16:23:36 <ihrachys> oh I missed one
16:23:40 <ihrachys> "jlibosva to classify current scenario failures and send email to openstack-dev@ asking for help"
16:23:54 <jlibosva> I did and did
16:24:00 <ihrachys> I saw the email
16:24:09 <ihrachys> was there any progress that emerged from it?
16:24:12 <jlibosva> http://lists.openstack.org/pipermail/openstack-dev/2017-July/120294.html
16:24:26 <jlibosva> slaweq sent a patch to enable qos as he wants to work on it
16:24:40 <ihrachys> that is abandoned now
16:24:48 <ihrachys> prolly because yamamoto had the same
16:24:58 <jlibosva> yeah
16:25:00 <jlibosva> https://review.openstack.org/#/c/468326
16:25:01 <ihrachys> the yamamoto patch https://review.openstack.org/#/c/468326
16:25:31 <jlibosva> it also seems there might be no work item and it works now
16:25:40 <ihrachys> ok
16:25:41 <jlibosva> I don't know what fixed the scenario or it just has a low repro rate
16:25:44 <ihrachys> let's land then and see?
16:25:48 <jlibosva> yep
16:26:08 <ihrachys> I +2d
16:26:27 <ihrachys> apart from those qos tests?
16:26:48 <jlibosva> armax looked at trunk failure which turned out a regression in ovs firewall
16:26:53 <ihrachys> btw the etherpad tracking the progress at https://etherpad.openstack.org/p/neutron-dvr-multinode-scenario-gate-failures
16:27:12 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1707339
16:27:12 <openstack> Launchpad bug 1707339 in neutron "test_trunk_subport_lifecycle fails on subport down timeout" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio)
16:27:20 <jlibosva> I've been working on it today
16:27:32 <jlibosva> I'll send out patch soon, I'm writing just some basic UTs
16:28:03 <ihrachys> nice
16:28:27 <jlibosva> other than that, it doesn't seem as popular as python3 effort
16:28:37 <ihrachys> tempest is hard
16:29:00 <jlibosva> I plan to take one failure at a time and do something about it, but I've been recently quite busy
16:29:09 <ihrachys> we can maybe ask once again, mentioning the progress and asking for more :)
16:29:10 <ihrachys> I imagine
16:29:41 <ihrachys> I can do the shout out
16:29:53 <jlibosva> if you have some cookies to promise, maybe that will help :)
16:30:10 <ihrachys> cookie or a stick, hm. tough choice.
16:30:15 <ihrachys> next was "haleyb to check why dvr-ha job is at ~100% failure rate"
16:30:34 <ihrachys> I talked to haleyb about that
16:30:44 <ihrachys> and it seems like this a devstack-gate provisioning issue
16:30:46 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1707003
16:30:46 <openstack> Launchpad bug 1707003 in neutron "gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate" [High,Confirmed] - Assigned to Brian Haley (brian-haley)
16:30:59 <ihrachys> there is a patch that tackles a similar problem, in grenade (abandoned)
16:31:15 <ihrachys> and we will probably need smth similar, but in devstack-gate, since this job is not even grenade
16:31:24 <ihrachys> and Brian is going to work on it
16:31:29 <ihrachys> I imagine it may take a while
16:31:32 <ihrachys> first, because d-g
16:31:45 <ihrachys> second, because reviews there are rare
16:31:46 <ihrachys> :)
16:32:10 <ihrachys> #topic Grafana
16:32:15 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:32:36 <ihrachys> there is a periodic failure for -pg- job again
16:32:45 <ihrachys> I noticed it today and sent an email to armax about it
16:32:55 <ihrachys> seems like another instance of sqlalchemy not being ready for -pg-
16:33:02 <ihrachys> there is a sql syntax error in cinder api
16:33:33 <ihrachys> there is fullstack still at 80%, and I actually started playing with the test locally yesterday (no progress yet)
16:33:36 <jlibosva> about pg, have you seen the patch about testing pg?
16:33:40 <ihrachys> good thing it reproduces just fine
16:33:48 <ihrachys> jlibosva, which one?
16:33:54 * jlibosva looking
16:34:06 <jlibosva> https://review.openstack.org/#/c/427880/
16:34:41 <ihrachys> oh that's old news yeah
16:34:47 <jlibosva> tldr, warns users that pg is not thoroughly tested, I wonder whether it is worth the effort to support pg in gate
16:34:54 <ihrachys> still it seems like we have armax interested in doing it
16:34:59 <ihrachys> I proposed removal of the job in the past
16:35:17 <ihrachys> and armax was all against it and jumped fixing issues
16:35:18 <jlibosva> I remember, but it was before this was approved
16:35:29 <jlibosva> ok
16:35:39 <ihrachys> I don't think it changes much if we just do it best effort
16:35:55 <ihrachys> if someone benefits from it and maintains it above the water, I am good
16:36:10 <jlibosva> okies
16:36:19 <ihrachys> there is one failure spike that I think is new
16:36:33 <ihrachys> for linuxbridge grenade multinode job
16:36:41 <ihrachys> currently at 45%
16:37:03 <ihrachys> I know that the job was never stable because of https://bugs.launchpad.net/neutron/+bug/1683256
16:37:03 <openstack> Launchpad bug 1683256 in neutron "linuxbridge multinode depending on multicast support of provider" [High,Confirmed] - Assigned to omkar_telee (omkar-telee)
16:37:33 <ihrachys> Kevin was going to look at it in the past, but it never happened
16:37:42 <ihrachys> and I wonder if that's a sign that we should deprovision the job
16:37:58 <ihrachys> because the only reason why we haven't a cycle ago was that Kevin was going to make it work
16:38:38 <ihrachys> I guess I will send another patch removing other people jobs and see how they react ;)
16:38:51 <ihrachys> #action ihrachys to propose a removal for linuxbridge grenade multinode job
16:39:44 <ihrachys> I suspect there is this multicast AND something else going on there but I have no will to dig
16:40:05 <ihrachys> #topic Open discussion
16:40:17 <ihrachys> jlibosva, progress on ovs isolation for fullstack?
16:40:22 <jlibosva> nah
16:40:24 <jlibosva> :)
16:40:26 <ihrachys> ack :)
16:40:29 <ihrachys> anything else?
16:40:32 <jlibosva> I have one topic thou
16:40:34 <jlibosva> yep
16:40:38 <jlibosva> I was pinged by ironic guys
16:40:48 <jlibosva> they are getting hit in grenade job by neutron
16:40:58 <ihrachys> bug number?
16:40:59 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1707160
16:40:59 <openstack> Launchpad bug 1707160 in neutron "test_create_port_in_allowed_allocation_pools test fails on ironic grenade" [Undecided,Confirmed]
16:41:07 <jlibosva> which probably leads to https://bugs.launchpad.net/neutron/+bug/1705351
16:41:07 <openstack> Launchpad bug 1705351 in neutron "agent notifier getting amqp NotFound exceptions propagated up to it" [High,In progress] - Assigned to Ihar Hrachyshka (ihar-hrachyshka)
16:41:39 <jlibosva> I looked at the logs of the failed job and it seemed to me that client times out after 60 seconds and then retries the operation
16:42:07 <jlibosva> the thing is that the first operation succeeds after some time > 60 secs and then the second "retry" fails
16:42:18 <jlibosva> so they bumped the cli timeout to 120s but still are getting hit
16:42:33 <jlibosva> I didn't have time to look at why they get hit again
16:42:48 <jlibosva> but I thought it was worth raising here even though it's not Neutron CI but some neutron failure :)
16:43:07 * ihrachys tries to understand why the 2nd bug is assigned to him
16:43:21 <jlibosva> you sent a patch to catch some exceptions
16:43:23 <jlibosva> iirc
16:43:26 <ihrachys> oh, a partial bug
16:43:37 <jlibosva> https://review.openstack.org/#/c/486687/
16:44:07 <ihrachys> jlibosva, how does the 1st bug lead to 2nd? can you elaborate?
16:44:19 <ihrachys> the 2nd is about an exception raised
16:44:24 <ihrachys> because of exchange missing
16:44:52 <ihrachys> the former looks more like dns misconfigured or smth else locking the processing thread for long time
16:45:06 <jlibosva> what I saw was subnet delete in the ironic gate - the delete was trying to notify dhcp agents they should delete their resources
16:45:38 <jlibosva> but due to some amqp issue, rpc was failing or stuck (don't remember), which cause the operation of subnet deletion to take 97 seconds
16:46:33 <jlibosva> and the tempest rest client called subnet delete again after 60 seconds, but the first call eventually succeeded, while the second failed
16:46:39 <jlibosva> cause subnet was already gone
16:46:41 <ihrachys> so, one of reasons I am aware of for a thread to lock on rpc is dns misconfigured, so oslo.messaging takes a while to get response from libresolv with ipv4/ipv6 addresses.
16:46:49 <jlibosva> so the delay on the server side was caused by the amqp issue
16:47:03 <ihrachys> jlibosva, not THE issue, but A issue, right?
16:47:56 <ihrachys> in the server log, I see AMQP server on 149.202.183.40:5672 is unreachable: timed out
16:48:11 <ihrachys> which probably suggests it's not dns
16:48:19 <ihrachys> (the broker is referred to via ip)
16:48:21 <jlibosva> I'm referring to amqp as THE :)
16:48:34 <jlibosva> IIRC there was a trace about missing queue or something
16:49:11 <ihrachys> oh yeah, i see "NotFound: Basic.publish: (404) NOT_FOUND - no exchange 'q-agent-notifier-port-delete_fanout' in vhost '/'"
16:49:43 <ihrachys> then the 2nd bug is indeed relevant
16:50:34 <ihrachys> with the fix landed on neutron side for this, I guess we may ask ironic folks if they are still affected?
16:51:24 <jlibosva> makes sense
16:51:42 <jlibosva> but I don't have much knowledge about the failure there
16:51:46 <jlibosva> just wanted to raise it here
16:52:03 <ihrachys> ack, thanks for that
16:52:08 <ihrachys> I will update them about the fix in LP
16:52:13 <ihrachys> anything else?
16:52:40 <jlibosva> not from me
16:52:45 <ihrachys> cool. thanks for joining.
16:52:45 <ihrachys> #endmeeting