16:00:50 <ihrachys|afk> #startmeeting neutron_ci
16:00:51 <openstack> Meeting started Tue Jun  6 16:00:50 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys|afk. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:53 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:54 <ihrachys|afk> good day everyone
16:00:55 <jlibosva> o/
16:00:56 <openstack> The meeting name has been set to 'neutron_ci'
16:01:03 * ihrachys|afk waves at haleyb
16:01:20 * haleyb waves back
16:01:28 <ihrachys> as usual, starting with actions from prev week
16:01:33 <ihrachys> #topic Actions from prev week
16:01:51 <ihrachys> first is "jlibosva to understand why instance failed to up networking in trunk conn test: https://review.openstack.org/#/c/462227/"
16:02:14 <jlibosva> I sent a new PS today
16:02:26 <ihrachys> yeah, still failing, though in a different way it seems
16:02:29 <jlibosva> I suspect it was because the port security was disabled *after* instance booted
16:02:33 <ihrachys> for linuxbridge
16:02:48 <ihrachys> http://logs.openstack.org/27/462227/5/check/gate-tempest-dsvm-neutron-scenario-linuxbridge-ubuntu-xenial-nv/f85a4b2/testr_results.html.gz
16:02:54 <jlibosva> so now I changed the approach to disable by default and after instances are up, it will enable for LB
16:02:56 <jlibosva> lookng
16:04:08 <jlibosva> I didn't test it with LB as I have ovs-agt only
16:04:38 <ihrachys> ok
16:04:39 <ihrachys> KeyError: 'port_security_enabled'
16:04:44 <ihrachys> this really looks like port-sec not enabled
16:05:00 <ihrachys> anyhoo, not a bother for the meeting I think
16:05:04 <ihrachys> let's move on
16:05:19 <ihrachys> next is "jlibosva to fetch and categorize functional py3 failures"
16:05:35 <ihrachys> I see smth in https://etherpad.openstack.org/p/py3-neutron-pike
16:05:42 <jlibosva> so I categorized it to 12 failures: https://etherpad.openstack.org/p/py3-neutron-pike
16:06:37 <ihrachys> nice
16:06:44 <ihrachys> now we need to decide what to do with the list
16:07:06 <ihrachys> considering that we are all full hand with stuff, maybe we can craft a request for action and send it to openstack-dev?
16:07:34 <ihrachys> maybe also prioritizing them
16:08:03 <ihrachys> some of those may look different but be the same issues, I would like to start where we are pretty sure those are unique
16:08:03 <jlibosva> or in 'spare time' - we can write our name to the number and try to produce some patch
16:08:24 <ihrachys> like one ovs firewall; one wsgi; one sqlfixture
16:08:37 <ihrachys> then once those are tackled, we can revisit the results and see what's still there
16:08:44 <ihrachys> what do you think?
16:09:19 <jlibosva> I wanted to add that some failures might be related to 3rd party libraries not working with python3 - like ovsdbapp or ryu
16:09:32 <jlibosva> as some failures occur only with these drivers
16:10:10 <ihrachys> aha
16:10:23 <ihrachys> well good news is I think we have links to their authors ;)
16:10:45 <ihrachys> maybe worth pulling those people for failures we suspect are related to the libs
16:10:56 <ihrachys> I am sure otherwiseguy will be able to help with ovsdbapp
16:11:05 <ihrachys> and yamamoto should know whom to pull for ryu
16:11:16 <ihrachys> jlibosva, are you up to craft the mail?
16:11:21 <jlibosva> I haven't confirmed it's really there but maybe would be worth e.g. enable python3-functional for ovsdbapp
16:11:25 <ihrachys> (assuming you think it's the right thing)
16:11:37 <jlibosva> yeah, you can make me an AI
16:11:45 <ihrachys> jlibosva, ovsdbapp has functional job?
16:11:50 <ihrachys> ok
16:11:51 <jlibosva> ihrachys: but not python3 flavor
16:11:56 <jlibosva> or does it?
16:12:00 <jlibosva> it didn't last time I checked
16:12:07 <ihrachys> #action jlibosva to craft an email to openstack-dev@ with func-py3 failures and request for action
16:12:39 <ihrachys> there is func job in ovsdbapp as can be seen in e.g. https://review.openstack.org/#/c/470441/
16:13:01 <jlibosva> ihrachys: but that runs with python2
16:13:05 <ihrachys> yeah I know
16:13:13 <ihrachys> just saying there is a job that we could dup for py3
16:13:18 <jlibosva> ah, ok
16:13:30 <jlibosva> there is not much inside though afair :)
16:13:34 <ihrachys> I would start with talking to Terry about it (maybe through same venue)
16:13:46 <ihrachys> there should be no expectation we pull it all ourselves
16:14:58 <ihrachys> ok let's move on, thanks for the work, good progress
16:15:06 <ihrachys> next is "jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine'"
16:15:11 <ihrachys> boy you have stuff on the plate
16:15:26 <jlibosva> oh, that didn't happen
16:15:31 <jlibosva> cause I forgot
16:15:31 <ihrachys> that's related to trunk test instability in fullstack job
16:15:42 <ihrachys> ok lemme repeat the AI for the next week
16:15:48 <ihrachys> #action jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine'
16:15:55 <jlibosva> I should stop trusting my memory
16:16:13 <ihrachys> jlibosva, I usually create a trello card for each thing I say I will look at
16:16:33 <ihrachys> doesn't guarantee I do, but at least it makes me conscious about it being on the plate
16:16:37 <jlibosva> I did create two after meeting without checking the logs
16:16:56 <ihrachys> I do right away, I don't trust myself :)
16:16:59 <ihrachys> ok, next was "ihrachys to understand why functional job spiked on weekend"
16:17:46 <ihrachys> so the spike (and current instability) is because of the job failing on one of clouds where the cloud uses same IP range as the job
16:17:57 <ihrachys> the fix is https://review.openstack.org/#/c/469189/
16:18:12 <ihrachys> which is switch to devstack-gate for the functional job (and fullstack while at it)
16:18:24 <ihrachys> d-g knows the correct ip range to use for devstack
16:18:46 <ihrachys> there is an issue with the switch right now, since fullstack doesn't use rootwrap, and sudo is disabled by d-g
16:19:36 <jlibosva> aha, so that's why you want to use rootwrap in the test runner :)
16:19:37 <ihrachys> well, we have one piece of rootwrap transition in already, for deployed resources: https://review.openstack.org/459110
16:19:46 <ihrachys> but test runner needs that too
16:19:56 <ihrachys> and the patch for that is at https://review.openstack.org/471097
16:19:59 <ihrachys> jlibosva, yes :)
16:20:09 <ihrachys> the patch is still failing, have to look at it
16:20:52 <ihrachys> I will update the next week about progress, if it's not merged till then
16:21:20 <ihrachys> #action ihrachys to update about functional/fullstack switch to devstack-gate and rootwrap
16:21:26 <ihrachys> ok next was "haleyb to monitor dvr+ha job and maybe replace existing dvr-multinode"
16:21:38 <ihrachys> haleyb, how's the job feeling these days?
16:22:16 <haleyb> that dashboard is a mess, the job isn't perfect
16:23:06 <ihrachys> haleyb, totally agreed about the dash
16:23:23 <ihrachys> haleyb, not perfect as in higher failure rate?
16:24:02 <haleyb> ihrachys: it's close to the dvr-multinode job
16:24:17 <haleyb> maybe 5% higher
16:24:36 <ihrachys> do we have a grasp of pressing issues there?
16:25:18 <haleyb> i don't think there's any dvr-specific failure from what i've looked at
16:26:06 <haleyb> this is just looking at the check queue jobs, the gate is clearly better since we don't push things in with failures
16:27:14 <haleyb> i will continue to watch it, wouldn't be comfortable changing it right now
16:28:00 <ihrachys> ok. one thing that may help is going through let's say last 30 patches and see how it failed there. can give a clue where to look at to make it less scary.
16:28:20 <ihrachys> if we don't know specific issues that hit it, we can't really make a progress towards enabling it
16:28:20 <ihrachys> so
16:28:30 <ihrachys> ok let's monitor/look at it and check next week
16:28:44 <ihrachys> #action haleyb to continue looking at prospects of dvr+ha job
16:28:55 <ihrachys> next in line was "ihrachys to talk to qa/keystone and maybe remove v3-only job"
16:29:07 <ihrachys> I haven't done that, will hopefully find some time this week
16:29:09 <ihrachys> #action ihrachys to talk to qa/keystone and maybe remove v3-only job
16:29:18 <ihrachys> it's not very pressing
16:29:22 <ihrachys> next was "haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim"
16:30:06 <haleyb> i am not done with that one, still need to look at all the configs for the jobs
16:30:54 <ihrachys> take your time
16:31:01 <ihrachys> I will hang it for the next
16:31:02 <ihrachys> #action haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim
16:31:11 <ihrachys> and these are all we had from prev meeting
16:31:16 <ihrachys> #topic Grafana
16:31:22 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:31:53 <ihrachys> one thing to note is ~12% failure rate *in gate* for unittests
16:32:09 <ihrachys> not sure exactly, but can be https://review.openstack.org/#/c/469602/
16:32:20 <ihrachys> I am still to get back to it to fix tests
16:32:45 <ihrachys> if someone would like to take it over while I look at functional job that would be great
16:33:46 <ihrachys> another thing to note is, linuxbridge job seems to be at horrible rate
16:33:50 <ihrachys> 30%?
16:33:54 <ihrachys> and it's in gate
16:34:00 <ihrachys> what's going on there?
16:34:22 <clarkb> http://status.openstack.org/elastic-recheck/data/integrated_gate.html will give you a list of the recent fails if you need to dig in
16:34:24 <jlibosva> there was a failure in detaching vifs
16:34:26 <haleyb> i could look at those unit test failures for your review
16:34:42 <ihrachys> mlavalle, are you aware of any tempest failures that could affect linuxbridge? vif detach nova, is it the bug?
16:34:48 <ihrachys> haleyb, please do, thanks
16:34:52 <jlibosva> ihrachys: https://bugs.launchpad.net/nova/+bug/1696006
16:34:54 <openstack> Launchpad bug 1696006 in neutron "Libvirt fails to detach network interface with Linux bridge" [Critical,New]
16:35:00 <haleyb> ihrachys: we talked about that in neutron meeting, right?
16:35:11 <jlibosva> haleyb: ihrachys connected later
16:35:46 <ihrachys> yeah I suck. I can read the logs instead
16:35:53 <ihrachys> so we think it's it?
16:36:02 <jlibosva> I compared times where we bumped os-vif correlate with failure occurence
16:36:12 <jlibosva> but I didn't find any patch in particular, I started looking at nova code
16:36:20 <haleyb> ihrachys: possible libvirt issue from what mlavalle saw - failure during port_delete causing this
16:36:28 <ihrachys> is nova team aware of this pressing issue?
16:36:38 <ihrachys> aware as in actively work on?
16:36:45 <jlibosva> 32 hits for 24h
16:36:53 <haleyb> i think he just filed bug last night
16:37:07 <jlibosva> not sure, but mlavalle did a good triage and is looking at it
16:37:33 <ihrachys> ok
16:37:52 <ihrachys> mriedem, https://bugs.launchpad.net/nova/+bug/1696125 affects neutron gate a lot. can we bump priority on it?
16:37:53 <openstack> Launchpad bug 1696125 in OpenStack Compute (nova) "Detach interface failed - Unable to detach from guest transient domain (pike)" [Medium,Confirmed]
16:39:56 <ihrachys> I guess Matt is not avail
16:39:59 <mriedem> i'm here
16:40:09 <ihrachys> ok
16:40:10 <mriedem> i'm always here for you ihar
16:40:12 <ihrachys> :)
16:40:21 * ihrachys hugs mriedem
16:40:27 <mriedem> i've got some tabs open,
16:40:35 <mriedem> dealing with some other stuff atm and then that this afternoon
16:40:38 <ihrachys> so what's about this bug? is it on the radar for nova?
16:40:49 <mriedem> yeah https://review.openstack.org/#/c/441204/6 needs to be updated
16:40:54 <mriedem> it's on my radar
16:41:06 <mriedem> no one else in nova probably is aware or cares
16:41:15 <ihrachys> ok cool. I will add myself to reviewers to monitor progress.
16:41:25 <ihrachys> thanks for caring
16:42:04 <ihrachys> looking at other grafana dashboards, they are mostly ok-ish, or it's functional/fullstack/scenarios that we know about and already covered
16:42:35 <ihrachys> moving to bugs
16:42:47 <ihrachys> #topic Gate bugs
16:43:17 <ihrachys> one thing that popped today is it seems like neutron broke tripleo pipeline
16:43:20 <ihrachys> https://bugs.launchpad.net/tripleo/+bug/1696094
16:43:21 <openstack> Launchpad bug 1696094 in tripleo "CI: ovb-ha promotion job fails with 504 gateway timeout, neutron-server create-subnet timing out" [Critical,Triaged]
16:43:38 <ihrachys> as per logs, it seems like neutron-server serves a subnet create request for 2minutes+
16:43:50 <ihrachys> and holds some locks for 60s+
16:44:25 <ihrachys> I suspect it's something like eventlet interacting badly with workers. like a green thread not yielding
16:44:42 <ihrachys> the ~60s is suspicious, it's same in all failure runs I looked at
16:44:54 <jlibosva> do we monkey patch server? :)
16:45:14 <ihrachys> jlibosva, we do, via neutron/common/eventlet_utils.py
16:45:26 <ihrachys> which is called from neutron/cmd/eventlet/__init__.py
16:45:36 <ihrachys> and neutron-server entrypoint is under it
16:45:58 <ihrachys> there are some suspects https://review.openstack.org/#/c/471345/ and https://review.openstack.org/#/c/471357/
16:46:27 <ihrachys> but really it's just a silly way to find late changes that seem related in some way :)
16:47:03 <ihrachys> ideally someone would run with the bug from there, but I don't know of anyone actively working on it right now
16:48:36 <ihrachys> ok I guess it may require some broader venue to advertise the issue
16:49:26 <ihrachys> looking at the list of other bugs here: https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0
16:50:19 <ihrachys> it doesn't seem there is anything on the list that was not covered and a new issue
16:50:27 <ihrachys> so let's move on
16:50:33 <ihrachys> #topic Open discussion
16:50:45 <ihrachys> anyone has anything to share? any concerns?
16:51:06 <jlibosva> I saw the pep8 job failing - is it known issue?
16:51:20 <jlibosva> I just saw it at one of patching in this meeting
16:51:31 <ihrachys> jlibosva, link?
16:51:56 <haleyb> https://review.openstack.org/#/c/469602/
16:52:00 <jlibosva> https://review.openstack.org/#/c/469189/
16:52:06 <haleyb> oops, that wasn't it
16:52:24 * haleyb knew it was one of ihar's patches
16:52:46 <ihrachys> oh this. I just suck and uploaded a patch with a pep8 violation
16:52:49 <ihrachys> nothing to look here :)
16:53:12 <haleyb> ihrachys: but you didn't touch the file it was complaining about
16:53:15 <jlibosva> yeah
16:53:24 <ihrachys> haleyb, it's based on another patch
16:53:30 <ihrachys> that touches it
16:53:37 <jlibosva> aaah
16:53:39 <ihrachys> ok unless someone else has more to share, I call it a day in 30s
16:53:47 <haleyb> i'll call it lunch
16:53:54 <jlibosva> :)
16:54:05 <ihrachys> heh
16:54:14 <ihrachys> #endmeeting
16:54:36 <ihrachys> has I done smth wrong?
16:54:41 <ihrachys> where is the bot?
16:54:59 <ihrachys|afk> #endmeeting