16:01:21 <ihrachys> #startmeeting neutron_ci
16:01:27 <openstack> Meeting started Tue Mar  7 16:01:21 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:28 <jlibosva> o/
16:01:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:29 <ihrachys> hi everyone
16:01:31 <openstack> The meeting name has been set to 'neutron_ci'
16:02:12 <mlavalle> o/
16:02:13 <ihrachys> jlibosva: kevinbenton: manjeets:
16:02:24 * jlibosva already waved
16:02:29 * ihrachys waves back at mlavalle
16:02:48 <ihrachys> jlibosva: I long for people attention, it's never enough
16:03:19 <ihrachys> ok let's start with reviewing action items from the previous meeting
16:03:24 <ihrachys> #topic Action items from previous meeting
16:03:33 <ihrachys> "ihrachys to monitor e-r irc bot reporting in the channel"
16:03:44 <ihrachys> so, I haven't seen a single report in the channel from the bot
16:03:59 <ihrachys> so I guess now I have another action item to track, which is to fix it :)
16:04:09 <ihrachys> #action ihrachys fix e-r bot not reporting in irc channel
16:04:19 <ihrachys> next is: "manjeets to repropose the CI dashboard script for reviewday"
16:04:42 <ihrachys> I see this proposed: https://review.openstack.org/#/c/439114/
16:05:27 <ihrachys> I guess we can't check the result before landing
16:07:23 <ihrachys> ok it seems like a good start, I will review after the meeting
16:07:38 <mlavalle> ihrachys: do you have a pointer?
16:07:47 <ihrachys> mlavalle: pointer to?
16:07:57 <mlavalle> manjeets patchset
16:08:08 <ihrachys> I thought I pasted it
16:08:08 <ihrachys> https://review.openstack.org/#/c/439114/
16:08:16 <mlavalle> ihrachys: you did. my bad
16:08:20 <ihrachys> ok np
16:08:47 <ihrachys> it's basically doing a simple gerrit board with bug/* filters for each gate-failure tagged bug
16:08:53 <ihrachys> so it seems the right thing
16:09:13 <ihrachys> I wonder if there is place to reuse code between dashboard, but otherwise it seems solid
16:09:28 <ihrachys> ok next item is: "ihrachys to follow up with armax on why trunk connectivity test fails for lb scenario job"
16:09:56 <ihrachys> so kevinbenton looked at the failures and realized that's because the scenario job for ovs was not using ovsfw
16:10:06 <ihrachys> and hybrid firewall does not support trunks
16:10:30 <ihrachys> for that matter there are a bunch of patches: https://review.openstack.org/#/q/topic:fix-trunk-scenario
16:10:50 <ihrachys> two neutron patches in gate, and we will need project-config change in to make it all good
16:10:50 <manjeets> o/
16:10:59 <ihrachys> there is still cleanup to do for gate-hook that I will follow up
16:11:02 <manjeets> sorry for being late
16:11:21 <ihrachys> #action ihrachys to clean up dsvm-scenario flavor handling from gate-hook
16:11:25 * mlavalle waves at manjeets
16:11:42 <ihrachys> manjeets: hi. you have a link to an example dashboard handy?
16:11:55 <ihrachys> (use url shortener if you paste it directly)
16:12:03 <mlavalle> lol
16:12:22 <ihrachys> everyone did it at least once in their lifetime!
16:12:26 <manjeets> ihrachys, I can generate one continue on meeting i'll paste in few mins
16:12:31 <ihrachys> manjeets: ok thanks
16:12:48 <mlavalle> I certainly did more than once :-)
16:12:51 <ihrachys> next item was "ihrachys to follow up on PTG working items related to CI and present next week"
16:13:05 <ihrachys> I did check the list, and i have some items to discuss, but will dedicate a separate section
16:13:34 <ihrachys> ok next was on me and not really covered so I will repeat it for public shame and the next meeting check
16:13:39 <ihrachys> #topic ihrachys to walk thru list of open gate failure bugs and give them love
16:13:48 <ihrachys> next was "armax to assess impact of d-g change on stadium gate hooks"
16:14:19 <ihrachys> (that was the breakage due to local.conf loading changes)
16:14:21 <ihrachys> I suspect armax is not online just yet
16:14:39 <ihrachys> and I haven't seen a complete assessment from him just yet. I gotta chase him down. :)
16:14:57 <ihrachys> #action ihrachys to chase down armax on d-g local.conf breakage assessment for stadium
16:15:32 <ihrachys> speaking of which, the next item was for me to land and backport the fix for neutron repo, and we did
16:15:36 <ihrachys> here: https://review.openstack.org/#/q/Ibd0f67f9131e7f67f3a4a62cb6ad28bf80e11bbf,n,z
16:15:46 <ihrachys> so neutron repo should be safe
16:15:57 <jlibosva> good job!
16:15:57 <ihrachys> as for others, we gotta have a dedicated time to look
16:16:27 <ihrachys> ok let's now have a look at our current state of gate
16:16:48 <ihrachys> #topic Gate state
16:17:05 <ihrachys> the neutron grafana is at: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:17:35 <ihrachys> one thing that stands out is api job showing 50% failure rate in sliding window
16:17:42 <ihrachys> good news is that it should be fixed now
16:18:08 <ihrachys> there was a breakage caused by devstack-gate patch that was fixed by a revert: https://review.openstack.org/#/c/442123/
16:18:44 <ihrachys> since it's not the first time we are borked by d-g changes, I proposed to add a non voting neutron-api job for devstack-gate here: https://review.openstack.org/#/c/442156/
16:20:18 <ihrachys> next job of concern is gate-tempest-dsvm-neutron-dsvm-ubuntu-xenial
16:20:26 <ihrachys> currently at 25%
16:20:28 <ihrachys> in gate
16:20:41 <ihrachys> ofc some of it may be due to OOM killers and libvirtd crashing
16:20:45 <ihrachys> but since it stands out
16:20:52 <ihrachys> it's probably more than just that
16:21:01 <ihrachys> and 25% in gate is not good
16:21:14 <ihrachys> any ideas about what lingers the job?
16:21:50 <ihrachys> there was a nasty metadata issue the prev week that kevinbenton tracked down to conntrack killing connections for other tenants
16:21:53 <mlavalle> ihrachys: you meant gate-tempest-dsvm-neutron-dvr-ubuntu-xenial, right?
16:21:57 <ihrachys> that should be fixed by https://review.openstack.org/#/c/441346/
16:22:10 <ihrachys> mlavalle: hm yeah, dvr, sorry
16:23:04 * haleyb hears dvr and listens up
16:23:47 <ihrachys> haleyb: any ideas what makes dvr look worse than others? ;0
16:24:04 <manjeets> ihrachys, is there any patch in progress for gate issues
16:24:20 <manjeets> currentlt i don't see any which have either bug id in message or in topic
16:24:29 <ihrachys> manjeets: not that I know of for dvr job, hence I wonder
16:24:38 <manjeets> https://goo.gl/zVplMR
16:24:46 <haleyb> ihrachys: it's probably the one lingering bug we have, https://bugs.launchpad.net/neutron/+bug/1509004
16:24:46 <openstack> Launchpad bug 1509004 in neutron ""test_dualnet_dhcp6_stateless_from_os" failures seen in the gate" [High,Confirmed]
16:24:49 <ihrachys> oh you mean generally
16:25:13 <manjeets> it used to show some patches, logic i added was either bug id in commit message or topic of patches
16:25:37 <haleyb> although that bug is more general l3, but happens more with dvr
16:25:52 <ihrachys> manjeets: makes sense. let's have a look at specific cases after the meeting.
16:25:59 <mlavalle> haleyb: +
16:26:22 <ihrachys> haleyb: and do we have any lead on it?
16:27:36 <haleyb> ihrachys: no it has been elusive for a long time, but not 25%, in the single digits
16:27:57 <ihrachys> ok I feel there is some investigation to do
16:28:03 <ihrachys> preferrably for l3 folks
16:28:08 <ihrachys> are you up for the job?
16:28:42 <mlavalle> ihrachys: if this is a priority, I can devote some bandwidth to it
16:28:58 <mlavalle> haleyb: ^^^ let me know if you want me to do this
16:29:07 <ihrachys> mlavalle: CI stability especially for jobs that are in gate is of priority for the whole project
16:29:16 <ihrachys> the failures slow down merge velocit
16:29:19 <ihrachys> *velocity
16:29:19 <haleyb> thanks mlavalle, yes, sometimes a new set of eyes will find something
16:29:20 <ihrachys> #action haleyb and mlavalle to investigate what makes dvr gate job failing with 25% rate
16:29:27 <mlavalle> ++
16:29:40 <haleyb> ihrachys: but above you said there was a 25% failure?  that wasn't the dvr job
16:29:59 <ihrachys> that was dvr, I made a mistake
16:30:04 <haleyb> oh, now i see it in grafana
16:30:07 <ihrachys> haleyb: see here http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen
16:30:48 <ihrachys> finally, we have fullstack job and whopping ~100% failure rate
16:31:06 <ihrachys> jlibosva: what's happening with it? caught a flu?
16:31:19 <jlibosva> I haven't checked, just see it's really busted
16:31:30 <jlibosva> I saw couple of failures related to qos
16:31:45 <ihrachys> example: http://logs.openstack.org/10/422210/10/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/3aad0fa/testr_results.html.gz
16:31:55 <ihrachys> yeah qos, I saw it a lot before
16:31:58 <jlibosva> but didn't get to it yet, you can add me an AI to have a peak
16:32:08 <ihrachys> I assume it's on ajo ;)
16:32:11 <haleyb> ihrachys: sometimes dvr is just the victim, has more moving parts, we make no changes and failures rise and fall
16:32:21 * haleyb realizes that's a bad excuse
16:32:36 <ihrachys> haleyb: bingo ;)
16:32:53 <ihrachys> jlibosva: I will try to have qos folks have a look, and if not we'll revisit
16:33:01 <ihrachys> it's not too pressing since it's non-voting
16:33:15 <ihrachys> #action ajo to chase down fullstack 100% failure rate due to test_dscp_qos_policy_rule_lifecycle failures
16:33:18 <jlibosva> but it's almost up to 100% :-/
16:33:26 <jlibosva> I'll have a look anyways :-P
16:33:39 <ihrachys> you are a free agent of your destiny
16:34:12 <ihrachys> #action jlibosva to help ajo understand why fullstack fails
16:34:44 <ihrachys> ok seems like we have grasp of current issues, or have folks on hooks for remaining unknowns ;)
16:34:48 <ihrachys> let's move on
16:34:48 <haleyb> ihrachys: for example, in the tempest gate, the dvr failure graph mimics the api failure one, just lower
16:34:52 <haleyb> anyways...
16:35:24 <ihrachys> haleyb: I think dvr job was not broken by gate-hook, but maybe just monitor for a day before jumping on it
16:35:36 <haleyb> will do
16:35:40 <ihrachys> if it doesn't resolve itself, then start panicking
16:35:52 <ihrachys> ok moving on
16:35:57 <ihrachys> #topic PTG followup
16:36:22 <ihrachys> so I went through the -final etherpad at: https://etherpad.openstack.org/p/neutron-ptg-pike-final and tried to list actionable CI items
16:36:30 <ihrachys> let's review those and find owners
16:36:51 <ihrachys> first block is functional job stability matters
16:37:08 <ihrachys> several things here to track, though some were already covered
16:37:26 <ihrachys> for one thing, the ovsdb-native timeout issue
16:38:03 <ihrachys> there were suggestion that we should try to eliminate some of broken parts from the run
16:38:27 <ihrachys> 1. raise timeout for ovsdb operations. I think there was some agreement that it may help to reduce the rate
16:38:48 <ihrachys> not sure if otherwiseguy still feels ok-ish about it, considering his late fixes up for review
16:39:11 <ihrachys> I am specifically talking about https://review.openstack.org/441208 and https://review.openstack.org/441258
16:39:29 <otherwiseguy> ihrachys, I'm ok with raising timeout.
16:40:05 <otherwiseguy> I think it'll help. but we need to raise the probe_interval as well.
16:40:34 <ihrachys> otherwiseguy: would you mind linking your patches to the correct bug in LP?
16:40:50 <otherwiseguy> ihrachys, will do
16:40:52 <ihrachys> that will help them to show up at the gerrit CI board manjeets is working on
16:40:56 <ihrachys> thanks
16:41:07 <manjeets> thanks
16:41:11 <ihrachys> ok then I guess it's on ajo to revive the timeout patch
16:41:30 <ihrachys> #action ajo to restore and merge patch raising ovsdb native timeout: https://review.openstack.org/#/c/425623/
16:42:40 <ihrachys> another thing that should reduce the number of tests for the job and hence the chance of hitting the issue is otherwiseguy's work to spin off ovsdb layer into a separate repo
16:42:54 <ihrachys> starting from https://review.openstack.org/#/c/442206/ and https://review.openstack.org/#/c/438087/
16:42:56 <ihrachys> it's wip
16:43:28 <ihrachys> I believe it won't fix the issue for the neutron gate because we still have tests that do not target the layer but use it (like dhcp agent)
16:43:31 <otherwiseguy> ihrachys, theoretically it could move out of WIP as soon as soon as added to openstack/requirements
16:43:55 <ihrachys> otherwiseguy: oh it's passing everything? you're quick
16:43:57 <otherwiseguy> https://review.openstack.org/#/c/442206/
16:44:14 <ihrachys> otherwiseguy: you could as well Depends-On the patch to pass the remaining job
16:44:20 <otherwiseguy> I'll have to merge in any changes we've made since posting it, but git subtree handles that.
16:44:22 <ihrachys> and then just remove WIP
16:44:53 <manjeets> otherwiseguy, would there be any version for initial ovsdb ?
16:45:04 <otherwiseguy> I guess I need to add to global-requirements.txt as well? wasn't sure if there was something automated if added to projects.txt in requirements.
16:45:29 <ihrachys> otherwiseguy: projects.txt only makes requirements bot to propose syncs for your repo
16:45:40 <ihrachys> it doesn't change a thing beyond that
16:45:48 <ihrachys> (unless I am completely delusional)
16:45:56 <otherwiseguy> manjeets, there are some dev versions posted to pypi. I'll probably make a real release 0.1 release soonish.
16:46:33 <ihrachys> otherwiseguy: so, you need to add ovsdbapp into global-reqs.txt and upper-constraints.txt
16:46:34 <otherwiseguy> THere is some refactoring I'd like to do before an actual 1.0 release. But once everything is completely up, I can do Depends-On changes for projects to handle my refactoring.
16:46:52 <otherwiseguy> ihrachys, Ok. I'll do that as well.
16:46:53 <ihrachys> and depends-on that, not the patch that modifies projects.txt
16:47:10 <manjeets> otherwiseguy, then don't we need upper-constraint as well (I may be wrong) ?
16:47:16 <ihrachys> we do
16:47:29 <ihrachys> otherwiseguy: so you are not going to adopt it in neutron before 1.0?
16:47:34 <otherwiseguy> I'll go through the instructions on project creation doc page.
16:48:11 <otherwiseguy> ihrachys, we can (and should) adopt immediately. Just going to change some, but I can keep things in sync. Especially after adding some more cores soon. :p
16:48:40 <ihrachys> otherwiseguy: as long as it does not go too far to make adoption more complicated
16:48:50 <ihrachys> consider what happened to oslo.messaging :)
16:49:06 <ihrachys> better stick to what we have and refactor once we have projects tied
16:49:11 <otherwiseguy> ihrachys, It'll be easier just having to make the ovsdb changes in one place as opposed to making them in neutron, then merging to ovsdbapp, then moving things, etc.
16:49:20 <ihrachys> unless you see a huge architectural issue that requires drastic changes
16:49:49 <otherwiseguy> Right, I'm planning on moving everything to using ovsdbapp as is, then gradually making changes later.
16:50:07 <ihrachys> ack, good
16:50:08 <ihrachys> ok that seems to progress and I am sure otherwiseguy will complete it till next meeting ;)
16:50:16 <otherwiseguy> :D
16:50:29 <otherwiseguy> I will be pretty obsessed with it for a bit.
16:50:37 <ihrachys> another item from PTG was driving switch to ha+dvr jobs for most gate jobs
16:51:00 <ihrachys> for that matter, anilvenkata_afk has the following patch in devstack-gate that doesn't move too far lately: https://review.openstack.org/#/c/383827/
16:51:42 <ihrachys> Anil just rebased it
16:51:56 <ihrachys> I will leave an action item on anilvenkata_afk to track that
16:52:11 <ihrachys> #action anilvenkata_afk to track inclusion of HA+DVR patch for devstack-gate
16:52:49 <ihrachys> there were also talks about reducing the number of jobs in gate, like removing non-dvr, non-multinode flavours wherever possible
16:53:25 <ihrachys> anyone willing to take the initial analysis of what we can do and report the next week with suggestions on what we can remove?
16:53:52 <haleyb> ihrachys: and i had just proposed to make the dvr-multinode job voting, but dvr+ha+multinode is just as good
16:54:06 <ihrachys> btw not just in gate but in check queue too, we have some jobs that are not going anywhere, or that are worthless for what I wonder (ironic?)
16:54:36 <ihrachys> haleyb: I would imagine that instead of piling new jobs on top we would remove non-dvr and replace them with dvr flavours
16:54:42 <haleyb> they have to be voting in check to be voting in gate from what i learned
16:54:43 <ihrachys> haleyb: link to the patch would be handy
16:54:57 <haleyb> https://review.openstack.org/410973
16:55:10 <haleyb> i had started that in newton
16:55:51 <ihrachys> ok. considering that we currently have 25% failure rate for the job, it may be challenging to sell its voting status ;)
16:55:57 <ihrachys> let's discuss on gerrit
16:56:08 <ihrachys> so, anyone to do the general analysis for all jobs we have?
16:56:44 <ihrachys> ok let's leave it for the next meeting to think about, we are running out of time
16:57:27 <ihrachys> we will follow up on more ptg items the next week
16:57:34 <ihrachys> (dude the number is huge)
16:57:44 <ihrachys> #topic Open discussion
16:57:51 <manjeets> ihrachys, one more missing bit py35 demonstration via tempest or functional job
16:58:03 <ihrachys> manjeets: yeah it's in my list for next week
16:58:05 <ihrachys> there are more :)
16:58:09 <manjeets> ohk
16:58:24 <ihrachys> I would like to raise attention to jlibosva docs patch on gerrit rechecks: https://review.openstack.org/#/c/426829/
16:58:40 <ihrachys> I think it's ready to go though I see some comments from manjeets that may need addressing
16:59:03 <ihrachys> I don't believe that the policy should apply to 'gate deputies' only
16:59:08 <ihrachys> (btw we don't have such a role)
16:59:14 <jlibosva> I'll look at the comments
16:59:17 <manjeets> ihrachys, It is for upgrades but still ci gate-grenade-dsvm-neutron-linuxbridge-multinode-ubuntu-xenial-nv
16:59:24 <ihrachys> to make it work, everyone should be on the same page and behave responsibly
16:59:35 <jlibosva> I think he meant bug deputies?
16:59:43 <manjeets> i need to follow up on that just a reminder will do that after meeting
16:59:55 <ihrachys> well I would not expect bug deputies to be on the hook to read logs now
17:00:03 <ihrachys> it's enough work to triage bugs already
17:00:10 <ihrachys> manjeets: ok
17:00:22 <ihrachys> jlibosva: let's follow up on manjeets's comments in gerrit
17:00:25 <ihrachys> we are out of time
17:00:26 <jlibosva> sure
17:00:27 <manjeets> or there can be separate gate deputies not sure if that would make more sense
17:00:41 <ihrachys> we haven't covered everything we should have, but that's ok, it means we have work to do ;)
17:00:46 <ihrachys> thanks all
17:00:46 <ihrachys> #endmeeting