16:01:21 <ihrachys> #startmeeting neutron_ci 16:01:27 <openstack> Meeting started Tue Mar 7 16:01:21 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:28 <jlibosva> o/ 16:01:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:29 <ihrachys> hi everyone 16:01:31 <openstack> The meeting name has been set to 'neutron_ci' 16:02:12 <mlavalle> o/ 16:02:13 <ihrachys> jlibosva: kevinbenton: manjeets: 16:02:24 * jlibosva already waved 16:02:29 * ihrachys waves back at mlavalle 16:02:48 <ihrachys> jlibosva: I long for people attention, it's never enough 16:03:19 <ihrachys> ok let's start with reviewing action items from the previous meeting 16:03:24 <ihrachys> #topic Action items from previous meeting 16:03:33 <ihrachys> "ihrachys to monitor e-r irc bot reporting in the channel" 16:03:44 <ihrachys> so, I haven't seen a single report in the channel from the bot 16:03:59 <ihrachys> so I guess now I have another action item to track, which is to fix it :) 16:04:09 <ihrachys> #action ihrachys fix e-r bot not reporting in irc channel 16:04:19 <ihrachys> next is: "manjeets to repropose the CI dashboard script for reviewday" 16:04:42 <ihrachys> I see this proposed: https://review.openstack.org/#/c/439114/ 16:05:27 <ihrachys> I guess we can't check the result before landing 16:07:23 <ihrachys> ok it seems like a good start, I will review after the meeting 16:07:38 <mlavalle> ihrachys: do you have a pointer? 16:07:47 <ihrachys> mlavalle: pointer to? 16:07:57 <mlavalle> manjeets patchset 16:08:08 <ihrachys> I thought I pasted it 16:08:08 <ihrachys> https://review.openstack.org/#/c/439114/ 16:08:16 <mlavalle> ihrachys: you did. my bad 16:08:20 <ihrachys> ok np 16:08:47 <ihrachys> it's basically doing a simple gerrit board with bug/* filters for each gate-failure tagged bug 16:08:53 <ihrachys> so it seems the right thing 16:09:13 <ihrachys> I wonder if there is place to reuse code between dashboard, but otherwise it seems solid 16:09:28 <ihrachys> ok next item is: "ihrachys to follow up with armax on why trunk connectivity test fails for lb scenario job" 16:09:56 <ihrachys> so kevinbenton looked at the failures and realized that's because the scenario job for ovs was not using ovsfw 16:10:06 <ihrachys> and hybrid firewall does not support trunks 16:10:30 <ihrachys> for that matter there are a bunch of patches: https://review.openstack.org/#/q/topic:fix-trunk-scenario 16:10:50 <ihrachys> two neutron patches in gate, and we will need project-config change in to make it all good 16:10:50 <manjeets> o/ 16:10:59 <ihrachys> there is still cleanup to do for gate-hook that I will follow up 16:11:02 <manjeets> sorry for being late 16:11:21 <ihrachys> #action ihrachys to clean up dsvm-scenario flavor handling from gate-hook 16:11:25 * mlavalle waves at manjeets 16:11:42 <ihrachys> manjeets: hi. you have a link to an example dashboard handy? 16:11:55 <ihrachys> (use url shortener if you paste it directly) 16:12:03 <mlavalle> lol 16:12:22 <ihrachys> everyone did it at least once in their lifetime! 16:12:26 <manjeets> ihrachys, I can generate one continue on meeting i'll paste in few mins 16:12:31 <ihrachys> manjeets: ok thanks 16:12:48 <mlavalle> I certainly did more than once :-) 16:12:51 <ihrachys> next item was "ihrachys to follow up on PTG working items related to CI and present next week" 16:13:05 <ihrachys> I did check the list, and i have some items to discuss, but will dedicate a separate section 16:13:34 <ihrachys> ok next was on me and not really covered so I will repeat it for public shame and the next meeting check 16:13:39 <ihrachys> #topic ihrachys to walk thru list of open gate failure bugs and give them love 16:13:48 <ihrachys> next was "armax to assess impact of d-g change on stadium gate hooks" 16:14:19 <ihrachys> (that was the breakage due to local.conf loading changes) 16:14:21 <ihrachys> I suspect armax is not online just yet 16:14:39 <ihrachys> and I haven't seen a complete assessment from him just yet. I gotta chase him down. :) 16:14:57 <ihrachys> #action ihrachys to chase down armax on d-g local.conf breakage assessment for stadium 16:15:32 <ihrachys> speaking of which, the next item was for me to land and backport the fix for neutron repo, and we did 16:15:36 <ihrachys> here: https://review.openstack.org/#/q/Ibd0f67f9131e7f67f3a4a62cb6ad28bf80e11bbf,n,z 16:15:46 <ihrachys> so neutron repo should be safe 16:15:57 <jlibosva> good job! 16:15:57 <ihrachys> as for others, we gotta have a dedicated time to look 16:16:27 <ihrachys> ok let's now have a look at our current state of gate 16:16:48 <ihrachys> #topic Gate state 16:17:05 <ihrachys> the neutron grafana is at: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:17:35 <ihrachys> one thing that stands out is api job showing 50% failure rate in sliding window 16:17:42 <ihrachys> good news is that it should be fixed now 16:18:08 <ihrachys> there was a breakage caused by devstack-gate patch that was fixed by a revert: https://review.openstack.org/#/c/442123/ 16:18:44 <ihrachys> since it's not the first time we are borked by d-g changes, I proposed to add a non voting neutron-api job for devstack-gate here: https://review.openstack.org/#/c/442156/ 16:20:18 <ihrachys> next job of concern is gate-tempest-dsvm-neutron-dsvm-ubuntu-xenial 16:20:26 <ihrachys> currently at 25% 16:20:28 <ihrachys> in gate 16:20:41 <ihrachys> ofc some of it may be due to OOM killers and libvirtd crashing 16:20:45 <ihrachys> but since it stands out 16:20:52 <ihrachys> it's probably more than just that 16:21:01 <ihrachys> and 25% in gate is not good 16:21:14 <ihrachys> any ideas about what lingers the job? 16:21:50 <ihrachys> there was a nasty metadata issue the prev week that kevinbenton tracked down to conntrack killing connections for other tenants 16:21:53 <mlavalle> ihrachys: you meant gate-tempest-dsvm-neutron-dvr-ubuntu-xenial, right? 16:21:57 <ihrachys> that should be fixed by https://review.openstack.org/#/c/441346/ 16:22:10 <ihrachys> mlavalle: hm yeah, dvr, sorry 16:23:04 * haleyb hears dvr and listens up 16:23:47 <ihrachys> haleyb: any ideas what makes dvr look worse than others? ;0 16:24:04 <manjeets> ihrachys, is there any patch in progress for gate issues 16:24:20 <manjeets> currentlt i don't see any which have either bug id in message or in topic 16:24:29 <ihrachys> manjeets: not that I know of for dvr job, hence I wonder 16:24:38 <manjeets> https://goo.gl/zVplMR 16:24:46 <haleyb> ihrachys: it's probably the one lingering bug we have, https://bugs.launchpad.net/neutron/+bug/1509004 16:24:46 <openstack> Launchpad bug 1509004 in neutron ""test_dualnet_dhcp6_stateless_from_os" failures seen in the gate" [High,Confirmed] 16:24:49 <ihrachys> oh you mean generally 16:25:13 <manjeets> it used to show some patches, logic i added was either bug id in commit message or topic of patches 16:25:37 <haleyb> although that bug is more general l3, but happens more with dvr 16:25:52 <ihrachys> manjeets: makes sense. let's have a look at specific cases after the meeting. 16:25:59 <mlavalle> haleyb: + 16:26:22 <ihrachys> haleyb: and do we have any lead on it? 16:27:36 <haleyb> ihrachys: no it has been elusive for a long time, but not 25%, in the single digits 16:27:57 <ihrachys> ok I feel there is some investigation to do 16:28:03 <ihrachys> preferrably for l3 folks 16:28:08 <ihrachys> are you up for the job? 16:28:42 <mlavalle> ihrachys: if this is a priority, I can devote some bandwidth to it 16:28:58 <mlavalle> haleyb: ^^^ let me know if you want me to do this 16:29:07 <ihrachys> mlavalle: CI stability especially for jobs that are in gate is of priority for the whole project 16:29:16 <ihrachys> the failures slow down merge velocit 16:29:19 <ihrachys> *velocity 16:29:19 <haleyb> thanks mlavalle, yes, sometimes a new set of eyes will find something 16:29:20 <ihrachys> #action haleyb and mlavalle to investigate what makes dvr gate job failing with 25% rate 16:29:27 <mlavalle> ++ 16:29:40 <haleyb> ihrachys: but above you said there was a 25% failure? that wasn't the dvr job 16:29:59 <ihrachys> that was dvr, I made a mistake 16:30:04 <haleyb> oh, now i see it in grafana 16:30:07 <ihrachys> haleyb: see here http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen 16:30:48 <ihrachys> finally, we have fullstack job and whopping ~100% failure rate 16:31:06 <ihrachys> jlibosva: what's happening with it? caught a flu? 16:31:19 <jlibosva> I haven't checked, just see it's really busted 16:31:30 <jlibosva> I saw couple of failures related to qos 16:31:45 <ihrachys> example: http://logs.openstack.org/10/422210/10/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/3aad0fa/testr_results.html.gz 16:31:55 <ihrachys> yeah qos, I saw it a lot before 16:31:58 <jlibosva> but didn't get to it yet, you can add me an AI to have a peak 16:32:08 <ihrachys> I assume it's on ajo ;) 16:32:11 <haleyb> ihrachys: sometimes dvr is just the victim, has more moving parts, we make no changes and failures rise and fall 16:32:21 * haleyb realizes that's a bad excuse 16:32:36 <ihrachys> haleyb: bingo ;) 16:32:53 <ihrachys> jlibosva: I will try to have qos folks have a look, and if not we'll revisit 16:33:01 <ihrachys> it's not too pressing since it's non-voting 16:33:15 <ihrachys> #action ajo to chase down fullstack 100% failure rate due to test_dscp_qos_policy_rule_lifecycle failures 16:33:18 <jlibosva> but it's almost up to 100% :-/ 16:33:26 <jlibosva> I'll have a look anyways :-P 16:33:39 <ihrachys> you are a free agent of your destiny 16:34:12 <ihrachys> #action jlibosva to help ajo understand why fullstack fails 16:34:44 <ihrachys> ok seems like we have grasp of current issues, or have folks on hooks for remaining unknowns ;) 16:34:48 <ihrachys> let's move on 16:34:48 <haleyb> ihrachys: for example, in the tempest gate, the dvr failure graph mimics the api failure one, just lower 16:34:52 <haleyb> anyways... 16:35:24 <ihrachys> haleyb: I think dvr job was not broken by gate-hook, but maybe just monitor for a day before jumping on it 16:35:36 <haleyb> will do 16:35:40 <ihrachys> if it doesn't resolve itself, then start panicking 16:35:52 <ihrachys> ok moving on 16:35:57 <ihrachys> #topic PTG followup 16:36:22 <ihrachys> so I went through the -final etherpad at: https://etherpad.openstack.org/p/neutron-ptg-pike-final and tried to list actionable CI items 16:36:30 <ihrachys> let's review those and find owners 16:36:51 <ihrachys> first block is functional job stability matters 16:37:08 <ihrachys> several things here to track, though some were already covered 16:37:26 <ihrachys> for one thing, the ovsdb-native timeout issue 16:38:03 <ihrachys> there were suggestion that we should try to eliminate some of broken parts from the run 16:38:27 <ihrachys> 1. raise timeout for ovsdb operations. I think there was some agreement that it may help to reduce the rate 16:38:48 <ihrachys> not sure if otherwiseguy still feels ok-ish about it, considering his late fixes up for review 16:39:11 <ihrachys> I am specifically talking about https://review.openstack.org/441208 and https://review.openstack.org/441258 16:39:29 <otherwiseguy> ihrachys, I'm ok with raising timeout. 16:40:05 <otherwiseguy> I think it'll help. but we need to raise the probe_interval as well. 16:40:34 <ihrachys> otherwiseguy: would you mind linking your patches to the correct bug in LP? 16:40:50 <otherwiseguy> ihrachys, will do 16:40:52 <ihrachys> that will help them to show up at the gerrit CI board manjeets is working on 16:40:56 <ihrachys> thanks 16:41:07 <manjeets> thanks 16:41:11 <ihrachys> ok then I guess it's on ajo to revive the timeout patch 16:41:30 <ihrachys> #action ajo to restore and merge patch raising ovsdb native timeout: https://review.openstack.org/#/c/425623/ 16:42:40 <ihrachys> another thing that should reduce the number of tests for the job and hence the chance of hitting the issue is otherwiseguy's work to spin off ovsdb layer into a separate repo 16:42:54 <ihrachys> starting from https://review.openstack.org/#/c/442206/ and https://review.openstack.org/#/c/438087/ 16:42:56 <ihrachys> it's wip 16:43:28 <ihrachys> I believe it won't fix the issue for the neutron gate because we still have tests that do not target the layer but use it (like dhcp agent) 16:43:31 <otherwiseguy> ihrachys, theoretically it could move out of WIP as soon as soon as added to openstack/requirements 16:43:55 <ihrachys> otherwiseguy: oh it's passing everything? you're quick 16:43:57 <otherwiseguy> https://review.openstack.org/#/c/442206/ 16:44:14 <ihrachys> otherwiseguy: you could as well Depends-On the patch to pass the remaining job 16:44:20 <otherwiseguy> I'll have to merge in any changes we've made since posting it, but git subtree handles that. 16:44:22 <ihrachys> and then just remove WIP 16:44:53 <manjeets> otherwiseguy, would there be any version for initial ovsdb ? 16:45:04 <otherwiseguy> I guess I need to add to global-requirements.txt as well? wasn't sure if there was something automated if added to projects.txt in requirements. 16:45:29 <ihrachys> otherwiseguy: projects.txt only makes requirements bot to propose syncs for your repo 16:45:40 <ihrachys> it doesn't change a thing beyond that 16:45:48 <ihrachys> (unless I am completely delusional) 16:45:56 <otherwiseguy> manjeets, there are some dev versions posted to pypi. I'll probably make a real release 0.1 release soonish. 16:46:33 <ihrachys> otherwiseguy: so, you need to add ovsdbapp into global-reqs.txt and upper-constraints.txt 16:46:34 <otherwiseguy> THere is some refactoring I'd like to do before an actual 1.0 release. But once everything is completely up, I can do Depends-On changes for projects to handle my refactoring. 16:46:52 <otherwiseguy> ihrachys, Ok. I'll do that as well. 16:46:53 <ihrachys> and depends-on that, not the patch that modifies projects.txt 16:47:10 <manjeets> otherwiseguy, then don't we need upper-constraint as well (I may be wrong) ? 16:47:16 <ihrachys> we do 16:47:29 <ihrachys> otherwiseguy: so you are not going to adopt it in neutron before 1.0? 16:47:34 <otherwiseguy> I'll go through the instructions on project creation doc page. 16:48:11 <otherwiseguy> ihrachys, we can (and should) adopt immediately. Just going to change some, but I can keep things in sync. Especially after adding some more cores soon. :p 16:48:40 <ihrachys> otherwiseguy: as long as it does not go too far to make adoption more complicated 16:48:50 <ihrachys> consider what happened to oslo.messaging :) 16:49:06 <ihrachys> better stick to what we have and refactor once we have projects tied 16:49:11 <otherwiseguy> ihrachys, It'll be easier just having to make the ovsdb changes in one place as opposed to making them in neutron, then merging to ovsdbapp, then moving things, etc. 16:49:20 <ihrachys> unless you see a huge architectural issue that requires drastic changes 16:49:49 <otherwiseguy> Right, I'm planning on moving everything to using ovsdbapp as is, then gradually making changes later. 16:50:07 <ihrachys> ack, good 16:50:08 <ihrachys> ok that seems to progress and I am sure otherwiseguy will complete it till next meeting ;) 16:50:16 <otherwiseguy> :D 16:50:29 <otherwiseguy> I will be pretty obsessed with it for a bit. 16:50:37 <ihrachys> another item from PTG was driving switch to ha+dvr jobs for most gate jobs 16:51:00 <ihrachys> for that matter, anilvenkata_afk has the following patch in devstack-gate that doesn't move too far lately: https://review.openstack.org/#/c/383827/ 16:51:42 <ihrachys> Anil just rebased it 16:51:56 <ihrachys> I will leave an action item on anilvenkata_afk to track that 16:52:11 <ihrachys> #action anilvenkata_afk to track inclusion of HA+DVR patch for devstack-gate 16:52:49 <ihrachys> there were also talks about reducing the number of jobs in gate, like removing non-dvr, non-multinode flavours wherever possible 16:53:25 <ihrachys> anyone willing to take the initial analysis of what we can do and report the next week with suggestions on what we can remove? 16:53:52 <haleyb> ihrachys: and i had just proposed to make the dvr-multinode job voting, but dvr+ha+multinode is just as good 16:54:06 <ihrachys> btw not just in gate but in check queue too, we have some jobs that are not going anywhere, or that are worthless for what I wonder (ironic?) 16:54:36 <ihrachys> haleyb: I would imagine that instead of piling new jobs on top we would remove non-dvr and replace them with dvr flavours 16:54:42 <haleyb> they have to be voting in check to be voting in gate from what i learned 16:54:43 <ihrachys> haleyb: link to the patch would be handy 16:54:57 <haleyb> https://review.openstack.org/410973 16:55:10 <haleyb> i had started that in newton 16:55:51 <ihrachys> ok. considering that we currently have 25% failure rate for the job, it may be challenging to sell its voting status ;) 16:55:57 <ihrachys> let's discuss on gerrit 16:56:08 <ihrachys> so, anyone to do the general analysis for all jobs we have? 16:56:44 <ihrachys> ok let's leave it for the next meeting to think about, we are running out of time 16:57:27 <ihrachys> we will follow up on more ptg items the next week 16:57:34 <ihrachys> (dude the number is huge) 16:57:44 <ihrachys> #topic Open discussion 16:57:51 <manjeets> ihrachys, one more missing bit py35 demonstration via tempest or functional job 16:58:03 <ihrachys> manjeets: yeah it's in my list for next week 16:58:05 <ihrachys> there are more :) 16:58:09 <manjeets> ohk 16:58:24 <ihrachys> I would like to raise attention to jlibosva docs patch on gerrit rechecks: https://review.openstack.org/#/c/426829/ 16:58:40 <ihrachys> I think it's ready to go though I see some comments from manjeets that may need addressing 16:59:03 <ihrachys> I don't believe that the policy should apply to 'gate deputies' only 16:59:08 <ihrachys> (btw we don't have such a role) 16:59:14 <jlibosva> I'll look at the comments 16:59:17 <manjeets> ihrachys, It is for upgrades but still ci gate-grenade-dsvm-neutron-linuxbridge-multinode-ubuntu-xenial-nv 16:59:24 <ihrachys> to make it work, everyone should be on the same page and behave responsibly 16:59:35 <jlibosva> I think he meant bug deputies? 16:59:43 <manjeets> i need to follow up on that just a reminder will do that after meeting 16:59:55 <ihrachys> well I would not expect bug deputies to be on the hook to read logs now 17:00:03 <ihrachys> it's enough work to triage bugs already 17:00:10 <ihrachys> manjeets: ok 17:00:22 <ihrachys> jlibosva: let's follow up on manjeets's comments in gerrit 17:00:25 <ihrachys> we are out of time 17:00:26 <jlibosva> sure 17:00:27 <manjeets> or there can be separate gate deputies not sure if that would make more sense 17:00:41 <ihrachys> we haven't covered everything we should have, but that's ok, it means we have work to do ;) 17:00:46 <ihrachys> thanks all 17:00:46 <ihrachys> #endmeeting