16:01:21 #startmeeting neutron_ci 16:01:27 Meeting started Tue Mar 7 16:01:21 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:28 o/ 16:01:29 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:29 hi everyone 16:01:31 The meeting name has been set to 'neutron_ci' 16:02:12 o/ 16:02:13 jlibosva: kevinbenton: manjeets: 16:02:24 * jlibosva already waved 16:02:29 * ihrachys waves back at mlavalle 16:02:48 jlibosva: I long for people attention, it's never enough 16:03:19 ok let's start with reviewing action items from the previous meeting 16:03:24 #topic Action items from previous meeting 16:03:33 "ihrachys to monitor e-r irc bot reporting in the channel" 16:03:44 so, I haven't seen a single report in the channel from the bot 16:03:59 so I guess now I have another action item to track, which is to fix it :) 16:04:09 #action ihrachys fix e-r bot not reporting in irc channel 16:04:19 next is: "manjeets to repropose the CI dashboard script for reviewday" 16:04:42 I see this proposed: https://review.openstack.org/#/c/439114/ 16:05:27 I guess we can't check the result before landing 16:07:23 ok it seems like a good start, I will review after the meeting 16:07:38 ihrachys: do you have a pointer? 16:07:47 mlavalle: pointer to? 16:07:57 manjeets patchset 16:08:08 I thought I pasted it 16:08:08 https://review.openstack.org/#/c/439114/ 16:08:16 ihrachys: you did. my bad 16:08:20 ok np 16:08:47 it's basically doing a simple gerrit board with bug/* filters for each gate-failure tagged bug 16:08:53 so it seems the right thing 16:09:13 I wonder if there is place to reuse code between dashboard, but otherwise it seems solid 16:09:28 ok next item is: "ihrachys to follow up with armax on why trunk connectivity test fails for lb scenario job" 16:09:56 so kevinbenton looked at the failures and realized that's because the scenario job for ovs was not using ovsfw 16:10:06 and hybrid firewall does not support trunks 16:10:30 for that matter there are a bunch of patches: https://review.openstack.org/#/q/topic:fix-trunk-scenario 16:10:50 two neutron patches in gate, and we will need project-config change in to make it all good 16:10:50 o/ 16:10:59 there is still cleanup to do for gate-hook that I will follow up 16:11:02 sorry for being late 16:11:21 #action ihrachys to clean up dsvm-scenario flavor handling from gate-hook 16:11:25 * mlavalle waves at manjeets 16:11:42 manjeets: hi. you have a link to an example dashboard handy? 16:11:55 (use url shortener if you paste it directly) 16:12:03 lol 16:12:22 everyone did it at least once in their lifetime! 16:12:26 ihrachys, I can generate one continue on meeting i'll paste in few mins 16:12:31 manjeets: ok thanks 16:12:48 I certainly did more than once :-) 16:12:51 next item was "ihrachys to follow up on PTG working items related to CI and present next week" 16:13:05 I did check the list, and i have some items to discuss, but will dedicate a separate section 16:13:34 ok next was on me and not really covered so I will repeat it for public shame and the next meeting check 16:13:39 #topic ihrachys to walk thru list of open gate failure bugs and give them love 16:13:48 next was "armax to assess impact of d-g change on stadium gate hooks" 16:14:19 (that was the breakage due to local.conf loading changes) 16:14:21 I suspect armax is not online just yet 16:14:39 and I haven't seen a complete assessment from him just yet. I gotta chase him down. :) 16:14:57 #action ihrachys to chase down armax on d-g local.conf breakage assessment for stadium 16:15:32 speaking of which, the next item was for me to land and backport the fix for neutron repo, and we did 16:15:36 here: https://review.openstack.org/#/q/Ibd0f67f9131e7f67f3a4a62cb6ad28bf80e11bbf,n,z 16:15:46 so neutron repo should be safe 16:15:57 good job! 16:15:57 as for others, we gotta have a dedicated time to look 16:16:27 ok let's now have a look at our current state of gate 16:16:48 #topic Gate state 16:17:05 the neutron grafana is at: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:17:35 one thing that stands out is api job showing 50% failure rate in sliding window 16:17:42 good news is that it should be fixed now 16:18:08 there was a breakage caused by devstack-gate patch that was fixed by a revert: https://review.openstack.org/#/c/442123/ 16:18:44 since it's not the first time we are borked by d-g changes, I proposed to add a non voting neutron-api job for devstack-gate here: https://review.openstack.org/#/c/442156/ 16:20:18 next job of concern is gate-tempest-dsvm-neutron-dsvm-ubuntu-xenial 16:20:26 currently at 25% 16:20:28 in gate 16:20:41 ofc some of it may be due to OOM killers and libvirtd crashing 16:20:45 but since it stands out 16:20:52 it's probably more than just that 16:21:01 and 25% in gate is not good 16:21:14 any ideas about what lingers the job? 16:21:50 there was a nasty metadata issue the prev week that kevinbenton tracked down to conntrack killing connections for other tenants 16:21:53 ihrachys: you meant gate-tempest-dsvm-neutron-dvr-ubuntu-xenial, right? 16:21:57 that should be fixed by https://review.openstack.org/#/c/441346/ 16:22:10 mlavalle: hm yeah, dvr, sorry 16:23:04 * haleyb hears dvr and listens up 16:23:47 haleyb: any ideas what makes dvr look worse than others? ;0 16:24:04 ihrachys, is there any patch in progress for gate issues 16:24:20 currentlt i don't see any which have either bug id in message or in topic 16:24:29 manjeets: not that I know of for dvr job, hence I wonder 16:24:38 https://goo.gl/zVplMR 16:24:46 ihrachys: it's probably the one lingering bug we have, https://bugs.launchpad.net/neutron/+bug/1509004 16:24:46 Launchpad bug 1509004 in neutron ""test_dualnet_dhcp6_stateless_from_os" failures seen in the gate" [High,Confirmed] 16:24:49 oh you mean generally 16:25:13 it used to show some patches, logic i added was either bug id in commit message or topic of patches 16:25:37 although that bug is more general l3, but happens more with dvr 16:25:52 manjeets: makes sense. let's have a look at specific cases after the meeting. 16:25:59 haleyb: + 16:26:22 haleyb: and do we have any lead on it? 16:27:36 ihrachys: no it has been elusive for a long time, but not 25%, in the single digits 16:27:57 ok I feel there is some investigation to do 16:28:03 preferrably for l3 folks 16:28:08 are you up for the job? 16:28:42 ihrachys: if this is a priority, I can devote some bandwidth to it 16:28:58 haleyb: ^^^ let me know if you want me to do this 16:29:07 mlavalle: CI stability especially for jobs that are in gate is of priority for the whole project 16:29:16 the failures slow down merge velocit 16:29:19 *velocity 16:29:19 thanks mlavalle, yes, sometimes a new set of eyes will find something 16:29:20 #action haleyb and mlavalle to investigate what makes dvr gate job failing with 25% rate 16:29:27 ++ 16:29:40 ihrachys: but above you said there was a 25% failure? that wasn't the dvr job 16:29:59 that was dvr, I made a mistake 16:30:04 oh, now i see it in grafana 16:30:07 haleyb: see here http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen 16:30:48 finally, we have fullstack job and whopping ~100% failure rate 16:31:06 jlibosva: what's happening with it? caught a flu? 16:31:19 I haven't checked, just see it's really busted 16:31:30 I saw couple of failures related to qos 16:31:45 example: http://logs.openstack.org/10/422210/10/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/3aad0fa/testr_results.html.gz 16:31:55 yeah qos, I saw it a lot before 16:31:58 but didn't get to it yet, you can add me an AI to have a peak 16:32:08 I assume it's on ajo ;) 16:32:11 ihrachys: sometimes dvr is just the victim, has more moving parts, we make no changes and failures rise and fall 16:32:21 * haleyb realizes that's a bad excuse 16:32:36 haleyb: bingo ;) 16:32:53 jlibosva: I will try to have qos folks have a look, and if not we'll revisit 16:33:01 it's not too pressing since it's non-voting 16:33:15 #action ajo to chase down fullstack 100% failure rate due to test_dscp_qos_policy_rule_lifecycle failures 16:33:18 but it's almost up to 100% :-/ 16:33:26 I'll have a look anyways :-P 16:33:39 you are a free agent of your destiny 16:34:12 #action jlibosva to help ajo understand why fullstack fails 16:34:44 ok seems like we have grasp of current issues, or have folks on hooks for remaining unknowns ;) 16:34:48 let's move on 16:34:48 ihrachys: for example, in the tempest gate, the dvr failure graph mimics the api failure one, just lower 16:34:52 anyways... 16:35:24 haleyb: I think dvr job was not broken by gate-hook, but maybe just monitor for a day before jumping on it 16:35:36 will do 16:35:40 if it doesn't resolve itself, then start panicking 16:35:52 ok moving on 16:35:57 #topic PTG followup 16:36:22 so I went through the -final etherpad at: https://etherpad.openstack.org/p/neutron-ptg-pike-final and tried to list actionable CI items 16:36:30 let's review those and find owners 16:36:51 first block is functional job stability matters 16:37:08 several things here to track, though some were already covered 16:37:26 for one thing, the ovsdb-native timeout issue 16:38:03 there were suggestion that we should try to eliminate some of broken parts from the run 16:38:27 1. raise timeout for ovsdb operations. I think there was some agreement that it may help to reduce the rate 16:38:48 not sure if otherwiseguy still feels ok-ish about it, considering his late fixes up for review 16:39:11 I am specifically talking about https://review.openstack.org/441208 and https://review.openstack.org/441258 16:39:29 ihrachys, I'm ok with raising timeout. 16:40:05 I think it'll help. but we need to raise the probe_interval as well. 16:40:34 otherwiseguy: would you mind linking your patches to the correct bug in LP? 16:40:50 ihrachys, will do 16:40:52 that will help them to show up at the gerrit CI board manjeets is working on 16:40:56 thanks 16:41:07 thanks 16:41:11 ok then I guess it's on ajo to revive the timeout patch 16:41:30 #action ajo to restore and merge patch raising ovsdb native timeout: https://review.openstack.org/#/c/425623/ 16:42:40 another thing that should reduce the number of tests for the job and hence the chance of hitting the issue is otherwiseguy's work to spin off ovsdb layer into a separate repo 16:42:54 starting from https://review.openstack.org/#/c/442206/ and https://review.openstack.org/#/c/438087/ 16:42:56 it's wip 16:43:28 I believe it won't fix the issue for the neutron gate because we still have tests that do not target the layer but use it (like dhcp agent) 16:43:31 ihrachys, theoretically it could move out of WIP as soon as soon as added to openstack/requirements 16:43:55 otherwiseguy: oh it's passing everything? you're quick 16:43:57 https://review.openstack.org/#/c/442206/ 16:44:14 otherwiseguy: you could as well Depends-On the patch to pass the remaining job 16:44:20 I'll have to merge in any changes we've made since posting it, but git subtree handles that. 16:44:22 and then just remove WIP 16:44:53 otherwiseguy, would there be any version for initial ovsdb ? 16:45:04 I guess I need to add to global-requirements.txt as well? wasn't sure if there was something automated if added to projects.txt in requirements. 16:45:29 otherwiseguy: projects.txt only makes requirements bot to propose syncs for your repo 16:45:40 it doesn't change a thing beyond that 16:45:48 (unless I am completely delusional) 16:45:56 manjeets, there are some dev versions posted to pypi. I'll probably make a real release 0.1 release soonish. 16:46:33 otherwiseguy: so, you need to add ovsdbapp into global-reqs.txt and upper-constraints.txt 16:46:34 THere is some refactoring I'd like to do before an actual 1.0 release. But once everything is completely up, I can do Depends-On changes for projects to handle my refactoring. 16:46:52 ihrachys, Ok. I'll do that as well. 16:46:53 and depends-on that, not the patch that modifies projects.txt 16:47:10 otherwiseguy, then don't we need upper-constraint as well (I may be wrong) ? 16:47:16 we do 16:47:29 otherwiseguy: so you are not going to adopt it in neutron before 1.0? 16:47:34 I'll go through the instructions on project creation doc page. 16:48:11 ihrachys, we can (and should) adopt immediately. Just going to change some, but I can keep things in sync. Especially after adding some more cores soon. :p 16:48:40 otherwiseguy: as long as it does not go too far to make adoption more complicated 16:48:50 consider what happened to oslo.messaging :) 16:49:06 better stick to what we have and refactor once we have projects tied 16:49:11 ihrachys, It'll be easier just having to make the ovsdb changes in one place as opposed to making them in neutron, then merging to ovsdbapp, then moving things, etc. 16:49:20 unless you see a huge architectural issue that requires drastic changes 16:49:49 Right, I'm planning on moving everything to using ovsdbapp as is, then gradually making changes later. 16:50:07 ack, good 16:50:08 ok that seems to progress and I am sure otherwiseguy will complete it till next meeting ;) 16:50:16 :D 16:50:29 I will be pretty obsessed with it for a bit. 16:50:37 another item from PTG was driving switch to ha+dvr jobs for most gate jobs 16:51:00 for that matter, anilvenkata_afk has the following patch in devstack-gate that doesn't move too far lately: https://review.openstack.org/#/c/383827/ 16:51:42 Anil just rebased it 16:51:56 I will leave an action item on anilvenkata_afk to track that 16:52:11 #action anilvenkata_afk to track inclusion of HA+DVR patch for devstack-gate 16:52:49 there were also talks about reducing the number of jobs in gate, like removing non-dvr, non-multinode flavours wherever possible 16:53:25 anyone willing to take the initial analysis of what we can do and report the next week with suggestions on what we can remove? 16:53:52 ihrachys: and i had just proposed to make the dvr-multinode job voting, but dvr+ha+multinode is just as good 16:54:06 btw not just in gate but in check queue too, we have some jobs that are not going anywhere, or that are worthless for what I wonder (ironic?) 16:54:36 haleyb: I would imagine that instead of piling new jobs on top we would remove non-dvr and replace them with dvr flavours 16:54:42 they have to be voting in check to be voting in gate from what i learned 16:54:43 haleyb: link to the patch would be handy 16:54:57 https://review.openstack.org/410973 16:55:10 i had started that in newton 16:55:51 ok. considering that we currently have 25% failure rate for the job, it may be challenging to sell its voting status ;) 16:55:57 let's discuss on gerrit 16:56:08 so, anyone to do the general analysis for all jobs we have? 16:56:44 ok let's leave it for the next meeting to think about, we are running out of time 16:57:27 we will follow up on more ptg items the next week 16:57:34 (dude the number is huge) 16:57:44 #topic Open discussion 16:57:51 ihrachys, one more missing bit py35 demonstration via tempest or functional job 16:58:03 manjeets: yeah it's in my list for next week 16:58:05 there are more :) 16:58:09 ohk 16:58:24 I would like to raise attention to jlibosva docs patch on gerrit rechecks: https://review.openstack.org/#/c/426829/ 16:58:40 I think it's ready to go though I see some comments from manjeets that may need addressing 16:59:03 I don't believe that the policy should apply to 'gate deputies' only 16:59:08 (btw we don't have such a role) 16:59:14 I'll look at the comments 16:59:17 ihrachys, It is for upgrades but still ci gate-grenade-dsvm-neutron-linuxbridge-multinode-ubuntu-xenial-nv 16:59:24 to make it work, everyone should be on the same page and behave responsibly 16:59:35 I think he meant bug deputies? 16:59:43 i need to follow up on that just a reminder will do that after meeting 16:59:55 well I would not expect bug deputies to be on the hook to read logs now 17:00:03 it's enough work to triage bugs already 17:00:10 manjeets: ok 17:00:22 jlibosva: let's follow up on manjeets's comments in gerrit 17:00:25 we are out of time 17:00:26 sure 17:00:27 or there can be separate gate deputies not sure if that would make more sense 17:00:41 we haven't covered everything we should have, but that's ok, it means we have work to do ;) 17:00:46 thanks all 17:00:46 #endmeeting