16:02:11 <ihrachys> #startmeeting neutron_ci 16:02:12 <openstack> Meeting started Tue Feb 20 16:02:11 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:02:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:16 <openstack> The meeting name has been set to 'neutron_ci' 16:02:16 <ihrachys> sorry for late start 16:02:51 <ihrachys> I don't think we have jlibosva or haleyb today, both on pto 16:03:15 <ihrachys> waiting for at least some people to show up 16:03:47 <slaweq> ihrachys: hi 16:03:49 <slaweq> sorry for late 16:03:51 <slaweq> and hello to all of You :) 16:03:58 <ihrachys> slaweq, hi, np I was late too. so far we are two. 16:04:08 <slaweq> ok 16:04:09 <ihrachys> Jakub and Brian are on PTO 16:04:18 <slaweq> I know about them 16:05:33 <ihrachys> ok regardless we are just two, let's do it (quick) 16:05:34 <ihrachys> #topic Actions from prev meeting 16:05:40 <ihrachys> "ihrachys report test_get_service_by_service_and_host_name failure in periodic -pg- job" 16:06:00 <slaweq> ok 16:06:06 <ihrachys> I actually still haven't; I noticed that grafana periodic dashboard is broken (?) http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=14&fullscreen 16:06:18 <ihrachys> not showing data for legacy jobs 16:06:24 <ihrachys> I suspect it's because we renamed the jobs 16:06:33 <ihrachys> afair mlavalle was working on moving them into neutron repo 16:06:55 <slaweq> I think that patch for that was merged already 16:07:01 <ihrachys> here: https://review.openstack.org/#/q/topic:remove-neutron-periodic-jobs 16:07:22 <ihrachys> slaweq, right. but as it often happens we forgot about grafana 16:07:29 <slaweq> right 16:07:37 <ihrachys> #action ihrachys to update grafana periodic board with new names 16:08:33 <ihrachys> as for the original failure in -pg- I guess I should still follow up 16:08:42 <ihrachys> #action report test_get_service_by_service_and_host_name failure in periodic -pg- job 16:08:51 <ihrachys> next is "slaweq to backport ovsfw fixes to older stable branches" 16:09:00 <slaweq> so I checked those patches 16:09:09 <slaweq> and backport wasn't necessary in fact 16:09:22 <ihrachys> for neither? 16:09:31 <slaweq> yes 16:09:44 <ihrachys> so we know which patch broke it? 16:10:08 <slaweq> we are using earlier version of ovs there and there is no issue with crush dump there 16:10:54 <slaweq> and for second patch I know which patch broke hard reboot of instance - this patch is not in Pike nor Ocata 16:11:18 <ihrachys> hm ok 16:11:52 <ihrachys> next was "slaweq to look at http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz failures" 16:12:04 <slaweq> yes 16:12:18 <slaweq> so there is at least 4 different issues spotted in those tests 16:12:23 <ihrachys> that's https://bugs.launchpad.net/neutron/+bug/1673531 right? 16:12:23 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:12:42 <slaweq> it's one of them 16:13:09 <slaweq> no, wait 16:13:24 <slaweq> it is related but this one not describes any specific reason 16:13:35 <slaweq> so I found and reported couple of issues: 16:13:38 <slaweq> 1. https://bugs.launchpad.net/neutron/+bug/1750337 16:13:38 <openstack> Launchpad bug 1750337 in neutron "Fullstack tests fail due to "block_until_boot" timeout" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:13:59 <slaweq> and patch for this one is in review: https://review.openstack.org/546069 16:14:06 <slaweq> we discussed about it yesterday 16:14:18 <slaweq> 2. https://bugs.launchpad.net/neutron/+bug/1728948 16:14:18 <openstack> Launchpad bug 1728948 in neutron "fullstack: test_connectivity fails due to dhclient crash" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:14:32 <slaweq> patch for this one is also ready https://review.openstack.org/#/c/545820/ 16:14:44 <slaweq> 3. https://bugs.launchpad.net/neutron/+bug/1750334 16:14:45 <openstack> Launchpad bug 1750334 in neutron "ovsdb commands timeouts cause fullstack tests failures" [High,Confirmed] 16:15:20 <slaweq> for this one I don't know exactly how to fix it but maybe otherwiseguy can help? 16:15:57 <ihrachys> slaweq, can the timeout be legit because of high load you saw? 16:16:05 <slaweq> 4. sometimes tests fails with "good" reason so network is interrupted, like e.g. http://logs.openstack.org/84/545584/1/check/neutron-fullstack/e49378a/logs/testr_results.html.gz 16:16:35 <slaweq> yes, it's possible that this high load cause timeouts in ovsdb commands also 16:17:01 <slaweq> but I really didn't have more time to check all those issues this week 16:17:23 <ihrachys> slaweq, you mean "RuntimeError: Networking interrupted after controllers have vanished" is red herring for some other issue that we already track in other plac? 16:17:25 <ihrachys> *place 16:17:50 <ihrachys> slaweq, ok we're landing the workers patch and will see if it gets better. 16:17:58 <slaweq> yes 16:18:07 <ihrachys> great progress 16:18:24 <slaweq> and I will try also to dig more in this interrupted networking errors if it will happen more 16:18:37 <slaweq> but I think it's better and better :) 16:18:40 <ihrachys> next was "mlavalle to look into linuxbridge ssh timeout failures" but mlavalle is offline so we will just repeat it 16:18:42 <ihrachys> #action mlavalle to look into linuxbridge ssh timeout failures 16:19:01 <ihrachys> #topic Grafana 16:19:01 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:20:34 <slaweq> neutron-tempest-ovsfw is definitely much better than it was last week :) 16:21:04 <ihrachys> I was watching functional job the previous week and it was always at average of ~10-15% (spikes to 20%, dips to 0%) 16:21:35 <ihrachys> slaweq, yeah, and dvr scenarios are also pretty good now 16:21:44 <slaweq> :) 16:21:46 <ihrachys> not same for linuxbridge 16:22:04 <slaweq> linuxbridge and tempest are still the worst ones 16:22:16 <slaweq> tempest/fullstack/s 16:22:18 <slaweq> sorry 16:23:20 <ihrachys> yeah 16:23:29 <ihrachys> mlavalle, good day sir :) 16:23:38 <mlavalle> sorry, got distracted 16:23:50 <slaweq> hi mlavalle 16:23:54 <mlavalle> hi 16:23:56 <ihrachys> mlavalle, we were going through grafana 16:24:04 <mlavalle> ok cool 16:24:09 <ihrachys> mlavalle, one thing is functional, it seems rather stable, on par with unit tests 16:24:19 <mlavalle> that's great 16:24:21 <ihrachys> both show average failure in check queue around 10-15% 16:24:38 <ihrachys> well it's slightly lower for unit tests I guess 16:25:03 <ihrachys> but then one may wonder if it's because more patches legitly break functional tests than unit tests 16:25:14 <ihrachys> the best validation would be comparing gates I guess 16:25:59 <ihrachys> functional mostly stays at 0% but we have a spike to 15% right now there. 16:26:19 <ihrachys> one complexity with using grafana to validate anything is that afaiu it captures results for all branches 16:26:42 <ihrachys> so if e.g. we stabilize functional in master but not stable/queens, and people post patches to the latter, it will show as failure in grafana 16:26:57 <mlavalle> ahh, right 16:27:02 <ihrachys> how do we deal with it 16:27:15 <ihrachys> maybe we should actually make the dashboard about master only 16:27:27 <mlavalle> I would say so 16:27:29 <ihrachys> (I hope grafite carries the data to distinguish) 16:27:55 <mlavalle> yeah, if the underlying platform carries the data, then we should strive for that 16:28:06 <ihrachys> ok, I will have a look 16:28:18 <ihrachys> #action ihrachys to update grafana boards to include master data only 16:28:51 <ihrachys> and I guess we postpone decision on functional voting till next time in 2weeks 16:28:55 <slaweq> should we then do also dashboard for stable branches? 16:29:07 <slaweq> or it is not neccessary? 16:29:10 <ihrachys> slaweq, yeah and I was actually planning to do it for quite a while 16:29:20 <ihrachys> that's tangential though 16:29:31 <slaweq> ok :) 16:29:44 <slaweq> just asking to not forget about it :) 16:29:50 <ihrachys> at this point you should know when I plan something it doesn't happen 16:30:06 <slaweq> LOL 16:30:30 <ihrachys> I will add a new board in the patch 16:30:41 <slaweq> if You will point me when exactly is should be done I can do it 16:30:52 <ihrachys> nah I should do SOMETHING right? 16:31:02 <slaweq> ok, as You want :) 16:31:07 <mlavalle> you do more than enough 16:31:22 <mlavalle> and we are thankful 16:31:28 <slaweq> ++ 16:32:03 <ihrachys> so, fullstack, we already went through it before mlavalle joined, but basically tl;dr is we land a bunch of slaweq's patches and will see if they fix other issues we may have with high load on the system 16:32:13 <ihrachys> hence skipping it now 16:32:19 <ihrachys> #topic Scenarios 16:32:19 <mlavalle> ok 16:32:24 <mlavalle> thanks for the summary 16:32:53 <ihrachys> with slaweq's patches for ovsfw we seem to be in great place for both dvr scenarios and ovsfw jobs now 16:33:04 <ihrachys> but not so much for linuxbridge 16:33:28 <ihrachys> mlavalle, afair you planned to have a look at ssh timeouts in the linuxbridge scenarios job 16:33:34 <ihrachys> have you got a chance to? 16:33:43 <mlavalle> I didn't have the time 16:33:53 <mlavalle> but I will try this week 16:34:39 <ihrachys> ok 16:34:49 <ihrachys> let's check if all failures are same in a latest run 16:34:58 <mlavalle> ok 16:35:35 <ihrachys> http://logs.openstack.org/69/546069/1/check/neutron-tempest-plugin-scenario-linuxbridge/8561cce/logs/testr_results.html.gz 16:35:56 <ihrachys> yeah seems exactly the same 4 failures 16:36:11 <ihrachys> I guess once they are tackled, we'll have another green job 16:36:22 <slaweq> mlavalle: I can try to have a look on those issues as You are probably busy with preparing to PTG 16:36:47 <mlavalle> slaweq: if you have time in your hands, yes, please go ahead 16:37:04 <slaweq> sure, I will try to debug it 16:37:06 <ihrachys> speaking of green jobs, I suggest we also consider making the dvr scenarios job voting if it survives the next 2 weeks 16:37:09 <mlavalle> THanks 16:37:22 <mlavalle> ihrachys: yeah 16:37:23 <ihrachys> and the same for ovsfw job 16:37:45 <mlavalle> it's early in the cycle, so this is the time to be agresive with this type of things 16:38:16 <slaweq> I agree 16:38:21 <ihrachys> ok. when we make it voting, do we make it voting in both queues? 16:38:38 <mlavalle> let's go with check 16:39:09 <ihrachys> ok 16:39:26 <ihrachys> we can revisit gate if it's proved to be stable with all new jobs 16:39:41 <mlavalle> yeah 16:41:25 <ihrachys> #topic Bugs 16:41:29 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=status&start=0 16:41:35 <ihrachys> those are all bugs tagged with gate-failure 16:41:50 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1724253 16:41:50 <openstack> Launchpad bug 1724253 in BaGPipe "error in netns privsep wrapper: ImportError: No module named agent.linux.ip_lib" [High,Confirmed] - Assigned to Thomas Morin (tmmorin-orange) 16:42:10 <ihrachys> this one is weird, in that it doesn't really seem to be a gate issue for neutron, and I suspect not for bagpipe either 16:42:28 <ihrachys> because they probably wouldn't have gate broken for months :) 16:43:29 <ihrachys> tmorin had this patch for neutron to make our scripts reusable for them: https://review.openstack.org/#/c/503280/ 16:44:05 <ihrachys> not sure why the patch is not linked to the bug 16:44:10 <ihrachys> but I think it's related 16:44:18 * mlavalle removed -2 16:46:16 <ihrachys> I see fullstack failed in a weird way there though 16:46:31 <ihrachys> last thing I want is to break fullstack with this :) 16:46:55 <ihrachys> but apart from the failure that should be fixed, the patch seems innocent enough to help their project 16:47:13 * slaweq will cry if fullstack will be totally broken again :) 16:47:20 <ihrachys> earlier I suggested the script is not part of neutron api and shouldn't be reused but maybe it's too pedantic and they know what they do 16:48:23 <ihrachys> regardless, I removed gate-failure tag from the bug since no gate is broken 16:48:56 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1744983 16:48:56 <openstack> Launchpad bug 1744983 in neutron "coverage job fails with timeout" [High,Confirmed] 16:49:05 <ihrachys> have we seen unit test / coverage job timeouts lately? 16:49:21 <ihrachys> we merged https://review.openstack.org/537016 that bumped time for the job lately. did it help? 16:49:41 <mlavalle> I haven't seen it lately 16:49:44 <ihrachys> I can't recollect timeouts in last several weeks so maybe it's gone 16:49:45 <slaweq> I can't remember if I saw such timeouts lately 16:49:59 <ihrachys> ok let me close it; we can reopen if it resurfaces 16:51:12 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1660612 16:51:12 <openstack> Launchpad bug 1660612 in neutron "Tempest full jobs time out on execution" [High,Confirmed] 16:51:54 <ihrachys> not sure if this one still happens much. but back when I looked into it, it was because of low concurrency (1) for tempest scenarios 16:52:08 <ihrachys> (not our scenarios; tempest scenarios that are executed in -full jobs) 16:52:17 <ihrachys> I proposed this in tempest: https://review.openstack.org/#/c/536598/1/tox.ini 16:52:30 <ihrachys> but it seems like they are not entirely supportive about the change 16:54:19 <ihrachys> the -full jobs, they are still defined in our zuul config 16:54:35 <ihrachys> but the problem is that we don't control how we execute tox env from tempest 16:54:40 <ihrachys> it's devstack-gate that does it 16:55:24 <clarkb> in the new zuul tempest jobs it is tempest that does it fwiw 16:55:33 <clarkb> (via the in tree job config) 16:56:24 <ihrachys> clarkb, we have some neutron full jobs too. those trigger d-g for sure. 16:57:26 <clarkb> ya if that haven't been transitioned yet they will still hit d-g 16:57:52 <ihrachys> clarkb, but it's good point, we also want to touch jobs coming from other repos 16:58:06 <ihrachys> clarkb, in tempest repo, I see run-tempest role used. where is it defined? 16:58:11 <ihrachys> codesearch doesn't help 16:58:45 <ihrachys> I mean here: http://git.openstack.org/cgit/openstack/tempest/tree/playbooks/devstack-tempest.yaml#n14 16:58:54 <clarkb> ihrachys: I think it may come from devstack/roles 16:59:05 <clarkb> (codesearch probably failing because its a dir name and not file content) 16:59:37 <ihrachys> clarkb, and devstack/roles is in which repo? 17:00:31 <clarkb> in devstack 17:00:35 <ihrachys> ack 17:00:37 <clarkb> sorry devstack is repo, roles is the dir 17:00:46 <ihrachys> we are out of time. thanks for joining. 17:00:48 <ihrachys> #endmeeting