16:02:11 #startmeeting neutron_ci 16:02:12 Meeting started Tue Feb 20 16:02:11 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:02:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:16 The meeting name has been set to 'neutron_ci' 16:02:16 sorry for late start 16:02:51 I don't think we have jlibosva or haleyb today, both on pto 16:03:15 waiting for at least some people to show up 16:03:47 ihrachys: hi 16:03:49 sorry for late 16:03:51 and hello to all of You :) 16:03:58 slaweq, hi, np I was late too. so far we are two. 16:04:08 ok 16:04:09 Jakub and Brian are on PTO 16:04:18 I know about them 16:05:33 ok regardless we are just two, let's do it (quick) 16:05:34 #topic Actions from prev meeting 16:05:40 "ihrachys report test_get_service_by_service_and_host_name failure in periodic -pg- job" 16:06:00 ok 16:06:06 I actually still haven't; I noticed that grafana periodic dashboard is broken (?) http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=14&fullscreen 16:06:18 not showing data for legacy jobs 16:06:24 I suspect it's because we renamed the jobs 16:06:33 afair mlavalle was working on moving them into neutron repo 16:06:55 I think that patch for that was merged already 16:07:01 here: https://review.openstack.org/#/q/topic:remove-neutron-periodic-jobs 16:07:22 slaweq, right. but as it often happens we forgot about grafana 16:07:29 right 16:07:37 #action ihrachys to update grafana periodic board with new names 16:08:33 as for the original failure in -pg- I guess I should still follow up 16:08:42 #action report test_get_service_by_service_and_host_name failure in periodic -pg- job 16:08:51 next is "slaweq to backport ovsfw fixes to older stable branches" 16:09:00 so I checked those patches 16:09:09 and backport wasn't necessary in fact 16:09:22 for neither? 16:09:31 yes 16:09:44 so we know which patch broke it? 16:10:08 we are using earlier version of ovs there and there is no issue with crush dump there 16:10:54 and for second patch I know which patch broke hard reboot of instance - this patch is not in Pike nor Ocata 16:11:18 hm ok 16:11:52 next was "slaweq to look at http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz failures" 16:12:04 yes 16:12:18 so there is at least 4 different issues spotted in those tests 16:12:23 that's https://bugs.launchpad.net/neutron/+bug/1673531 right? 16:12:23 Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:12:42 it's one of them 16:13:09 no, wait 16:13:24 it is related but this one not describes any specific reason 16:13:35 so I found and reported couple of issues: 16:13:38 1. https://bugs.launchpad.net/neutron/+bug/1750337 16:13:38 Launchpad bug 1750337 in neutron "Fullstack tests fail due to "block_until_boot" timeout" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:13:59 and patch for this one is in review: https://review.openstack.org/546069 16:14:06 we discussed about it yesterday 16:14:18 2. https://bugs.launchpad.net/neutron/+bug/1728948 16:14:18 Launchpad bug 1728948 in neutron "fullstack: test_connectivity fails due to dhclient crash" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:14:32 patch for this one is also ready https://review.openstack.org/#/c/545820/ 16:14:44 3. https://bugs.launchpad.net/neutron/+bug/1750334 16:14:45 Launchpad bug 1750334 in neutron "ovsdb commands timeouts cause fullstack tests failures" [High,Confirmed] 16:15:20 for this one I don't know exactly how to fix it but maybe otherwiseguy can help? 16:15:57 slaweq, can the timeout be legit because of high load you saw? 16:16:05 4. sometimes tests fails with "good" reason so network is interrupted, like e.g. http://logs.openstack.org/84/545584/1/check/neutron-fullstack/e49378a/logs/testr_results.html.gz 16:16:35 yes, it's possible that this high load cause timeouts in ovsdb commands also 16:17:01 but I really didn't have more time to check all those issues this week 16:17:23 slaweq, you mean "RuntimeError: Networking interrupted after controllers have vanished" is red herring for some other issue that we already track in other plac? 16:17:25 *place 16:17:50 slaweq, ok we're landing the workers patch and will see if it gets better. 16:17:58 yes 16:18:07 great progress 16:18:24 and I will try also to dig more in this interrupted networking errors if it will happen more 16:18:37 but I think it's better and better :) 16:18:40 next was "mlavalle to look into linuxbridge ssh timeout failures" but mlavalle is offline so we will just repeat it 16:18:42 #action mlavalle to look into linuxbridge ssh timeout failures 16:19:01 #topic Grafana 16:19:01 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:20:34 neutron-tempest-ovsfw is definitely much better than it was last week :) 16:21:04 I was watching functional job the previous week and it was always at average of ~10-15% (spikes to 20%, dips to 0%) 16:21:35 slaweq, yeah, and dvr scenarios are also pretty good now 16:21:44 :) 16:21:46 not same for linuxbridge 16:22:04 linuxbridge and tempest are still the worst ones 16:22:16 tempest/fullstack/s 16:22:18 sorry 16:23:20 yeah 16:23:29 mlavalle, good day sir :) 16:23:38 sorry, got distracted 16:23:50 hi mlavalle 16:23:54 hi 16:23:56 mlavalle, we were going through grafana 16:24:04 ok cool 16:24:09 mlavalle, one thing is functional, it seems rather stable, on par with unit tests 16:24:19 that's great 16:24:21 both show average failure in check queue around 10-15% 16:24:38 well it's slightly lower for unit tests I guess 16:25:03 but then one may wonder if it's because more patches legitly break functional tests than unit tests 16:25:14 the best validation would be comparing gates I guess 16:25:59 functional mostly stays at 0% but we have a spike to 15% right now there. 16:26:19 one complexity with using grafana to validate anything is that afaiu it captures results for all branches 16:26:42 so if e.g. we stabilize functional in master but not stable/queens, and people post patches to the latter, it will show as failure in grafana 16:26:57 ahh, right 16:27:02 how do we deal with it 16:27:15 maybe we should actually make the dashboard about master only 16:27:27 I would say so 16:27:29 (I hope grafite carries the data to distinguish) 16:27:55 yeah, if the underlying platform carries the data, then we should strive for that 16:28:06 ok, I will have a look 16:28:18 #action ihrachys to update grafana boards to include master data only 16:28:51 and I guess we postpone decision on functional voting till next time in 2weeks 16:28:55 should we then do also dashboard for stable branches? 16:29:07 or it is not neccessary? 16:29:10 slaweq, yeah and I was actually planning to do it for quite a while 16:29:20 that's tangential though 16:29:31 ok :) 16:29:44 just asking to not forget about it :) 16:29:50 at this point you should know when I plan something it doesn't happen 16:30:06 LOL 16:30:30 I will add a new board in the patch 16:30:41 if You will point me when exactly is should be done I can do it 16:30:52 nah I should do SOMETHING right? 16:31:02 ok, as You want :) 16:31:07 you do more than enough 16:31:22 and we are thankful 16:31:28 ++ 16:32:03 so, fullstack, we already went through it before mlavalle joined, but basically tl;dr is we land a bunch of slaweq's patches and will see if they fix other issues we may have with high load on the system 16:32:13 hence skipping it now 16:32:19 #topic Scenarios 16:32:19 ok 16:32:24 thanks for the summary 16:32:53 with slaweq's patches for ovsfw we seem to be in great place for both dvr scenarios and ovsfw jobs now 16:33:04 but not so much for linuxbridge 16:33:28 mlavalle, afair you planned to have a look at ssh timeouts in the linuxbridge scenarios job 16:33:34 have you got a chance to? 16:33:43 I didn't have the time 16:33:53 but I will try this week 16:34:39 ok 16:34:49 let's check if all failures are same in a latest run 16:34:58 ok 16:35:35 http://logs.openstack.org/69/546069/1/check/neutron-tempest-plugin-scenario-linuxbridge/8561cce/logs/testr_results.html.gz 16:35:56 yeah seems exactly the same 4 failures 16:36:11 I guess once they are tackled, we'll have another green job 16:36:22 mlavalle: I can try to have a look on those issues as You are probably busy with preparing to PTG 16:36:47 slaweq: if you have time in your hands, yes, please go ahead 16:37:04 sure, I will try to debug it 16:37:06 speaking of green jobs, I suggest we also consider making the dvr scenarios job voting if it survives the next 2 weeks 16:37:09 THanks 16:37:22 ihrachys: yeah 16:37:23 and the same for ovsfw job 16:37:45 it's early in the cycle, so this is the time to be agresive with this type of things 16:38:16 I agree 16:38:21 ok. when we make it voting, do we make it voting in both queues? 16:38:38 let's go with check 16:39:09 ok 16:39:26 we can revisit gate if it's proved to be stable with all new jobs 16:39:41 yeah 16:41:25 #topic Bugs 16:41:29 https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=status&start=0 16:41:35 those are all bugs tagged with gate-failure 16:41:50 https://bugs.launchpad.net/neutron/+bug/1724253 16:41:50 Launchpad bug 1724253 in BaGPipe "error in netns privsep wrapper: ImportError: No module named agent.linux.ip_lib" [High,Confirmed] - Assigned to Thomas Morin (tmmorin-orange) 16:42:10 this one is weird, in that it doesn't really seem to be a gate issue for neutron, and I suspect not for bagpipe either 16:42:28 because they probably wouldn't have gate broken for months :) 16:43:29 tmorin had this patch for neutron to make our scripts reusable for them: https://review.openstack.org/#/c/503280/ 16:44:05 not sure why the patch is not linked to the bug 16:44:10 but I think it's related 16:44:18 * mlavalle removed -2 16:46:16 I see fullstack failed in a weird way there though 16:46:31 last thing I want is to break fullstack with this :) 16:46:55 but apart from the failure that should be fixed, the patch seems innocent enough to help their project 16:47:13 * slaweq will cry if fullstack will be totally broken again :) 16:47:20 earlier I suggested the script is not part of neutron api and shouldn't be reused but maybe it's too pedantic and they know what they do 16:48:23 regardless, I removed gate-failure tag from the bug since no gate is broken 16:48:56 https://bugs.launchpad.net/neutron/+bug/1744983 16:48:56 Launchpad bug 1744983 in neutron "coverage job fails with timeout" [High,Confirmed] 16:49:05 have we seen unit test / coverage job timeouts lately? 16:49:21 we merged https://review.openstack.org/537016 that bumped time for the job lately. did it help? 16:49:41 I haven't seen it lately 16:49:44 I can't recollect timeouts in last several weeks so maybe it's gone 16:49:45 I can't remember if I saw such timeouts lately 16:49:59 ok let me close it; we can reopen if it resurfaces 16:51:12 https://bugs.launchpad.net/neutron/+bug/1660612 16:51:12 Launchpad bug 1660612 in neutron "Tempest full jobs time out on execution" [High,Confirmed] 16:51:54 not sure if this one still happens much. but back when I looked into it, it was because of low concurrency (1) for tempest scenarios 16:52:08 (not our scenarios; tempest scenarios that are executed in -full jobs) 16:52:17 I proposed this in tempest: https://review.openstack.org/#/c/536598/1/tox.ini 16:52:30 but it seems like they are not entirely supportive about the change 16:54:19 the -full jobs, they are still defined in our zuul config 16:54:35 but the problem is that we don't control how we execute tox env from tempest 16:54:40 it's devstack-gate that does it 16:55:24 in the new zuul tempest jobs it is tempest that does it fwiw 16:55:33 (via the in tree job config) 16:56:24 clarkb, we have some neutron full jobs too. those trigger d-g for sure. 16:57:26 ya if that haven't been transitioned yet they will still hit d-g 16:57:52 clarkb, but it's good point, we also want to touch jobs coming from other repos 16:58:06 clarkb, in tempest repo, I see run-tempest role used. where is it defined? 16:58:11 codesearch doesn't help 16:58:45 I mean here: http://git.openstack.org/cgit/openstack/tempest/tree/playbooks/devstack-tempest.yaml#n14 16:58:54 ihrachys: I think it may come from devstack/roles 16:59:05 (codesearch probably failing because its a dir name and not file content) 16:59:37 clarkb, and devstack/roles is in which repo? 17:00:31 in devstack 17:00:35 ack 17:00:37 sorry devstack is repo, roles is the dir 17:00:46 we are out of time. thanks for joining. 17:00:48 #endmeeting