16:00:42 #startmeeting neutron_ci 16:00:46 Meeting started Tue Jan 31 16:00:42 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:47 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:50 The meeting name has been set to 'neutron_ci' 16:01:28 hello everyone, assuming there is anyone :) 16:01:32 o/ 16:01:37 armax: kevinbenton: jlibosva: ding ding 16:02:39 jlibosva: looks more like it's you and me :) 16:02:44 :_ 16:02:58 ihrachys: what is the agenda? 16:03:32 well I was going to present the initiative and go through the etherpad that we already have, listing patches up for review and such. 16:03:43 and maybe later brain storming on current issues 16:03:53 ok, I looked at the etherpad and put some comments to the ovs failure 16:04:02 yesterday 16:04:03 if it's just me and you, it may not make much sense 16:04:23 at least we would have a meeting minutes 16:04:53 if we come up with any action items 16:04:54 right. ok. so, this is the first CI team meeting, that spurred by latest issues in gate 16:05:09 there was a discussion before on the issues that was captured in https://etherpad.openstack.org/p/neutron-upstream-ci 16:05:25 we will use the etherpad to capture new details on gate problems in the future 16:05:46 there were several things to follow up on, so let's walk through the list 16:06:53 1. checking all things around elastic-recheck, whether queries can target check queue and such. I am still to follow up on that with e-r cores, but it looks like they accepted a query targeting functional tests yesterday, so we hopefully should be able to classify func test failures. 16:07:22 2. "banning rechecks without bug number" again, I am to check with infra on that point 16:07:44 3. armax added func tests periodic job to grafana: https://review.openstack.org/#/c/426308/ 16:08:04 sadly, I don't see it showing up in periodic dashboard, see http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=4&fullscreen 16:08:28 I see 3 trend lines while there are supposed to be 5 of those as per dashboard definition 16:09:02 this will need a revisit I guess 16:09:05 was there a run already? 16:09:26 the patch was landed 32 hours ago, and it's supposed to trigger daily (?) 16:09:39 I guess we can give it some time and see if it heals itself 16:09:59 but the fact that another job in the dashboard does not show up too is suspicious 16:10:12 oh, sorry :D I thought it's still 30th Jan today 16:10:31 #action armax to make sure periodic functional test job shows up in grafana 16:10:49 #action ihrachys to follow up on elastic-recheck with e-r cores 16:11:09 #action ihrachys to follow up with infra on forbidding bare gerrit rechecks 16:11:54 there is an action item on adding a CI deputy role, but I believe it's not critical and should be decided on by the next ptl 16:12:11 agreed 16:12:22 also, I was going to map all late functional test failures, I did (the report is line 21+ in the pad) 16:12:51 the short story is though there are different tests failing, most of them turn out to be the same ovs native failure 16:13:33 seems like bug 1627106 is our sole enemy right now in terms of func tests 16:13:33 bug 1627106 in neutron "TimeoutException while executing test_post_commit_vswitchd_completed_no_failures" [High,In progress] https://launchpad.net/bugs/1627106 - Assigned to Miguel Angel Ajo (mangelajo) 16:14:08 kevinbenton landed a related patch for the bug: https://review.openstack.org/426032 We will need to track some more to see if that fixes anything 16:14:25 ihrachys: I think it’s because it might not have a failure yet 16:14:42 armax: wouldn't we see 0% failure rate? 16:14:47 ihrachys: it’s getting built here: http://logs.openstack.org/periodic/periodic-neutron-dsvm-functional-ubuntu-xenial/ 16:14:56 ihrachys: strangely I see that postgres is missing too 16:15:08 but I have yet to find the time to bug infra about seeing what is actually happening 16:15:09 right 16:15:16 aye, sure 16:15:24 but two builds so far 16:15:25 no errorr 16:15:27 errors 16:16:02 back to ovsdb native, ajo also has a patch bumping ovs timeout: https://review.openstack.org/#/c/425623/ though afaiu otherwiseguy has reservations about the direction 16:16:33 there are some interesting findings from this morning by iwamoto 16:16:39 he claims the whole system freezes 16:16:54 as dstat doesn't give any outputs by the time probe times out 16:17:03 it's supposed to update every second 16:17:22 vm progress locked by hypervisor? 16:18:17 could be the reason why noone is able to reproduce it locally 16:18:39 but do we see 10sec hangs? 16:18:42 or shorter? 16:19:11 let's have a look 16:19:36 btw speaking of timeouts, another class of functional test failures that I saw in late runs could be described as 'random tests failing with test case timeout', even those not touching ovs, like test_migration 16:20:05 but per test case timeout is a lot longer than ovsdb 10secs 16:21:01 interesting. 16:21:29 I see 5sec lock in dstat output that Iwamoto linked to 16:23:55 interestingly, we see functional job at ~10% failure rate at the moment, which is a drastic reduce from what we saw even on Friday 16:24:56 not sure what could be the reason 16:27:11 we don't have dstat in functional job, so it's hard to say if we see same hypervisor locks 16:27:24 the logs that Iwamoto linked to are for neutron-full 16:27:56 I will check if we can easily collect those in scope of functional tests 16:28:05 interesting is that it didn't cause any harm in those tests 16:28:08 #action ihrachys check if we can enable dstat logs for functional job 16:29:10 otherwiseguy: so what's your take on bumping timeout for test env? 16:30:03 ihrachys, i wouldn't hurt, but I have no idea if it would help. 16:31:03 ok I guess it's worth a try then. though the latest reduce in failure rate may relax the severity of the issue and also make it harder to spot if it's the fix that helps. 16:31:28 otherwiseguy: apart from that, any other ideas how we could help debug or fix the situation from ovs library side? 16:32:49 ihrachys: During xenial switch, I noticed ovs 2.6 is more prone to reproduce the issue 16:33:15 ihrachys, right now I'm writing some scripts that spawn multiple processes and just create and delete a bunch of bridges. adding occasionally restarting the ovsdb-server, etc. 16:33:21 so maybe having a patch that disables ovs compilation for functional and leave the one that's packaged for ubuntu could improve the repro rate 16:33:24 jlibosva: ovs python library 2.6, or openvswitch service 2.6? 16:33:33 just trying to reproduce. 16:33:33 ihrachys: service 16:33:49 the ovsdb server itself probably 16:34:08 jlibosva: improve rate as in 'raise' or as in 'lower'? 16:34:28 ihrachys: raise :) so we can test patches or add more debug message etc 16:34:28 just to understand, xenial is 2.5 or 2.6? 16:34:38 IIRC it should be 2.6 16:34:42 let me check 16:34:46 oh and we compile 2.6.1? 16:35:37 * jlibosva is confused 16:36:34 maybe it's vice-versa. 2.5 is worse and we compile 2.6.1 16:37:06 yeah, so xenial contains packages 2.5 but we compile to 2.6.1 on xenial nodes 16:37:09 ok, I guess it should not be hard to spin up the patch and see how it fails 16:37:32 #action jlibosva to spin up a test-only patch to disable ovs compilation to improve reproduce rate 16:38:18 done :) 16:38:32 link 16:39:07 ... some network issues with sending :-/ 16:39:28 nevermind, let's move on 16:39:42 sure 16:39:52 I mentioned several tests failing with test case timeouts before 16:40:06 when they do, they fail with AttributeError on __str__ call for WaitTimeout 16:40:30 there is a patch by tmorin to fix the error: https://review.openstack.org/#/c/425924/2 16:40:46 while it won't fix the timeout root cause, it's still worth attention 16:41:12 yeah, gate is giving the patch hard times 16:41:57 closing the topic of func tests, I see jlibosva added https://bugs.launchpad.net/neutron/+bug/1659965 to the etherpad 16:41:57 Launchpad bug 1659965 in neutron "test_get_root_helper_child_pid_returns_first_child gate failure" [Undecided,In progress] - Assigned to Jakub Libosvar (libosvar) 16:42:07 jlibosva: is it some high impact failure? 16:42:18 ihrachys: no, I don't think so 16:42:21 or you just have the patch in place that would benefit from review attention 16:43:06 I added it there as it's a legitimate functional failure. It's kinda new so I don't know how burning that is 16:43:15 the cause is pstree segfaulting 16:43:19 ok, still seems like something to look at, thanks for pointing out 16:45:13 that's it for functional tests. as for other jobs, we had oom-killers that we hoped to be fixed by the swappiness tweak: https://review.openstack.org/#/c/425961/ 16:45:30 ajo mentioned though we still see the problem happening in gate. 16:45:59 :[ 16:47:10 yeah, I see that mentioned in https://bugs.launchpad.net/neutron/+bug/1656386 comments 16:47:10 Launchpad bug 1656386 in neutron "Memory leaks on Neutron jobs" [Critical,Confirmed] - Assigned to Darek Smigiel (smigiel-dariusz) 16:48:14 armax: I see the strawman patch proposing putting mysql on a diet was abandoned. was there any discussion before that? 16:48:39 ihrachys: not that I am aware 16:48:46 :-o 16:48:54 ihrachys: we should check the openstack-qa channel 16:50:30 I don't see anything relevant there, probably worth talking to Monty 16:50:57 as for libvirtd malloc crashes, it's also not fixed, and I don't think we can help it 16:51:17 we also have a new issue with linuxbridge job: https://bugs.launchpad.net/neutron/+bug/1660612 16:51:17 Launchpad bug 1660612 in neutron "gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial times out on execution" [Undecided,New] 16:51:32 the global timeout kills the test run as it runs more than an hour 16:52:31 and how long does it generally take? 16:53:05 I don't think we have an equivalent with ovs so it's hard to compare 16:53:22 in another job, I see 40m for all tests 16:53:39 could be a slowdown, hard to say. I guess we have a single data point? 16:54:05 with successful linuxbridge job, the whole job takes around an hour 16:54:46 so it's around 43mins in successful linuxbridge job 16:55:18 weird, ok let's monitor and see if it shows more impact 16:55:39 one final thing I want to touch base on before closing the meeting is gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv 16:55:40 100% failure rate 16:56:04 jlibosva: do you know what happens there (seems like legit connectivity issues in some tests)? 16:56:18 ihrachys: no, I haven't investigated it 16:56:24 the trend seems to be 100% for almost a week 16:56:54 I think it passed a while ago; we need to understand what broke and fix it, and have a plan to make it voting. 16:56:55 ihrachys: yeah, I'm working on this one. SSH fails there but we don't collect console logs 16:57:26 ihrachys: it might be related to ubuntu image as they update it on their site 16:57:34 jlibosva: oh don't we? how come? isn't it controlled by generic devstack infra code? 16:57:48 ihrachys: no, it's a tempest code 16:57:58 ihrachys: and we have our own Neutron in-tree code 16:58:04 jlibosva: don't we freeze a specific past version of the image? 16:58:05 which doesn't have this capability 16:58:25 ihrachys: that's the problem, they have 'current' dir and they don't store those with timestamps 16:58:38 they store like maybe 4 latest but they get wiped eventually 16:58:43 hm, then maybe we should store it somewhere ourselves? 16:58:59 anyway, even when I fetch the same as in gate, the job passes on my environment 16:59:01 classic 16:59:16 #action jlibosva to explore what broke scenario job 16:59:17 that would be best, then we would need someone to maintain the storage 16:59:41 jlibosva: well if it's one time update per cycle, it's not like huge deal 16:59:57 ok thanks jlibosva for joining, I would feel lonely without you :) 17:00:06 I hope next time we will have better presence 17:00:21 if not maybe we will need to consider other time 17:00:25 thanks again 17:00:27 #endmeeting