16:00:42 <ihrachys> #startmeeting neutron_ci 16:00:46 <openstack> Meeting started Tue Jan 31 16:00:42 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:50 <openstack> The meeting name has been set to 'neutron_ci' 16:01:28 <ihrachys> hello everyone, assuming there is anyone :) 16:01:32 <jlibosva> o/ 16:01:37 <ihrachys> armax: kevinbenton: jlibosva: ding ding 16:02:39 <ihrachys> jlibosva: looks more like it's you and me :) 16:02:44 <jlibosva> :_ 16:02:58 <jlibosva> ihrachys: what is the agenda? 16:03:32 <ihrachys> well I was going to present the initiative and go through the etherpad that we already have, listing patches up for review and such. 16:03:43 <ihrachys> and maybe later brain storming on current issues 16:03:53 <jlibosva> ok, I looked at the etherpad and put some comments to the ovs failure 16:04:02 <jlibosva> yesterday 16:04:03 <ihrachys> if it's just me and you, it may not make much sense 16:04:23 <jlibosva> at least we would have a meeting minutes 16:04:53 <jlibosva> if we come up with any action items 16:04:54 <ihrachys> right. ok. so, this is the first CI team meeting, that spurred by latest issues in gate 16:05:09 <ihrachys> there was a discussion before on the issues that was captured in https://etherpad.openstack.org/p/neutron-upstream-ci 16:05:25 <ihrachys> we will use the etherpad to capture new details on gate problems in the future 16:05:46 <ihrachys> there were several things to follow up on, so let's walk through the list 16:06:53 <ihrachys> 1. checking all things around elastic-recheck, whether queries can target check queue and such. I am still to follow up on that with e-r cores, but it looks like they accepted a query targeting functional tests yesterday, so we hopefully should be able to classify func test failures. 16:07:22 <ihrachys> 2. "banning rechecks without bug number" again, I am to check with infra on that point 16:07:44 <ihrachys> 3. armax added func tests periodic job to grafana: https://review.openstack.org/#/c/426308/ 16:08:04 <ihrachys> sadly, I don't see it showing up in periodic dashboard, see http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=4&fullscreen 16:08:28 <ihrachys> I see 3 trend lines while there are supposed to be 5 of those as per dashboard definition 16:09:02 <ihrachys> this will need a revisit I guess 16:09:05 <jlibosva> was there a run already? 16:09:26 <ihrachys> the patch was landed 32 hours ago, and it's supposed to trigger daily (?) 16:09:39 <ihrachys> I guess we can give it some time and see if it heals itself 16:09:59 <ihrachys> but the fact that another job in the dashboard does not show up too is suspicious 16:10:12 <jlibosva> oh, sorry :D I thought it's still 30th Jan today 16:10:31 <ihrachys> #action armax to make sure periodic functional test job shows up in grafana 16:10:49 <ihrachys> #action ihrachys to follow up on elastic-recheck with e-r cores 16:11:09 <ihrachys> #action ihrachys to follow up with infra on forbidding bare gerrit rechecks 16:11:54 <ihrachys> there is an action item on adding a CI deputy role, but I believe it's not critical and should be decided on by the next ptl 16:12:11 <jlibosva> agreed 16:12:22 <ihrachys> also, I was going to map all late functional test failures, I did (the report is line 21+ in the pad) 16:12:51 <ihrachys> the short story is though there are different tests failing, most of them turn out to be the same ovs native failure 16:13:33 <ihrachys> seems like bug 1627106 is our sole enemy right now in terms of func tests 16:13:33 <openstack> bug 1627106 in neutron "TimeoutException while executing test_post_commit_vswitchd_completed_no_failures" [High,In progress] https://launchpad.net/bugs/1627106 - Assigned to Miguel Angel Ajo (mangelajo) 16:14:08 <ihrachys> kevinbenton landed a related patch for the bug: https://review.openstack.org/426032 We will need to track some more to see if that fixes anything 16:14:25 <armax> ihrachys: I think it’s because it might not have a failure yet 16:14:42 <ihrachys> armax: wouldn't we see 0% failure rate? 16:14:47 <armax> ihrachys: it’s getting built here: http://logs.openstack.org/periodic/periodic-neutron-dsvm-functional-ubuntu-xenial/ 16:14:56 <armax> ihrachys: strangely I see that postgres is missing too 16:15:08 <armax> but I have yet to find the time to bug infra about seeing what is actually happening 16:15:09 <ihrachys> right 16:15:16 <ihrachys> aye, sure 16:15:24 <armax> but two builds so far 16:15:25 <armax> no errorr 16:15:27 <armax> errors 16:16:02 <ihrachys> back to ovsdb native, ajo also has a patch bumping ovs timeout: https://review.openstack.org/#/c/425623/ though afaiu otherwiseguy has reservations about the direction 16:16:33 <jlibosva> there are some interesting findings from this morning by iwamoto 16:16:39 <jlibosva> he claims the whole system freezes 16:16:54 <jlibosva> as dstat doesn't give any outputs by the time probe times out 16:17:03 <jlibosva> it's supposed to update every second 16:17:22 <ihrachys> vm progress locked by hypervisor? 16:18:17 <jlibosva> could be the reason why noone is able to reproduce it locally 16:18:39 <ihrachys> but do we see 10sec hangs? 16:18:42 <ihrachys> or shorter? 16:19:11 <jlibosva> let's have a look 16:19:36 <ihrachys> btw speaking of timeouts, another class of functional test failures that I saw in late runs could be described as 'random tests failing with test case timeout', even those not touching ovs, like test_migration 16:20:05 <ihrachys> but per test case timeout is a lot longer than ovsdb 10secs 16:21:01 <otherwiseguy> interesting. 16:21:29 <ihrachys> I see 5sec lock in dstat output that Iwamoto linked to 16:23:55 <ihrachys> interestingly, we see functional job at ~10% failure rate at the moment, which is a drastic reduce from what we saw even on Friday 16:24:56 <ihrachys> not sure what could be the reason 16:27:11 <ihrachys> we don't have dstat in functional job, so it's hard to say if we see same hypervisor locks 16:27:24 <ihrachys> the logs that Iwamoto linked to are for neutron-full 16:27:56 <ihrachys> I will check if we can easily collect those in scope of functional tests 16:28:05 <jlibosva> interesting is that it didn't cause any harm in those tests 16:28:08 <ihrachys> #action ihrachys check if we can enable dstat logs for functional job 16:29:10 <ihrachys> otherwiseguy: so what's your take on bumping timeout for test env? 16:30:03 <otherwiseguy> ihrachys, i wouldn't hurt, but I have no idea if it would help. 16:31:03 <ihrachys> ok I guess it's worth a try then. though the latest reduce in failure rate may relax the severity of the issue and also make it harder to spot if it's the fix that helps. 16:31:28 <ihrachys> otherwiseguy: apart from that, any other ideas how we could help debug or fix the situation from ovs library side? 16:32:49 <jlibosva> ihrachys: During xenial switch, I noticed ovs 2.6 is more prone to reproduce the issue 16:33:15 <otherwiseguy> ihrachys, right now I'm writing some scripts that spawn multiple processes and just create and delete a bunch of bridges. adding occasionally restarting the ovsdb-server, etc. 16:33:21 <jlibosva> so maybe having a patch that disables ovs compilation for functional and leave the one that's packaged for ubuntu could improve the repro rate 16:33:24 <ihrachys> jlibosva: ovs python library 2.6, or openvswitch service 2.6? 16:33:33 <otherwiseguy> just trying to reproduce. 16:33:33 <jlibosva> ihrachys: service 16:33:49 <jlibosva> the ovsdb server itself probably 16:34:08 <ihrachys> jlibosva: improve rate as in 'raise' or as in 'lower'? 16:34:28 <jlibosva> ihrachys: raise :) so we can test patches or add more debug message etc 16:34:28 <ihrachys> just to understand, xenial is 2.5 or 2.6? 16:34:38 <jlibosva> IIRC it should be 2.6 16:34:42 <jlibosva> let me check 16:34:46 <ihrachys> oh and we compile 2.6.1? 16:35:37 * jlibosva is confused 16:36:34 <jlibosva> maybe it's vice-versa. 2.5 is worse and we compile 2.6.1 16:37:06 <jlibosva> yeah, so xenial contains packages 2.5 but we compile to 2.6.1 on xenial nodes 16:37:09 <ihrachys> ok, I guess it should not be hard to spin up the patch and see how it fails 16:37:32 <ihrachys> #action jlibosva to spin up a test-only patch to disable ovs compilation to improve reproduce rate 16:38:18 <jlibosva> done :) 16:38:32 <ihrachys> link 16:39:07 <jlibosva> ... some network issues with sending :-/ 16:39:28 <ihrachys> nevermind, let's move on 16:39:42 <jlibosva> sure 16:39:52 <ihrachys> I mentioned several tests failing with test case timeouts before 16:40:06 <ihrachys> when they do, they fail with AttributeError on __str__ call for WaitTimeout 16:40:30 <ihrachys> there is a patch by tmorin to fix the error: https://review.openstack.org/#/c/425924/2 16:40:46 <ihrachys> while it won't fix the timeout root cause, it's still worth attention 16:41:12 <jlibosva> yeah, gate is giving the patch hard times 16:41:57 <ihrachys> closing the topic of func tests, I see jlibosva added https://bugs.launchpad.net/neutron/+bug/1659965 to the etherpad 16:41:57 <openstack> Launchpad bug 1659965 in neutron "test_get_root_helper_child_pid_returns_first_child gate failure" [Undecided,In progress] - Assigned to Jakub Libosvar (libosvar) 16:42:07 <ihrachys> jlibosva: is it some high impact failure? 16:42:18 <jlibosva> ihrachys: no, I don't think so 16:42:21 <ihrachys> or you just have the patch in place that would benefit from review attention 16:43:06 <jlibosva> I added it there as it's a legitimate functional failure. It's kinda new so I don't know how burning that is 16:43:15 <jlibosva> the cause is pstree segfaulting 16:43:19 <ihrachys> ok, still seems like something to look at, thanks for pointing out 16:45:13 <ihrachys> that's it for functional tests. as for other jobs, we had oom-killers that we hoped to be fixed by the swappiness tweak: https://review.openstack.org/#/c/425961/ 16:45:30 <ihrachys> ajo mentioned though we still see the problem happening in gate. 16:45:59 <jlibosva> :[ 16:47:10 <ihrachys> yeah, I see that mentioned in https://bugs.launchpad.net/neutron/+bug/1656386 comments 16:47:10 <openstack> Launchpad bug 1656386 in neutron "Memory leaks on Neutron jobs" [Critical,Confirmed] - Assigned to Darek Smigiel (smigiel-dariusz) 16:48:14 <ihrachys> armax: I see the strawman patch proposing putting mysql on a diet was abandoned. was there any discussion before that? 16:48:39 <armax> ihrachys: not that I am aware 16:48:46 <ihrachys> :-o 16:48:54 <armax> ihrachys: we should check the openstack-qa channel 16:50:30 <ihrachys> I don't see anything relevant there, probably worth talking to Monty 16:50:57 <ihrachys> as for libvirtd malloc crashes, it's also not fixed, and I don't think we can help it 16:51:17 <jlibosva> we also have a new issue with linuxbridge job: https://bugs.launchpad.net/neutron/+bug/1660612 16:51:17 <openstack> Launchpad bug 1660612 in neutron "gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial times out on execution" [Undecided,New] 16:51:32 <jlibosva> the global timeout kills the test run as it runs more than an hour 16:52:31 <ihrachys> and how long does it generally take? 16:53:05 <jlibosva> I don't think we have an equivalent with ovs so it's hard to compare 16:53:22 <ihrachys> in another job, I see 40m for all tests 16:53:39 <ihrachys> could be a slowdown, hard to say. I guess we have a single data point? 16:54:05 <jlibosva> with successful linuxbridge job, the whole job takes around an hour 16:54:46 <jlibosva> so it's around 43mins in successful linuxbridge job 16:55:18 <ihrachys> weird, ok let's monitor and see if it shows more impact 16:55:39 <ihrachys> one final thing I want to touch base on before closing the meeting is gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv 16:55:40 <ihrachys> 100% failure rate 16:56:04 <ihrachys> jlibosva: do you know what happens there (seems like legit connectivity issues in some tests)? 16:56:18 <jlibosva> ihrachys: no, I haven't investigated it 16:56:24 <ihrachys> the trend seems to be 100% for almost a week 16:56:54 <ihrachys> I think it passed a while ago; we need to understand what broke and fix it, and have a plan to make it voting. 16:56:55 <jlibosva> ihrachys: yeah, I'm working on this one. SSH fails there but we don't collect console logs 16:57:26 <jlibosva> ihrachys: it might be related to ubuntu image as they update it on their site 16:57:34 <ihrachys> jlibosva: oh don't we? how come? isn't it controlled by generic devstack infra code? 16:57:48 <jlibosva> ihrachys: no, it's a tempest code 16:57:58 <jlibosva> ihrachys: and we have our own Neutron in-tree code 16:58:04 <ihrachys> jlibosva: don't we freeze a specific past version of the image? 16:58:05 <jlibosva> which doesn't have this capability 16:58:25 <jlibosva> ihrachys: that's the problem, they have 'current' dir and they don't store those with timestamps 16:58:38 <jlibosva> they store like maybe 4 latest but they get wiped eventually 16:58:43 <ihrachys> hm, then maybe we should store it somewhere ourselves? 16:58:59 <jlibosva> anyway, even when I fetch the same as in gate, the job passes on my environment 16:59:01 <jlibosva> classic 16:59:16 <ihrachys> #action jlibosva to explore what broke scenario job 16:59:17 <jlibosva> that would be best, then we would need someone to maintain the storage 16:59:41 <ihrachys> jlibosva: well if it's one time update per cycle, it's not like huge deal 16:59:57 <ihrachys> ok thanks jlibosva for joining, I would feel lonely without you :) 17:00:06 <ihrachys> I hope next time we will have better presence 17:00:21 <ihrachys> if not maybe we will need to consider other time 17:00:25 <ihrachys> thanks again 17:00:27 <ihrachys> #endmeeting