15:01:49 #startmeeting XenAPI 15:01:49 Meeting started Wed Aug 27 15:01:49 2014 UTC and is due to finish in 60 minutes. The chair is johnthetubaguy. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:50 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:52 The meeting name has been set to 'xenapi' 15:02:13 BobBall: hello 15:02:24 and hello to anyone else who is around and interested :) 15:02:29 #topic CI 15:02:43 yay the CI 15:02:44 we're having fun 15:02:52 BobBall: seem the CI has got upsted 15:02:56 failures and no idea why :) 15:02:59 upset? 15:03:07 yeah, I meant upset 15:03:11 Let's use this meeting to try and look into one of the failures 15:03:15 see if you can help :) 15:03:19 OK, sure 15:03:21 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/86/114286/12/23562/results.html 15:03:27 Just a random example 15:03:33 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/86/114286/12/23562/testr_results.html.gz 15:03:41 shows that we have a failure in tearDownClass (tempest.scenario.test_server_advanced_ops 15:03:45 InternalServerError: Lock Timeout occurred for key, os-revoke-events (Disable debug mode to suppress these details.) (HTTP 500) 15:03:50 which apparently comes from keystone 15:04:10 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/86/114286/12/23562/screen-key.txt.gz are the keystone logs and don't have any ERROR or TRACE 15:04:54 there are thousands of os-revoke-events because the lock is acquired lots and released lots 15:05:27 So.... ideas? :) 15:05:49 This issue seems to be failing maybe 30% of the XenServer CI tests ATM 15:06:02 OK, was looking at shelved I guess 15:06:13 Want a cleaner example? 15:06:37 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/78/95278/9/23521/results.html 15:06:39 That only has the one failure 15:06:44 which is the os-revoke-events issue 15:17:42 hmm, OK 15:17:43 have you searched launchpad yet? 15:17:43 yup 15:17:43 I didn't find anything 15:18:00 agree it looks like delete user failed somehow 15:18:01 So - any thoughts? 15:18:01 why might we be seeing this here rather than in -infra? 15:18:04 different OS version changes some lock thing, no idea really 15:18:05 maybe you are running debug, but gate is not? 15:18:05 it does say disable debug or something in the error message 15:18:05 no - I think gate is running debug 15:18:05 but disabling it just supresses the details 15:18:05 the failure still occurs afaict 15:18:05 [Wed Aug 27 14:35:07.591400 2014] [:error] [pid 1875:tid 140390208882432] 1875 WARNING keystone.common.wsgi [-] Lock Timeout occurred for key, os-revoke-events (Disable debug mode to suppress these details.) 15:18:05 in the keystone logs 15:18:05 ohhhh 15:18:05 how did I not find that?! 15:18:05 maybe the system is just running really slow 15:18:05 which log file? http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/86/114286/12/23562/screen-key.txt.gz doesn't show that 15:18:05 I do sometimes see it where the whole VM locks up inside Xen for a little while, it could be that 15:18:05 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/86/114286/12/23562/screen-key.txt.gz 15:18:05 is where I am 15:18:06 *blink* that's the one I'm in 15:18:06 maybe that's a firefox bug failing to search 15:18:06 each for Lock Timeout 15:18:06 search 15:18:06 I am in chrome if that helps 15:18:06 I found it from the timestamp 15:18:06 hmm, does look like it locked up for a few mins 15:18:07 seconds 15:18:07 yeah, oops 15:18:24 OK - so the issue here is that we _got_ the lock, but we didn't release it within the timeout 15:18:24 yeah, sounds like the box is running too slow for some reason 15:18:24 so we can blame RAX? ;) 15:18:24 We could increase the timeout, but that's a horrible solution 15:18:24 and might not solve it - if it's a timeout 15:18:24 not* 15:18:24 I blame the hypervisor, but you know 15:18:24 yeah, its not a great solution 15:18:24 :P 15:18:24 its odd it suddenly started happening 15:18:24 Could be because I'm so backed up ATM I'm running loads of jobs 15:18:24 did we hit some memory maximum? 15:18:40 so maybe they are sitting on the same ... ohhhh good idea 15:18:40 KiB Mem: 4097680 total, 4070128 used, 27552 free, 1436 buffers 15:18:41 Bingo 15:18:56 We're uber-swapping 15:18:56 that would do it 15:18:57 using 2G of swap 15:18:59 eek 15:18:59 that's the problem 15:19:10 OK, now, how to solve.... 15:19:11 :P 15:19:13 so turn down the number of a few workers 15:19:21 I would start with API 15:19:37 how many do you think we need? 15:19:53 i duno 15:19:54 tempest can only run 4 jobs, so maybe 4 as a max? 15:20:09 yeah, sounds sensible, what is it at the moment? 15:20:09 we have 19... 15:20:11 oh... 15:22:25 yeah, 4 15:22:25 (we run less than that on our production nodes, from memory) 15:22:25 although we have quite a few production nodes... 15:22:25 how do you set the number of workers? 15:22:26 its a nova conf things 15:22:26 osapi_workers or something like that 15:22:26 oh wow 15:22:26 the default is the number of CPUIs available apparently 15:22:26 which is wrong 15:22:26 we have 6 CPUs so how come we have 19 workers? 15:22:26 ah... does nova-api include other types of worker such as metadata workers? 15:22:26 yeah 15:22:26 os-volume, etc 15:22:31 although maybe thats gone now, but yeah 15:23:38 how many conductors? 15:23:56 well 4 would do I guess, whats now, two processes maybe? 15:24:23 2014-08-27 13:36:15.699 4402 INFO nova.openstack.common.service [-] Starting 6 workers 15:24:58 yeah, I guess it has 6 right now, 4 seems safe enough for now 15:24:58 if not less 15:25:09 but then why do we have 19 instances of nova-api? :) 15:25:16 in ps 15:25:51 hmmm... 18 15:26:00 3 lots of 6? (6 CPUs) 15:26:17 not 100% sure, just looking at the log 15:26:44 it claims only starting 6 workers 15:27:22 did you mean to be running trove and sahara? 15:27:25 2014-08-27 14:40:17.063 4315 INFO nova.openstack.common.service [-] Starting 6 workers 15:27:29 2014-08-27 14:40:18.084 4315 INFO nova.openstack.common.service [-] Starting 6 workers 15:27:31 2014-08-27 14:40:18.524 4315 INFO nova.openstack.common.service [-] Starting 6 workers 15:27:42 that makes 18 15:27:49 e2c, metadata, os-api? 15:28:56 yes 15:29:00 yeah, you are running trove, sahara and ceilometer, is that required now? 15:29:21 dunno - does tempest depend on them running? :) 15:30:06 That's all in the default devstack-gate 15:30:06 there is no easy way to disable them 15:30:06 Ok, so they get requests it seems 15:30:10 celiometer seems to have three api workers, which seems, excessive 15:30:27 anyways, I guess trim down a few, and things should improve a tidy bit? 15:30:34 trove shouldn't be running actually 15:30:36 that's weird 15:30:43 DEVSTACK_GATE_TROVE should be set to 0 15:30:52 which should mean it doesn't run 15:30:58 0 or '0'? 15:31:06 # Set to 1 to run trove 15:31:06 export DEVSTACK_GATE_TROVE=${DEVSTACK_GATE_TROVE:-0} 15:31:16 the default is don't run 15:31:36 I guess thats not working in your branch 15:32:07 ZUUL_URL=https://review.openstack.org ZUUL_REF=refs/changes/97/108097/19 CHANGED_PROJECT=openstack/nova PYTHONUNBUFFERED=true DEVSTACK_GATE_TEMPEST=1 DEVSTACK_GATE_TEMPEST_FULL=1 DEVSTACK_GATE_VIRT_DRIVER=xenapi DEVSTACK_GATE_TIMEOUT=180 APPLIANCE_NAME=devstack ENABLED_SERVICES=g-api,g-reg,key,n-api,n-crt,n-obj,n-cpu,n-sch,horizon,mysql,rabbit,sysstat,dstat,pidstat,s-proxy,s-account,s-container,s-object,n-cond /home/jenkins/xenapi-os 15:32:07 is how we run the job... 15:32:07 I'm confused why is trove running 15:32:22 ++ export DEVSTACK_GATE_TROVE=0 15:32:23 ++ DEVSTACK_GATE_TROVE=0