15:00:40 <johnthetubaguy> #startmeeting XenAPI 15:00:41 <openstack> Meeting started Wed Feb 19 15:00:40 2014 UTC and is due to finish in 60 minutes. The chair is johnthetubaguy. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:44 <openstack> The meeting name has been set to 'xenapi' 15:00:53 <johnthetubaguy> hello, how is around this week? 15:01:11 <matel> I am here for a short time. 15:01:19 <leifz> I am around. 15:01:26 <sandywalsh_> o/ 15:01:27 <BobBall> I am also here 15:01:43 <johnthetubaguy> Ok, so lets jump to matel 15:01:53 <johnthetubaguy> #topic XenServer CI 15:02:03 <johnthetubaguy> matel BobBall : tell me good news :D 15:02:17 <BobBall> #link http://paste.openstack.org/show/67274/ 15:02:18 <johnthetubaguy> … and any bad news 15:02:27 <BobBall> That's the good news 15:02:34 <matel> Okay, I guess I'll do it with Bob. 15:02:35 <BobBall> #link http://paste.openstack.org/show/67277/ is the bad news 15:02:50 <matel> So We are commenting on successful runs 15:03:05 <BobBall> {'Collected': 31, 'Finished': 142, 'Running': 14, 'Queued': 62} are the current queue stats 15:03:28 <BobBall> difference between "Collected" and "finished" is that collected has got the results, and finished has posted them 15:03:38 <johnthetubaguy> OK, cool, are we getting complete within the correct timeframe? 15:03:38 <BobBall> we are not posting about failures ATM 15:03:51 <BobBall> It's complete now. 15:03:55 <johnthetubaguy> BobBall: totally makes sense right now 15:03:56 <BobBall> I can post failures now if we want 15:04:07 <johnthetubaguy> well, do we know what they are yet? 15:04:19 <BobBall> but I personally want verification of most of the failures before I turn on auto failure posting 15:04:22 <BobBall> That was the bad news 15:04:27 <BobBall> #link http://paste.openstack.org/show/67277/ 15:04:31 <BobBall> Those are the 30 failures 15:04:39 <BobBall> some are real (e.g. 73539 15:04:43 <johnthetubaguy> right 15:04:45 <BobBall> some are not real 15:04:55 <BobBall> BUT all that we've looked at so far have corresponding defects in the gate 15:05:07 <johnthetubaguy> ah, OK 15:05:19 <johnthetubaguy> I guess it would be good to see that jenkins signature match 15:05:19 <BobBall> I do not know yet whether we are suffering a higher hit rate of those defects - and if we are whether they are related to the environment 15:05:24 <leifz> Was wondering if this was just current stability ranking. Is that what you are saying Bob? 15:05:46 <BobBall> sorry leifz ? not sure I understand? 15:05:56 <johnthetubaguy> {'': 14, 'Failed': 30, None: 62, 'Passed': 139, 'Aborted: Unknown': 4} 15:05:57 <leifz> Is the failure rate in line with current gate failure rate. 15:06:00 <johnthetubaguy> is what bob posted 15:06:05 <BobBall> ah yes - I don't know leifz 15:06:17 <johnthetubaguy> oh, its usually under a few percent, good question though 15:06:19 <BobBall> I asked yesterday what the current rate was but didn't get an answer and haven't saked again / chased 15:06:35 <leifz> Thanks. 15:06:45 <BobBall> There are other issues we've been fixing in the last couple of days to get the stability of the system fixed 15:07:04 <johnthetubaguy> Do we think its just code issues at this point? 15:07:15 <BobBall> everything is just code john :) 15:07:42 <BobBall> We're certainly at the point where we should be trying to track down the failures that I've listed in that page 15:08:03 <BobBall> We seem to be hitting test_basic_scenario frequently 15:08:04 <leifz> LOL stepped into that one. How close do we need to be to be non-voting reporting? 15:08:28 <johnthetubaguy> I would rather we don't report crud, people just assume the thing is broken then 15:08:32 <BobBall> and while it's an acknowledged gate bug I suspect we're hitting it more, possibly because of slower volume provisioning or something 15:09:03 <johnthetubaguy> OK, so, can we look at getting the gate bug patterns tested against a fail run? 15:09:11 <johnthetubaguy> see if we hit a gate bug signature, etc 15:09:25 <BobBall> I'm not giong to look at automating that now 15:09:32 <BobBall> that's a nice-to-have 15:09:54 <BobBall> Manual inspection of the tests for why they are failing is what we should do to get the rates up 15:10:16 <johnthetubaguy> sure, for now it makes sense 15:10:37 <johnthetubaguy> so do we have a public source for this data you are generating? 15:10:38 <BobBall> Anyway - I'd be happy to argue that we've satisfied what's needed for I-3 15:10:50 <BobBall> All passed tests get voted on 15:10:53 <BobBall> all logs are public 15:11:00 <BobBall> but these lists are not public, no 15:11:04 <johnthetubaguy> OK, one more request... 15:11:18 <BobBall> We could easily create a cronjob to post them somewhere if you want the latest details 15:11:19 <johnthetubaguy> can we just report the errors as "hmm, we found a problem, we are checking it out" 15:11:28 <johnthetubaguy> until we have more confedence? 15:12:17 <johnthetubaguy> BobBall: cron job of stats would be idea, just so people can check the queue length / status 15:12:26 <BobBall> We can do that, but I think the current volume means that we will not be able to check out + post on each test - I suspect "we're checking it out" could be seen as a suggestion that they will get another update. 15:12:41 <BobBall> An alternative would be to automatically requeue failed jobs once or twice 15:12:48 <BobBall> but that'll take ages to report on failures 15:13:05 <johnthetubaguy> OK, I recon, "hmm, we found a bug, we are rechecking" 15:13:10 <BobBall> Perhaps the first failure we comment on it then say we'll requeue and re-test. 15:13:16 <johnthetubaguy> "hmm, we still found a bug, we will look into this for you soon" 15:13:20 <johnthetubaguy> yeah 15:13:26 <BobBall> No - not the second one - that doesn't scale 15:13:35 <BobBall> the patch submitter must be the one who looks into any failures 15:13:44 <johnthetubaguy> well, I don't mind us not looking into any of those right now 15:13:46 <BobBall> on a patch-by-patch basis 15:13:56 <matel> Idon't think, we are offering any kind of service like "look into this for you soon" 15:13:58 <BobBall> we can look into more common failures 15:14:10 <johnthetubaguy> so the issue is, if we don't have the gate bot to tell us about errors, then we can't ask the patch submitter to do it yet 15:14:12 <BobBall> One thing that I would like to add is "Patch failed tests XYZ" which should be easy to grep 15:14:25 <johnthetubaguy> OK 15:14:26 <BobBall> and if we get that info - even if it's just internal - then we can easily group failures. 15:15:00 <johnthetubaguy> So the big think I would love is just warn people we are still testing the system, and we found an error, just it might not be an error 15:15:02 <BobBall> I don't understand that sentence john? ^^ 15:15:13 <leifz> Dumb question: do we run current trunk (no patches) on any period? 15:15:16 <matel> I think, instead of throwing in ideas, we would need to really ask, what needs to be done to protect XenAPI's place in the trunk. 15:15:27 <johnthetubaguy> leifz: who is we? 15:15:28 <BobBall> No leifz, not currently 15:15:36 <BobBall> but patches continue to pass 15:15:49 <BobBall> if all patches start to fail then it'll be a trunk thing 15:15:49 <johnthetubaguy> BobBall: let me try again 15:15:51 <leifz> Any of the reporting tests. 15:16:36 <johnthetubaguy> So the big thing I would love is: warn people we are still testing the system, but still tell them we found an error, just it might not be an error, it could the test system that is a bit funny 15:16:43 <BobBall> 15:14 < johnthetubaguy> so the issue is, if we don't have the gate bot to tell us about errors, then we can't ask the patch submitter to do it yet 15:16:46 <BobBall> That sentence 15:16:51 <leifz> I agree with matel on ^^ 15:17:06 <johnthetubaguy> oh I see 15:17:14 <BobBall> Agreed with matel too - just didn't see his msg. Sorry matel. 15:17:18 <johnthetubaguy> BobBall: its the stuff that tells you which bug you hit 15:17:24 <BobBall> ATM we need to be focused on the minimum. 15:17:31 <BobBall> e-r makes people lazy 15:17:42 <BobBall> It's not unreasonable to expect people to look at logs without e-r. 15:17:48 <matel> It's OK, and I think it's fun to spend time with CI systems, but we really have to align our efforts with the requirements. 15:18:02 <johnthetubaguy> right, I am about setting expectations 15:18:07 <johnthetubaguy> we need to report errors 15:18:17 <BobBall> e-r doesn't comment on any other third party systems does it? 15:18:20 <johnthetubaguy> but I would rather we told people we are not sure about them 15:18:32 <johnthetubaguy> until the point where we are more sure it is an error 15:18:39 <johnthetubaguy> hang on, let me re-read the wiki page 15:18:49 <BobBall> Where does other third party system do that? 15:19:09 <BobBall> #link http://ci.openstack.org/third_party.html#requirements for others 15:19:19 <johnthetubaguy> thats not the Nova one 15:19:36 <johnthetubaguy> https://wiki.openstack.org/wiki/HypervisorSupportMatrix/DeprecationPlan 15:20:06 <johnthetubaguy> So, to meet #1 you need to report errors 15:20:32 <johnthetubaguy> for #2 we need a cron job to show the status of our queue 15:20:33 <BobBall> Fine - so we can do all of that now. 15:20:38 <BobBall> I'No we don't 15:20:43 <BobBall> Cron job is extra 15:20:54 <BobBall> If it does it then that satisfies the requirement. 15:21:12 <BobBall> But I agree that we should have a cron job so you at RAX can monitor the queue too. 15:21:17 <johnthetubaguy> so, lets go through those requirements, just to check 15:21:36 <matel> The job need not be voting, but must be informational so that cores have an increased level of confidence in the patch 15:21:36 <matel> Results should come no later than four hours after patch submission at peak load 15:21:36 <matel> Tests should include a full run of Tempest at a minimum, but may include other tests as appropriate 15:21:36 <matel> Results should be accessible to the world and include log files archived for at least six months 15:21:38 <matel> The tempest configuration being used must be published 15:21:45 <johnthetubaguy> if we don't reports errors, we don't meet #1 right? 15:22:02 <BobBall> I can turn on reporting of errors immediately 15:22:12 <johnthetubaguy> how can we prove #2 without some kind of heath of queue status page? 15:22:23 <BobBall> by looking at the times we reported on the patch. 15:22:31 <johnthetubaguy> BobBall: that sounds good, just can we make a note saying we are not sure yet? 15:22:39 <BobBall> it's very obvious from the patch whether we met the 4 hour or not. 15:22:41 <johnthetubaguy> BobBall: thats a bit nuts though 15:23:03 <johnthetubaguy> do we publish our localrc config and list of tempest skips (assuming there are none?) 15:23:21 <BobBall> #link http://ca.downloads.xensource.com/OpenStack/xenserver-ci/refs/changes/00/73000/2/ 15:23:29 <BobBall> We publish the same logs collected by the gate. 15:23:40 <johnthetubaguy> thats not what they mean 15:23:54 <johnthetubaguy> do we have our localrc and list of tempest tests anywhere? 15:24:05 <BobBall> Yes - check that URL I just posted. 15:24:05 <johnthetubaguy> oh, hang on 15:24:15 <johnthetubaguy> I am blind 15:24:44 <johnthetubaguy> Ok, we just need a wiki page describing how we meet all those points 15:24:50 <johnthetubaguy> then we can remove that dodgy log message 15:25:00 <johnthetubaguy> then we are good 15:25:22 <johnthetubaguy> Before the nova meeting tomorrow would be awesome 15:25:36 <johnthetubaguy> BobBall matel: life savers by the way, this is awesome stuff 15:25:44 <BobBall> The thing that we really need is help looking into the failures 15:25:53 <BobBall> I don't want to say "We're not sure about the failures" 15:26:01 <johnthetubaguy> #help need help to look into the failures 15:26:11 <johnthetubaguy> BobBall: but its true right? we are not sure? 15:26:34 <johnthetubaguy> I would rather say we are not sure for a few weeks while we prove the stability, so people don't just ignore the xenapi test results 15:26:37 <BobBall> I'm saying I don't know if the failures are more likely in XenAPI 15:26:39 <BobBall> they are all failures 15:26:53 <BobBall> but just like every gate failure isn't related to the patch, the same is true of XenAPI failures here. 15:27:04 <BobBall> Gate doesn't say "might not be your fault" 15:27:15 <johnthetubaguy> right, but its not new 15:27:19 <BobBall> nor do other CI's that I've seen? 15:27:26 <johnthetubaguy> sure 15:27:42 <johnthetubaguy> I just want to be sure they are not new false positives 15:27:48 <johnthetubaguy> anyways, go with what you think is best 15:28:01 <BobBall> Perhaps if I phrased it differently.... 15:28:19 <BobBall> Every failure that we have should correspond to a bug in launchpad - and one that should be fixed. 15:28:35 <BobBall> I don't know if we are hitting more bugs or bugs more often than KVM or A.N.Other driver 15:28:40 <BobBall> but they are all real 15:29:21 <johnthetubaguy> I don't disagree with you, I just don't want people to start ignoring the XenAPI tests 15:29:37 <BobBall> Which they will if we put a comment saying "Might not be you" 15:29:53 <johnthetubaguy> right, my hope is we prove the system, then remove that phrase 15:30:11 <BobBall> People who get used to the phrase will not notice when it's gone 15:30:19 <johnthetubaguy> when we have more confidence that its gate bugs, probably via using the gate bug signature thingy 15:30:23 <johnthetubaguy> OK 15:30:34 <matel> Okay, I need to go, sorry. 15:30:44 <BobBall> Thanks matel 15:30:44 <johnthetubaguy> now worries, top work 15:30:49 <johnthetubaguy> no^ 15:30:57 <BobBall> Have fun with people poking in your mouth. 15:31:41 <johnthetubaguy> nice 15:31:54 <BobBall> Anyway 15:32:01 <BobBall> I've asked Ant for an increase in our quota 15:32:03 <johnthetubaguy> OK, so help with the failures, I would love to jump on that asap 15:32:21 <johnthetubaguy> Oh, cool, he is the right guy for that, makes sense 15:32:23 <BobBall> we're currently restricted to 128GB RAM which, at 8GB instances, is 15 total (it's just under 128G I think) 15:32:42 <BobBall> I know that we've got a giant queue at the moment, but I've been keen to re-process jobs that failed 15:32:49 <BobBall> so I'm a long way from hitting the 4 hour rate 15:33:05 <BobBall> with 50% more or double the VMs we'll get back very quickly. 15:33:15 <johnthetubaguy> right, totally makes sense 15:33:27 <BobBall> and I think that while we're catching up ATM we won't cope with 15 VMs under peak load 15:33:43 <johnthetubaguy> certainly will not, we should increase that for you 15:34:05 <johnthetubaguy> are we spreading across regions yet? 15:34:14 <johnthetubaguy> that might help a little 15:34:16 <BobBall> No - but we've had that at one point 15:34:22 <BobBall> it's easy to do and I've suggested as much to Ant 15:34:28 <BobBall> currently all in IAD 15:34:34 <johnthetubaguy> I think your quota is per region, but I could be wrong 15:34:35 <BobBall> but we've also had it working with DFW 15:34:47 <johnthetubaguy> LON is the other good choice 15:34:48 <BobBall> Oh? in which case I might have misunderstood 15:34:56 <BobBall> I'll try setting up multi-region as another job for me 15:35:04 <BobBall> that might resolve our quota issue today 15:35:14 <BobBall> Can I access LON in the same way? 15:35:20 <BobBall> I know it's separate from the web interface... 15:35:36 <johnthetubaguy> oh, different account still, bummer, maybe ask for one of those from Ant too 15:35:58 <BobBall> Where are performance flavors currently? 15:36:00 <johnthetubaguy> I would do IAD, DFW, ORD as a starting point anyway 15:36:06 <johnthetubaguy> most places now 15:36:06 <BobBall> OK - I'll add ORD too. 15:36:20 <johnthetubaguy> not HKG and SYD 15:36:23 <BobBall> If it's per-region then adding DFW and ORD would more than make up for the issues 15:36:41 <johnthetubaguy> its worth a whirl, I think it is per region, but I could be wrong 15:36:53 <BobBall> OK 15:37:31 <BobBall> So - tasks so far... 15:37:44 <BobBall> Bob: Cron job, post -ve comments, multi-region 15:38:07 <johnthetubaguy> yep, that sounds good 15:38:08 <antonym> BobBall: you should have gotten the quota increase btw 15:38:09 <BobBall> John: Investigate some of the failures from http://paste.openstack.org/show/67277/ to match against bugs / or ideally propose fixes to reduce failure rate 15:38:17 <BobBall> perfect! thanks ant! 15:38:22 <BobBall> I'll try to go multi-region first 15:38:26 <antonym> think i shot a mail over yesterday 15:38:29 <BobBall> since that'll be lighter on you 15:38:39 <antonym> if they only did it in one region, let me know and i'll get it set for the others 15:38:52 <BobBall> sorry - I may have missed it with the fun we've been having 15:38:58 <johnthetubaguy> yeah, more independent failures too, when we do a deploy, etc 15:38:59 <antonym> no problem :) 15:39:28 <johnthetubaguy> Bob: write up wiki page with links to tempest config, etc 15:39:35 <johnthetubaguy> add wiki page into: 15:39:35 <BobBall> Failed deploys show up as "Aborted" :) 15:39:38 <BobBall> Good point. 15:39:57 <johnthetubaguy> the above wiki 15:40:14 <BobBall> Will do. 15:40:23 <johnthetubaguy> awesome 15:40:28 <johnthetubaguy> so, sounds like we are almost there 15:40:44 <johnthetubaguy> I am going did into errors tomorrow I am afraid, got blueprints to sort out this afternoon 15:41:37 <BobBall> OK 15:41:51 <BobBall> well I won't have time to look at them based on the list of things I've got to do :D 15:42:00 <johnthetubaguy> indeed 15:42:11 <BobBall> OK - that's all for the CI I think 15:42:23 <johnthetubaguy> its awesome to see it going 15:42:46 <leifz> Real quick is there a quick link to look at errors in general? 15:42:48 <johnthetubaguy> I actually found a team that might help maintain it in Rackspace, once we have it proven, if thats helpful 15:43:18 <BobBall> What do you mean leifz ? 15:43:43 <johnthetubaguy> #action look into XenAPI build errors: http://paste.openstack.org/show/67277/ 15:43:44 <BobBall> Should b e easy to do johnthetubaguy - it's all up in github 15:43:53 <leifz> you said you needed help looking at failures. Was curious if that easy to look at. 15:44:07 <BobBall> Ah yes - the link that John gave includes the log files for the errors 15:44:27 <johnthetubaguy> OK, so lets move on.. 15:44:32 <johnthetubaguy> #topic AOB 15:44:41 <johnthetubaguy> anyone else got anything to talk about? 15:44:48 <BobBall> We want to add to those log files to include the host logs (matel is working on this) as there is important info in those for some errors 15:44:59 <johnthetubaguy> sounds good 15:45:16 <BobBall> No AOB from me 15:45:42 <BobBall> And I have to jump away now 15:45:46 <johnthetubaguy> its blueprint cut off day, patch up today, else your blueprint gets defered 15:45:51 <johnthetubaguy> OK, thanks BobBall 15:45:52 <BobBall> I'll be back in a few minutes 15:45:54 <johnthetubaguy> nothing from me 15:46:04 <johnthetubaguy> #endmeeting