#openstack-meeting log

15:00:40 <johnthetubaguy> #startmeeting XenAPI
15:00:41 <openstack> Meeting started Wed Feb 19 15:00:40 2014 UTC and is due to finish in 60 minutes.  The chair is johnthetubaguy. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:44 <openstack> The meeting name has been set to 'xenapi'
15:00:53 <johnthetubaguy> hello, how is around this week?
15:01:11 <matel> I am here for a short time.
15:01:19 <leifz> I am around.
15:01:26 <sandywalsh_> o/
15:01:27 <BobBall> I am also here
15:01:43 <johnthetubaguy> Ok, so lets jump to matel
15:01:53 <johnthetubaguy> #topic XenServer CI
15:02:03 <johnthetubaguy> matel BobBall : tell me good news :D
15:02:17 <BobBall> #link http://paste.openstack.org/show/67274/
15:02:18 <johnthetubaguy> … and any bad news
15:02:27 <BobBall> That's the good news
15:02:34 <matel> Okay, I guess I'll do it with Bob.
15:02:35 <BobBall> #link http://paste.openstack.org/show/67277/ is the bad news
15:02:50 <matel> So We are commenting on successful runs
15:03:05 <BobBall> {'Collected': 31, 'Finished': 142, 'Running': 14, 'Queued': 62} are the current queue stats
15:03:28 <BobBall> difference between "Collected" and "finished" is that collected has got the results, and finished has posted them
15:03:38 <johnthetubaguy> OK, cool, are we getting complete within the correct timeframe?
15:03:38 <BobBall> we are not posting about failures ATM
15:03:51 <BobBall> It's complete now.
15:03:55 <johnthetubaguy> BobBall: totally makes sense right now
15:03:56 <BobBall> I can post failures now if we want
15:04:07 <johnthetubaguy> well, do we know what they are yet?
15:04:19 <BobBall> but I personally want verification of most of the failures before I turn on auto failure posting
15:04:22 <BobBall> That was the bad news
15:04:27 <BobBall> #link http://paste.openstack.org/show/67277/
15:04:31 <BobBall> Those are the 30 failures
15:04:39 <BobBall> some are real (e.g. 73539
15:04:43 <johnthetubaguy> right
15:04:45 <BobBall> some are not real
15:04:55 <BobBall> BUT all that we've looked at so far have corresponding defects in the gate
15:05:07 <johnthetubaguy> ah, OK
15:05:19 <johnthetubaguy> I guess it would be good to see that jenkins signature match
15:05:19 <BobBall> I do not know yet whether we are suffering a higher hit rate of those defects - and if we are whether they are related to the environment
15:05:24 <leifz> Was wondering if this was just current stability ranking.  Is that what you are saying Bob?
15:05:46 <BobBall> sorry leifz ? not sure I understand?
15:05:56 <johnthetubaguy> {'': 14, 'Failed': 30, None: 62, 'Passed': 139, 'Aborted: Unknown': 4}
15:05:57 <leifz> Is the failure rate in line with current gate failure rate.
15:06:00 <johnthetubaguy> is what bob posted
15:06:05 <BobBall> ah yes - I don't know leifz
15:06:17 <johnthetubaguy> oh, its usually under a few percent, good question though
15:06:19 <BobBall> I asked yesterday what the current rate was but didn't get an answer and haven't saked again / chased
15:06:35 <leifz> Thanks.
15:06:45 <BobBall> There are other issues we've been fixing in the last couple of days to get the stability of the system fixed
15:07:04 <johnthetubaguy> Do we think its just code issues at this point?
15:07:15 <BobBall> everything is just code john :)
15:07:42 <BobBall> We're certainly at the point where we should be trying to track down the failures that I've listed in that page
15:08:03 <BobBall> We seem to be hitting test_basic_scenario frequently
15:08:04 <leifz> LOL stepped into that one.  How close do we need to be to be non-voting reporting?
15:08:28 <johnthetubaguy> I would rather we don't report crud, people just assume the thing is broken then
15:08:32 <BobBall> and while it's an acknowledged gate bug I suspect we're hitting it more, possibly because of slower volume provisioning or something
15:09:03 <johnthetubaguy> OK, so, can we look at getting the gate bug patterns tested against a fail run?
15:09:11 <johnthetubaguy> see if we hit a gate bug signature, etc
15:09:25 <BobBall> I'm not giong to look at automating that now
15:09:32 <BobBall> that's a nice-to-have
15:09:54 <BobBall> Manual inspection of the tests for why they are failing is what we should do to get the rates up
15:10:16 <johnthetubaguy> sure, for now it makes sense
15:10:37 <johnthetubaguy> so do we have a public source for this data you are generating?
15:10:38 <BobBall> Anyway - I'd be happy to argue that we've satisfied what's needed for I-3
15:10:50 <BobBall> All passed tests get voted on
15:10:53 <BobBall> all logs are public
15:11:00 <BobBall> but these lists are not public, no
15:11:04 <johnthetubaguy> OK, one more request...
15:11:18 <BobBall> We could easily create a cronjob to post them somewhere if you want the latest details
15:11:19 <johnthetubaguy> can we just report the errors as "hmm, we found a problem, we are checking it out"
15:11:28 <johnthetubaguy> until we have more confedence?
15:12:17 <johnthetubaguy> BobBall: cron job of stats would be idea, just so people can check the queue length / status
15:12:26 <BobBall> We can do that, but I think the current volume means that we will not be able to check out + post on each test - I suspect "we're checking it out" could be seen as a suggestion that they will get another update.
15:12:41 <BobBall> An alternative would be to automatically requeue failed jobs once or twice
15:12:48 <BobBall> but that'll take ages to report on failures
15:13:05 <johnthetubaguy> OK, I recon, "hmm, we found a bug, we are rechecking"
15:13:10 <BobBall> Perhaps the first failure we comment on it then say we'll requeue and re-test.
15:13:16 <johnthetubaguy> "hmm, we still found a bug, we will look into this for you soon"
15:13:20 <johnthetubaguy> yeah
15:13:26 <BobBall> No - not the second one - that doesn't scale
15:13:35 <BobBall> the patch submitter must be the one who looks into any failures
15:13:44 <johnthetubaguy> well, I don't mind us not looking into any of those right now
15:13:46 <BobBall> on a patch-by-patch basis
15:13:56 <matel> Idon't think, we are offering any kind of service like "look into this for you soon"
15:13:58 <BobBall> we can look into more common failures
15:14:10 <johnthetubaguy> so the issue is, if we don't have the gate bot to tell us about errors, then we can't ask the patch submitter to do it yet
15:14:12 <BobBall> One thing that I would like to add is "Patch failed tests XYZ" which should be easy to grep
15:14:25 <johnthetubaguy> OK
15:14:26 <BobBall> and if we get that info - even if it's just internal - then we can easily group failures.
15:15:00 <johnthetubaguy> So the big think I would love is just warn people we are still testing the system, and we found an error, just it might not be an error
15:15:02 <BobBall> I don't understand that sentence john? ^^
15:15:13 <leifz> Dumb question: do we run current trunk (no patches) on any period?
15:15:16 <matel> I think, instead of throwing in ideas, we would need to really ask, what needs to be done to protect XenAPI's place in the trunk.
15:15:27 <johnthetubaguy> leifz: who is we?
15:15:28 <BobBall> No leifz, not currently
15:15:36 <BobBall> but patches continue to pass
15:15:49 <BobBall> if all patches start to fail then it'll be a trunk thing
15:15:49 <johnthetubaguy> BobBall: let me try again
15:15:51 <leifz> Any of the reporting tests.
15:16:36 <johnthetubaguy> So the big thing I would love is:  warn people we are still testing the system, but still tell them we found an error, just it might not be an error, it could the test system that is a bit funny
15:16:43 <BobBall> 15:14 < johnthetubaguy> so the issue is, if we don't have the gate bot to tell us about errors, then we can't ask the patch submitter to do it yet
15:16:46 <BobBall> That sentence
15:16:51 <leifz> I agree with matel on ^^
15:17:06 <johnthetubaguy> oh I see
15:17:14 <BobBall> Agreed with matel too - just didn't see his msg.  Sorry matel.
15:17:18 <johnthetubaguy> BobBall: its the stuff that tells you which bug you hit
15:17:24 <BobBall> ATM we need to be focused on the minimum.
15:17:31 <BobBall> e-r makes people lazy
15:17:42 <BobBall> It's not unreasonable to expect people to look at logs without e-r.
15:17:48 <matel> It's OK, and I think it's fun to spend time with CI systems, but we really have to align our efforts with the requirements.
15:18:02 <johnthetubaguy> right, I am about setting expectations
15:18:07 <johnthetubaguy> we need to report errors
15:18:17 <BobBall> e-r doesn't comment on any other third party systems does it?
15:18:20 <johnthetubaguy> but I would rather we told people we are not sure about them
15:18:32 <johnthetubaguy> until the point where we are more sure it is an error
15:18:39 <johnthetubaguy> hang on, let me re-read the wiki page
15:18:49 <BobBall> Where does other third party system do that?
15:19:09 <BobBall> #link http://ci.openstack.org/third_party.html#requirements for others
15:19:19 <johnthetubaguy> thats not the Nova one
15:19:36 <johnthetubaguy> https://wiki.openstack.org/wiki/HypervisorSupportMatrix/DeprecationPlan
15:20:06 <johnthetubaguy> So, to meet #1 you need to report errors
15:20:32 <johnthetubaguy> for #2 we need a cron job to show the status of our queue
15:20:33 <BobBall> Fine - so we can do all of that now.
15:20:38 <BobBall> I'No we don't
15:20:43 <BobBall> Cron job is extra
15:20:54 <BobBall> If it does it then that satisfies the requirement.
15:21:12 <BobBall> But I agree that we should have a cron job so you at RAX can monitor the queue too.
15:21:17 <johnthetubaguy> so, lets go through those requirements, just to check
15:21:36 <matel> The job need not be voting, but must be informational so that cores have an increased level of confidence in the patch
15:21:36 <matel> Results should come no later than four hours after patch submission at peak load
15:21:36 <matel> Tests should include a full run of Tempest at a minimum, but may include other tests as appropriate
15:21:36 <matel> Results should be accessible to the world and include log files archived for at least six months
15:21:38 <matel> The tempest configuration being used must be published
15:21:45 <johnthetubaguy> if we don't reports errors, we don't meet #1 right?
15:22:02 <BobBall> I can turn on reporting of errors immediately
15:22:12 <johnthetubaguy> how can we prove #2 without some kind of heath of queue status page?
15:22:23 <BobBall> by looking at the times we reported on the patch.
15:22:31 <johnthetubaguy> BobBall: that sounds good, just can we make a note saying we are not sure yet?
15:22:39 <BobBall> it's very obvious from the patch whether we met the 4 hour or not.
15:22:41 <johnthetubaguy> BobBall: thats a bit nuts though
15:23:03 <johnthetubaguy> do we publish our localrc config and list of tempest skips (assuming there are none?)
15:23:21 <BobBall> #link http://ca.downloads.xensource.com/OpenStack/xenserver-ci/refs/changes/00/73000/2/
15:23:29 <BobBall> We publish the same logs collected by the gate.
15:23:40 <johnthetubaguy> thats not what they mean
15:23:54 <johnthetubaguy> do we have our localrc and list of tempest tests anywhere?
15:24:05 <BobBall> Yes - check that URL I just posted.
15:24:05 <johnthetubaguy> oh, hang on
15:24:15 <johnthetubaguy> I am blind
15:24:44 <johnthetubaguy> Ok, we just need a wiki page describing how we meet all those points
15:24:50 <johnthetubaguy> then we can remove that dodgy log message
15:25:00 <johnthetubaguy> then we are good
15:25:22 <johnthetubaguy> Before the nova meeting tomorrow would be awesome
15:25:36 <johnthetubaguy> BobBall matel: life savers by the way, this is awesome stuff
15:25:44 <BobBall> The thing that we really need is help looking into the failures
15:25:53 <BobBall> I don't want to say "We're not sure about the failures"
15:26:01 <johnthetubaguy> #help need help to look into the failures
15:26:11 <johnthetubaguy> BobBall: but its true right? we are not sure?
15:26:34 <johnthetubaguy> I would rather say we are not sure for a few weeks while we prove the stability, so people don't just ignore the xenapi test results
15:26:37 <BobBall> I'm saying I don't know if the failures are more likely in XenAPI
15:26:39 <BobBall> they are all failures
15:26:53 <BobBall> but just like every gate failure isn't related to the patch, the same is true of XenAPI failures here.
15:27:04 <BobBall> Gate doesn't say "might not be your fault"
15:27:15 <johnthetubaguy> right, but its not new
15:27:19 <BobBall> nor do other CI's that I've seen?
15:27:26 <johnthetubaguy> sure
15:27:42 <johnthetubaguy> I just want to be sure they are not new false positives
15:27:48 <johnthetubaguy> anyways, go with what you think is best
15:28:01 <BobBall> Perhaps if I phrased it differently....
15:28:19 <BobBall> Every failure that we have should correspond to a bug in launchpad - and one that should be fixed.
15:28:35 <BobBall> I don't know if we are hitting more bugs or bugs more often than KVM or A.N.Other driver
15:28:40 <BobBall> but they are all real
15:29:21 <johnthetubaguy> I don't disagree with you, I just don't want people to start ignoring the XenAPI tests
15:29:37 <BobBall> Which they will if we put a comment saying "Might not be you"
15:29:53 <johnthetubaguy> right, my hope is we prove the system, then remove that phrase
15:30:11 <BobBall> People who get used to the phrase will not notice when it's gone
15:30:19 <johnthetubaguy> when we have more confidence that its gate bugs, probably via using the gate bug signature thingy
15:30:23 <johnthetubaguy> OK
15:30:34 <matel> Okay, I need to go, sorry.
15:30:44 <BobBall> Thanks matel
15:30:44 <johnthetubaguy> now worries, top work
15:30:49 <johnthetubaguy> no^
15:30:57 <BobBall> Have fun with people poking in your mouth.
15:31:41 <johnthetubaguy> nice
15:31:54 <BobBall> Anyway
15:32:01 <BobBall> I've asked Ant for an increase in our quota
15:32:03 <johnthetubaguy> OK, so help with the failures, I would love to jump on that asap
15:32:21 <johnthetubaguy> Oh, cool, he is the right guy for that, makes sense
15:32:23 <BobBall> we're currently restricted to 128GB RAM which, at 8GB instances, is 15 total (it's just under 128G I think)
15:32:42 <BobBall> I know that we've got a giant queue at the moment, but I've been keen to re-process jobs that failed
15:32:49 <BobBall> so I'm a long way from hitting the 4 hour rate
15:33:05 <BobBall> with 50% more or double the VMs we'll get back very quickly.
15:33:15 <johnthetubaguy> right, totally makes sense
15:33:27 <BobBall> and I think that while we're catching up ATM we won't cope with 15 VMs under peak load
15:33:43 <johnthetubaguy> certainly will not, we should increase that for you
15:34:05 <johnthetubaguy> are we spreading across regions yet?
15:34:14 <johnthetubaguy> that might help a little
15:34:16 <BobBall> No - but we've had that at one point
15:34:22 <BobBall> it's easy to do and I've suggested as much to Ant
15:34:28 <BobBall> currently all in IAD
15:34:34 <johnthetubaguy> I think your quota is per region, but I could be wrong
15:34:35 <BobBall> but we've also had it working with DFW
15:34:47 <johnthetubaguy> LON is the other good choice
15:34:48 <BobBall> Oh? in which case I might have misunderstood
15:34:56 <BobBall> I'll try setting up multi-region as another job for me
15:35:04 <BobBall> that might resolve our quota issue today
15:35:14 <BobBall> Can I access LON in the same way?
15:35:20 <BobBall> I know it's separate from the web interface...
15:35:36 <johnthetubaguy> oh, different account still, bummer, maybe ask for one of those from Ant too
15:35:58 <BobBall> Where are performance flavors currently?
15:36:00 <johnthetubaguy> I would do IAD, DFW, ORD as a starting point anyway
15:36:06 <johnthetubaguy> most places now
15:36:06 <BobBall> OK - I'll add ORD too.
15:36:20 <johnthetubaguy> not HKG and SYD
15:36:23 <BobBall> If it's per-region then adding DFW and ORD would more than make up for the issues
15:36:41 <johnthetubaguy> its worth a whirl, I think it is per region, but I could be wrong
15:36:53 <BobBall> OK
15:37:31 <BobBall> So - tasks so far...
15:37:44 <BobBall> Bob: Cron job, post -ve comments, multi-region
15:38:07 <johnthetubaguy> yep, that sounds good
15:38:08 <antonym> BobBall: you should have gotten the quota increase btw
15:38:09 <BobBall> John: Investigate some of the failures from http://paste.openstack.org/show/67277/ to match against bugs / or ideally propose fixes to reduce failure rate
15:38:17 <BobBall> perfect! thanks ant!
15:38:22 <BobBall> I'll try to go multi-region first
15:38:26 <antonym> think i shot a mail over yesterday
15:38:29 <BobBall> since that'll be lighter on you
15:38:39 <antonym> if they only did it in one region, let me know and i'll get it set for the others
15:38:52 <BobBall> sorry - I may have missed it with the fun we've been having
15:38:58 <johnthetubaguy> yeah, more independent failures too, when we do a deploy, etc
15:38:59 <antonym> no problem :)
15:39:28 <johnthetubaguy> Bob: write up wiki page with links to tempest config, etc
15:39:35 <johnthetubaguy> add wiki page into:
15:39:35 <BobBall> Failed deploys show up as "Aborted" :)
15:39:38 <BobBall> Good point.
15:39:57 <johnthetubaguy> the above wiki
15:40:14 <BobBall> Will do.
15:40:23 <johnthetubaguy> awesome
15:40:28 <johnthetubaguy> so, sounds like we are almost there
15:40:44 <johnthetubaguy> I am going did into errors tomorrow I am afraid, got blueprints to sort out this afternoon
15:41:37 <BobBall> OK
15:41:51 <BobBall> well I won't have time to look at them based on the list of things I've got to do :D
15:42:00 <johnthetubaguy> indeed
15:42:11 <BobBall> OK - that's all for the CI I think
15:42:23 <johnthetubaguy> its awesome to see it going
15:42:46 <leifz> Real quick is there a quick link to look at errors in general?
15:42:48 <johnthetubaguy> I actually found a team that might help maintain it in Rackspace, once we have it proven, if thats helpful
15:43:18 <BobBall> What do you mean leifz ?
15:43:43 <johnthetubaguy> #action look into XenAPI build errors: http://paste.openstack.org/show/67277/
15:43:44 <BobBall> Should b e easy to do johnthetubaguy - it's all up in github
15:43:53 <leifz> you said you needed help looking at failures.  Was curious if that easy to look at.
15:44:07 <BobBall> Ah yes - the link that John gave includes the log files for the errors
15:44:27 <johnthetubaguy> OK, so lets move on..
15:44:32 <johnthetubaguy> #topic AOB
15:44:41 <johnthetubaguy> anyone else got anything to talk about?
15:44:48 <BobBall> We want to add to those log files to include the host logs (matel is working on this) as there is important info in those for some errors
15:44:59 <johnthetubaguy> sounds good
15:45:16 <BobBall> No AOB from me
15:45:42 <BobBall> And I have to jump away now
15:45:46 <johnthetubaguy> its blueprint cut off day, patch up today, else your blueprint gets defered
15:45:51 <johnthetubaguy> OK, thanks BobBall
15:45:52 <BobBall> I'll be back in a few minutes
15:45:54 <johnthetubaguy> nothing from me
15:46:04 <johnthetubaguy> #endmeeting