17:00:56 #startmeeting qa 17:00:57 Meeting started Thu Jun 5 17:00:56 2014 UTC and is due to finish in 60 minutes. The chair is mtreinish. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:58 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:01:01 The meeting name has been set to 'qa' 17:01:10 Hi, who's here today? 17:01:13 Hi 17:01:18 hi 17:01:33 #link https://wiki.openstack.org/wiki/Meetings/QATeamMeeting 17:01:38 ^^^ Today's agenda 17:02:15 Hi 17:02:43 dkranz, sdague, mkoderer, afazekas: are you guys around? 17:02:53 I will add my spec review to the agenda 17:02:55 Hi 17:02:59 well anyway let's get started 17:03:09 #topic Mid-Cycle Meetup (mtreinish) 17:03:30 * afazekas o/ 17:03:30 So I just wanted to remind everyone about the qa/infra mid cycle we're going to be having 17:03:40 the details can be found here https://wiki.openstack.org/wiki/Qa_Infra_Meetup_2014 17:04:07 I expect that the schedule for that week will change a bit more over the next week 17:04:17 #link https://wiki.openstack.org/wiki/Qa_Infra_Meetup_2014 17:04:50 I really didn't have much else for this topic, unless anyone had any questions about it 17:05:22 hi folks. Sorry, I am late 17:05:46 hi all :) 17:05:52 ok, well I guess we'll move to the next topic 17:06:06 #topic Unstable API testing in Tempest (mtreinish) 17:06:19 so this is something that came up the other day in a review 17:06:29 and I know we talked about it with regards to v3 testing 17:06:55 but since we've moved to branchless tempest I feel that we really can't support testing unstable apis in tempest 17:07:14 mtreinish: +1 17:07:28 if the api doesn't conform to the stability guidlines because it's still in development than we really can't be testing it 17:07:43 mtreinish: Neutron's model of testing in tree and then promoting to tempest sounds like a good option for unstable APIs 17:08:08 andreaf: yeah I'd like to see that, but we have yet to see it implemented 17:08:24 it would definitely solve this problem once it's in full swing 17:08:32 oh for reference the review this came up in was an ironic api change: 17:08:34 #link https://review.openstack.org/95789 17:09:03 I'm thinking we should codify this policy somewhere 17:09:17 mtreinish: ++ 17:09:27 mtreinish: If we are talking about a whole service api, like ironic, why not just set the enable flag to false in the gate? 17:09:47 mtreinish: I mean as a temporary measure, not a substitute for the other ideas. 17:09:59 dkranz: well that doesn't really address the unstable thing 17:10:19 sdague: It means that for gating purposes, it is out of tree. 17:10:20 mtreinish: the alternative here would be micro-versions in ironic :) 17:10:32 andreaf: they'd have to implement that 17:10:42 which I don't think they've talked about yet 17:10:53 sdague: Same as if it were actually out-of-tree which was the other proposal 17:10:56 andreaf: yeah, but this example is actually a field removal so it'd be a major version... 17:11:00 sdague: yes of course - but it seems to be problem common to everyone 17:11:44 dkranz: what I'm proposing is just drawing the line in the sand now, and we can figure out how to deal with the unstable apis we have in tree after that 17:11:56 there shouldn't be too many 17:12:08 sdague: perhaps micro versions is something every project should have - also it would bring a more consistent experience 17:12:08 yeh, I agree 17:12:09 dkranz: I'm also not sure it's the entire api surface for ironic 17:12:23 andreaf: it's a good goal, we don't have any that do it yet 17:12:36 so do we think a new section in the readme is good enough 17:12:45 or should we start storing these things somwhere else 17:12:46 ? 17:13:14 new readme section is probably the right starting point 17:13:33 ok, then I'll push out a patch adding that 17:13:47 #action mtreinish to add a readme section about only testing stable apis in tempest 17:13:47 if we have a consistent topic for just docs changes, we could pop that to the top of the dashboard 17:13:57 to have people review docs changes faster 17:13:57 mtreinish, sdague: the problem I see when reviewing is how we define a stable API 17:14:14 andreaf: example? 17:14:21 meaning until we don't have a test for something we don't know 17:14:22 andreaf: if it's in tempest it's stable... 17:14:28 there is no turning back 17:14:36 andreaf: Right 17:15:05 hi 17:15:26 ok fine so we let an API become progressively stable test by test ... as long as something does not have a test it's not stable 17:15:56 andreaf: No, I don't think that't right. 17:15:57 andreaf: that's a separate problem, which is really just api surface coverage. This is more to prevent projects that explicitly say their apis will be changing but they're adding tests to tempest 17:16:14 right, I'm more concerned about the second thing 17:16:28 ok 17:16:30 if a project doesn't believe the api is stable, and they tell us that, then we don't land those interfaces 17:16:59 because they've said explicitly that they don't believe it to be a contract 17:17:03 sdague: marun's proposal is the only clean way to address this I think. 17:17:08 dkranz: sure 17:17:49 sdague: or they don't tell us, ignore the new readme section, something gets merged and then they're locked into the stability guidelines :) 17:17:59 sure 17:18:00 I'm just thinking about an easy way to be consistent in reviews on this - so if there is a place where projects publish the fact that an API is stable, we can -2 directly any test for unstable stuff 17:18:51 andreaf: well I think this only really applies to new major api versions and incubated projects 17:19:00 yeh, I had some vague thoughts about all of that. But my brain is too gate focussed right now to take the context switch. 17:19:08 andreaf: maybe propose something as a readme patch? 17:19:14 and we can sift it there 17:19:32 it honestly might be nice to have a REVIEWING.rst 17:19:44 sdague: +1 17:19:47 or do more in it 17:20:02 oh, I have a local file that I started fo rthat 17:20:05 yeah I was thinking about doing that as a wiki page like nova does 17:20:24 I personally prefer it in the tree 17:20:35 because then changing it is reviewed, and it's there when you check things out 17:20:41 mtreinish: or a link in tree to a wiki page 17:20:44 :D 17:20:46 I find stuff in the wiki ends up logically too far away 17:20:52 and no one notices when it changes 17:21:02 so we mentally fork 17:21:06 you can subscribe to a page... 17:21:07 +1 for in the tree 17:21:11 but I get what you're saying 17:21:23 the location doesn't really matter as long as we have it 17:21:53 sdague: do you want an action to start that? 17:22:16 sure 17:22:27 #action sdague to start a REVIEWING.rst file for tempest 17:22:37 ok then is there anything else on this topic? 17:23:27 when we get to open talk, I want to discuss a couple suggestions after debugging some gate things 17:23:30 #topic Specs Review 17:23:32 sdague: ok 17:23:52 sdague: actually I'll make that a topic because I'm not sure we'll get to open 17:24:02 dkranz: I think you posted the first one on the agenda 17:24:06 #link https://review.openstack.org/#/c/94473/ 17:24:14 do we -1 patches that didn't have a merged spec? 17:24:27 mtreinish: Yes, boris-42 put in this spec but there is not enough detail 17:24:30 (I added mine to the agenda https://review.openstack.org/#/c/97589/) 17:24:58 mkoderer: yeah if the bp isn't approved yet we shouldn't merge the patches as part of it 17:25:03 mtreinish: He did not respond to the comment yet and I am not sure if he intends to move forward with this or not. 17:25:34 dkranz: yeah it doesn't really have any design in it, it just explains what the script will be used for 17:25:59 mtreinish: The last comment from him was May 23 17:26:06 dkranz: well I guess ping boris-42 after the meeting, and if he doesn't respond we can open a parallel one up 17:26:09 mtreinish: yep we need to have a look at this... couldn't jenkins do a -1 automatically 17:26:10 mtreinish: I guess I will follow up with him 17:26:39 mkoderer: I think that's over optimizing at this point 17:26:51 sdague: ok :) 17:26:58 especially as as a team we're kind of terrible about tagging commit messages 17:27:21 I know I am... 17:27:35 ok the next spec on the agenda is asselin's: 17:27:35 we'll let another project build that infrastructure first, then see if we want to use it :) 17:27:47 #link https://review.openstack.org/#/c/97589/ 17:28:00 asselin: so go ahead 17:28:14 Hi, this is something that hemna and sdague talked about in Hong Kong. 17:28:31 Where the API calls success and failures are automatically tracked during stress tests 17:28:55 asselin: so you want to enhance the statistics? 17:29:18 yes, there's an implementation out for review. 17:29:33 asselin: ok cool I will have a look 17:29:36 link? 17:29:36 sample output is available here at the end: http://paste.openstack.org/show/73469/ 17:29:48 Stress Test API Tracking: https://review.openstack.org/#/c/90449/ 17:30:09 #link https://review.openstack.org/#/c/90449/ 17:30:19 in the above output, lines 18-32 were 'manually' added to the stress action. 17:30:44 With the new code, lines 34-85 show which api calls were made, and how many passed, failed. 17:31:14 asselin: so I haven't looked at this in detail yet, but the way your tracking api calls 17:31:24 could that also be used in the non stress case too? 17:31:38 yes, I believe so 17:31:58 asselin: I think we don't need the manual way.. 17:32:11 asselin: ok that may be useful as part of tracking leaks in general tempest runs too 17:32:20 I'll look at your spec and comment there 17:32:20 mkoderer, yes, that's exactly the point: no need to do the manul way anymore 17:32:20 just tracking the api calls look sufficient for me 17:32:29 asselin: ok cool 17:32:52 there's another patch for cinder tests here: Cinder Stress/CHO Test: https://review.openstack.org/#/c/94690/ 17:33:01 this one previously had the manual calls. 17:33:09 they are all now removed 17:33:36 ok, so take a look at this spec proposal to give it some feedback. It seems reasonable to me. :) 17:33:37 and by previously, I meant before it was submitted for review. 17:33:43 mtreinish, thanks! 17:33:53 ok are there any other specs people want to bring up 17:33:56 yes 17:34:02 andreaf: ok go ahead 17:34:02 #link https://review.openstack.org/#/c/86967/ 17:34:08 non-admin one 17:34:20 I submitted an update on dkranz 's work 17:35:04 from the discussions at the summit it seems that the only kind of admin work we can avoid for now is tenant isolation 17:35:18 andreaf: right, but I don't want to give that up 17:35:26 which means I still feel like this is part 2 17:35:34 and part 1 is preallocated ids 17:35:49 until we have hierarchical multitenancy we won't be able to do things like list all vms in a domain 17:35:52 at which point we delete the non tenant isolation case 17:36:17 there in simplifying tempest in the process 17:37:03 sdague: +1, although the non tenant isolation case would just be a list len of 1 for the preallocated ids 17:37:12 mtreinish: sure 17:37:17 sdague: so I think we should remove user and user_alt and just have user provider that either uses tenant isolation or an array of configured users 17:37:27 andreaf: yes exactly 17:37:30 andreaf: +1 17:37:48 I'm still not sure how we pass the user in from calls to testr in that case 17:37:51 now we just need someone who wants to do that 17:38:06 dkranz: there might be some trickiness there 17:38:12 dkranz: yeah there is an interesting problem to determine how many threads we're safe for 17:38:20 but, honestly, we have to solve it 17:38:31 even if we decide the solution is a tempest binary 17:38:31 or we could just allow overcommit and just lock on having available creds 17:38:34 to wrap it 17:38:43 sdague: I agree, but feel a lack of testr expertise 17:38:50 mtreinish: or just die 17:39:07 heh, yeah that's probably a better idea :) 17:39:10 if you try to run concurency of more than your users, die 17:39:27 sdague: the fuzziness is the alt user 17:39:32 we could do a config consistency check before starting the tests 17:39:39 I'm not sure how a list of users would actually work in terms of tempest deciding which one to use. It would have to be per-thread 17:39:41 mtreinish: more than user +1 17:39:50 At class load time a process can lock on one demo and alt_demo user 17:39:53 users[-1] is the alt user 17:40:00 well it's user + n 17:40:14 because if you need an alt user for all the workers at once 17:40:29 I think we typically only need 1 alt user globally 17:40:30 mtreinish: Doesn't each worker need its own alt_user? 17:40:40 but anyway we can figure that out later 17:40:44 we're down to 20 min 17:40:50 so let's move on 17:40:51 ok, so who's spearheading this one? 17:41:07 mtreinish: so we need a spec, I can start a bp 17:41:14 andreaf: cool 17:41:47 #action andreaf start bp on static users 17:41:57 ok, then lets move on 17:42:05 mtreinish: I had another spec I wanted to mention 17:42:08 #link https://review.openstack.org/#/c/96163/ 17:42:31 this is the test server, client, gui 17:42:33 andreaf: ok, do you want to save that for next week when masayukig is around? 17:42:42 ok sure 17:42:57 ok cool 17:43:00 #topic how to save the gate (sdague) 17:43:14 sdague: you've got the floor 17:43:23 man, that's a much bigger thing than I was going for 17:43:25 :) 17:43:47 ok, so a couple of things when dealing with failures 17:44:05 first off, we explode during teardown some times on resource deletes 17:44:26 and whether or not the test author thought delete was part of their path, explode on teardown sucks 17:44:53 so I think we need a concerted effort to make sure that our teardown paths are safe 17:45:02 and just log warn if they leak a thing 17:45:03 sdague: I proposed a safe teardown mechanism https://review.openstack.org/#/c/84645/ 17:45:15 but I need to write a spec before ;) 17:45:23 not sure if this helps 17:45:25 mkoderer: oh, cool 17:45:25 sdague: but explode how, like if it's an unexpected error on a delete call shouldn't that be a big issue that causes a failure? 17:46:00 mtreinish: honestly... I'm pretty mixed on that 17:46:30 because I have a feeling that if we treated all deletes like that 17:46:33 we'd never pass 17:46:42 the only reason tempest passes is because we leak 17:46:47 explode on teardown is ok, and still fails the related test 17:47:04 afazekas: honestly, I don't think it is ok 17:47:15 if you want to test delete, do it explicitly 17:47:40 sdague: +1, but the leaks are still bugs 17:47:44 dkranz: sure 17:47:57 but we can solve those orthogonally 17:48:14 sdague: but isn't kind of the same thing as a setup failure. We're not explicitly testing those calls but something failed 17:48:19 sdague afazekas failures in fixtures is something we should perhaps write warnings and collect stats 17:48:24 sdague: Yes, at this point I think we need to agressively stop failures that are due to "known" bugs 17:48:26 I just think moving on if a delete explodes is going to mask the real failure 17:48:27 sdague: explicitly by addCleanUp ? 17:48:33 and cause something elsewhere 17:48:41 afazekas: no, explicitly by callling delete 17:48:48 mtreinish: When the gate becomes reliable again we can add failures back 17:49:03 the problem is there is too much implicit 17:49:22 so even people familiar with the tempest code take a long time to unwind this 17:49:23 sdague: and where do we want to reallocate the resources allocated before the delete ? 17:49:23 afazekas: No, by calling delete not in addCleanUp 17:49:33 deallocate 17:49:44 afazekas: in the test function 17:49:50 dkranz: that's what I'm saying is I don't think this will necessarily make fixing the gate easier 17:49:52 test_foo_* is where you make calls 17:50:11 mtreinish: It might not. But it might. 17:50:12 sdague: resource leak on failure is ok ? 17:50:26 afazekas: if it has WARN in the log about it 17:50:41 I don't think it is ok but we are drowning in failures, right? 17:50:42 then we are tracking it at least 17:50:54 dkranz: yes, very much so 17:51:04 every single tempest fix to remove a race I did yesterday 17:51:08 was failed by another race 17:51:24 sdague: So I was just suggesting we stop "known" failures until we get it under control 17:51:39 Dow we want to avoid this kind of issues ? https://bugs.launchpad.net/tempest/+bug/1257641 17:51:41 Launchpad bug 1257641 in tempest "Quota exceeded for instances: Requested 1, but already used 10 of 10 instances" [Medium,Confirmed] 17:51:43 dkranz: sure, I'm also suggesting a different pattern here 17:52:09 sdague: So make sure whatever is actually testing delete does so explicitly and don't fail on cleanup delete failures, just warn. 17:52:11 I think this topic would be nice for the mid cycle meetup to work together on it 17:52:16 sdague: I think that is what you are saying, right? 17:52:21 dkranz: yep, exactly 17:52:31 the test_* should be explicit about what it tests 17:52:42 sdague: Yes, that was my review comment as well. 17:52:42 and teardown is for reclamation and shouldn't be fatal 17:52:57 sdague: I would say more that it should be, but we suspend that for now. 17:52:59 sdague: +1 17:53:23 sdague: for api tests specifically - in scenario I would often include the cleanup in the test itself 17:53:24 dkranz: well once tempest is actually keeping track of 100% of it's resources, I might agree 17:53:41 andreaf: sure, and that's fine 17:53:47 just make it explicit 17:53:47 How will we see if jobs randomly just WARN on a failed delete ? 17:53:57 afazekas: it's in the logs 17:54:01 and the logs are indexed 17:54:22 sdague: nobody reads them if the test suite passed 17:54:25 sdague: I think we need tools to analyse the logs and 17:54:33 sure, we need all those things 17:54:47 but if we can't ever land a tempest patch again, then it's kind of useless :) 17:54:47 trigger warnings perhaps to the DL or in iRC 17:54:49 afazekas: I think the point is that there are race bugs around delete, and we know that, but we can't keep failing because of them 17:55:09 dkranz: bug link ? 17:55:16 afazekas: We now have to accept the risk of a regression around this issue to get things working. 17:55:27 afazekas: and realistically people have been talking about tracking resourses forever, and no one ever did that work 17:55:40 afazekas: There is no bug link because no one has any idea of why the deletes fail as far as I know. 17:56:05 dkranz: no bug, no issue to solve 17:56:06 Unless I am wrong about that. 17:56:28 afazekas: sorry, some of us have been too busy fighting fires in real time to write up all the bugs 17:56:46 so step1 we need to "ignore" failed deletes and get the gate back and then we can go from there? 17:57:12 andreaf: yeh 17:57:20 though honestly, this was only one of 2 items 17:57:30 andreaf: What other choice is there? 17:57:32 the other is we need to stop being overly clever and reusing servers 17:57:43 so I proposed that patch for promote 17:57:46 I'm still not convinced just switching exceptions to be log warns in the short term is going to make it easier to debug. Because of all the shared state between tests, but I'll defer to the consensus 17:57:54 sdague: +1 on the second point 17:58:00 https://review.openstack.org/#/c/97842/ 17:58:10 we do it in image snapshots as well 17:58:18 mtreinish: I don't think the point was about being easier to debug, just to be able to get patches through. 17:58:19 I'll propose a patch for that later today 17:58:33 dkranz: but that too, I think it'll just shift fails to other places 17:58:39 mtreinish: it might 17:58:50 mtreinish: and we will see that if it happens and learn something 17:58:54 I would like to see several logs/jobs where we had just delete issues. 17:58:55 the other thing that would be really awesome 17:59:15 afazekas: well, dive in and fix gate bugs, and you'll see them 17:59:46 so I think we have 1 min left? 17:59:53 yeah we're basically at time 17:59:57 Before we go, Could I have some core reviews for https://review.openstack.org/#/c/83627 and https://review.openstack.org/#/c/47816. There was some overlap in these two patchsets, but I fixed it this past week. So they are good to go 18:00:00 We can move to qa channel 18:00:03 yeh, so one parting thought 18:00:08 sdague: I frequently check the logs after failures, and I can't recall 18:00:28 well there is a meeting after us I think 18:00:33 so I'm going to call it 18:00:39 ok, parting in -qa 18:00:41 #endmeeting