17:00:12 <andreaf> #startmeeting qa
17:00:13 <openstack> Meeting started Thu Mar  2 17:00:12 2017 UTC and is due to finish in 60 minutes.  The chair is andreaf. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:00:17 <openstack> The meeting name has been set to 'qa'
17:00:27 <andreaf> hello, who's around today?
17:00:31 <blancos> o/
17:00:39 <chandankumar> andreaf: hello
17:00:52 <andreaf> Today's agenda: #link https://wiki.openstack.org/wiki/Meetings/QATeamMeeting#Agenda_for_March_2nd_2017_.281700_UTC.29
17:00:55 <rodrigods> o/
17:00:55 <sdague> o/
17:01:05 <dustins> \o
17:01:45 <andreaf> mtreinish, oomichi, afazekas: around?
17:01:58 <oomichi> andreaf: hi
17:02:14 <mtreinish> o/
17:02:42 <andreaf> ok we have quite a long dense agenda so let's get started
17:02:46 <andreaf> #topic PTG
17:03:26 <andreaf> About PTG I just wanted to thank everyone who attended
17:03:34 <andreaf> the team pictures are in the ML :)
17:03:43 <tosky> o/
17:03:50 <oomichi> hehe, nice pic
17:03:57 <andreaf> we set priorities for Pike: #link https://etherpad.openstack.org/p/pike-qa-priorities
17:04:24 <andreaf> if you want to review what happened the list of all etherpads is at #link https://etherpad.openstack.org/p/qa-ptg-pike
17:05:27 <andreaf> that's all I had on PTG, so moving on
17:05:28 <andreaf> #topic Gate Status
17:06:03 <andreaf> We are at the beginning of the Pike cycle now, and we must make sure that the gate is helping our community and not stopping people from getting work done
17:06:32 <andreaf> The failure rate is currently too high in the gate, so we've been discussing how to get things back under control
17:06:42 <andreaf> Top recheck issues #link http://status.openstack.org/elastic-recheck/gate.html, http://lists.openstack.org/pipermail/openstack-dev/2017-February/113052.html
17:06:58 <andreaf> One issue was already fixed: #link https://review.openstack.org/#/c/439638/
17:07:27 <andreaf> it turns out Tempest did not close connection on failure which caused the ssh banner issue
17:08:06 <andreaf> but still we have our SUT running under high load for a large part of test runs and we suspect that may be the actual underlying issue behind a lot of the flackyness we've experienced
17:08:43 <andreaf> we discussed a lot in the past about pruning our scenario tests which we should not really be having in tempest
17:08:53 <andreaf> and now it's a good time to move ahead on that plan
17:09:04 <andreaf> we made an etherpad: #link https://ethercalc.openstack.org/nu56u2wrfb2b
17:09:27 <andreaf> proposing which scenario to keep in the gate - for now we would simply skip a number of scenarios
17:09:38 <andreaf> not actually remove them
17:09:51 <andreaf> but we will need eventually to move some out of tempest
17:10:05 <sdague> yeh, the concrete patch I have up there just moves them into the slow bucket
17:10:18 <sdague> which means that you can run them under -e all if you want
17:10:25 <andreaf> #link https://review.openstack.org/#/c/439698/ the patch
17:10:50 <andreaf> and also sdague's patch runs scenario tests serially
17:11:11 <oomichi> andreaf: current above patch seems running scenario tests as serial, right?
17:11:16 <sdague> oomichi: yes
17:11:20 <oomichi> andreaf: yeah,
17:11:27 <andreaf> I would ask folks to review the list and if you have any concern with any specific test please put a comment on the ethercals
17:11:35 <andreaf> we should try to get that patch approved by end of day tomorrow
17:11:49 <sdague> it is roughly jordanP's list of tests from the ethercalc
17:12:10 <andreaf> one thing that that approach does not cover though is load by API tests - some of which can be quite heavy
17:12:25 <andreaf> so another patch I proposed reduces concurrency to three #link https://review.openstack.org/#/c/439879/
17:12:35 <sdague> the inspiration came from some of the nova conversation with ceph folks on their failures, where they think a lot of it is load related, and we said they should just trim back to the important stuff, and run less parallel to get things under control
17:12:36 <andreaf> my proposal would be to actually combine the two things
17:12:48 <sdague> and that seems like a reasonable starting point for us as well
17:13:16 <sdague> andreaf: well, the concurency drop should be in d-g right? not https://review.openstack.org/#/c/439879/
17:13:28 <andreaf> in parallel I'm working on identifying which tests are doing more resource allocation so we can verify if they really need to do so
17:13:31 <tosky> can we switch the default concurrency level to 3 or 4 also in ostestr and tempest run?
17:13:59 <tosky> this load problem has hit also other places (like the RDO CI)
17:14:13 <tosky> yes, it can be patched in every place, but if the default is saner...
17:14:22 <sdague> tosky: yeh, it's definitely worth the conversation to figure out how to back it down
17:14:43 <andreaf> sdague: yeah I guess concurrency is specific to the sizing of the SUT, which is managed by d-g in our case, so it would be more appropriate in there
17:14:46 <sdague> tosky: especially if it turns out that load reduction really makes the world a lot better
17:14:51 <andreaf> sdague: only it will have to go in a number of jobs
17:15:03 <oomichi> I am ok to keep scenario tests with serial if we can get the gate status stable again with sdague patch
17:15:07 <sdague> part of the current problem, is the data set we can get off a patch is limitted
17:16:02 <sdague> so my feeling is to move forward with the serial scenario patch, see what the macro gate effect ends up being after a week, evaluate if it was effective, and if so, what other changes should be made with that data
17:16:03 <andreaf> so any concern anyone on this?
17:16:40 <andreaf> sdague: well the thing is that we've seen already a failure on API tests in your patch
17:16:50 <oomichi> andreaf: will we have heavy load tests like current job as non-voting?
17:17:20 <oomichi> it would be helpful to investigate the root problem
17:17:25 <sdague> I think the only other part of it is communication about the fact that the scenario tests got trimmed by default, so projects that want to test some of those need an -e all in there gate somewhere
17:17:30 <andreaf> oomichi: heh I think for tempest we should have an extra job non voting which runs all the tests still, maybe serially, so we don't break them
17:17:30 <sdague> probably on experimental
17:17:55 <sdague> andreaf: yeh, or honestly make an all scenarios job
17:18:06 <oomichi> andreaf: sdague: cool, I don't have any objection
17:18:09 <andreaf> sdague: yeah something like this
17:18:09 <sdague> and just run all the scenarios in serial, but only those
17:18:21 <sdague> so there isn't the 2.5 hour job issue
17:18:26 <andreaf> ok to summarize
17:18:34 <andreaf> - email the ML with the plan
17:18:52 <andreaf> - comment on the etherpad, and tomorrow we merge sdague patch taking in account comments
17:19:27 <andreaf> - prepare a patch on d-g to reduce concurrency, on hold for now
17:19:30 <andreaf> - prepare a new scenario only job to run on tempest, to be merged tomorrow as well
17:19:41 <sdague> andreaf: all seems reasonable
17:19:46 <andreaf> did I miss anything?
17:20:05 <oomichi> andreaf: good summary :)
17:20:45 <andreaf> ok, they are all pretty small tasks I can follow up tomorrow but I need people to look at the ethercalc and raise comments if appropriate
17:20:57 <andreaf> apart from that I have other two points
17:21:01 <bknudson_> any effect on project tempest plugins?
17:21:20 <sdague> bknudson_: good question
17:21:29 <sdague> andreaf: are they typically run in -e full?
17:21:36 <chandankumar> andreaf: this plan will also help to catch issue in tempest plugins
17:22:08 <andreaf> bknudson_ , sdague: I'm not quite sure
17:22:21 <andreaf> and I think checking all jobs might take a bit
17:22:36 <andreaf> I mean I can grep for tox -e full across project-config to get an idea
17:22:42 <sdague> bknudson_: I think if they are run under -e full by default, it means that the run times of those jobs will get longer if they add a number of scenario tests
17:22:59 <mtreinish> andreaf: most plugins do not use the tox -efull job
17:23:04 <mtreinish> in fact I can't think of any
17:23:13 <sdague> so if they happened to be right up against there timer, they could go over
17:23:19 <sdague> mtreinish: they are using -e all?
17:23:52 <mtreinish> sdague: yeah -eall or -eall-plugin which has system site-packages enabled (hopefully I'll be able to remove that eventually)
17:23:56 <bknudson_> the keystone tests are scenario tests (although shouldn't be a problem since not much tested)
17:24:25 <sdague> mtreinish: and do they typically set their own concurrency?
17:24:26 <bknudson_> oh, we've got both api and scenario
17:24:39 <andreaf> mtreinish: ok, would you mind checking on that just to be sure?
17:24:40 <rodrigods> bknudson_, api are only to test the clients used in the scenario
17:24:44 <sdague> bknudson_: so I think given that, there will be no impact
17:25:05 <sdague> however we may collect data which suggests how you might want to tweak things
17:25:09 <mtreinish> sdague: they all rely on the d-g variable for setting concurrency
17:25:25 <oomichi> octavia sets OS_TESTR_CONCURRENCY 1 as a sample
17:25:40 <sdague> however, I don't expect that keystone is going to have the same load/iowait issues as jobs that have lvm volumes getting allocated
17:25:50 <mtreinish> andreaf: I haven't checked every single job in project-config, but I'd be really surprised if any project used the full job on a plugin
17:25:53 <andreaf> heh right
17:26:04 <andreaf> mtreinish: heh I agree
17:26:14 <sdague> I really think that volumes + qemu boots is where we get the really heavy load that's hurting us
17:27:04 <andreaf> #action everyone - review the ethercalc @ https://ethercalc.openstack.org/nu56u2wrfb2b
17:27:27 <andreaf> #action andreaf email ML with our plan on scenario tests
17:27:35 <andreaf> #action andreaf setup a scenario only job for tempest
17:27:55 <tosky> anecdotally, in sahara we use _REGEXP (with -eall), and probably other plugin do the same
17:27:55 <andreaf> two more things related to gate instability
17:28:18 * dustins makes note to look at the Manila plugin
17:28:19 <andreaf> tosky: ok so that would not be affected which is good
17:28:31 <andreaf> thanks dustins
17:28:44 <dustins> Of course!
17:29:27 <andreaf> I would like to propose a temp no-new-test merge policy until we are confident the gate is stable
17:29:32 <andreaf> of course we can discuss tests on a case by case basis
17:29:50 <andreaf> but in general we should be very careful about getting anything in until things settle down
17:30:15 <chandankumar> no-new-test merge policy for scenario tests only na?
17:30:18 <andreaf> and then we should document criterias for new scenarios as jordanP proposed
17:30:40 <andreaf> chandankumar: well mostly for scenario, but there are API tests that can be pretty heavy
17:31:13 <sdague> right, it would be nice to let the dust settle to the point where a gate fail of the full job on random project is unexpected
17:31:14 <oomichi> andreaf: are there any draft of the criterias?
17:31:25 <sdague> vs. just part of normal business
17:31:27 <andreaf> so I don't mind a negative API test or a keystone one, but a nova API test for migration would make me think
17:32:03 <andreaf> oomichi: we have something in docs already
17:32:33 <andreaf> oomichi: but I think we need to get something more specific in terms of resource utilisation and reviewing patches
17:33:02 <andreaf> oomichi: given the variance on run time and system load it's hard to judge from those but we can check how many servers / volumes are created :)
17:33:36 <andreaf> oomichi: if it's not something for interop and it could go into functional tests it would be nice to have it there
17:34:00 <oomichi> andreaf: yeah, I can see. but I mind negative ones anyways ;)
17:34:01 <andreaf> oomichi: but we can discuss these details on an etherpad or gerrit patch - it's not urgent
17:34:36 <oomichi> andreaf: can I see the link of current doc?
17:34:41 <andreaf> so can we agree on a "please be very careful about getting any new test in Tempest until things settle down" ?
17:34:47 <oomichi> as the criteria?
17:35:21 <oomichi> I could not find it in REVIEWING.rst
17:35:48 <andreaf> oomichi: so for instance #link https://docs.openstack.org/developer/tempest/field_guide/scenario.html
17:36:06 <andreaf> oomichi: as a temporary measure until the gate is back into shape
17:36:32 <oomichi> andreaf: I see, thanks. yeah, we need the detail more and nice to discuss it
17:37:27 <andreaf> #agreed until further notice we shall think twice before letting any new test in Tempest (until the gate settles down)
17:37:40 <andreaf> ok I'm not even sure if that's a meeting bot command :D
17:37:53 <sdague> heh
17:37:56 <andreaf> so one last thing is about versions we test
17:38:04 <andreaf> API versions
17:38:33 <andreaf> we took cinder v1 out of the gate and have a job to test those on demand I think
17:39:04 <andreaf> but we may need to review / document the API version that we want to exercise in the gate
17:39:21 <sdague> andreaf: I noticed that volumes v1 admin actions didn't get pulled with that
17:39:30 <sdague> I think because the tests are structured differently
17:39:33 <andreaf> sdague: oh ok
17:39:40 <sdague> https://github.com/openstack/tempest/blob/13a7fec7592236a1b2a5510d819181d8fe3f694e/tempest/api/volume/admin/test_volumes_actions.py#L58
17:39:44 <sdague> probably a good todo
17:39:58 <andreaf> any volunteer to look into that?
17:40:08 <andreaf> sdague?
17:40:11 <sdague> it definitely feels like for per-commit pre gating we should only be testing the most recent major API
17:40:20 <sdague> andreaf: I'll see what I can do
17:40:26 <andreaf> sdague: thanks!
17:41:03 <sdague> testing deprecated APIs seems like the role of the project, or at least being done not on master per-commit pre-gating
17:41:27 <andreaf> sdague: yeah my only concern is whether it's enough to drive tests via an API version, or if we need to ensure that all services are talking that same version between them as well
17:41:54 <sdague> andreaf: good question, I don't know
17:42:27 <oomichi> sdague: I agree. cinder v3 is current, and v2 is supported. So it would be nice to test v3 as priority on the gate
17:42:36 <sdague> oomichi: yep
17:42:56 <sdague> especially as nova is going to require v3 shortly
17:43:02 <andreaf> ok I guess there is some digging to be done to track which versions we are testing in which job and propose a plan on how we want things to look like
17:43:07 <mtreinish> oomichi: just update the endpoint in the config it should be the same
17:43:20 <sdague> mtreinish: yep
17:43:25 <oomichi> mtreinish: yeah, v3 = v2 + microversions :)
17:43:33 <andreaf> #action: sdague to look into cinder v1 admin tests
17:43:40 <sdague> but the base should just work sliding across
17:43:50 <andreaf> any volunteer to look at API versions planning?
17:44:31 <oomichi> andreaf: I can help
17:44:39 <andreaf> oomichi: great, thanks!
17:44:57 <andreaf> #action oomichi to look at API versions for test jobs
17:45:07 <andreaf> ok that's all I had on the gate issues
17:45:28 <andreaf> anything else anyone?
17:45:34 <andreaf> #topic Specs Reviews
17:45:43 <andreaf> #link https://review.openstack.org/#/q/status:open+project:openstack/qa-specs,n,z
17:46:10 <andreaf> anything on specs?
17:46:15 <andreaf> 3...
17:46:16 <andreaf> 2..
17:46:19 <andreaf> 1.
17:46:20 <sdague> I think we can probably dump the 2 grenade spaces.
17:46:37 <sdague> I had a good talk with luce? (sp) at PTG
17:46:52 <sdague> and I think the new idea is to build a purpose built tool for the zero downtime keystone testing
17:46:56 <andreaf> luzC
17:47:04 <sdague> andreaf: that's it
17:47:31 <andreaf> sdague: yeah I'm not sure if that's going to be new specs or what, probably yes
17:47:43 <andreaf> I'll ping luzC about those
17:47:44 <sdague> so, i'd just double check with her, and close those out unless there is a reason they want to keep them up
17:48:06 <andreaf> #action andreaf check with luzC about grenade specs
17:48:27 <andreaf> #topic Tempest
17:49:26 <andreaf> oomichi: did you add this? https://review.openstack.org/#/c/389318/3
17:49:40 <mtreinish> andreaf: I put that on there
17:50:05 <mtreinish> andreaf: I had a discussion the other day in the puppet channel about how that broke the ceilo plugin
17:50:11 <oomichi> that broke ceilometer gate
17:50:20 <oomichi> mtreinish: yeah, that is
17:50:56 <oomichi> I replaced service client code with tempest.lib after that in ceilometer repo to avoid it again
17:51:09 <mtreinish> I just thought it was a good discussion point, since it's a private interface in lib
17:51:22 <mtreinish> although we clearly document the lib stable contract on public interfaces: https://docs.openstack.org/developer/tempest/library.html#tempest-library-documentation
17:51:48 <oomichi> mtreinish: yeah, that is really private one
17:52:14 <andreaf> yes so that's clearly documented already
17:52:44 <mtreinish> but it makes me wonder what the gap was in the stable interfaces
17:53:32 <sdague> so... for some definition of clearly documented :)
17:54:08 <oomichi> I guess people never read the doc of the other projects even if we have clear doc.
17:54:10 <sdague> honestly, until the documentation includes a bunch of example usage, and makes it so that it's not worth people's time to open the code, once they get in the code, they are going to find other methods they want to use
17:54:30 <sdague> I would not, for instance, call https://docs.openstack.org/developer/tempest/library/cli.html clear doc
17:55:19 <mtreinish> sdague: nor would I
17:55:27 <andreaf> sdague: uhm sure that's not the point under discussion though - I though the part about stability is quite clear
17:55:28 <andreaf> sdague: but I again I may be to involved in tempest to see the gap
17:55:57 <sdague> andreaf: so, I think it's easy to say "if you click through these 20 pages you can find the functions we consider stable"
17:56:03 <andreaf> sdague: we have documentation as part of the high prio things in pike I agree we need examples or so
17:56:17 <sdague> but, that's not really super clear or discoverable. You basically want an SDK doc
17:56:33 <sdague> and SDK needs examples for every usage
17:56:46 <andreaf> yeah ok that's in the todo list
17:57:02 <sdague> it might also be interesting to figure out if there was a way to emit some warnings from tempest run if stuff gets inherited in places that are unexpected
17:57:09 <sdague> to help people realize they did the wrong thing
17:57:27 <andreaf> yeah that's the other topic I had on the agenda for today
17:57:35 <andreaf> but I guess we are running out of time
17:57:54 <andreaf> we can continue in the QA channel or next meeting
17:58:05 <chandankumar> sdague: andreaf i would like to help on this.
17:58:17 <oomichi> sdague: I did try it with hacking, but difficult to get agreement
17:58:20 <andreaf> so it's a pity it takes two weeks to meet the same group again in a meeting
17:58:24 <andreaf> chandankumar: thank you!!
17:58:26 <bknudson_> I've got a review that's stuck - https://review.openstack.org/#/c/388897/ -- I answered the comment but the reviewer seems to have left.
17:58:31 <mtreinish> andreaf: the docs stuff? I did get a start to some of that yesterday: https://review.openstack.org/439830
17:58:44 <andreaf> mtreinish: thank you!
17:59:06 <andreaf> #link https://review.openstack.org/#/c/388897/ for review
17:59:10 <mtreinish> castulo: did you have a follow up on bknudson_'s patch ^^^
18:00:06 <andreaf> ok thanks everyone!
18:00:11 <andreaf> #endmeeting