#openstack-cinder log

15:04:25 <scottda> #startmeeting cinder_testing
15:04:26 <openstack> Meeting started Wed Aug 31 15:04:25 2016 UTC and is due to finish in 60 minutes.  The chair is scottda. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:04:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:04:30 <openstack> The meeting name has been set to 'cinder_testing'
15:05:00 <eharney> hi
15:05:09 <patrickeast> o/
15:05:15 <smcginnis> In another meeting as usual, but will kinda be paying attention. :)
15:05:26 <e0ne> hi
15:05:34 <dulek> hello!
15:05:53 <scottda> We've nothing on the agenda. I've been trying to narrow down the oom_killer issue...
15:06:07 <akerr> same here, sprint demo meeting every other wednesday :)
15:06:25 <xyang> hi
15:06:25 <scottda> I cannot repro oom on stable/mitaka with 'tox -epy34' and 2GB VM (as opposed to Master, which repros every time)
15:06:54 <e0ne> I would like to ask smcginnis and all cores to take a look on https://review.openstack.org/#/c/348449/ - fake drivers integration for devstack
15:07:11 <smcginnis> e0ne: Opened
15:07:22 <scottda> I'm doing a loose binary search on commits to see if I can find a point-in-time where the problem starts.
15:07:23 <scottda> e0ne: I'll look
15:07:39 <e0ne> smcginnis: thanks. IMO PTL vote will be useful for it
15:07:57 <openstackgerrit> xiexs proposed openstack/cinder: Convert InvalidVolumeMetadataSize to webob.exc.*  https://review.openstack.org/356213
15:08:06 <smcginnis> scottda: My concern is that it's just been a gradual increase in memory needed to run the UTs, so I'm worried we won't be find a smoking gun.
15:08:11 <akerr> I would like to ask that we try to figure out a way to test manage/unmanage in the gate.  cFouts has been mentioning it at lot lately but we have internal tests for those functions that have been broken about 3 times so far this release due to a lack of functional gate tests
15:08:16 <scottda> We could ask people to review patrickeast 's devstack patches, but I see one with +2 and 8 +1's and yet not reviews in 4 weeks!
15:08:21 <eharney> i'm planning to cycle back to the oom bug this week myself, and look for general issues in that regard
15:08:29 <scottda> smcginnis: Yes, that's my worry as well.
15:08:52 <akerr> This causes us lots of headaches trying to get patches through our internal review process :(
15:08:56 <geguileo> eharney: Are you able to reproduce it?
15:09:03 <dulek> akerr: +1
15:09:06 <eharney> geguileo: i did last week
15:09:19 <geguileo> eharney: What are the requirements to trigger it?
15:09:39 <scottda> Any help is welcome. I've tried a few approaches, but I'm sure there's ideas I'm missing...
15:09:40 <eharney> easy route is to run unit tests on a smaller VM, i think i was using 1500MB of RAM
15:09:47 <eharney> i'm going to go at it this week w/ profiling tools
15:09:57 <scottda> geguileo: I trigger with 2GB vm, running tox -epy34, every time
15:10:12 <geguileo> eharney: scottda OK, maybe I'll give it a go
15:10:18 <smcginnis> It occasionally hits with 4GB RAM, so I think anything under that increases the likelihood that it will happen.
15:10:27 <scottda> I see a slow, steady increase in memory usage as the unit tests run.
15:11:06 <scottda> strace doesn't show me anything stuck or looping in system calls.
15:11:07 <geguileo> scottda: So it's not a couple of tests fault, but a general issue
15:11:23 <smcginnis> geguileo: That's my feel.
15:11:30 <scottda> geguileo: That's the current hypothesis
15:11:43 <scottda> I tried removing the zonemanager tests, but it still reproduced.
15:12:02 <scottda> I might try removing other entire directories.
15:12:16 <dulek> scottda: +1 - that can help us narrow the issue.
15:12:25 <dulek> scottda: You should be able to automate that. :)
15:12:53 <geguileo> We didn't use to hit these issues before
15:13:17 <geguileo> Has anybody tried to git bisect it?
15:13:35 <scottda> geguileo: Yeah, but akerr said he had to up internal VMs for unit tests from 2GB-> 4GB in January
15:13:57 <scottda> geguileo: No, I just ran stable/mitaka without repro this morning. I can try some bisect next.
15:14:19 <geguileo> scottda: I think that could at least give us an idea
15:14:20 <openstackgerrit> xiexs proposed openstack/cinder: Make examples consistent with actul API  https://review.openstack.org/326852
15:15:14 <akerr> scottda: geguileo: and we had to up it from 1G to 2G a while before that, so it has been a gradual thing for a long while now
15:15:40 <geguileo> akerr: That's great to know
15:15:59 <geguileo> Because this could be related to the number of tests more than anything else
15:16:12 <openstackgerrit> Ivan Kolodyazhny proposed openstack/cinder: RBD Thin Provisioning stats  https://review.openstack.org/178262
15:16:15 <geguileo> So it's probably something in the base class
15:17:24 <scottda> OK, so there's that issue...
15:17:44 <scottda> Anything we need to talk about regarding upcoming release?
15:17:45 <scottda> Are tests subject to FeatureFreeze?
15:18:04 <geguileo> scottda: I don't think they should be
15:18:13 <ntpttr> I have a few things related to rolling upgrades testing, not sure if related to upcoming release
15:18:19 <scottda> no, me neither. just asking
15:19:04 <ntpttr> I've been fiddling with some ansible scripts to deploy on multiple nodes at mitaka combined with some python scripts for setting up a volume / reading and writing to it and backing it up https://github.com/ntpttr/rolling-upgrades-devstack
15:19:05 <scottda> ntpttr: That's stuff in devstack or infra? It'd be nice to get rolling upgrades in before the release, I would think.
15:19:30 <geguileo> scottda: +1
15:19:35 <ntpttr> I've come across a few things that we might want to take a look at, or at least have in some kind of upgrade documentation
15:19:41 <dulek> scottda: Actually multinode grenade is in check queue now, non-voting currently.
15:19:52 <ntpttr> scottda, I've been doing manual tests w/ devstack, not infra
15:20:23 <geguileo> ntpttr: What are those issues?
15:20:25 <dulek> ntpttr: Oh, upgrade docs would be cool. I've wanted to write them, but stopped when noticed that there isn't a good place to put them.
15:20:30 <smcginnis> Catching up - tests will not be subject to feature freeze.
15:20:52 <ntpttr> One thing is, after upgrading the DB and API service to master from mitaka, it's possible that no volume creation will work, because in mitaka having a volume_type of 'None' is okay, but recently an exception has started to be thrown for that
15:21:13 <ntpttr> so people will need to add a default_volume_type before they upgrade
15:21:20 <eharney> ntpttr: i thought we fixed that
15:21:35 <geguileo> eharney: I thought so too...
15:21:46 <ntpttr> eharney: I ran into the issue yesterday after upgrading the BD and just the API service, with the rest running on mitaka
15:21:46 <eharney> ntpttr: https://review.openstack.org/#/c/353665/
15:22:13 <eharney> ntpttr: please file a bug if it still throws exceptions
15:22:24 <ntpttr> eharney: I'll give it another test today and file if it does
15:22:52 <geguileo> ntpttr: Thanks, what other issues?
15:23:06 <scottda> ntpttr: dulek We'd like some upgrade docs. I'm guessing we can figure somewhere to put them.
15:23:16 <ntpttr> Another thing I ran into had to do with the backup service - it seems like when Cinder is out of syncc with the rest of the deployment (in my case it was cinder at mitaka and the other services at master), creating a backup results in an error
15:23:40 <scottda> ntpttr: Yes, I've seen that. I'm not sure what the resolution was...
15:23:57 <ntpttr> I thought it might have had to do with cinder and swift being at different branches, but it looked like the problem was with rootwrap or something before any calls to swift were even made
15:24:09 <scottda> I think it was a version pinning thing in the DB
15:24:19 <dulek> ntpttr: What error? I've tested that manually and fixed a bug related to that.
15:24:46 <dulek> ntpttr: https://review.openstack.org/#/c/350534/
15:24:48 <ntpttr> dulek: I don't have a paste of the error handy, but it was rootwrap throwing an unauthorized action exception I think
15:25:14 <dulek> ntpttr: Oh, so maybe rootwrap filters weren't updated.
15:25:40 <dulek> ntpttr: If there's such requirement when upgrading, we need to signal it in the release notes.
15:25:41 <geguileo> ntpttr: I think you should create bugs for all the issues with instructions and maybe an Etherpad so we can nail those down asap
15:25:52 <ntpttr> geguileo: Will do
15:25:58 <geguileo> ntpttr: Thanks!!
15:26:03 <ntpttr> np!
15:26:07 <dulek> ntpttr: Thanks a lot!
15:26:13 <dulek> Okay, so can I use a moment here?
15:26:19 <geguileo> ntpttr: Please let us know when you have a list of them so we can give them priority
15:26:21 <ntpttr> I had one other question related to this, but go ahead dulek
15:26:34 <dulek> ntpttr: Oh, sorry, go on. :)
15:26:35 <ntpttr> geguileo: Sure, I'll try and nail it down today and tomorrow
15:26:49 <geguileo> ntpttr: Awesome!
15:27:06 <scottda> ntpttr: Go with your question.
15:28:02 <ntpttr> dulek: Okay thanks :). I was just wondering if it would maybe be good to have some kind of API that an admin could call when they're planning an upgrade, something that could check to see if there are any pending taks like a backup or volume being created, so that services don't go down in the middle of the process
15:28:28 <ntpttr> Maybe also to send a pause signal to stop new processes from starting until the upgrade has begun
15:29:34 <ntpttr> a deployment tool could use it to wait to shut down until a running process is complete
15:29:35 <dulek> ntpttr: Good thinking. So I've always assumed this is addressed by the fact that services on close are finishing all the jobs running already
15:29:37 <geguileo> ntpttr: If the service is properly configured then they won't just go down in the middle
15:29:55 <geguileo> ntpttr: There is a timeout that must be set
15:30:05 <dulek> Only SIGKILL destroys the service immediately.
15:30:06 <geguileo> ntpttr: But nobody sets it up
15:30:11 <dulek> And now the fun part!
15:30:25 <ntpttr> geguileo: oh really? That's good to know - I assumed though that in a deployment tool like kolla that destroys containers and brings up new ones wouldn't be checking that
15:30:27 <dulek> I think that in DevStack service may get killed immediately.
15:30:33 <ntpttr> kolla just being one example
15:30:56 <dulek> Becasuse oslo.service has some strange stuff related to process being a daemon and not.
15:31:18 <geguileo> ntpttr: It's done by oslo service, dulek and I will be talking about that among other things in our OpenStack Barcelona talk
15:31:36 <ntpttr> geguileo: awesome, I'll be sure to check that out
15:31:37 <dulek> ntpttr: Yup, in case of kolla that sucks. But IMO that should be addressed in Kolla.
15:31:57 <ntpttr> dulek: I agree - if the services can handle it deployment tools can make use of it
15:32:13 <ntpttr> that was sort of my idea w/ the api, just a tool for deployment options to make use of
15:32:19 <ntpttr> if it already exists that's great
15:33:31 <geguileo> dulek: What did you want to talk about?
15:33:43 <ntpttr> I think that answers my question, I'll be sure to watch your talk in person or via internet if I can't make it to the summit
15:34:00 <ntpttr> thanks!
15:34:19 <dulek> Just wanted to note that multinode grenade job is in check queue as non-voting right now. I wonder what are the requirements on job stability to make it voting.
15:34:29 <dulek> And how to actually check job stability.
15:34:47 <dulek> Because goo.gl/g6GO7t isn't really clear. ;)
15:37:43 <dulek> Hm, did I just got disconnected? ;)
15:37:45 <scottda> e0ne: Do you have thoughts? You've looked at moving various jobs to voting in the past.
15:37:58 <geguileo> dulek: No, I just have no idea on those
15:38:25 <e0ne> scottda: AFAIR, we don't have requirements, but usually ask to have it stable for 1-3 months
15:38:49 <geguileo> e0ne: And how do we check its stability?
15:39:02 <scottda> yeah, and then we ask at a cinder meeting about people's opinions to move to voting, IIRC
15:39:12 <e0ne> geguileo: http://graphite.openstack.org/
15:39:30 <e0ne> geguileo: and compare it to destack+lvm job
15:39:47 <geguileo> e0ne: Oh, OK, comparing it
15:39:52 <dulek> e0ne: Thanks!
15:39:59 <e0ne> dulek: np
15:40:12 <dulek> So - we'll have voting upgrade testing in few months. :)
15:40:23 <dulek> (unless it will get unstable :P)
15:40:44 <geguileo> lol
15:40:59 <e0ne> dulek: let's talk about it on design session:)
15:41:00 <geguileo> I have a question related to the HA A/A testing
15:41:50 <geguileo> Will anybody with a storage that supports CGs be able to tests that part?
15:42:14 <scottda> patrickeast: Are you already doing this with your CI?
15:42:59 <geguileo> I have no access to that kind of storage for tests
15:43:09 <scottda> me neither
15:43:10 <geguileo> And I'm afraid that with the ammount of changes that are going in they'll break something
15:43:21 <xyang> geguilieo: if we get e0ne's patch in https://review.openstack.org/#/c/348449/, we can use the fake gate driver
15:43:37 <xyang> that uses LVM
15:44:01 <dulek> geguileo: Fake driver has CG support.
15:44:01 <dulek> geguileo: Or GateDriver.
15:44:01 <dulek> geguileo: It mocks CGs on LVM.
15:44:25 <geguileo> I was hoping for some real tests...
15:44:40 <xyang> geguileo: that is real test
15:44:54 <xyang> geguileo: it creates volumes on LVM
15:44:56 <geguileo> xyang: Then it's not mocking them?
15:44:56 <tommylikehu_> hey guys,who have any idea about the Jenkins issues?
15:45:03 <xyang> not mocking
15:45:14 <geguileo> xyang: Aaaaaah, awesome!!!
15:45:16 <xyang> geguileo: it is called fake just because it is not for production
15:45:42 <geguileo> xyang: OK, then I'll try to use that one on my tests
15:45:45 <xyang> geguileo: because it can't guarantee consistency on the snapshots, but we are not testing that part anyway
15:45:49 <geguileo> And see if I can try it with it
15:45:52 <dulek> geguileo: It mocks CGs consistency, not Cinder resources. ;)
15:46:07 <geguileo> dulek: Cool
15:46:17 <xyang> dulek: to be accurate, it does not mock consistency
15:46:20 <geguileo> Then I will be able to test it myself
15:46:32 <xyang> dulek: it claims that it is not consistent
15:46:46 <geguileo> xyang: dulek Thanks for the answers  :-)
15:46:47 <xyang> but definitely good enough for testing
15:47:08 <dulek> xyang: I think we had same thing in mind. :)
15:47:20 <xyang> dulek: sure:)
15:47:38 <scottda> Anything else on that topic geguileo ? I'm another HA testing question...
15:47:51 <scottda> s/I'm/I've
15:48:10 <scottda> dulek: Do we have any multi-node testing other than for upgrades?
15:49:04 <scottda> I'm thinking we could start in on the next steps for HA, either 2 c-vol with Ceph or work on shared LVM solution or both in parallel.
15:50:53 <scottda> I think we could/should try to wrap as much as possible into one multi-node configuration, to avoid having to get multiple jobs and changes through the infra/devstack review process. I don't know if there's other multi-node testing that could run on the same config?
15:51:12 <DuncanT> Sorry, only just back to my desk. Lots talked about today, I'll read the log.
15:51:29 <dulek> scottda: I think some devstack-gate changes will be required to make configurations different on primary and sub.
15:51:37 <dulek> scottda: But yes - that should be next step.
15:51:46 <dulek> Maybe testing migrate/retype between nodes?
15:52:20 <scottda> dulek: Yes, that's a good idea. I'll start with that, since I've got in-flight patches already for the single-node case.
15:52:25 <geguileo> I think we would need to run 2 cinder-volume services in each node
15:52:38 <geguileo> One of those with LVM and out of the cluster
15:52:38 <dulek> geguileo: Does DevStack allow that?
15:52:57 <geguileo> And the other with a storage that can be clustered
15:53:09 <geguileo> dulek: Well, if you use a custom local.sh you can do it
15:53:48 <dulek> geguileo: local.sh, or localrc?
15:54:05 <scottda> geguileo: OK, we'll keep that in mind as we set this up. Thanks.
15:54:16 <geguileo> dulek: local.sh
15:54:18 <patrickeast> sry, was afk for a bit, scottda geguileo: yea pure's ci is testing cg's with HA stuff
15:54:29 <geguileo> patrickeast: AWESOME!!!
15:54:37 <scottda> BTW, 5 minutes before cinder meeting.  Let's wrap this up...
15:54:55 <patrickeast> dulek: geguileo: i've got a job that does pure + ceph in AA on two nodes, but the ceph plugin doesn't seem to like it
15:55:00 <scottda> Anything else today?
15:55:08 <patrickeast> so ceph wont just be out of the box clustered kinda deal
15:55:12 <geguileo> patrickeast: lol, we'll have to look at that
15:55:36 <geguileo> patrickeast: Deploying with devstack you mean?
15:55:41 <patrickeast> geguileo: yea
15:55:52 <patrickeast> geguileo: ceph itself is fine, just the devstack setup scripts
15:56:11 <geguileo> patrickeast: You have to deploy Ceph outside of devstack
15:56:23 <geguileo> patrickeast: And not let devstack do the deployment of Ceph
15:56:50 <patrickeast> ah, thats unfortunate
15:56:53 <geguileo> patrickeast: Or let it do it and then do some ceph config copying and have a different local.conf for the other node
15:57:05 <patrickeast> geguileo: we probably need to change that if we're going to use that for our gate tests with AA
15:58:00 <scottda> ok, time to move to the next meeting. Thanks everyone
15:58:00 <scottda> #endmeeting