15:04:25 <scottda> #startmeeting cinder_testing 15:04:26 <openstack> Meeting started Wed Aug 31 15:04:25 2016 UTC and is due to finish in 60 minutes. The chair is scottda. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:04:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:04:30 <openstack> The meeting name has been set to 'cinder_testing' 15:05:00 <eharney> hi 15:05:09 <patrickeast> o/ 15:05:15 <smcginnis> In another meeting as usual, but will kinda be paying attention. :) 15:05:26 <e0ne> hi 15:05:34 <dulek> hello! 15:05:53 <scottda> We've nothing on the agenda. I've been trying to narrow down the oom_killer issue... 15:06:07 <akerr> same here, sprint demo meeting every other wednesday :) 15:06:25 <xyang> hi 15:06:25 <scottda> I cannot repro oom on stable/mitaka with 'tox -epy34' and 2GB VM (as opposed to Master, which repros every time) 15:06:54 <e0ne> I would like to ask smcginnis and all cores to take a look on https://review.openstack.org/#/c/348449/ - fake drivers integration for devstack 15:07:11 <smcginnis> e0ne: Opened 15:07:22 <scottda> I'm doing a loose binary search on commits to see if I can find a point-in-time where the problem starts. 15:07:23 <scottda> e0ne: I'll look 15:07:39 <e0ne> smcginnis: thanks. IMO PTL vote will be useful for it 15:07:57 <openstackgerrit> xiexs proposed openstack/cinder: Convert InvalidVolumeMetadataSize to webob.exc.* https://review.openstack.org/356213 15:08:06 <smcginnis> scottda: My concern is that it's just been a gradual increase in memory needed to run the UTs, so I'm worried we won't be find a smoking gun. 15:08:11 <akerr> I would like to ask that we try to figure out a way to test manage/unmanage in the gate. cFouts has been mentioning it at lot lately but we have internal tests for those functions that have been broken about 3 times so far this release due to a lack of functional gate tests 15:08:16 <scottda> We could ask people to review patrickeast 's devstack patches, but I see one with +2 and 8 +1's and yet not reviews in 4 weeks! 15:08:21 <eharney> i'm planning to cycle back to the oom bug this week myself, and look for general issues in that regard 15:08:29 <scottda> smcginnis: Yes, that's my worry as well. 15:08:52 <akerr> This causes us lots of headaches trying to get patches through our internal review process :( 15:08:56 <geguileo> eharney: Are you able to reproduce it? 15:09:03 <dulek> akerr: +1 15:09:06 <eharney> geguileo: i did last week 15:09:19 <geguileo> eharney: What are the requirements to trigger it? 15:09:39 <scottda> Any help is welcome. I've tried a few approaches, but I'm sure there's ideas I'm missing... 15:09:40 <eharney> easy route is to run unit tests on a smaller VM, i think i was using 1500MB of RAM 15:09:47 <eharney> i'm going to go at it this week w/ profiling tools 15:09:57 <scottda> geguileo: I trigger with 2GB vm, running tox -epy34, every time 15:10:12 <geguileo> eharney: scottda OK, maybe I'll give it a go 15:10:18 <smcginnis> It occasionally hits with 4GB RAM, so I think anything under that increases the likelihood that it will happen. 15:10:27 <scottda> I see a slow, steady increase in memory usage as the unit tests run. 15:11:06 <scottda> strace doesn't show me anything stuck or looping in system calls. 15:11:07 <geguileo> scottda: So it's not a couple of tests fault, but a general issue 15:11:23 <smcginnis> geguileo: That's my feel. 15:11:30 <scottda> geguileo: That's the current hypothesis 15:11:43 <scottda> I tried removing the zonemanager tests, but it still reproduced. 15:12:02 <scottda> I might try removing other entire directories. 15:12:16 <dulek> scottda: +1 - that can help us narrow the issue. 15:12:25 <dulek> scottda: You should be able to automate that. :) 15:12:53 <geguileo> We didn't use to hit these issues before 15:13:17 <geguileo> Has anybody tried to git bisect it? 15:13:35 <scottda> geguileo: Yeah, but akerr said he had to up internal VMs for unit tests from 2GB-> 4GB in January 15:13:57 <scottda> geguileo: No, I just ran stable/mitaka without repro this morning. I can try some bisect next. 15:14:19 <geguileo> scottda: I think that could at least give us an idea 15:14:20 <openstackgerrit> xiexs proposed openstack/cinder: Make examples consistent with actul API https://review.openstack.org/326852 15:15:14 <akerr> scottda: geguileo: and we had to up it from 1G to 2G a while before that, so it has been a gradual thing for a long while now 15:15:40 <geguileo> akerr: That's great to know 15:15:59 <geguileo> Because this could be related to the number of tests more than anything else 15:16:12 <openstackgerrit> Ivan Kolodyazhny proposed openstack/cinder: RBD Thin Provisioning stats https://review.openstack.org/178262 15:16:15 <geguileo> So it's probably something in the base class 15:17:24 <scottda> OK, so there's that issue... 15:17:44 <scottda> Anything we need to talk about regarding upcoming release? 15:17:45 <scottda> Are tests subject to FeatureFreeze? 15:18:04 <geguileo> scottda: I don't think they should be 15:18:13 <ntpttr> I have a few things related to rolling upgrades testing, not sure if related to upcoming release 15:18:19 <scottda> no, me neither. just asking 15:19:04 <ntpttr> I've been fiddling with some ansible scripts to deploy on multiple nodes at mitaka combined with some python scripts for setting up a volume / reading and writing to it and backing it up https://github.com/ntpttr/rolling-upgrades-devstack 15:19:05 <scottda> ntpttr: That's stuff in devstack or infra? It'd be nice to get rolling upgrades in before the release, I would think. 15:19:30 <geguileo> scottda: +1 15:19:35 <ntpttr> I've come across a few things that we might want to take a look at, or at least have in some kind of upgrade documentation 15:19:41 <dulek> scottda: Actually multinode grenade is in check queue now, non-voting currently. 15:19:52 <ntpttr> scottda, I've been doing manual tests w/ devstack, not infra 15:20:23 <geguileo> ntpttr: What are those issues? 15:20:25 <dulek> ntpttr: Oh, upgrade docs would be cool. I've wanted to write them, but stopped when noticed that there isn't a good place to put them. 15:20:30 <smcginnis> Catching up - tests will not be subject to feature freeze. 15:20:52 <ntpttr> One thing is, after upgrading the DB and API service to master from mitaka, it's possible that no volume creation will work, because in mitaka having a volume_type of 'None' is okay, but recently an exception has started to be thrown for that 15:21:13 <ntpttr> so people will need to add a default_volume_type before they upgrade 15:21:20 <eharney> ntpttr: i thought we fixed that 15:21:35 <geguileo> eharney: I thought so too... 15:21:46 <ntpttr> eharney: I ran into the issue yesterday after upgrading the BD and just the API service, with the rest running on mitaka 15:21:46 <eharney> ntpttr: https://review.openstack.org/#/c/353665/ 15:22:13 <eharney> ntpttr: please file a bug if it still throws exceptions 15:22:24 <ntpttr> eharney: I'll give it another test today and file if it does 15:22:52 <geguileo> ntpttr: Thanks, what other issues? 15:23:06 <scottda> ntpttr: dulek We'd like some upgrade docs. I'm guessing we can figure somewhere to put them. 15:23:16 <ntpttr> Another thing I ran into had to do with the backup service - it seems like when Cinder is out of syncc with the rest of the deployment (in my case it was cinder at mitaka and the other services at master), creating a backup results in an error 15:23:40 <scottda> ntpttr: Yes, I've seen that. I'm not sure what the resolution was... 15:23:57 <ntpttr> I thought it might have had to do with cinder and swift being at different branches, but it looked like the problem was with rootwrap or something before any calls to swift were even made 15:24:09 <scottda> I think it was a version pinning thing in the DB 15:24:19 <dulek> ntpttr: What error? I've tested that manually and fixed a bug related to that. 15:24:46 <dulek> ntpttr: https://review.openstack.org/#/c/350534/ 15:24:48 <ntpttr> dulek: I don't have a paste of the error handy, but it was rootwrap throwing an unauthorized action exception I think 15:25:14 <dulek> ntpttr: Oh, so maybe rootwrap filters weren't updated. 15:25:40 <dulek> ntpttr: If there's such requirement when upgrading, we need to signal it in the release notes. 15:25:41 <geguileo> ntpttr: I think you should create bugs for all the issues with instructions and maybe an Etherpad so we can nail those down asap 15:25:52 <ntpttr> geguileo: Will do 15:25:58 <geguileo> ntpttr: Thanks!! 15:26:03 <ntpttr> np! 15:26:07 <dulek> ntpttr: Thanks a lot! 15:26:13 <dulek> Okay, so can I use a moment here? 15:26:19 <geguileo> ntpttr: Please let us know when you have a list of them so we can give them priority 15:26:21 <ntpttr> I had one other question related to this, but go ahead dulek 15:26:34 <dulek> ntpttr: Oh, sorry, go on. :) 15:26:35 <ntpttr> geguileo: Sure, I'll try and nail it down today and tomorrow 15:26:49 <geguileo> ntpttr: Awesome! 15:27:06 <scottda> ntpttr: Go with your question. 15:28:02 <ntpttr> dulek: Okay thanks :). I was just wondering if it would maybe be good to have some kind of API that an admin could call when they're planning an upgrade, something that could check to see if there are any pending taks like a backup or volume being created, so that services don't go down in the middle of the process 15:28:28 <ntpttr> Maybe also to send a pause signal to stop new processes from starting until the upgrade has begun 15:29:34 <ntpttr> a deployment tool could use it to wait to shut down until a running process is complete 15:29:35 <dulek> ntpttr: Good thinking. So I've always assumed this is addressed by the fact that services on close are finishing all the jobs running already 15:29:37 <geguileo> ntpttr: If the service is properly configured then they won't just go down in the middle 15:29:55 <geguileo> ntpttr: There is a timeout that must be set 15:30:05 <dulek> Only SIGKILL destroys the service immediately. 15:30:06 <geguileo> ntpttr: But nobody sets it up 15:30:11 <dulek> And now the fun part! 15:30:25 <ntpttr> geguileo: oh really? That's good to know - I assumed though that in a deployment tool like kolla that destroys containers and brings up new ones wouldn't be checking that 15:30:27 <dulek> I think that in DevStack service may get killed immediately. 15:30:33 <ntpttr> kolla just being one example 15:30:56 <dulek> Becasuse oslo.service has some strange stuff related to process being a daemon and not. 15:31:18 <geguileo> ntpttr: It's done by oslo service, dulek and I will be talking about that among other things in our OpenStack Barcelona talk 15:31:36 <ntpttr> geguileo: awesome, I'll be sure to check that out 15:31:37 <dulek> ntpttr: Yup, in case of kolla that sucks. But IMO that should be addressed in Kolla. 15:31:57 <ntpttr> dulek: I agree - if the services can handle it deployment tools can make use of it 15:32:13 <ntpttr> that was sort of my idea w/ the api, just a tool for deployment options to make use of 15:32:19 <ntpttr> if it already exists that's great 15:33:31 <geguileo> dulek: What did you want to talk about? 15:33:43 <ntpttr> I think that answers my question, I'll be sure to watch your talk in person or via internet if I can't make it to the summit 15:34:00 <ntpttr> thanks! 15:34:19 <dulek> Just wanted to note that multinode grenade job is in check queue as non-voting right now. I wonder what are the requirements on job stability to make it voting. 15:34:29 <dulek> And how to actually check job stability. 15:34:47 <dulek> Because goo.gl/g6GO7t isn't really clear. ;) 15:37:43 <dulek> Hm, did I just got disconnected? ;) 15:37:45 <scottda> e0ne: Do you have thoughts? You've looked at moving various jobs to voting in the past. 15:37:58 <geguileo> dulek: No, I just have no idea on those 15:38:25 <e0ne> scottda: AFAIR, we don't have requirements, but usually ask to have it stable for 1-3 months 15:38:49 <geguileo> e0ne: And how do we check its stability? 15:39:02 <scottda> yeah, and then we ask at a cinder meeting about people's opinions to move to voting, IIRC 15:39:12 <e0ne> geguileo: http://graphite.openstack.org/ 15:39:30 <e0ne> geguileo: and compare it to destack+lvm job 15:39:47 <geguileo> e0ne: Oh, OK, comparing it 15:39:52 <dulek> e0ne: Thanks! 15:39:59 <e0ne> dulek: np 15:40:12 <dulek> So - we'll have voting upgrade testing in few months. :) 15:40:23 <dulek> (unless it will get unstable :P) 15:40:44 <geguileo> lol 15:40:59 <e0ne> dulek: let's talk about it on design session:) 15:41:00 <geguileo> I have a question related to the HA A/A testing 15:41:50 <geguileo> Will anybody with a storage that supports CGs be able to tests that part? 15:42:14 <scottda> patrickeast: Are you already doing this with your CI? 15:42:59 <geguileo> I have no access to that kind of storage for tests 15:43:09 <scottda> me neither 15:43:10 <geguileo> And I'm afraid that with the ammount of changes that are going in they'll break something 15:43:21 <xyang> geguilieo: if we get e0ne's patch in https://review.openstack.org/#/c/348449/, we can use the fake gate driver 15:43:37 <xyang> that uses LVM 15:44:01 <dulek> geguileo: Fake driver has CG support. 15:44:01 <dulek> geguileo: Or GateDriver. 15:44:01 <dulek> geguileo: It mocks CGs on LVM. 15:44:25 <geguileo> I was hoping for some real tests... 15:44:40 <xyang> geguileo: that is real test 15:44:54 <xyang> geguileo: it creates volumes on LVM 15:44:56 <geguileo> xyang: Then it's not mocking them? 15:44:56 <tommylikehu_> hey guys,who have any idea about the Jenkins issues? 15:45:03 <xyang> not mocking 15:45:14 <geguileo> xyang: Aaaaaah, awesome!!! 15:45:16 <xyang> geguileo: it is called fake just because it is not for production 15:45:42 <geguileo> xyang: OK, then I'll try to use that one on my tests 15:45:45 <xyang> geguileo: because it can't guarantee consistency on the snapshots, but we are not testing that part anyway 15:45:49 <geguileo> And see if I can try it with it 15:45:52 <dulek> geguileo: It mocks CGs consistency, not Cinder resources. ;) 15:46:07 <geguileo> dulek: Cool 15:46:17 <xyang> dulek: to be accurate, it does not mock consistency 15:46:20 <geguileo> Then I will be able to test it myself 15:46:32 <xyang> dulek: it claims that it is not consistent 15:46:46 <geguileo> xyang: dulek Thanks for the answers :-) 15:46:47 <xyang> but definitely good enough for testing 15:47:08 <dulek> xyang: I think we had same thing in mind. :) 15:47:20 <xyang> dulek: sure:) 15:47:38 <scottda> Anything else on that topic geguileo ? I'm another HA testing question... 15:47:51 <scottda> s/I'm/I've 15:48:10 <scottda> dulek: Do we have any multi-node testing other than for upgrades? 15:49:04 <scottda> I'm thinking we could start in on the next steps for HA, either 2 c-vol with Ceph or work on shared LVM solution or both in parallel. 15:50:53 <scottda> I think we could/should try to wrap as much as possible into one multi-node configuration, to avoid having to get multiple jobs and changes through the infra/devstack review process. I don't know if there's other multi-node testing that could run on the same config? 15:51:12 <DuncanT> Sorry, only just back to my desk. Lots talked about today, I'll read the log. 15:51:29 <dulek> scottda: I think some devstack-gate changes will be required to make configurations different on primary and sub. 15:51:37 <dulek> scottda: But yes - that should be next step. 15:51:46 <dulek> Maybe testing migrate/retype between nodes? 15:52:20 <scottda> dulek: Yes, that's a good idea. I'll start with that, since I've got in-flight patches already for the single-node case. 15:52:25 <geguileo> I think we would need to run 2 cinder-volume services in each node 15:52:38 <geguileo> One of those with LVM and out of the cluster 15:52:38 <dulek> geguileo: Does DevStack allow that? 15:52:57 <geguileo> And the other with a storage that can be clustered 15:53:09 <geguileo> dulek: Well, if you use a custom local.sh you can do it 15:53:48 <dulek> geguileo: local.sh, or localrc? 15:54:05 <scottda> geguileo: OK, we'll keep that in mind as we set this up. Thanks. 15:54:16 <geguileo> dulek: local.sh 15:54:18 <patrickeast> sry, was afk for a bit, scottda geguileo: yea pure's ci is testing cg's with HA stuff 15:54:29 <geguileo> patrickeast: AWESOME!!! 15:54:37 <scottda> BTW, 5 minutes before cinder meeting. Let's wrap this up... 15:54:55 <patrickeast> dulek: geguileo: i've got a job that does pure + ceph in AA on two nodes, but the ceph plugin doesn't seem to like it 15:55:00 <scottda> Anything else today? 15:55:08 <patrickeast> so ceph wont just be out of the box clustered kinda deal 15:55:12 <geguileo> patrickeast: lol, we'll have to look at that 15:55:36 <geguileo> patrickeast: Deploying with devstack you mean? 15:55:41 <patrickeast> geguileo: yea 15:55:52 <patrickeast> geguileo: ceph itself is fine, just the devstack setup scripts 15:56:11 <geguileo> patrickeast: You have to deploy Ceph outside of devstack 15:56:23 <geguileo> patrickeast: And not let devstack do the deployment of Ceph 15:56:50 <patrickeast> ah, thats unfortunate 15:56:53 <geguileo> patrickeast: Or let it do it and then do some ceph config copying and have a different local.conf for the other node 15:57:05 <patrickeast> geguileo: we probably need to change that if we're going to use that for our gate tests with AA 15:58:00 <scottda> ok, time to move to the next meeting. Thanks everyone 15:58:00 <scottda> #endmeeting