16:00:24 #startmeeting cinder-nova-api-changes 16:00:24 Meeting started Thu Sep 21 16:00:24 2017 UTC and is due to finish in 60 minutes. The chair is ildikov. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:25 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:28 The meeting name has been set to 'cinder_nova_api_changes' 16:00:38 johnthetubaguy jaypipes e0ne jgriffith hemna mriedem patrickeast smcginnis diablo_rojo xyang1 raj_singh lyarwood jungleboyj stvnoyes 16:00:52 @! 16:00:52 <_pewp_> jungleboyj \(-_- ) 16:00:58 o/ 16:01:33 let's wait a tiny bit, I guess people are slower joining after having the PTG last week :) 16:02:50 o/ 16:03:02 o/ 16:03:05 o/ 16:03:07 ok, let's start 16:03:08 pheww 16:03:17 jgriffith: just in time :) 16:03:39 so we agreed on a few things last week and are about to move forward 16:03:41 jgriffith: Slides into home. 16:03:58 I saw recent comments on the live_migration patch and test 16:04:25 mriedem: there's only one service version check in the new attach patch that I'm aware of 16:04:28 looks like i have a grenade issue there. 16:04:39 mriedem: which is not supposed to make the Grenade test fail 16:05:01 it could be a problem in the test, i'm not sure 16:05:18 i'll look into to see why it's failing 16:05:28 but yes, we shouldn't be using any new style attachments if any computes are not running the latest code 16:05:33 so i'm not sure why grenade would have issues 16:05:44 i am guessing it's a test issue 16:05:55 maybe, i haven't grok'ed your tempest change 16:07:31 while we're on the topic of tempest tests... 16:07:33 the attach patch should be up to date with the compute service version 16:07:45 but let me know if something went wrong there 16:08:27 I've been working on debugging the migrate iscsi test. In my local env I am seeing very occasional detach timeouts - DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain. 16:08:54 happened twice yesterday, but after a bunch of debug, hasn't happened yet today 16:09:18 i've been updated this bug - https://bugs.launchpad.net/nova/+bug/1696125 16:09:18 Launchpad bug 1696125 in OpenStack Compute (nova) "Detach interface failed - timeout waiting to detach tap device in linuxbridge job (pike)" [High,In progress] - Assigned to Matt Riedemann (mriedem) 16:09:21 fyi 16:10:02 seems to go down into libvirt and libvirt never responds with detach complete 16:10:16 stvnoyes I ran into that at one point a while back, but with the old code. Problem "went away" and I never got it back 16:10:27 stvnoyes and never figured out what it was doing either :( 16:10:56 stvnoyes: are you using ovs or linuxbridge? 16:11:03 ovs 16:11:19 are you running latest nova? 16:11:26 there was a recent change that went into that code not too long ago 16:11:35 sorta, as of Aug 31, after the last change listed in that bug 16:12:04 https://review.openstack.org/#/q/I8cd056fa17184a98c31547add0e9fb2d363d0908 16:12:14 ok you should have that then 16:13:00 melwitt, mdbooth, lyarwood and kashyap are the people to ask about that weird persistent vs transient domain stuff 16:13:38 ok thx. the persistent detach seems to work ok, it dies on the transient detach 16:14:14 i am hoping to get a failure with my latest debug code, should learn more when that happens 16:14:32 unless of course it changes things enough to prevent the failure 16:16:13 stvnoyes: thanks for looking into it 16:16:26 I didn't run into this one besides the Gerrit reviews 16:17:38 is there anything else to add to this topic? 16:17:49 all set here 16:18:01 stvnoyes: cool, thanks 16:18:29 so we're working on a couple of things to get the multi-attach prerequisites covered as well before merging the new attach patch 16:18:52 the reason is to wait and have the latest microversion out so we don't need to deal with bumps later if possible 16:19:24 we said we will have a uses_shared_connections flag 16:19:36 * johnthetubaguy hides in the back row 16:19:48 looking more into it the recent plans are to have that in the volume info 16:19:58 as opposed to the attachment or the connection_info 16:20:13 mriedem: johnthetubaguy: any objections to that? ^^ 16:20:23 johnthetubaguy: glad you could join :) 16:21:05 johnthetubaguy mriedem for a little more detail, the idea being that you'd want *something* to use as a lock string even during attachment-reserve/create 16:21:12 as long as when we do attachment_update and attachment_delete we have the info, I think it is fine 16:21:21 So the proposal is that we have a bool added in the volume that says "uses_shared_targets" 16:21:22 jgriffith: why on reserve? 16:21:39 patrickeast raised a concern stating it needed to be there 16:21:52 do we know why? 16:22:00 I'm not 100% convinced, but didn't feel like rat holing 16:22:03 and as it is a capability it doesn't change over attachments, so it makes more sense as part of the volume info 16:22:19 so a lock on reserve isn't really implementable in Nova right now 16:22:23 jgriffith: didn't he say attachment_update? 16:22:27 ildikov +1 it made sense to associate it with the volume regardless 16:22:50 ildikov well, yes; the problem being that he'd need the info DURING the attachment_update 16:22:54 yeah, in volume_info seems to make sense, from a data model POV 16:22:59 which means ideally he'd have it on the attachment_reserve 16:23:25 I am guessing non-Nova users will appreciate it before attachment_create/reserve etc 16:23:30 that would mean the return value from reserve the latest and not using it for reserve 16:23:30 So my proposal is to just put that flag in the volume, as well as the volume_backend_name which could then be used for the lock 16:23:38 jgriffith: +1 16:24:03 yeah, we want it for attachment_update and _delete, I think that is the main bit 16:24:24 johnthetubaguy yeah, and this way you can have it for *all the things* if something comes up 16:24:35 we retrieve the volume info before those calls I think in all cases, so it should be ok 16:24:37 logically it makes sense, and the testing so far looks bueno 16:24:44 we could still save it in the BDM if we wanted to 16:24:52 hope to have it posted later this morning 16:24:55 what happens if the volume is retyped? 16:25:01 and the backend name changes 16:25:03 if I can figure out which Instance I had all the code on :) 16:25:07 jgriffith: yeah, all sounds fine, my worry was there was something happening in reserve/create we don't understand 16:25:15 mriedem then that info gets updated if it's a backend migration 16:25:43 johnthetubaguy no, I don't think there is, it's more about when/how to get the info on the caller side 16:26:02 jgriffith: That plan makes sense to me. 16:26:04 mriedem did that address our question? 16:26:06 i don't really want to think about retyping a multi-attach volume 16:26:14 mriedem me neither :( 16:26:15 the info makes even more sense on the volume level considering back end migration IMHO 16:26:35 There's a reason we used not allow retype of an in-use volume :) 16:26:48 yeah, let's not ruin this nice morning with that thought... :/ 16:26:57 jgriffith: cool 16:27:14 mriedem +1 on that 16:28:54 mriedem: One step at a time. :-) 16:28:58 ok, it seems to me that we're on an agreement here 16:29:40 so if anyone wants to rain in this parade please do it now 16:29:46 or don't do it at all :) 16:30:26 shared connections are backend specific, 16:30:31 and the volume is typed per backend, 16:30:37 so it makes sense to model that in the volume 16:30:39 so +1 16:31:25 great, so we will add the info to the volume and do the microversion bumps accordingly 16:32:12 * johnthetubaguy nods 16:32:21 next update 16:32:34 is there a spec to +1 around all that, I maybe missed the link? 16:32:53 johnthetubaguy: I will add this to the multi-attach spec after the meeting 16:33:09 OK 16:33:20 johnthetubaguy: I added the info about the lock there, but wanted to get to an agreement on where we put the extra info before updating the spec again 16:33:30 +1 16:33:56 johnthetubaguy: I also added a note there about the policies I believe where I will need some double checks and suggestions as well 16:34:06 cool 16:34:21 johnthetubaguy: so highly appreciated if you can check the spec once I update it again shortly 16:34:26 you guys don't want a cinder spec for any of that? 16:34:29 I think the policy should be mostly cinder side, but I should read through that 16:34:36 mriedem: I was wondering that too 16:34:39 mriedem yeah, we do/will 16:34:52 I'll have it posted prior to proposing the changes 16:34:52 nova was going to have a policy rule for multi-attach boot from volume i think 16:35:02 cinder was going to have a policy rule for multi-attach in general 16:35:30 it was in my nova/cinder ptg recap email 16:35:37 covering that on both sides would be good, I agree 16:36:04 I'm happy to type it up, but my knowledge on policies is fairly weak, so I'll need some help 16:36:38 some of this could be called "configuration" depending on how you expose this in the API 16:37:07 I guess I was meaning giving some control to the operators, however you feel is best 16:37:51 mriedem: Yeah, we were going to have policies for mulit-attach, read-only and read/write 16:39:31 jungleboyj we have some discussing to do around some of that on the Cinder side FWIW 16:39:33 but anyway 16:40:24 ok, so let's discuss what we need to before the next meeting 16:40:25 jgriffith: Ok. 16:40:41 stvnoyes: yay https://photos.app.goo.gl/Q8JdpjM0PZhAzsv32 16:40:54 I will update the Nova side spec and we can add a reference to any Cinder side document we will have 16:40:54 required every time i review any live migration code 16:41:26 nice, so simple :-) 16:42:05 mriedem: nice! in the sense of existence 16:42:40 mriedem: that first one is a cast depending on the microversion now 16:42:43 we should have more of these honestly, I wish I wouldn't be that lazy to draw some when I go and re-understand these flows... 16:42:59 johnthetubaguy: thats from the API, which is not in this diagram 16:43:09 i'm starting at the conductor live migration task 16:43:13 mriedem: good point 16:44:39 ok 16:44:53 further items 16:45:18 I updated the new attach patch with removing the tiny race conditions by changing the order of attachment_delete and attachment_create 16:45:24 to have create first 16:45:45 +1 16:45:49 with jgriffith we're now in a smaller rabbit hole on tracking the volume and attachment state on the Cinder side 16:46:04 that's this right? https://review.openstack.org/#/c/504467/ 16:46:12 mriedem yes 16:46:15 i was confused how that played into the changes we needed on the nova side 16:46:19 i guess they are separate issues? 16:46:43 partially 16:46:52 it's a multi-attach issue for the most part 16:47:14 But there's a problem with moving those things with making sure we have the right "states" on things 16:47:16 mriedem: the problem by having create coming first is that it changes the volume state to 'reserved', which messes with delete, which then changes the volume state to 'available' which also doesn't work and that's not even multi-attach yet... 16:47:59 what I mean is not the enforcement/check of things, but at the end of actions, figuring out the disaster that is all the various states stashed here and there when you have multiple attachment records 16:48:05 and you don't treat them as the primary resource 16:48:08 ok 16:48:24 so if a volume is reserved or in-use because there is an existing attachment, 16:48:39 and we create another attachment to keep it reserved, that changes the volume status from in-use to reserved? 16:48:46 mriedem correct 16:48:48 rather than just leaving it in-use 16:48:49 ok 16:48:54 oh... 16:48:56 no opposite 16:49:06 what happens currently is it will get updated to reserved 16:49:11 I don't think that's what we want 16:49:17 we want to keep the in-use status 16:49:17 and once we delete the old attachment, it should leave the volume reserved, but not in-use if that remaining attachment doesn't have a connector 16:49:19 mriedem: exactly 16:49:33 something like that 16:49:46 mriedem the last one you pointed out yes 16:49:52 mriedem: yes and right now if you delete an attachment it will change the volume state to 'available' blindly 16:49:53 ok 16:49:57 jgriffith: Agreed. Seems like in-use is the right status. 16:50:11 jungleboyj: in-use depends on if the attachment has a connector 16:50:22 from nova's pov, in-use or reserved are basically the same 16:50:23 mriedem: +1 16:50:25 https://review.openstack.org/#/c/504467/5/cinder/volume/api.py L#2048 16:50:43 I'm open to suggestions/discussion if people think that ranking is incorrect 16:51:02 mriedem: Right, but it shouldn't get switched to reserved if there is a connector and it is still in-use in at least one location. Right? 16:51:16 jungleboyj: correct 16:51:20 at least that's what i'd expect 16:51:20 jgriffith: I couldn't finish my comments before the meeting, but will add them shortly 16:51:27 mriedem: ++ 16:51:35 honestly I'd like to propose a complete rewrite of all of the status mess in Cinder, but I think that's a bigger problem and don't want to pause the Nova attachment changes any more than we have to 16:51:47 mriedem: and more importantly we also want to get 'attaching' and 'detaching' statuses correctly would be my thinking 16:51:50 mriedem: Granted I am new to this and trying to catch up. If I make no sense you can tell me to go away. ;-) 16:52:14 jungleboyj no!! You don't get to "go away" 16:52:28 jgriffith: Yeah, that would be a much larger effort and we should let this work finish and stabilize first. 16:52:29 and if something doesn't make sense, ping me or somebody else to figure it out 16:52:32 jgriffith: that makes sense and as we don't want to introduce back status checking in Nova that much I would hope we can rewrite this later without any big mess 16:52:33 We have made so much progress. 16:52:47 awww jgriffith Wants me around! :-) 16:52:54 * jgriffith didn't say that 16:52:56 there is no attaching status right? 16:53:01 * jungleboyj laughs 16:53:03 mriedem there is 16:53:07 it's available, reserved, in-use, detaching, available 16:53:25 jungleboyj: we're just eager to share the pain with as many fellow Stackers as possible ;) 16:53:48 :-) 16:53:49 ok, i'll just leave this alone... 16:53:55 mriedem: attachment_update puts the volume in 'attaching' state 16:54:05 well, tha'ts the attach_status yeah? 16:54:07 and it's attachment_complete that moves it to 'in-use' 16:54:08 there is status, and attach_status 16:54:15 like vm and task state in nova 16:54:26 a vm can be ACTIVE and 'migrating' 16:54:34 anywho 16:54:51 mriedem: ah, ok 16:55:06 btw, 16:55:12 i should not know this much about the internal state machine in cinder 16:55:22 mriedem and now you're asking for my 20 minute rant on how F'd up all of this is on the Cinder side :) 16:55:22 mriedem these are why I feel the whole thing is "broken" in Cinder and needs fixed 16:55:31 mriedem: I wouldn't volunteer right now to write up all the possible variations of the 4 statuses we're having... 16:55:33 mriedem agreed 16:55:50 anyway... I do plan to "fix" it 16:56:00 just not before the nova attach changes 16:56:05 ok, ill go back to reviewing steve's live migration change 16:56:07 mriedem: correct, Nova shouldn't play on that field in an ideal world 16:56:19 The fact that here is status and attach_status blew my mind when I realized it. 16:56:46 jungleboyj don't forget the attachment records (new and old) also has an independent attach_status column as well 16:56:55 ok, so I think we all have homeworks that we can move forward with 16:57:05 we really want to make sure we record attach_status!! 16:57:31 jgriffith: it's very important! :) 16:57:47 FWIW, my intent was that attach related statuses on the volume go away; and the attachment records are the only ones that matter 16:58:00 1 source of truth 16:58:13 jgriffith: ++ That sounds like the right goal. 16:58:17 anyway... we don't want to go into that right now probably 16:58:24 nope 16:58:28 we have 2 minutes left 16:58:57 so I wonder whether we have any other topics we need to discuss this week and haven't yet? 16:59:53 ok, it seems that was it for this week 16:59:59 thank you all! 17:00:14 let's continue the chats on the channel(s) 17:00:24 see you here next week 17:00:28 #endmeeting