12:00:13 #startmeeting nova api 12:00:14 Meeting started Tue Feb 23 12:00:13 2016 UTC and is due to finish in 60 minutes. The chair is alex_xu. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:18 The meeting name has been set to 'nova_api' 12:00:24 'allo 12:00:38 who is here today? 12:00:43 o/ 12:01:06 o/ 12:01:07 hi there 12:01:24 o/ 12:01:45 o/ 12:01:45 welcome everyone, let's start the meeting 12:02:02 #topic API futures - patches for approved specs 12:02:11 any api patches, we want to discuss? 12:02:32 so eliqiao_ is looking at the live-migrate ones 12:02:41 I think its worth a quick catchup 12:02:46 yup 12:02:53 #link https://review.openstack.org/#/c/259319 12:03:12 * bauzas lurks 12:03:35 basically, I was talking about not returning values from that API, because if we do return things, we are forced to keep the API synchronous (its not actually enforced today) 12:03:58 yup 12:03:58 so the plan was to only look at making the arguments we want to remove, optional 12:04:41 johnthetubaguy: agree, so we think of how to return those info later? 12:05:05 probably a propose in newton? 12:05:05 yeah, I think so 12:05:13 given the conversation about the network API, I'm more inclined to stop having optional parameters here 12:05:14 its really part of the general migration status API, I feel 12:05:21 couldn't we do 'host': 'auto' 12:05:37 sdague: thats not a bad idea 12:05:53 I do wonder if we can actually *remove* the other params, rather than keep them 12:06:28 as an asside, I think this is an API that needs to fail until the upgrade has completed 12:06:48 I think we need to ensure we have the correct min_version of the service being reported by all compute nodes 12:07:12 basically, as we need the source and destination to be able to auto compute the values, before we can make those arguments optional in the API 12:07:13 johnthetubaguy: self.compute_api.live_migrate is a sync call, I think we can catch exception from it and fail. 12:07:50 johnthetubaguy, do we need to do that really? 12:07:52 eliqiao_: no, I want this to be a blanket ban, in a standard way 12:07:52 johnthetubaguy: so if compute rpc api raise any exception, REST API can get noticed and raise it. 12:07:57 johnthetubaguy: I'm confused, what needs to be computed on the target? 12:08:06 johnthetubaguy: actually the new microversion api only enabled when all node upgrade now, just like this https://github.com/openstack/nova/blob/master/nova/compute/rpcapi.py#L1042 12:08:13 sdague: the is_block_migration stuff, etc 12:08:39 sdague:if block_migration is None, we compute it in compute_node. 12:09:06 there is the work bauzas is hitting on, that eventually leads to a forced host, but this is separate from that, as I understand it 12:09:07 eliqiao_: ok, and that only works with newer RPC? 12:09:17 so for upgrade process, we can not do live-migration if we pass block_migraiton=None. 12:09:34 it needs a new compute RPC, well service version, as I understand it 12:09:39 sdague: yes, only works for new RPC. 12:09:43 this is the same discussion we had for multi-attach, basically 12:10:09 I think it is if the 'None' pass down, then an exception return back, that means the new api version won't work for all node until all node upgraded 12:10:13 for simplicity, block the new API, until the upgrade has completed, as we know things will work, rather than trying to test all combinations of problems we hit in the middle 12:10:13 ok, has anyone written this flow down somewhere in detail? This seems like a lot of moving parts to get right 12:10:37 it does feel like its time for a new spec revision 12:10:39 johnthetubaguy: blocking an API is new infrastructure we don't have 12:10:54 johnthetubaguy: for now, can we use this way to block new api https://github.com/openstack/nova/blob/master/nova/compute/rpcapi.py#L1042 ? 12:10:59 johnthetubaguy: correct, but that will be a Newton-ish thing 12:11:02 however alex_xu pointed to how the crash dump bits work, which seems sane 12:11:11 sdague: well, we do have the DB method here: https://github.com/openstack/nova/blob/master/nova/objects/service.py#L333, but that will require a change to add the caching infrastructure on for APIs, not just conductor 12:11:35 so for a single node thing, thats OK 12:11:40 just using the destination compute node 12:11:45 johnthetubaguy: right, but it's a thing we've never done before, and where the edge cases are is kind of unclear 12:11:49 but this involves compue<->compute things 12:12:10 so we went through this with multi-attach, hence the bocking on min_service_version 12:12:27 how about we just need make sure the new api block for any compute to any compute 12:12:31 you are right, its the first time 12:12:53 johnthetubaguy: that still feels weird honestly, and will have unintended consequences if a node gets wedged somewhere 12:13:02 all the auto compute bits are new 12:13:04 the edge case we hit is that there is no way to atomically update the min_service_version cache in all the API nodes, although you can do a drain and move behind an LB to avoid that niggle 12:13:33 OK, I thought we agreed this at the mid-cycle for multi-attach, so its just copying that pattern 12:13:33 I guess, here is the question 12:13:52 so the big question, I guess, is what do we do for mitaka with this change? 12:13:53 do we think this is landing in Mitaka 12:14:00 right 12:14:20 because, blocking API on service min version calculation seems like a lot of new moving parts 12:14:29 looking at my watch, my head says no change, my heat says, folks need this 12:14:42 s/no change/no chance/ 12:15:11 * eliqiao_ :-) 12:15:11 honestly... 12:15:16 this is an admin api 12:15:43 so, I think if we think this api is needed in mitaka, we can do it without the safety net 12:16:02 and just doc that under a partially upgraded cloud this is going to potentially break 12:16:12 the admin should then just not use it 12:16:26 and work on safety net infrastructure in newton 12:16:43 the problem is lots of folks use live-migrate to do their upgrades, otherwise I would agree with you 12:16:53 sure, and those folks won't be able to use this 12:17:06 those folks will never be able to use this, honestly 12:17:13 * alex_xu lost the point 12:17:17 johnthetubaguy, that's why I asked about insisting everythinhg has min version 12:17:29 why will those folks never be able to use this? 12:17:41 right, because if you insist everything has a min version, you *can't* live-upgrade to do upgrades 12:17:49 the api should work for old version api? (ignore me if I looks like totally miss the context) 12:17:56 alex_xu: right, exactly 12:18:40 so thats why I want the check in there really, so they don't have nasty live-migrate error by trying the new API 12:18:49 reno: if you use live migration to upgrade your cloud, you should not use microversions > 2.21 until the cloud is fully upgrade. Unexpected results will happen. 12:18:57 johnthetubaguy: right, but microversion is explicit 12:19:19 they had to take an ACTIVE action to say they wanted to use this new live migration API 12:19:30 and we can tell them, don't do that until you are upgraded 12:19:44 sdague: we can raise exception like trigger_crash_dump, so it won't be a unexpected restult 12:19:52 so we do use this infrastructure in the conductor already 12:19:56 s/restult/result/ 12:20:36 johnthetubaguy: I 100% don't understand why you believe this is an issue 12:21:11 microversions used by clients don't magically increment 12:21:28 well, not unless they use the CLI, but yes 12:21:29 the concern is novaclient cli? 12:22:05 I see 2 options based on calendar: we can fix this in docs and land it in Mitaka, or we can miss Mitaka and build safeguards in code. 12:22:39 we could add the safeguides in Newton and back port them, maybe? 12:22:54 we're not going to get this into nova cli before release anyway 12:23:02 maybe, they seem high risk 12:23:10 honestly, documentation is fine 12:24:26 it only affects folks calling the new microversion, the new safety things, that is, and the infrastructure is already being used in the conductor, so its seems lower risk to me, but I am probably just too heads down in that stuff at this point 12:24:41 OK, I will not block it, lets see what we can land without the safeguard 12:24:59 johnthetubaguy, just wnat to be clear on one thing 12:25:02 mostly because its an admin API 12:25:35 johnthetubaguy, this means block_migration=None would be the thing that shouldn't be done ? 12:25:38 we could fire a reno issue 12:25:49 during upgrade 12:25:58 PaulMurray: basically, yes 12:25:58 so it would be super clear and not hidden in an upgrade or feature note 12:26:23 bauzas: no a bad idea 12:26:25 johnthetubaguy, is that the same as normal migration at the moment (i.e. if yo udon't specify block_migration at all) 12:26:31 s/no/not/ 12:26:41 like a "known bug" or rather a limitation 12:26:52 PaulMurray: if you don't specify the API fails, I believe 12:26:54 so, to be clear, what is the semantic intent of the current code for block_migration not being there? 12:27:02 is that also 'block_migration': 'auto' ? 12:27:17 johnthetubaguy, so any tooling that does a migration without block migration may fail 12:27:21 for the current code, you must specify block_mgiration in the api 12:27:27 sdague: yes, although frankly I think we should remove the arg all together, so auto is the only valid option 12:27:28 I think so 12:27:44 PaulMurray: right, that tooling would fail today 12:28:11 PaulMurray: you have to explicity ask for the new microversion, and change your API to not include that argument, to make this work 12:28:12 johnthetubaguy, a lot of people who use live migraiton do it with shared storage or cinder volumes, so most would fail 12:28:22 ?? 12:28:23 johnthetubaguy: ok, there is no valid option other than 'auto' ? 12:28:29 I'm fine with it being removed then 12:28:31 optional seems bad 12:28:43 sdague: yeah, I think we should just remove it 12:29:02 johnthetubaguy, sdague agreed 12:29:07 we should remove it, after the scheduler can track the shared storage I think 12:29:11 (I agree) 12:29:22 ok, so block_migration should be removed, not made optional 12:29:37 and host should be made to support 'auto', not made optional 12:29:38 alex_xu: oh... so thats the issue, we can't get it correct yet? 12:29:40 hmm.. 12:30:03 johnthetubaguy: yeah, scheduler may choice a node which isn't in the shared storage pool 12:30:30 so give user a chance choice block_mgiration or not when user have multiple storage pool in the cluster 12:31:12 I am thinking we are just not ready to make this change yet... 12:31:22 after we have resource provider, we probably can track that 12:31:23 ++ 12:31:50 even if we do provide the resource providers infra, it will require Cinder to tell us the usage 12:31:55 do we have to deal with the block thing right now? host='auto' seems pretty useful 12:32:14 ok, who is the primary driver of this change? 12:32:33 sdague: do you mean host='auto' is equal to host=None(current) 12:32:40 eliqiao_ work on those patches 12:32:44 eliqiao_: yes 12:32:57 sdague, eliqiao_ is the driver (right ?) 12:33:00 how about block_migration? block_migration='auto' also? 12:33:04 so the block_migration=auto is probably the more useful bit, the rest is a no change, I think? 12:33:10 yup 12:33:27 and I care to modify the 'host' behaviour in my check-dests sepc 12:33:31 whats the use case here right, its basically stop asking the operator questions that we know better than they do 12:33:34 * alex_xu need think about more about cinder which bauzas metioned 12:34:09 because I feel like if we think this is needed to go in, I need a new write up about what's going in, what are the failure conditions the code is going to cover, and what are the scenarios the admin needs to just not do 12:34:21 i.e. failure conditions owned by admin 12:34:50 yeah, me too 12:34:56 I think we have to kill this one for now 12:35:10 or give folks 24 hrs to build that 12:35:15 and see if we comfortable 12:35:21 ok, looks like we have agreement on that 12:35:27 that would work 12:35:33 just because I can't get it all in my head right this second doesn't mean it's bad 12:35:51 eliqiao_ / alex_xu - can you all get this written up for tomorrow this time? 12:35:54 sdague: what is 'write up' point to, you mean need eliqiao_ update spec? 12:36:03 I have a feeling we just identified that the use case we want to enable, doesn't actually work for the deployments where its really useful 12:36:13 spec, etherpad, email 12:36:23 sdague: ok, cool, got it 12:36:27 it doesn't really matter, it just needs to be a fresh look at the whole thing 12:36:51 yeah, need the whole picture described neatly 12:36:53 honestly, I think an updated spec is probably the worst option. Start with a clean page and write out the change and the failure circumstances 12:36:57 ok, let's move on, looks like we are cool here 12:37:09 okay, get it. 12:37:11 I want to give another live-migration patch update at here... 12:37:14 yep 12:37:39 #link https://review.openstack.org/#/c/258771/ 12:38:01 we said we only return in-progress migration for the migration sub-resoruce of servers 12:38:42 but we added ref-link in the /os-migrations later. So this patch change the show method return any migration, not in-progress migration anymore, does this looks like cool for everyone? 12:39:05 404 means its completed? 12:39:23 johnthetubaguy: yes, then the ref-link looks like useless anymore 12:39:29 404 means not active 12:39:40 true, its not active, not completed 12:40:03 so it is ok keep show return 404 for not active migration? 12:40:06 I think thats a problem with the old api not the new API 12:40:40 yeh, it's all a little gorpy, but I think that's the best we've got right now 12:40:53 as an operator, I know its often useful to know where an instance has moved, really to double check billing related things, hence /os-migrations 12:41:06 so the main bit is, the API is going in the right direction 12:41:16 johnthetubaguy: and that's not stored in instance_actions? 12:41:20 alex_xu, FYI - the cancel live migration uses delete to mean cancel - does that make sense ? 12:41:21 and the old API linking to the new stuff is handy for active 12:41:40 i.e. delete migration => cancel it 12:41:43 PaulMurray: yeh, DELETE on the resource is right 12:41:45 PaulMurray: yes, I think that is agreed by people on the spec 12:41:54 I think that's why we went with the subresource, to make that more clear 12:42:01 alex_xu, sdague yes, its agreed in spec, just making sure 12:42:27 sdague: well, migrations neatly just describe all the moves, which is what you wanting in that use case, the previous locations of the server (I have a feeling we don't add instance_actions for things with migrations, at least not yet) 12:42:27 ok, cool, that sounds make sense to me, I will feedback to the developer 12:42:28 in that case its doesn't make sense to return migrations that are not on-going 12:42:31 so I feel like this & cancel are more critical to land 12:42:42 PaulMurray: right 12:42:50 so I think we are happy with the new API 12:42:56 so we should go for it 12:43:02 ok, cool 12:43:09 so we can move on 12:43:23 johnthetubaguy, alex_xu - hold on 12:43:26 one thought 12:43:27 where is the cancel api change ? 12:43:38 PaulMurray: sure, please go ahead 12:43:43 can we look at why a migration finished? 12:43:58 sdague: https://review.openstack.org/277971 12:44:10 PaulMurray: right now, no 12:44:18 so not currently, and well I think the idea is instance-actions should really tell you about that stuff 12:44:34 yeh, instance-actions seems best for all of that 12:44:43 johnthetubaguy, sdague ok - I'm cool with that 12:45:15 if no more patch bring up, we will move on 12:45:16 ok, so we should try to land https://review.openstack.org/#/c/258771/ & https://review.openstack.org/277971 this week 12:45:28 because those would be really good to have in Mitaka. 12:45:35 +1 12:45:38 so agressive review appreciated 12:45:39 yea, they are pretty close 12:45:56 hmm, do they not depend on each other? 12:46:07 conflict each other :) 12:46:19 um... that seems bad 12:46:26 yeah... we should fix that 12:46:32 yeh, they are both trying to get 2.23 12:46:38 I think cancel goes on the top, logically? 12:46:46 the status is first? 12:46:54 johnthetubaguy: yes, that's what I assumed 12:47:11 that sounds ok 12:47:13 PaulMurray: can you reach out to get that rebase done? 12:47:13 PaulMurray: can you poke andrearosa to restack his patch? 12:47:31 johnthetubaguy, sure - does it really matter which goes first? 12:47:55 (wondering if one is closer to being done) 12:48:14 deleting a resource you can't show is kind of weird 12:48:22 +1 that 12:48:27 I would rather not do that 12:48:43 never mind, will do - we have the same thing with pause which is already in 12:48:56 but I'll get andrearosa to go second 12:49:07 oh my, so yeah, get the list in ASAP 12:49:49 cool, we can move on 12:49:55 #topic Nova Microversion testing in Tempest 12:50:07 sdague: I guess this your item 12:50:15 s/this/this is/ 12:50:40 gmann has some patches up to make this better, I haven't gotten back to them yet because I've been working on some nova bugs. I'll try to get that on my review radar this week 12:51:08 cool 12:51:11 I guess this is technically OK post FF 12:51:18 as its testing, not features 12:51:38 although its a distraction from critical bugs, but it might find some 12:51:38 the tempest team is going to spin tempest-lib back into tempest to make things easier here. The split out actually made it really hard to stay current with API testing 12:52:10 oh, to stop a land it, then release it cycle 12:52:47 yeh 12:53:04 a lot of things got kind of hard to fix with the rest interface in tempest being in a separate tree 12:53:16 anyway, that's a different thing, we can move on 12:53:45 ok 12:53:52 #topic open 12:53:59 there is one item "Remove get servers by DB index with microversions." 12:54:06 looks like we already done that 12:54:54 get servers by db index already removed without microversion few days ago 12:55:10 so any more open bring it up? 12:55:54 emm..I surprised we still can end the meeting a little early today 12:55:58 There's the problem with creating a server without supplying a network 12:56:24 speller: what's problem 12:56:27 the documentation says the server will be connected to all isolated networks in the tenant, and that is not the api's behavior 12:56:56 so this relates a little to the over conversation on the ML 12:57:06 The document says that if I don't supply a network when creating the server "the API provisions the server instance with all isolated networks for the tenant". But when if you don't supply a network you get a message from nova "Multiple possible networks found, use a Network ID to be more specific." 12:57:26 the "default" API behaviour around the networking is incorrectly documented, it turns out 12:57:33 right 12:57:42 so we should fix the doc 12:57:45 its also madness, but thats a different thing, that we spoke about in the ML 12:57:48 yes 12:57:54 to both things 12:58:40 who can take the action to fix that up? 12:58:56 and/or create the bug 12:59:12 I can 12:59:13 speller: can you create a bug for this, referencing the docs and what you see? 12:59:16 thanks 12:59:19 speller: thanks 12:59:20 speller: awesome 12:59:24 we can work on a fix from there 12:59:32 I can do that to if you want 12:59:37 speller: sure 12:59:41 would be appreciated 12:59:46 sure thing 12:59:49 I think we only auto add networks, when using neutron, if there is only one of them, but yeah, need to check that 12:59:53 ask in #openstack-nova if you need any help 13:00:04 johnthetubaguy: that sounds about right 13:00:10 sorry, it's time to close the meeting 13:00:22 #endmeeting