14:02:43 #startmeeting Nova Live Migration 14:02:44 Meeting started Tue Feb 23 14:02:43 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:02:45 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:02:47 The meeting name has been set to 'nova_live_migration' 14:02:53 hi 14:02:55 o/ 14:02:57 o/ 14:03:04 o/ 14:03:12 o/ 14:03:12 o/ 14:03:24 agenda here: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:03:44 git diff 14:03:47 sorry 14:03:53 #topic Priority reviews 14:04:14 Feature freeze is effectively the end of this week 14:04:17 o/ 14:04:47 lets go through the few patches we have remaining as on agenda 14:05:00 starting with live-migration-progress-report 14:05:09 this was discussed in the API meeting earlier 14:05:22 https://review.openstack.org/#/c/258771 14:05:38 can you briefly write here what's the conclusion? 14:05:40 i couldn't attend 14:05:52 There is a new patchset needed 14:06:02 so it only shows in-progress migrations 14:06:29 for both, index and get? 14:06:43 yes, for the first version 14:06:44 I have a question - does anyone know if a novclient change has been done for this? 14:07:05 okay, well, this is what we agreed on in spec 14:07:06 https://review.openstack.org/#/c/281335/ 14:07:12 PaulMurray: ^^ python-novaclient 14:07:15 it's almost good to go 14:07:23 needs a small fix 14:08:03 (Andrey's comment) 14:08:06 good - didn't see it on the etherpad - I'll add it 14:08:30 this needs to be done this week too - there are two other python-novclient changes coming as well 14:08:47 johnthetubaguy, would it be best to chain all these novaclient changes 14:09:08 PaulMurray: per Andrey's comment, we should chain it because of microversions 14:09:26 yeah, needs to be in a chain 14:09:36 it gets confusing when we have holes in the support 14:09:44 it should stop them all conflicting all the time, as well 14:10:42 so shall we go progress-report <- force-complete <- abort 14:10:50 (left first) 14:11:16 maybe, force and abort can be in either order, I guess 14:11:36 actually the first two are already there - yes, so only andrearosa to add his 14:11:40 on the end of the change 14:11:46 s/change/chain/ 14:11:49 ack 14:11:50 PaulMurray: force is +Wed already 14:12:08 shouldnt we keep microversion order? 14:12:28 pkoniszewski, yes, of course - I'll catch up don't 14:12:46 andrearosa last and all is well I think 14:13:00 i.e. abort comes last 14:13:36 For abort-migrations we have 14:13:38 https://review.openstack.org/#/c/277971 14:13:50 Again - this needs to be rebased on 14:13:58 the report-migrations patch 14:14:00 I have a request and 2 open questions 14:14:07 go ahead 14:14:11 the request is please help me in testing it 14:14:25 the first open question is about 14:14:54 the excetion we want to raise in the API, if HTTPNotFound 14:15:05 or HTTPBadRequest in the evnet the migration is not found 14:15:18 I put in the code HTTPNotFound but it seems 14:15:23 if its +W already, then lets just leave it, its a shame, but thats OK. 14:15:43 ppl prefer to have HTTPBadRequest tto be consisten with what we 14:15:49 have for force_complete 14:16:06 so we have an API guideline on that, let me dig it up 14:16:06 https://review.openstack.org/#/c/277971/8/nova/api/openstack/compute/server_migrations.py 14:16:50 http://specs.openstack.org/openstack/api-wg/guidelines/http.html#failure-code-clarifications 14:17:02 so the thing that is missing is in the URL I assume? 14:17:10 rather than in the body of the request 14:17:20 it's part of URL 14:17:29 right, so it should be 404 if its in the URL 14:17:37 its 400 if its in the body of the request 14:17:47 okay, so i will fix force complete to be consistent 14:17:59 because it is against guideline right now 14:18:11 oh dear, yes, it should be consistent 14:18:28 lets do that before it merges, ideally 14:18:48 force complet ehas merged - abort hasn't 14:18:54 oh 14:19:05 I would need to check if thats another microversion, technically... 14:19:22 I think it is 14:19:31 because we change a return code 14:21:05 so, I think users always expected 404 being possible, so we might get away without the bum 14:21:07 p 14:21:17 http://docs.openstack.org/developer/nova/api_microversion_dev.html#when-do-i-need-a-new-microversion 14:21:23 but thats kinda bending the rules, really 14:21:44 johnthetubaguy, shall we do the right thing 14:21:45 I will submit a fix and let's try to agree in reviews, is that ok? 14:21:47 anyways, don't let this stop the more general conversation here 14:21:58 pkoniszewski: +1 14:22:02 andrearosa, did you have a second question? 14:22:09 ok yes I did 14:22:25 there is a discussion about returning a more detailed error 14:22:26 #action pkoniszewski to make force-complete return code consistent with guidelines 14:22:33 from the libvirt driver 14:22:59 let me find the right link with the discussion 14:23:15 is it not an asynchronous API, so thats not possible? 14:23:49 sorry I didn't explain very well, to log a more detailed error 14:24:04 to try to understand if the abort failed for a generic reason or if it failed 14:24:18 because we tried to abort a job which was finished by the time we call the abort 14:24:40 https://review.openstack.org/#/c/277971/7/nova/virt/libvirt/driver.py 14:24:57 andrearosa, that's a race condition between getting past the check in the api and actually trying to do it in the compute manager ? 14:24:58 that is not the latest PS but it is the PS with the discussion 14:25:00 is the compute manager RPC a call or a cast? 14:25:14 it's asyunc 14:25:28 so force completing a complete thing, is probably counted as a success, in my book 14:25:50 like removing something thats already been removed, the thing you asked for has been completed. 14:26:19 now if its cancelled, by the time you do force, then thats a different thing 14:26:32 I mean this is an admin operation, so lets focus on good logging, honestly 14:27:04 can we distinguish between a complete migration and one in error ? 14:27:13 or does error also count as complete 14:27:17 we are open to race condition in any cases and what I do is to catch a general exception and reraise it and log it 14:27:18 ? 14:27:50 when we abort a live-migration we mark the migration as cancelled 14:28:05 if the live-migration doesn't complete beacause of an error 14:28:09 is marked as in error 14:28:36 so does libvirt tell us what happened here? 14:28:50 that sounds like another race condition - so we need to go from libvirts view 14:29:10 in my understanding it just tell us 0 good -1 error, pkoniszewski ma I right? 14:29:30 it's what libvirt docs are saying 14:29:33 in which case, we need to get libvirt changed to help us with this 14:29:33 nope, the migration status will be set based on the outcome detected by the monitor thread in the driver 14:30:03 so if it is running and someone aborts it then it will complete as cancelled 14:30:05 so thats racey too, can we not double check the job status after the exception has raised? 14:30:20 if the job actually completed, we have our answer 14:30:31 if it finishes before the abort gets to take action it shows up as ended 14:31:10 oh wait, this is abort not force complete... so technically the abort failed 14:31:15 the issue raise in the review was what is the outcome of the abort op and how should that be reported 14:31:50 is the migration marked as error if the abort fails ? 14:31:59 johnthetubaguy: yes the abort fails if the job was not running 14:32:00 yes, if the abort tries to call jobAbort on a livbirt domain that has no job running you get an exception which is reported 14:32:08 PaulMurray: 14:32:19 nope if the failure was because 14:32:22 in the action-list and instance fault objects/tables 14:32:25 the live-m,igration was finished 14:32:44 the live-migraion is reported as completed and the abort as failed 14:32:54 surely if the abort does not take effect it should not change the state of the migration ? 14:33:08 correct 14:33:10 so if we get a failure we just log a message 14:33:12 PaulMurray, yes, that is what happens 14:33:47 so I'm not sure what the problem is ? 14:33:56 yes, it logs an error if the abort fails 14:34:11 yes the discussion point is to see if we want try to report the error and try to see if we can distinguish between a failed abort becasue the job was not running 14:34:18 or because a general error in thje libvirt 14:34:36 at the moment we just log a generic error 14:34:49 and it's up to the operators to dig and try to understand what was wrong 14:35:01 so back to the beginning : can we distinguish the cases ? 14:35:13 without a race 14:35:41 in my understanding no we can't but Timofey has a different opinion 14:35:43 yes 14:36:02 and he is not here 14:36:10 right :( 14:36:16 we can, you need to call job abort directly 14:36:24 not via wrapper 14:36:27 ? 14:36:45 but it will still return 0/-1 14:36:49 so if we are talking about correctness of a log, lets ignore this, and get back to it 14:36:51 or call infoJob first and the not call abortJob if there is no job 14:37:09 johnthetubaguy, yes, this can be improved 14:37:11 later 14:37:15 if we call info job, after the exception, would that not tell us if it had completed? 14:37:20 paul-carlton: even in that scenrario we are still open to race 14:37:45 doing it after should avoid the race 14:37:58 so if it really matters, you can call info, abort then if it fails info again, 14:38:09 right, that 14:38:21 well, actually, just the one info, after it fails would be enough 14:38:26 I think 14:38:45 then you know for sure that if the second info says the job is active the abort failed to abort the migration rather than failed because migration was complete 14:38:51 right 14:39:15 andrearosa: would that work? 14:39:16 johnthetubaguy: Randomly chiming, yes - job info means: "get active job information for the specified disk" 14:39:33 info before doesn't cost much and saves you trying to abort something that has completed 14:39:51 kashyap: OK, so you can only get active jobs, not completed ones 14:39:59 paul-carlton: or failed 14:40:05 johnthetubaguy: I have to test it, I am a bit worried that this could take more time than expected I wonder fi we can go with this solution ATM and put an improvement, if we find one isi possible later 14:40:16 so this feels like a follow on patch 14:40:18 lets move on 14:40:27 johnthetubaguy: I agree 14:40:27 johnthetubaguy: That should also inform if the job is completed 14:40:34 abort doesn't care if it completed or failed, our task was to abort it, we can only do that if it is running 14:40:48 andrearosa, I think you have a way forward now 14:41:01 lets move on 14:41:08 unless you have any other questions 14:41:08 yes thanks, basically I do not need to change anything, so please review it! 14:41:30 well, please add a follow up patch with ideas to test a fix, but yeah, lets not block on that for the merge 14:41:32 and test it, if you can I did on a multinode devstack installation 14:41:51 johnthetubaguy: ack 14:41:54 "check experimental" should run the live-migrate jobs 14:42:06 johnthetubaguy, it needs to abort during a migration 14:42:07 I mean it doesn't test the new stuff, but check if we broke anything 14:42:30 agreed - all should always use the CI job on all patches 14:42:40 we are the live migration subteam ! 14:42:42 ok 14:42:49 next was going to be pause-vm-during-live-migration 14:42:53 johnthetubaguy: Just to note, found this via (QEMU's query-events), so it'll emit a "BLOCK_JOB_COMPLETED" event for finished jobs 14:43:48 but pause is now complete except for the result consistency thing we discussed earlier 14:44:04 Making the live-migration API friendly 14:44:10 was discussed in the API meeting 14:44:18 eliqiao_, can you summarise ? 14:44:26 PaulMurray: sure 14:44:43 call for team to help review start from #link https://review.openstack.org/#/c/275585/ 14:45:15 in previous API meeting, we discssed that not to make host and block_migration as optional. 14:45:38 #link for API meeting http://eavesdrop.openstack.org/meetings/nova_api/2016/nova_api.2016-02-23-12.00.html 14:45:48 so my take: there be many dragons, lets summarise all the use cases and issues by tomorrow, so we can find a path forward 14:46:24 yes, I will do a write down by tommorrow to find all cases. 14:46:44 especially while upgrade. 14:47:09 I am worried that we don't have enough supporting things around shared storage for the new API to be fully useful yet, and yeah, upgrade worries 14:48:13 We should prevent new API request while doing upgrade, this can be done in rpc API layer and give a 400 bad request. 14:49:12 At the moment I am not sure what we are looking to get done this week 14:49:23 sdague suggested that not to have optional parameter for host and block_migration and use 'auto' instead. 14:49:47 I think what we can do this week is part of what should be clear in the write up 14:49:55 is that what you were thinking johnthetubaguy 14:50:37 I will try to finish the write up by tommorow, I will call alex_xu for help. 14:50:53 concentrate on defining the problem for now 14:51:05 Beside, will update patch (compute api layer and REST API layer) for review. 14:51:22 its worth reviewing problems with some solutions 14:51:31 but don't worry about picking at this point 14:51:32 johnthetubaguy: sure, I will mention it in the write up. 14:52:08 we only have this week to do something in mitaka, so I want to understand when we should know if we are doing anything this week 14:52:17 because we all have to have read and understood 14:52:23 by that time 14:53:25 eliqiao_, where will you put the write up? 14:53:27 it's worth to read the origin sepc which is proposed by alex_xu. I don't get why my patch don't get review for about 6 weeks. 14:53:42 PaulMurray: perhanps I will send it to ML. 14:53:51 PaulMurray: or do you prefer to etherpad? 14:54:32 eliqiao_, I don't have a preference - perhaps etherpad and announce on ML ? 14:54:43 PaulMurray: okay, I can do that. 14:54:47 so I think we need to chat on IRC this time tomorrow, and go through our options 14:54:49 you could create one now and give us the url 14:54:56 well, ideally earlier actually 14:55:24 how about 11am UTC 14:55:52 johnthetubaguy, ^^? 14:57:15 lets try for that 14:57:34 Before we finish I'd like to plug this series: https://review.openstack.org/#/c/282432/ 14:57:47 It's part of libvirt storage pools 14:57:57 I should probably tag that in the commits 14:57:59 #action eliqiao_ to do writeup for Making the live-migration API friendly in etherpad and we can talk about it at 11am UTC on IRC 14:58:16 PaulMurray: ack. 14:58:26 mdbooth: do we want to land any of that in mitaka? 14:58:29 Looks like it needs a rebase, also there's a (genuine! not false positive) failure in the middle which I've just worked out and need to fix. 14:58:37 johnthetubaguy: Ideally all of it, yes. 14:58:50 But I still need to write the upgrade-related fix. 14:59:26 OK, that blueprint is marked as defered currently, based on the midcycle discussion 14:59:28 johnthetubaguy: The end of the series makes the changes to the driver which actually use it. 14:59:29 mdbooth, is that a bug fix? 14:59:37 johnthetubaguy: Oh, fun. 14:59:52 johnthetubaguy: Well, I haven't been deferring it :) I definitely won't finish it, though. 15:00:05 However, if I'm not allowed to make any progress, I will never finish it. 15:00:28 mdbooth, storage pools was deferred - but understood there was something you wanted in mitaka to prepare 15:00:33 PaulMurray: Is what a bug? 15:00:47 the title of the patch says "Fix...." 15:00:55 wondered if feature freeze applies :) 15:00:58 get out of my meeting room! :) 15:01:06 #endmeeting