14:02:43 <PaulMurray> #startmeeting Nova Live Migration 14:02:44 <openstack> Meeting started Tue Feb 23 14:02:43 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:02:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:02:47 <openstack> The meeting name has been set to 'nova_live_migration' 14:02:53 <andrearosa> hi 14:02:55 <mdbooth> o/ 14:02:57 <eliqiao_> o/ 14:03:04 <davidgiluk> o/ 14:03:12 <pkoniszewski> o/ 14:03:12 <jlanoux> o/ 14:03:24 <PaulMurray> agenda here: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:03:44 <jlanoux> git diff 14:03:47 <jlanoux> sorry 14:03:53 <PaulMurray> #topic Priority reviews 14:04:14 <PaulMurray> Feature freeze is effectively the end of this week 14:04:17 <paul-carlton> o/ 14:04:47 <PaulMurray> lets go through the few patches we have remaining as on agenda 14:05:00 <PaulMurray> starting with live-migration-progress-report 14:05:09 <PaulMurray> this was discussed in the API meeting earlier 14:05:22 <PaulMurray> https://review.openstack.org/#/c/258771 14:05:38 <pkoniszewski> can you briefly write here what's the conclusion? 14:05:40 <pkoniszewski> i couldn't attend 14:05:52 <PaulMurray> There is a new patchset needed 14:06:02 <PaulMurray> so it only shows in-progress migrations 14:06:29 <pkoniszewski> for both, index and get? 14:06:43 <johnthetubaguy> yes, for the first version 14:06:44 <PaulMurray> I have a question - does anyone know if a novclient change has been done for this? 14:07:05 <pkoniszewski> okay, well, this is what we agreed on in spec 14:07:06 <pkoniszewski> https://review.openstack.org/#/c/281335/ 14:07:12 <pkoniszewski> PaulMurray: ^^ python-novaclient 14:07:15 <pkoniszewski> it's almost good to go 14:07:23 <pkoniszewski> needs a small fix 14:08:03 <pkoniszewski> (Andrey's comment) 14:08:06 <PaulMurray> good - didn't see it on the etherpad - I'll add it 14:08:30 <PaulMurray> this needs to be done this week too - there are two other python-novclient changes coming as well 14:08:47 <PaulMurray> johnthetubaguy, would it be best to chain all these novaclient changes 14:09:08 <pkoniszewski> PaulMurray: per Andrey's comment, we should chain it because of microversions 14:09:26 <johnthetubaguy> yeah, needs to be in a chain 14:09:36 <johnthetubaguy> it gets confusing when we have holes in the support 14:09:44 <johnthetubaguy> it should stop them all conflicting all the time, as well 14:10:42 <PaulMurray> so shall we go progress-report <- force-complete <- abort 14:10:50 <PaulMurray> (left first) 14:11:16 <johnthetubaguy> maybe, force and abort can be in either order, I guess 14:11:36 <PaulMurray> actually the first two are already there - yes, so only andrearosa to add his 14:11:40 <PaulMurray> on the end of the change 14:11:46 <PaulMurray> s/change/chain/ 14:11:49 <andrearosa> ack 14:11:50 <pkoniszewski> PaulMurray: force is +Wed already 14:12:08 <pkoniszewski> shouldnt we keep microversion order? 14:12:28 <PaulMurray> pkoniszewski, yes, of course - I'll catch up don't 14:12:46 <PaulMurray> andrearosa last and all is well I think 14:13:00 <PaulMurray> i.e. abort comes last 14:13:36 <PaulMurray> For abort-migrations we have 14:13:38 <PaulMurray> https://review.openstack.org/#/c/277971 14:13:50 <PaulMurray> Again - this needs to be rebased on 14:13:58 <PaulMurray> the report-migrations patch 14:14:00 <andrearosa> I have a request and 2 open questions 14:14:07 <PaulMurray> go ahead 14:14:11 <andrearosa> the request is please help me in testing it 14:14:25 <andrearosa> the first open question is about 14:14:54 <andrearosa> the excetion we want to raise in the API, if HTTPNotFound 14:15:05 <andrearosa> or HTTPBadRequest in the evnet the migration is not found 14:15:18 <andrearosa> I put in the code HTTPNotFound but it seems 14:15:23 <johnthetubaguy> if its +W already, then lets just leave it, its a shame, but thats OK. 14:15:43 <andrearosa> ppl prefer to have HTTPBadRequest tto be consisten with what we 14:15:49 <andrearosa> have for force_complete 14:16:06 <johnthetubaguy> so we have an API guideline on that, let me dig it up 14:16:06 <andrearosa> https://review.openstack.org/#/c/277971/8/nova/api/openstack/compute/server_migrations.py 14:16:50 <johnthetubaguy> http://specs.openstack.org/openstack/api-wg/guidelines/http.html#failure-code-clarifications 14:17:02 <johnthetubaguy> so the thing that is missing is in the URL I assume? 14:17:10 <johnthetubaguy> rather than in the body of the request 14:17:20 <pkoniszewski> it's part of URL 14:17:29 <johnthetubaguy> right, so it should be 404 if its in the URL 14:17:37 <johnthetubaguy> its 400 if its in the body of the request 14:17:47 <pkoniszewski> okay, so i will fix force complete to be consistent 14:17:59 <pkoniszewski> because it is against guideline right now 14:18:11 <johnthetubaguy> oh dear, yes, it should be consistent 14:18:28 <johnthetubaguy> lets do that before it merges, ideally 14:18:48 <PaulMurray> force complet ehas merged - abort hasn't 14:18:54 <johnthetubaguy> oh 14:19:05 <johnthetubaguy> I would need to check if thats another microversion, technically... 14:19:22 <andrearosa> I think it is 14:19:31 <andrearosa> because we change a return code 14:21:05 <johnthetubaguy> so, I think users always expected 404 being possible, so we might get away without the bum 14:21:07 <johnthetubaguy> p 14:21:17 <johnthetubaguy> http://docs.openstack.org/developer/nova/api_microversion_dev.html#when-do-i-need-a-new-microversion 14:21:23 <johnthetubaguy> but thats kinda bending the rules, really 14:21:44 <PaulMurray> johnthetubaguy, shall we do the right thing 14:21:45 <pkoniszewski> I will submit a fix and let's try to agree in reviews, is that ok? 14:21:47 <johnthetubaguy> anyways, don't let this stop the more general conversation here 14:21:58 <johnthetubaguy> pkoniszewski: +1 14:22:02 <PaulMurray> andrearosa, did you have a second question? 14:22:09 <andrearosa> ok yes I did 14:22:25 <andrearosa> there is a discussion about returning a more detailed error 14:22:26 <PaulMurray> #action pkoniszewski to make force-complete return code consistent with guidelines 14:22:33 <andrearosa> from the libvirt driver 14:22:59 <andrearosa> let me find the right link with the discussion 14:23:15 <johnthetubaguy> is it not an asynchronous API, so thats not possible? 14:23:49 <andrearosa> sorry I didn't explain very well, to log a more detailed error 14:24:04 <andrearosa> to try to understand if the abort failed for a generic reason or if it failed 14:24:18 <andrearosa> because we tried to abort a job which was finished by the time we call the abort 14:24:40 <andrearosa> https://review.openstack.org/#/c/277971/7/nova/virt/libvirt/driver.py 14:24:57 <PaulMurray> andrearosa, that's a race condition between getting past the check in the api and actually trying to do it in the compute manager ? 14:24:58 <andrearosa> that is not the latest PS but it is the PS with the discussion 14:25:00 <johnthetubaguy> is the compute manager RPC a call or a cast? 14:25:14 <andrearosa> it's asyunc 14:25:28 <johnthetubaguy> so force completing a complete thing, is probably counted as a success, in my book 14:25:50 <johnthetubaguy> like removing something thats already been removed, the thing you asked for has been completed. 14:26:19 <johnthetubaguy> now if its cancelled, by the time you do force, then thats a different thing 14:26:32 <johnthetubaguy> I mean this is an admin operation, so lets focus on good logging, honestly 14:27:04 <PaulMurray> can we distinguish between a complete migration and one in error ? 14:27:13 <PaulMurray> or does error also count as complete 14:27:17 <andrearosa> we are open to race condition in any cases and what I do is to catch a general exception and reraise it and log it 14:27:18 <PaulMurray> ? 14:27:50 <andrearosa> when we abort a live-migration we mark the migration as cancelled 14:28:05 <andrearosa> if the live-migration doesn't complete beacause of an error 14:28:09 <andrearosa> is marked as in error 14:28:36 <johnthetubaguy> so does libvirt tell us what happened here? 14:28:50 <PaulMurray> that sounds like another race condition - so we need to go from libvirts view 14:29:10 <andrearosa> in my understanding it just tell us 0 good -1 error, pkoniszewski ma I right? 14:29:30 <pkoniszewski> it's what libvirt docs are saying 14:29:33 <johnthetubaguy> in which case, we need to get libvirt changed to help us with this 14:29:33 <paul-carlton> nope, the migration status will be set based on the outcome detected by the monitor thread in the driver 14:30:03 <paul-carlton> so if it is running and someone aborts it then it will complete as cancelled 14:30:05 <johnthetubaguy> so thats racey too, can we not double check the job status after the exception has raised? 14:30:20 <johnthetubaguy> if the job actually completed, we have our answer 14:30:31 <paul-carlton> if it finishes before the abort gets to take action it shows up as ended 14:31:10 <johnthetubaguy> oh wait, this is abort not force complete... so technically the abort failed 14:31:15 <paul-carlton> the issue raise in the review was what is the outcome of the abort op and how should that be reported 14:31:50 <PaulMurray> is the migration marked as error if the abort fails ? 14:31:59 <andrearosa> johnthetubaguy: yes the abort fails if the job was not running 14:32:00 <paul-carlton> yes, if the abort tries to call jobAbort on a livbirt domain that has no job running you get an exception which is reported 14:32:08 <andrearosa> PaulMurray: 14:32:19 <andrearosa> nope if the failure was because 14:32:22 <paul-carlton> in the action-list and instance fault objects/tables 14:32:25 <andrearosa> the live-m,igration was finished 14:32:44 <andrearosa> the live-migraion is reported as completed and the abort as failed 14:32:54 <PaulMurray> surely if the abort does not take effect it should not change the state of the migration ? 14:33:08 <andrearosa> correct 14:33:10 <PaulMurray> so if we get a failure we just log a message 14:33:12 <paul-carlton> PaulMurray, yes, that is what happens 14:33:47 <PaulMurray> so I'm not sure what the problem is ? 14:33:56 <paul-carlton> yes, it logs an error if the abort fails 14:34:11 <andrearosa> yes the discussion point is to see if we want try to report the error and try to see if we can distinguish between a failed abort becasue the job was not running 14:34:18 <andrearosa> or because a general error in thje libvirt 14:34:36 <andrearosa> at the moment we just log a generic error 14:34:49 <andrearosa> and it's up to the operators to dig and try to understand what was wrong 14:35:01 <PaulMurray> so back to the beginning : can we distinguish the cases ? 14:35:13 <PaulMurray> without a race 14:35:41 <andrearosa> in my understanding no we can't but Timofey has a different opinion 14:35:43 <paul-carlton> yes 14:36:02 <PaulMurray> and he is not here 14:36:10 <andrearosa> right :( 14:36:16 <paul-carlton> we can, you need to call job abort directly 14:36:24 <paul-carlton> not via wrapper 14:36:27 <andrearosa> ? 14:36:45 <pkoniszewski> but it will still return 0/-1 14:36:49 <johnthetubaguy> so if we are talking about correctness of a log, lets ignore this, and get back to it 14:36:51 <paul-carlton> or call infoJob first and the not call abortJob if there is no job 14:37:09 <PaulMurray> johnthetubaguy, yes, this can be improved 14:37:11 <PaulMurray> later 14:37:15 <johnthetubaguy> if we call info job, after the exception, would that not tell us if it had completed? 14:37:20 <andrearosa> paul-carlton: even in that scenrario we are still open to race 14:37:45 <johnthetubaguy> doing it after should avoid the race 14:37:58 <paul-carlton> so if it really matters, you can call info, abort then if it fails info again, 14:38:09 <johnthetubaguy> right, that 14:38:21 <johnthetubaguy> well, actually, just the one info, after it fails would be enough 14:38:26 <johnthetubaguy> I think 14:38:45 <paul-carlton> then you know for sure that if the second info says the job is active the abort failed to abort the migration rather than failed because migration was complete 14:38:51 <johnthetubaguy> right 14:39:15 <johnthetubaguy> andrearosa: would that work? 14:39:16 <kashyap> johnthetubaguy: Randomly chiming, yes - job info means: "get active job information for the specified disk" 14:39:33 <paul-carlton> info before doesn't cost much and saves you trying to abort something that has completed 14:39:51 <johnthetubaguy> kashyap: OK, so you can only get active jobs, not completed ones 14:39:59 <pkoniszewski> paul-carlton: or failed 14:40:05 <andrearosa> johnthetubaguy: I have to test it, I am a bit worried that this could take more time than expected I wonder fi we can go with this solution ATM and put an improvement, if we find one isi possible later 14:40:16 <johnthetubaguy> so this feels like a follow on patch 14:40:18 <johnthetubaguy> lets move on 14:40:27 <andrearosa> johnthetubaguy: I agree 14:40:27 <kashyap> johnthetubaguy: That should also inform if the job is completed 14:40:34 <paul-carlton> abort doesn't care if it completed or failed, our task was to abort it, we can only do that if it is running 14:40:48 <PaulMurray> andrearosa, I think you have a way forward now 14:41:01 <PaulMurray> lets move on 14:41:08 <PaulMurray> unless you have any other questions 14:41:08 <andrearosa> yes thanks, basically I do not need to change anything, so please review it! 14:41:30 <johnthetubaguy> well, please add a follow up patch with ideas to test a fix, but yeah, lets not block on that for the merge 14:41:32 <andrearosa> and test it, if you can I did on a multinode devstack installation 14:41:51 <andrearosa> johnthetubaguy: ack 14:41:54 <johnthetubaguy> "check experimental" should run the live-migrate jobs 14:42:06 <PaulMurray> johnthetubaguy, it needs to abort during a migration 14:42:07 <johnthetubaguy> I mean it doesn't test the new stuff, but check if we broke anything 14:42:30 <PaulMurray> agreed - all should always use the CI job on all patches 14:42:40 <PaulMurray> we are the live migration subteam ! 14:42:42 <andrearosa> ok 14:42:49 <PaulMurray> next was going to be pause-vm-during-live-migration 14:42:53 <kashyap> johnthetubaguy: Just to note, found this via (QEMU's query-events), so it'll emit a "BLOCK_JOB_COMPLETED" event for finished jobs 14:43:48 <PaulMurray> but pause is now complete except for the result consistency thing we discussed earlier 14:44:04 <PaulMurray> Making the live-migration API friendly 14:44:10 <PaulMurray> was discussed in the API meeting 14:44:18 <PaulMurray> eliqiao_, can you summarise ? 14:44:26 <eliqiao_> PaulMurray: sure 14:44:43 <eliqiao_> call for team to help review start from #link https://review.openstack.org/#/c/275585/ 14:45:15 <eliqiao_> in previous API meeting, we discssed that not to make host and block_migration as optional. 14:45:38 <eliqiao_> #link for API meeting http://eavesdrop.openstack.org/meetings/nova_api/2016/nova_api.2016-02-23-12.00.html 14:45:48 <johnthetubaguy> so my take: there be many dragons, lets summarise all the use cases and issues by tomorrow, so we can find a path forward 14:46:24 <eliqiao_> yes, I will do a write down by tommorrow to find all cases. 14:46:44 <eliqiao_> especially while upgrade. 14:47:09 <johnthetubaguy> I am worried that we don't have enough supporting things around shared storage for the new API to be fully useful yet, and yeah, upgrade worries 14:48:13 <eliqiao_> We should prevent new API request while doing upgrade, this can be done in rpc API layer and give a 400 bad request. 14:49:12 <PaulMurray> At the moment I am not sure what we are looking to get done this week 14:49:23 <eliqiao_> sdague suggested that not to have optional parameter for host and block_migration and use 'auto' instead. 14:49:47 <PaulMurray> I think what we can do this week is part of what should be clear in the write up 14:49:55 <PaulMurray> is that what you were thinking johnthetubaguy 14:50:37 <eliqiao_> I will try to finish the write up by tommorow, I will call alex_xu for help. 14:50:53 <johnthetubaguy> concentrate on defining the problem for now 14:51:05 <eliqiao_> Beside, will update patch (compute api layer and REST API layer) for review. 14:51:22 <johnthetubaguy> its worth reviewing problems with some solutions 14:51:31 <johnthetubaguy> but don't worry about picking at this point 14:51:32 <eliqiao_> johnthetubaguy: sure, I will mention it in the write up. 14:52:08 <PaulMurray> we only have this week to do something in mitaka, so I want to understand when we should know if we are doing anything this week 14:52:17 <PaulMurray> because we all have to have read and understood 14:52:23 <PaulMurray> by that time 14:53:25 <PaulMurray> eliqiao_, where will you put the write up? 14:53:27 <eliqiao_> it's worth to read the origin sepc which is proposed by alex_xu. I don't get why my patch don't get review for about 6 weeks. 14:53:42 <eliqiao_> PaulMurray: perhanps I will send it to ML. 14:53:51 <eliqiao_> PaulMurray: or do you prefer to etherpad? 14:54:32 <PaulMurray> eliqiao_, I don't have a preference - perhaps etherpad and announce on ML ? 14:54:43 <eliqiao_> PaulMurray: okay, I can do that. 14:54:47 <johnthetubaguy> so I think we need to chat on IRC this time tomorrow, and go through our options 14:54:49 <PaulMurray> you could create one now and give us the url 14:54:56 <johnthetubaguy> well, ideally earlier actually 14:55:24 <PaulMurray> how about 11am UTC 14:55:52 <PaulMurray> johnthetubaguy, ^^? 14:57:15 <johnthetubaguy> lets try for that 14:57:34 <mdbooth> Before we finish I'd like to plug this series: https://review.openstack.org/#/c/282432/ 14:57:47 <mdbooth> It's part of libvirt storage pools 14:57:57 <mdbooth> I should probably tag that in the commits 14:57:59 <PaulMurray> #action eliqiao_ to do writeup for Making the live-migration API friendly in etherpad and we can talk about it at 11am UTC on IRC 14:58:16 <eliqiao_> PaulMurray: ack. 14:58:26 <johnthetubaguy> mdbooth: do we want to land any of that in mitaka? 14:58:29 <mdbooth> Looks like it needs a rebase, also there's a (genuine! not false positive) failure in the middle which I've just worked out and need to fix. 14:58:37 <mdbooth> johnthetubaguy: Ideally all of it, yes. 14:58:50 <mdbooth> But I still need to write the upgrade-related fix. 14:59:26 <johnthetubaguy> OK, that blueprint is marked as defered currently, based on the midcycle discussion 14:59:28 <mdbooth> johnthetubaguy: The end of the series makes the changes to the driver which actually use it. 14:59:29 <PaulMurray> mdbooth, is that a bug fix? 14:59:37 <mdbooth> johnthetubaguy: Oh, fun. 14:59:52 <mdbooth> johnthetubaguy: Well, I haven't been deferring it :) I definitely won't finish it, though. 15:00:05 <mdbooth> However, if I'm not allowed to make any progress, I will never finish it. 15:00:28 <PaulMurray> mdbooth, storage pools was deferred - but understood there was something you wanted in mitaka to prepare 15:00:33 <mdbooth> PaulMurray: Is what a bug? 15:00:47 <PaulMurray> the title of the patch says "Fix...." 15:00:55 <PaulMurray> wondered if feature freeze applies :) 15:00:58 <mriedem> get out of my meeting room! :) 15:01:06 <PaulMurray> #endmeeting