#openstack-meeting-3 log

14:02:43 <PaulMurray> #startmeeting Nova Live Migration
14:02:44 <openstack> Meeting started Tue Feb 23 14:02:43 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:02:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:02:47 <openstack> The meeting name has been set to 'nova_live_migration'
14:02:53 <andrearosa> hi
14:02:55 <mdbooth> o/
14:02:57 <eliqiao_> o/
14:03:04 <davidgiluk> o/
14:03:12 <pkoniszewski> o/
14:03:12 <jlanoux> o/
14:03:24 <PaulMurray> agenda here: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:03:44 <jlanoux> git diff
14:03:47 <jlanoux> sorry
14:03:53 <PaulMurray> #topic Priority reviews
14:04:14 <PaulMurray> Feature freeze is effectively the end of this week
14:04:17 <paul-carlton> o/
14:04:47 <PaulMurray> lets go through the few patches we have remaining as on agenda
14:05:00 <PaulMurray> starting with live-migration-progress-report
14:05:09 <PaulMurray> this was discussed in the API meeting earlier
14:05:22 <PaulMurray> https://review.openstack.org/#/c/258771
14:05:38 <pkoniszewski> can you briefly write here what's the conclusion?
14:05:40 <pkoniszewski> i couldn't attend
14:05:52 <PaulMurray> There is a new patchset needed
14:06:02 <PaulMurray> so it only shows in-progress migrations
14:06:29 <pkoniszewski> for both, index and get?
14:06:43 <johnthetubaguy> yes, for the first version
14:06:44 <PaulMurray> I have a question - does anyone know if a novclient change has been done for this?
14:07:05 <pkoniszewski> okay, well, this is what we agreed on in spec
14:07:06 <pkoniszewski> https://review.openstack.org/#/c/281335/
14:07:12 <pkoniszewski> PaulMurray: ^^ python-novaclient
14:07:15 <pkoniszewski> it's almost good to go
14:07:23 <pkoniszewski> needs a small fix
14:08:03 <pkoniszewski> (Andrey's comment)
14:08:06 <PaulMurray> good - didn't see it on the etherpad - I'll add it
14:08:30 <PaulMurray> this needs to be done this week too - there are two other python-novclient changes coming as well
14:08:47 <PaulMurray> johnthetubaguy, would it be best to chain all these novaclient changes
14:09:08 <pkoniszewski> PaulMurray: per Andrey's comment, we should chain it because of microversions
14:09:26 <johnthetubaguy> yeah, needs to be in a chain
14:09:36 <johnthetubaguy> it gets confusing when we have holes in the support
14:09:44 <johnthetubaguy> it should stop them all conflicting all the time, as well
14:10:42 <PaulMurray> so shall we go progress-report <- force-complete <- abort
14:10:50 <PaulMurray> (left first)
14:11:16 <johnthetubaguy> maybe, force and abort can be in either order, I guess
14:11:36 <PaulMurray> actually the first two are already there - yes, so only andrearosa to add his
14:11:40 <PaulMurray> on the end of the change
14:11:46 <PaulMurray> s/change/chain/
14:11:49 <andrearosa> ack
14:11:50 <pkoniszewski> PaulMurray: force is +Wed already
14:12:08 <pkoniszewski> shouldnt we keep microversion order?
14:12:28 <PaulMurray> pkoniszewski, yes, of course - I'll catch up don't
14:12:46 <PaulMurray> andrearosa last and all is well I think
14:13:00 <PaulMurray> i.e. abort comes last
14:13:36 <PaulMurray> For abort-migrations we have
14:13:38 <PaulMurray> https://review.openstack.org/#/c/277971
14:13:50 <PaulMurray> Again - this needs to be rebased on
14:13:58 <PaulMurray> the report-migrations patch
14:14:00 <andrearosa> I have a request and 2 open questions
14:14:07 <PaulMurray> go ahead
14:14:11 <andrearosa> the request is please help me in testing it
14:14:25 <andrearosa> the first open question is about
14:14:54 <andrearosa> the excetion we want to raise in the API, if HTTPNotFound
14:15:05 <andrearosa> or HTTPBadRequest in the evnet the migration is not found
14:15:18 <andrearosa> I put in the code HTTPNotFound but it seems
14:15:23 <johnthetubaguy> if its +W already, then lets just leave it, its a shame, but thats OK.
14:15:43 <andrearosa> ppl prefer to have HTTPBadRequest tto be consisten with what we
14:15:49 <andrearosa> have for force_complete
14:16:06 <johnthetubaguy> so we have an API guideline on that, let me dig it up
14:16:06 <andrearosa> https://review.openstack.org/#/c/277971/8/nova/api/openstack/compute/server_migrations.py
14:16:50 <johnthetubaguy> http://specs.openstack.org/openstack/api-wg/guidelines/http.html#failure-code-clarifications
14:17:02 <johnthetubaguy> so the thing that is missing is in the URL I assume?
14:17:10 <johnthetubaguy> rather than in the body of the request
14:17:20 <pkoniszewski> it's part of URL
14:17:29 <johnthetubaguy> right, so it should be 404 if its in the URL
14:17:37 <johnthetubaguy> its 400 if its in the body of the request
14:17:47 <pkoniszewski> okay, so i will fix force complete to be consistent
14:17:59 <pkoniszewski> because it is against guideline right now
14:18:11 <johnthetubaguy> oh dear, yes, it should be consistent
14:18:28 <johnthetubaguy> lets do that before it merges, ideally
14:18:48 <PaulMurray> force complet ehas merged - abort hasn't
14:18:54 <johnthetubaguy> oh
14:19:05 <johnthetubaguy> I would need to check if thats another microversion, technically...
14:19:22 <andrearosa> I think it is
14:19:31 <andrearosa> because we change a return code
14:21:05 <johnthetubaguy> so, I think users always expected 404 being possible, so we might get away without the bum
14:21:07 <johnthetubaguy> p
14:21:17 <johnthetubaguy> http://docs.openstack.org/developer/nova/api_microversion_dev.html#when-do-i-need-a-new-microversion
14:21:23 <johnthetubaguy> but thats kinda bending the rules, really
14:21:44 <PaulMurray> johnthetubaguy, shall we do the right thing
14:21:45 <pkoniszewski> I will submit a fix and let's try to agree in reviews, is that ok?
14:21:47 <johnthetubaguy> anyways, don't let this stop the more general conversation here
14:21:58 <johnthetubaguy> pkoniszewski: +1
14:22:02 <PaulMurray> andrearosa, did you have a second question?
14:22:09 <andrearosa> ok yes I did
14:22:25 <andrearosa> there is a discussion about returning a more detailed error
14:22:26 <PaulMurray> #action pkoniszewski to make force-complete return code consistent with guidelines
14:22:33 <andrearosa> from the libvirt driver
14:22:59 <andrearosa> let me find the right link with the discussion
14:23:15 <johnthetubaguy> is it not an asynchronous API, so thats not possible?
14:23:49 <andrearosa> sorry I didn't explain very well, to log a more detailed error
14:24:04 <andrearosa> to try to understand if the abort failed for a generic reason or if it failed
14:24:18 <andrearosa> because we tried to abort a job which was finished by the time we call the abort
14:24:40 <andrearosa> https://review.openstack.org/#/c/277971/7/nova/virt/libvirt/driver.py
14:24:57 <PaulMurray> andrearosa, that's a race condition between getting past the check in the api and actually trying to do it in the compute manager ?
14:24:58 <andrearosa> that is not the latest PS but it is the PS with the discussion
14:25:00 <johnthetubaguy> is the compute manager RPC a call or a cast?
14:25:14 <andrearosa> it's asyunc
14:25:28 <johnthetubaguy> so force completing a complete thing, is probably counted as a success, in my book
14:25:50 <johnthetubaguy> like removing something thats already been removed, the thing you asked for has been completed.
14:26:19 <johnthetubaguy> now if its cancelled, by the time you do force, then thats a different thing
14:26:32 <johnthetubaguy> I mean this is an admin operation, so lets focus on good logging, honestly
14:27:04 <PaulMurray> can we distinguish between a complete migration and one in error ?
14:27:13 <PaulMurray> or does error also count as complete
14:27:17 <andrearosa> we are open to race condition in any cases and what I do is to catch a general exception and reraise it and log it
14:27:18 <PaulMurray> ?
14:27:50 <andrearosa> when we abort a live-migration we mark the migration as cancelled
14:28:05 <andrearosa> if the live-migration doesn't complete beacause of an error
14:28:09 <andrearosa> is marked as in error
14:28:36 <johnthetubaguy> so does libvirt tell us what happened here?
14:28:50 <PaulMurray> that sounds like another race condition - so we need to go from libvirts view
14:29:10 <andrearosa> in my understanding it just tell us 0 good -1 error, pkoniszewski ma I right?
14:29:30 <pkoniszewski> it's what libvirt docs are saying
14:29:33 <johnthetubaguy> in which case, we need to get libvirt changed to help us with this
14:29:33 <paul-carlton> nope, the migration status will be set based on the outcome detected by the monitor thread in the driver
14:30:03 <paul-carlton> so if it is running and someone aborts it then it will complete as cancelled
14:30:05 <johnthetubaguy> so thats racey too, can we not double check the job status after the exception has raised?
14:30:20 <johnthetubaguy> if the job actually completed, we have our answer
14:30:31 <paul-carlton> if it finishes before the abort gets to take action it shows up as ended
14:31:10 <johnthetubaguy> oh wait, this is abort not force complete... so technically the abort failed
14:31:15 <paul-carlton> the issue raise in the review was what is the outcome of the abort op and how should that be reported
14:31:50 <PaulMurray> is the migration marked as error if the abort fails ?
14:31:59 <andrearosa> johnthetubaguy: yes the abort fails if the job was not running
14:32:00 <paul-carlton> yes, if the abort tries to call jobAbort on a livbirt domain that has no job running you get an exception which is reported
14:32:08 <andrearosa> PaulMurray:
14:32:19 <andrearosa> nope if the failure was because
14:32:22 <paul-carlton> in the action-list and instance fault objects/tables
14:32:25 <andrearosa> the live-m,igration was finished
14:32:44 <andrearosa> the live-migraion is reported as completed and the abort as failed
14:32:54 <PaulMurray> surely if the abort does not take effect it should not change the state of the migration ?
14:33:08 <andrearosa> correct
14:33:10 <PaulMurray> so if we get a failure we just log a message
14:33:12 <paul-carlton> PaulMurray, yes, that is what happens
14:33:47 <PaulMurray> so I'm not sure what the problem is ?
14:33:56 <paul-carlton> yes, it logs an error if the abort fails
14:34:11 <andrearosa> yes the discussion point is to see if we want try to report the error and try to see if we can distinguish between a failed abort becasue the job was not running
14:34:18 <andrearosa> or because a general error in thje libvirt
14:34:36 <andrearosa> at the moment we just log a generic error
14:34:49 <andrearosa> and it's up to the operators to dig and try to understand what was wrong
14:35:01 <PaulMurray> so back to the beginning : can we distinguish the cases ?
14:35:13 <PaulMurray> without a race
14:35:41 <andrearosa> in my understanding no we can't but Timofey has a different opinion
14:35:43 <paul-carlton> yes
14:36:02 <PaulMurray> and he is not here
14:36:10 <andrearosa> right :(
14:36:16 <paul-carlton> we can, you need to call job abort directly
14:36:24 <paul-carlton> not via wrapper
14:36:27 <andrearosa> ?
14:36:45 <pkoniszewski> but it will still return 0/-1
14:36:49 <johnthetubaguy> so if we are talking about correctness of a log, lets ignore this, and get back to it
14:36:51 <paul-carlton> or call infoJob first and the not call abortJob if there is no job
14:37:09 <PaulMurray> johnthetubaguy, yes, this can be improved
14:37:11 <PaulMurray> later
14:37:15 <johnthetubaguy> if we call info job, after the exception, would that not tell us if it had completed?
14:37:20 <andrearosa> paul-carlton: even in that scenrario we are still open to race
14:37:45 <johnthetubaguy> doing it after should avoid the race
14:37:58 <paul-carlton> so if it really matters, you can call info, abort then if it fails info again,
14:38:09 <johnthetubaguy> right, that
14:38:21 <johnthetubaguy> well, actually, just the one info, after it fails would be enough
14:38:26 <johnthetubaguy> I think
14:38:45 <paul-carlton> then you know for sure that if the second info says the job is active the abort failed to abort the migration rather than failed because migration was complete
14:38:51 <johnthetubaguy> right
14:39:15 <johnthetubaguy> andrearosa: would that work?
14:39:16 <kashyap> johnthetubaguy: Randomly chiming, yes - job info means: "get active job information for the specified disk"
14:39:33 <paul-carlton> info before doesn't cost much and saves you trying to abort something that has completed
14:39:51 <johnthetubaguy> kashyap: OK, so you can only get active jobs, not completed ones
14:39:59 <pkoniszewski> paul-carlton: or failed
14:40:05 <andrearosa> johnthetubaguy: I have to test it, I am a bit worried that this could take more time than expected I wonder fi we can go with this solution ATM and put an improvement, if we find one isi possible later
14:40:16 <johnthetubaguy> so this feels like a follow on patch
14:40:18 <johnthetubaguy> lets move on
14:40:27 <andrearosa> johnthetubaguy: I agree
14:40:27 <kashyap> johnthetubaguy: That should also inform if the job is completed
14:40:34 <paul-carlton> abort doesn't care if it completed or failed, our task was to abort it, we can only do that if it is running
14:40:48 <PaulMurray> andrearosa, I think you have a way forward now
14:41:01 <PaulMurray> lets move on
14:41:08 <PaulMurray> unless you have any other questions
14:41:08 <andrearosa> yes thanks, basically I do not need to change anything, so please review it!
14:41:30 <johnthetubaguy> well, please add a follow up patch with ideas to test a fix, but yeah, lets not block on that for the merge
14:41:32 <andrearosa> and test it, if you can I did on a multinode devstack installation
14:41:51 <andrearosa> johnthetubaguy: ack
14:41:54 <johnthetubaguy> "check experimental" should run the live-migrate jobs
14:42:06 <PaulMurray> johnthetubaguy, it needs to abort during a migration
14:42:07 <johnthetubaguy> I mean it doesn't test the new stuff, but check if we broke anything
14:42:30 <PaulMurray> agreed - all should always use the CI job on all patches
14:42:40 <PaulMurray> we are the live migration subteam !
14:42:42 <andrearosa> ok
14:42:49 <PaulMurray> next was going to be pause-vm-during-live-migration
14:42:53 <kashyap> johnthetubaguy: Just to note, found this via (QEMU's query-events), so it'll emit a "BLOCK_JOB_COMPLETED" event for finished jobs
14:43:48 <PaulMurray> but pause is now complete except for the result consistency thing we discussed earlier
14:44:04 <PaulMurray> Making the live-migration API friendly
14:44:10 <PaulMurray> was discussed in the API meeting
14:44:18 <PaulMurray> eliqiao_, can you summarise ?
14:44:26 <eliqiao_> PaulMurray: sure
14:44:43 <eliqiao_> call for team to help review start from #link https://review.openstack.org/#/c/275585/
14:45:15 <eliqiao_> in previous API meeting, we discssed that not to make host and block_migration as optional.
14:45:38 <eliqiao_> #link for API meeting  http://eavesdrop.openstack.org/meetings/nova_api/2016/nova_api.2016-02-23-12.00.html
14:45:48 <johnthetubaguy> so my take: there be many dragons, lets summarise all the use cases and issues by tomorrow, so we can find a path forward
14:46:24 <eliqiao_> yes, I will do a write down by tommorrow to find all cases.
14:46:44 <eliqiao_> especially while upgrade.
14:47:09 <johnthetubaguy> I am worried that we don't have enough supporting things around shared storage for the new API to be fully useful yet, and yeah, upgrade worries
14:48:13 <eliqiao_> We should prevent new API request while doing upgrade, this can be done in rpc API layer and give a 400 bad request.
14:49:12 <PaulMurray> At the moment I am not sure what we are looking to get done this week
14:49:23 <eliqiao_> sdague suggested that not to have optional parameter for host and block_migration and use 'auto' instead.
14:49:47 <PaulMurray> I think what we can do this week is part of what should be clear in the write up
14:49:55 <PaulMurray> is that what you were thinking johnthetubaguy
14:50:37 <eliqiao_> I will try to finish the write up by tommorow, I will call alex_xu for help.
14:50:53 <johnthetubaguy> concentrate on defining the problem for now
14:51:05 <eliqiao_> Beside, will update patch (compute api layer and REST API layer) for review.
14:51:22 <johnthetubaguy> its worth reviewing problems with some solutions
14:51:31 <johnthetubaguy> but don't worry about picking at this point
14:51:32 <eliqiao_> johnthetubaguy: sure, I will mention it in the write up.
14:52:08 <PaulMurray> we only have this week to do something in mitaka, so I want to understand when we should know if we are doing anything this week
14:52:17 <PaulMurray> because we all have to have read and understood
14:52:23 <PaulMurray> by that time
14:53:25 <PaulMurray> eliqiao_, where will you put the write up?
14:53:27 <eliqiao_> it's worth to read the origin sepc which is proposed by alex_xu. I don't get why my patch don't get review for about 6 weeks.
14:53:42 <eliqiao_> PaulMurray: perhanps I will send it to ML.
14:53:51 <eliqiao_> PaulMurray: or do you prefer to etherpad?
14:54:32 <PaulMurray> eliqiao_,  I don't have a preference - perhaps etherpad and announce on ML ?
14:54:43 <eliqiao_> PaulMurray: okay, I can do that.
14:54:47 <johnthetubaguy> so I think we need to chat on IRC this time tomorrow, and go through our options
14:54:49 <PaulMurray> you could create one now and give us the url
14:54:56 <johnthetubaguy> well, ideally earlier actually
14:55:24 <PaulMurray> how about 11am UTC
14:55:52 <PaulMurray> johnthetubaguy, ^^?
14:57:15 <johnthetubaguy> lets try for that
14:57:34 <mdbooth> Before we finish I'd like to plug this series: https://review.openstack.org/#/c/282432/
14:57:47 <mdbooth> It's part of libvirt storage pools
14:57:57 <mdbooth> I should probably tag that in the commits
14:57:59 <PaulMurray> #action eliqiao_ to do writeup for Making the live-migration API friendly in etherpad and we can talk about it at 11am UTC on IRC
14:58:16 <eliqiao_> PaulMurray: ack.
14:58:26 <johnthetubaguy> mdbooth: do we want to land any of that in mitaka?
14:58:29 <mdbooth> Looks like it needs a rebase, also there's a (genuine! not false positive) failure in the middle which I've just worked out and need to fix.
14:58:37 <mdbooth> johnthetubaguy: Ideally all of it, yes.
14:58:50 <mdbooth> But I still need to write the upgrade-related fix.
14:59:26 <johnthetubaguy> OK, that blueprint is marked as defered currently, based on the midcycle discussion
14:59:28 <mdbooth> johnthetubaguy: The end of the series makes the changes to the driver which actually use it.
14:59:29 <PaulMurray> mdbooth, is that a bug fix?
14:59:37 <mdbooth> johnthetubaguy: Oh, fun.
14:59:52 <mdbooth> johnthetubaguy: Well, I haven't been deferring it :) I definitely won't finish it, though.
15:00:05 <mdbooth> However, if I'm not allowed to make any progress, I will never finish it.
15:00:28 <PaulMurray> mdbooth,  storage pools was deferred - but understood there was something you wanted in mitaka to prepare
15:00:33 <mdbooth> PaulMurray: Is what a bug?
15:00:47 <PaulMurray> the title of the patch says "Fix...."
15:00:55 <PaulMurray> wondered if feature freeze applies :)
15:00:58 <mriedem> get out of my meeting room! :)
15:01:06 <PaulMurray> #endmeeting