14:04:28 <PaulMurray> #startmeeting Nova Live Migration
14:04:29 <openstack> Meeting started Tue Dec  1 14:04:28 2015 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:04:29 <jlanoux> good meeting :)
14:04:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:04:33 <openstack> The meeting name has been set to 'nova_live_migration'
14:04:35 <pkoniszewski> o/
14:04:36 <mdbooth> o/
14:04:38 <eliqiao> o/
14:04:39 <jlanoux> o/
14:04:39 <davidgiluk> o/
14:04:39 <alex_xu> o/
14:04:41 <paul-carlton1> o/
14:04:49 <tdurakov> again?)
14:04:52 <andrearosa> hi
14:04:58 <PaulMurray> ok - sorry, got distracted for a moment
14:05:08 <PaulMurray> hi everyone
14:05:25 <PaulMurray> as usual the agenda is here https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:05:50 <shaohe_feng> hi PaulMurray
14:06:00 <PaulMurray> We need specs in this week I think so lets spend some time on that
14:06:09 <johnthetubaguy> +1
14:06:14 <PaulMurray> #topic Specs Status
14:06:18 <paul-carlton1> #link https://review.openstack.org/#/c/228828 is updated following API meeting, +1's pls
14:06:46 <PaulMurray> Thanks you paul-carlton1 is there anything to add
14:07:11 <pkoniszewski> so i will update pause spec to make it consistent
14:07:19 <eliqiao> sure and this one #link https://review.openstack.org/248472
14:07:24 <paul-carlton1> nope, hopefully this will be ok with all
14:07:43 <tdurakov> https://review.openstack.org/#/c/225910/ - need more time for this
14:07:59 <tdurakov> is it possible to get freeze-exception
14:08:03 <tdurakov> ?
14:08:07 <paul-carlton1> pkoniszewski, do we need to address the name of the command and POST body
14:08:12 <PaulMurray> #info The cancel migration spec was discussed in the Nova API meeting and agreed to use the migration sub-resource - pause sepc to be made consistent
14:08:40 <PaulMurray> tdurakov, we'll come to that one
14:08:41 <johnthetubaguy> tdurakov: I wouldn't bet on getting an exception, they are... exceptional
14:08:41 <pkoniszewski> paul-carlton1: ok, i will update everything needed
14:09:18 <johnthetubaguy> so yeah, the API meeting was productive, seem to have reached consensus on the API format
14:09:31 <tdurakov> johnthetubaguy, ok, will update it now
14:09:39 <alex_xu> I saw it only list live-migration migration, I saw it should list all migrations. But I didn't remember we nail down that in the api meeting...
14:09:48 <paul-carlton1> johnthetubaguy, so all specs need to be approved by Thursday?
14:09:50 <tdurakov> could we align things about this refactoring?
14:10:10 <paul-carlton1> Yep, it lists them all now
14:10:33 <alex_xu> paul-carlton1: ok, cool, will re-read your spec again
14:10:57 <shaohe_feng> alex_xu: yes.  only list now.
14:11:24 <johnthetubaguy> paul-carlton1: ideally, yes
14:11:43 <alex_xu> one more note, we should return 400 if the migration didn't support cancel
14:12:06 <paul-carlton1> we can't ,we won't know till it gets to driver
14:12:11 <pkoniszewski> alex_xu: this is async, can we do that?
14:12:27 <alex_xu> paul-carlton1: I mean for cancel cold-migration and resize
14:12:54 <johnthetubaguy> pkoniszewski: I think you can for resizes you can't cancel, there are a few (will leave others to decide between conflict and bad request).
14:12:57 <paul-carlton1> The DELETE /server/id/migrstion/id will return 202 if the instance and active migration are present
14:13:28 <pkoniszewski> got it
14:13:42 <paul-carlton1> Then if it turns out it can't be done an instance action us created to report htis
14:13:45 <paul-carlton1> this
14:14:29 <alex_xu> true, that is a problem.
14:14:36 <johnthetubaguy> yeah, always need an instance action for the API call, but yeah, that can report the failure
14:15:15 <alex_xu> johnthetubaguy: oh, so this still a instance action?
14:15:22 <paul-carlton1> alex_xu, No reason why a cold migration or resize could not be cancelled in the future but we don't have the technology to do it today
14:15:50 <alex_xu> paul-carlton1: yes, but for now, we should tell user the cold migration and resize can't be cancelled
14:16:12 <paul-carlton1> so you can try to cancel anything, all the API call will do is check the migration is running
14:16:14 <johnthetubaguy> alex_xu: I think so, maybe thats where it gets sketchy?
14:16:44 <PaulMurray> does the migration record what sort of action initiated it?
14:16:50 <paul-carlton1> alex_xu, so you want a specific error response for those types of migration for now?
14:16:57 <tdurakov> PaulMurray, yes
14:17:05 <tdurakov> there is a type for migration
14:17:10 <alex_xu> johnthetubaguy: at least that is only way to report the virt driver didn't implement that action
14:17:33 <mdbooth> Incidentally, note that there is no constraint that there is only a single migration per instance
14:17:33 <alex_xu> paul-carlton1: that is part of API contract, I prefer 400
14:17:40 <eliqiao> alex_xu: hard code them in rpc_api?
14:18:06 <johnthetubaguy> alex_xu: right, its the best way I know
14:18:12 <alex_xu> eliqiao: yes
14:18:22 <alex_xu> eliqiao: but it isn't in rpc_api
14:18:33 <eliqiao> alex_xu: so where?
14:18:42 <eliqiao> alex_xu: compute_api/
14:18:44 <johnthetubaguy> mdbooth: I thought there ways, via the task_state checking?
14:18:48 <alex_xu> eliqiao: yep
14:18:49 <paul-carlton1> alex_xu, happy to do so but I think resize can be cancelled, it is just a live migration
14:19:02 <johnthetubaguy> s/there ways/there was/
14:19:27 <tdurakov> paul-carlton1, resize == cold migration not live
14:19:39 <johnthetubaguy> tdurakov: +1
14:19:41 <alex_xu> yea, resize == cold migration
14:20:04 <mdbooth> johnthetubaguy: If there is, it requires a ton of context and it would be hard to prove.
14:20:05 <shaohe_feng> paul-carlton1: why the cold migration can not cancel? the hypervisor does not support it?
14:20:07 <johnthetubaguy> there is a spec to do a live-resize, but thats a different story
14:20:10 <paul-carlton1> I thought it was implemented as a live migration with different flavor?
14:20:41 <tdurakov> paul-carlton1, but it is still cold one
14:20:43 <alex_xu> paul-carlton1: no, it's different
14:20:50 <paul-carlton1> is that spec approved for Mikata
14:20:55 <johnthetubaguy> mdbooth: I mean there is a race there, but thats the check: https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3253
14:21:16 <alex_xu> shaohe_feng: that is future thing, we just say we didn't implement it yet
14:21:20 <mdbooth> johnthetubaguy: Ah, migrations are created in several places. For example, evacuations.
14:21:23 <PaulMurray> shaohe_feng, cold migration is changing due to storage pools (hopefully) so best to deal with it later
14:21:32 <PaulMurray> at least in libvirt
14:21:42 <johnthetubaguy> mdbooth: ah, we are talking at cross purposes then
14:21:46 <pkoniszewski> paul-carlton1: as far as i know it is not (hot resize)
14:22:07 <pkoniszewski> paul-carlton1: https://blueprints.launchpad.net/nova/+spec/hot-resize
14:22:16 <johnthetubaguy> paul-carlton1: the live-resize spec is not yet approved
14:22:25 <mdbooth> johnthetubaguy: No, I don't think so.
14:22:43 <PaulMurray> This is expanding beyond "cancel live migration" I would think it best to deal with live migration and then consider more later
14:22:47 <paul-carlton1> there can only be one migration of an instance active at any time, if not that is surely a bug?
14:23:00 <johnthetubaguy> it is, yes
14:23:05 <shaohe_feng> PaulMurray:  who's storage pools? libvirt?
14:23:08 <johnthetubaguy> it is possible
14:23:17 <PaulMurray> shaohe_feng, yes
14:23:18 <mdbooth> paul-carlton1: Yes, that's definitely a bug.
14:24:19 <paul-carlton1> so we should fix that! Possibly something for https://review.openstack.org/#/c/225910/
14:25:03 <PaulMurray> lets leave that now - do we need to sort it out in this meeting or can you get something in the spec paul-carlton1
14:25:06 <johnthetubaguy> paul-carlton1: mdbooth: getting too distracted now, but you can create mutliple active ones (or used to be possible) when you reset state of VMs after killing nova-computes half way through a resize, I suspect there are still holes in there... but we should get back to the specs
14:25:15 <johnthetubaguy> lets be clear
14:25:21 <johnthetubaguy> the spec needs to agree the direction
14:25:29 <johnthetubaguy> it doesn't need to be a detailed implementation plan
14:25:35 <alex_xu> johnthetubaguy: +1
14:26:03 <johnthetubaguy> it should define the user experience, from an API point of view, where possible, but we can revise that, if it turns out we made bad assumptions
14:26:14 <johnthetubaguy> anyways, we should move forward...
14:26:22 <PaulMurray> ok - next
14:26:23 <PaulMurray> is
14:26:34 <PaulMurray> #link https://review.openstack.org/#/c/245543/
14:26:43 <PaulMurray> Making the live-migration API friendly
14:27:11 <PaulMurray> alex_xu, where are you with this?
14:27:24 <PaulMurray> thre is a +1 from mikal
14:27:25 <alex_xu> i updated some implement detail last week
14:27:46 <PaulMurray> sorry +2 from mikal
14:27:55 <PaulMurray> just needs reviews?
14:28:07 <alex_xu> yea, I think so
14:28:18 <PaulMurray> and a failing test
14:28:20 <alex_xu> I didn't any concern for now
14:28:30 <mdbooth> It's going to lose that +2 when the line limit is fixed, though :/
14:28:52 <tdurakov> mdbooth, i'm not sure it critical
14:28:58 <alex_xu> ah, yea, really didn't want to lose +2 at this moment
14:29:13 <PaulMurray> I'm sure it will come back - just tell mikal
14:29:14 <mdbooth> tdurakov: It's a bot. The bot doesn't care.
14:29:29 <tdurakov> alex_xu, if you update it latter you could lose both +2
14:29:29 <alex_xu> PaulMurray: ok, let me fix it
14:29:32 <tdurakov> no?
14:29:37 <PaulMurray> lets move on
14:29:41 <PaulMurray> #link https://review.openstack.org/#/c/225910/
14:29:43 <johnthetubaguy> having a +2 previously is a good excuse to an exception, just do the right thing
14:29:51 <PaulMurray> Split different live-migration types in nova
14:30:06 <tdurakov> need to get agreement for approach?
14:30:07 <PaulMurray> tdurakov, this had no update for a month
14:30:26 <tdurakov> PaulMurray, yep, because i worked on ci task
14:30:27 <johnthetubaguy> tdurakov, I will be honest, I am not sure I understand where this is going right now, or what the aim is here
14:30:56 <tdurakov> well, there are 2 approaches
14:30:59 <johnthetubaguy> it feels a bit libvirt specific right now
14:31:46 <tdurakov> need to pick one: tasks or just method split for that
14:32:03 <tdurakov> what is more preferable?
14:33:16 <johnthetubaguy> my preference is to correctly report what is happening via instance actions, as a starting point.
14:34:23 <tdurakov> johnthetubaguy, during this spec?
14:35:15 <PaulMurray> I'm not entirely sure how this fits with all the other work at the moment
14:35:37 <johnthetubaguy> tdurakov: I was really speaking generally there I guess
14:36:13 <PaulMurray> I see this cycle as: fix bugs + add necessary features
14:36:30 <tdurakov> PaulMurray, there is concern for this
14:36:34 <PaulMurray> Maybe some refactoring is necessary in that - but I don't see it as an objective in itself
14:36:49 * mdbooth would like to talk about shared storage detection and the libvirt storage pools spec if there's time, btw
14:36:52 <paul-carlton1> tdurakov, looks like an attempt to do some much needed re-factoring that aligns thinks for Task API
14:36:55 <PaulMurray> So can you motivate why it should be done now
14:37:09 <tdurakov> new features could make things harder to maintain
14:37:45 <mdbooth> +1
14:37:49 <johnthetubaguy> in terms of refactoring the libvirt drivers live-migrate code, yeah, that probably makes sense, the storage pools work might be the tipping point there
14:38:15 <johnthetubaguy> I was seeing this as changing the virt driver layer, and I can see that causing big problems in other drivers
14:38:30 <mdbooth> image cache + image handling + shared storage + migration (cold + live) are currently an incredibly tightly coupled mountain of technical debt.
14:39:10 <tdurakov> my point here to make some orchestration for live migration, and split all of this by underlying storage type
14:39:45 <johnthetubaguy> tdurakov: I think thats doing too much at once
14:40:03 <tdurakov> we could move it out of scope, but adding new features could increase technical debt on this
14:40:26 <johnthetubaguy> tdurakov: so this feels like it needs in person high bandwidth discussion to resolve
14:40:41 <PaulMurray> johnthetubaguy, agree
14:40:57 <PaulMurray> tdurakov, I am not against this - I just
14:41:00 <eliqiao> how about do the refactor one method one time?
14:41:11 <PaulMurray> want to see very focused work
14:41:14 <eliqiao> at least I don't like resize() in compute api.
14:41:27 <eliqiao> #link https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2594
14:41:31 <mdbooth> PaulMurray: The problem with that focus is tight coupling.
14:41:42 <tdurakov> johnthetubaguy, well, so whats the plan for this then?
14:41:49 <johnthetubaguy> so, when we add features, like storage pools, we need to look at refactoring to keep the code maintained properly, if we end up splitting that out to a separate blueprint thats fine, but lets not get too bogged down
14:42:23 <PaulMurray> johnthetubaguy, I agree.
14:42:27 <mdbooth> e.g. It's going to be very hard to tightly focus the libvirt storage pools plan, when other modules assume not only a filesystem, but the specific layout.
14:42:57 <johnthetubaguy> tdurakov: we don't have any kind of consensus on the orchestration piece, for me the task focus has moved towards error reporting, and applying instance actions consistently, based on the summit discussion, although not sure I have seen those specs yet :(
14:42:58 <tdurakov> johnthetubaguy, ok, what about cleaning compute code without bp at all, as we'll have full test suite for that?
14:43:54 <PaulMurray> johnthetubaguy, is there a path for proposing cleanup work outside a spec?
14:44:01 <johnthetubaguy> tdurakov: possibly, but I think we should move on
14:44:26 <johnthetubaguy> PaulMurray: mostly they are just patches with a developer raised bug attached, we sometimes use blueprints to track them
14:44:46 <PaulMurray> ok - time is very tight
14:45:08 <mdbooth> I'd like eyes on this:
14:45:11 <mdbooth> https://review.openstack.org/#/c/248705/
14:45:23 <PaulMurray> there were three more by shaohe_feng but I am not sure they are likely to make it?
14:45:31 <PaulMurray> https://review.openstack.org/#/c/248358/4
14:45:37 <PaulMurray> https://review.openstack.org/#/c/248465
14:45:43 <PaulMurray> https://review.openstack.org/#/c/248472
14:46:06 <PaulMurray> I'm not sure we have time to discuss each in any deailt
14:46:12 <mdbooth> Specifically, I'm interested in comments on the proposal to detect shared storage by testing for the existence of the storage on the target. This would also be relevant for live migration.
14:46:31 <shaohe_feng> PaulMurray:  the last one https://review.openstack.org/#/c/248472 are close to
14:46:55 <PaulMurray> report more live migration progress detail
14:46:56 <pkoniszewski> mdbooth: +1, i'm currently trying to detect shared storage somehow for ephemerals, root disks etc.
14:47:15 <PaulMurray> shaohe_feng, I think we just continue to look for reviews on that then
14:47:46 <PaulMurray> johnthetubaguy, the spec mdbooth listed is an update
14:47:47 <mdbooth> pkoniszewski: I'm hoping that with well thought out storage pool operations, you can just test that it exists, and use it if it does. No heuristics required.
14:48:01 <PaulMurray> johnthetubaguy, how does that sit with the spec freeze?>
14:48:32 <shaohe_feng> PaulMurray: OK. These  days, many reviews on it.
14:48:37 <mdbooth> PaulMurray: I also haven't re-read the libvirt storage pools spec, but as I mentioned the other day, we can't completely move to libvirt storage pools
14:48:40 <johnthetubaguy> we can merge that one post freeze, its just tricky getting eyes on it.
14:48:47 <mdbooth> That inevitably means a change to that spec.
14:49:01 <tdurakov> mdbooth, is it about file-based shared storage?
14:49:13 <mdbooth> tdurakov: What is?
14:49:21 <tdurakov> shared storage detection
14:49:39 <mdbooth> tdurakov: No, it shouldn't be. The backend should be irrelevant.
14:50:22 <mdbooth> tdurakov: Although in practice it can't be lvm
14:50:34 <mdbooth> So shared filesystem, and rbd right now
14:50:45 <mdbooth> There are multiple races in rbd, btw
14:50:48 <mdbooth> currently
14:50:51 <paul-carlton1> mdbooth, what about the ploop issue
14:50:56 <tdurakov> mdbooth, there is different ways for this right now, for nfs, there is temp file creation
14:51:05 <tdurakov> for example
14:51:06 <mdbooth> That's why we can't move exclusively to libvirt storage pools
14:51:44 <mdbooth> We need a higher abstract. We'll use them for as much as possible, but not everything. Also, we need apis they don't currently provide, so even for libvirt storage pools we'll need per-backend extensions.
14:52:11 <tdurakov> last meeting johnthetubaguy said about spec for shared storage aggregates right now
14:52:25 <mdbooth> tdurakov: I'm talking about all the 'if is_shared_storage() all over the place'
14:52:45 <tdurakov> mdbooth, do we really need such detection?
14:52:56 <mdbooth> With my proposal that would go away, because you would know in advance what storage needs to be transferred.
14:53:07 <PaulMurray> mdbooth, so given johnthetubaguy's comment above, this one is not time critical (for freeze) so lets continue outside the meeting
14:53:18 <paul-carlton1> so it can't be default in Mitaka but we can implement it with caveats and work to address scenarios it can't support ?
14:53:23 <shaohe_feng> tdurakov:  I remember some issues about nfs for storage pool.  nfs will block is something wrong with it.
14:53:29 <johnthetubaguy> yeah, I might regret saying that, but anyways
14:53:38 <PaulMurray> :)
14:53:45 <PaulMurray> just quickly
14:53:49 <johnthetubaguy> mdbooth: I like the "does the correct thing at the correct time" part of that change
14:53:51 <PaulMurray> #topic CI status
14:53:56 <tdurakov> shaohe_feng, mdbooth let's discuss it in openstack-nova
14:53:56 <PaulMurray> any update tdurakov
14:54:21 <tdurakov> ci status: working on ceph right now, have some progress
14:54:45 <PaulMurray> ok - good - any help needed?
14:55:01 <tdurakov> PaulMurray, nope, i think
14:55:12 <PaulMurray> let us know if you do
14:55:26 <PaulMurray> #topic open reviews
14:55:32 <PaulMurray> two I want to mention
14:55:36 <PaulMurray> https://review.openstack.org/#/c/227278/
14:55:53 <PaulMurray> This needs a trivial update and then it can probably merge
14:55:59 <PaulMurray> it has an approved follow on patch
14:56:00 <johnthetubaguy> tdurakov: it would be good to get something voting soon, if we have additional coverage, but lets catch up about that separatly
14:56:01 <pkoniszewski> PaulMurray: i'm actually working on it
14:56:13 <PaulMurray> pkoniszewski, great - saw your comment
14:56:23 <PaulMurray> this is a big win to get in early in the cycle
14:56:25 <pkoniszewski> i have some concerns with NFS and shared devices
14:57:03 <PaulMurray> does it miss them
14:57:27 <PaulMurray> well - we can review
14:57:33 <PaulMurray> Another series:
14:57:35 <PaulMurray> https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bug/1417667,n,z
14:57:44 <tdurakov> pkoniszewski, feel free to 'check experimental' to trigger ci
14:58:03 <tdurakov> nfs is already there
14:58:04 <pkoniszewski> I'm not familiar with each nova option in nova.conf - can, for instance, ephemeral device be deployed on NFS and VM outside of this NFS?
14:58:20 <PaulMurray> This is ndipanov's
14:58:30 <pkoniszewski> so we can still trigger block live migration but some devices are shared through NFS
14:58:39 <PaulMurray> it is a series of 6 patches waiting for review if anyone has time
14:58:47 <tdurakov> pkoniszewski, i'm not sure it's possible
14:58:56 <alex_xu> PaulMurray: yea, that is important, we should have claim for live-migration
14:59:13 <pkoniszewski> if it is not then im ready with my change, just need to update tests
14:59:27 <paul-carlton1> other the obvious idea of writing a file on node a then looking for it on node b?
14:59:46 <PaulMurray> we are right at the end now - need to vacate the channel
14:59:53 <tdurakov> paul-carlton1, this check already in nova:)
14:59:55 <PaulMurray> sorry for the slightly late start
15:00:08 <PaulMurray> feel free to continue on the other side
15:00:11 <pkoniszewski> thanks everyone!
15:00:12 <PaulMurray> #endmeeting