14:04:28 <PaulMurray> #startmeeting Nova Live Migration 14:04:29 <openstack> Meeting started Tue Dec 1 14:04:28 2015 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:04:29 <jlanoux> good meeting :) 14:04:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:04:33 <openstack> The meeting name has been set to 'nova_live_migration' 14:04:35 <pkoniszewski> o/ 14:04:36 <mdbooth> o/ 14:04:38 <eliqiao> o/ 14:04:39 <jlanoux> o/ 14:04:39 <davidgiluk> o/ 14:04:39 <alex_xu> o/ 14:04:41 <paul-carlton1> o/ 14:04:49 <tdurakov> again?) 14:04:52 <andrearosa> hi 14:04:58 <PaulMurray> ok - sorry, got distracted for a moment 14:05:08 <PaulMurray> hi everyone 14:05:25 <PaulMurray> as usual the agenda is here https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:05:50 <shaohe_feng> hi PaulMurray 14:06:00 <PaulMurray> We need specs in this week I think so lets spend some time on that 14:06:09 <johnthetubaguy> +1 14:06:14 <PaulMurray> #topic Specs Status 14:06:18 <paul-carlton1> #link https://review.openstack.org/#/c/228828 is updated following API meeting, +1's pls 14:06:46 <PaulMurray> Thanks you paul-carlton1 is there anything to add 14:07:11 <pkoniszewski> so i will update pause spec to make it consistent 14:07:19 <eliqiao> sure and this one #link https://review.openstack.org/248472 14:07:24 <paul-carlton1> nope, hopefully this will be ok with all 14:07:43 <tdurakov> https://review.openstack.org/#/c/225910/ - need more time for this 14:07:59 <tdurakov> is it possible to get freeze-exception 14:08:03 <tdurakov> ? 14:08:07 <paul-carlton1> pkoniszewski, do we need to address the name of the command and POST body 14:08:12 <PaulMurray> #info The cancel migration spec was discussed in the Nova API meeting and agreed to use the migration sub-resource - pause sepc to be made consistent 14:08:40 <PaulMurray> tdurakov, we'll come to that one 14:08:41 <johnthetubaguy> tdurakov: I wouldn't bet on getting an exception, they are... exceptional 14:08:41 <pkoniszewski> paul-carlton1: ok, i will update everything needed 14:09:18 <johnthetubaguy> so yeah, the API meeting was productive, seem to have reached consensus on the API format 14:09:31 <tdurakov> johnthetubaguy, ok, will update it now 14:09:39 <alex_xu> I saw it only list live-migration migration, I saw it should list all migrations. But I didn't remember we nail down that in the api meeting... 14:09:48 <paul-carlton1> johnthetubaguy, so all specs need to be approved by Thursday? 14:09:50 <tdurakov> could we align things about this refactoring? 14:10:10 <paul-carlton1> Yep, it lists them all now 14:10:33 <alex_xu> paul-carlton1: ok, cool, will re-read your spec again 14:10:57 <shaohe_feng> alex_xu: yes. only list now. 14:11:24 <johnthetubaguy> paul-carlton1: ideally, yes 14:11:43 <alex_xu> one more note, we should return 400 if the migration didn't support cancel 14:12:06 <paul-carlton1> we can't ,we won't know till it gets to driver 14:12:11 <pkoniszewski> alex_xu: this is async, can we do that? 14:12:27 <alex_xu> paul-carlton1: I mean for cancel cold-migration and resize 14:12:54 <johnthetubaguy> pkoniszewski: I think you can for resizes you can't cancel, there are a few (will leave others to decide between conflict and bad request). 14:12:57 <paul-carlton1> The DELETE /server/id/migrstion/id will return 202 if the instance and active migration are present 14:13:28 <pkoniszewski> got it 14:13:42 <paul-carlton1> Then if it turns out it can't be done an instance action us created to report htis 14:13:45 <paul-carlton1> this 14:14:29 <alex_xu> true, that is a problem. 14:14:36 <johnthetubaguy> yeah, always need an instance action for the API call, but yeah, that can report the failure 14:15:15 <alex_xu> johnthetubaguy: oh, so this still a instance action? 14:15:22 <paul-carlton1> alex_xu, No reason why a cold migration or resize could not be cancelled in the future but we don't have the technology to do it today 14:15:50 <alex_xu> paul-carlton1: yes, but for now, we should tell user the cold migration and resize can't be cancelled 14:16:12 <paul-carlton1> so you can try to cancel anything, all the API call will do is check the migration is running 14:16:14 <johnthetubaguy> alex_xu: I think so, maybe thats where it gets sketchy? 14:16:44 <PaulMurray> does the migration record what sort of action initiated it? 14:16:50 <paul-carlton1> alex_xu, so you want a specific error response for those types of migration for now? 14:16:57 <tdurakov> PaulMurray, yes 14:17:05 <tdurakov> there is a type for migration 14:17:10 <alex_xu> johnthetubaguy: at least that is only way to report the virt driver didn't implement that action 14:17:33 <mdbooth> Incidentally, note that there is no constraint that there is only a single migration per instance 14:17:33 <alex_xu> paul-carlton1: that is part of API contract, I prefer 400 14:17:40 <eliqiao> alex_xu: hard code them in rpc_api? 14:18:06 <johnthetubaguy> alex_xu: right, its the best way I know 14:18:12 <alex_xu> eliqiao: yes 14:18:22 <alex_xu> eliqiao: but it isn't in rpc_api 14:18:33 <eliqiao> alex_xu: so where? 14:18:42 <eliqiao> alex_xu: compute_api/ 14:18:44 <johnthetubaguy> mdbooth: I thought there ways, via the task_state checking? 14:18:48 <alex_xu> eliqiao: yep 14:18:49 <paul-carlton1> alex_xu, happy to do so but I think resize can be cancelled, it is just a live migration 14:19:02 <johnthetubaguy> s/there ways/there was/ 14:19:27 <tdurakov> paul-carlton1, resize == cold migration not live 14:19:39 <johnthetubaguy> tdurakov: +1 14:19:41 <alex_xu> yea, resize == cold migration 14:20:04 <mdbooth> johnthetubaguy: If there is, it requires a ton of context and it would be hard to prove. 14:20:05 <shaohe_feng> paul-carlton1: why the cold migration can not cancel? the hypervisor does not support it? 14:20:07 <johnthetubaguy> there is a spec to do a live-resize, but thats a different story 14:20:10 <paul-carlton1> I thought it was implemented as a live migration with different flavor? 14:20:41 <tdurakov> paul-carlton1, but it is still cold one 14:20:43 <alex_xu> paul-carlton1: no, it's different 14:20:50 <paul-carlton1> is that spec approved for Mikata 14:20:55 <johnthetubaguy> mdbooth: I mean there is a race there, but thats the check: https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3253 14:21:16 <alex_xu> shaohe_feng: that is future thing, we just say we didn't implement it yet 14:21:20 <mdbooth> johnthetubaguy: Ah, migrations are created in several places. For example, evacuations. 14:21:23 <PaulMurray> shaohe_feng, cold migration is changing due to storage pools (hopefully) so best to deal with it later 14:21:32 <PaulMurray> at least in libvirt 14:21:42 <johnthetubaguy> mdbooth: ah, we are talking at cross purposes then 14:21:46 <pkoniszewski> paul-carlton1: as far as i know it is not (hot resize) 14:22:07 <pkoniszewski> paul-carlton1: https://blueprints.launchpad.net/nova/+spec/hot-resize 14:22:16 <johnthetubaguy> paul-carlton1: the live-resize spec is not yet approved 14:22:25 <mdbooth> johnthetubaguy: No, I don't think so. 14:22:43 <PaulMurray> This is expanding beyond "cancel live migration" I would think it best to deal with live migration and then consider more later 14:22:47 <paul-carlton1> there can only be one migration of an instance active at any time, if not that is surely a bug? 14:23:00 <johnthetubaguy> it is, yes 14:23:05 <shaohe_feng> PaulMurray: who's storage pools? libvirt? 14:23:08 <johnthetubaguy> it is possible 14:23:17 <PaulMurray> shaohe_feng, yes 14:23:18 <mdbooth> paul-carlton1: Yes, that's definitely a bug. 14:24:19 <paul-carlton1> so we should fix that! Possibly something for https://review.openstack.org/#/c/225910/ 14:25:03 <PaulMurray> lets leave that now - do we need to sort it out in this meeting or can you get something in the spec paul-carlton1 14:25:06 <johnthetubaguy> paul-carlton1: mdbooth: getting too distracted now, but you can create mutliple active ones (or used to be possible) when you reset state of VMs after killing nova-computes half way through a resize, I suspect there are still holes in there... but we should get back to the specs 14:25:15 <johnthetubaguy> lets be clear 14:25:21 <johnthetubaguy> the spec needs to agree the direction 14:25:29 <johnthetubaguy> it doesn't need to be a detailed implementation plan 14:25:35 <alex_xu> johnthetubaguy: +1 14:26:03 <johnthetubaguy> it should define the user experience, from an API point of view, where possible, but we can revise that, if it turns out we made bad assumptions 14:26:14 <johnthetubaguy> anyways, we should move forward... 14:26:22 <PaulMurray> ok - next 14:26:23 <PaulMurray> is 14:26:34 <PaulMurray> #link https://review.openstack.org/#/c/245543/ 14:26:43 <PaulMurray> Making the live-migration API friendly 14:27:11 <PaulMurray> alex_xu, where are you with this? 14:27:24 <PaulMurray> thre is a +1 from mikal 14:27:25 <alex_xu> i updated some implement detail last week 14:27:46 <PaulMurray> sorry +2 from mikal 14:27:55 <PaulMurray> just needs reviews? 14:28:07 <alex_xu> yea, I think so 14:28:18 <PaulMurray> and a failing test 14:28:20 <alex_xu> I didn't any concern for now 14:28:30 <mdbooth> It's going to lose that +2 when the line limit is fixed, though :/ 14:28:52 <tdurakov> mdbooth, i'm not sure it critical 14:28:58 <alex_xu> ah, yea, really didn't want to lose +2 at this moment 14:29:13 <PaulMurray> I'm sure it will come back - just tell mikal 14:29:14 <mdbooth> tdurakov: It's a bot. The bot doesn't care. 14:29:29 <tdurakov> alex_xu, if you update it latter you could lose both +2 14:29:29 <alex_xu> PaulMurray: ok, let me fix it 14:29:32 <tdurakov> no? 14:29:37 <PaulMurray> lets move on 14:29:41 <PaulMurray> #link https://review.openstack.org/#/c/225910/ 14:29:43 <johnthetubaguy> having a +2 previously is a good excuse to an exception, just do the right thing 14:29:51 <PaulMurray> Split different live-migration types in nova 14:30:06 <tdurakov> need to get agreement for approach? 14:30:07 <PaulMurray> tdurakov, this had no update for a month 14:30:26 <tdurakov> PaulMurray, yep, because i worked on ci task 14:30:27 <johnthetubaguy> tdurakov, I will be honest, I am not sure I understand where this is going right now, or what the aim is here 14:30:56 <tdurakov> well, there are 2 approaches 14:30:59 <johnthetubaguy> it feels a bit libvirt specific right now 14:31:46 <tdurakov> need to pick one: tasks or just method split for that 14:32:03 <tdurakov> what is more preferable? 14:33:16 <johnthetubaguy> my preference is to correctly report what is happening via instance actions, as a starting point. 14:34:23 <tdurakov> johnthetubaguy, during this spec? 14:35:15 <PaulMurray> I'm not entirely sure how this fits with all the other work at the moment 14:35:37 <johnthetubaguy> tdurakov: I was really speaking generally there I guess 14:36:13 <PaulMurray> I see this cycle as: fix bugs + add necessary features 14:36:30 <tdurakov> PaulMurray, there is concern for this 14:36:34 <PaulMurray> Maybe some refactoring is necessary in that - but I don't see it as an objective in itself 14:36:49 * mdbooth would like to talk about shared storage detection and the libvirt storage pools spec if there's time, btw 14:36:52 <paul-carlton1> tdurakov, looks like an attempt to do some much needed re-factoring that aligns thinks for Task API 14:36:55 <PaulMurray> So can you motivate why it should be done now 14:37:09 <tdurakov> new features could make things harder to maintain 14:37:45 <mdbooth> +1 14:37:49 <johnthetubaguy> in terms of refactoring the libvirt drivers live-migrate code, yeah, that probably makes sense, the storage pools work might be the tipping point there 14:38:15 <johnthetubaguy> I was seeing this as changing the virt driver layer, and I can see that causing big problems in other drivers 14:38:30 <mdbooth> image cache + image handling + shared storage + migration (cold + live) are currently an incredibly tightly coupled mountain of technical debt. 14:39:10 <tdurakov> my point here to make some orchestration for live migration, and split all of this by underlying storage type 14:39:45 <johnthetubaguy> tdurakov: I think thats doing too much at once 14:40:03 <tdurakov> we could move it out of scope, but adding new features could increase technical debt on this 14:40:26 <johnthetubaguy> tdurakov: so this feels like it needs in person high bandwidth discussion to resolve 14:40:41 <PaulMurray> johnthetubaguy, agree 14:40:57 <PaulMurray> tdurakov, I am not against this - I just 14:41:00 <eliqiao> how about do the refactor one method one time? 14:41:11 <PaulMurray> want to see very focused work 14:41:14 <eliqiao> at least I don't like resize() in compute api. 14:41:27 <eliqiao> #link https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2594 14:41:31 <mdbooth> PaulMurray: The problem with that focus is tight coupling. 14:41:42 <tdurakov> johnthetubaguy, well, so whats the plan for this then? 14:41:49 <johnthetubaguy> so, when we add features, like storage pools, we need to look at refactoring to keep the code maintained properly, if we end up splitting that out to a separate blueprint thats fine, but lets not get too bogged down 14:42:23 <PaulMurray> johnthetubaguy, I agree. 14:42:27 <mdbooth> e.g. It's going to be very hard to tightly focus the libvirt storage pools plan, when other modules assume not only a filesystem, but the specific layout. 14:42:57 <johnthetubaguy> tdurakov: we don't have any kind of consensus on the orchestration piece, for me the task focus has moved towards error reporting, and applying instance actions consistently, based on the summit discussion, although not sure I have seen those specs yet :( 14:42:58 <tdurakov> johnthetubaguy, ok, what about cleaning compute code without bp at all, as we'll have full test suite for that? 14:43:54 <PaulMurray> johnthetubaguy, is there a path for proposing cleanup work outside a spec? 14:44:01 <johnthetubaguy> tdurakov: possibly, but I think we should move on 14:44:26 <johnthetubaguy> PaulMurray: mostly they are just patches with a developer raised bug attached, we sometimes use blueprints to track them 14:44:46 <PaulMurray> ok - time is very tight 14:45:08 <mdbooth> I'd like eyes on this: 14:45:11 <mdbooth> https://review.openstack.org/#/c/248705/ 14:45:23 <PaulMurray> there were three more by shaohe_feng but I am not sure they are likely to make it? 14:45:31 <PaulMurray> https://review.openstack.org/#/c/248358/4 14:45:37 <PaulMurray> https://review.openstack.org/#/c/248465 14:45:43 <PaulMurray> https://review.openstack.org/#/c/248472 14:46:06 <PaulMurray> I'm not sure we have time to discuss each in any deailt 14:46:12 <mdbooth> Specifically, I'm interested in comments on the proposal to detect shared storage by testing for the existence of the storage on the target. This would also be relevant for live migration. 14:46:31 <shaohe_feng> PaulMurray: the last one https://review.openstack.org/#/c/248472 are close to 14:46:55 <PaulMurray> report more live migration progress detail 14:46:56 <pkoniszewski> mdbooth: +1, i'm currently trying to detect shared storage somehow for ephemerals, root disks etc. 14:47:15 <PaulMurray> shaohe_feng, I think we just continue to look for reviews on that then 14:47:46 <PaulMurray> johnthetubaguy, the spec mdbooth listed is an update 14:47:47 <mdbooth> pkoniszewski: I'm hoping that with well thought out storage pool operations, you can just test that it exists, and use it if it does. No heuristics required. 14:48:01 <PaulMurray> johnthetubaguy, how does that sit with the spec freeze?> 14:48:32 <shaohe_feng> PaulMurray: OK. These days, many reviews on it. 14:48:37 <mdbooth> PaulMurray: I also haven't re-read the libvirt storage pools spec, but as I mentioned the other day, we can't completely move to libvirt storage pools 14:48:40 <johnthetubaguy> we can merge that one post freeze, its just tricky getting eyes on it. 14:48:47 <mdbooth> That inevitably means a change to that spec. 14:49:01 <tdurakov> mdbooth, is it about file-based shared storage? 14:49:13 <mdbooth> tdurakov: What is? 14:49:21 <tdurakov> shared storage detection 14:49:39 <mdbooth> tdurakov: No, it shouldn't be. The backend should be irrelevant. 14:50:22 <mdbooth> tdurakov: Although in practice it can't be lvm 14:50:34 <mdbooth> So shared filesystem, and rbd right now 14:50:45 <mdbooth> There are multiple races in rbd, btw 14:50:48 <mdbooth> currently 14:50:51 <paul-carlton1> mdbooth, what about the ploop issue 14:50:56 <tdurakov> mdbooth, there is different ways for this right now, for nfs, there is temp file creation 14:51:05 <tdurakov> for example 14:51:06 <mdbooth> That's why we can't move exclusively to libvirt storage pools 14:51:44 <mdbooth> We need a higher abstract. We'll use them for as much as possible, but not everything. Also, we need apis they don't currently provide, so even for libvirt storage pools we'll need per-backend extensions. 14:52:11 <tdurakov> last meeting johnthetubaguy said about spec for shared storage aggregates right now 14:52:25 <mdbooth> tdurakov: I'm talking about all the 'if is_shared_storage() all over the place' 14:52:45 <tdurakov> mdbooth, do we really need such detection? 14:52:56 <mdbooth> With my proposal that would go away, because you would know in advance what storage needs to be transferred. 14:53:07 <PaulMurray> mdbooth, so given johnthetubaguy's comment above, this one is not time critical (for freeze) so lets continue outside the meeting 14:53:18 <paul-carlton1> so it can't be default in Mitaka but we can implement it with caveats and work to address scenarios it can't support ? 14:53:23 <shaohe_feng> tdurakov: I remember some issues about nfs for storage pool. nfs will block is something wrong with it. 14:53:29 <johnthetubaguy> yeah, I might regret saying that, but anyways 14:53:38 <PaulMurray> :) 14:53:45 <PaulMurray> just quickly 14:53:49 <johnthetubaguy> mdbooth: I like the "does the correct thing at the correct time" part of that change 14:53:51 <PaulMurray> #topic CI status 14:53:56 <tdurakov> shaohe_feng, mdbooth let's discuss it in openstack-nova 14:53:56 <PaulMurray> any update tdurakov 14:54:21 <tdurakov> ci status: working on ceph right now, have some progress 14:54:45 <PaulMurray> ok - good - any help needed? 14:55:01 <tdurakov> PaulMurray, nope, i think 14:55:12 <PaulMurray> let us know if you do 14:55:26 <PaulMurray> #topic open reviews 14:55:32 <PaulMurray> two I want to mention 14:55:36 <PaulMurray> https://review.openstack.org/#/c/227278/ 14:55:53 <PaulMurray> This needs a trivial update and then it can probably merge 14:55:59 <PaulMurray> it has an approved follow on patch 14:56:00 <johnthetubaguy> tdurakov: it would be good to get something voting soon, if we have additional coverage, but lets catch up about that separatly 14:56:01 <pkoniszewski> PaulMurray: i'm actually working on it 14:56:13 <PaulMurray> pkoniszewski, great - saw your comment 14:56:23 <PaulMurray> this is a big win to get in early in the cycle 14:56:25 <pkoniszewski> i have some concerns with NFS and shared devices 14:57:03 <PaulMurray> does it miss them 14:57:27 <PaulMurray> well - we can review 14:57:33 <PaulMurray> Another series: 14:57:35 <PaulMurray> https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bug/1417667,n,z 14:57:44 <tdurakov> pkoniszewski, feel free to 'check experimental' to trigger ci 14:58:03 <tdurakov> nfs is already there 14:58:04 <pkoniszewski> I'm not familiar with each nova option in nova.conf - can, for instance, ephemeral device be deployed on NFS and VM outside of this NFS? 14:58:20 <PaulMurray> This is ndipanov's 14:58:30 <pkoniszewski> so we can still trigger block live migration but some devices are shared through NFS 14:58:39 <PaulMurray> it is a series of 6 patches waiting for review if anyone has time 14:58:47 <tdurakov> pkoniszewski, i'm not sure it's possible 14:58:56 <alex_xu> PaulMurray: yea, that is important, we should have claim for live-migration 14:59:13 <pkoniszewski> if it is not then im ready with my change, just need to update tests 14:59:27 <paul-carlton1> other the obvious idea of writing a file on node a then looking for it on node b? 14:59:46 <PaulMurray> we are right at the end now - need to vacate the channel 14:59:53 <tdurakov> paul-carlton1, this check already in nova:) 14:59:55 <PaulMurray> sorry for the slightly late start 15:00:08 <PaulMurray> feel free to continue on the other side 15:00:11 <pkoniszewski> thanks everyone! 15:00:12 <PaulMurray> #endmeeting