14:00:28 <PaulMurray> #startmeeting Nova Live Migration 14:00:29 <openstack> Meeting started Tue Nov 24 14:00:28 2015 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:33 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:38 <shaohe_feng> Hi PaulMurray 14:00:52 <alex_xu> o/ 14:00:54 <mdbooth> o/ 14:00:58 <andrearosa> hi 14:01:00 <PaulMurray> Hi, who's here for live migration 14:01:00 <tdurakov> o/ 14:01:06 <jlanoux_> o/ 14:01:08 <davidgiluk> hi 14:01:29 <yuntongjin> Hi PaulMurray 14:01:37 <PaulMurray> hi yuntongjin 14:01:41 <pkoniszewski> o/ 14:01:58 <PaulMurray> I'll wait a minute as usual to give people a chance to join 14:02:05 <shaohe_feng> yuntongjin: are you here? 14:02:58 <PaulMurray> ok - lets get started 14:03:00 <shaohe_feng> PaulMurray: it does not mattet. I'm work with yuntongjin. anythin, I can convey to him 14:03:23 <PaulMurray> There is an agenda here: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:03:24 * davidgiluk will try to be here each week - I work on QEMU migration at RH 14:03:41 <PaulMurray> davidgiluk, thank you, that will be helpful 14:03:43 <yuntongjin> shaohe_feng:yes, i'm here 14:04:00 <PaulMurray> #topic Specs status 14:04:16 <PaulMurray> We are getting close to the spec cutoff 14:04:42 <PaulMurray> johnthetubaguy, mentioned the other day that the cutoff is intended to be there for priority specs too 14:05:06 <mdbooth> https://review.openstack.org/#/c/248705/ 14:05:06 <PaulMurray> most are there 14:05:15 <mdbooth> That has a constructive -1 from danpb 14:05:24 <mdbooth> I need to address those comments 14:05:31 <eliqiao> hi 14:05:58 <PaulMurray> mdbooth, yes, saw that - do you want to discuss something about it? 14:06:10 <shaohe_feng> PaulMurray: I'd like to talk about auto converge. https://review.openstack.org/#/c/248358/ 14:06:25 <shaohe_feng> there are 2 things . 14:06:52 <shaohe_feng> 1. we dicide to set the auto converge as default value. right? 14:06:52 <PaulMurray> shaohe_feng, hold on a moment 14:07:02 <shaohe_feng> PaulMurray: OK. 14:07:17 <mdbooth> Well, the meat of it is in the spec. I'd only say that we focussed on just the transfer mechanism last time. However, when I looked at the code to flesh out the detail, it became clear to me that the much harder problem was getting the code flow right. 14:08:02 <PaulMurray> mdbooth, do you have a plan together for that? 14:08:13 <mdbooth> PaulMurray: Yup. It's in that spec :) 14:08:30 <PaulMurray> so we just need to review it then ? 14:08:36 <mdbooth> So, I agree with everything which was approved originally. 14:08:52 <mdbooth> I just think it glosses over some pretty hard stuff which we need to solve before we can do that. 14:09:08 <mdbooth> PaulMurray: Yup. 14:09:54 <PaulMurray> mdbooth, ok, so lets see how it goes - I will certainly go over it today, but may not be best place to judge 14:10:31 <PaulMurray> shaohe_feng, we will come to your spec in a minute - there was one on the agenda I want to go to first (because it was on the agenda) 14:10:31 <pkoniszewski> mdbooth: I will also take a look 14:10:44 <PaulMurray> https://review.openstack.org/#/c/245543/ 14:11:00 <PaulMurray> alex_xu, are you there? 14:11:09 <alex_xu> PaulMurray: hi, I'm here 14:11:19 <PaulMurray> I think that is yours 14:11:49 <alex_xu> Actually I already bring this to api meeting, and get agreement on we can remove disk_over_commit flag directly with microversion, and without deprecate first 14:11:52 <alex_xu> PaulMurray: thanks 14:12:08 <alex_xu> next thing I want to ask is what is the API behaviour after remove disk_over_commit 14:12:26 <pkoniszewski> i agree with this one, but what about block migration flag? 14:12:32 <alex_xu> we didn't check the disksize anymore or keep check the virtual_size as disk_over_commit=True? 14:13:23 <tdurakov> i'd prefer second 14:13:24 <alex_xu> pkoniszewski: the propose is make block migration flag optional, and the default value is None, then nova will doing the right thing based on the storage is shared or not 14:13:46 <alex_xu> Ideally the resource tracker should check the disk usage as what we do in other action, but that should be another bug fix. 14:14:01 <PaulMurray> mdbooth, did you see the check alex_xu is talking about when you were looking at the code for your spec 14:14:06 * mdbooth would strongly prefer that. We can update the scheduler to be shared-storage aware separately. 14:14:14 <mdbooth> PaulMurray: Which one specifically? 14:14:21 <alex_xu> That is a reason I begin to think about after remove disk_over_commit, we remove the disk size check totally, as except libvirt driver, no other driver check the disk size 14:14:38 <mdbooth> My new spec wants to do away with the current shared storage detection. 14:14:43 <johnthetubaguy> yeah, I like the stop the current check, and make the scheduler/resource tracker do the right thing instead 14:14:45 <pkoniszewski> tdurakov: I thought that your spec (split LM code) will remove block migration flag 14:15:07 <mdbooth> It will, instead just check on the dest if the storage already exists. If it does, it assumes it's because of shared storage. 14:15:10 <tdurakov> pkoniszewski, nope, i'm not going to touch rest api 14:15:18 <pkoniszewski> ok, that makes sense then 14:15:36 <PaulMurray> mdbooth, I think the check is there because libvirt creates a file at the disk size 14:15:49 <tdurakov> what about smth like aggregates for shared storages? 14:15:55 <PaulMurray> mdbooth, not sure if that is just to check shared storage or as part of copy 14:16:11 <PaulMurray> i was thinking about your comment to do with sparse copy 14:16:32 <johnthetubaguy> tdurakov: shared storage is going to be modelled by resource provider pools stuff soon, AFAIK 14:16:52 <tdurakov> johnthetubaguy, sounds good, any spec for that? 14:16:55 <alex_xu> johnthetubaguy: +1 14:17:25 <johnthetubaguy> tdurakov: jaypipes has one someone where, I think its this one: https://review.openstack.org/#/c/225546/ 14:17:53 <jaypipes> johnthetubaguy: pushing up the pci-generate-stats one first, then a new rev on that. 14:18:02 <jaypipes> johnthetubaguy: meetings, meetings, meetings... 14:18:15 <jaypipes> johnthetubaguy: will definitely have both BPs pushed again today. 14:18:50 <PaulMurray> mdbooth, did any of that answer your question? I got lost in the discussion 14:19:32 <tdurakov> https://review.openstack.org/#/c/225910/ - live-migration refactoring, need some time to update this, comments are welcome 14:19:39 <mdbooth> PaulMurray: I'm also lost :) Not sure which specific check we're talking about. The shared storage detection is kinda distributed, which is a large part of its problem. 14:20:14 <PaulMurray> mdbooth, never mind - I might be confusing things 14:20:35 <alex_xu> yea...I'm also lost...the shared storage detection is next step after we have resource provide. So people are ok with current propose? 14:20:40 <PaulMurray> alex_xu, did you get what you need 14:21:00 <mdbooth> alex_xu: Please look at my spec, though, because I want to do shared storage detection quite differently. 14:21:23 <PaulMurray> alex_xu, I think I saw some people agree 14:21:25 <mdbooth> If I've missed something obvious, it would be good to know about that sooner. 14:21:40 <alex_xu> mdbooth: ok, no problem, I will take look at 14:21:44 <PaulMurray> Does anyone disagree with what alex_xu proposed? 14:22:14 <PaulMurray> I.e.: remove the disk size check? 14:22:28 <alex_xu> yea 14:22:41 <alex_xu> if no response, I guess it means agree :) 14:22:46 <eliqiao> seem it depend on mdbooth about shared storage part ? 14:22:48 <PaulMurray> I think so 14:22:48 <johnthetubaguy> mdbooth: sorry, I lots the link, which is your new spec? 14:23:02 <tdurakov> mdbooth, starred spec too https://review.openstack.org/#/c/248705/ 14:23:36 <PaulMurray> ok - now I have made shaohe_feng wait - so he is up next 14:23:50 <PaulMurray> https://review.openstack.org/#/c/248358/ 14:24:04 <mdbooth> The disk size check is orthogonal to my stuff 14:24:06 <alex_xu> whatever shared storage detect is next step? 14:24:11 <PaulMurray> shaohe_feng, over to you 14:24:28 <johnthetubaguy> I think the specific check I was talking about was this one: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L5340 14:24:44 <mdbooth> johnthetubaguy: https://review.openstack.org/#/c/248705/ 14:24:58 <shaohe_feng> PaulMurray: OK 14:25:10 <shaohe_feng> 2 things 14:25:13 <johnthetubaguy> mdbooth: ah, OK 14:25:36 <shaohe_feng> 1. most want to set auto converge in nova.conf 14:26:01 <shaohe_feng> if this flag is not set in nova.conf 14:26:13 <shaohe_feng> the default value is true or false? 14:26:46 <shaohe_feng> 2. qemu version > 1.6 support auto coverge. 14:26:50 <pkoniszewski> shaohe_feng: if it is not set then it is False 14:26:54 <tdurakov> +1 14:27:25 <yuntongjin> +1 14:27:32 <pkoniszewski> shaohe_feng: the point is we need newer libvirt 14:27:35 <johnthetubaguy> seems like we could default to auto converge being on? at least when you have new enough QEMU, or am I missing a bit? 14:27:41 <shaohe_feng> but libvirt version(1.2.3) support a API to set auto converge. 14:28:10 <pkoniszewski> shaohe_feng: what about python bindings? I thought that it is included since 1.2.8 14:28:17 <johnthetubaguy> #link https://wiki.openstack.org/wiki/LibvirtDistroSupportMatrix 14:28:17 <alex_xu> we also need check libvirt version 14:28:17 <shaohe_feng> so if qemu 1.8, and libvirt 1.2.2 do we still need to support auto converge? 14:28:40 <pkoniszewski> IMO we need to bump min libvirt version before we move on there 14:28:51 <alex_xu> I guess no, we only support the thing libvirt api exposed. 14:29:02 <shaohe_feng> pkoniszewski: but we can use libvirt to send qmp comment to enable it. 14:29:04 <yuntongjin> +1 with alex 14:29:12 <pkoniszewski> not really, bindings are generated in a weird way 14:29:20 <pkoniszewski> and i'm not sure that libvirt 1.2.3 will support this 14:29:27 <johnthetubaguy> so we already add support for things that need a higher version of libvirt than the minimum, but I guess it can't be on by default 14:29:36 <davidgiluk> shaohe_feng: QEMU 2.5 has some (experimental) parameters for changing the throttling levels of autoconverge, not in libvirt yet, but it's something to keep in mind 14:29:49 <johnthetubaguy> and needs clear errors when it fails to work due to the libvirt version 14:30:20 <shaohe_feng> davidgiluk: yes. 14:30:44 <shaohe_feng> davidgiluk: do we will support to set the throttling level? 14:31:14 <davidgiluk> shaohe_feng: I think so, they are migration parameters (x-cpu-throttle-increment, x-cpu-throttle-initial) 14:31:42 <PaulMurray> shaohe_feng, this is qemu specific isn't it ? 14:31:51 <shaohe_feng> so the same case with xbzrle compress 14:32:39 <yuntongjin> this is qemu specific 14:32:40 <shaohe_feng> PaulMurray: not sure other hypervisor support this feature. 14:32:50 <eliqiao> auto-coverage require qemu>=1.6 14:33:09 <PaulMurray> I was thinking about what gets exposed - if these are qemu specific config parameters that is ok 14:33:20 <PaulMurray> If its visible in the API then I think it is not ok 14:33:52 <eliqiao> davidgiluk: can you tell where "x-cpu-throttle-increment x-cpu-throttle-initial" come from? 14:33:54 <shaohe_feng> PaulMurray: so how do we support to set the throttling level, if it is qemu specific 14:34:04 <mdbooth> PaulMurray: +1 14:34:21 <PaulMurray> shaohe_feng, that has been a long discussion in lots of other features 14:34:24 <johnthetubaguy> I am very much against us leaking these implementation details out the API, or in image properties, etc 14:34:34 <davidgiluk> eliqiao: They are migration parameters, set using migrate_set_parameter (that's the HMP name) 14:34:57 <yuntongjin> shaohe_feng: i'd suggest a auto-setting "x-cpu-throttle-increment x-cpu-throttle-initial" 14:34:57 <johnthetubaguy> Nova is aiming to provide a single API experience across all Nova installations, lets keep that in mind 14:35:13 <shaohe_feng> davidgiluk: like the xzbrle set cache size. 14:35:29 <johnthetubaguy> now there may be some technology specific tunables that make sense to go in nova.conf, say if you network is a certain capacity you need to change the throttle, etc 14:35:48 <davidgiluk> shaohe_feng: No, xzbrle cache size is annoyingly special - it's a migrate_set_cache_size command, the hope is that all future numerical parameters are set using migrate_set_parameter 14:36:00 <eliqiao> yeah , I am currios about if we should set live migration throttling level in nova? 14:36:23 <davidgiluk> eliqiao: No one quite knows what value to set it to yet though, kind of depends on the workload 14:36:48 <johnthetubaguy> workload dependence is the nasty part, thats where I think pause is more useful here 14:36:57 <eliqiao> davidgiluk: okay , then nova should decide it or expose a new api to let operatior do it? 14:37:30 <shaohe_feng> davidgiluk: than need libvirt support the a new API for xzbrle cache_size 14:37:40 <davidgiluk> johnthetubaguy: The old autoconverge code didn't try very hard, the new one tries a lot harder if you set those parameters up, of course pause guarantees it will be quiet - or postcopy is the alternate guarantee 14:38:26 <eliqiao> as Daniel Berrange commentd on that spec, some low level throttling parameter shouldn't expose to nova, how do we go with these libvirt feature? 14:38:37 <eliqiao> a stupid nova or a smart one ? 14:38:47 <mdbooth> At the highest level, some instances may prefer to be killed than to run slowly 14:38:57 <shaohe_feng> eliqiao: yes. ^ what should we do? 14:39:01 <PaulMurray> eliqiao, could start stupid and then improve 14:39:11 <mdbooth> Because they will recover quickly if killed, but will miss deadlines if running slowly 14:39:16 <PaulMurray> We are starting to build some features to help migration 14:39:17 <pkoniszewski> +1 for starting with stupid one 14:39:25 <pkoniszewski> and this should be imo configurable through nova.conf 14:39:26 <pkoniszewski> not the api 14:39:26 <PaulMurray> pause is a stupid one 14:39:34 <paul-carlton1> It is tempting to add lots of knobs to live migration, this could be exposed in a non driver specific way by usings key value pairs but much better to just express a policy ranging from pause the instance to let it complete through various degrees of impact on instance performance to allow migration to complete 14:39:43 <mdbooth> I think some level of per-instance tunable is inevitable 14:39:44 <eliqiao> PaulMurray: how to get more smarter if we don't want to know the low level throttling stuff? 14:40:04 <pkoniszewski> eliqiao: watcher should step in 14:40:05 <tdurakov> pkoniszewski, agree 14:40:29 <shaohe_feng> PaulMurray: got it. So for xzbrle, we can set enable flag and cache size in nova.conf 14:40:32 <shaohe_feng> right? 14:40:34 <eliqiao> pkoniszewski: yeah, but watcher depends on nova, but nova has no API be called at all. 14:40:43 <PaulMurray> shaohe_feng, sounds like a good start 14:40:48 <johnthetubaguy> polices around downtime, are the better way to approach this, in terms of per image/instance 14:40:53 <pkoniszewski> do we really want to have XBZRLE enabled by default? 14:40:59 <pkoniszewski> it has some impact on memory... 14:41:37 <yuntongjin> and cpu 14:41:38 <shaohe_feng> PaulMurray: thank you. then I need know how to do it. 14:41:46 <alex_xu> no, it shouldn't be the default 14:41:48 <davidgiluk> pkoniszewski: No! 14:41:54 <davidgiluk> pkoniszewski: It really sucks CPU 14:42:12 <johnthetubaguy> so I have been thinking enable by default to help the testing of these features, but if its not available in the min_version, and has resource requirements, that doesn't seem like a good move 14:42:17 <eliqiao> davidgiluk: +1 14:42:35 <eliqiao> davidgiluk: later we are expect multi-thread compression 14:42:37 <paul-carlton1> So we should have pause, cancel and a dial we can set on an in progress instance migration as well as an initial setting for all migration that ranges from don't put instance performance/integrity at risk at all to allow significant impact 14:42:42 <johnthetubaguy> if it uses so many resources, should we both supporting it? 14:42:48 <davidgiluk> eliqiao: Yes, I've had that running as well 14:42:49 <mdbooth> When optimising disk transfer in another context, I found that compression reduced transfer performance on a fast network in practice. 14:42:57 <johnthetubaguy> s/both/bother/ 14:43:04 <pkoniszewski> one question - have we decided which version of libvirt will be supported in O-release already? 14:43:04 * mdbooth would ignore compression. 14:43:22 <davidgiluk> paul-carlton1: I'm not sure it's a linear type of line of different features like that 14:43:34 <pkoniszewski> I think that it is a clue for our problems that we have now 14:43:37 <johnthetubaguy> pkoniszewski: this is all we have right now: https://wiki.openstack.org/wiki/LibvirtDistroSupportMatrix 14:43:42 <shaohe_feng> yes. we are working on multi-thread compression in libvirt. and maybe we need think how should we support the API to set multi-thread compression parmeters. 14:43:46 <eliqiao> mdbooth: what's your concern about compression? 14:43:56 <mdbooth> eliqiao: It makes it slower. 14:44:02 <mdbooth> Whilst also using more cpu. 14:44:21 <mdbooth> That was my experience, anyway. 14:44:26 <davidgiluk> mdbooth: That's what my measurements show, although I think there's some hope that if you have a compression accelerator etc it might help 14:44:27 <eliqiao> mdbooth: but may required in some specify cases 14:44:28 <johnthetubaguy> unless you have a very constrained network, which seems like the less likely case 14:44:39 <PaulMurray> eliqiao, mdbooth shaohe_feng I think we need to draw this discussion to a close for now 14:44:49 <PaulMurray> time is running on 14:44:57 <PaulMurray> Perhaps it should go on the mailing list? 14:44:59 <shaohe_feng> PaulMurray: OK 14:45:04 <eliqiao> PaulMurray: okay, thx. 14:45:12 <pkoniszewski> 15 minutes left 14:45:15 <PaulMurray> sorry to do that, but time wont stop 14:45:17 <bauzas> oh snap, missed the meeting 14:45:27 <PaulMurray> #topic CI status 14:45:36 <PaulMurray> tdurakov, do you have an update for us 14:45:38 <PaulMurray> ? 14:45:58 <tdurakov> yep 14:46:15 <tdurakov> https://review.openstack.org/#/c/247081/ - need to merge this 14:46:22 <tdurakov> enables nfs for live-migration 14:46:33 <tdurakov> started to work on ceph 14:46:50 <PaulMurray> I see jaypipes has a +2 on that review already 14:47:05 <tdurakov> example of job execution 14:47:06 <tdurakov> http://logs.openstack.org/81/247081/14/experimental/gate-tempest-dsvm-multinode-live-migration/a098ca9/logs/devstack-gate-post_test_hook.txt.gz 14:47:47 <PaulMurray> tdurakov, good progress 14:48:06 <tdurakov> that's all about ci for today 14:48:17 <alex_xu> tdurakov: nice work 14:48:19 <PaulMurray> the last time we spoke the plan was to add ceph and some tests 14:48:31 <eliqiao> tdurakov: the scritps seems cool, but a little hard to be understaned as jaypipes 's commented. 14:48:33 <PaulMurray> then ask about moving to check queue - right? 14:48:36 <tdurakov> tests in tempest on review already 14:48:42 <pkoniszewski> tdurakov: good progress! Thanks! 14:48:52 <tdurakov> yep 14:49:09 <jlanoux_> tdurakov: did you start working on project-config? 14:49:17 <tdurakov> eliqiao, it's all about context, will add several comments 14:49:32 <eliqiao> tdurakov: thanks, that would be great ! 14:49:42 <tdurakov> jlanoux_, yes, it's in progress to 14:49:48 <jlanoux_> tdurakov: cool! 14:50:10 <PaulMurray> tdurakov, we should be thinking about adding tests to this job to add coverage for things we do 14:50:24 <PaulMurray> when do you think it will be a good idea to start doing that? 14:50:28 <PaulMurray> after in check queue? 14:50:56 <tdurakov> yep, we need to merge tempest test, and then other multinode staff could be added 14:51:04 <tdurakov> tests* 14:51:12 <eliqiao> PaulMurray: it's cool to use gate to testing new feature. thanks gate! 14:51:23 <PaulMurray> eliqiao, :) 14:51:30 <tdurakov> :) 14:51:45 <PaulMurray> thanks tdurakov - need to move to next topic 14:51:55 <PaulMurray> #topic Bugs 14:51:57 <alex_xu> yea, thanks tdurakov 14:51:59 <tdurakov> as patch for nfs being merged feel free too check experimental on patches 14:52:01 <tdurakov> surre 14:52:07 <PaulMurray> https://review.openstack.org/#/c/215483/ 14:52:09 <rajesht> hi 14:52:16 <PaulMurray> rajesht, over to you 14:52:32 <rajesht> I have added comment abt approaches to fix this issue to LP bug https://bugs.launchpad.net/nova/+bug/1470420 14:52:32 <openstack> Launchpad bug 1470420 in OpenStack Compute (nova) "Set migration status to 'error' instead of 'failed' during live-migration" [Low,In progress] - Assigned to Rajesh Tailor (rajesh-tailor) 14:53:22 <PaulMurray> rajesht, sorry I didn't get around to review this again - I had a quick look 14:53:25 <rajesht> IMO to remove instance files on live-migration failure we need to set migration status to 'error' so that periodic_task cleanup_incomplete_migration will delete instance files. 14:53:49 <PaulMurray> I saw ndipanov wasn't happy about that approach 14:53:57 <PaulMurray> did yo uunderstand his comment - he is not here 14:54:09 <alex_xu> I also remember that 14:54:36 <rajesht> Yes, he mentioned that it is reasonable to have retry logic but thta logic is not there.. 14:54:44 <rajesht> on _do_live_migration method 14:55:27 <rajesht> there were total three places where migration status should be replace from failed to error. 14:56:20 <alex_xu> sorry, I miss the context also, need take a look at more on the patch 14:56:20 <johnthetubaguy> did we resolve the worry about failed vs error being used in the periodic tasks for clean up tasks? 14:56:27 <PaulMurray> rajesht, I said I would look closer at that because I thought one of them should be failed - but was not completely sure 14:56:41 <johnthetubaguy> also, I am a bit worried that changes the public API a little bit 14:56:45 <PaulMurray> so I don't think that is resolved 14:57:17 <rajesht> In that case, would you please review it again and comment your suggestions. 14:57:53 <PaulMurray> rajesht, I will do that - but I think we need nikola and others with more background in this error handling to comment 14:58:09 <PaulMurray> as well 14:58:28 <PaulMurray> so i think we have to leave it there. 14:58:37 <rajesht> PaulMurray: sure, Will contact nikola for his feedback. 14:58:43 <PaulMurray> #topic Open 14:58:47 <PaulMurray> we have no time for open 14:58:49 <PaulMurray> sorry 14:58:54 <PaulMurray> but thank you for coming 14:59:02 <PaulMurray> #help please review the specs 14:59:14 <alex_xu> #link https://review.openstack.org/#/c/247016/ 14:59:18 <PaulMurray> we need to end now 14:59:28 <alex_xu> please help on review this great doc by PaulMurray ^ 14:59:50 <PaulMurray> well done for sneaking the quick link in :) 14:59:53 <PaulMurray> #endmeeting