#openstack-meeting-3 log

14:00:28 <PaulMurray> #startmeeting Nova Live Migration
14:00:29 <openstack> Meeting started Tue Nov 24 14:00:28 2015 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:33 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:38 <shaohe_feng> Hi PaulMurray
14:00:52 <alex_xu> o/
14:00:54 <mdbooth> o/
14:00:58 <andrearosa> hi
14:01:00 <PaulMurray> Hi, who's here for live migration
14:01:00 <tdurakov> o/
14:01:06 <jlanoux_> o/
14:01:08 <davidgiluk> hi
14:01:29 <yuntongjin> Hi PaulMurray
14:01:37 <PaulMurray> hi yuntongjin
14:01:41 <pkoniszewski> o/
14:01:58 <PaulMurray> I'll wait a minute as usual to give people a chance to join
14:02:05 <shaohe_feng> yuntongjin:  are you here?
14:02:58 <PaulMurray> ok - lets get started
14:03:00 <shaohe_feng> PaulMurray: it does not mattet. I'm work with yuntongjin. anythin, I can convey to him
14:03:23 <PaulMurray> There is an agenda here: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:03:24 * davidgiluk will try to be here each week - I work on QEMU migration at RH
14:03:41 <PaulMurray> davidgiluk, thank you, that will be helpful
14:03:43 <yuntongjin> shaohe_feng:yes, i'm here
14:04:00 <PaulMurray> #topic Specs status
14:04:16 <PaulMurray> We are getting close to the spec cutoff
14:04:42 <PaulMurray> johnthetubaguy, mentioned the other day that the cutoff is intended to be there for priority specs too
14:05:06 <mdbooth> https://review.openstack.org/#/c/248705/
14:05:06 <PaulMurray> most are there
14:05:15 <mdbooth> That has a constructive -1 from danpb
14:05:24 <mdbooth> I need to address those comments
14:05:31 <eliqiao> hi
14:05:58 <PaulMurray> mdbooth, yes, saw that - do you want to discuss something about it?
14:06:10 <shaohe_feng> PaulMurray:  I'd like to talk about auto converge. https://review.openstack.org/#/c/248358/
14:06:25 <shaohe_feng> there are 2 things .
14:06:52 <shaohe_feng> 1. we dicide to set the  auto converge as default value. right?
14:06:52 <PaulMurray> shaohe_feng, hold on a moment
14:07:02 <shaohe_feng> PaulMurray: OK.
14:07:17 <mdbooth> Well, the meat of it is in the spec. I'd only say that we focussed on just the transfer mechanism last time. However, when I looked at the code to flesh out the detail, it became clear to me that the much harder problem was getting the code flow right.
14:08:02 <PaulMurray> mdbooth, do you have a plan together for that?
14:08:13 <mdbooth> PaulMurray: Yup. It's in that spec :)
14:08:30 <PaulMurray> so we just need to review it then ?
14:08:36 <mdbooth> So, I agree with everything which was approved originally.
14:08:52 <mdbooth> I just think it glosses over some pretty hard stuff which we need to solve before we can do that.
14:09:08 <mdbooth> PaulMurray: Yup.
14:09:54 <PaulMurray> mdbooth, ok, so lets see how it goes - I will certainly go over it today, but may not be best place to judge
14:10:31 <PaulMurray> shaohe_feng, we will come to your spec in a minute - there was one on the agenda I want to go to first (because it was on the agenda)
14:10:31 <pkoniszewski> mdbooth: I will also take a look
14:10:44 <PaulMurray> https://review.openstack.org/#/c/245543/
14:11:00 <PaulMurray> alex_xu, are you there?
14:11:09 <alex_xu> PaulMurray: hi, I'm here
14:11:19 <PaulMurray> I think that is yours
14:11:49 <alex_xu> Actually I already bring this to api meeting, and get agreement on we can remove disk_over_commit flag directly with microversion, and without deprecate first
14:11:52 <alex_xu> PaulMurray: thanks
14:12:08 <alex_xu> next thing I want to ask is what is the API behaviour after remove disk_over_commit
14:12:26 <pkoniszewski> i agree with this one, but what about block migration flag?
14:12:32 <alex_xu> we didn't check the disksize anymore or keep check the virtual_size as disk_over_commit=True?
14:13:23 <tdurakov> i'd prefer second
14:13:24 <alex_xu> pkoniszewski: the propose is make block migration flag optional, and the default value is None, then nova will doing the right thing based on the storage is shared or not
14:13:46 <alex_xu> Ideally the resource tracker should check the disk usage as what we do in other action, but that should be another bug fix.
14:14:01 <PaulMurray> mdbooth, did you see the check alex_xu is talking about when you were looking at the code for your spec
14:14:06 * mdbooth would strongly prefer that. We can update the scheduler to be shared-storage aware separately.
14:14:14 <mdbooth> PaulMurray: Which one specifically?
14:14:21 <alex_xu> That is a reason I begin to think about after remove disk_over_commit, we remove the disk size check totally, as except libvirt driver, no other driver check the disk size
14:14:38 <mdbooth> My new spec wants to do away with the current shared storage detection.
14:14:43 <johnthetubaguy> yeah, I like the stop the current check, and make the scheduler/resource tracker do the right thing instead
14:14:45 <pkoniszewski> tdurakov: I thought that your spec (split LM code) will remove block migration flag
14:15:07 <mdbooth> It will, instead just check on the dest if the storage already exists. If it does, it assumes it's because of shared storage.
14:15:10 <tdurakov> pkoniszewski, nope, i'm  not going to touch rest api
14:15:18 <pkoniszewski> ok, that makes sense then
14:15:36 <PaulMurray> mdbooth, I think the check is there because libvirt creates a file at the disk size
14:15:49 <tdurakov> what about smth like aggregates for shared storages?
14:15:55 <PaulMurray> mdbooth, not sure if that is just to check shared storage or as part of copy
14:16:11 <PaulMurray> i was thinking about your comment to do with sparse copy
14:16:32 <johnthetubaguy> tdurakov: shared storage is going to be modelled by resource provider pools stuff soon, AFAIK
14:16:52 <tdurakov> johnthetubaguy, sounds good, any spec for that?
14:16:55 <alex_xu> johnthetubaguy: +1
14:17:25 <johnthetubaguy> tdurakov: jaypipes has one someone where, I think its this one: https://review.openstack.org/#/c/225546/
14:17:53 <jaypipes> johnthetubaguy: pushing up the pci-generate-stats one first, then a new rev on that.
14:18:02 <jaypipes> johnthetubaguy: meetings, meetings, meetings...
14:18:15 <jaypipes> johnthetubaguy: will definitely have both BPs pushed again today.
14:18:50 <PaulMurray> mdbooth, did any of that answer your question? I got lost in the discussion
14:19:32 <tdurakov> https://review.openstack.org/#/c/225910/ - live-migration refactoring, need some time to update this, comments are welcome
14:19:39 <mdbooth> PaulMurray: I'm also lost :) Not sure which specific check we're talking about. The shared storage detection is kinda distributed, which is a large part of its problem.
14:20:14 <PaulMurray> mdbooth, never mind - I might be confusing things
14:20:35 <alex_xu> yea...I'm also lost...the shared storage detection is next step after we have resource provide. So people are ok with current propose?
14:20:40 <PaulMurray> alex_xu, did you get what you need
14:21:00 <mdbooth> alex_xu: Please look at my spec, though, because I want to do shared storage detection quite differently.
14:21:23 <PaulMurray> alex_xu, I think I saw some people agree
14:21:25 <mdbooth> If I've missed something obvious, it would be good to know about that sooner.
14:21:40 <alex_xu> mdbooth: ok, no problem, I will take look at
14:21:44 <PaulMurray> Does anyone disagree with what alex_xu proposed?
14:22:14 <PaulMurray> I.e.: remove the disk size check?
14:22:28 <alex_xu> yea
14:22:41 <alex_xu> if no response, I guess it means agree :)
14:22:46 <eliqiao> seem it depend on mdbooth about shared storage part ?
14:22:48 <PaulMurray> I think so
14:22:48 <johnthetubaguy> mdbooth: sorry, I lots the link, which is your new spec?
14:23:02 <tdurakov> mdbooth, starred spec too https://review.openstack.org/#/c/248705/
14:23:36 <PaulMurray> ok - now I have made shaohe_feng wait - so he is up next
14:23:50 <PaulMurray> https://review.openstack.org/#/c/248358/
14:24:04 <mdbooth> The disk size check is orthogonal to my stuff
14:24:06 <alex_xu> whatever shared storage detect is next step?
14:24:11 <PaulMurray> shaohe_feng, over to you
14:24:28 <johnthetubaguy> I think the specific check I was talking about was this one: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L5340
14:24:44 <mdbooth> johnthetubaguy: https://review.openstack.org/#/c/248705/
14:24:58 <shaohe_feng> PaulMurray: OK
14:25:10 <shaohe_feng> 2 things
14:25:13 <johnthetubaguy> mdbooth: ah, OK
14:25:36 <shaohe_feng> 1.  most want to set auto converge in nova.conf
14:26:01 <shaohe_feng> if this flag is not set in nova.conf
14:26:13 <shaohe_feng> the default value is true or false?
14:26:46 <shaohe_feng> 2. qemu version > 1.6 support auto coverge.
14:26:50 <pkoniszewski> shaohe_feng: if it is not set then it is False
14:26:54 <tdurakov> +1
14:27:25 <yuntongjin> +1
14:27:32 <pkoniszewski> shaohe_feng: the point is we need newer libvirt
14:27:35 <johnthetubaguy> seems like we could default to auto converge being on? at least when you have new enough QEMU, or am I missing a bit?
14:27:41 <shaohe_feng> but libvirt version(1.2.3) support a API to set auto converge.
14:28:10 <pkoniszewski> shaohe_feng: what about python bindings? I thought that it is included since 1.2.8
14:28:17 <johnthetubaguy> #link https://wiki.openstack.org/wiki/LibvirtDistroSupportMatrix
14:28:17 <alex_xu> we also need check libvirt version
14:28:17 <shaohe_feng> so if qemu 1.8, and libvirt 1.2.2 do we still need to support auto converge?
14:28:40 <pkoniszewski> IMO we need to bump min libvirt version before we move on there
14:28:51 <alex_xu> I guess no, we only support the thing libvirt api exposed.
14:29:02 <shaohe_feng> pkoniszewski: but we can use libvirt to send qmp comment to enable it.
14:29:04 <yuntongjin> +1 with alex
14:29:12 <pkoniszewski> not really, bindings are generated in a weird way
14:29:20 <pkoniszewski> and i'm not sure that libvirt 1.2.3 will support this
14:29:27 <johnthetubaguy> so we already add support for things that need a higher version of libvirt than the minimum, but I guess it can't be on by default
14:29:36 <davidgiluk> shaohe_feng: QEMU 2.5 has some (experimental) parameters for changing the throttling levels of autoconverge, not in libvirt yet, but it's something to keep in mind
14:29:49 <johnthetubaguy> and needs clear errors when it fails to work due to the libvirt version
14:30:20 <shaohe_feng> davidgiluk:  yes.
14:30:44 <shaohe_feng> davidgiluk: do we will support to set the throttling level?
14:31:14 <davidgiluk> shaohe_feng: I think so, they are migration parameters (x-cpu-throttle-increment, x-cpu-throttle-initial)
14:31:42 <PaulMurray> shaohe_feng, this is qemu specific isn't it ?
14:31:51 <shaohe_feng> so the same case with xbzrle compress
14:32:39 <yuntongjin> this is qemu specific
14:32:40 <shaohe_feng> PaulMurray: not sure other hypervisor support this feature.
14:32:50 <eliqiao> auto-coverage require qemu>=1.6
14:33:09 <PaulMurray> I was thinking about what gets exposed - if these are qemu specific config parameters that is ok
14:33:20 <PaulMurray> If its visible in the API then I think it is not ok
14:33:52 <eliqiao> davidgiluk: can you tell where "x-cpu-throttle-increment x-cpu-throttle-initial" come from?
14:33:54 <shaohe_feng> PaulMurray: so how do we support to set the throttling level, if it is qemu specific
14:34:04 <mdbooth> PaulMurray: +1
14:34:21 <PaulMurray> shaohe_feng, that has been a long discussion in lots of other features
14:34:24 <johnthetubaguy> I am very much against us leaking these implementation details out the API, or in image properties, etc
14:34:34 <davidgiluk> eliqiao: They are migration parameters, set using migrate_set_parameter (that's the HMP name)
14:34:57 <yuntongjin> shaohe_feng: i'd suggest a auto-setting "x-cpu-throttle-increment x-cpu-throttle-initial"
14:34:57 <johnthetubaguy> Nova is aiming to provide a single API experience across all Nova installations, lets keep that in mind
14:35:13 <shaohe_feng> davidgiluk:  like the xzbrle set cache size.
14:35:29 <johnthetubaguy> now there may be some technology specific tunables that make sense to go in nova.conf, say if you network is a certain capacity you need to change the throttle, etc
14:35:48 <davidgiluk> shaohe_feng: No, xzbrle cache size is annoyingly special - it's a migrate_set_cache_size command,  the hope is that all future numerical parameters are set using migrate_set_parameter
14:36:00 <eliqiao> yeah , I am currios about if we should set live migration throttling level in nova?
14:36:23 <davidgiluk> eliqiao: No one quite knows what value to set it to yet though, kind of depends on the workload
14:36:48 <johnthetubaguy> workload dependence is the nasty part, thats where I think pause is more useful here
14:36:57 <eliqiao> davidgiluk: okay , then nova should decide it or expose a new api to let operatior do it?
14:37:30 <shaohe_feng> davidgiluk: than need libvirt support the a new API for xzbrle cache_size
14:37:40 <davidgiluk> johnthetubaguy: The old autoconverge code didn't try very hard, the new one tries a lot harder if you set those parameters up, of course pause guarantees it will be quiet - or postcopy is the alternate guarantee
14:38:26 <eliqiao> as Daniel Berrange commentd on that spec, some low level throttling parameter shouldn't expose to nova, how do we go with these libvirt feature?
14:38:37 <eliqiao> a stupid nova or a smart one ?
14:38:47 <mdbooth> At the highest level, some instances may prefer to be killed than to run slowly
14:38:57 <shaohe_feng> eliqiao:  yes. ^ what should we do?
14:39:01 <PaulMurray> eliqiao, could start stupid and then improve
14:39:11 <mdbooth> Because they will recover quickly if killed, but will miss deadlines if running slowly
14:39:16 <PaulMurray> We are starting to build some features to help migration
14:39:17 <pkoniszewski> +1 for starting with stupid one
14:39:25 <pkoniszewski> and this should be imo configurable through nova.conf
14:39:26 <pkoniszewski> not the api
14:39:26 <PaulMurray> pause is a stupid one
14:39:34 <paul-carlton1> It is tempting to add lots of knobs to live migration, this could be exposed in a non driver specific way by usings key value pairs but much better to just express a policy ranging from pause the instance to let it complete through various degrees of impact on instance performance to allow migration to complete
14:39:43 <mdbooth> I think some level of per-instance tunable is inevitable
14:39:44 <eliqiao> PaulMurray: how to get more smarter if we don't want to know the low level throttling stuff?
14:40:04 <pkoniszewski> eliqiao: watcher should step in
14:40:05 <tdurakov> pkoniszewski, agree
14:40:29 <shaohe_feng> PaulMurray:  got it. So for xzbrle, we can set enable flag and cache size in nova.conf
14:40:32 <shaohe_feng> right?
14:40:34 <eliqiao> pkoniszewski: yeah, but watcher depends on nova, but nova has no API be called at all.
14:40:43 <PaulMurray> shaohe_feng, sounds like a good start
14:40:48 <johnthetubaguy> polices around downtime, are the better way to approach this, in terms of per image/instance
14:40:53 <pkoniszewski> do we really want to have XBZRLE enabled by default?
14:40:59 <pkoniszewski> it has some impact on memory...
14:41:37 <yuntongjin> and cpu
14:41:38 <shaohe_feng> PaulMurray:  thank you. then I need know how to do it.
14:41:46 <alex_xu> no, it shouldn't be the default
14:41:48 <davidgiluk> pkoniszewski: No!
14:41:54 <davidgiluk> pkoniszewski: It really sucks CPU
14:42:12 <johnthetubaguy> so I have been thinking enable by default to help the testing of these features, but if its not available in the min_version, and has resource requirements, that doesn't seem like a good move
14:42:17 <eliqiao> davidgiluk: +1
14:42:35 <eliqiao> davidgiluk: later we are expect multi-thread compression
14:42:37 <paul-carlton1> So we should have pause, cancel and a dial we can set on an in progress instance migration as well as an initial setting for all migration that ranges from don't put instance performance/integrity at risk at all to allow significant impact
14:42:42 <johnthetubaguy> if it uses so many resources, should we both supporting it?
14:42:48 <davidgiluk> eliqiao: Yes, I've had that running as well
14:42:49 <mdbooth> When optimising disk transfer in another context, I found that compression reduced transfer performance on a fast network in practice.
14:42:57 <johnthetubaguy> s/both/bother/
14:43:04 <pkoniszewski> one question - have we decided which version of libvirt will be supported in O-release already?
14:43:04 * mdbooth would ignore compression.
14:43:22 <davidgiluk> paul-carlton1: I'm not sure it's a linear type of line of different features like that
14:43:34 <pkoniszewski> I think that it is a clue for our problems that we have now
14:43:37 <johnthetubaguy> pkoniszewski: this is all we have right now: https://wiki.openstack.org/wiki/LibvirtDistroSupportMatrix
14:43:42 <shaohe_feng> yes. we are working on multi-thread compression in libvirt. and maybe we need think how should we support the API to set  multi-thread compression parmeters.
14:43:46 <eliqiao> mdbooth: what's your concern about compression?
14:43:56 <mdbooth> eliqiao: It makes it slower.
14:44:02 <mdbooth> Whilst also using more cpu.
14:44:21 <mdbooth> That was my experience, anyway.
14:44:26 <davidgiluk> mdbooth: That's what my measurements show, although I think there's some hope that if you have a compression accelerator etc it might help
14:44:27 <eliqiao> mdbooth: but may required in some specify cases
14:44:28 <johnthetubaguy> unless you have a very constrained network, which seems like the less likely case
14:44:39 <PaulMurray> eliqiao, mdbooth shaohe_feng I think we need to draw this discussion to a close for now
14:44:49 <PaulMurray> time is running on
14:44:57 <PaulMurray> Perhaps it should go on the mailing list?
14:44:59 <shaohe_feng> PaulMurray: OK
14:45:04 <eliqiao> PaulMurray: okay, thx.
14:45:12 <pkoniszewski> 15 minutes left
14:45:15 <PaulMurray> sorry to do that, but time wont stop
14:45:17 <bauzas> oh snap, missed the meeting
14:45:27 <PaulMurray> #topic CI status
14:45:36 <PaulMurray> tdurakov, do you have an update for us
14:45:38 <PaulMurray> ?
14:45:58 <tdurakov> yep
14:46:15 <tdurakov> https://review.openstack.org/#/c/247081/ - need to merge this
14:46:22 <tdurakov> enables nfs for live-migration
14:46:33 <tdurakov> started to work on ceph
14:46:50 <PaulMurray> I see jaypipes has a +2 on that review already
14:47:05 <tdurakov> example of job execution
14:47:06 <tdurakov> http://logs.openstack.org/81/247081/14/experimental/gate-tempest-dsvm-multinode-live-migration/a098ca9/logs/devstack-gate-post_test_hook.txt.gz
14:47:47 <PaulMurray> tdurakov, good progress
14:48:06 <tdurakov> that's all about ci for today
14:48:17 <alex_xu> tdurakov: nice work
14:48:19 <PaulMurray> the last time we spoke the plan was to add ceph and some tests
14:48:31 <eliqiao> tdurakov: the scritps seems cool, but a little hard to be understaned as jaypipes 's commented.
14:48:33 <PaulMurray> then ask about moving to check queue - right?
14:48:36 <tdurakov> tests in tempest on review already
14:48:42 <pkoniszewski> tdurakov: good progress! Thanks!
14:48:52 <tdurakov> yep
14:49:09 <jlanoux_> tdurakov: did you start working on project-config?
14:49:17 <tdurakov> eliqiao, it's all about context, will add several comments
14:49:32 <eliqiao> tdurakov: thanks, that would be great !
14:49:42 <tdurakov> jlanoux_, yes, it's in progress to
14:49:48 <jlanoux_> tdurakov: cool!
14:50:10 <PaulMurray> tdurakov, we should be thinking about adding tests to this job to add coverage for things we do
14:50:24 <PaulMurray> when do you think it will be a good idea to start doing that?
14:50:28 <PaulMurray> after in check queue?
14:50:56 <tdurakov> yep, we need to merge tempest test, and then other multinode staff could be added
14:51:04 <tdurakov> tests*
14:51:12 <eliqiao> PaulMurray: it's cool to use gate to testing new feature. thanks gate!
14:51:23 <PaulMurray> eliqiao, :)
14:51:30 <tdurakov> :)
14:51:45 <PaulMurray> thanks tdurakov - need to move to next topic
14:51:55 <PaulMurray> #topic Bugs
14:51:57 <alex_xu> yea, thanks tdurakov
14:51:59 <tdurakov> as patch for nfs being merged feel free too check experimental on patches
14:52:01 <tdurakov> surre
14:52:07 <PaulMurray> https://review.openstack.org/#/c/215483/
14:52:09 <rajesht> hi
14:52:16 <PaulMurray> rajesht, over to you
14:52:32 <rajesht> I have added comment abt approaches to fix this issue to LP bug https://bugs.launchpad.net/nova/+bug/1470420
14:52:32 <openstack> Launchpad bug 1470420 in OpenStack Compute (nova) "Set migration status to 'error' instead of 'failed' during live-migration" [Low,In progress] - Assigned to Rajesh Tailor (rajesh-tailor)
14:53:22 <PaulMurray> rajesht, sorry I didn't get around to review this again - I had a quick look
14:53:25 <rajesht> IMO to remove instance files on live-migration failure we need to set migration status to 'error' so that periodic_task cleanup_incomplete_migration will delete instance files.
14:53:49 <PaulMurray> I saw ndipanov wasn't happy about that approach
14:53:57 <PaulMurray> did yo uunderstand his comment - he is not here
14:54:09 <alex_xu> I also remember that
14:54:36 <rajesht> Yes, he mentioned that it is reasonable to have retry logic but thta logic is not there..
14:54:44 <rajesht> on _do_live_migration method
14:55:27 <rajesht> there were total three places where migration status should be replace from failed to error.
14:56:20 <alex_xu> sorry, I miss the context also, need take a look at more on the patch
14:56:20 <johnthetubaguy> did we resolve the worry about failed vs error being used in the periodic tasks for clean up tasks?
14:56:27 <PaulMurray> rajesht, I said I would look closer at that because I thought one of them should be failed - but was not completely sure
14:56:41 <johnthetubaguy> also, I am a bit worried that changes the public API a little bit
14:56:45 <PaulMurray> so I don't think that is resolved
14:57:17 <rajesht> In that case, would you please review it again and comment your suggestions.
14:57:53 <PaulMurray> rajesht, I will do that - but I think we need nikola and others with more background in this error handling to comment
14:58:09 <PaulMurray> as well
14:58:28 <PaulMurray> so i think we have to leave it there.
14:58:37 <rajesht> PaulMurray: sure, Will contact nikola for his feedback.
14:58:43 <PaulMurray> #topic Open
14:58:47 <PaulMurray> we have no time for open
14:58:49 <PaulMurray> sorry
14:58:54 <PaulMurray> but thank you for coming
14:59:02 <PaulMurray> #help please review the specs
14:59:14 <alex_xu> #link https://review.openstack.org/#/c/247016/
14:59:18 <PaulMurray> we need to end now
14:59:28 <alex_xu> please help on review this great doc by PaulMurray ^
14:59:50 <PaulMurray> well done for sneaking the quick link  in :)
14:59:53 <PaulMurray> #endmeeting