14:00:47 <PaulMurray> #startmeeting Nova Live Migration 14:00:48 <openstack> Meeting started Tue Nov 17 14:00:47 2015 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:53 <openstack> The meeting name has been set to 'nova_live_migration' 14:01:01 * bauzas lurks 14:01:11 * johnthetubaguy lurks with intent 14:01:17 <PaulMurray> Hi, Anyone here for live migration ? 14:01:21 <pkoniszewski> o/ 14:01:23 <paul-carlton> yep 14:01:23 <rdopiera> o/ 14:01:24 <mdbooth> o/ 14:01:24 <alex_xu> o/ 14:01:34 * kashyap waves 14:01:44 <shaohe_feng1> hi PaulMurray 14:01:46 <eliqiao> hi PaulMurray 14:01:46 <kashyap> * 14:01:47 <andrearosa> hi 14:01:50 <eliqiao> o/ 14:02:01 <eliqiao> hi pkoniszewski 14:02:18 <shaohe_feng1> hi pkoniszewski 14:02:25 <shaohe_feng1> o/ 14:02:28 <PaulMurray> Hi Everyone, just wait one minute in case someone is late 14:02:34 <jlanoux> O/ 14:03:17 <johnthetubaguy> looks like a good turn out, great to see :) 14:03:30 <eliqiao> I though the time is utc 1300 as the polled . okay anyway. 14:03:30 <PaulMurray> ok - that's lone enough - thanks for coming everyone 14:04:08 <PaulMurray> eliqiao, intersting, you are second to say that - the poll was definitely 1400UTC, but we're here now anyway 14:04:31 <PaulMurray> I assume you have all seen the meeting page here: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:04:37 <PaulMurray> it has an agenda on it 14:04:41 <johnthetubaguy> daylight savings changed between now and the poll, thats always fun 14:04:59 <kashyap> And, the 6 etherpads it has URLs to :P 14:05:00 <PaulMurray> I will try to make sure the agenda far enough in advance that everyone gets to see it before 14:05:26 <eliqiao> johnthetubaguy: get it :) 14:05:49 <PaulMurray> I'll also try to go through things promptly in the meeting so we can go early if there is nothing in particular to discuss 14:05:55 <shaohe_feng1> johnthetubaguy: got it. I'm also puzzle with the time. 14:06:21 <PaulMurray> Did the poll adjust the times for local time zones? 14:06:42 <pkoniszewski> it did, i had no problems 14:06:56 <PaulMurray> Ah, I will remember that in future - sorry everyone 14:07:06 <pkoniszewski> but i was also affected by daylight saving thing 14:07:37 <PaulMurray> #topic Specs status 14:07:50 <johnthetubaguy> I like this part of the year when I live in UTC, life is so much simpler! 14:07:54 <shaohe_feng1> PaulMurray: there is no dalylight in our country. 14:08:15 <PaulMurray> shaohe_feng1, not at all, how do you see? 14:08:22 <kashyap> PaulMurray: :-) 14:08:25 <pkoniszewski> :D 14:08:33 <PaulMurray> #link https://etherpad.openstack.org/p/mitaka-nova-spec-review-tracking 14:08:43 <PaulMurray> This is the priority spec review page 14:09:11 <PaulMurray> We have a few specs under live migrate 14:09:19 <PaulMurray> just scroll a little from the top 14:09:27 <shaohe_feng1> PaulMurray: daylight saveing. :) 14:09:32 <PaulMurray> Two are already merged 14:10:06 <PaulMurray> I added the last one fo alex_xu earlier today 14:10:16 <alex_xu> PaulMurray: thanks 14:10:32 <alex_xu> sorry join the party late 14:10:51 <PaulMurray> I want to discuss the pause VM during live migration for a moment, any others people want to mention? 14:11:25 <PaulMurray> #link https://review.openstack.org/#/c/229040/ 14:11:48 <mdbooth> PaulMurray: libvirt volumes may have hit an issue requiring a minor rethink 14:11:55 <PaulMurray> This is the pause VM one - it is very close to being done, but has a couple of points to clear 14:12:03 <PaulMurray> mdbooth, ok - come back to that in a mo 14:12:07 <mdbooth> +! 14:12:15 <eliqiao> I think it will be hard to control the status when the live-migraion is running 14:12:28 <bzhou> " As an operator of an OpenStack cloud, I would like the ability to pause VM 40 14:12:28 <bzhou> during live migration. " why "during live migration"? 14:12:49 <PaulMurray> bzhou, its really a way to push a migration through 14:13:00 <pkoniszewski> bzhou to get rid of dirty pages 14:13:11 <johnthetubaguy> so, for me, the key think here, is live-migrate can literally take for ever, lets give operators a way for force the end of that process 14:13:14 <johnthetubaguy> yeah, that 14:13:20 <eliqiao> does pause == cancle? 14:13:32 <PaulMurray> eliqiao, no 14:13:38 <PaulMurray> we will do both seperately 14:13:44 <PaulMurray> as two features 14:13:48 <pkoniszewski> no, cancel is different operation which will literally cancel this process 14:14:17 <PaulMurray> So the question on the spec is about naming 14:14:19 <mdbooth> Can we support post-copy yet? 14:14:24 <PaulMurray> but also about future intent 14:14:31 <pkoniszewski> no yet, post-copy should be there around N-release of openstack 14:14:33 <bzhou> do we need to resume it after LM is done? 14:14:36 <mdbooth> Another option for an operator might be to cancel and post-copy 14:14:45 <pkoniszewski> bzhou: we don't, libvirt does the job 14:14:52 <eliqiao> [danpb] NB, when we make use of post-copy migration, cancel will be impossible once post-copy starts. 14:15:05 <eliqiao> I see this from LM etherpad 14:15:12 <paul-carlton> I thought post copy was pending changes in qemu that are still being tested? 14:15:20 <pkoniszewski> paul-carlton: exactly 14:15:29 <shaohe_feng1> PaulMurray: converge can also can help to reduce the rate dirty pages 14:15:36 <pkoniszewski> a change to libvirt should be simple, tricky part will be OpenStack, e.g., networking 14:16:01 <johnthetubaguy> we should get back to Mitaka stuff I think 14:16:04 <pkoniszewski> y 14:16:05 <PaulMurray> The question I would like to see an answer to is do we intend this only be the pause version 14:16:11 <bzhou> are we able to resume even if LM is NOT done? 14:16:16 <andrearosa> post copy is not an option now, do not discuss it. Converge can help the idea of pausing the LM is a kind of last resort 14:16:34 <PaulMurray> or do we intend that there could be other options that will come under the same action 14:16:45 <johnthetubaguy> so I think we can add options later 14:16:50 <shaohe_feng1> post copy is ready for qemu? 14:16:57 <PaulMurray> If the second we need a different name - that is all I think 14:17:02 <pkoniszewski> PaulMurray: I don't like this idea to have different actions under the same action 14:17:09 <pkoniszewski> it brings confusions 14:17:10 <eliqiao> shaohe_feng1: no 14:17:30 <kashyap> shaohe_feng1: It is merged upstream QEMU 14:17:35 <kashyap> And the relevant Kernel part. 14:17:38 <johnthetubaguy> thing is, we don't need to decide that now, except if we add pause in the API name, I guess 14:18:24 <PaulMurray> johnthetubaguy, agreed - so perhaps change name - makes it distinctly different to normal pause operation 14:18:32 <johnthetubaguy> just for clarity, we generally don't want features in Nova to depend on unreleased QEMU or libvirt 14:18:53 <eliqiao> johnthetubaguy: +1 14:19:07 <PaulMurray> Lets move the discussion of the naming to the spec and move on 14:19:14 <johnthetubaguy> so yes, the current list of names 14:19:15 <andrearosa> I like the johnthetubaguy's proposal to call it "live-migrate-force-end". That name doesn't have any implementation details and give us the flexibility to change the actual actions in future 14:19:33 <PaulMurray> mdbooth, you had something to discuss 14:19:39 <kashyap> johnthetubaguy: Post copy dev in QEMU is on this channel - davidgiluk. Maybe he can comment which release it'll be in QEMU. 14:20:02 <mdbooth> PaulMurray: Yes, although I posted review comments. 14:20:03 <quintela____> 2.5 14:20:06 <eliqiao> but force-end is kind like cancel 14:20:07 <quintela____> cade is already upstream 14:20:12 <mdbooth> https://review.openstack.org/#/c/232053/ 14:20:28 <pkoniszewski> andrearosa: if we will change entire action in the future, e.g. to something that will dynamically throttle VM, +1 from me 14:20:31 <PaulMurray> mdbooth, does it need discussion here or just highlight the review? 14:20:39 <pkoniszewski> andrearosa: if we gonna hide two distinct operations under one action - huge -1 14:21:00 <mdbooth> The concern is that libvirt's storage copy method will probably be a severe performance regression from rsync 14:21:06 <mdbooth> in almost all cases 14:21:22 <johnthetubaguy> mdbooth: seems like we might have to assume rsync lives until that bug is fixed then? 14:21:23 <davidgiluk> kashyap: The libvirt work is still in progress; there is also an experimental openstack set that the guys at UMU have done 14:21:24 <mdbooth> So while I think we should go ahead with it, rsync needs to remain a first class citizen for the moment 14:21:43 <johnthetubaguy> mdbooth: that sounds like a good approach 14:21:46 <mdbooth> Yes. It can live differently, using libvirt storage pools. 14:22:06 <mdbooth> But I don't think we can deprecate it, because some people do use it, and they're unlikely to be happy. 14:22:07 <paul-carlton> mdbooth, rsync is not an option for security reasons 14:22:22 <PaulMurray> paul-carlton, that's not true for everyone 14:22:28 <johnthetubaguy> paul-carlton: well you can choose performance vs security right? 14:22:41 <paul-carlton> we really need the libvirt perf improvement danbp proposed befor this is viable 14:23:10 <mdbooth> paul-carlton: I don't think we need to go that far. The storage pools stuff looks great. We'll just have to retain some cruft for a bit. 14:23:20 <andrearosa> paul-carlton: it is an option now. I think that the recap made by danp in his comment is a reasonable approach 14:23:27 <johnthetubaguy> mdbooth: +1 14:23:29 <paul-carlton> true, we should do it but a lot of people (including HPE) will not use it if the choice is rsync or slow 14:23:56 <johnthetubaguy> paul-carlton: agreed, we just don't have an alternative, and it seems like slow might get fixed soonish 14:24:39 <PaulMurray> maybe we can find people to fix the slow option - we could look into that 14:24:50 <johnthetubaguy> yeah, I think thats the correct approach here 14:24:51 <PaulMurray> is anyone working on it now? 14:24:55 <mdbooth> I think it's vaguely in motion. 14:24:55 <paul-carlton> if so then fine, I'm not arguing against doing the work, I'm saying that there will be class of customers who won't use it till the libvirt fix is done 14:25:09 <mdbooth> danpb sounded interested. I've been looking at the code this morning. 14:25:40 <PaulMurray> #info storage pools version of migration will be slow - need to keep rsync version and look to improve performance 14:25:57 <paul-carlton> I'd prefer to implement based on the assumption that libvirt will fix performance issues in due course 14:26:04 <johnthetubaguy> we should get that detail in the spec, but I think the existing comments make that clear 14:26:35 <PaulMurray> paul-carlton, mdbooth I think this all sounds ok 14:26:56 <PaulMurray> do the change - keep rsync option - organise getting performance improvement 14:27:02 <PaulMurray> that can be done 14:27:07 <mdbooth> Thrash out the detail in the spec? 14:27:14 <PaulMurray> unless anyone thinks I'm being naive 14:27:24 <mdbooth> Sounds good to me 14:27:39 <PaulMurray> lets do that then 14:27:57 <PaulMurray> Any other specs - they are the main thing at the moment 14:28:08 <johnthetubaguy> I am curious about the query one 14:28:09 <PaulMurray> Any other with urgent issue to go over that is 14:28:10 <johnthetubaguy> and cancel 14:28:30 <PaulMurray> johnthetubaguy, goahead 14:29:06 <johnthetubaguy> do we agree how that API should look, I assume similar to that pause/force-end one? 14:29:15 <johnthetubaguy> the action of cancel I mean 14:29:39 <PaulMurray> paul-carlton, ? ~^^ 14:29:45 <johnthetubaguy> looks like the latest version is going that way, which is cool for me 14:29:59 <paul-carlton> yep 14:30:24 <johnthetubaguy> so the current version doesn't have much query in it 14:30:33 <johnthetubaguy> by which I mean, the progress reporting 14:31:01 <paul-carlton> I took it out and changed the name 14:31:01 <pkoniszewski> i thought that they decided that progress is already reported somehow 14:31:19 <paul-carlton> yes, because progress is reported on instance details 14:31:29 <johnthetubaguy> ah, with the percentages 14:31:40 <mdbooth> percentage of what? 14:31:43 <mdbooth> Disk transfer? 14:31:44 <PaulMurray> paul-carlton, it still says query in the title 14:31:46 <mdbooth> Convergence? 14:32:00 <johnthetubaguy> mdbooth: good question 14:32:04 <mdbooth> How can you tell the difference between transfer of a big disk, and a vm which is slow to converge? 14:32:27 <shaohe_feng1> percentage of ram, disk? 14:32:42 <johnthetubaguy> mdbooth: yeah, I think thats what I would love to see, disk transfer vs convergence step 14:33:05 <paul-carlton> I'll fix the title 14:33:38 <PaulMurray> Maybe that should be split out as a separate issue as there is something already there for query 14:33:48 <PaulMurray> just fix the title as paul-carlton says 14:33:50 <johnthetubaguy> PaulMurray: +1 for a separate spec on this 14:33:51 <paul-carlton> I think if it is slow to converge you should see percentage complete going up and down 14:34:25 <paul-carlton> There is only one spec, abort, I'll purge it of any mention of query, missed the title 14:34:44 <PaulMurray> paul-carlton, good 14:34:53 <johnthetubaguy> I would be nice to be explicit about this at some point, but yeah, lets track that separatly 14:35:03 <johnthetubaguy> does anyone want to talk that spec? 14:35:14 <johnthetubaguy> oops 14:35:15 <johnthetubaguy> take 14:35:22 <PaulMurray> I'd rather move to the next topic 14:35:30 <PaulMurray> But first a note to everyone 14:36:01 <PaulMurray> please do review the specs on that list. We can target getting things in for the spec cutoff 14:36:12 <PaulMurray> even though it is not absolute for priorities 14:36:23 <PaulMurray> I really want to make progress as fast as possible 14:36:28 <shaohe_feng1> johnthetubaguy: do we need a smart tune(such as migration, pause) when converge is slow when percentage complete going up and down 14:36:42 <johnthetubaguy> #help would be good for someone to take on writing up better live-migrate progress 14:36:55 <PaulMurray> #topic CI status 14:37:02 <johnthetubaguy> shaohe_feng1: I think the pause and cancel spec have a lot of that covered already 14:37:10 <PaulMurray> I don't see tdurkov 14:37:23 <PaulMurray> Does anyone know the status of CI? 14:37:40 <PaulMurray> I know there is a job he has a review for 14:37:44 <shaohe_feng1> such as migration, pause/such as compress, pause 14:38:16 <PaulMurray> I also don't have the review links to hand 14:38:20 <davidgiluk> shaohe_feng1: QEMU 2.5 has an improved autoconverge 14:38:22 <kashyap> PaulMurray: For the CI to work, first multi-node CI ought to be working successfully, no? 14:38:40 <PaulMurray> kashyap, the plan is to create a new CI job for live migration 14:38:44 <PaulMurray> then add coverage to it 14:38:54 <johnthetubaguy> yeah, the discussion was create a new one 14:38:56 <PaulMurray> The idea is to seperate it from other instabilities 14:39:09 <kashyap> Sure. 14:39:47 <paul-carlton> johnthetubaguy, I'll look at the way progress is calculated and propose a solution meets everyones needs if you like 14:39:53 <PaulMurray> I know Timofei Durakov is part way there 14:40:07 <shaohe_feng1> davidgiluk: so nova do not need an strategy tune, right? 14:40:25 <PaulMurray> So I guess there is nothing more to say on this one 14:40:32 <johnthetubaguy> paul-carlton: sounds good 14:40:39 <pkoniszewski> one more thing 14:40:48 <kashyap> shaohe_feng1: I think topics are bing mixed up, probably good for open floor discussion. 14:40:50 <PaulMurray> pkoniszewski, on CI? 14:41:06 <pkoniszewski> oh, thought its open discussion already, sorry! :) 14:41:16 <shaohe_feng1> kashyap: sorry. go ahead CI. 14:41:23 <PaulMurray> no - going on with agenda 14:41:25 <davidgiluk> shaohe_feng1: The qemu stuff is experimental and still needs the libvirt stuff wiring up, but it should do appropriate CPU limiting - it would be good to check that out before trying to write a new one 14:41:38 <PaulMurray> #topic Bugs 14:41:53 <PaulMurray> shaohe_feng1, this is your spot I think? 14:42:12 <PaulMurray> You have been looking at bugs in Intel and Rackspace right? 14:42:29 <PaulMurray> https://docs.google.com/spreadsheets/d/19MFatOpjePS4JtkVHXCh6Qa8XUf6T2t0Igy1PucZ3Zk/edit#gid=2127877307 14:42:44 <shaohe_feng1> PaulMurray: maybe we can discuss the pkoniszewski bug fix. 14:43:04 <pkoniszewski> this list is a bit outdated 14:43:16 <PaulMurray> Do you have a link? 14:43:24 <shaohe_feng1> PaulMurray: https://review.openstack.org/#/c/168916/ 14:43:34 <PaulMurray> #link https://review.openstack.org/#/c/168916/ 14:44:34 <shaohe_feng1> PaulMurray: https://review.openstack.org/#/c/235994/ 14:44:35 <PaulMurray> I haven't looked at this yet, can you tell us something? 14:45:05 <PaulMurray> #link https://review.openstack.org/#/c/235994/ 14:46:02 <shaohe_feng1> PaulMurray: yes. we need to set the correct VM status when live-migration RPC call timeout 14:46:32 <shaohe_feng1> pkoniszewski is working on it. 14:46:57 <PaulMurray> Do you need anything, or just reviews? 14:47:00 <bauzas> less than 15 mins left 14:47:06 <PaulMurray> bauzas, thanks 14:47:32 <shaohe_feng1> PaulMurray: we need a little discussion. 14:48:00 <PaulMurray> Can I suggest it goes on the ML this time 14:48:14 <shaohe_feng1> PaulMurray: yes. 14:48:18 <pkoniszewski> ML should be way better for this problem 14:48:22 <pkoniszewski> its a bit weird 14:48:26 <PaulMurray> My understanding is that there are a few people going through the live migraiton bugs 14:48:36 <PaulMurray> Is that still true 14:48:38 <PaulMurray> ? 14:49:23 <andrearosa> no replies so probably you are right 14:49:31 <PaulMurray> Maybe its best if I contact you about this seperately 14:49:35 <PaulMurray> moving on 14:49:40 <PaulMurray> #topic Open 14:49:45 <mdbooth> Did I miss the bug list? 14:50:03 <mdbooth> Ah, the google spreadsheet 14:50:16 <PaulMurray> mdbooth, there are links on the meeting page and on the google sheet 14:50:34 <PaulMurray> mikal has put reviews on the priority review page: 14:50:50 <PaulMurray> #link https://etherpad.openstack.org/p/mitaka-nova-priorities-tracking 14:51:05 <PaulMurray> Please return to this regularly to get subteam reviews done 14:51:23 <PaulMurray> then when reviews are ready we can bring htem to the attention of cores 14:51:24 <shaohe_feng1> PaulMurray: do you means this google link? https://docs.google.com/spreadsheets/d/19MFatOpjePS4JtkVHXCh6Qa8XUf6T2t0Igy1PucZ3Zk/edit#gid=2127877307 14:52:15 <PaulMurray> If yo uhave code ready to review put it on the page in the subteam section 14:53:25 <johnthetubaguy> +1 for that etherpad 14:53:31 <johnthetubaguy> would be great to get the bug fixes listed in there 14:53:36 <johnthetubaguy> if the code is up for review 14:53:37 <shaohe_feng1> +1. 14:54:00 <PaulMurray> johnthetubaguy, do you want the individual reviews in there - there? 14:54:15 <mdbooth> paul-carlton: Do you have code already for anything relating to libvirt storage pools? 14:55:17 <PaulMurray> Anything that needs to be brought up in this meeting now? 14:55:19 <paul-carlton> mdbooth, there is lots of code from Solly Ross's attempt at this but I've not done anything yet 14:55:26 <PaulMurray> (as opposed to discussed after) 14:55:38 <mdbooth> paul-carlton: Ok, lets chat offline as I'm interested and don't want to tread on your toes. 14:55:56 <paul-carlton> Welcome the help, we should talk 14:55:58 <PaulMurray> mdbooth, thanks for stepping up for work to do 14:56:15 <PaulMurray> Thank you everyone for coming - please help me be as organised 14:56:29 <PaulMurray> as I can be - I will settle into chairing 14:56:43 <kashyap> PaulMurray: Good work, keeping things on-topic :-) 14:56:45 <PaulMurray> Same time next week 1400UTC 14:56:57 <PaulMurray> kashyap, doing my best :) 14:57:03 <PaulMurray> #endmeeting