14:00:47 #startmeeting Nova Live Migration 14:00:48 Meeting started Tue Nov 17 14:00:47 2015 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:49 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:53 The meeting name has been set to 'nova_live_migration' 14:01:01 * bauzas lurks 14:01:11 * johnthetubaguy lurks with intent 14:01:17 Hi, Anyone here for live migration ? 14:01:21 o/ 14:01:23 yep 14:01:23 o/ 14:01:24 o/ 14:01:24 o/ 14:01:34 * kashyap waves 14:01:44 hi PaulMurray 14:01:46 hi PaulMurray 14:01:46 * 14:01:47 hi 14:01:50 o/ 14:02:01 hi pkoniszewski 14:02:18 hi pkoniszewski 14:02:25 o/ 14:02:28 Hi Everyone, just wait one minute in case someone is late 14:02:34 O/ 14:03:17 looks like a good turn out, great to see :) 14:03:30 I though the time is utc 1300 as the polled . okay anyway. 14:03:30 ok - that's lone enough - thanks for coming everyone 14:04:08 eliqiao, intersting, you are second to say that - the poll was definitely 1400UTC, but we're here now anyway 14:04:31 I assume you have all seen the meeting page here: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:04:37 it has an agenda on it 14:04:41 daylight savings changed between now and the poll, thats always fun 14:04:59 And, the 6 etherpads it has URLs to :P 14:05:00 I will try to make sure the agenda far enough in advance that everyone gets to see it before 14:05:26 johnthetubaguy: get it :) 14:05:49 I'll also try to go through things promptly in the meeting so we can go early if there is nothing in particular to discuss 14:05:55 johnthetubaguy: got it. I'm also puzzle with the time. 14:06:21 Did the poll adjust the times for local time zones? 14:06:42 it did, i had no problems 14:06:56 Ah, I will remember that in future - sorry everyone 14:07:06 but i was also affected by daylight saving thing 14:07:37 #topic Specs status 14:07:50 I like this part of the year when I live in UTC, life is so much simpler! 14:07:54 PaulMurray: there is no dalylight in our country. 14:08:15 shaohe_feng1, not at all, how do you see? 14:08:22 PaulMurray: :-) 14:08:25 :D 14:08:33 #link https://etherpad.openstack.org/p/mitaka-nova-spec-review-tracking 14:08:43 This is the priority spec review page 14:09:11 We have a few specs under live migrate 14:09:19 just scroll a little from the top 14:09:27 PaulMurray: daylight saveing. :) 14:09:32 Two are already merged 14:10:06 I added the last one fo alex_xu earlier today 14:10:16 PaulMurray: thanks 14:10:32 sorry join the party late 14:10:51 I want to discuss the pause VM during live migration for a moment, any others people want to mention? 14:11:25 #link https://review.openstack.org/#/c/229040/ 14:11:48 PaulMurray: libvirt volumes may have hit an issue requiring a minor rethink 14:11:55 This is the pause VM one - it is very close to being done, but has a couple of points to clear 14:12:03 mdbooth, ok - come back to that in a mo 14:12:07 +! 14:12:15 I think it will be hard to control the status when the live-migraion is running 14:12:28 " As an operator of an OpenStack cloud, I would like the ability to pause VM 40 14:12:28 during live migration. " why "during live migration"? 14:12:49 bzhou, its really a way to push a migration through 14:13:00 bzhou to get rid of dirty pages 14:13:11 so, for me, the key think here, is live-migrate can literally take for ever, lets give operators a way for force the end of that process 14:13:14 yeah, that 14:13:20 does pause == cancle? 14:13:32 eliqiao, no 14:13:38 we will do both seperately 14:13:44 as two features 14:13:48 no, cancel is different operation which will literally cancel this process 14:14:17 So the question on the spec is about naming 14:14:19 Can we support post-copy yet? 14:14:24 but also about future intent 14:14:31 no yet, post-copy should be there around N-release of openstack 14:14:33 do we need to resume it after LM is done? 14:14:36 Another option for an operator might be to cancel and post-copy 14:14:45 bzhou: we don't, libvirt does the job 14:14:52 [danpb] NB, when we make use of post-copy migration, cancel will be impossible once post-copy starts. 14:15:05 I see this from LM etherpad 14:15:12 I thought post copy was pending changes in qemu that are still being tested? 14:15:20 paul-carlton: exactly 14:15:29 PaulMurray: converge can also can help to reduce the rate dirty pages 14:15:36 a change to libvirt should be simple, tricky part will be OpenStack, e.g., networking 14:16:01 we should get back to Mitaka stuff I think 14:16:04 y 14:16:05 The question I would like to see an answer to is do we intend this only be the pause version 14:16:11 are we able to resume even if LM is NOT done? 14:16:16 post copy is not an option now, do not discuss it. Converge can help the idea of pausing the LM is a kind of last resort 14:16:34 or do we intend that there could be other options that will come under the same action 14:16:45 so I think we can add options later 14:16:50 post copy is ready for qemu? 14:16:57 If the second we need a different name - that is all I think 14:17:02 PaulMurray: I don't like this idea to have different actions under the same action 14:17:09 it brings confusions 14:17:10 shaohe_feng1: no 14:17:30 shaohe_feng1: It is merged upstream QEMU 14:17:35 And the relevant Kernel part. 14:17:38 thing is, we don't need to decide that now, except if we add pause in the API name, I guess 14:18:24 johnthetubaguy, agreed - so perhaps change name - makes it distinctly different to normal pause operation 14:18:32 just for clarity, we generally don't want features in Nova to depend on unreleased QEMU or libvirt 14:18:53 johnthetubaguy: +1 14:19:07 Lets move the discussion of the naming to the spec and move on 14:19:14 so yes, the current list of names 14:19:15 I like the johnthetubaguy's proposal to call it "live-migrate-force-end". That name doesn't have any implementation details and give us the flexibility to change the actual actions in future 14:19:33 mdbooth, you had something to discuss 14:19:39 johnthetubaguy: Post copy dev in QEMU is on this channel - davidgiluk. Maybe he can comment which release it'll be in QEMU. 14:20:02 PaulMurray: Yes, although I posted review comments. 14:20:03 2.5 14:20:06 but force-end is kind like cancel 14:20:07 cade is already upstream 14:20:12 https://review.openstack.org/#/c/232053/ 14:20:28 andrearosa: if we will change entire action in the future, e.g. to something that will dynamically throttle VM, +1 from me 14:20:31 mdbooth, does it need discussion here or just highlight the review? 14:20:39 andrearosa: if we gonna hide two distinct operations under one action - huge -1 14:21:00 The concern is that libvirt's storage copy method will probably be a severe performance regression from rsync 14:21:06 in almost all cases 14:21:22 mdbooth: seems like we might have to assume rsync lives until that bug is fixed then? 14:21:23 kashyap: The libvirt work is still in progress; there is also an experimental openstack set that the guys at UMU have done 14:21:24 So while I think we should go ahead with it, rsync needs to remain a first class citizen for the moment 14:21:43 mdbooth: that sounds like a good approach 14:21:46 Yes. It can live differently, using libvirt storage pools. 14:22:06 But I don't think we can deprecate it, because some people do use it, and they're unlikely to be happy. 14:22:07 mdbooth, rsync is not an option for security reasons 14:22:22 paul-carlton, that's not true for everyone 14:22:28 paul-carlton: well you can choose performance vs security right? 14:22:41 we really need the libvirt perf improvement danbp proposed befor this is viable 14:23:10 paul-carlton: I don't think we need to go that far. The storage pools stuff looks great. We'll just have to retain some cruft for a bit. 14:23:20 paul-carlton: it is an option now. I think that the recap made by danp in his comment is a reasonable approach 14:23:27 mdbooth: +1 14:23:29 true, we should do it but a lot of people (including HPE) will not use it if the choice is rsync or slow 14:23:56 paul-carlton: agreed, we just don't have an alternative, and it seems like slow might get fixed soonish 14:24:39 maybe we can find people to fix the slow option - we could look into that 14:24:50 yeah, I think thats the correct approach here 14:24:51 is anyone working on it now? 14:24:55 I think it's vaguely in motion. 14:24:55 if so then fine, I'm not arguing against doing the work, I'm saying that there will be class of customers who won't use it till the libvirt fix is done 14:25:09 danpb sounded interested. I've been looking at the code this morning. 14:25:40 #info storage pools version of migration will be slow - need to keep rsync version and look to improve performance 14:25:57 I'd prefer to implement based on the assumption that libvirt will fix performance issues in due course 14:26:04 we should get that detail in the spec, but I think the existing comments make that clear 14:26:35 paul-carlton, mdbooth I think this all sounds ok 14:26:56 do the change - keep rsync option - organise getting performance improvement 14:27:02 that can be done 14:27:07 Thrash out the detail in the spec? 14:27:14 unless anyone thinks I'm being naive 14:27:24 Sounds good to me 14:27:39 lets do that then 14:27:57 Any other specs - they are the main thing at the moment 14:28:08 I am curious about the query one 14:28:09 Any other with urgent issue to go over that is 14:28:10 and cancel 14:28:30 johnthetubaguy, goahead 14:29:06 do we agree how that API should look, I assume similar to that pause/force-end one? 14:29:15 the action of cancel I mean 14:29:39 paul-carlton, ? ~^^ 14:29:45 looks like the latest version is going that way, which is cool for me 14:29:59 yep 14:30:24 so the current version doesn't have much query in it 14:30:33 by which I mean, the progress reporting 14:31:01 I took it out and changed the name 14:31:01 i thought that they decided that progress is already reported somehow 14:31:19 yes, because progress is reported on instance details 14:31:29 ah, with the percentages 14:31:40 percentage of what? 14:31:43 Disk transfer? 14:31:44 paul-carlton, it still says query in the title 14:31:46 Convergence? 14:32:00 mdbooth: good question 14:32:04 How can you tell the difference between transfer of a big disk, and a vm which is slow to converge? 14:32:27 percentage of ram, disk? 14:32:42 mdbooth: yeah, I think thats what I would love to see, disk transfer vs convergence step 14:33:05 I'll fix the title 14:33:38 Maybe that should be split out as a separate issue as there is something already there for query 14:33:48 just fix the title as paul-carlton says 14:33:50 PaulMurray: +1 for a separate spec on this 14:33:51 I think if it is slow to converge you should see percentage complete going up and down 14:34:25 There is only one spec, abort, I'll purge it of any mention of query, missed the title 14:34:44 paul-carlton, good 14:34:53 I would be nice to be explicit about this at some point, but yeah, lets track that separatly 14:35:03 does anyone want to talk that spec? 14:35:14 oops 14:35:15 take 14:35:22 I'd rather move to the next topic 14:35:30 But first a note to everyone 14:36:01 please do review the specs on that list. We can target getting things in for the spec cutoff 14:36:12 even though it is not absolute for priorities 14:36:23 I really want to make progress as fast as possible 14:36:28 johnthetubaguy: do we need a smart tune(such as migration, pause) when converge is slow when percentage complete going up and down 14:36:42 #help would be good for someone to take on writing up better live-migrate progress 14:36:55 #topic CI status 14:37:02 shaohe_feng1: I think the pause and cancel spec have a lot of that covered already 14:37:10 I don't see tdurkov 14:37:23 Does anyone know the status of CI? 14:37:40 I know there is a job he has a review for 14:37:44 such as migration, pause/such as compress, pause 14:38:16 I also don't have the review links to hand 14:38:20 shaohe_feng1: QEMU 2.5 has an improved autoconverge 14:38:22 PaulMurray: For the CI to work, first multi-node CI ought to be working successfully, no? 14:38:40 kashyap, the plan is to create a new CI job for live migration 14:38:44 then add coverage to it 14:38:54 yeah, the discussion was create a new one 14:38:56 The idea is to seperate it from other instabilities 14:39:09 Sure. 14:39:47 johnthetubaguy, I'll look at the way progress is calculated and propose a solution meets everyones needs if you like 14:39:53 I know Timofei Durakov is part way there 14:40:07 davidgiluk: so nova do not need an  strategy tune, right? 14:40:25 So I guess there is nothing more to say on this one 14:40:32 paul-carlton: sounds good 14:40:39 one more thing 14:40:48 shaohe_feng1: I think topics are bing mixed up, probably good for open floor discussion. 14:40:50 pkoniszewski, on CI? 14:41:06 oh, thought its open discussion already, sorry! :) 14:41:16 kashyap: sorry. go ahead CI. 14:41:23 no - going on with agenda 14:41:25 shaohe_feng1: The qemu stuff is experimental and still needs the libvirt stuff wiring up, but it should do appropriate CPU limiting - it would be good to check that out before trying to write a new one 14:41:38 #topic Bugs 14:41:53 shaohe_feng1, this is your spot I think? 14:42:12 You have been looking at bugs in Intel and Rackspace right? 14:42:29 https://docs.google.com/spreadsheets/d/19MFatOpjePS4JtkVHXCh6Qa8XUf6T2t0Igy1PucZ3Zk/edit#gid=2127877307 14:42:44 PaulMurray: maybe we can discuss the pkoniszewski bug fix. 14:43:04 this list is a bit outdated 14:43:16 Do you have a link? 14:43:24 PaulMurray: https://review.openstack.org/#/c/168916/ 14:43:34 #link https://review.openstack.org/#/c/168916/ 14:44:34 PaulMurray: https://review.openstack.org/#/c/235994/ 14:44:35 I haven't looked at this yet, can you tell us something? 14:45:05 #link https://review.openstack.org/#/c/235994/ 14:46:02 PaulMurray: yes. we need to set the correct VM status when live-migration RPC call timeout 14:46:32 pkoniszewski is working on it. 14:46:57 Do you need anything, or just reviews? 14:47:00 less than 15 mins left 14:47:06 bauzas, thanks 14:47:32 PaulMurray: we need a little discussion. 14:48:00 Can I suggest it goes on the ML this time 14:48:14 PaulMurray: yes. 14:48:18 ML should be way better for this problem 14:48:22 its a bit weird 14:48:26 My understanding is that there are a few people going through the live migraiton bugs 14:48:36 Is that still true 14:48:38 ? 14:49:23 no replies so probably you are right 14:49:31 Maybe its best if I contact you about this seperately 14:49:35 moving on 14:49:40 #topic Open 14:49:45 Did I miss the bug list? 14:50:03 Ah, the google spreadsheet 14:50:16 mdbooth, there are links on the meeting page and on the google sheet 14:50:34 mikal has put reviews on the priority review page: 14:50:50 #link https://etherpad.openstack.org/p/mitaka-nova-priorities-tracking 14:51:05 Please return to this regularly to get subteam reviews done 14:51:23 then when reviews are ready we can bring htem to the attention of cores 14:51:24 PaulMurray: do you means this google link? https://docs.google.com/spreadsheets/d/19MFatOpjePS4JtkVHXCh6Qa8XUf6T2t0Igy1PucZ3Zk/edit#gid=2127877307 14:52:15 If yo uhave code ready to review put it on the page in the subteam section 14:53:25 +1 for that etherpad 14:53:31 would be great to get the bug fixes listed in there 14:53:36 if the code is up for review 14:53:37 +1. 14:54:00 johnthetubaguy, do you want the individual reviews in there - there? 14:54:15 paul-carlton: Do you have code already for anything relating to libvirt storage pools? 14:55:17 Anything that needs to be brought up in this meeting now? 14:55:19 mdbooth, there is lots of code from Solly Ross's attempt at this but I've not done anything yet 14:55:26 (as opposed to discussed after) 14:55:38 paul-carlton: Ok, lets chat offline as I'm interested and don't want to tread on your toes. 14:55:56 Welcome the help, we should talk 14:55:58 mdbooth, thanks for stepping up for work to do 14:56:15 Thank you everyone for coming - please help me be as organised 14:56:29 as I can be - I will settle into chairing 14:56:43 PaulMurray: Good work, keeping things on-topic :-) 14:56:45 Same time next week 1400UTC 14:56:57 kashyap, doing my best :) 14:57:03 #endmeeting