14:00:00 #startmeeting Nova Live Migration 14:00:00 Meeting started Tue May 10 14:00:00 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:04 The meeting name has been set to 'nova_live_migration' 14:00:13 o/ 14:00:15 o/ 14:00:15 o/ 14:00:16 hi all 14:00:22 o/ 14:00:25 hi! 14:00:32 o/ 14:00:34 o/ 14:00:41 o/ 14:00:42 * kashyap waves 14:01:04 For those with short memory: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:01:10 Agenda ^^ 14:01:10 o/ 14:01:21 0/ 14:01:35 hi 14:01:42 #topic CI 14:01:55 tdurakov, has been away - so welcome bak 14:02:16 tdurakov, any update ? 14:02:46 PaulMurray: as discussed mostly working on my spec now, but happy to help with storage pool for this 14:03:03 any details on that? 14:03:14 What was the experimental stability like ? 14:03:43 the same, going to check is there xenial nodes in nodepool already 14:03:47 i noticed the experimental live migration job wasn't running with latest libvirt/qemu yet? 14:03:55 mriedem: true 14:04:12 i thought that job was going to use that new repo that installs latest libvirt/qemu? 14:04:14 haven't seen xenial multionode yet 14:04:31 the one that markus and tonyb worked on 14:04:32 tdurakov: What about the coverage of the Mitaka features? Are you done? 14:04:58 mriedem: is that finished? 14:05:04 mitaka features aren;t covered mostly 14:05:06 jlanoux: ^ 14:05:14 I'm just getting back to the CI, did markus get it working ? 14:05:27 I mean with the latest versions of libvirt etc. 14:05:33 yeah i thought so, awhile ago 14:05:35 pkoniszewski: ok 14:05:49 mriedem, ok, so we are playing catchup then 14:05:53 mriedem: will add this to job today than 14:05:58 talk to markus_z and/or tonyb about it 14:06:08 mriedem: ok 14:06:15 last i knew they just wanted to move the git repo under the openstack namespace 14:06:23 but they had something in project-config for using this 14:06:51 mriedem: will ask them after this meeting 14:07:10 #action tdurakov to follow up with markus_z/tonyb about the trunk libvirt/qemu repo 14:07:20 yup 14:07:50 tdurakov, let me know if you need help 14:08:11 mriedem: On a slightly related note, there's now a new DevStack plugin up for review (revived from an old review) that lets one install custom libvirt / QEMU from tar releases -- https://review.openstack.org/#/c/313568/ 14:08:19 It's an external plugin though. 14:08:40 PaulMurray: as a plan B for that, who could help with adding multinode xenial to nodepool? 14:08:41 ok, but those are like daily builds? 14:09:08 jlanoux, do you know anything about nodepool ? 14:09:24 PaulMurray: nope 14:09:25 tdurakov: i wonder if pabelanger could help with that in infra 14:09:28 if not I will go to andreaf 14:09:53 (I hope that's the right nick) 14:10:02 and see if he can help 14:10:05 smth like that 14:10:06 https://github.com/openstack-infra/project-config/blob/a755ee6d0257faafb4204a843f5265e935689639/nodepool/nodepool.yaml#L73 14:10:51 mriedem: If there are daily tarballs avialable, yes 14:11:09 kashyap: ok, i guess i'd rather gate on actual releases, rather than daily builds 14:11:14 we already have problems with stability 14:11:20 we don't need daily tarballs here i think 14:11:24 mriedem: It by default uses official releases 14:11:49 we're going to need multinode xenial regardless, so i think that's time well spent 14:12:22 pkoniszewski: It's not daily, it uses official tar balls, from here (for QEMU) and similar URL for libvirt: http://wiki.qemu-project.org/download/ 14:12:53 got it 14:13:30 #link Devstack plugin for qemu/libvirt versions: https://review.openstack.org/#/c/313568/ 14:13:50 I like the title of that ^^ "First version" 14:14:24 PaulMurray: :-) Yeah, they could improve the commit messages 14:14:30 moving on slightly 14:14:39 #topic Libvirt Storage Pools 14:14:52 What do we need for CI for storage pools ? 14:15:06 for the dependent refactor, 14:15:07 I addressed that in the recent spec update 14:15:14 an lvm, rbd, ploop 14:15:16 i'd like at least an lvm-backed job 14:15:19 * mdbooth digs it out 14:15:26 we have rbd and ploop ci already 14:15:40 plus shared storage and non shared 14:15:43 we're missing lvm - it might start as an experimental job for nova 14:15:50 ceph is shared storage 14:15:55 Note that Jenkins currently only tests the Qcow2 and Rbd(ceph) backends 14:15:56 in the gate. All current libvirt tempest jobs run by Jenkins use the 14:15:56 default Qcow2 backend except gate-tempest-dsvm-full-devstack-plugin-ceph, which 14:15:56 uses Rbd. We additionally coverage of the ploop backend in 14:15:56 check-dsvm-tempest-vz7-exe-minimal run by Virtuozzo CI. This means that we 14:15:56 currently have no gate coverage of the Raw and Lvm backends. 14:15:58 PaulMurray: I would ask in the -infra room, there is already a xenial single node image https://github.com/openstack-infra/project-config/blob/a755ee6d0257faafb4204a843f5265e935689639/nodepool/nodepool.yaml#L88 so it shouldn't be too difficult to setup a multinode env based on xenial - clarkb was working a lot on setting up the original multinode environment I believe 14:16:00 lvm/ephemeral is non-shared 14:16:07 mriedem, what about adding this to existing live-migration job instead? 14:16:14 nfs shared would be good too, different from ceph 14:16:25 thanks andreaf 14:16:33 shared nfs migration tests have been quite good for finding bugs 14:16:49 We're only interested in a limited set of tests for these backends 14:16:53 tdurakov: does the live migration job test nfs? 14:16:58 yes 14:17:02 and ceph too 14:17:07 We're not going to run the full suite against each backend, are we? 14:17:22 can the live migration job also use lvm? 14:18:05 mdbooth: i was more concerned with the big refactor 14:18:09 mdbooth: I'd prefer to start with live-migration, It would be expensive to have full multinode jobs for all backends 14:18:46 i'm not talking about a multinode job for lvm, just an expermintal queue job that runs on nova and could run against these refactor changes 14:18:46 mriedem: Right. We need coverage, but how complete? 14:18:47 mriedem: I thought this is the plan 14:18:50 one question - won't it take too much time to execute all tests in gate if we also add storage pools to existing CI? I mean, we still need to cover all mitaka features there 14:19:30 * mdbooth is thinking about our poor testing resources 14:19:32 well, the live migration job isn't going to test snapshots right? 14:19:42 the experimental queue is on-demand 14:19:51 mriedem: it's not testing yet 14:20:07 but we could enable this later, after fixing stability issues 14:20:22 enable what later? 14:20:34 test snapshots 14:20:41 snapshots in LM CI? it's totally different thing, isn't it? 14:20:50 pkoniszewski: yes 14:21:04 that's kind of my point, we don't need to test snapshots in the LM job 14:21:05 renaming will help.. 14:21:32 I thought the plan was to split live migration tests from other things - no point in extending it out from there 14:21:38 yeah, +1 14:21:49 but i think it's useful to have a job, in the experimental queue, that runs with lvm which we can run on mdbooth's refactor series which will test the compute api and virt driver for things that the LM job won't test 14:21:54 can't find a reason to mix live migration with other things, it is complex enough 14:22:07 mriedem: Yup. Also the 'Raw' backend, don't forget. 14:22:18 mdbooth: yeah 14:22:20 Bizarrely we don't currently have coverage of that, either 14:22:21 we have job that already tests 3 different configs, we could expand it with lvm, and add all multinode actions there 14:22:34 mdbooth: we could maybe change the ceph job to use raw... 14:23:00 mriedem: ? Then it wouldn't be the ceph job. Have I misunderstood? 14:23:09 oh right, hehe 14:23:12 tdurakov, I would expect LM job to test enough back ends 14:23:24 mdbooth: forgot that had it's own special imagebackend 14:23:32 so it seems good to put lvm there if its not already 14:23:36 It's special 14:23:45 well we have gate-tempest-dsvm-full (n-net) and gate-tempest-dsvm-neutron-full, those both use qcow2 right? 14:24:02 Everything which doesn't use something explicitly uses qcow2 14:24:13 so maybe we make one of those use raw 14:24:15 * mdbooth audited them the other day 14:24:22 mriedem: Makes sense 14:24:44 and then we just have 1 new experimental queue job for lvm 14:24:55 now having said this, changing one of the integrated gate jobs is i think branchless, 14:25:07 mriedem: We could switch one of the other jobs to lvm, right? 14:25:08 so if changing it to use raw introduces a bunch of failures....that would be bad 14:25:14 There are plenty of them 14:25:25 i have a feeling lvm will be racey 14:25:45 just based on what i've seen with the lxc job that uses it 14:25:53 Hmm, ok. Of course we want to know that, but yeah... 14:26:05 anyway, i think i can hack up a devstack-gate change to test lvm and see how it looks 14:26:21 same for raw 14:26:53 mriedem, shall we call that an action - or just thinking out loud 14:26:59 sign me up 14:27:27 #action mriedem to hack up a devstack-get change to test lvm 14:27:50 (and Raw' 14:27:51 ) 14:28:18 #undo 14:28:31 #action mriedem to hack up devstack-gate changes to test lvm and raw image backends 14:28:54 i'll also review https://review.openstack.org/#/c/302117/ after this meeting 14:29:31 mriedem: Appreciated, thanks 14:29:47 paul-carlton, mdbooth are the two followon specs ready for broader review - looks like still needs subteam review 14:29:49 ? 14:29:56 yep 14:30:17 We discussed 1 aspect of libvirt storage pools this morning 14:30:27 #link https://review.openstack.org/#/c/310538/ 14:30:51 * mdbooth hasn't looked at that one in detail, yet 14:30:52 #link https://review.openstack.org/#/c/310505/ 14:31:18 paul-carlton: Ah, I see you've updated it. Need to re-review. 14:31:50 good - so its in hand - anyone else can help review too 14:32:18 #topic Specs 14:32:41 for specs: https://etherpad.openstack.org/p/newton-nova-priorities-tracking 14:32:55 if your spec isn't there just add it 14:33:41 Does anyone have one they want to mention ? 14:34:12 can we briefly discuss about this: https://review.openstack.org/#/c/301509/ 14:34:35 go ahead 14:34:49 andrearosa suggested to include information about migration type in the migration object 14:35:09 something like "postcopy-status" --> disabled/enabled/active 14:35:13 luis5tb: it is not mandatory 14:35:17 luis5tb: Is that spec related to https://review.openstack.org/#/c/306561/ ? 14:35:19 what is your view about that? 14:35:37 soemthing I'd like to have more opinion 14:35:45 me too 14:35:54 I think it could be a good idea 14:36:12 migrations works not only for libvirt, will it be valid to expose post-copy over migration entity? 14:36:20 also, I think we need to at least include information about the memory iterations, besides the remaining data and the other stats already there 14:36:24 tdurakov: I think not 14:36:29 i don't think that we want to expose low level details through API 14:36:43 I don't think we should expose iterations, either 14:36:44 mdbooth: so, here is the answer 14:36:44 nope, better to let the driver code work out if abort is allow 14:37:03 i.e. have we switched to post copy 14:37:38 * mdbooth has a slightly better understanding of post-copy this week than I had last week 14:38:08 paul-carlton: what about saving such switches to instance-actions instead? 14:38:11 In my view, it's something which should just happen when the user requests force completion 14:38:33 as a separete instance-action-event step 14:38:35 paul-carlton: Be a little careful of races around that; if you do 'have we switched to postcopy? No - ok, abort' then you might switch to postcopy between asking and aborting 14:38:52 that would be ok I guess but current code doesn't look at that 14:39:22 libvirt will deny the abort if the switch to postcopy has already been triggered anyway 14:39:34 davidgiluk: races story https://review.openstack.org/#/c/287997/ 14:39:34 well, we don't save an instance action when we increase downtime during LM 14:39:37 davidgiluk, yep, you'd need to do a lock, check the proceed 14:39:52 pkoniszewski: what about start doing this? 14:39:57 why? 14:40:21 instance actions are pretty explicit, allows to store even tracebacks 14:40:34 would be useful to get details on migrations 14:40:38 thoughts? 14:40:44 but checking that would not prevent race 14:40:48 tdurakov: Yes, especially if they do wrong 14:41:26 tdurakov: yes that was my idea. I'd like to have something tell me what happened. I do not have any real use cases atm but I bet that for debugging purpose it could be handy 14:41:39 slightly confused, are you thinking of having lots of instance actions for a migration ? 14:41:45 aren't logs enough for debugging purposes? 14:41:48 better to have the migration monitor thread in the driver get a lock on the instance before switching to post copy so other thread can't abort it 14:41:57 there is already separate steps for migration 14:42:09 we will save tons of instance actions per migration if we go this way 14:42:21 pkoniszewski: why tons? 14:42:33 save only several event from libvirt 14:42:35 not all 14:42:48 I'm not proposing to store whole progress 14:42:50 i just mentioned downtime which is increased iteratively during LM process 14:42:54 do we need to save it? 14:43:05 pkoniszewski: worth to discuss 14:43:09 admin will look at instance actions to see what happened to a finished migration 14:43:17 needs to be deciferable for the admin 14:43:19 saving switch to post copy is a reasonable thing to do but as user information 14:43:33 paul-carlton, agreed 14:43:35 I was thinking saving the switch 14:43:42 paul-carlton: live-migration is admin action 14:43:43 but not lots of progress stuff 14:43:45 not user 14:43:48 but not the progess 14:43:59 surely no progress 14:44:13 but for switch we need to save all the data 14:44:23 like memory remaining, how many cycles 14:44:29 why? 14:44:31 I would expose live migration started, live migration ended, and live migration aborted 14:44:31 pkoniszewski: do we? 14:44:34 To the user 14:44:34 because if we save only switch, it means nothing 14:44:36 And nothing else 14:44:37 switch could be just based on the memory iteration 14:44:48 okay, lm switched to post-copy, but what it really means? 14:45:03 and how it would help? 14:45:15 exposing post copy switch, as a force complete action to be driver agnostic is reasonable 14:45:31 it is informative for user and admin 14:45:34 pkoniszewski: it would be explicit to operator, what steps were done to converge migration 14:45:36 If the user is trying to debug why their performance went a bit funky for a bit, knowing that the funkiness occurred during a live migration is sufficient 14:45:42 They don't need to know every detail of it. 14:46:03 mdbooth: Do they have an easy way to get the detail if they need it? 14:46:18 * PaulMurray is thinking about the time.... 14:46:44 Shall we continue on the spec and move on ? 14:46:44 davidgiluk: I believe checking logs is the only way 14:46:46 so once we start using auto converge we will save all auto converge steps, all downtime steps, and post-copy switch? 14:46:47 Can we talk about https://review.openstack.org/#/c/301509/ 14:46:52 https://review.openstack.org/#/c/301561` 14:46:56 sounds like a small book in DB for a single migration 14:47:19 paul-carlton, we need to cover the agenda 14:47:25 lets come back at the end if time 14:47:26 ok 14:47:38 pkoniszewski: not all, but let's discuss this in spec instead, will leave a comment 14:47:46 #topic Review Requests 14:47:47 tdurakov: +1 14:48:00 https://review.openstack.org/#/c/310352/ 14:48:03 eliqiao, 14:48:35 PaulMurray: https://review.openstack.org/#/c/287997/ 14:48:49 still not merged 14:49:31 Also mechanical cleanup: https://review.openstack.org/#/c/308876/ 14:49:39 tdurakov, noted - lets see if we can get help with it 14:50:17 https://review.openstack.org/#/c/310707/ 14:50:31 mdbooth, you've been doing reviews - well done 14:50:57 you're names on most things I look at 14:51:21 eliqiao is not there 14:51:30 that explains it 14:51:38 ...the silence I mean 14:51:39 we need some eyes here https://review.openstack.org/#/c/310707/ 14:51:46 it's a regression in mitaka 14:51:50 can I get reviews on https://review.openstack.org/#/c/307131/ and https://review.openstack.org/#/c/306561/ please, as well as https://review.openstack.org/#/c/310505/ 14:51:54 and requires a backport 14:52:03 pkoniszewski, yes - I noticed that (ref above too) 14:52:19 that made me think about the 14:52:30 CI with latest versions discussion earlier 14:52:56 1.3.1 does not work for selective block migration on tunnelled connections 14:54:14 ok 14:54:27 #topic Open Discussion 14:54:35 only a few minutes left 14:54:46 anything else to cover 14:54:48 ? 14:54:56 (quickly) 14:54:58 yeah 14:54:59 one question 14:55:21 paul-carlton: Can you figure out how your spec for automatic live migration completion goes together with luis5tb's postcopy spec? 14:55:24 do we really need a spec for that? https://review.openstack.org/#/c/248358/ 14:55:39 davidgiluk paul-carlton +1 14:55:45 I'm also confused about that 14:55:57 i mean, this is something that is already supported in nova, right now everyone can use auto converge by adding a flag to live_migration_flags 14:56:07 davidgiluk, I think it is dependant on luis5tb's spec 14:56:19 because we want to remove live_migration_flags, this new flag is just to keep a way to turn auto converge on, nothing more 14:56:26 mriedem_meeting: ^^ 14:56:35 I'd like to see all of these merged into a single spec 14:56:48 'How do I force my live migration to complete' is 1 topic 14:57:18 mdbooth: yeh 14:57:35 not a good idea to mix them up 14:57:44 post copy is one thing 14:57:46 mdbooth: these are two different topics 14:57:51 post-copy is a way to force to complete 14:57:58 paul-carlton: I think it would be good to see how yours, postcopy and autoconverge go together; if they're really 3 specs or 1 14:58:01 the auto completion stuff build on it 14:58:03 Right, and so is auto converge 14:58:07 auto converge and compression is just to increase chances to converge 14:58:10 But they're both parts of the same problem 14:58:22 Treating them separately is confusing 14:58:34 auto converge will never force to complete, really 14:58:43 even if you cut 99% cpu cycles 14:58:58 talking to danbp auto-converge needs to basically stop instance to get it done 14:59:00 This is going to go over the end of the meeting 14:59:07 do we want another time to talk about it ? 14:59:10 post-copy is much more effective 14:59:23 we can discuss on specs? 14:59:24 post-copy is a way to force to complete 14:59:26 auto converge is not 14:59:27 paul-carlton: Agreed. We need to discuss that in more more than 1 spec :) 14:59:35 s/more more/no more/ 14:59:52 I'll have to cut off now, so lets continue in nova room 14:59:59 #endmeeting