14:00:21 <tdurakov> #startmeeting Nova Live Migration 14:00:22 <openstack> Meeting started Tue Jan 19 14:00:21 2016 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:26 <openstack> The meeting name has been set to 'nova_live_migration' 14:00:32 <mdbooth> o/ 14:00:33 <andrearosa> hi 14:00:45 <eliqiao> o/ 14:00:48 <tdurakov> hi 14:00:49 <pkoniszewski> o/ 14:01:08 <tdurakov> agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:01:38 <tdurakov> so, let's start 14:01:47 <tdurakov> #topic priority reviews 14:01:50 <shaohe_feng> hi all 14:02:45 <tdurakov> any reviews that needed to be discussed? 14:03:07 <shaohe_feng> tdurakov: one, please 14:03:56 <tdurakov> shaohe_feng, sure, which one? 14:04:15 <shaohe_feng> tdurakov: https://review.openstack.org/#/c/258813/ 14:05:01 <shaohe_feng> I'd like to discuss the interval to write DB 14:05:48 <shaohe_feng> tdurakov: https://review.openstack.org/#/c/258813/4/nova/virt/libvirt/driver.py 14:05:54 <shaohe_feng> now I set it 0.5s 14:05:54 <pkoniszewski> shaohe_feng: I think that this is a good topic for our mailing list 14:06:26 <shaohe_feng> pkoniszewski: OK, let talk it by mailing list. 14:06:34 <tdurakov> so, fr�quence of db writes, right? 14:06:45 <shaohe_feng> tdurakov: yes. 14:06:56 <pkoniszewski> shaohe_feng: i think we need folks familiar with DB to discuss about it 14:07:10 <shaohe_feng> tdurakov: any suggestion? 14:07:19 <tdurakov> +1 for ml 14:07:30 <andrearosa> pkoniszewski: not jsut that, need to understand if it makes sense to have an update every 0.5s on a task which can last hours... 14:07:38 <pkoniszewski> yup 14:07:58 <pkoniszewski> exactly, just wanted to write that my question is who will poll API every 0.5 seconds for even 15 hours or more 14:08:00 <eliqiao> andrearosa: that won't take hours. 14:08:02 <andrearosa> btw +1 for ml discussion 14:08:07 <tdurakov> btw, compute can't make writes to db by itself, right? 14:08:23 <pkoniszewski> it can't, it needs to go through conductor 14:08:40 <tdurakov> yep, so there is additional load for messaging 14:08:40 <mdbooth> I'm pretty sure it can. However, in this case it would go through conductor. 14:09:57 <tdurakov> shaohe_feng please start thread in ML. 14:10:10 <shaohe_feng> tdurakov: yes. I will. 14:10:38 <tdurakov> let's move on, any reviews? 14:10:47 <eliqiao> tdurakov: I got one 14:11:10 <eliqiao> tdurakov: no need to discuess, just want to get some eyes for review #link https://review.openstack.org/#/q/topic:bp/making-live-migration-api-friendly 14:11:48 <eliqiao> tdurakov: thx, we can move on next. 14:12:01 <tdurakov> eliqiao,acked, added for todo list 14:12:36 <tdurakov> more reviews? 14:12:56 <tdurakov> ok, moving to the next topic 14:13:09 <tdurakov> #topic bugs 14:13:09 <shaohe_feng> tdurakov: need more reviews https://review.openstack.org/#/c/258797/ and https://review.openstack.org/#/c/258771/ 14:14:07 * mdbooth has found a couple, but I haven't written them up, yet. Mean to use them to help get reviews for my refactor series. 14:14:34 <mdbooth> They are: you can't have ephemeral disks on Rbd or Raw backends. 14:14:59 <mdbooth> And the backend holds the wrong locks whilst writing to both the image cache and disks. 14:15:06 <tdurakov> mdbooth, could you file a bug please for this? 14:15:08 <mdbooth> Meaning that writes to all backends are basically unlocked. 14:15:30 <mdbooth> Yeah, will do. I have a note here. 14:15:31 <eliqiao> +1 for file a bug. 14:16:01 <mdbooth> Ulterior motive for not doing that: if anybody fixed it before me, it would conflict with my patch series :) 14:16:03 <tdurakov> sounds strange, as I've tested ceph for nova, worked fine. need more info 14:16:14 <eliqiao> (please also tagging live-migration) 14:16:28 <mdbooth> tdurakov: It works, but the locking is wrong. 14:16:45 <mdbooth> But the ephemeral disks don't work. That raises an exception. 14:16:48 <tdurakov> ah 14:17:07 * andrearosa just triggered file alarm leaving the meeting :( 14:17:21 * eliqiao would like to see the reproduce steps and I will have a try. 14:17:39 <mdbooth> But please, don't submit patches. 14:17:46 <tdurakov> another one from me: https://bugs.launchpad.net/nova/+bug/1535232 14:17:47 <openstack> Launchpad bug 1535232 in OpenStack Compute (nova) "live-migration ci failure on nfs shared storage" [Undecided,New] 14:18:32 <tdurakov> found during ci jobs failure on nfs, reproduces from time to time 14:19:09 <mdbooth> qemu: terminating on signal 15 from pid 8062 14:19:20 <mdbooth> Any idea what that means at a high level? 14:19:21 <tdurakov> working on it with kashyap 14:19:57 <tdurakov> destination node failed, and it seems to be qemu-level issue 14:21:13 <tdurakov> so, if you triggered job, and found such failures, please recheck 14:21:19 <davidgiluk> tdurakov: Yeh kashyap asked me about that; if possible try with a newer qemu, I keep adding more debug to the failure path so it might tell you a bit more 14:21:54 <eliqiao> tdurakov: these logs are grabed from gate? 14:21:55 <pkoniszewski> tdurakov: have you checked on which nodes it fails? 14:22:25 <pkoniszewski> tdurakov: there is similiar issue reported and miedem found that it primarly fails on OVH nodes (the slowest ones) 14:22:33 <pkoniszewski> tdurakov: FYI https://bugs.launchpad.net/nova/+bug/1524898 14:22:34 <openstack> Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:22:45 <tdurakov> davidgiluk, I haven't reproduced it locally yet, speaking about packages, there are plans for new job with latest packages 14:22:52 <tdurakov> eliqiao, yep 14:22:59 <pkoniszewski> tdurakov: it might be the same issue actually 14:23:31 <eliqiao> pkoniszewski: that one is volumed based. 14:23:32 <tdurakov> pkoniszewski, will check after meeting, but yes, possible duplicate 14:24:05 <tdurakov> I think it's live-migration problem at whole, so it could affect all scenarios 14:24:05 <pkoniszewski> eliqiao: it fails occasionally so we can't be sure that it does not affect volume based LM also 14:24:34 <tdurakov> pkoniszewski, yes 14:24:35 <eliqiao> pkoniszewski: tdurakov that may all low level issue 14:25:05 <pkoniszewski> anway +1 for a job with latest packages 14:25:25 <tdurakov> pkoniszewski, it's markus_z plans 14:25:29 <eliqiao> +1 also would like to search about qemu bugs 14:25:55 <tdurakov> ok, let's move on 14:26:00 <tdurakov> any issues? 14:26:53 <tdurakov> #topic Mid cycle meetup 14:27:59 * eliqiao testing, am i still on line ? 14:28:06 <pkoniszewski> eliqiao: yes ;) 14:28:07 <tdurakov> any ideas to be discussed during mid-cycle? please add to https://etherpad.openstack.org/p/mitaka-nova-midcycle 14:29:26 * eliqiao thanks pkoniszewski :) 14:30:13 <tdurakov> well, next topic 14:30:24 <tdurakov> #topic CI status 14:31:15 <tdurakov> ceph, job is on review, thanks for reviewing, also patch in project-config that enables triggering got +2 14:31:31 <tdurakov> so, hope everything will be merged soon 14:31:35 <eliqiao> tdurakov: cool, thanks for the hard efforts. 14:31:52 <tdurakov> as said before, found one bug in new job 14:32:34 <mdbooth> Is there any CI which touches LVM? 14:32:52 <eliqiao> tdurakov: one question, do you have any guide on how to setup it in local enviroment to do testing ? 14:33:00 <tdurakov> mdbooth, lvm for what? 14:33:11 <mdbooth> tdurakov: For nova storage 14:33:14 <eliqiao> lvm for qemu backend ? 14:33:17 <mdbooth> non-volume 14:33:33 <eliqiao> is that use local lvm as qemu backend ? 14:33:38 <tdurakov> eliqiao, good question, added this for todo 14:33:50 <tdurakov> mdbooth, haven't seen one 14:33:51 <mdbooth> eliqiao: I guess you could describe it like that 14:34:08 <eliqiao> tdurakov: I would like to get that docs, so developer can do more testing locally. 14:34:09 <mdbooth> tdurakov: It's significantly different to the other backends, and I'm making lots of changes 14:34:15 <mdbooth> which makes me nervous 14:34:47 <tdurakov> mdbooth, acked, will start ml for adding lvm for this job, thank you for idea 14:35:27 <tdurakov> btw, ansible 2.0 released 14:35:36 <tdurakov> this broke hook last week 14:35:48 <eliqiao> tdurakov: hmm. can you please let me know the docs of how to do local LM testing when you finished? 14:36:08 <tdurakov> eliqiao, sure 14:36:14 <eliqiao> tdurakov: thanks in advance :) 14:36:40 <eliqiao> tdurakov: what the issue of ansible when upgrade to 2.0? 14:36:47 <tdurakov> fyi https://github.com/ansible/ansible/issues/13862 14:36:56 <tdurakov> eliqiao, ^ 14:37:10 <eliqiao> tdurakov: thx. 14:37:29 <tdurakov> anything about ci? 14:38:24 <tdurakov> then next topic 14:38:43 <tdurakov> #topic Open Discussion 14:39:38 <eliqiao> just FYI #link http://docs.openstack.org/releases/schedules/mitaka.html 14:40:01 <eliqiao> I am kinds of worry about our schedule, there's lots patch to be reviewed. 14:40:15 <davidgiluk> tdurakov: How repeatable is that migration failure with nfs? 14:40:36 <tdurakov> davidgiluk, not often 14:41:01 <tdurakov> I'd say rare 14:41:28 <davidgiluk> tdurakov: Hmm, that makes it a bit trickier; it's important to make sure if we can tell whether the migration failed on the destination is coming before or after the error 15 (i.e. someone killed the source qemu) 14:41:49 <davidgiluk> tdurakov: i.e. is the destination just complaining because the source died, or did the destination dieing cause the problem 14:42:02 <tdurakov> davidgiluk, I'll provide more detailed stats in a day for this using logstash.openstack.org 14:42:43 <davidgiluk> tdurakov: OK, thanks - the other thing I've done in the past is use tracing or systemtap to watch the receive side, but it's tricky if it's very rare 14:43:17 <tdurakov> davidgiluk, to highlight, haven't reproduced it locally yet 14:44:14 <tdurakov> so yes, everything is tricky here 14:45:05 <tdurakov> any other things to be discussed during meeting? 14:46:50 <tdurakov> so, thanks to all 14:46:58 <pkoniszewski> thanks! 14:47:09 <tdurakov> #endmeeting