14:00:21 #startmeeting Nova Live Migration 14:00:22 Meeting started Tue Jan 19 14:00:21 2016 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:26 The meeting name has been set to 'nova_live_migration' 14:00:32 o/ 14:00:33 hi 14:00:45 o/ 14:00:48 hi 14:00:49 o/ 14:01:08 agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:01:38 so, let's start 14:01:47 #topic priority reviews 14:01:50 hi all 14:02:45 any reviews that needed to be discussed? 14:03:07 tdurakov: one, please 14:03:56 shaohe_feng, sure, which one? 14:04:15 tdurakov: https://review.openstack.org/#/c/258813/ 14:05:01 I'd like to discuss the interval to write DB 14:05:48 tdurakov: https://review.openstack.org/#/c/258813/4/nova/virt/libvirt/driver.py 14:05:54 now I set it 0.5s 14:05:54 shaohe_feng: I think that this is a good topic for our mailing list 14:06:26 pkoniszewski: OK, let talk it by mailing list. 14:06:34 so, fr�quence of db writes, right? 14:06:45 tdurakov: yes. 14:06:56 shaohe_feng: i think we need folks familiar with DB to discuss about it 14:07:10 tdurakov: any suggestion? 14:07:19 +1 for ml 14:07:30 pkoniszewski: not jsut that, need to understand if it makes sense to have an update every 0.5s on a task which can last hours... 14:07:38 yup 14:07:58 exactly, just wanted to write that my question is who will poll API every 0.5 seconds for even 15 hours or more 14:08:00 andrearosa: that won't take hours. 14:08:02 btw +1 for ml discussion 14:08:07 btw, compute can't make writes to db by itself, right? 14:08:23 it can't, it needs to go through conductor 14:08:40 yep, so there is additional load for messaging 14:08:40 I'm pretty sure it can. However, in this case it would go through conductor. 14:09:57 shaohe_feng please start thread in ML. 14:10:10 tdurakov: yes. I will. 14:10:38 let's move on, any reviews? 14:10:47 tdurakov: I got one 14:11:10 tdurakov: no need to discuess, just want to get some eyes for review #link https://review.openstack.org/#/q/topic:bp/making-live-migration-api-friendly 14:11:48 tdurakov: thx, we can move on next. 14:12:01 eliqiao,acked, added for todo list 14:12:36 more reviews? 14:12:56 ok, moving to the next topic 14:13:09 #topic bugs 14:13:09 tdurakov: need more reviews https://review.openstack.org/#/c/258797/ and https://review.openstack.org/#/c/258771/ 14:14:07 * mdbooth has found a couple, but I haven't written them up, yet. Mean to use them to help get reviews for my refactor series. 14:14:34 They are: you can't have ephemeral disks on Rbd or Raw backends. 14:14:59 And the backend holds the wrong locks whilst writing to both the image cache and disks. 14:15:06 mdbooth, could you file a bug please for this? 14:15:08 Meaning that writes to all backends are basically unlocked. 14:15:30 Yeah, will do. I have a note here. 14:15:31 +1 for file a bug. 14:16:01 Ulterior motive for not doing that: if anybody fixed it before me, it would conflict with my patch series :) 14:16:03 sounds strange, as I've tested ceph for nova, worked fine. need more info 14:16:14 (please also tagging live-migration) 14:16:28 tdurakov: It works, but the locking is wrong. 14:16:45 But the ephemeral disks don't work. That raises an exception. 14:16:48 ah 14:17:07 * andrearosa just triggered file alarm leaving the meeting :( 14:17:21 * eliqiao would like to see the reproduce steps and I will have a try. 14:17:39 But please, don't submit patches. 14:17:46 another one from me: https://bugs.launchpad.net/nova/+bug/1535232 14:17:47 Launchpad bug 1535232 in OpenStack Compute (nova) "live-migration ci failure on nfs shared storage" [Undecided,New] 14:18:32 found during ci jobs failure on nfs, reproduces from time to time 14:19:09 qemu: terminating on signal 15 from pid 8062 14:19:20 Any idea what that means at a high level? 14:19:21 working on it with kashyap 14:19:57 destination node failed, and it seems to be qemu-level issue 14:21:13 so, if you triggered job, and found such failures, please recheck 14:21:19 tdurakov: Yeh kashyap asked me about that; if possible try with a newer qemu, I keep adding more debug to the failure path so it might tell you a bit more 14:21:54 tdurakov: these logs are grabed from gate? 14:21:55 tdurakov: have you checked on which nodes it fails? 14:22:25 tdurakov: there is similiar issue reported and miedem found that it primarly fails on OVH nodes (the slowest ones) 14:22:33 tdurakov: FYI https://bugs.launchpad.net/nova/+bug/1524898 14:22:34 Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:22:45 davidgiluk, I haven't reproduced it locally yet, speaking about packages, there are plans for new job with latest packages 14:22:52 eliqiao, yep 14:22:59 tdurakov: it might be the same issue actually 14:23:31 pkoniszewski: that one is volumed based. 14:23:32 pkoniszewski, will check after meeting, but yes, possible duplicate 14:24:05 I think it's live-migration problem at whole, so it could affect all scenarios 14:24:05 eliqiao: it fails occasionally so we can't be sure that it does not affect volume based LM also 14:24:34 pkoniszewski, yes 14:24:35 pkoniszewski: tdurakov that may all low level issue 14:25:05 anway +1 for a job with latest packages 14:25:25 pkoniszewski, it's markus_z plans 14:25:29 +1 also would like to search about qemu bugs 14:25:55 ok, let's move on 14:26:00 any issues? 14:26:53 #topic Mid cycle meetup 14:27:59 * eliqiao testing, am i still on line ? 14:28:06 eliqiao: yes ;) 14:28:07 any ideas to be discussed during mid-cycle? please add to https://etherpad.openstack.org/p/mitaka-nova-midcycle 14:29:26 * eliqiao thanks pkoniszewski :) 14:30:13 well, next topic 14:30:24 #topic CI status 14:31:15 ceph, job is on review, thanks for reviewing, also patch in project-config that enables triggering got +2 14:31:31 so, hope everything will be merged soon 14:31:35 tdurakov: cool, thanks for the hard efforts. 14:31:52 as said before, found one bug in new job 14:32:34 Is there any CI which touches LVM? 14:32:52 tdurakov: one question, do you have any guide on how to setup it in local enviroment to do testing ? 14:33:00 mdbooth, lvm for what? 14:33:11 tdurakov: For nova storage 14:33:14 lvm for qemu backend ? 14:33:17 non-volume 14:33:33 is that use local lvm as qemu backend ? 14:33:38 eliqiao, good question, added this for todo 14:33:50 mdbooth, haven't seen one 14:33:51 eliqiao: I guess you could describe it like that 14:34:08 tdurakov: I would like to get that docs, so developer can do more testing locally. 14:34:09 tdurakov: It's significantly different to the other backends, and I'm making lots of changes 14:34:15 which makes me nervous 14:34:47 mdbooth, acked, will start ml for adding lvm for this job, thank you for idea 14:35:27 btw, ansible 2.0 released 14:35:36 this broke hook last week 14:35:48 tdurakov: hmm. can you please let me know the docs of how to do local LM testing when you finished? 14:36:08 eliqiao, sure 14:36:14 tdurakov: thanks in advance :) 14:36:40 tdurakov: what the issue of ansible when upgrade to 2.0? 14:36:47 fyi https://github.com/ansible/ansible/issues/13862 14:36:56 eliqiao, ^ 14:37:10 tdurakov: thx. 14:37:29 anything about ci? 14:38:24 then next topic 14:38:43 #topic Open Discussion 14:39:38 just FYI #link http://docs.openstack.org/releases/schedules/mitaka.html 14:40:01 I am kinds of worry about our schedule, there's lots patch to be reviewed. 14:40:15 tdurakov: How repeatable is that migration failure with nfs? 14:40:36 davidgiluk, not often 14:41:01 I'd say rare 14:41:28 tdurakov: Hmm, that makes it a bit trickier; it's important to make sure if we can tell whether the migration failed on the destination is coming before or after the error 15 (i.e. someone killed the source qemu) 14:41:49 tdurakov: i.e. is the destination just complaining because the source died, or did the destination dieing cause the problem 14:42:02 davidgiluk, I'll provide more detailed stats in a day for this using logstash.openstack.org 14:42:43 tdurakov: OK, thanks - the other thing I've done in the past is use tracing or systemtap to watch the receive side, but it's tricky if it's very rare 14:43:17 davidgiluk, to highlight, haven't reproduced it locally yet 14:44:14 so yes, everything is tricky here 14:45:05 any other things to be discussed during meeting? 14:46:50 so, thanks to all 14:46:58 thanks! 14:47:09 #endmeeting