#openstack-meeting-3 log

14:00:21 <tdurakov> #startmeeting Nova Live Migration
14:00:22 <openstack> Meeting started Tue Jan 19 14:00:21 2016 UTC and is due to finish in 60 minutes.  The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:26 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:32 <mdbooth> o/
14:00:33 <andrearosa> hi
14:00:45 <eliqiao> o/
14:00:48 <tdurakov> hi
14:00:49 <pkoniszewski> o/
14:01:08 <tdurakov> agenda: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:01:38 <tdurakov> so, let's start
14:01:47 <tdurakov> #topic priority reviews
14:01:50 <shaohe_feng> hi all
14:02:45 <tdurakov> any reviews that needed to be discussed?
14:03:07 <shaohe_feng> tdurakov:  one, please
14:03:56 <tdurakov> shaohe_feng, sure, which one?
14:04:15 <shaohe_feng> tdurakov:  https://review.openstack.org/#/c/258813/
14:05:01 <shaohe_feng> I'd like to discuss the interval to write DB
14:05:48 <shaohe_feng> tdurakov: https://review.openstack.org/#/c/258813/4/nova/virt/libvirt/driver.py
14:05:54 <shaohe_feng> now I set it 0.5s
14:05:54 <pkoniszewski> shaohe_feng: I think that this is a good topic for our mailing list
14:06:26 <shaohe_feng> pkoniszewski: OK, let talk it by mailing list.
14:06:34 <tdurakov> so, fr�quence of db writes, right?
14:06:45 <shaohe_feng> tdurakov: yes.
14:06:56 <pkoniszewski> shaohe_feng: i think we need folks familiar with DB to discuss about it
14:07:10 <shaohe_feng> tdurakov: any suggestion?
14:07:19 <tdurakov> +1 for ml
14:07:30 <andrearosa> pkoniszewski: not jsut that, need to understand if it makes sense to have an update every 0.5s on a task which can last hours...
14:07:38 <pkoniszewski> yup
14:07:58 <pkoniszewski> exactly, just wanted to write that my question is who will poll API every 0.5 seconds for even 15 hours or more
14:08:00 <eliqiao> andrearosa: that won't take hours.
14:08:02 <andrearosa> btw +1 for ml discussion
14:08:07 <tdurakov> btw, compute can't make writes to db by itself, right?
14:08:23 <pkoniszewski> it can't, it needs to go through conductor
14:08:40 <tdurakov> yep, so there is additional load for messaging
14:08:40 <mdbooth> I'm pretty sure it can. However, in this case it would go through conductor.
14:09:57 <tdurakov> shaohe_feng please start thread in ML.
14:10:10 <shaohe_feng> tdurakov: yes. I will.
14:10:38 <tdurakov> let's move on, any reviews?
14:10:47 <eliqiao> tdurakov: I got one
14:11:10 <eliqiao> tdurakov: no need to discuess, just want to get some eyes for review #link https://review.openstack.org/#/q/topic:bp/making-live-migration-api-friendly
14:11:48 <eliqiao> tdurakov: thx, we can move on next.
14:12:01 <tdurakov> eliqiao,acked, added for todo list
14:12:36 <tdurakov> more reviews?
14:12:56 <tdurakov> ok, moving to the next topic
14:13:09 <tdurakov> #topic bugs
14:13:09 <shaohe_feng> tdurakov:  need more reviews https://review.openstack.org/#/c/258797/ and https://review.openstack.org/#/c/258771/
14:14:07 * mdbooth has found a couple, but I haven't written them up, yet. Mean to use them to help get reviews for my refactor series.
14:14:34 <mdbooth> They are: you can't have ephemeral disks on Rbd or Raw backends.
14:14:59 <mdbooth> And the backend holds the wrong locks whilst writing to both the image cache and disks.
14:15:06 <tdurakov> mdbooth, could you file a bug please for this?
14:15:08 <mdbooth> Meaning that writes to all backends are basically unlocked.
14:15:30 <mdbooth> Yeah, will do. I have a note here.
14:15:31 <eliqiao> +1 for file a bug.
14:16:01 <mdbooth> Ulterior motive for not doing that: if anybody fixed it before me, it would conflict with my patch series :)
14:16:03 <tdurakov> sounds strange, as I've tested ceph for nova, worked fine. need more info
14:16:14 <eliqiao> (please also tagging live-migration)
14:16:28 <mdbooth> tdurakov: It works, but the locking is wrong.
14:16:45 <mdbooth> But the ephemeral disks don't work. That raises an exception.
14:16:48 <tdurakov> ah
14:17:07 * andrearosa just triggered file alarm leaving the meeting :(
14:17:21 * eliqiao would like to see the reproduce steps and I will have a try.
14:17:39 <mdbooth> But please, don't submit patches.
14:17:46 <tdurakov> another one from me: https://bugs.launchpad.net/nova/+bug/1535232
14:17:47 <openstack> Launchpad bug 1535232 in OpenStack Compute (nova) "live-migration ci failure on nfs shared storage" [Undecided,New]
14:18:32 <tdurakov> found during ci jobs failure on nfs, reproduces from time to time
14:19:09 <mdbooth> qemu: terminating on signal 15 from pid 8062
14:19:20 <mdbooth> Any idea what that means at a high level?
14:19:21 <tdurakov> working on it with kashyap
14:19:57 <tdurakov> destination node failed, and it seems to be qemu-level issue
14:21:13 <tdurakov> so, if you triggered job, and found such failures, please recheck
14:21:19 <davidgiluk> tdurakov: Yeh kashyap asked me about that; if possible try with a newer qemu, I keep adding more debug to the failure path so it might tell you a bit more
14:21:54 <eliqiao> tdurakov: these logs are grabed from gate?
14:21:55 <pkoniszewski> tdurakov: have you checked on which nodes it fails?
14:22:25 <pkoniszewski> tdurakov: there is similiar issue reported and miedem found that it primarly fails on OVH nodes (the slowest ones)
14:22:33 <pkoniszewski> tdurakov: FYI https://bugs.launchpad.net/nova/+bug/1524898
14:22:34 <openstack> Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed]
14:22:45 <tdurakov> davidgiluk, I haven't reproduced it locally yet, speaking about packages, there are plans for new job with latest packages
14:22:52 <tdurakov> eliqiao, yep
14:22:59 <pkoniszewski> tdurakov: it might be the same issue actually
14:23:31 <eliqiao> pkoniszewski: that one is volumed based.
14:23:32 <tdurakov> pkoniszewski, will check after meeting, but yes, possible duplicate
14:24:05 <tdurakov> I think it's live-migration problem at whole, so it could affect all scenarios
14:24:05 <pkoniszewski> eliqiao: it fails occasionally so we can't be sure that it does not affect volume based LM also
14:24:34 <tdurakov> pkoniszewski, yes
14:24:35 <eliqiao> pkoniszewski: tdurakov that may all low level issue
14:25:05 <pkoniszewski> anway +1 for a job with latest packages
14:25:25 <tdurakov> pkoniszewski, it's markus_z plans
14:25:29 <eliqiao> +1 also would like to search about qemu bugs
14:25:55 <tdurakov> ok, let's move on
14:26:00 <tdurakov> any issues?
14:26:53 <tdurakov> #topic Mid cycle meetup
14:27:59 * eliqiao testing, am i still on line ?
14:28:06 <pkoniszewski> eliqiao: yes ;)
14:28:07 <tdurakov> any ideas to be discussed during mid-cycle? please add to https://etherpad.openstack.org/p/mitaka-nova-midcycle
14:29:26 * eliqiao thanks pkoniszewski :)
14:30:13 <tdurakov> well, next topic
14:30:24 <tdurakov> #topic CI status
14:31:15 <tdurakov> ceph, job is on review, thanks for reviewing, also patch in project-config that enables triggering got +2
14:31:31 <tdurakov> so, hope everything will be merged soon
14:31:35 <eliqiao> tdurakov: cool, thanks for the hard efforts.
14:31:52 <tdurakov> as said before, found one bug in new job
14:32:34 <mdbooth> Is there any CI which touches LVM?
14:32:52 <eliqiao> tdurakov: one question, do you have any guide on how to setup it in local enviroment to do testing ?
14:33:00 <tdurakov> mdbooth, lvm for what?
14:33:11 <mdbooth> tdurakov: For nova storage
14:33:14 <eliqiao> lvm for qemu backend ?
14:33:17 <mdbooth> non-volume
14:33:33 <eliqiao> is that use local lvm as qemu backend ?
14:33:38 <tdurakov> eliqiao, good question, added this for todo
14:33:50 <tdurakov> mdbooth, haven't seen one
14:33:51 <mdbooth> eliqiao: I guess you could describe it like that
14:34:08 <eliqiao> tdurakov: I would like to get that docs, so developer can do more testing locally.
14:34:09 <mdbooth> tdurakov: It's significantly different to the other backends, and I'm making lots of changes
14:34:15 <mdbooth> which makes me nervous
14:34:47 <tdurakov> mdbooth, acked, will start ml for adding lvm for this job, thank you for idea
14:35:27 <tdurakov> btw, ansible 2.0 released
14:35:36 <tdurakov> this broke hook last week
14:35:48 <eliqiao> tdurakov: hmm. can you please let me know the docs of how to do local LM testing when you finished?
14:36:08 <tdurakov> eliqiao, sure
14:36:14 <eliqiao> tdurakov: thanks in advance :)
14:36:40 <eliqiao> tdurakov: what the issue of ansible when upgrade to 2.0?
14:36:47 <tdurakov> fyi  https://github.com/ansible/ansible/issues/13862
14:36:56 <tdurakov> eliqiao, ^
14:37:10 <eliqiao> tdurakov: thx.
14:37:29 <tdurakov> anything about ci?
14:38:24 <tdurakov> then next topic
14:38:43 <tdurakov> #topic Open Discussion
14:39:38 <eliqiao> just FYI #link http://docs.openstack.org/releases/schedules/mitaka.html
14:40:01 <eliqiao> I am kinds of worry about our schedule, there's lots patch to be reviewed.
14:40:15 <davidgiluk> tdurakov: How repeatable is that migration failure with nfs?
14:40:36 <tdurakov> davidgiluk, not often
14:41:01 <tdurakov> I'd say rare
14:41:28 <davidgiluk> tdurakov: Hmm, that makes it a bit trickier; it's important to make sure if we can tell whether the migration failed on the destination is coming before or after the error 15 (i.e. someone killed the source qemu)
14:41:49 <davidgiluk> tdurakov: i.e. is the destination just complaining because the source died, or did the destination dieing cause the problem
14:42:02 <tdurakov> davidgiluk, I'll provide more detailed stats in a day for this using logstash.openstack.org
14:42:43 <davidgiluk> tdurakov: OK, thanks - the other thing I've done in the past is use tracing or systemtap to watch the receive side, but it's tricky if it's very rare
14:43:17 <tdurakov> davidgiluk, to highlight, haven't reproduced it locally yet
14:44:14 <tdurakov> so yes, everything is tricky here
14:45:05 <tdurakov> any other things to be discussed during meeting?
14:46:50 <tdurakov> so, thanks to all
14:46:58 <pkoniszewski> thanks!
14:47:09 <tdurakov> #endmeeting