13:59:59 #startmeeting Nova Live Migration 14:00:00 Meeting started Tue Aug 9 13:59:59 2016 UTC and is due to finish in 60 minutes. The chair is tdurakov. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:01 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:03 The meeting name has been set to 'nova_live_migration' 14:00:12 hi everyone 14:00:16 * kashyap waves 14:01:19 just wait a minute, so everyone joins 14:01:53 agenda https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration 14:02:16 o/ 14:02:20 o/ 14:02:35 so, let's start 14:02:43 #topic Libvirt image backend 14:02:52 any updates? 14:03:19 I've reviewed several patches on that, it looks like chain reduces 14:03:36 o/ 14:04:16 I'd expect almost the same for the next topic... 14:04:27 #topic Storage pools 14:05:12 any updates on that? 14:05:55 okay, let's move then to CI one... 14:06:03 #topic CI 14:06:09 first 14:06:32 wanted to mention, markus_z is working on the serial consoles for live-migration 14:06:50 #link https://review.openstack.org/#/c/347471/ 14:07:18 so, it's improve coverage for that 14:08:03 https://bugs.launchpad.net/nova/+bug/1524898 - this one is still hanged... 14:08:03 Launchpad bug 1524898 in OpenStack Compute (nova) "Volume based live migration aborted unexpectedly" [High,Confirmed] 14:08:09 mriedem: anything on that^ 14:08:35 nope 14:08:49 we had talked about skipping it 14:08:55 last week 14:09:01 at least in the multinode job 14:09:04 to see if that stabilizes that job 14:09:25 mriedem: ok, I'll submit patch today for that 14:09:30 ok 14:10:13 there is also potential issue with nfs https://bugs.launchpad.net/nova/+bug/1535232 14:10:13 Launchpad bug 1535232 in OpenStack Compute (nova) "live-migration ci failure on nfs shared storage" [Medium,Confirmed] 14:10:56 I expected it will work properly on xenial, but looks like we have it on xenial either 14:11:20 tdurakov: Is that a new bug/ 14:11:22 * kashyap clicks 14:11:41 kashyap: it's not a new, but it turns out to be valid for xenial too 14:11:45 Duh, no, I even seem to have commented on it, a bit old (5 months old) 14:12:05 yes 14:12:33 tdurakov: Currently we seem to be stuck at understanding 14:12:48 Why block I/O errors would cause live mig to abort 14:13:08 davidgiluk: Later, do you have time to take a qiuck look at? - https://bugs.launchpad.net/nova/+bug/1535232/comments/6 14:13:08 Launchpad bug 1535232 in OpenStack Compute (nova) "live-migration ci failure on nfs shared storage" [Medium,Confirmed] 14:13:14 kashyap: looks like another underlying issue 14:13:25 * davidgiluk looks 14:14:51 davidgiluk: The hint is at the end of the comment. 14:16:09 kashyap: The IO errors are interesting, but that suggests it's somethign related to the storage again 14:16:24 tdurakov: Is it reproducible for you locally? (I don't have an NFS env on my local env, should set one up.) 14:16:32 kashyap: same 14:16:34 davidgiluk: Ah, you mean the host storage env? 14:16:49 I haven't managed to reproduce it 14:16:51 Hmm, I have no clue what the underlying storage setup is like 14:17:07 kashyap: Yes 14:17:26 tdurakov: Incidentally, I'm doing a drive-mirror+NBD storage migration test with QEMU itself, I'll keep an eye to see if I can trigger this 14:17:42 kashyap: ok, thanks 14:17:53 kashyap: I'd guess at NFS permissions 14:17:56 (I'll test it w/ libvirt APIs, too, which is more useful in Nova's case.) 14:18:29 davidgiluk: Yeah, without 'sosreport'-style logs, can't be certain that they're NFS permissions, can we...? 14:18:46 kashyap: But also we need to see libvirt source logs to know why it decided to shut it down 14:19:13 tdurakov: Do you have a handy link for the latest hits for this failure? 14:19:20 davidgiluk: the strange thing here that there are tests passed on that, during one run 14:19:40 kashyap: http://logs.openstack.org/66/329466/4/check/gate-tempest-dsvm-multinode-live-migration/586c6be/ 14:20:15 tdurakov: Hmm, the failure is not deterministic? 14:20:24 kashyap: looks so 14:21:09 tdurakov: Deterministic failures would be too easy 14:21:13 tdurakov: 'subnode-2' is the destiniation, isn't it? 14:21:22 no 14:21:29 Ah, it's the source? 14:21:33 subnode-2 is just compute only 14:21:39 in that case it's source 14:21:59 tdurakov: I regularly do NFS qemu only migrations - so it normally works at that level 14:22:05 tdurakov: Okay, I got confused because, for some instances on 'subnode-2', I see the QEMU command-line parameter '-incoming', for some not 14:22:28 E.g. this (instance-00000009) has the "-incoming defer" (which means, it's the destination) - http://logs.openstack.org/66/329466/4/check/gate-tempest-dsvm-multinode-live-migration/586c6be/logs/subnode-2/libvirt/qemu/instance-00000009.txt.gz 14:23:01 davidgiluk: You don't test drive-mirror + NBD, do you? I presume you still do: 'migrate -b'? 14:23:26 Or 'migrate tcp:localhost:4444' 14:23:37 kashyap: No, I rearely do migrate -b either, I just use migrate with NFS shared storage most of the time 14:24:00 I'll check one thing, whether it's always fails on subnode, or there is no correlation 14:24:28 davidgiluk: That's the source libvirt daemon log - logs.openstack.org/66/329466/4/check/gate-tempest-dsvm-multinode-live-migration/586c6be/logs/subnode-2/libvirt/libvirtd.txt.gz 14:24:29 my understanding for that was, that things are unrelated... 14:25:07 http://logs.openstack.org/66/329466/4/check/gate-tempest-dsvm-multinode-live-migration/586c6be/logs/subnode-2/libvirt/libvirtd.txt.gz#_2016-08-03_00_45_27_923 14:25:12 davidgiluk: Oh, you seem to sound right... 14:25:12 kashyap, davidgiluk ^ 14:25:13 2016-08-03 00:45:21.456+0000: 18816: info : qemuMonitorIOProcess:423 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7008012390 buf={"timestamp": {"seconds": 1470185121, "microseconds": 456577}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "reason": "Permission denied", "operation": "write", "action": "report"}} 14:25:22 davidgiluk: "Permission denied" 14:25:33 kashyap: Yeh that looks like a simple perm problem on the NFS 14:26:21 davidgiluk: Also, checking out the URL from tdurakov w/ the "load of migration failed: Input/output error" 14:26:47 davidgiluk, about perm issue, http://logs.openstack.org/66/329466/4/check/gate-tempest-dsvm-multinode-live-migration/586c6be/console.html.gz#_2016-08-03_00_45_50_882922 - there are several tests that passes 14:26:49 tdurakov: We can move this discussion out of the meeting; maybe you have other issues to discuss. 14:27:15 I'd spend 5 more minutes on that 14:27:20 Yeah, sure 14:27:44 tdurakov: Hmm, so this succeeds 'test_live_block_migration_paused' 14:27:54 kashyap: yes 14:28:22 so if it's perm issue, I'd expect to see all tests to fail, right? 14:29:51 well, let's take a timeout for that, so everyone interested could grep through the code, and get back for that later 14:30:28 tdurakov: I wonder if the one that is passing is actually NFS shared storage; it's name is live_block_migration - so is it actually using NFS shared? 14:30:40 davidgiluk: it's nfs 14:30:44 ok 14:30:49 just bad tempest naming... 14:31:10 tdurakov: When the storage is shared, what is the point of doing a *block* migration? 14:31:26 it's not block migration 14:31:37 Ah, bad tempest test name you say above 14:31:43 right 14:31:53 another ci thing to mention 14:31:56 https://bugs.launchpad.net/nova/+bug/1592015 14:31:56 Launchpad bug 1592015 in OpenStack Compute (nova) "libvirt: cleanup of a volume backed instance resize leaves behind the instance directory" [Medium,In progress] - Assigned to Lee Yarwood (lyarwood) 14:32:37 I'm going to add tempest test that do cold migration, and then live-migration after that 14:33:06 once the bug is fixed 14:33:53 mriedem1, thoughts on that^ 14:35:19 tdurakov: Do those systems use apparmor or anything like that? I know there's a couple of selinux tricks to get NFS working, but I guess it's not using that 14:35:45 davidgiluk: does ubuntu use apparmor? 14:35:57 tdurakov: It does, I don't know if it's used in this case 14:36:16 I mean by default 14:36:39 tdurakov: the more tests the better 14:37:22 ok let's move on then 14:38:05 pkoniszewski: added http://lists.openstack.org/pipermail/openstack-dev/2016-August/100657.html this thread to agenda 14:38:40 so, the point is to not using tunneled for live-migration by default 14:39:33 so, I support that idea 14:40:09 if you folks, missed the thread fell free to respond on that 14:40:25 also there is a patch on review already https://review.openstack.org/#/c/350480/ 14:41:30 #topic: Open discussion 14:41:36 tdurakov: I've read the thread, and agree with danpb's comment there 14:41:50 kashyap: ok 14:41:52 tdurakov: And, yours 14:42:10 You're also "thumbs up" for default=False, for good reasons. 14:42:10 cool, review the patch then, thanks:) 14:42:28 Yeah, looking 14:43:14 anything else to discuss on live-migration? 14:43:55 Nothing from me as of now. 14:44:22 ok, thank you for comming, It was kind of sub-sub-team meeting:) 14:45:01 #endmeeting