15:00:45 #startmeeting libvirt 15:00:46 Meeting started Tue Aug 19 15:00:45 2014 UTC and is due to finish in 60 minutes. The chair is danpb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:47 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:49 The meeting name has been set to 'libvirt' 15:01:13 ..if anyone is here & wants to talk about libvirt issues make yourselves known... 15:01:41 o/ 15:02:06 o/ 15:02:23 o/ 15:02:26 o/ 15:02:35 o/ 15:02:52 o/ 15:04:04 welcome folks, so the only item added to the etherpad is from rcleere 15:04:11 #topic Libvirt ulimit patch 15:04:24 rcleere: ..your turn :-) 15:04:28 ok 15:04:46 So, when we boot containers the container inherits ulimits from the libvirt_lxc process 15:05:07 one of those being no_file (max file descriptors a process can have) 15:05:36 ok, standard unix inheritance which is probably quite unhelpful here :-) 15:05:37 this introduces xml to set it in the instance XML that would get applied when executing the init process of the container. 15:06:32 rcleere: what is your end goal here - is it to simply avoiding inheriting a possibly stupid limit from libvirt_lxc 15:06:39 with this patch, eventually we could apply flavor specific open file limits 15:06:43 The other hacky way I was doing it was to have a shell wrapper around libvirt_lxc that set it then passed all args to the real libvirt_lxc 15:06:57 rcleere: or is to to use this for some kind of explicit resource control polocy 15:07:19 explicit resource control. 15:07:58 I should mention we are using user namespace, so as root in the container you cannot increase the no_file only decrease it 15:08:03 hmm, so that raises questions of just how the mechanism works, particularly wrt user namespaces 15:08:24 ah, that's what i was going to ask - does user namespace give you compete unrestricted control over the ulimits 15:08:31 nope 15:08:34 just decrease 15:09:11 so semantically is the num files limit of the container a global limit that will apply to the sum of all child processes it starts 15:09:22 so as it is in our environment the kernel sets a soft limit of 1024 and a hard limit of 4096. we dont change those in our running environment. libvirt_lxc inherits that so does container init. 15:09:32 unfortunatly not 15:09:40 it is the max open files per process. 15:09:41 i recall previously seeing discussions about whether ulimits needed to be mapped into cgroups to apply to an entire process set 15:09:57 they really should be mapped to cgroups and/or namespaces 15:10:05 Seems like it'd make more sense that way 15:10:13 yeah so if it is per-process it isn't very useful as a resource limiting mechanism 15:10:21 as you can just spawn more processes to escape the limit 15:10:33 There had been some talk on the kernel mailing and some patches but none have made it in. 15:10:43 i like that ulimit inside the container reports correct limits with rcleere's approach 15:11:20 given that current impl, why would be not simply set the container init to have (hard_limit == soft_limit) == hard limit of libvirt-lxc ? 15:11:23 danpb: true it is not a real limit on the container, BUT it does allow things like mysql, ngingx, haproxy to set higher limits for incoming tcp connections that it needs. 15:11:45 it isn't offering any resource control - so only useful thing we can do is avoid the accidental capping of the soft limit 15:11:52 danpb: not sure I understand your question 15:12:16 so say libvirt_lxc has soft=1024, hard=8192 15:12:35 why would we not simply set the init process to have hard=8192, soft=8192 and avoid any XML confijg 15:12:43 it already does. 15:13:09 the XML config doesn't seem to add any real value here until we can set num_files liimits against the cgroup so they actually work as a global cap on the container 15:13:20 we'd eventually like to customize the number of open files value based on flavor 15:13:38 and having an XML directive seemed to be a good place to define it 15:13:42 But I dont want to blindly set ALL containers an arbitrarally high no_files limit if they dont need it. 15:14:05 sew: sure i understand that, but it feels like that needs the cgroups integration for ulimits 15:14:20 like the haproxy example. if we were going to use that as a LB, then I would probably want the no_files limit to 10000+ but I dont really want that for all of my containers. 15:15:00 the idea behind this change is to allow good processes to work properly, not to disallow bad processes from behaving badly 15:15:14 (which they could anyway) 15:15:14 apmelton: very good point 15:15:35 +1 on amelton's pont 15:16:13 apmelton: yep, but my concern is around the long term implications of a short term decision 15:16:52 apmelton: eg if we add config setting for the current per-process ulimit vs later per-container ulimit 15:17:10 and whether it makes sense to support both approaches in the long term or not 15:17:17 danpb: why not both once they're available 15:17:21 ah 15:17:46 so, being able to set ulimit and have it appear correctly in a container will be useful for processes that aren't cgroup aware 15:17:52 i guess you could say that if you had a per-container ulimit of 100,000 files 15:18:25 would it still make sense to be able to say that the init process only had 8192 per-process limit 15:19:11 does anyone know what limits a real pid1 will get from the kenrel 15:19:23 1024/4096 15:20:05 but pid1 still has the ability to raise its own hard limit i guess 15:20:15 which we do not have in containers even with user namespace 15:20:37 as root any process can raise its hard limit 15:21:12 any process with CAP_SYS_RESOURCE i guess 15:21:31 that I dont know 15:22:46 oh well, i guess I'd say you should make a patch proposal to the libvirt list and we can discuss the finer details there 15:23:42 ok 15:26:25 #topic qemu-nbd mounting 15:26:30 s1rp: ^^ 15:27:26 Howdy danpb. s1rp and I were trying to get libvirt-lxc working with a simple tempest test in devstack over the past few days... 15:27:51 yeah, so just started digging into that at the end of the day yesterday; but i don't think it's a 'bug' in libvirt per se; just the fact that AMIs + qemu-nbd + mount aren't work off the bat 15:28:41 we've been using LVM for a few months, so qemu-nbd stuff is very unfamiliar, so not even sure if i can ask the right questions at the moment :-) 15:29:19 The effect is basically that the rootfs filesystem on the disk seems to fail to pupulate correctly. Correct, s1rp? 15:29:24 the idea behiind the qemu-nbd stuff is simply that it lets us use qcow2 files directly without having to flatten them into raw files 15:29:25 populate* 15:29:52 assuming qemu-nbd is not buggy, you aren't supposed to notice any difference vs plain raw files 15:30:14 does qemu-nbd/qemu-img even support ami? 15:30:36 danpb: yeah the qcow2 part does work fine... 15:31:07 apmelton: oh, wait, what do you mean by 'ami' here ? 15:31:28 so we're using the cirros image from devstack 15:31:29 danpb: I'm wondering the same thing, s1rp, what AMI are you talking about? 15:31:38 this isn't just a plain disk image in raw or qcow2 format ? 15:31:57 i thought AMI was just a term to refer to the 3 files (kernel, initrd, disk image) 15:32:25 must admit i've not looked at cirros in any great detail though 15:32:40 danpb: right, which is why it's not going to work... we're just going to need to use a different image for the LXC tempest tests 15:34:00 so you can't just boot a regular cirros disk image without the kernel/initrd part being used 15:34:20 naively i'd hope the cirros disk image would just let you run the init inside it 15:34:58 s1rp: does tempest not use the normal cirros image that comes with devstack? 15:34:59 danpb: yeah so we're attempting that, but it's not working off the bat... still digging into figure out exactly why.... sorry not too many more details i can offer 15:35:12 we end of with an empty filesystem mounted to rootfs right now 15:35:13 yeah, i'd be interested in hearing what you find out 15:35:14 I could have sworn I got it to at least start a cirros container 15:35:39 there are two cirros images in devstack 15:35:46 apmelton: yeah i vaguely recall that as well, but i'm not sure what's different this go around 15:35:49 one being raw format for libvirt-lxc 15:35:52 did we have to recompile qemu-nbd 15:35:52 ? 15:36:00 s1rp: nope 15:36:03 hmm 15:36:08 the only thing I ever had to do was modprobe nbd 15:36:18 for some reason devstack won't load it when running lxc 15:36:25 apmelton: I also got it running containers, but that was months ago. 15:36:33 yeah, the module is loaded 15:36:37 but they consistently worked fine in Devstack 15:37:06 ohhh, hang on,the cirros image is wierd 15:37:14 oh really? 15:37:45 so when i try to qemu-nbd/mount it, i just end of w/ 'lost+found' 15:37:49 so the disk image just looks like an empty ext3 filesystem 15:37:51 i'd expect a full filesystem 15:37:58 yeah that's what im seeing 15:38:02 i wonder if all the real data is in the initrd cpio file 15:38:11 and just gets copied into this empty filesystem by the initrd 15:38:11 so that's what i was starting to think... 15:38:16 haven't gotten to confirm that though 15:38:53 so if that's the case, how did we get cirros working before? 15:38:54 perhaps we'll just have to make devstack able to convert the default cirros image into a normal filesystem when it sees it is configured for libvirt LXC 15:38:56 apmelton ^^^ 15:39:25 danpb: yeah a shim shouldn't be too hard to write, i hope 15:39:32 s1rp: must be something new, I know I had cirros working at least to the point where I could virsh console into it 15:40:04 i might still have the devstack vm around that I can poke in 15:40:14 yeah wonder if the cirros upstream changed 15:40:28 yeah, so im going to look in the initrd to see what's hiding in there 15:40:46 hopefully that answers it, and we can just do a copy into the rootfs, and then mount for lxc 15:41:31 would really like the gate to use 'cirros' image though 15:41:43 that way we're literally just flipping virt_type from qemu -> lxc 15:42:03 even if behind the scenes we're doing some hacky stuff to make that work 15:42:53 s1rp, nelsnelson: may I see the local.conf for your devstack? 15:43:34 you mean locarc? 15:43:42 *localrc? 15:43:56 s1rp: Sure, it's local.conf now 15:44:02 http://devstack.org/ 15:44:05 localrc is the old way 15:44:23 s1rp: danpb: cirros image from july 16th devstack contains a single rootfs image 15:44:24 ah right, so i'm just using the prompts to generate .local.auto 15:44:40 https://github.com/openstack-dev/devstack/blob/a6a45467c412476b2cddb8526a97e421b0b74db7/stackrc#L354-L356 15:44:43 s1rp: when you switched to lxc, did you just flip the config, or did you rebuild? 15:45:04 thomasem: ah ha 15:45:10 apmelton: just flipped the config 15:45:32 thomasem: ahhh 15:46:19 thomasem: well that's the answer, thanks! 15:46:24 my pleasure! 15:46:39 tempest tests, one step closer, 999 more to go 15:46:45 :P 15:48:05 apmelton: the image i looked at is from just a day or two ago 15:48:23 anyway, this doesn't sound like it is a libvirt problem per se, rather a cirros problem 15:48:35 danpb: there are two different images devstack will pull down, the uec style or rootfs style 15:48:51 ah, 15:48:58 finally my 2 months in devstack hell pay off 15:48:59 :) 15:49:09 :) 15:49:19 Well, they did back then too. Now it's just icing. 15:49:43 lol 15:49:52 #topic idmap 15:50:15 so i know we discussed the idmap and block devices issue a while back, but i was curious if there are any new thoughts? 15:50:32 sew: can you refresh my mind ? 15:50:51 is the problem that we need to find a way to mount the block dev before we run id remaping ? 15:50:57 yes, i think so 15:51:17 when using lvm with idmap for example 15:51:28 yeah, guess that does make sense really 15:51:54 i would imagine anyone attempting to run containers with user namespaces enabled and block backed filesystems would be hitting this 15:52:41 hmm, oh, type=block is only used when we boot from cinder volume, right ? 15:53:14 cinder would be block device 15:53:29 but we're seeing this problem with just local logical volumes 15:53:42 danpb: we'd like to move to type=block for lvm images as well 15:54:43 right now though, for non-volume based instances we're always mounting the image (whether lvm, raw or qcow2) in the host and then using type=mount so idmap can run without trouble 15:54:51 it's be nice to be able to start VMs using libvirt, rather than have to nova mount the drive, then start the guest 15:55:11 reboot inside the container also fails under the mounted approach 15:55:14 if we change so that libvirt is responsible for all the mounting, then we'll face this problem for all types of root disk, lvm., raw, qcow2 15:55:48 danpb: moving the responsibility to libvirt would allow the removal of a bunch of extra lxc code paths in nova 15:55:58 it seems that somewhere in the guest startup process we will always need to mount the images or volumes in the host, run the idmap, and then unmount again and handoff to libvirt 15:56:12 apmelton: yeah, i'd like to see that happen 15:56:16 danpb: this isn't idmapshift 15:56:28 danpb: this is the setting up of the actual user namespace 15:56:51 oh, you mean libvirt itself is broken with type=block and user namespaces 15:56:52 it appears that the mounting of the block device is done from within the user namespace 15:56:56 danpb: yup 15:56:58 yes 15:57:03 sorry, ok, now i see 15:57:52 guess someone will need to poke around in the code to see how we might fix it - presumably we need to chown the block device before starting the user namespace , or possibly even mount it before hand ? 15:58:33 i played with changing device ownership a bit , but didn't find a winning combo 15:59:07 afraid we're out of time here today 15:59:18 have to make way for the next meeting to use this channel 15:59:36 sounds good. thx danpb 15:59:38 best take this to the mailing list or bug tracker 15:59:52 #endmeeting