#openstack-meeting-3 log

15:00:45 <danpb> #startmeeting libvirt
15:00:46 <openstack> Meeting started Tue Aug 19 15:00:45 2014 UTC and is due to finish in 60 minutes.  The chair is danpb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:49 <openstack> The meeting name has been set to 'libvirt'
15:01:13 <danpb> ..if anyone is here & wants to talk about libvirt issues make yourselves known...
15:01:41 <sew> o/
15:02:06 <thomasem> o/
15:02:23 <apmelton> o/
15:02:26 <mtesauro> o/
15:02:35 <rcleere> o/
15:02:52 <nelsnelson> o/
15:04:04 <danpb> welcome folks, so the only item added to the etherpad is from rcleere
15:04:11 <danpb> #topic Libvirt ulimit patch
15:04:24 <danpb> rcleere: ..your turn :-)
15:04:28 <rcleere> ok
15:04:46 <rcleere> So, when we boot containers the container inherits ulimits from the libvirt_lxc process
15:05:07 <rcleere> one of those being no_file (max file descriptors a process can have)
15:05:36 <danpb> ok, standard unix inheritance which is probably quite unhelpful here :-)
15:05:37 <rcleere> this introduces xml to set it in the instance XML that would get applied when executing the init process of the container.
15:06:32 <danpb> rcleere: what is your end goal here - is it to simply avoiding inheriting a possibly stupid limit from libvirt_lxc
15:06:39 <sew> with this patch, eventually we could apply flavor specific open file limits
15:06:43 <rcleere> The other hacky way I was doing it was to have a shell wrapper around libvirt_lxc that set it then passed all args to the real libvirt_lxc
15:06:57 <danpb> rcleere: or is to to use this for some kind of explicit resource control polocy
15:07:19 <rcleere> explicit resource control.
15:07:58 <rcleere> I should mention we are using user namespace, so as root in the container you cannot increase the no_file only decrease it
15:08:03 <danpb> hmm, so that raises questions of just how the mechanism works, particularly wrt user namespaces
15:08:24 <danpb> ah, that's what i was going to ask - does user namespace give you compete unrestricted control over the ulimits
15:08:31 <rcleere> nope
15:08:34 <rcleere> just decrease
15:09:11 <danpb> so semantically is the num files limit of the container a global limit that will apply to the sum of all child processes it starts
15:09:22 <rcleere> so as it is in our environment the kernel sets a soft limit of 1024 and a hard limit of 4096. we dont change those in our running environment. libvirt_lxc inherits that so does container init.
15:09:32 <rcleere> unfortunatly not
15:09:40 <rcleere> it is the max open files per process.
15:09:41 <danpb> i recall previously seeing discussions about whether  ulimits needed to be mapped into cgroups to apply to an entire process set
15:09:57 <rcleere> they really should be mapped to cgroups and/or namespaces
15:10:05 <thomasem> Seems like it'd make more sense that way
15:10:13 <danpb> yeah so if it is per-process it isn't very useful as a resource limiting mechanism
15:10:21 <danpb> as you can just spawn more processes to escape the limit
15:10:33 <rcleere> There had been some talk on the kernel mailing and some patches but none have made it in.
15:10:43 <sew> i like that ulimit inside the container reports correct limits with rcleere's approach
15:11:20 <danpb> given that current impl, why would be not simply set the container init to have  (hard_limit == soft_limit) ==  hard limit of libvirt-lxc ?
15:11:23 <rcleere> danpb: true it is not a real limit on the container, BUT it does allow things like mysql, ngingx, haproxy to set higher limits for incoming tcp connections that it needs.
15:11:45 <danpb> it isn't offering any resource control - so only useful thing we can do is avoid the accidental capping of the soft limit
15:11:52 <rcleere> danpb: not sure I understand your question
15:12:16 <danpb> so say libvirt_lxc has soft=1024, hard=8192
15:12:35 <danpb> why would we not simply set  the init process to have  hard=8192, soft=8192 and avoid any XML confijg
15:12:43 <rcleere> it already does.
15:13:09 <danpb> the XML config doesn't seem to add any real value here  until we can set  num_files liimits against the cgroup so they actually work as a global cap on the container
15:13:20 <sew> we'd eventually like to customize the number of open files value based on flavor
15:13:38 <sew> and having an XML directive seemed to be a good place to define it
15:13:42 <rcleere> But I dont want to blindly set ALL containers an arbitrarally high no_files limit if they dont need it.
15:14:05 <danpb> sew: sure i understand that, but it feels like that needs the cgroups integration for ulimits
15:14:20 <rcleere> like the haproxy example. if we were going to use that as a LB, then I would probably want the no_files limit to 10000+ but I dont really want that for all of my containers.
15:15:00 <apmelton> the idea behind this change is to allow good processes to work properly, not to disallow bad processes from behaving badly
15:15:14 <thomasem> (which they could anyway)
15:15:14 <rcleere> apmelton: very good point
15:15:35 <mtesauro> +1 on amelton's pont
15:16:13 <danpb> apmelton: yep, but my concern is around the long term implications of  a short term decision
15:16:52 <danpb> apmelton: eg if we add config setting for the current per-process ulimit vs later  per-container ulimit
15:17:10 <danpb> and whether it makes sense to support both approaches in the long term or not
15:17:17 <apmelton> danpb: why not both once they're available
15:17:21 <apmelton> ah
15:17:46 <apmelton> so, being able to set ulimit and have it appear correctly in a container will be useful for processes that aren't cgroup aware
15:17:52 <danpb> i guess you could say that if you had a per-container ulimit of  100,000 files
15:18:25 <danpb> would it still make sense to be able to say that the init process only had 8192 per-process limit
15:19:11 <danpb> does anyone know what limits a real pid1 will get from the kenrel
15:19:23 <rcleere> 1024/4096
15:20:05 <danpb> but pid1 still has the ability to raise its own hard limit i guess
15:20:15 <danpb> which we do not have in containers even with user namespace
15:20:37 <rcleere> as root any process can raise its hard limit
15:21:12 <danpb> any process with CAP_SYS_RESOURCE i guess
15:21:31 <rcleere> that I dont know
15:22:46 <danpb> oh well, i guess I'd say you should make a patch proposal to the libvirt list and we can discuss the finer details there
15:23:42 <rcleere> ok
15:26:25 <danpb> #topic qemu-nbd mounting
15:26:30 <danpb> s1rp: ^^
15:27:26 <nelsnelson> Howdy danpb.  s1rp and I were trying to get libvirt-lxc working with a simple tempest test in devstack over the past few days...
15:27:51 <s1rp> yeah, so just started digging into that at the end of the day yesterday; but i don't think it's a 'bug' in libvirt per se; just the fact that AMIs + qemu-nbd + mount aren't work off the bat
15:28:41 <s1rp> we've been using LVM for a few months, so qemu-nbd stuff is very unfamiliar, so not even sure if i can ask the right questions at the moment :-)
15:29:19 <nelsnelson> The effect is basically that the rootfs filesystem on the disk seems to fail to pupulate correctly.  Correct, s1rp?
15:29:24 <danpb> the idea behiind the qemu-nbd stuff is simply that it lets us use  qcow2 files directly without having to flatten them into raw files
15:29:25 <nelsnelson> populate*
15:29:52 <danpb> assuming qemu-nbd is not buggy, you aren't supposed to notice any difference vs plain raw files
15:30:14 <apmelton> does qemu-nbd/qemu-img even support ami?
15:30:36 <s1rp> danpb: yeah the qcow2 part does work fine...
15:31:07 <danpb> apmelton: oh, wait, what do you mean by 'ami' here ?
15:31:28 <s1rp> so we're using the cirros image from devstack
15:31:29 <apmelton> danpb: I'm wondering the same thing, s1rp, what AMI are you talking about?
15:31:38 <danpb> this isn't just a plain disk image in raw or qcow2 format ?
15:31:57 <danpb> i thought AMI was just a term to refer to the 3 files (kernel, initrd, disk image)
15:32:25 <danpb> must admit i've not looked at cirros in any great detail though
15:32:40 <s1rp> danpb: right, which is why it's not going to work... we're just going to need to use a different image for the LXC tempest tests
15:34:00 <danpb> so you can't just boot a regular cirros disk image without the  kernel/initrd part being used
15:34:20 <danpb> naively i'd hope the cirros disk image would just let you run the init inside it
15:34:58 <apmelton> s1rp: does tempest not use the normal cirros image that comes with devstack?
15:34:59 <s1rp> danpb: yeah so we're attempting that, but it's not working off the bat... still digging into figure out exactly why.... sorry not too many more details i can offer
15:35:12 <s1rp> we end of with an empty filesystem mounted to rootfs right now
15:35:13 <danpb> yeah, i'd be interested in hearing what you find out
15:35:14 <apmelton> I could have sworn I got it to at least start a cirros container
15:35:39 <thomasem> there are two cirros images in devstack
15:35:46 <s1rp> apmelton: yeah i vaguely recall that as well, but i'm not sure what's different this go around
15:35:49 <thomasem> one being raw format for libvirt-lxc
15:35:52 <s1rp> did we have to recompile qemu-nbd
15:35:52 <s1rp> ?
15:36:00 <apmelton> s1rp: nope
15:36:03 <s1rp> hmm
15:36:08 <apmelton> the only thing I ever had to do was modprobe nbd
15:36:18 <apmelton> for some reason devstack won't load it when running lxc
15:36:25 <thomasem> apmelton: I also got it running containers, but that was months ago.
15:36:33 <s1rp> yeah, the module is loaded
15:36:37 <thomasem> but they consistently worked fine in Devstack
15:37:06 <danpb> ohhh, hang on,the cirros image is wierd
15:37:14 <s1rp> oh really?
15:37:45 <s1rp> so when i try to qemu-nbd/mount it, i just end of w/ 'lost+found'
15:37:49 <danpb> so the disk image  just looks like an empty ext3 filesystem
15:37:51 <s1rp> i'd expect a full filesystem
15:37:58 <s1rp> yeah that's what im seeing
15:38:02 <danpb> i wonder if all the real data is in the initrd cpio file
15:38:11 <danpb> and just gets copied into this empty filesystem by the initrd
15:38:11 <s1rp> so that's what i was starting to think...
15:38:16 <s1rp> haven't gotten to confirm that though
15:38:53 <s1rp> so if that's the case, how did we get cirros working before?
15:38:54 <danpb> perhaps we'll just have to make devstack able to convert the default cirros image into a normal filesystem when it sees it is configured for libvirt LXC
15:38:56 <s1rp> apmelton ^^^
15:39:25 <s1rp> danpb: yeah a shim shouldn't be too hard to write, i hope
15:39:32 <apmelton> s1rp: must be something new, I know I had cirros working at least to the point where I could virsh console into it
15:40:04 <apmelton> i might still have the devstack vm around that I can poke in
15:40:14 <danpb> yeah wonder if the cirros upstream changed
15:40:28 <s1rp> yeah, so im going to look in the initrd to see what's hiding in there
15:40:46 <s1rp> hopefully that answers it, and we can just do a copy into the rootfs, and then mount for lxc
15:41:31 <s1rp> would really like the gate to use 'cirros' image though
15:41:43 <s1rp> that way we're literally just flipping virt_type from qemu -> lxc
15:42:03 <s1rp> even if behind the scenes we're doing some hacky stuff to make that work
15:42:53 <thomasem> s1rp, nelsnelson: may I see the local.conf for your devstack?
15:43:34 <s1rp> you mean locarc?
15:43:42 <s1rp> *localrc?
15:43:56 <thomasem> s1rp: Sure, it's local.conf now
15:44:02 <thomasem> http://devstack.org/
15:44:05 <thomasem> localrc is the old way
15:44:23 <apmelton> s1rp: danpb: cirros image from july 16th devstack contains a single rootfs image
15:44:24 <s1rp> ah right, so i'm just using the prompts to generate .local.auto
15:44:40 <thomasem> https://github.com/openstack-dev/devstack/blob/a6a45467c412476b2cddb8526a97e421b0b74db7/stackrc#L354-L356
15:44:43 <apmelton> s1rp: when you switched to lxc, did you just flip the config, or did you rebuild?
15:45:04 <apmelton> thomasem: ah ha
15:45:10 <s1rp> apmelton: just flipped the config
15:45:32 <s1rp> thomasem: ahhh
15:46:19 <s1rp> thomasem: well that's the answer, thanks!
15:46:24 <thomasem> my pleasure!
15:46:39 <s1rp> tempest tests, one step closer, 999 more to go
15:46:45 <thomasem> :P
15:48:05 <danpb> apmelton: the image i looked at is from just a day or two ago
15:48:23 <danpb> anyway, this doesn't sound like it is a libvirt problem per se, rather a cirros problem
15:48:35 <apmelton> danpb: there are two different images devstack will pull down, the uec style or rootfs style
15:48:51 <danpb> ah,
15:48:58 <thomasem> finally my 2 months in devstack hell pay off
15:48:59 <thomasem> :)
15:49:09 <nelsnelson> :)
15:49:19 <thomasem> Well, they did back then too. Now it's just icing.
15:49:43 <danpb> lol
15:49:52 <danpb> #topic idmap
15:50:15 <sew> so i know we discussed the idmap and block devices issue a while back, but i was curious if there are any new thoughts?
15:50:32 <danpb> sew: can you refresh my mind ?
15:50:51 <danpb> is the problem that we need to find a way to mount the block dev before we run id remaping ?
15:50:57 <sew> yes, i think so
15:51:17 <sew> when using lvm with idmap for example
15:51:28 <danpb> yeah, guess that does make sense really
15:51:54 <sew> i would imagine anyone attempting to run containers with user namespaces enabled and block backed filesystems would be hitting this
15:52:41 <danpb> hmm, oh,  type=block is only used when we boot from cinder volume, right ?
15:53:14 <sew> cinder would be block device
15:53:29 <sew> but we're seeing this problem with just local logical volumes
15:53:42 <apmelton> danpb: we'd like to move to type=block for lvm images as well
15:54:43 <danpb> right now though, for non-volume based instances we're always mounting the image (whether lvm, raw or qcow2) in the host and then using  type=mount so idmap can run without trouble
15:54:51 <s1rp> it's be nice to be able to start VMs using libvirt, rather than have to nova mount the drive, then start the guest
15:55:11 <sew> reboot inside the container also fails under the mounted approach
15:55:14 <danpb> if we change so that libvirt is responsible for all the mounting, then we'll face this problem for all types of root disk,  lvm., raw, qcow2
15:55:48 <apmelton> danpb: moving the responsibility to libvirt would allow the removal of a bunch of extra lxc code paths in nova
15:55:58 <danpb> it seems that somewhere in the guest startup process we will always need to mount the images or volumes in the host, run the idmap, and then unmount again and handoff to libvirt
15:56:12 <danpb> apmelton: yeah, i'd like to see that happen
15:56:16 <apmelton> danpb: this isn't idmapshift
15:56:28 <apmelton> danpb: this is the setting up of the actual user namespace
15:56:51 <danpb> oh, you mean  libvirt itself is broken with type=block and user namespaces
15:56:52 <apmelton> it appears that the mounting of the block device is done from within the user namespace
15:56:56 <apmelton> danpb: yup
15:56:58 <sew> yes
15:57:03 <danpb> sorry, ok, now i see
15:57:52 <danpb> guess someone will need to poke around in the code to see how we might fix it - presumably we need to chown the block device before starting the user namespace , or possibly even mount it before hand ?
15:58:33 <sew> i played with changing device ownership a bit , but didn't find a winning combo
15:59:07 <danpb> afraid we're out of time here today
15:59:18 <danpb> have to make way for the next meeting to use this channel
15:59:36 <sew> sounds good.  thx danpb
15:59:38 <danpb> best take this to the mailing list or bug tracker
15:59:52 <danpb> #endmeeting