*** auk has joined #kata-dev | 02:33 | |
*** auk has quit IRC | 06:27 | |
*** lpetrut has joined #kata-dev | 07:16 | |
*** jodh has joined #kata-dev | 07:52 | |
*** sgarzare has joined #kata-dev | 08:06 | |
*** jodh has quit IRC | 08:42 | |
*** jodh has joined #kata-dev | 08:44 | |
*** davidgiluk has joined #kata-dev | 08:55 | |
*** sameo has joined #kata-dev | 09:05 | |
*** gwhaley has joined #kata-dev | 09:07 | |
*** pohly has joined #kata-dev | 10:38 | |
pohly | stefanha: hello. I am trying to understand how (and how well) virtio-fs supports mmap. Background: I work on PMEM-CSI, a driver which enables the use of PMEM in Kubernetes. Ultimately the goal is that an application can do mmap(MAP_SYNC) and then do byte read/writes directly to the the underlying hardware. That works without kata-containers involved. I now looked at kata-containers 1.9.1 with the kata-qemu-virtiofs. I can see that this | 10:42 |
---|---|---|
pohly | passes the dax-capable filesystem (XFS, in case that this matters) into the qemu instance with virtiofs. A test program can do mmap(MAP_SYNC) on a file. | 10:42 |
pohly | But... it can also do that with 9p as file system and with the container root filesystem served by virtio-fs although that filesystem on the host does not support dax (hosted by plain SSD). | 10:43 |
pohly | I was under the (perhaps mistaken) impression that virtio-fs would somehow support mmap. I though I had read that somewhere. Is that really true? | 10:45 |
pohly | I checked the /proc/<pid>/maps for the /opt/kata/bin/qemu-virtiofs-system-x86_64 process that runs the pod. It doesn't have any entry for the file that currently is mapped inside the container. | 10:46 |
brtknr | pohly: following this discussion | 10:51 |
davidgiluk | pohly: is the mount mounted with DAX? | 10:57 |
pohly | Yes: kataShared on /data type virtio_fs (rw,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,dax) | 10:58 |
pohly | That is inside qemu. | 10:58 |
pohly | And also outside of it: /dev/mapper/ndbus0region0fsdax-e7660acd0fd86e6aea32589af51903654f6a4e41 on /var/lib/kubelet/pods/6576fed5-5488-4ee4-a6a2-578c5519ae9c/volumes/kubernetes.io~csi/my-csi-volume/mount type xfs (rw,relatime,attr2,dax,inode64,noquota) | 10:59 |
stefanha | pohly: virtio-fs isn't intended for pmem. QEMU won't use MAP_SYNC. | 11:00 |
stefanha | pohly: If you need MAP_SYNC semantics then QEMU's nvdimm device can do that. | 11:01 |
stefanha | pohly: MAP_SYNC support could be added to virtio-fs but today it doesn't do that. | 11:01 |
pohly | stefanha: if virtio-fs doesn't support MAP_SYNC, shouldn't it then reject the mmap call? | 11:01 |
stefanha | pohly: Probably. Inside the guest the virtio-fs and FUSE code isn't doing anything that violates MAP_SYNC, | 11:03 |
stefanha | but the problem is that the host side doesn't necessarily honor those semantics. | 11:03 |
pohly | But plain mmap works? | 11:03 |
stefanha | pohly: Yep, plain mmap is supported. | 11:03 |
pohly | Should I then see a /proc/*/maps entry for the file? I don't have that. | 11:04 |
pohly | Or am I checking the wrong process? I looked at qemu-virtiofs-system-x86_64, because that is where the code runs. | 11:04 |
stefanha | pohly: There isn't necessarily a 1:1 mmap relationship between guest application mmaps and host qemu-virtiofs-system-x86_64 mmaps. | 11:05 |
stefanha | pohly: What are you trying to confirm by looking at qemu-virtiofs-system-x86_64 mmaps? | 11:05 |
pohly | Looking more closely I do see one entry that has at least the right size: 7f2f1bffe000-7f2f1bfff000 ---p 00000000 00:00 0 | 11:06 |
pohly | But it doesn't have a file name associated with it. Should it have that? | 11:06 |
pohly | I am trying to verify that a file on the host has indeed been mapped into the address space of the process running inside qemu. | 11:07 |
pohly | If that isn't the case, then how does mmap support work? | 11:07 |
stefanha | pohly: The lack of filename could be due to file descriptor passing | 11:07 |
stefanha | The file is opened by virtiofsd and passed to QEMU. Maybe that's why no name is reported. | 11:08 |
stefanha | But that's just a guess. | 11:08 |
pohly | That might be it. Let me remove the mapping inside qemu... | 11:08 |
davidgiluk | the name normally does show up | 11:08 |
davidgiluk | pohly: Have you accessed the mmap'd area, or just done the mmap? | 11:08 |
pohly | Just the mmap. So it's waiting for a page fault before doing anything on the host side? I can add that. | 11:10 |
stefanha | Yes, that sounds likely. | 11:11 |
stefanha | pohly: But again, if your goal is to get pmem semantics then virtio-fs in its current state doesn't guarantee that. | 11:11 |
davidgiluk | pohly: Yes, I think so - remember for virtiofs we only have a fixed sized cache window, so we can't guarantee to mmap the whole region | 11:12 |
stefanha | pohly: QEMU has -device nvdimm and -device virtio-pmem-pci for that. | 11:12 |
pohly | Using those for a mounted filesystem in kata-containers isn't going to be easy. | 11:13 |
pohly | virtio-fs looked much more promising ;-} | 11:14 |
*** pcaruana has joined #kata-dev | 11:14 | |
davidgiluk | stefanha: What stops us passing the MAP_SYNC all the way through? | 11:15 |
pohly | davidgiluk: even if you do, "fixed size cache window" sounds like another big roadblock. PMEM comes in higher capacity than DRAM, that's partly why it is appealing for some workloads. | 11:17 |
pohly | MAP_SYNC isn't even needed for all workloads. In fact, most apps currently don't depend on it. | 11:18 |
pohly | So virtio-fs may already be a good step forward and sufficient. | 11:18 |
pohly | OTOH, if it needs to set up and tear down mappings on the host side often, then that may affect performance. | 11:19 |
pohly | memcached uses PMEM as DRAM replacement and stores its data there. Predictable access times for that data probably is important. | 11:20 |
davidgiluk | pohly: Right; if you've got a single PMEM device to pass through then as stefan says using the -device stuff is the right way; if you're trying to pass through files that on the host are mountedon a filesystem that's backed by pmem, then virtiofs might be interesting | 11:20 |
pohly | davidgiluk: we are trying the former. PMEM-CSI basically splits up a single PMEM device and hands out portions of it to individual apps. We cannot assume that only a single app uses that device; that would be rather limiting. | 11:22 |
pohly | Ahem, I meant "we are trying the latter"... | 11:22 |
davidgiluk | pohly: But does the PMEM-CSI portions look like individual block devices that you then put a filesystem on, and is that filesystem built in the host or the guest? | 11:23 |
pohly | davidgiluk: it is a block device. But applications in Kubernetes typically will ask for a filesystem, so PMEM-CSI formats and mounts that device. | 11:26 |
*** sameo has quit IRC | 11:26 | |
pohly | And then Kubernetes passes the directory name of the mounted FS to the runtime. | 11:26 |
pohly | I heard that kata-containers sometimes does tricks like then passing the device into qemu and mounting again inside. | 11:27 |
pohly | That's a bit dirty, because there are two Linux kernels which both might write to the same block device. | 11:27 |
davidgiluk | pohly: OK, if it's a device+filesystem just for that container then it does feel like passing that block device into the container is right rather than passing the filesystem through virtiofs | 11:27 |
pohly | davidgiluk: yes, that would be the better alternative, except for the "is already mounted" part. | 11:28 |
pohly | Also, does it have to be some actual device? Currently the block devices are either LVM logical volumes or PMEM namespaces (/dev/pmem*). | 11:29 |
pohly | We can't use PCI device pass-through - it's not even on the PCI bus. | 11:30 |
pohly | Nor do we want to pass in the entire NVDIMM. | 11:30 |
*** openstack has joined #kata-dev | 11:39 | |
*** ChanServ sets mode: +o openstack | 11:39 | |
*** openstack has joined #kata-dev | 11:51 | |
*** ChanServ sets mode: +o openstack | 11:51 | |
davidgiluk | pohly: Yeh probably best to make an issue; I'm also not sure the best way to wire it through - but if it looks like a block device, and that block device is intended just for this container, then treat it as a block device and let the guest handle it | 11:51 |
*** irclogbot_1 has quit IRC | 11:52 | |
*** irclogbot_2 has joined #kata-dev | 11:52 | |
gwhaley | pohly: include 'devimc' on that Issue, if not already - he'll have a good idea I think of what knitting would be required. | 11:52 |
gwhaley | yes, the hard bit is how to annotate that volume/mount/device to ensure it ends up mapped via the correct route. It may be that 'annotations' are the route. | 11:53 |
gwhaley | oh, amshinde might have good input as well | 11:53 |
gwhaley | so, historically we've always noted that nvdimm/dax could be used to pass items in (kata uses it for iirc the kernel image, or is it the rootfs....) - but, I don't believe there is a defined mechanism to set that all up via the orchestrators and runtime, and I don't think I've ever seen anybody actually using an nvdimm/dax mount/map for themselves ... yet.... | 11:55 |
pohly | gwhaley: /opt/kata/share/kata-containers/kata-containers-image_clearlinux_1.9.1_agent_d4bbd8007f.img is passed via "-object" + "-device nvdimm". | 11:59 |
pohly | Looks like the rootfs. There's also "root=/dev/pmem0p1". | 12:00 |
gwhaley | pohly: right, the rootfs for the VM (I can never remember if it is the rootfs or the kernel we do it with ;-) )... so, we use it, we know it works.... now it would be how do we enable 'users' to do it... | 12:00 |
pohly | davidgiluk: to get closure on this: when actually writing into the memory mapped region via virtio-fs, I do see map entries on the host side, including the file name. | 12:03 |
pohly | davidgiluk: how large is this "fixed size cache window"? | 12:03 |
davidgiluk | pohly: It's configurable via an option, normally a few GB | 12:04 |
gwhaley | https://github.com/kata-containers/runtime/blob/master/cli/config/configuration-qemu-virtiofs.toml.in#L118-L131 :-) | 12:05 |
* davidgiluk disappears for a 2 hours | 12:05 | |
* gwhaley goes for lunch... | 12:05 | |
pohly | So a lot less than the hundreds of GB that people may have as PMEM. MIght be worth testing how that affects performance. Thanks! | 12:06 |
* pohly too | 12:06 | |
* pohly lunch... | 12:06 | |
*** lpetrut has joined #kata-dev | 12:53 | |
pohly | stefanha: should I also file a bug about rejecting MAP_SYNC? Where? | 13:46 |
pohly | Oh, in case someone wants to follow, the issue about adding PMEM support is here: https://github.com/kata-containers/runtime/issues/2262 | 13:46 |
*** canyounot has joined #kata-dev | 14:02 | |
stefanha | pohly: Sorry, I was offline. Please file it here: https://gitlab.com/groups/virtio-fs/-/issues | 14:10 |
pohly | stefanha: which project? "libfuse"? | 14:13 |
pohly | Note that 9p has the same issue, so it might be common to fuse-based filesystems. | 14:14 |
stefanha | pohly: linux please | 14:16 |
stefanha | pohly: virtio-9p is not FUSE-based./ | 14:16 |
pohly | Oh, okay. | 14:16 |
*** devimc has joined #kata-dev | 14:22 | |
*** fuentess has joined #kata-dev | 14:36 | |
*** sameo has joined #kata-dev | 14:36 | |
pohly | stefanha: never mind. I made a slight mistake in my test program (MAP_SHARED instead of MAP_SHARED_VALIDATE) and the effect is that MAP_SYNC gets silently ignored, as specified in the man page. | 14:39 |
stefanha | aha! :) | 14:46 |
stefanha | So now mmap(2) rejects the flag? | 14:46 |
pohly | Yes. | 15:02 |
*** lpetrut has quit IRC | 16:51 | |
*** dklyle has quit IRC | 16:53 | |
*** dklyle has joined #kata-dev | 16:54 | |
*** sgarzare has quit IRC | 17:04 | |
*** igordc has joined #kata-dev | 17:18 | |
*** devimc has quit IRC | 17:33 | |
*** devimc has joined #kata-dev | 17:34 | |
*** devimc has quit IRC | 17:40 | |
*** devimc has joined #kata-dev | 17:41 | |
*** devimc has quit IRC | 17:57 | |
*** devimc has joined #kata-dev | 17:58 | |
*** jodh has quit IRC | 18:05 | |
*** gwhaley has quit IRC | 18:13 | |
*** igordc has quit IRC | 18:39 | |
*** noahm has joined #kata-dev | 18:49 | |
*** igordc has joined #kata-dev | 19:55 | |
*** davidgiluk has quit IRC | 20:09 | |
*** sameo has quit IRC | 20:42 | |
*** igordc has quit IRC | 20:47 | |
*** igordc has joined #kata-dev | 20:53 | |
*** pcaruana has quit IRC | 21:38 | |
*** pohly has quit IRC | 21:55 | |
*** canyounot has quit IRC | 22:07 | |
*** devimc has quit IRC | 23:07 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!