There's many different aspects at play here, and the right answer depends
on what kind of storage you want to deal with too. Ignoring k8s, for network
block devices using the in-QEMU client is the preferred way of dealing with
RBD volumes (as opposed to in-kernel client + block device access). It is
simpler to manage as you don't need to play waiting games on asynchornous
operations like waiting for udev to finish creating device nodes which
have been a constant source of bugs in the past. It avoids the situation
where you can get an unkillable QEMU process on storage hang, due to the
kernel syscall being in an uninterruptable wait state. It also has better
performance characteristics by removing layers from the I/O stack both
in QEMU and the kernel. Similar benefits are applicable to Gluster, ISCSI
and NFS, all of which have in-QEMU clients. iSCSI is a not quite so clear
cut though, because there's no multi-path support in QEMU at this time.
With recent work, QEMU can also support LUKS encryption natively with
these in-QEMU clients.
So in general using the in QEMU clients is the right way to integrate,
regardless of any other k8s specific aspects. It happens to be the
case though, that using the in-QEMU client is also better fit for k8s.
As it stands today, if you map a PVC into a POD, k8s wants to format a
filesystem on it and mount it in the POD, even if you don't request
and mounts into the containers. IOW you can't actually pass the RBD,
iSCSI, etc block device into the POD - you just get a filesystem in
the block device.
There may be times where you wanted a filesystem & a qcow2/raw file
inside it, which is then exposed to QEMU, but I don't expect that to
be the common case. For a start you cannot now do migration, since
you can't mount that volume on multiple hosts at once due to it using
ext3. So you would have todo full storage copy on migrate. It also
adds many more layersinto the I/O path so will have worse performance
than when QEMU directly uses an RBD client.
Then of course there is the namespace problem. The QEMU processes
need to be in the same namespace as libvirtd, and while k8s does have
these mounts exposed in the host namespace where libvirtd & QEMU
could see them, this is a private impl detail of k8s right now.
Eventually we do need to figure out a way to deal with block devices
and local files, which will probably involve accessing these mounts
or devices in the host namespace, but we'll need to try and get k8s
to provide some kind of guarantees that this is allowed.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|