mounting persistent volume claims in vm pod?

dvo...@redhat.com

unread,

Jun 15, 2017, 3:36:55 PM6/15/17

to kubevirt-dev

Hey,

I've been reading up on now the k8s persistent volumes work and found it curious that we don't mount the claim in the vm pod. Instead we're doing a mapping of the persistent volume claim into libvirt domain arguments.

Why did we decide to approach the problem this way instead of allowing k8s to bind the volume into the pod?

-- Vossel

Stuart Gott

unread,

Jun 15, 2017, 7:15:06 PM6/15/17

to David Vossel, kubevirt-dev

Short answer is: mounted filesystem != raw device

If we associated the PVC with the virt-launcher pod, then Kubernetes would bind the volume (and mount it) to the virt-launcher pod. What we need is an un-mounted raw device that libvirt can use. By claiming the PV with the VM resource, we end up with that--since Kubernetes can't sensibly assume a third party resource even has a filesystem.

Incidentally (and I can't remember the reason why off the top of my head), the qemu emulator script cannot enter the mount namespace of the virt-launcher pod.

Stu

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubevi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/35b80afc-cf55-4a7e-a5fc-4fd3bb4ea3db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ih...@redhat.com

unread,

Jun 15, 2017, 9:19:28 PM6/15/17

to kubevirt-dev

could we treat differently block and file, and get kubernetes to to connect to the block device, but pass it as a raw block device to the container? this is not special for VMs, some applications used to perform better with raw devices iirc / used shared raw block device to manage their clustering, etc.

On Thursday, June 15, 2017 at 7:15:06 PM UTC-4, Stuart Gott wrote:

Short answer is: mounted filesystem != raw device

If we associated the PVC with the virt-launcher pod, then Kubernetes would bind the volume (and mount it) to the virt-launcher pod. What we need is an un-mounted raw device that libvirt can use. By claiming the PV with the VM resource, we end up with that--since Kubernetes can't sensibly assume a third party resource even has a filesystem.

Incidentally (and I can't remember the reason why off the top of my head), the qemu emulator script cannot enter the mount namespace of the virt-launcher pod.

Stu

On Thu, Jun 15, 2017 at 3:36 PM, <dvo...@redhat.com> wrote:

Hey,

I've been reading up on now the k8s persistent volumes work and found it curious that we don't mount the claim in the vm pod. Instead we're doing a mapping of the persistent volume claim into libvirt domain arguments.

Why did we decide to approach the problem this way instead of allowing k8s to bind the volume into the pod?

-- Vossel

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.

Fabian Deutsch

unread,

Jun 16, 2017, 3:40:48 AM6/16/17

to Stuart Gott, David Vossel, kubevirt-dev

On Fri, Jun 16, 2017 at 1:15 AM, Stuart Gott <sg...@redhat.com> wrote:
> Short answer is: mounted filesystem != raw device

Yes - This was the beginnig :)

To Kubernetes, today, volumes are file-systems.
However, for VMs, for qemu, disks we need files (like qcow), block
devices, or connection details (i.e. for iscsi or ceph, whatever
built-in support qemu has).

Because there was no sane way how to expose volumes as block devices
to pods, we skipped this for now - at least when it comes to
implementation. This is why we don't attach volumes to a pod.

We also did not work to much on using files as disk, because here we
need a strong story of how we lay out files and how they relate to a
volume (i.e. is a single disk a single file in a single volume? or is
a volume a pool of disks, …).

But incidentally Kubernetes volumes are also supported for iSCSI and
Ceph (beyond others). The nice thing about these two (and gluster
iirc) is that qemu has built-in support to use these storage types
directly.
Thus what we did is, we reused the PV/PVC mechanism to assign PVCs to
VMs. Then we used the connection details to allow qemu to access those
storage types directly.
The nice part so far about this approach is, that it was easy to
implement, that it reuses the existing Kubernetes objects, and that it
is simple.

However, even if we have this option today - we still want to find
solutions for the block device and file-system based approaches.

- fabian

> If we associated the PVC with the virt-launcher pod, then Kubernetes would
> bind the volume (and mount it) to the virt-launcher pod. What we need is an
> un-mounted raw device that libvirt can use. By claiming the PV with the VM
> resource, we end up with that--since Kubernetes can't sensibly assume a
> third party resource even has a filesystem.
>
> Incidentally (and I can't remember the reason why off the top of my head),
> the qemu emulator script cannot enter the mount namespace of the
> virt-launcher pod.
>
> Stu
>
> On Thu, Jun 15, 2017 at 3:36 PM, <dvo...@redhat.com> wrote:
>>
>> Hey,
>>
>> I've been reading up on now the k8s persistent volumes work and found it
>> curious that we don't mount the claim in the vm pod. Instead we're doing a
>> mapping of the persistent volume claim into libvirt domain arguments.
>>
>> Why did we decide to approach the problem this way instead of allowing k8s
>> to bind the volume into the pod?
>>
>> -- Vossel
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "kubevirt-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> email to kubevirt-dev...@googlegroups.com.

>> To post to this group, send email to kubevi...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/kubevirt-dev/35b80afc-cf55-4a7e-a5fc-4fd3bb4ea3db%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "kubevirt-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to kubevirt-dev...@googlegroups.com.

> To post to this group, send email to kubevi...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/kubevirt-dev/CAH7yScsjxyGN1V7j8YKyAHgp%3DfA1mi6CJwzUOtjopgQT%3DYv9ew%40mail.gmail.com.

Daniel P. Berrange

unread,

Jun 16, 2017, 4:38:36 AM6/16/17

to dvo...@redhat.com, kubevirt-dev

There's many different aspects at play here, and the right answer depends
on what kind of storage you want to deal with too. Ignoring k8s, for network
block devices using the in-QEMU client is the preferred way of dealing with
RBD volumes (as opposed to in-kernel client + block device access). It is
simpler to manage as you don't need to play waiting games on asynchornous
operations like waiting for udev to finish creating device nodes which
have been a constant source of bugs in the past. It avoids the situation
where you can get an unkillable QEMU process on storage hang, due to the
kernel syscall being in an uninterruptable wait state. It also has better
performance characteristics by removing layers from the I/O stack both
in QEMU and the kernel. Similar benefits are applicable to Gluster, ISCSI
and NFS, all of which have in-QEMU clients. iSCSI is a not quite so clear
cut though, because there's no multi-path support in QEMU at this time.
With recent work, QEMU can also support LUKS encryption natively with
these in-QEMU clients.

So in general using the in QEMU clients is the right way to integrate,
regardless of any other k8s specific aspects. It happens to be the
case though, that using the in-QEMU client is also better fit for k8s.
As it stands today, if you map a PVC into a POD, k8s wants to format a
filesystem on it and mount it in the POD, even if you don't request
and mounts into the containers. IOW you can't actually pass the RBD,
iSCSI, etc block device into the POD - you just get a filesystem in
the block device.

There may be times where you wanted a filesystem & a qcow2/raw file
inside it, which is then exposed to QEMU, but I don't expect that to
be the common case. For a start you cannot now do migration, since
you can't mount that volume on multiple hosts at once due to it using
ext3. So you would have todo full storage copy on migrate. It also
adds many more layersinto the I/O path so will have worse performance
than when QEMU directly uses an RBD client.

Then of course there is the namespace problem. The QEMU processes
need to be in the same namespace as libvirtd, and while k8s does have
these mounts exposed in the host namespace where libvirtd & QEMU
could see them, this is a private impl detail of k8s right now.

Eventually we do need to figure out a way to deal with block devices
and local files, which will probably involve accessing these mounts
or devices in the host namespace, but we'll need to try and get k8s
to provide some kind of guarantees that this is allowed.

Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Daniel P. Berrange

unread,

Jun 16, 2017, 4:43:12 AM6/16/17

to ih...@redhat.com, kubevirt-dev

On Thu, Jun 15, 2017 at 06:19:27PM -0700, ih...@redhat.com wrote:
> could we treat differently block and file, and get kubernetes to to connect
> to the block device, but pass it as a raw block device to the container?
> this is not special for VMs, some applications used to perform better with
> raw devices iirc / used shared raw block device to manage their clustering,
> etc.

If k8s exposes raw block devices to containers it exposes the entire host
to dangerous security problems. Currently k8s can be sure that all storage
has a safe filesystem on it (because it run mkfs). If you pass a raw block
device to a container, then the container has ability to write malicious
filesystem structures into the block device. If k8s then later mounts this
malicious FS, it is possible to exploit Linux kernel bugs to crash the
host kernel. Combine that with a replication controller and you can use
this malicious FS to exploit a kernel bug to crash every single host in
your cloud. These kernel bugs are not theoretical - there have been quite
a few over the years and while they have done fuzzing to try to identify
and fix bugs, the kernel FS maintainers actually say that the kernel FS
driver code is written assuming non-malicious filesystems. This is the
key reason why libguestfs was created to avoid mounting guest filesystems
on host kernel.

Fabian Deutsch

unread,

Jun 16, 2017, 5:07:44 AM6/16/17

to Itamar Heim, kubevirt-dev

On Fri, Jun 16, 2017 at 3:19 AM, <ih...@redhat.com> wrote:
> could we treat differently block and file, and get kubernetes to to connect
> to the block device, but pass it as a raw block device to the container?
> this is not special for VMs, some applications used to perform better with
> raw devices iirc / used shared raw block device to manage their clustering,
> etc.

There are some thoughts in this [1] PR speaking about how to expose a
volume as a block device to a container.

- fabian

[1] https://github.com/kubernetes/community/pull/306

> https://groups.google.com/d/msgid/kubevirt-dev/0d4064bd-cd2a-4255-be87-0d4e8d6c6fb4%40googlegroups.com.

Reply all

Reply to author

Forward