RFC: What happens if the Host OS can't see container mounts?

159 views
Skip to first unread message

Jim Ramsay

unread,
Mar 5, 2021, 9:33:49 AM3/5/21
to Kubernetes developer/contributor discussion
Hello everyone! I'm looking into creating a KEP to allow Kubelet and the container-runtime to operate in a distinct mount namespace from the host OS, and I'm looking for any real-world use-case that might be broken by this change. This was originally brought up to sig-node but was suggested I bring it up here too for wider consideration.

TL;DR
I want to be able to hide all container-specific mount points from the host OS, and I want to know if any known tools or use-cases would be adverse affected.

What are we proposing to change?

Today, Kubelet and the Container Runtime all run in the same mount namespace as the host OS (systemd, login shells, system daemons)

See fig1: Original mount propagation [1]

In fact, the documentation for the mountPropagation volumeMount parameter, as well as the "Mount Propagation" e2e test in kubernetes/kubernetes both enforce a set of 3 guarantees about mountpoint visibility:

1. Mounts originating in the host OS are visible to the Container Runtime, OCI hooks, Kubelet, and container volumeMounts with MountPropagation: HostToContainer (or Bidirectional)

2. Mounts originating in the Container Runtime, OCI hooks, Kubelet, and container volumeMounts with mountPropagation: Bidirectional are visible to each other and container volumeMounts with MountPropagation: HostToContainer

3. Mounts originating in the Container Runtime, OCI hooks, Kubelet, and container volumeMounts with mountPropagation: Bidirectional are visible to the host OS
Of those 3 guarantees mentioned above, the first 2 are critical to the Kubernetes and container-native operations, and would not change, but the 3rd guarantee would be demoted to an optional configuration choice.

See fig 2: Proposed support for hidden mount propagation [2]

Why make a change?  There are 3 main reasons, though the 1st one is the most pressing concern today.

- Systemd Efficiency

Today systemd is adversely affected by a large number of mount points. This appears to be rooted in the lack of granularity in the events the kernel sends to systemd about mountpoint changes, which necessitates a full rescan of all mountpoints even on a minor change or single mountpoint addition. (See Red Hat BZ1819869 [3] for more details)

Even a fairly small OpenShift deployment today has upwards of 400 mountpoints on every node owned by Kubelet, the Container Runtime, and infrastructure containers like OVN or CSI operators. Hiding the container mount points reduces the steady-state systemd CPU utilization from ~10% to ~0% on every worker node, and the improvement would be greater on more heavily-loaded systems with more container-specific mountpoints.  I expect the same can be said for vanilla k8s as well.

- Tidiness and Encapsulation

These container overlay mounts, secret and configmap mounts, and namespace mounts are all internal implementation details around the way the Kubernetes and the Container Runtime allocate resources to containers. For normal day-to-day operation, there is no need for these special-use mounts to be visible in the host OS's top-level 'mount' or 'findmnt' output.

- Security

While access to container-specific mounts in Kubernetes (like secrets) today are protected by filesystem permissions, the fact that the locations are visible to unprivileged users via the 'mount' command can provide information which could be used in conjunction with other vulnerabilities to decipher secrets or other protected container mount contents. Locking access to even listing the mountpoints behind the ability to enter the mount namespace (which in turn requires CAP_SYS_CHROOT and CAP_SYS_ADMIN) removes one level of potential access to container-specific mounts.

What exactly would change?

From the end-user's perspective:

The original "everything-in-one-namespace" mode of operation would still be allowed, but users or distributions would be allowed (or maybe even encouraged?) to set up their init systems so that the Container Runtime and Kubelet are in a separate mount namespace separate from the host OS, provided this uses 'slave/shared' propagation so that host OS-mounted filesystems are still available to the Container Runtime, Kubelet, and the containers they spawn, and any changes in the 2nd-level namespace is propagated down to lower-level namespaces.

For example, a user on a Linux system with systemd could use a systemd service like container-mount-namespace.service[4] to create the mount namespace, then a drop-in like 20-container-mount-namespace.conf[5] for both kubelet.service and the appropriate container runtime service so they enter that namespace before execution. (See the extractExecStart[6] helper script and the OpenShift-based proof-of-concept page[7] for more details.)

An administrator wanting to examine the container mounts can easily run nsenter to spawn a shell or other command inside the container mount namespace, and likewise any 3rd-party service that needs access to these mounts could use nsenter or a drop-in like the example above to do the same.

From the Kubernetes perspective:

The "Mount Propagation" e2e test needs alteration, to detect when a separate container mount namespace is present and if so, enforce the slightly different propagation rules. The documentation for the mountPropagation property needs a similar change. An initial attempt at this implementation is in this branch [8].

The installation guide for Kubernetes would also need a small addition, to outline and demonstrate this optional environmental change.

In conclusion, I'd like to write this KEP but I'd also like to understand ahead of time if there are any existing tools or use-cases you may know about that rely on this ability of something running in the Host OS to see mountpoints owned by Kubelet, the Container Runtime, or other containers.

Thanks for your time!

References:

Benjamin Elder

unread,
Mar 5, 2021, 4:48:07 PM3/5/21
to Jim Ramsay, Kubernetes developer/contributor discussion
With Kubernetes Code Freeze looming I've only had time to skim this, but FWIW I think the KIND project might be pretty interested in this feature (which is not to say we should implement it because of this, but when I have time later we might be interested in enabling it / testing it).

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/c96eac12-7cd0-4325-8fb9-dc28da01b6b2n%40googlegroups.com.

Ben Swartzlander

unread,
Mar 6, 2021, 12:02:30 AM3/6/21
to kuberne...@googlegroups.com
This could conceivably break some CSI plugins, depending on how they
perform their mounts. I haven't looked at exactly what you proposal
does, and I'm not aware of any specific problem, but I recommend testing
this change with a wide variety of CSI plugins to make sure nothing breaks.

-Ben Swartzlander
> --
> You received this message because you are subscribed to the Google
> Groups "Kubernetes developer/contributor discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to kubernetes-de...@googlegroups.com
> <mailto:kubernetes-de...@googlegroups.com>.
> <https://groups.google.com/d/msgid/kubernetes-dev/c96eac12-7cd0-4325-8fb9-dc28da01b6b2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Jim Ramsay

unread,
Mar 9, 2021, 10:14:41 AM3/9/21
to Kubernetes developer/contributor discussion
Thanks, Ben!

I don't know a lot about CSI plugins, but am happy to learn!  I agree trying a variety of CSI plugins is a critical test for this proposal.

My suspicion is that most CSI plugins should be fine:
 - If the CSI plugin relies on a container to perform a mount, and passes it to other containers via 'mountPropagation: Bidirectional', this still works as before with no change needed.
 - If the CSI plugin relies on a service running on the host OS to perform a mount, Mounts done by the host OS (or services on the host OS) are still made available to containers exactly as before with no change needed (since under my proposal, the container-specific mount namespace MUST be a slave to the Host OS mount namespace).
- The OpenShift proof-of-concept I have set up with this namespace change does execute and pass a set of CSI tests, though I haven't dug in too far to understand exactly what this entails or how rigorously it's checking CSI conformance.

The only kind of CSI plugin that could be adversely affected by this proposal would be one that relies on (1) a container performing a mount, and then (2) some service in the Host OS seeing it and acting upon it after the fact.  I hope this would be the minority case or maybe even prohibited by the CSI spec, though I haven't been able to tease out exactly what the CSI spec says about this.

Do you happen to know:
 - Would that kind of plugin be allowed under the CSI spec?
 - Are there any CSI plugins that actually do this kind of thing?

As an aside, if there was a CSI plugin that relied on the behavior, it would be easy to change it to work under my proposal by wrapping any such host OS services in 'nsenter' so they run in the same mount namespace as Kubelet and the container runtime.

Ben Swartzlander

unread,
Mar 9, 2021, 11:25:59 AM3/9/21
to kuberne...@googlegroups.com
Replies inline
The CSI spec is intentionally vague about how plugins might perform
their functions, so as to maximize compatibility. The effective limits
of what plugins can do is "whatever they can get away with". Since this
spec seems to be shrinking the space of what plugins can get away with,
that's why I see the potential for issues.

>  - Are there any CSI plugins that actually do this kind of thing?

On the plus side, I'm not aware of any plugins that rely on the
functionality you say is going away. But there are currently 100 (!) CSI
drivers listed on this page:

https://kubernetes-csi.github.io/docs/drivers.html

It's anyone's guess whether some of those plugins might run into trouble
with this change.

> As an aside, if there was a CSI plugin that relied on the behavior, it
> would be easy to change it to work under my proposal by wrapping any
> such host OS services in 'nsenter' so they run in the same mount
> namespace as Kubelet and the container runtime.

This is something I think we should explore further, because if there's
a generalized workaround then it would alleviate most of the concerns.
It seems like what you describe would be hard to do inside an individual
CSI plugin, especially without changes to the CSI spec. If there's a way
to ensure that the right thing happens by default, that would be ideal.

The important thing to understand about CSI plugins, is that the node
plugin always runs with elevated privileges of some sort, because you
need elevated privileges to mount things. Most, if not all, set
privileged to true and enable CAP_SYS_ADMIN in the daemonset definition,
which results in pods sharing a mount namespace with the system. I don't
know which namespace this would end up being after the change you
propose, but if it's the correct one, then there's nothing needed.

-Ben
> <https://groups.google.com/d/msgid/kubernetes-dev/c96eac12-7cd0-4325-8fb9-dc28da01b6b2n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/kubernetes-dev/c96eac12-7cd0-4325-8fb9-dc28da01b6b2n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Kubernetes developer/contributor discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to kubernetes-de...@googlegroups.com
> <mailto:kubernetes-de...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kubernetes-dev/1931bc6c-c4e3-464f-bf2f-7de0f2b9c4a5n%40googlegroups.com
> <https://groups.google.com/d/msgid/kubernetes-dev/1931bc6c-c4e3-464f-bf2f-7de0f2b9c4a5n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Michelle Au

unread,
Mar 10, 2021, 12:59:42 AM3/10/21
to Ben Swartzlander, Kubernetes developer/contributor discussion, Jan Safranek
The container runtime needs to see the mounts that are made by the csi driver to bind mount them into the container.

We also unfortunately have a few places left in kubelet code that checks if things are mounts. We are slowly trying to chip away at removing those assumptions.

Did your csi testing use a "real driver" or some dev driver like host path or mock that doesn't actually make mounts?

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/4aabaa0d-2a54-cfde-83ff-dc36f479dc37%40swartzlander.org.

Jan Safranek

unread,
Mar 10, 2021, 3:05:42 AM3/10/21
to Michelle Au, Ben Swartzlander, Kubernetes developer/contributor discussion
I think the proposal here is to "allow Kubelet and the container-runtime
to operate in a distinct mount namespace from the host". So, kubelet
will see the same mounts as the container runtime and containers it
started, as today. The host (systemd?) may not see them though.

From CSI perspective it's IMO ok. Unless CSI drivers want to propagate
the mount really up to the host from some reason (e.g. some storage
daemon sitting there). I haven't seen such daemon though.

One thing we need to ensure is to make this mount namespace stable - not
to be destroyed when either kubelet or container runtime services are
restarted.

Jan
> >      > <mailto:kubernetes-de...@googlegroups.com
> <mailto:kubernetes-dev%2Bunsu...@googlegroups.com>
> > <mailto:kubernetes-de...@googlegroups.com
> <mailto:kubernetes-dev%2Bunsu...@googlegroups.com>>.
> <https://groups.google.com/d/msgid/kubernetes-dev/1931bc6c-c4e3-464f-bf2f-7de0f2b9c4a5n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/kubernetes-dev/1931bc6c-c4e3-464f-bf2f-7de0f2b9c4a5n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Kubernetes developer/contributor discussion" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to kubernetes-de...@googlegroups.com
> <mailto:kubernetes-dev%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kubernetes-dev/4aabaa0d-2a54-cfde-83ff-dc36f479dc37%40swartzlander.org
> <https://groups.google.com/d/msgid/kubernetes-dev/4aabaa0d-2a54-cfde-83ff-dc36f479dc37%40swartzlander.org>.
>

Noah Kantrowitz

unread,
Mar 10, 2021, 3:12:46 AM3/10/21
to Kubernetes developer/contributor discussion
From the SRE side, I've definitely seen a fair number of system daemons
that want to gather data on mounts, usually either legacy metrics
systems or security tools. The native answer would be "run those as a
DaemonSet instead" but we should make sure kubeadm and other "official"
installer tools explain how to change the default back, or at least
explain what kinds of things might be impacted.

--Noah

Jim Ramsay wrote on 3/4/21 8:47 AM:
> --
> You received this message because you are subscribed to the Google
> Groups "Kubernetes developer/contributor discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to kubernetes-de...@googlegroups.com
> <https://groups.google.com/d/msgid/kubernetes-dev/c96eac12-7cd0-4325-8fb9-dc28da01b6b2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Jim Ramsay

unread,
Mar 18, 2021, 1:58:27 PM3/18/21
to Kubernetes developer/contributor discussion
More discussion can follow in the corresponding issue, too: https://github.com/kubernetes/kubernetes/issues/100259

Jim Ramsay

unread,
Mar 18, 2021, 2:05:31 PM3/18/21
to Kubernetes developer/contributor discussion
That's a good point about the stability of the namespace!  This would be an important aspect to document about the namespace requirements, in addition to the fact that it must be 'slave/shared' and not just 'slave' to the top-level namespace.
In my proof-of-concept work in OpenShift, I achieve this by having a separate systemd service that creates the namespace, then a drop-in for both Kubelet and CRI-O that start them within that namespace whenever they restart.

Jim Ramsay

unread,
Mar 18, 2021, 2:35:41 PM3/18/21
to Kubernetes developer/contributor discussion
Yes; That's exactly the kind of use-case I'm looking for to understand how widely this might be used.  Thanks!  Do you have any specific examples of such SRE tools that you could point me at to investigate further?

The way I see it, on a system with this change there are 3 mitigations possible for legacy tools:
- Making these tools cloud-native and running them as a daemonset is one.  Encouraging tools to go this way is probably valuable in and of itself, but not something everyone will be able to do immediately.
- Disabling the "separate mount namespace" environment entirely is another, but it's a fairly big hammer.
- A middle road would be to use a systemd drop-in or edit the service to wrap a legacy tool inside 'nsenter' so it starts in the same namespace as the container runtime, though this more surgical workaround may not be appropriate for all tools.

In the short term, I'm just asking to have the k8s e2e test relaxed to allow this new mode of operation, and not have it required or even on by default.  With that small change in place, installer tools or downstream providers could then start making the switch at their leisure if they want the benefits and cost of doing so. You've rightly demonstrated that part of this cost is ensuring there's an off switch and/or appropriate documentation on how to adapt legacy tools to fit.

I have a proof-of-concept under way for OpenShift that has both installer-provided mitigation modes covered - A way to put things back to the existing namespace mechanism, plus an example drop-in and script to make it easy for services and administrators to jump into the new namespace.
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages