RFC: What happens if the Host OS can't see container mounts?

45 views
Skip to first unread message

Jim Ramsay

unread,
Mar 4, 2021, 11:44:29 AM3/4/21
to kubernetes-sig-node
Hello everyone! As I introduced in this past sig-node meeting, I'm looking into creating a KEP to allow Kubelet and the Container Runtime to operate in a mount namespace distinct from the host OS.

TL;DR
I want to be able to hide all container-specific mount points from the host OS, and I want to know if any known tools or use-cases would be adverse affected.

What are we proposing to change?

Today, Kubelet and the Container Runtime all run in the same mount namespace as the host OS (systemd, login shells, system daemons)

See fig1: Original mount propagation [1]

In fact, the documentation for the mountPropagation volumeMount parameter, as well as the "Mount Propagation" e2e test in kubernetes/kubernetes both enforce a set of 3 guarantees about mountpoint visibility:

1. Mounts originating in the host OS are visible to the Container Runtime, OCI hooks, Kubelet, and container volumeMounts with MountPropagation: HostToContainer (or Bidirectional)

2. Mounts originating in the Container Runtime, OCI hooks, Kubelet, and container volumeMounts with mountPropagation: Bidirectional are visible to each other and container volumeMounts with MountPropagation: HostToContainer

3. Mounts originating in the Container Runtime, OCI hooks, Kubelet, and container volumeMounts with mountPropagation: Bidirectional are visible to the host OS
Of those 3 guarantees mentioned above, the first 2 are critical to the Kubernetes and container-native operations, and would not change, but the 3rd guarantee would be demoted to an optional configuration choice.

See fig 2: Proposed support for hidden mount propagation [2]

Why make a change?  There are 3 main reasons, though the 1st one is the most pressing concern today.

- Systemd Efficiency

Today systemd is adversely affected by a large number of mount points. This appears to be rooted in the lack of granularity in the events the kernel sends to systemd about mountpoint changes, which necessitates a full rescan of all mountpoints even on a minor change or single mountpoint addition. (See Red Hat BZ1819869 [3] for more details)

Even a fairly small OpenShift deployment today has upwards of 400 mountpoints on every node owned by Kubelet, the Container Runtime, and infrastructure containers like OVN or CSI operators. Hiding the container mount points reduces the steady-state systemd CPU utilization from ~10% to ~0% on every worker node, and the improvement would be greater on more heavily-loaded systems with more container-specific mountpoints.  I expect the same can be said for vanilla k8s as well.

- Tidiness and Encapsulation

These container overlay mounts, secret and configmap mounts, and namespace mounts are all internal implementation details around the way the Kubernetes and the Container Runtime allocate resources to containers. For normal day-to-day operation, there is no need for these special-use mounts to be visible in the host OS's top-level 'mount' or 'findmnt' output.

- Security

While access to container-specific mounts in Kubernetes (like secrets) today are protected by filesystem permissions, the fact that the locations are visible to unprivileged users via the 'mount' command can provide information which could be used in conjunction with other vulnerabilities to decipher secrets or other protected container mount contents. Locking access to even listing the mountpoints behind the ability to enter the mount namespace (which in turn requires CAP_SYS_CHROOT and CAP_SYS_ADMIN) removes one level of potential access to container-specific mounts.

What exactly would change?

From the end-user's perspective:

The original "everything-in-one-namespace" mode of operation would still be allowed, but users or distributions would be allowed (or maybe even encouraged?) to set up their init systems so that the Container Runtime and Kubelet are in a separate mount namespace separate from the host OS, provided this uses 'slave/shared' propagation so that host OS-mounted filesystems are still available to the Container Runtime, Kubelet, and the containers they spawn, and any changes in the 2nd-level namespace is propagated down to lower-level namespaces.

For example, a user on a Linux system with systemd could use a systemd service like container-mount-namespace.service[4] to create the mount namespace, then a drop-in like 20-container-mount-namespace.conf[5] for both kubelet.service and the appropriate container runtime service so they enter that namespace before execution. (See the extractExecStart[6] helper script and the OpenShift-based proof-of-concept page[7] for more details.)

An administrator wanting to examine the container mounts can easily run nsenter to spawn a shell or other command inside the container mount namespace, and likewise any 3rd-party service that needs access to these mounts could use nsenter or a drop-in like the example above to do the same.

From the Kubernetes perspective:

The "Mount Propagation" e2e test needs alteration, to detect when a separate container mount namespace is present and if so, enforce the slightly different propagation rules. The documentation for the mountPropagation property needs a similar change. An initial attempt at this implementation is in this branch [8].

The installation guide for Kubernetes would also need a small addition, to outline and demonstrate this optional environmental change.

In conclusion, I'd like to write this KEP but I'd also like to understand ahead of time if there are any existing tools or use-cases you may know about that rely on this ability of something running in the Host OS to see mountpoints owned by Kubelet, the Container Runtime, or other containers.

Thanks for your time!

References:

Fox, Kevin M

unread,
Mar 4, 2021, 12:09:13 PM3/4/21
to Jim Ramsay, kubernetes-sig-node
Some folks are using containers to poke things onto the host or view things. For example, managing the root password, installing cni drivers, monitoring root disk space usage, even occasionally installing systemd services. I think kata even installs container runtime bits that way.

These use cases should be considered.

Thanks,
Kevin

________________________________________
From: kubernete...@googlegroups.com <kubernete...@googlegroups.com> on behalf of Jim Ramsay <jra...@redhat.com>
Sent: Thursday, March 4, 2021 8:44 AM
To: kubernetes-sig-node
Subject: RFC: What happens if the Host OS can't see container mounts?

Check twice before you click! This email originated from outside PNNL.
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com<mailto:kubernetes-sig-...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/5c716b70-3f76-490d-8348-6b9a7bebab73n%40googlegroups.com<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fkubernetes-sig-node%2F5c716b70-3f76-490d-8348-6b9a7bebab73n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7CKevin.Fox%40pnnl.gov%7Cb6a1435fb9e24b0cdbba08d8df2cebd8%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637504731332930438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5c8diH1bw69GG%2F4Ofojx4flTCYUhNdKG3fIP0nyEnNg%3D&reserved=0>.

Jim Ramsay

unread,
Mar 4, 2021, 1:05:50 PM3/4/21
to kubernetes-sig-node
Thanks for the comments!

I completely agree that these use cases should be considered, and I believe that containers of that sort should be unaffected by my proposal.  As a case in point, the MachineConfigOperator in OpenShift does most of the things you mention (edits host OS files, installs systemd services, sets the root password, etc) and I have yet to observe any change in its behavior with my proof-of-concept.

Basically this is because the new 2nd-level container-specific mount namespace would be a "slave" mount to the top-level host OS mount namespace. This means that any mountpoints in the host OS would still propagate down and be visible to the Container Runtime, and from there, according to the existing mountPropagation logic, these are made available to containers exactly as they are in the current implementation.

To put it another way, if a container pokes things onto the host or views things by virtue of it being granted access to a host-owned mount via the 'mountPropagation: HostToContainer' property, it would see no change under my proposal.  If it uses 'mountPropagation:Bidirectional' and mounts filesystems for use by the container runtime, OCI hooks, Kubelet, or other containers, it would see no change under my proposal.  If it uses 'mountPropagation: Bidirectional' and mounts filesystems for the host OS (or services there) to use, that would be prevented by my proposal, and that's the kind of case I'm trying to uncover.

Fox, Kevin M

unread,
Mar 4, 2021, 1:58:01 PM3/4/21
to Jim Ramsay, kubernetes-sig-node
Ah. yeah.... slave mount propagation from the root should let it handle all the things I can think of at the moment. Interesting.

Thanks,
Kevin

________________________________________
From: kubernete...@googlegroups.com <kubernete...@googlegroups.com> on behalf of Jim Ramsay <jra...@redhat.com>
Sent: Thursday, March 4, 2021 10:05 AM
To: kubernetes-sig-node
Subject: Re: RFC: What happens if the Host OS can't see container mounts?
[1] https://github.com/lack/redhat-notes/raw/main/crio_unshare_mounts/images/Original%20k8s%20mount%20propagation.png<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flack%2Fredhat-notes%2Fraw%2Fmain%2Fcrio_unshare_mounts%2Fimages%2FOriginal%2520k8s%2520mount%2520propagation.png&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941653893%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=C3PJ%2BM2Wloun3REO0WEXAeXcQ1tcCGQyQtSjkxJQlFU%3D&reserved=0>
[2] https://github.com/lack/redhat-notes/raw/main/crio_unshare_mounts/images/Proposed%20hidden%20k8s%20mount%20propagation.png<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flack%2Fredhat-notes%2Fraw%2Fmain%2Fcrio_unshare_mounts%2Fimages%2FProposed%2520hidden%2520k8s%2520mount%2520propagation.png&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941663848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qHf3Obmol%2FpLKApC2ISk%2BIRzzGzFrt8TZnQW%2FWeyBBs%3D&reserved=0>
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1819868<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1819868&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941663848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Ck8mdmXIlJv0Hp0LKLD4lFInbDLWQMih8mU%2FYXgm%2Fy0%3D&reserved=0>
[4] https://github.com/lack/redhat-notes/blob/main/crio_unshare_mounts/container-private-mounts/container-mount-namespace.service<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flack%2Fredhat-notes%2Fblob%2Fmain%2Fcrio_unshare_mounts%2Fcontainer-private-mounts%2Fcontainer-mount-namespace.service&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941663848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SbTiF2%2BJ1ZVqNlQei9HJyX8BNBcYNQxZsK6qdfw6ZjY%3D&reserved=0>
[5] https://github.com/lack/redhat-notes/blob/main/crio_unshare_mounts/container-private-mounts/20-container-mount-namespace.conf<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flack%2Fredhat-notes%2Fblob%2Fmain%2Fcrio_unshare_mounts%2Fcontainer-private-mounts%2F20-container-mount-namespace.conf&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941673805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Fyz42EIW0Pdji8Bj2tWS0OAVLOtkrP431vSzGbZh%2BH0%3D&reserved=0>
[6] https://github.com/lack/redhat-notes/blob/main/crio_unshare_mounts/container-private-mounts/extractExecStart<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flack%2Fredhat-notes%2Fblob%2Fmain%2Fcrio_unshare_mounts%2Fcontainer-private-mounts%2FextractExecStart&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941673805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=n1nZNe8J%2FnM2f0o%2BwVXPiDrTpRq6PmMIMQvGKVAZCPI%3D&reserved=0>
[7] https://github.com/lack/redhat-notes/tree/main/crio_unshare_mounts<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flack%2Fredhat-notes%2Ftree%2Fmain%2Fcrio_unshare_mounts&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941673805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=dcEmL5cV0HpLS0%2FVDhW0N1CTqDiSY9myv%2FDMyAtFosk%3D&reserved=0>
[8] https://github.com/lack/kubernetes/tree/hide_container_mountpoints-k8s<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flack%2Fkubernetes%2Ftree%2Fhide_container_mountpoints-k8s&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941683762%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=lajzHXarHjixSjmAu%2F90K3UnhLw%2BbwLhJG33%2F8km%2BwM%3D&reserved=0>

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com<mailto:kubernetes-sig-...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/5c716b70-3f76-490d-8348-6b9a7bebab73n%40googlegroups.com<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fkubernetes-sig-node%2F5c716b70-3f76-490d-8348-6b9a7bebab73n%2540googlegroups.com&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941683762%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KRre1UO5XByybDRiUEOkH9wx4uUNDRCU3wTWQ7j52vI%3D&reserved=0><https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fkubernetes-sig-node%2F5c716b70-3f76-490d-8348-6b9a7bebab73n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7CKevin.Fox%40pnnl.gov%7Cb6a1435fb9e24b0cdbba08d8df2cebd8%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637504731332930438%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5c8diH1bw69GG%2F4Ofojx4flTCYUhNdKG3fIP0nyEnNg%3D&reserved=0<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fkubernetes-sig-node%2F5c716b70-3f76-490d-8348-6b9a7bebab73n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941693719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=e%2BQkd3ow%2Fb9FP2K4erL9f9U4EfYw6qy9kq%2FmmHpRFB8%3D&reserved=0>>.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com<mailto:kubernetes-sig-...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/8e91fd95-bbb4-4ff7-849e-b9d5c4dd757bn%40googlegroups.com<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fkubernetes-sig-node%2F8e91fd95-bbb4-4ff7-849e-b9d5c4dd757bn%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7CKevin.Fox%40pnnl.gov%7C9676ca5fd873424a048408d8df3874d3%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C1%7C637504780941693719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=RwHcRUmaw1sPq0MYlHPkDOhSoVZJqsRZwKf85McR0Qw%3D&reserved=0>.

Jim Ramsay

unread,
Mar 18, 2021, 1:57:21 PM3/18/21
to kubernetes-sig-node
More discussion can follow in the corresponding issue, too: https://github.com/kubernetes/kubernetes/issues/100259
Reply all
Reply to author
Forward
0 new messages