Following on from the discussion today, it seems like the main criticisms of the Sandbox boolean are:
- It's hard to define what a sandboxed runtime gives you in terms of security gurantees
- It's hard to test or assert that a sandbox is going to give you any particular security guarantees
- How can the tradeoffs be simply expressed in such a way that a user can intuitively make a choice?
So maybe we shouldn't be thinking about it in terms of security guarantees or implementation detail; but rather in policy terms. In that respect, sandboxing can be a policy convention that can be explicitly tested for. For example:
1 - A sandboxed pod has no access to the Kubernetes control plane
2 - A sandboxed pod cannot write any state to the host on which its running that could be readable by other pods (or the host itself?)
- Limit volume types. A sandboxed pod's persistent state should not be readable in the event of a privilege escalation to the host. Can use block device / encryption
- Would have an impact on ephermal storage such as log storage. Again this could be overcome by encryption / block devices.
3 - A sandboxed pod inherits no configuration or state from the host on which it runs
- Malicious modification of a secret or configmap on the host would be picked up by pods. Secrets, configmaps etc should be explicitly copied in
4 - A privileged container within a sandboxed pod can only access the sandbox itself, not the host
5 - No local exec or attach
- Rather than disabling exec and attach, there should be a way to only allow it via Kubelet and prevent exec and attach on the CRI socket locally
This really only considers runtime and storage. It may not make sense to include conventions around network isolation given all the different providers and options out there. However, a list like this is portable, avoids plumbing / implementation details and can be explicitly tested for.
I guess a criticism of this - other than it being over-simplistic - is that you could apply these constraints to a regular container runtime - you wouldn't necessarily need gVisor or Kata. In that respect, maybe it's not enough of a guarantee of security for some. You could add something about the kernel not being shared if there was a consensus about that. On the flip side though, maybe that's a good thing. The benefits of introducing abstraction layers - be that x86 or system calls - are harder to assert or test for.
And that brings us to the question of allowing mixed sandbox types on a single node. There's an argument to be made that allowing un-sandboxed containers on a node presents a potential risk to the sandboxed ones - mostly through privilege escalation and subsequent control plane manipulation.
If there were a complete list such as the one above that could be agreed upon, it wouldn't necessarily negate the usefulness of being able to select a particular container runtime for reasons such as performance, hardware isolation etc. But to me it seems so much simpler to keep that as a node-level attribute for now and use taints and tolerations for scheduling.
Ben