Hey everyone.
We are looking to converge on solutions for resource tracking for various Trusted Execution Environment (TEE) technologies like TDX, SEV-SNP, and more. In the Confidential Computing Meeting yesterday, I talked about what in this discussion corresponds to solution #2 below.
Background of the Problem:
* There is a limited number of TEEs you can launch on a single host.
* On TDX, SEV-SNP, and other TEEs the source of truth for tracking this capacity and usage is the kernel. The kernel exposes this information in `/sys/fs/cgroups/misc.capacity` for capacity, and `/sys/fs/cgroups/misc.current` for current usage. These files expose capacity and usage in a format that looks like this:
```
tdx 31
sev_rs 127
```
* We want to accurately track this capacity in KubeVirt to ensure we do not attempt to launch or schedule more confidential VMs than the hardware can support.
* Based on previous discussions, the favored approach of doing this capacity tracking was using device-plugins (partially also because we already need to mount a device into virt-launcher at least in the case of TDX).
* But here is where things get tricky and interesting: there are entirely realistic scenarios where other operators in Kubernetes (and in theory even outside Kubernetes) may be consuming a part of this limit; the most notable example of this Kata containers, which in particular we want to ensure compatibility with.
How does Kata track resource usage for TDX?
Kata has settled on using NFD alongside extended resource capacity to track usage of this limited resource. The NFD operator has logic to read the `/sys/fs/cgroups/misc.capacity` file when it initializes and save the resource capacity to a variable called @cpu.security.tdx.total_keys. Kata then sets an NFD Rule to use this value to create an extended resource with URI tdx.intel.com/keys. My understanding is Kata is pretty set on this approach (though if I am wrong please correct me; also see https://github.com/kubevirt/kubevirt/pull/16262#discussion_r2623986885 and the messages under for more context).
The open question for us:
So coming back to KubeVirt, given that we do want to maintain full support for running alongside other operators including at least Kata Containers, how do we implement resource tracking on our end? Should we stick with device plugins? Use DRA? Also use NDF? Adapt a hybrid approach?
One thing that was mentioned is we definitely do not want to create a hard dependency on KubeVirt for NFD.
Here is a summary of some options we have discussed:
1.
Proposal: Use the same approach as Kata by depending on NFD, and sharing the extended resource with Kata for when NFD is available (i.e. optional dependency). When NFD is not available, we fall back to a standard device plugin approach.
This approach has the advantage of being stable and the most non-racey but of course requires a non-trivial amount of work for dealing with the edge cases like: NFD+Kata getting installed later or deleted later, and dealing with running VMs in such situations. Not super pretty...
—————
2.
Proposal: Use the kernel, the source or truth to mediate any discrepancies between how many TEE “keys” we have allocated according to KubeVirt and what is the true amount. Then, using the device plugin framework, you can mark devices that are being used by something other than us as “unhealthy” preventing the scheduler from scheduling those device plugins.
Example: Suppose you have 16 available TDX “keys” so your device plugin is initialized with a capacity of 16. Now KubeVirt is using 4. So you read `/sys/fs/cgroups/misc.current` (the source of truth) and see the true allocated amount of TDX keys is 6. That means 2 of those keys we think are available are actually already allocated by something else, and so we set 2 of our 16 devices as “unhealthy”.
The advantage of this approach is that KubeVirt can always know the correct number of available TEE keys regardless of whatever is on that node since we always reference the single source of truth: the almighty kernel!
Problem: The device plugin framework does not allow us to know how many device plugins are currently allocated. It is a stateless framework. While the framework provides us with an Allocate() callback (when the DP is first allocated), there is no way to know when the DP is unallocated making it impossible to track how many DPs are currently in use with in the confines of this framework. Another problem is that even if you could know how many DPs are allocated, you also need to know which DPs are allocated because the device plugin framework identifies each of the x devices unique using an ID. So if you need to reduce available DP capacity by two (for example) you must ensure the specific DPs you mark unhealthy are currently NOT allocated. And this is non-trivial.
Solution: Use informers. We can register an asynchronous informer when the device plugin is first created to watch for the creation of new pods that request our device plugin. Using this information we can maintain a counter. Similarly, the informer can also watch for when pods requesting this DP are unallocated and decrement the counter. At the same time we can create a watcher on `/sys/fs/cgroups/misc.current` to watch for changes to the current capacity from the source of truth (kernel). I’m skipping some finer details but essentially, anytime the counter we maintain != kernel’s counter that means something other than kubevirt is using a TEE key, and so we must mark the difference as unhealthy devices. Importantly, we also need to identify which device ID is being allocated and unallocated (because the devices we mark as unhealthy MUST currently be unallocated) and this information is known only to the kubelet. There are two ways the kubelet exposes this information to us:
when Allocate() is called, one of the information it gives us is the device id being allocated. Unfortunately, we don’t know to whom it is being allocated however we can upsert this information into a queue that “hey I’m being allocated” and the informer can connect this allocation with an event it received. However, this is going to be very racey and brittle logic; likely not reliable because we can associate the wrong pods with the wrong device.
The Kubelet exposes a gRPC call (https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#grpc-endpoint-list) to list the resources of a running pod. So instead once the informer hears an event of a pod with our DP being started, it makes a gRPC call to get this information and saves it to a map.
Unresolved Problems:
This is still a slight race here where if KubeVirt and something else tries to allocate a TEE key at the same time, there is a small window where one of the TEEs will fail to start because we wouldn’t have had enough time to update capacities and health. This is an intermittent failure and is not dangerous.
While this solution lets KubeVirt always track the correct amount of available TEE capacity straight from the kernel, this still doesn’t fix the fact that Kata can still misallocate (though I don’t know if we consider this our problem).
—————
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubevirt-dev/8323b493-c88a-4418-a1ed-f0f1fc0d9d83n%40googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubevirt-dev/CADsFNy4p2_1j6v-oMqmoOrhrS-aOoYn6%2BttUsqW_jc8u0f-8%2Bg%40mail.gmail.com.
Hi Aseef and Zhenchao,
Thanks for this write-up and for extracting this conversation to its own thread.
I'm concerned about creating an implicit dependency on NFD (even if optional); I'd prefer to avoid it.
On the configurable CR idea (secureGuestResources in the KubeVirt CR), it's a nice way to make the resource handling pluggable without hardcoding NFD, but it doesn't fully address attestation handling as we agreed on in https://github.com/kubevirt/enhancements/pull/113. In that VEP and in the ongoing TDX attestation PR, we're already tying QGS socket mounting and health checks to the device plugin.
TDX and SNP require different implementation approaches. Since @Aseef Imran is already addressing the TDX, my PR will focus on SNP. @Aseef Imran Does this alignment work for you?
Also, please note that I am excluding SEV/SEV-ES for now as they are experimental; SNP is our target.
From my perspective, we should prioritize DRA support while staying mindful of timelines.For this release, I suggest ignoring Kata entirely.
KubeVirt is a separate project and doesn't consider Kata when it allocates resources like GPUs/vGPUs or other devices.
Let's first focus on making the device plugin approach fully functional, then shift to DRA in the next release.