[Confidential Computing WG] Solutions for TEE Resource Tracking

84 views
Skip to first unread message

Aseef Imran

unread,
Jan 28, 2026, 12:56:57 PM (11 days ago) Jan 28
to kubevirt-dev

Hey everyone.


We are looking to converge on solutions for resource tracking for various Trusted Execution Environment (TEE) technologies like TDX, SEV-SNP, and more. In the Confidential Computing Meeting yesterday, I talked about what in this discussion corresponds to solution #2 below.


Background of the Problem:

* There is a limited number of TEEs you can launch on a single host. 

* On TDX, SEV-SNP, and other TEEs the source of truth for tracking this capacity and usage is the kernel. The kernel exposes this information in `/sys/fs/cgroups/misc.capacity` for capacity, and `/sys/fs/cgroups/misc.current` for current usage. These files expose capacity and usage in a format that looks like this:

```

tdx 31

sev_rs 127

```

* We want to accurately track this capacity in KubeVirt to ensure we do not attempt to launch or schedule more confidential VMs than the hardware can support.

* Based on previous discussions, the favored approach of doing this capacity tracking was using device-plugins (partially also because we already need to mount a device into virt-launcher at least in the case of TDX).

* But here is where things get tricky and interesting: there are entirely realistic scenarios where other operators in Kubernetes (and in theory even outside Kubernetes) may be consuming a part of this limit; the most notable example of this Kata containers, which in particular we want to ensure compatibility with.


How does Kata track resource usage for TDX?

Kata has settled on using NFD alongside extended resource capacity to track usage of this limited resource. The NFD operator has logic to read the `/sys/fs/cgroups/misc.capacity` file when it initializes and save the resource capacity to a variable called @cpu.security.tdx.total_keys. Kata then sets an NFD Rule to use this value to create an extended resource with URI tdx.intel.com/keys. My understanding is Kata is pretty set on this approach (though if I am wrong please correct me; also see https://github.com/kubevirt/kubevirt/pull/16262#discussion_r2623986885 and the messages under for more context).


The open question for us:

So coming back to KubeVirt, given that we do want to maintain full support for running alongside other operators including at least Kata Containers, how do we implement resource tracking on our end? Should we stick with device plugins?  Use DRA? Also use NDF? Adapt a hybrid approach?


One thing that was mentioned is we definitely do not want to create a hard dependency on KubeVirt for NFD.


Here is a summary of some options we have discussed:


1.

Proposal: Use the same approach as Kata by depending on NFD, and sharing the extended resource with Kata for when NFD is available (i.e. optional dependency). When NFD is not available, we fall back to a standard device plugin approach.


This approach has the advantage of being stable and the most non-racey but of course requires a non-trivial amount of work for dealing with the edge cases like: NFD+Kata getting installed later or deleted later, and dealing with running VMs in such situations. Not super pretty...


—————


2.

Proposal: Use the kernel, the source or truth to mediate any discrepancies between how many TEE “keys” we have allocated according to KubeVirt and what is the true amount. Then, using the device plugin framework, you can mark devices that are being used by something other than us as “unhealthy” preventing the scheduler from scheduling those device plugins. 


Example: Suppose you have 16 available TDX “keys” so your device plugin is initialized with a capacity of 16. Now KubeVirt is using 4. So you read `/sys/fs/cgroups/misc.current` (the source of truth) and see the true allocated amount of TDX keys is 6. That means 2 of those keys we think are available are actually already allocated by something else, and so we set 2 of our 16 devices as “unhealthy”.


The advantage of this approach is that KubeVirt can always know the correct number of available TEE keys regardless of whatever is on that node since we always reference the single source of truth: the almighty kernel!


Problem: The device plugin framework does not allow us to know how many device plugins are currently allocated. It is a stateless framework. While the framework provides us with an Allocate() callback (when the DP is first allocated), there is no way to know when the DP is unallocated making it impossible to track how many DPs are currently in use with in the confines of this framework. Another problem is that even if you could know how many DPs are allocated, you also need to know which DPs are allocated because the device plugin framework identifies each of the x devices unique using an ID. So if you need to reduce available DP capacity by two (for example) you must ensure the specific DPs you mark unhealthy are currently NOT allocated. And this is non-trivial.


Solution: Use informers. We can register an asynchronous informer when the device plugin is first created to watch for the creation of new pods that request our device plugin. Using this information we can maintain a counter. Similarly, the informer can also watch for when pods requesting this DP are unallocated and decrement the counter. At the same time we can create a watcher on `/sys/fs/cgroups/misc.current` to watch for changes to the current capacity from the source of truth (kernel). I’m skipping some finer details but essentially, anytime the counter we maintain != kernel’s counter that means something other than kubevirt is using a TEE key, and so we must mark the difference as unhealthy devices. Importantly, we also need to identify which device ID is being allocated and unallocated (because the devices we mark as unhealthy MUST currently be unallocated) and this information is known only to the kubelet. There are two ways the kubelet exposes this information to us: 

  1. when Allocate() is called, one of the information it gives us is the device id being allocated. Unfortunately, we don’t know to whom it is being allocated however we can upsert this information into a queue that “hey I’m being allocated” and the informer can connect this allocation with an event it received. However, this is going to be very racey and brittle logic; likely not reliable because we can associate the wrong pods with the wrong device.

  2. The Kubelet exposes a gRPC call (https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#grpc-endpoint-list) to list the resources of a running pod. So instead once the informer hears an event of a pod with our DP being started, it makes a gRPC call to get this information and saves it to a map.


Unresolved Problems:

  • This is still a slight race here where if KubeVirt and something else tries to allocate a TEE key at the same time, there is a small window where one of the TEEs will fail to start because we wouldn’t have had enough time to update capacities and health. This is an intermittent failure and is not dangerous.

  • While this solution lets KubeVirt always track the correct amount of available TEE capacity straight from the kernel, this still doesn’t fix the fact that Kata can still misallocate (though I don’t know if we consider this our problem).


—————


3.

Proposal: Given the problem, obviously DPs are not the best fit for the job. Instead, it is likely better to use DRAs as from what I understand, they don’t have most of the limitations of DPs. With DRAs we can still reference the kernel (source of truth) to always know the correct amount of available capacity more easily.

Problem: DRA support not ready in KubeVirt. However, adding resource capacity tracking is not urgent and support for confidential computing is not GA yet.

Zhenchao Liu

unread,
Jan 29, 2026, 3:44:02 AM (10 days ago) Jan 29
to Aseef Imran, kubevirt-dev
Hi Aseef,

Thanks for raising this.

Regarding the DRA proposal, I am curious how well it integrates with extended resources. 
For example, if DRA reports a schedulable node with only one resource available, but a Kata pod starts on that node and consumes the resource before the KubeVirt pod is scheduled, would the KubeVirt pod simply remain in a pending state?

I am also considering a solution:

1. Configurable External Resources: 
We could make external resources configurable in the KubeVirt CR. For example:
    secureGuestResources:
        tdx: tdx.intel.com/keys
        snp: sev.amd.com/esids

- When secureGuestResources is set, KubeVirt uses the configured resources
- When it is not set (the default), KubeVirt starts the device plugin to advertise the resources itself.

2. Handling Kata Removal:
- If there are running confidential VMs, NFD/extended resources must not be removed, as KubeVirt will continue to depend on them.
- If there are no running confidential VMs, NFD/extended resources can be removed. At that point, secureGuestResources can be unset, allowing KubeVirt to start its own device plugin to advertise the resources.

It is straightforward, but we need to properly document the configuration process and ensure Kata's removal is handled carefully.

Thanks,
Zhenchao

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubevirt-dev/8323b493-c88a-4418-a1ed-f0f1fc0d9d83n%40googlegroups.com.

Vladik Romanovsky

unread,
Jan 29, 2026, 5:13:36 PM (9 days ago) Jan 29
to Zhenchao Liu, Aseef Imran, kubevirt-dev
Hi Aseef and Zhenchao,

Thanks for this write-up and for extracting this conversation to its own thread.

I'm concerned about creating an implicit dependency on NFD (even if optional); I'd prefer to avoid it.

On the configurable CR idea (secureGuestResources in the KubeVirt CR), it's a nice way to make the resource handling pluggable without hardcoding NFD, but it doesn't fully address attestation handling as we agreed on in https://github.com/kubevirt/enhancements/pull/113. In that VEP and in the ongoing TDX attestation PR, we're already tying QGS socket mounting and health checks to the device plugin.

From my perspective, we should prioritize DRA support while staying mindful of timelines. 
For this release, I suggest ignoring Kata entirely. 
KubeVirt is a separate project and doesn't consider Kata when it allocates resources like GPUs/vGPUs or other devices.
Let's first focus on making the device plugin approach fully functional, then shift to DRA in the next release.
If we have enough time left in this cycle, we can start prototyping the DRA solution early.

There are several parallel efforts to integrate DRA in KubeVirt. It therefore makes sense to develop a DRA driver for TEEs (especially TDX) and incorporate vsock support too, as a follow up. DRA will replace device plugins eventually. In the short term, the device plugin can serve as a bridge.

That's my take :)

Thanks,
Vladik

Zhenchao Liu

unread,
Jan 29, 2026, 11:41:08 PM (9 days ago) Jan 29
to Vladik Romanovsky, Aseef Imran, kubevirt-dev
Hi Vladik and Aseef,

On Fri, Jan 30, 2026 at 6:13 AM Vladik Romanovsky <vrom...@redhat.com> wrote:
Hi Aseef and Zhenchao,

Thanks for this write-up and for extracting this conversation to its own thread.

I'm concerned about creating an implicit dependency on NFD (even if optional); I'd prefer to avoid it.

On the configurable CR idea (secureGuestResources in the KubeVirt CR), it's a nice way to make the resource handling pluggable without hardcoding NFD, but it doesn't fully address attestation handling as we agreed on in https://github.com/kubevirt/enhancements/pull/113. In that VEP and in the ongoing TDX attestation PR, we're already tying QGS socket mounting and health checks to the device plugin.

TDX and SNP require different implementation approaches. Since @Aseef Imran  is already addressing the TDX, my PR will focus on SNP.  @Aseef Imran Does this alignment work for you? 

Also, please note that I am excluding SEV/SEV-ES for now as they are experimental; SNP is our target.

 

From my perspective, we should prioritize DRA support while staying mindful of timelines. 
For this release, I suggest ignoring Kata entirely.  
KubeVirt is a separate project and doesn't consider Kata when it allocates resources like GPUs/vGPUs or other devices.
Let's first focus on making the device plugin approach fully functional, then shift to DRA in the next release.
Agreed. Let's proceed with the device plugin implementation and leave Kata out of scope for now.

Thanks,
Zhenchao

Aseef Imran

unread,
Jan 30, 2026, 5:22:18 PM (8 days ago) Jan 30
to Zhenchao Liu, Vladik Romanovsky, kubevirt-dev
Zhenchao, that sounds good!

As for the TDX Device Plugin implementation, I have a draft you can view here for context https://github.com/Aseeef/kubevirt/compare/device-plugins-4...tdx-qgs-2.
It currently depends on https://github.com/kubevirt/kubevirt/pull/16073, so I'm hoping that can get merged soon, otherwise I may un-rebase it and submit it as its own PR...

Thanks!
--
Muhammad Aseef Imran (Aseef)
Software Engineer
https://source.redhat.com/.profile/aimran
Reply all
Reply to author
Forward
0 new messages