Per-Pod CPU Frequency (P-State) Control — Any Existing Work?

101 views
Skip to first unread message

Thiruveedula, Bharath

unread,
Mar 25, 2026, 4:08:38 PMMar 25
to sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian
Hi SIG Node,

I'd like to raise a discussion around per-pod CPU frequency management in Kubernetes and check if this is already being explored by the community.

Kubernetes clusters commonly run a mix of critical and non-critical workloads on the same nodes — latency-sensitive services alongside background pods like log collectors, monitoring agents, and batch housekeeping. Today, CPU performance states (P-states) can only be configured at the node level, which means every pod on a node runs at the same CPU frequency regardless of its priority. There is no way for a Kubernetes admin to say "this pod is low priority and should run at a reduced CPU frequency" without affecting all other workloads on that node.

This matters for Kuberenetes admins managing power-constrained environments or dense deployments, where staying within node or rack power budgets is important. Being able to cap CPU frequency for non-critical pods would also help with thermal management — reducing heat on cores running background work and preserving headroom for co-located critical workloads. Today's QoS classes (Guaranteed, Burstable, BestEffort) control CPU shares and limits, but they have no influence over the hardware frequency at which those cycles execute.

I'm curious whether there is any existing KEP, proposal, or active work in this area. I'd also appreciate any thoughts on whether this is something that belongs in the kubelet's resource management or would be better suited as an external component like an operator or device plugin. If anyone else in the community is exploring this space, I'd love to connect.

Happy to join a SIG Node meeting to discuss further if there's interest.

Thanks,
Bharath Thiruveedula

John Belamaric

unread,
Mar 26, 2026, 3:24:01 AMMar 26
to Thiruveedula, Bharath, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian
This is very in line with the capabilities we are making available by using the DRA CPU Driver. Take a look at  and join us in the #wg-device-management Kubernetes Slack channel to learn and discuss more. 

John

--
You received this message because you are subscribed to the Google Groups "sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sig-node+u...@kubernetes.io.
To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/sig-node/CAOX3LY%3D6%2By7vGSMMC0v6LM3Kqs39CWxCN-Gkj%3D4k48BfrRRfQg%40mail.gmail.com.

Feruz

unread,
Mar 26, 2026, 4:45:57 AMMar 26
to John Belamaric, Thiruveedula, Bharath, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

One way to achieve this is by using NRI (Node Resource Interface). For example, the NRI Balloons plugin[1] allows you to create “balloons” and assign specific workloads to them. This effectively partitions a node, enabling you to tune hardware parameters for each partition independently. For instance, you can configure options such as minFreq, maxFreq, uncoreMinFreq, and uncoreMaxFreq to control CPU frequency behavior within a given balloon. Below is an example snapshot of a Balloons CR:

  control:
    cpu:
      classes:
        ultra-low-latency:
          minFreq: 3500000
          maxFreq: 3900000
          uncoreMinFreq: 2400000
          uncoreMaxFreq: 2400000
          disabledCstates: [C6, C7, C8, C10]
        normal:
          minFreq: 800000
          maxFreq: 2500000
        powersave:
          minFreq: 800000
          maxFreq: 800000

[1] https://containers.github.io/nri-plugins/stable/docs/resource-policy/policy/balloons.html# 

Best regards,
Feruz


Thiruveedula, Bharath

unread,
Mar 27, 2026, 2:50:02 PM (13 days ago) Mar 27
to Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian
Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.

The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

The NRI Balloons plugin is closer to what I described. The per-balloon CPU class configuration with minFreq, maxFreq, and C-state controls is exactly the kind of knob I was looking for. However, after reading through the implementation, I see that it works by partitioning CPUs into exclusive groups — each balloon owns its cores, and frequency is set via direct sysfs writes to scaling_min_freq/scaling_max_freq on those physical cores. This means two pods in different balloons with different frequency targets cannot share the same core.

That works well for Guaranteed pods with dedicated CPUs, but the gap I'm trying to address is specifically around burstable and best-effort pods that share cores. In a typical cluster, non-critical background pods (monitoring, log shipping, housekeeping) run as burstable workloads without exclusive CPU allocation. There's currently no way to tell the scheduler or the kubelet "this pod should run at a lower CPU freqency" without dedicating physical cores to it, which defeats the density benefit of shared-core scheduling.

The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

Thanks,
Bharath

Eric Tune

unread,
Mar 27, 2026, 3:05:56 PM (13 days ago) Mar 27
to Thiruveedula, Bharath, Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian
AIUI, Setting the CPU control registers to change the p-state does not take very long.  So, you could, I suppose, change the p-state when you context switch between a "critical" task and a "non-critical" task. However, it takes some time (like 1ms) for the processor to actually ramp the operating voltage up or down to the target for the new p-state.  This happens asynchronously.
Meanwhile, there can be several context switches between several critical and non-critical threads. so, the p-states wouldn't necessarily line up with what task is running.
If you block on the p-state actuation completing, then you introduce unwanted delays.

--
You received this message because you are subscribed to the Google Groups "sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sig-node+u...@kubernetes.io.

Sergey Kanzhelev

unread,
Mar 27, 2026, 3:43:49 PM (13 days ago) Mar 27
to Eric Tune, Thiruveedula, Bharath, Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian
> The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

Most of the work I saw is based on dynamically adjusting cpusets to align with the "demand". As Eric mentioned above, there will be challenges in trying to time-based partition CPU performance.

/Sergey


John Belamaric

unread,
Mar 28, 2026, 5:34:18 AM (13 days ago) Mar 28
to Thiruveedula, Bharath, Feruz, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian
On Fri, Mar 27, 2026 at 6:49 PM Thiruveedula, Bharath <bharath.th...@verizon.com> wrote:
Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.

The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

DRA allows users to attach configuration to the device request and/or device class. This provides the API mechanism to send the per-Pod intent through to the driver. While there is nothing specific on the roadmap, we can take a look at your requirements and I suspect it may be fairly simple to add. Is this a vendor-specific configuration or does the kernel provide this level of control?

Francesco Romani

unread,
Mar 30, 2026, 3:25:25 AM (11 days ago) Mar 30
to Thiruveedula, Bharath, Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian
On Fri, Mar 27, 2026 at 7:50 PM 'Thiruveedula, Bharath' via sig-node
<sig-...@kubernetes.io> wrote:
>
> Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.
>
> The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

Hi, this is an interesting feature to think about, thanks for sharing.
Echoing John's answer, there's nothing on the roadmap of the DRA CPU
Driver in this area mostly because we haven't heard about the usecase
yet, this is the main reason.
We plan to distinguish and manage different core types
(efficiency/performance) in
https://github.com/kubernetes-sigs/dra-driver-cpu/issues/11 , but that
is all regarding power-related areas so far .
I think a DRA driver is the right place to enable power management,
and I see custom (or tunable) allocation logic as a building block, i
order to enable dynamic partitioning by core frequency for example.

[...]
> The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

There is kubernetes power manager project, I don't know if it is still
active, and I don't recall if it covered this angle. Offhand, I
believe this form of power management would require cooperation with
the cgroup manager, so the kubelet.

--
Francesco Romani -- software engineer @ Red Hat
https://github.com/ffromani

Thiruveedula, Bharath

unread,
Apr 7, 2026, 12:23:00 AM (3 days ago) Apr 7
to Francesco Romani, John Belamaric, sig-...@kubernetes.io, Feruz, skanz...@google.com, et...@google.com, Myron Eugene (Gene) Bagwell, Sujith Cherian, Jamie Dietsch
Hi everyone,

Thank you for the detailed responses.

Eric and Sergey, I agree that setting P-states for burstable or best-effort workloads is problematic, as a task might yield before the change takes effect, potentially affecting the next task.

While userspace controls may not be viable, I have been exploring kernel-level features. I specifically came across the uclamp feature (https://docs.kernel.org/scheduler/sched-util-clamp.html), which might offer a different approach to this problem.

Uclamp allows for setting cpu.uclamp.max and cpu.uclamp.min per cgroup in v2. Instead of directly flipping P-states, it caps the utilization signal seen by the schedutil governor, which then influences frequency selection. This provides a declarative, static hint for the cgroup, avoiding the need for per-context-switch changes. Since the kernel aggregates uclamp values across runnable tasks on a core, the highest cap among active tasks takes precedence.

On paper, this seems like it could address the per-pod frequency differentiation use case, though I am uncertain how reliably it handles Burstable or BestEffort workloads. Has anyone experimented with uclamp in this context?

Additionally, for those using the NRI Balloons policy, have you observed any measurable reduction in power consumption that you can share?


Thanks,
Bharath



On Mon, Mar 30, 2026 at 2:25 AM Francesco Romani <fro...@redhat.com> wrote:
On Fri, Mar 27, 2026 at 7:50 PM 'Thiruveedula, Bharath' via sig-node
<sig-...@kubernetes.io> wrote:
>
> Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.
>
> The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

Hi, this is an interesting feature to think about, thanks for sharing.
Echoing John's answer, there's nothing on the roadmap of the DRA CPU
Driver in this area mostly because we haven't heard about the usecase
yet, this is the main reason.
We plan to distinguish and manage different core types
(efficiency/performance) in

is all regarding power-related areas so far .
I think a DRA driver is the right place to enable power management,
and I see custom (or tunable) allocation logic as a building block, i
order to enable dynamic partitioning by core frequency for example.

[...]
> The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

There is kubernetes power manager project, I don't know if it is still
active, and I don't recall if it covered this angle. Offhand, I
believe this form of power management would require cooperation with
the cgroup manager, so the kubelet.

--
Francesco Romani -- software engineer @ Red Hat
Reply all
Reply to author
Forward
0 new messages