Per-Pod CPU Frequency (P-State) Control

Thiruveedula, Bharath

unread,

Mar 25, 2026, 4:08:38 PMMar 25

to sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

Hi SIG Node,

I'd like to raise a discussion around per-pod CPU frequency management in Kubernetes and check if this is already being explored by the community.

Kubernetes clusters commonly run a mix of critical and non-critical workloads on the same nodes — latency-sensitive services alongside background pods like log collectors, monitoring agents, and batch housekeeping. Today, CPU performance states (P-states) can only be configured at the node level, which means every pod on a node runs at the same CPU frequency regardless of its priority. There is no way for a Kubernetes admin to say "this pod is low priority and should run at a reduced CPU frequency" without affecting all other workloads on that node.

This matters for Kuberenetes admins managing power-constrained environments or dense deployments, where staying within node or rack power budgets is important. Being able to cap CPU frequency for non-critical pods would also help with thermal management — reducing heat on cores running background work and preserving headroom for co-located critical workloads. Today's QoS classes (Guaranteed, Burstable, BestEffort) control CPU shares and limits, but they have no influence over the hardware frequency at which those cycles execute.

I'm curious whether there is any existing KEP, proposal, or active work in this area. I'd also appreciate any thoughts on whether this is something that belongs in the kubelet's resource management or would be better suited as an external component like an operator or device plugin. If anyone else in the community is exploring this space, I'd love to connect.

Happy to join a SIG Node meeting to discuss further if there's interest.

Thanks,
Bharath Thiruveedula

John Belamaric

unread,

Mar 26, 2026, 3:24:01 AMMar 26

to Thiruveedula, Bharath, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

This is very in line with the capabilities we are making available by using the DRA CPU Driver. Take a look at

https://github.com/kubernetes-sigs/dra-driver-cpu

and join us in the #wg-device-management Kubernetes Slack channel to learn and discuss more.

John

--
You received this message because you are subscribed to the Google Groups "sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sig-node+u...@kubernetes.io.
To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/sig-node/CAOX3LY%3D6%2By7vGSMMC0v6LM3Kqs39CWxCN-Gkj%3D4k48BfrRRfQg%40mail.gmail.com.

Feruz

unread,

Mar 26, 2026, 4:45:57 AMMar 26

to John Belamaric, Thiruveedula, Bharath, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

One way to achieve this is by using NRI (Node Resource Interface). For example, the NRI Balloons plugin[1] allows you to create “balloons” and assign specific workloads to them. This effectively partitions a node, enabling you to tune hardware parameters for each partition independently. For instance, you can configure options such as minFreq, maxFreq, uncoreMinFreq, and uncoreMaxFreq to control CPU frequency behavior within a given balloon. Below is an example snapshot of a Balloons CR:

  control:
    cpu:
      classes:
        ultra-low-latency:
          minFreq: 3500000
          maxFreq: 3900000
          uncoreMinFreq: 2400000
          uncoreMaxFreq: 2400000
          disabledCstates: [C6, C7, C8, C10]
        normal:
          minFreq: 800000
          maxFreq: 2500000
        powersave:
          minFreq: 800000
          maxFreq: 800000

[1] https://containers.github.io/nri-plugins/stable/docs/resource-policy/policy/balloons.html#

Best regards,

Feruz

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/sig-node/CAC_RkjxPojr6NBcFecbwoK%3DH1esUtNabQz%2BKyKkWS1yBTpWXYA%40mail.gmail.com.

Thiruveedula, Bharath

unread,

Mar 27, 2026, 2:50:02 PMMar 27

to Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.

The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

The NRI Balloons plugin is closer to what I described. The per-balloon CPU class configuration with minFreq, maxFreq, and C-state controls is exactly the kind of knob I was looking for. However, after reading through the implementation, I see that it works by partitioning CPUs into exclusive groups — each balloon owns its cores, and frequency is set via direct sysfs writes to scaling_min_freq/scaling_max_freq on those physical cores. This means two pods in different balloons with different frequency targets cannot share the same core.

That works well for Guaranteed pods with dedicated CPUs, but the gap I'm trying to address is specifically around burstable and best-effort pods that share cores. In a typical cluster, non-critical background pods (monitoring, log shipping, housekeeping) run as burstable workloads without exclusive CPU allocation. There's currently no way to tell the scheduler or the kubelet "this pod should run at a lower CPU freqency" without dedicating physical cores to it, which defeats the density benefit of shared-core scheduling.

The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

Thanks,
Bharath

Eric Tune

unread,

Mar 27, 2026, 3:05:56 PMMar 27

to Thiruveedula, Bharath, Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

AIUI, Setting the CPU control registers to change the p-state does not take very long. So, you could, I suppose, change the p-state when you context switch between a "critical" task and a "non-critical" task. However, it takes some time (like 1ms) for the processor to actually ramp the operating voltage up or down to the target for the new p-state. This happens asynchronously.

Meanwhile, there can be several context switches between several critical and non-critical threads. so, the p-states wouldn't necessarily line up with what task is running.

If you block on the p-state actuation completing, then you introduce unwanted delays.

--

You received this message because you are subscribed to the Google Groups "sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sig-node+u...@kubernetes.io.

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/sig-node/CAOX3LYncMTrzC9t6d9iM_peTJVrzjmFe-F3NW1Ub3_ynGcx3RQ%40mail.gmail.com.

Sergey Kanzhelev

unread,

Mar 27, 2026, 3:43:49 PMMar 27

to Eric Tune, Thiruveedula, Bharath, Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

> The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

Most of the work I saw is based on dynamically adjusting cpusets to align with the "demand". As Eric mentioned above, there will be challenges in trying to time-based partition CPU performance.

/Sergey

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/sig-node/CAASt_VEDnTgMk2rP_Mo8Gsd6KrXH0-E2S2%2Ber%3DnutN2vUu5j6A%40mail.gmail.com.

John Belamaric

unread,

Mar 28, 2026, 5:34:18 AMMar 28

to Thiruveedula, Bharath, Feruz, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

On Fri, Mar 27, 2026 at 6:49 PM Thiruveedula, Bharath <bharath.th...@verizon.com> wrote:

Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.

The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

DRA allows users to attach configuration to the device request and/or device class. This provides the API mechanism to send the per-Pod intent through to the driver. While there is nothing specific on the roadmap, we can take a look at your requirements and I suspect it may be fairly simple to add. Is this a vendor-specific configuration or does the kernel provide this level of control?

Francesco Romani

unread,

Mar 30, 2026, 3:25:25 AMMar 30

to Thiruveedula, Bharath, Feruz, John Belamaric, sig-...@kubernetes.io, Myron Eugene (Gene) Bagwell, Sujith Cherian

On Fri, Mar 27, 2026 at 7:50 PM 'Thiruveedula, Bharath' via sig-node
<sig-...@kubernetes.io> wrote:
>
> Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.
>
> The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

Hi, this is an interesting feature to think about, thanks for sharing.
Echoing John's answer, there's nothing on the roadmap of the DRA CPU
Driver in this area mostly because we haven't heard about the usecase
yet, this is the main reason.
We plan to distinguish and manage different core types
(efficiency/performance) in
https://github.com/kubernetes-sigs/dra-driver-cpu/issues/11 , but that
is all regarding power-related areas so far .
I think a DRA driver is the right place to enable power management,
and I see custom (or tunable) allocation logic as a building block, i
order to enable dynamic partitioning by core frequency for example.

[...]

> The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

There is kubernetes power manager project, I don't know if it is still
active, and I don't recall if it covered this angle. Offhand, I
believe this form of power management would require cooperation with
the cgroup manager, so the kubelet.

--
Francesco Romani -- software engineer @ Red Hat
https://github.com/ffromani

Thiruveedula, Bharath

unread,

Apr 7, 2026, 12:23:00 AMApr 7

to Francesco Romani, John Belamaric, sig-...@kubernetes.io, Feruz, skanz...@google.com, et...@google.com, Myron Eugene (Gene) Bagwell, Sujith Cherian, Jamie Dietsch

Hi everyone,

Thank you for the detailed responses.

Eric and Sergey, I agree that setting P-states for burstable or best-effort workloads is problematic, as a task might yield before the change takes effect, potentially affecting the next task.

While userspace controls may not be viable, I have been exploring kernel-level features. I specifically came across the uclamp feature (https://docs.kernel.org/scheduler/sched-util-clamp.html), which might offer a different approach to this problem.

Uclamp allows for setting cpu.uclamp.max and cpu.uclamp.min per cgroup in v2. Instead of directly flipping P-states, it caps the utilization signal seen by the schedutil governor, which then influences frequency selection. This provides a declarative, static hint for the cgroup, avoiding the need for per-context-switch changes. Since the kernel aggregates uclamp values across runnable tasks on a core, the highest cap among active tasks takes precedence.

On paper, this seems like it could address the per-pod frequency differentiation use case, though I am uncertain how reliably it handles Burstable or BestEffort workloads. Has anyone experimented with uclamp in this context?

Additionally, for those using the NRI Balloons policy, have you observed any measurable reduction in power consumption that you can share?

Thanks,
Bharath

On Mon, Mar 30, 2026 at 2:25 AM Francesco Romani <fro...@redhat.com> wrote:

On Fri, Mar 27, 2026 at 7:50 PM 'Thiruveedula, Bharath' via sig-node
<sig-...@kubernetes.io> wrote:
>
> Thank you both for the pointers — I've spent some time looking at both the DRA CPU Driver and the NRI Balloons plugin in detail.
>
> The DRA CPU Driver is focused on topology-aware CPU allocation and placement, which is valuable but doesn't address CPU frequency or P-state control. There are no frequency-related attributes or controls, and I don't see it on the roadmap for 0.2.0 either. Correct me if I'm wrong.

Hi, this is an interesting feature to think about, thanks for sharing.
Echoing John's answer, there's nothing on the roadmap of the DRA CPU
Driver in this area mostly because we haven't heard about the usecase
yet, this is the main reason.
We plan to distinguish and manage different core types
(efficiency/performance) in

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes-2Dsigs_dra-2Ddriver-2Dcpu_issues_11&d=DwIFaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=UTP8N06L2trD3jawJkElFfA6CQYtkCWzTnn4GsHkmJwT3Hu9fZDoBTSisORqyicz&m=1hkED3uV6W2Yo0AYGoTf_l4x95meCIg8gWXm1KasE-O0cMxgSE2rZ3VYc9Id-J5i&s=2xlrtFidKfEmADw2BSyKEEkRdyFAEvkmDfAFvxJsrqM&e= , but that

is all regarding power-related areas so far .
I think a DRA driver is the right place to enable power management,
and I see custom (or tunable) allocation logic as a building block, i
order to enable dynamic partitioning by core frequency for example.

[...]
> The underlying question is whether there's any existing work or proposal around per-pod frequency control that operates at the cgroup or task level rather than the physical core level - something that would work for pods sharing cores without requiring CPU partitioning. Has anyone in the community explored this direction, or is there any ongoing work I might have missed?

There is kubernetes power manager project, I don't know if it is still
active, and I don't recall if it covered this angle. Offhand, I
believe this form of power management would require cooperation with
the cgroup manager, so the kubelet.

--
Francesco Romani -- software engineer @ Red Hat

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ffromani&d=DwIFaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=UTP8N06L2trD3jawJkElFfA6CQYtkCWzTnn4GsHkmJwT3Hu9fZDoBTSisORqyicz&m=1hkED3uV6W2Yo0AYGoTf_l4x95meCIg8gWXm1KasE-O0cMxgSE2rZ3VYc9Id-J5i&s=SKMZhGncIfgOCbrdPdKzp1p1Ik3SlTkM9w7a4E8gzwQ&e=

Antti Kervinen

unread,

Jun 4, 2026, 5:56:18 AMJun 4

to sig-node, Thiruveedula, Bharath, Myron Eugene (Gene) Bagwell, Sujith Cherian, Jamie Dietsch, Francesco Romani, John Belamaric, Feruz, skanz...@google.com, et...@google.com

Hi Bharath,

I'd like to share with you some insights regarding NRI Balloons, CPU frequency/C-state tuning (cpufreq, cpuidle).

One-sentence recap for those unfamiliar with the concept of a "balloon" in this context: a balloon associates a set of CPUs with a set of containers: containers of a balloon are allowed to use CPUs of the balloon, while containers in other balloons are not.

> I'm trying to address is specifically around burstable and best-effort pods that share cores.

Burstable and best-effort containers share cores when they are assigned to the same balloon. CPUs of each balloon can be tuned (min/max frequency, disable C-states, soon "take turbo") by specifying balloon's CPU class with wanted CPU attributes.

When different burstable containers need different CPU tunings, which is highly likely as burstability alone doesn't tell much about characteristics and needs of the workload, the solution is to use different balloons with differently configured CPU classes. Selecting which burstable containers go to which balloons can be done by adding balloons-specific pod annotations or by matching existing pod and container properties such as namespace, labels, pod name, container name.

Live CPU tuning: CPU classes with their attributes are specified in a BalloonsPolicy CR. It can be cluster-wide, node-group-wide or node-specific. Balloons policy daemonset, running on every node, listens to changes in its effective CR. If it notices that only CPU attributes were changed, then it applies new CPU tunings to all already exiting balloons. In other words, CPU attributes of already running containers can be controlled in cluster level without changing containers' CPU affinity.

I have seen powersaving solutions that control node/cluster power consumption by watching SLA and live tuning CPUs accordingly.

I'd like to add a couple of notes to the discussion.

On CPU frequency controls: even if maximum turbo frequency of a CPU would be 5 GHz, it does not mean that all 100+ CPU cores would constantly run on 5 GHz when busy. However, for it may be that a critical workload, if scheduled on a node, must get 5 GHz without sharing the "turbo budget" with less critical workloads. However, if there is no critical workload, it is better to run any container with 5 GHz than none. Giving turbo based on container priority is coming in new "cpuClasses" in balloons. It looks like this:

https://github.com/askervin/kcd-helsinki-2026-realtime

On priorities of burstable containers that share cores: Linux process and I/O scheduling can be adjusted in container level even within the same balloon. If one burstable container is fine executing on idle CPU cycles while anotherone must execute immediately when possible (even if it could cause starvation), they can be annotated with "scheduling-class" that refer to appropriate schedulingClasses specified in BalloonsPolicy.

Per-Pod CPU Frequency (P-State) Control — Any Existing Work?

Thiruveedula, Bharath

John Belamaric

Feruz

Thiruveedula, Bharath

Eric Tune

Sergey Kanzhelev

John Belamaric

Francesco Romani

Thiruveedula, Bharath

Antti Kervinen