k8s cpumanager smt awareness and noisy neighbor prevention

Francesco Romani

unread,

Apr 5, 2021, 1:26:43 PM4/5/21

to kubernete...@googlegroups.com, Lehtonen, Markus, Kanevskiy, Alexander, Kevin Klues, gergely...@nokia.com, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

Hi sig-node!

I've been researching about how to make kubernetes a even better
platform to run a certain class of very low-latency applications:
(near-)realtime, or packet-processing-intensive (DPDK) applications.

On SMT platforms (e.g. hyperthread), due to how the current `static`
cpumanager policy allocates cores, there is not a reliable way to
guarantee a container can get full cores.

If different containers share the same physical core, we can get
unpredictable behaviour in terms of latency. It's a case of noisy neighbors.

Careful ordering of container creation and careful request of container
resource can workaround this behaviour, but this is bad UX and covers
only the simplest of cases. To really provide these guarantees, I think
we should add new cpumanager policies.

I believe is better to add new policies, which seems a very natural
extension point, to allow user opt-in and keep the current users of the
'static' policy completely unaffected.

The concepts here are not new: openstack has similar capabilities which
I believe should also find their way in cpumanager.

https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/virt-driver-cpu-thread-pinning.html

I've created a small introduction slide deck to summarize the issue I'm
addressing and to show how the proposed new policies should look like:
https://github.com/fromanirh/fromanirh/blob/main/docs/presentations/k8s-cpumanager-smtawareness/smtawareness-intro.pdf

I'll be presenting the full version of the above on sig-node on April 13
to describe the entire proposal, and I'll have a draft KEP ready by
early next week (totally before the meeting).

I'm sending this mail to gather some early feedback CC'ing few people
that, considering past sig-node activities, may be interested.

Thanks for any feedback and best regards,

--
Francesco Romani
SWE @ Red Hat
github: @fromanirh

Kevin Klues

unread,

Apr 6, 2021, 5:40:08 AM4/6/21

to Francesco Romani, kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, gergely...@nokia.com, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

I'm wondering if it's time to actually recast the CPUManager as a device plugin so that people can customize it to whatever policy they want. I haven't thought much about the details of how this would work exactly, but the CPUManager was written at a time before the device plugin interface existed, and it actually performs alot of the same functionality that device plugins do. If we decide to go this route, then the existing static CPUManager could remain available as a built-in policy, and a new "external" policy could be added to direct the CPUManager to make policy decisions from the plugin.

Kevin

--

~Kevin

aluk...@redhat.com

unread,

Apr 6, 2021, 7:13:11 AM4/6/21

to kubernetes-sig-node

I like the idea of external policies, but it looks like long term solution and will include the redesign of the CPU manager.

Francesco Romani

unread,

Apr 6, 2021, 7:24:42 AM4/6/21

to Kevin Klues, kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, gergely...@nokia.com, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

On 4/6/21 11:39 AM, Kevin Klues wrote:

I'm wondering if it's time to actually recast the CPUManager as a device plugin so that people can customize it to whatever policy they want. I haven't thought much about the details of how this would work exactly, but the CPUManager was written at a time before the device plugin interface existed, and it actually performs alot of the same functionality that device plugins do. If we decide to go this route, then the existing static CPUManager could remain available as a built-in policy, and a new "external" policy could be added to direct the CPUManager to make policy decisions from the plugin.

Kevin

Hey Kevin

I would very much like this approach, and I think it would also play nice with the longer-term future plans we were discussing in sig-node in the last few months, like https://github.com/container-orchestrated-devices/resource-management-improvements-wg/issues/1

I'm aware of this project https://github.com/nokia/CPU-Pooler which seems to be very close to the goal, could be a very good basis for this work. I'm not sure how the path forward could look like however. For example, where this device plugin should sit? Should be part of kubernetes core?

I'll mention this option in my session next week (April 13, I cannot attend today) so we can keep the discussion open on this subject.

Maciej Iwanowski

unread,

Apr 6, 2021, 7:41:08 AM4/6/21

to Kevin Klues, Francesco Romani, kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, gergely...@nokia.com, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

On Tue, Apr 6, 2021 at 9:40 AM Kevin Klues <klu...@gmail.com> wrote:
>
> I'm wondering if it's time to actually recast the CPUManager as a device plugin so that people can customize it to whatever policy they want. I haven't thought much about the details of how this would work exactly, but the CPUManager was written at a time before the device plugin interface existed, and it actually performs alot of the same functionality that device plugins do. If we decide to go this route, then the existing static CPUManager could remain available as a built-in policy, and a new "external" policy could be added to direct the CPUManager to make policy decisions from the plugin.

It looks like a way to go to me. It would allow us to create more robust policies that could control cache allocation or big.LITTLE cores. I would worry a bit about other device plugins that may be affected by CPU allocation decisions (what NUMA node is your GPU attached to).

--
Pozdrawiam/Regards,
Maciej Iwanowski

Csatari, Gergely (Nokia - FI/Espoo)

unread,

Apr 6, 2021, 7:45:23 AM4/6/21

to Francesco Romani, Kevin Klues, Kale, Levente (Nokia - HU/Budapest), kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

Hi,

Let me drop Levente, the author or CPU Pooler to the discussion.

Br,

Gerg0

Kevin Klues

unread,

Apr 6, 2021, 8:24:35 AM4/6/21

to Csatari, Gergely (Nokia - FI/Espoo), Francesco Romani, Kale, Levente (Nokia - HU/Budapest), kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

@alukiano

This is definitely a longer term solution, but I also don't want to spend time adding a bunch of new built-in policies if this is the long-term strategy.

That said, having worked quite a bit with both the CPUManager and the DeviceManager code, I'm fairly certain it wouldn't take much effort to do this.

The bigger question for me is whether this is the direction we actually want to go, or do people prefer having built-in policies instead.

@fromani

I think we can address the question of "where does the plugin live" at a later stage. As I mentioned before, I think we would leave the `stafic` policy as a built-in policy, and only allow the use of a plugin if the special `external` policy was specified. In that sense, there would be no "canonical" plugin that needed to be part of the k8s core. The `static` policy would still be the default one and built-in, and others could maintain their own plugins elsewhere.

@maciej.iwanowski

So long as one of these new plugins shared their topology information via the Device message and implemented the GetPreferredAllocation() call of the device plugin interface, alignment by NUMA shouldn't be a concern: https://github.com/kubernetes/enhancements/pull/1121

Kevin

--

~Kevin

Alexey Perevalov

unread,

Apr 6, 2021, 8:38:09 AM4/6/21

to Csatari, Gergely (Nokia - FI/Espoo), Francesco Romani, Kevin Klues, Kale, Levente (Nokia - HU/Budapest), kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

I'm also not so happy with extending built-in policies, since from a maintainability point of view it's not the best solution. Better to introduce a new interface and support this interface in existing open sourced solution as well as in downstream.

BR

Alexey

вт, 6 апр. 2021 г. в 14:45, Csatari, Gergely (Nokia - FI/Espoo) <gergely...@nokia.com>:

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/HE1PR07MB43630E5BAB7C351F8F82B15CE1769%40HE1PR07MB4363.eurprd07.prod.outlook.com.

Kale, Levente (Nokia - HU/Budapest)

unread,

Apr 6, 2021, 9:02:30 AM4/6/21

to Kevin Klues, Csatari, Gergely (Nokia - FI/Espoo), Francesco Romani, kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

Hi,

Thanks for the add!

I will try and attend next Monday call too, hopefully can provide some insight related to the discussion considering we do use both the DP approach, and the two mentioned policies as well on the field already for some time.

If you don’t mind me, couple points from me related to what was being discussed so far:

1. External API vs internal API

I don’t think anyone cares about who is implementing the policy, the important thing is the API via which users can ask for the resources.

I.e. resources API. biggest pain point of my existing “customers” is that they need to use different syntax depending on whether they -want to- run on a CPU Manager, or CPU Pooler node (spec.resources.cpu vs spec.resources.POOL_NAME)

If the user-facing API could be the same for both in-built, and outsourced policies (either just .cpu, or just the pool-like nomenclature with .cpu becoming a reserved pool type) that would be fantastic

Note, outsourcing policies is not a new idea, it was just previously rejected by the community.

2. NUMA alignment

Can confirm indeed not an issue, Pooler does it 😊 (it doesn’t implement preferred allocation API atm but reports the socket info during device discovery, and as users usually ask these resources together with SR-IOV VFs the default alignment done by topology manager and the DM is good enough)

3. Shortcomings

While in theory outsourcing CPU management policies in such a way is indeed not a big issue, there are some minor pitfalls which need to be addressed.

A: automatic reconciliation currently forces us to entirely disable CPU-Manager on these nodes, so that’s something which would prob need to be looked at if community goes this way

B: cpuset cgroup creation: not sure who would be responsible for creating the cpuset (and cpu,cpuacct) cgroup(s) in this design.

Regardless whether it is Kubelet, or this would be also outsourced to the plugin container/Pod information now needs to reach the DP in the Allocate() call, which is currently not a thing in the DPAPI.

In the current design I’m trying to retro-actively rewrite the cgroups which causes all kinds of timing issues I need to deal with.

In an official design it would be good if either the plugin could create the cgroup for the containesr, or it could tell Kubelet what to create it with (maybe the mountpath in the Allocate response could be “abused” for this without needing a DPAPI change?)

That’s all, don’t want to hijack the thread just thought I share some field experience with you! See you on Monday 😊

Br,

Levent

Francesco Romani

unread,

Apr 6, 2021, 9:08:04 AM4/6/21

to Kale, Levente (Nokia - HU/Budapest), Kevin Klues, Csatari, Gergely (Nokia - FI/Espoo), kubernetes-sig-node, Lehtonen, Markus, Kanevskiy, Alexander, Poyhonen, Petteri (Nokia - FI/Espoo), c.zuk...@samsung.com, k.wia...@samsung.com

On 4/6/21 3:02 PM, Kale, Levente (Nokia - HU/Budapest) wrote:

Hi,

Thanks for the add!

I will try and attend next Monday call too, hopefully can provide some insight related to the discussion considering we do use both the DP approach, and the two mentioned policies as well on the field already for some time.

Hey! thanks for chiming in, lots of great points I'll need to think about. Just a quick note: the sig-node meeting is every TUESDAY (not monday) https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.d9zp2j5jvkke

I'll make sure to book a time slot in the coming days so we can discuss

fro...@redhat.com

unread,

Apr 7, 2021, 6:00:07 AM4/7/21

to kubernetes-sig-node

Thanks everyone for the lively discussion so far.

It seems a very important topic will be about how we extend cpumanager in general (new policies vs refactoring/new, more extensible API). I was wondering however if there is any question or comment regarding the behaviour of the specific policies I'm proposing.

Thanks!

Tim Xu

unread,

Apr 7, 2021, 10:10:08 PM4/7/21

to kubernetes-sig-node

In our production K8s version, we add a few new cpu manager policies for colocation and isolation. The in-tree implementations makes it difficult for us to extend. External policy design will be a very useful idea. I'd like to be involved in the discussion.

Krisztian Litkey

unread,

Apr 13, 2021, 2:58:21 PM4/13/21

to kubernetes-sig-node

Hi,

Since there is a lively discussion about making Kubernetes CPU resource management more flexible, would there be interest among this crowd to discuss resource management in a broader sense, beyond just the CPU aspects ?

Based on this thread, it looks to me like most of the folks would lean towards externally plugged CPU allocation policies being a necessity, and I'm not questioning that. I mean more of a broader/holistic view of where people would like to see resource management evolving. Also potentially touching on a bit more controversial/painful subjects. For instance,

- Is turning the CPU manager into more of a device plugin going to solve the long term problems folks are having ?

- Is CPU allocation somehow special that it warrants/requires externally plugged policies but other resources or aspects of resource management do not ? Or is this just one symptom of a more general need ? Will we find the next similar problem right around the corner once CPU policies can be plugged in externally (for instance by folks wanting to override the built-in topology manager policy) ?

- If externally plugged algorithms are a more generic need for resource management, should the current architecture (of siloed allocations in isolated resource domains later recombined by the topology manager) be imposed on the external policies ?

- Is kubelet the only/best place to plug in resource allocation algorithms ?

- Any other topic people feel there is the right audience gathered to discuss...

Many of these topics have been discussed earlier to some extent. However, there never seemed to be enough people around with relevant experience and use cases for the discussions to really gain traction and they always petered out. Maybe this time is different.

Cheers,

Krisztian

Marlow Warnicke

unread,

Aug 10, 2021, 12:50:25 PM8/10/21

to kubernetes-sig-node

Has there been any movement or further discussion on any of this? I would like to start making progress re how we handle resources generally going forward. Our assumptions on cpus within the cpu management plane seem limited and it would be useful to have more flexible model to work from. For instance, power management must be done differently from cpu management, but this seems like there should be a simplistic way to have these linked. Another example is the case where there may be a need for SOME dedicated CPUs in a particular container, but some could be shared with the pod.