Physical CPU Affinity/Pinning

Cliff Burdick

unread,

Mar 22, 2019, 3:19:01 PM3/22/19

to kubernetes-wg-resource-management

Hi all, I've been following the CPU pinning discussion on github, and it wasn't quite clear to me if what we're trying to do is possible at this point. We have nodes with multiple GPUs and NICs where not only the NUMA awareness matters, but the PCIe topology matters. Since the topology awareness scheduler is not quite ready, we wrote our own scheduler that through information provided by a pod, attempts to make placements onto specific physical cores, NIC interfaces, and GPU devices. This works somewhat well, but there is still an issue with other kube pods in both burstable and besteffort cgroups getting moved onto the cores used by the data plane application. This was unexpected at first since we're using isolcpus, and the cpusets documentation seems to hit that isolated CPUs will *not* be part of the load balancing algorithm, and thus should not have workloads put onto them. However, there doesn't seem to be a transparent way to detect if that's actually true, since all kube cpusets have cpusets.cpus set to all of the CPUs in the system. As a result, things like kube-proxy, calico, prometheus, etc, are moved onto those isolated cores. Kubelet has some flags to control certain cgroups properties, but in my testing, there doesn't seem to be a way to force everything in, say, kube-system namespace, onto the non-isolated cpusets. Furthermore, I believe all cpusets are inherited, so setting the runtime (Docker) cpusets to be the non-isolated cores would make it so that a data plane application that's launched would also inherit that.

I looking into both the K8s CPU Manager and the Intel CMK, but neither seem to allow you to choose specific CPU cores, which is what I'm after. I also tried getting tricky and moving the tasks on the host that are non-real-time onto a cgroups comprising of the non-isolated cores, but of course, that's unmanageable.

Am I overthinking this? Is there any way that I can have pods in a certain class (kube-system) use some pre-defined shared group of course, while data-plane pods are given the entire core set of the system? In the latter case the data plane pods already know their physical mapping due to the customer scheduler, so they will only take a subset of the isolated cores they've been assigned to. Thanks!

Derek Carr

unread,

Mar 22, 2019, 4:15:52 PM3/22/19

to Cliff Burdick, kubernetes-sig-node, kubernetes-wg-resource-management

I encourage you to attend an upcoming SIG node meeting to discuss this topic further.

Our plan is to support the following KEP in 1.15 release as a next step: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0035-20190130-topology-manager.md

Thanks,
Derek

--
You received this message because you are subscribed to the Google Groups "kubernetes-wg-resource-management" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-wg-resource...@googlegroups.com.
To post to this group, send email to kubernetes-wg-re...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-wg-resource-management/c29be7e8-986e-431f-907e-7d42e8dfac26%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cliff Burdick

unread,

Mar 22, 2019, 6:03:11 PM3/22/19

to Derek Carr, kubernetes-sig-node, kubernetes-wg-resource-management

Thanks Derek! Will do.

Jiaxin Shan

unread,

Mar 22, 2019, 7:42:53 PM3/22/19

to Cliff Burdick, Derek Carr, kubernetes-sig-node, kubernetes-wg-resource-management

Hi Cliff,

CPU Manager I think is more helpful on isolation and context switching reduction.

What's your workloads? In our case, we use OpenMPI and it has some supports for process affinity and choose right bind strategy sounds like helpful in this case.

But I have not tried to use specific cores, I only tried bind to cores or sockets.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-wg-resource-management/CA%2BGp1nZmrFCo0Fxb8QPJDtPgJGYBRmsMzp-FUymC%2BE5dkRA3fQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Best Regards!
Jiaxin Shan

Tel: 412-230-7670

Address: 470 2nd Ave S, Kirkland, WA

Cliff Burdick

unread,

Mar 22, 2019, 8:17:12 PM3/22/19

to Jiaxin Shan, Derek Carr, kubernetes-sig-node, kubernetes-wg-resource-management

Hi Jiaxin, in an HPC application that uses userspace networking. We don't use MPI for several reasons, but mostly because there's extra affinity criteria we need outside of what MPI gives.

Reply all

Reply to author

Forward