Hi all, I've been following the CPU pinning discussion on github, and it wasn't quite clear to me if what we're trying to do is possible at this point. We have nodes with multiple GPUs and NICs where not only the NUMA awareness matters, but the PCIe topology matters. Since the topology awareness scheduler is not quite ready, we wrote our own scheduler that through information provided by a pod, attempts to make placements onto specific physical cores, NIC interfaces, and GPU devices. This works somewhat well, but there is still an issue with other kube pods in both burstable and besteffort cgroups getting moved onto the cores used by the data plane application. This was unexpected at first since we're using isolcpus, and the cpusets documentation seems to hit that isolated CPUs will *not* be part of the load balancing algorithm, and thus should not have workloads put onto them. However, there doesn't seem to be a transparent way to detect if that's actually true, since all kube cpusets have cpusets.cpus set to all of the CPUs in the system. As a result, things like kube-proxy, calico, prometheus, etc, are moved onto those isolated cores. Kubelet has some flags to control certain cgroups properties, but in my testing, there doesn't seem to be a way to force everything in, say, kube-system namespace, onto the non-isolated cpusets. Furthermore, I believe all cpusets are inherited, so setting the runtime (Docker) cpusets to be the non-isolated cores would make it so that a data plane application that's launched would also inherit that.
I looking into both the K8s CPU Manager and the Intel CMK, but neither seem to allow you to choose specific CPU cores, which is what I'm after. I also tried getting tricky and moving the tasks on the host that are non-real-time onto a cgroups comprising of the non-isolated cores, but of course, that's unmanageable.
Am I overthinking this? Is there any way that I can have pods in a certain class (kube-system) use some pre-defined shared group of course, while data-plane pods are given the entire core set of the system? In the latter case the data plane pods already know their physical mapping due to the customer scheduler, so they will only take a subset of the isolated cores they've been assigned to. Thanks!