Isolating low latency application on CPU-0?

Michael Mattoss

unread,

Mar 4, 2015, 9:38:22 AM3/4/15

to mechanica...@googlegroups.com

Hi guys,

I'm in the process of setting up a new dual-socket server for a low latency workload.

The application will run exclusively on one CPU and everything else (i.e. OS, non-critical processes) will run on the other CPU to avoid cache pollution.

I was wondering if it makes any difference as to which one of the 2 CPU's is chosen for the workload.

Theoretically, there should be no difference but I was wondering if there is some low-level stuff (e.g. core OS code, system management interrupts handlers) that is statically allocated to CPU-0 as every system has at least 1 CPU.

Of course, if that's the case then CPU-1 is the better choice.

Any thoughts/suggestions?

Thanks,

Michael

Matt Godbolt

unread,

Mar 4, 2015, 9:50:39 AM3/4/15

to mechanica...@googlegroups.com

Though it depends on your setup, IRQs are probably distributed evenly on both CPUs. Check /proc/interrupts and you'll see where they're going. (I run "watch --diff cat /proc/interrupts" to see this somewhat graphically).

If you're running irqbalance, it is in theory trying to steer IRQs to keep them balanced across cores. If you want to manually steer IRQs away from your chosen core, you'll need to either tell irqbalance about this ('man irqbalance' and look for IRQBALANCE_BANNED_CPUS), or else stop it entirely and move the interrupts yourself. For the latter take a look at /proc/irq/<NUM>/smp_affinity to get the mask of CPUs that IRQ can be delivered to.

Anecdotally I've noticed CPU 0 tend to get more IRQs anyway, so I tend to use CPU 0 as my "junk core" when worrying about this kind of thing.

If you have access to Red Hat support, their tuning guide is a pretty good place to look for information ( http://developerblog.redhat.com/2015/02/11/low-latency-performance-tuning-rhel-7/ )

Hope that helps,

-matt

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Matt

Siddhartha Jana

unread,

Mar 4, 2015, 9:57:12 AM3/4/15

to mechanica...@googlegroups.com

One thing to consider is the physical proximity of the network-card to the sockets.

--

Michael Mattoss

unread,

Mar 4, 2015, 10:19:37 AM3/4/15

to mechanica...@googlegroups.com

Hi Matt,

I'll take a closer look at that but you seem to confirm my suspicion.

Thanks,

Michael

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Matt

Michael Mattoss

unread,

Mar 4, 2015, 10:27:31 AM3/4/15

to mechanica...@googlegroups.com

Absolutely. Different PCIe sockets can be closer to different CPU sockets just like memory (NUMA).

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Matt Godbolt

unread,

Mar 4, 2015, 10:39:01 AM3/4/15

to mechanica...@googlegroups.com

lstopo is your friend for this kind of stuff.

On Wed, Mar 4, 2015 at 9:27 AM, Michael Mattoss <michael...@gmail.com> wrote:

Absolutely. Different PCIe sockets can be closer to different CPU sockets just like memory (NUMA).

lstopo is your friend for this kind of stuff.

Rishi

unread,

Mar 11, 2015, 7:55:53 AM3/11/15

to mechanica...@googlegroups.com

Related to this question, does Xeon processors require "symmetric" memory config? I was thinking of putting only like 4 GB on junk core as you call it and then like 64 GB connected to second socket.

Matt Godbolt

unread,

Mar 11, 2015, 8:01:24 AM3/11/15

to mechanica...@googlegroups.com

On Wed, Mar 11, 2015 at 6:55 AM, Rishi <ris...@gmail.com> wrote:

Related to this question, does Xeon processors require "symmetric" memory config? I was thinking of putting only like 4 GB on junk core as you call it and then like 64 GB connected to second socket.

It really depends what you mean. When I refer to "cores" I'm referring to the individual processors on a socket. So, on a given socket, one of the cores is a "junk" core.

As NUMA memory is connected to the sockets (not their individual cores) it doesn't make sense to talk about plugging memory into a "junk core".

However, in general as best I understand it you cannot have asymmetric memory on a multi-socket system. At least, none of the server setups I have access to is set this way.

On a related note I was recently in a call with Intel and asked about the roadmap for asymmetric processors: having a cheaper 4-core processor in one socket, and an 18-core in the other. This apparently is not a planned option for normal processors, despite the Phi/Knight's Landing etc setup clearly moving in that direction.

-matt

Rishi Dhupar

unread,

Mar 11, 2015, 10:39:58 AM3/11/15

to mechanica...@googlegroups.com

Yes I meant to state socket not core. Glad to hear other people have thought about this, unfortunate Intel is not looking into this type of architecture. Thanks for the info.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/SnJ6LTKCjEU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Himanshu Sharma

unread,

May 22, 2017, 3:59:42 AM5/22/17

to mechanical-sympathy

Hi Michael

Did you find a satisfactory reason for not isolating cpu 0, maybe some low level OS code that is bound to run on core 0? I am also stuck at this question right now and am thinking you might have an answer.

Thanks

Himanshu

Wojciech Kudla

unread,

May 22, 2017, 4:38:25 AM5/22/17

to mechanical-sympathy

There's a number of kernel tasks that are implicitly bound to cpu0. For an example of one have a look at rcu offloading and its restrictions.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Himanshu Sharma

unread,

May 22, 2017, 6:41:02 AM5/22/17

to mechanica...@googlegroups.com

Thanks Wojciech for the quick reply. I read about rcu offloading and in my testing, can confirm that kernel rcu threads are scheduled on core 0 even if core 0 is isolated. Another thing I observed was that there are some kworker threads which run on isolated cpus other than 0. Is this expected behavior, because I used to think that isolated cpus are not touched by the kernel. These kworker threads will definitely lead to context switches and hamper performance a little bit. And I am afraid we can do nothing to get rid of them.

Himanshu Sharma

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/SnJ6LTKCjEU/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Wojciech Kudla

unread,

May 22, 2017, 9:03:53 AM5/22/17

to mechanica...@googlegroups.com

There's lots of work being scheduled on even isolated cpus. If you are not running a tickless kernel, you should see around 1000 local timer interrupts per second (by default). You will also see soft irqs (if you haven't affinitized them with some housekeeping cpu), non maskable interrupts/machine check errors, work queue tasks, etc.
As for rcu, even with offloading you will see the isolated cores performing work required to schedule the callbacks on the offloaded cpus. You can solve that by switching to rcu callback polling, but my point here is, there's a number of different types of tasks that will run on isolated cpus.

On Mon, 22 May 2017, 11:41 Himanshu Sharma, <imhima...@gmail.com> wrote:

Thanks Wojciech for the quick reply. I read about rcu offloading and in my testing, can confirm that kernel rcu threads are scheduled on core 0 even if core 0 is isolated. Another thing I observed was that there are some kworker threads which run on isolated cpus other than 0. Is this expected behavior, because I used to think that isolated cpus are not touched by the kernel. These kworker threads will definitely lead to context switches and hamper performance a little bit. And I am afraid we can do nothing to get rid of them.

Himanshu Sharma

On Mon, May 22, 2017 at 2:08 PM, Wojciech Kudla <wojciec...@gmail.com> wrote:

There's a number of kernel tasks that are implicitly bound to cpu0. For an example of one have a look at rcu offloading and its restrictions.

On Mon, 22 May 2017, 08:59 Himanshu Sharma, <imhima...@gmail.com> wrote:

Hi Michael

Did you find a satisfactory reason for not isolating cpu 0, maybe some low level OS code that is bound to run on core 0? I am also stuck at this question right now and am thinking you might have an answer.

Thanks
Himanshu

On Wednesday, March 4, 2015 at 8:08:22 PM UTC+5:30, Michael Mattoss wrote:
Hi guys,

I'm in the process of setting up a new dual-socket server for a low latency workload.
The application will run exclusively on one CPU and everything else (i.e. OS, non-critical processes) will run on the other CPU to avoid cache pollution.
I was wondering if it makes any difference as to which one of the 2 CPU's is chosen for the workload.
Theoretically, there should be no difference but I was wondering if there is some low-level stuff (e.g. core OS code, system management interrupts handlers) that is statically allocated to CPU-0 as every system has at least 1 CPU.
Of course, if that's the case then CPU-1 is the better choice.

Any thoughts/suggestions?

Thanks,
Michael

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/SnJ6LTKCjEU/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Adrien Mahieux

unread,

May 23, 2017, 3:13:21 AM5/23/17

to mechanical-sympathy

Hi guys,

This is my first post, so I assume we are talking about Linux here :) I'll take kernel 3.10.105 as example

On Monday, May 22, 2017 at 3:03:53 PM UTC+2, Wojciech Kudla wrote:

There's lots of work being scheduled on even isolated cpus.

Indeed, isolcpus option only changes the default scheduler mask : parsing at kernel/sched/core.c:5768 and usage at init scheduler kernel/sched/core.c:6713

Apps not using the default allocation (kernel thread) and remasked user tasks can still use it.

If you are not running a tickless kernel, you should see around 1000 local timer interrupts per second (by default).

This is the HZ constant of the kernel. On most distribution it's indeed 1000HZ. You can check this with a grep on your /proc/config.gz or /boot/config-$(uname -r) :

# zgrep ^CONFIG_HZ= /proc/config.gz

Or you can also check the value of the "LOC" interrupt. The added value each second is the value of HZ (can be 100, 250, 300, 1000).

# watch -d -n1 grep LOC: /proc/interrupt

(if you are lazy like me, check my ratethis script)

You will also see soft irqs (if you haven't affinitized them with some housekeeping cpu), non maskable interrupts/machine check errors, work queue tasks, etc.

Are you talking about bottom-halves (kernel irq-related threads) ?

For NMI and MCE, you shouldn't be seeing them to anything else than 0. NMI can be generated by watchdogs, but it's better to disable them in sake of avoiding jitters.

As for rcu, even with offloading you will see the isolated cores performing work required to schedule the callbacks on the offloaded cpus. You can solve that by switching to rcu callback polling, but my point here is, there's a number of different types of tasks that will run on isolated cpus.

In Linux, you can't fully isolate CPU0, and you can't set it to nohzfull : it's used for timekeeping and other management stuff :

At each tick, there's a lot of work done: scheduling, jiffies, loadavg, timekeeping, timers, RCU callbacks...

Also, for each cpu you have a number of kernel threads that cannot be moved :

- ksoftirqd : a generic handler for "softirq" tasks, that is execute the "action" function of registered softirq vectors (check __do_softirq in kernel/softirq.c). this is mostly used by network drivers and tasklets.

- migration : the only task running as realtime 99, that is responsible to migrate tasks from a cpu to another (to balance the load).

- watchdog : this detects soft and hard lockups (if CONFIG_LOCKUP_DETECTOR is set in kernel config).

Also as you can't fully isolate CPU0, that means socket0 will have parts of its LLC trashed, thus impacting other threads on the same socket, inducing jitter.

There's also some vendor technology to reduce this, but it's vendor and model dependant.

Adrien.

On Mon, 22 May 2017, 11:41 Himanshu Sharma, <imhima...@gmail.com> wrote:

Thanks Wojciech for the quick reply. I read about rcu offloading and in my testing, can confirm that kernel rcu threads are scheduled on core 0 even if core 0 is isolated. Another thing I observed was that there are some kworker threads which run on isolated cpus other than 0. Is this expected behavior, because I used to think that isolated cpus are not touched by the kernel. These kworker threads will definitely lead to context switches and hamper performance a little bit. And I am afraid we can do nothing to get rid of them.

Himanshu Sharma

On Mon, May 22, 2017 at 2:08 PM, Wojciech Kudla <wojciec...@gmail.com> wrote:

There's a number of kernel tasks that are implicitly bound to cpu0. For an example of one have a look at rcu offloading and its restrictions.

On Mon, 22 May 2017, 08:59 Himanshu Sharma, <imhima...@gmail.com> wrote:

Hi Michael

Did you find a satisfactory reason for not isolating cpu 0, maybe some low level OS code that is bound to run on core 0? I am also stuck at this question right now and am thinking you might have an answer.

Thanks
Himanshu

On Wednesday, March 4, 2015 at 8:08:22 PM UTC+5:30, Michael Mattoss wrote:
Hi guys,

I'm in the process of setting up a new dual-socket server for a low latency workload.
The application will run exclusively on one CPU and everything else (i.e. OS, non-critical processes) will run on the other CPU to avoid cache pollution.
I was wondering if it makes any difference as to which one of the 2 CPU's is chosen for the workload.
Theoretically, there should be no difference but I was wondering if there is some low-level stuff (e.g. core OS code, system management interrupts handlers) that is statically allocated to CPU-0 as every system has at least 1 CPU.
Of course, if that's the case then CPU-1 is the better choice.

Any thoughts/suggestions?

Thanks,
Michael

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/SnJ6LTKCjEU/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ross Bencina

unread,

May 24, 2017, 7:49:41 PM5/24/17

to mechanica...@googlegroups.com

On 22/05/2017 5:47 PM, Himanshu Sharma wrote:
> Did you find a satisfactory reason for not isolating cpu 0, maybe some
> low level OS code that is bound to run on core 0?

Throwing this out there for comment:

In addition to Linux kernel internals, you might want to consider which
CPU your IO is connected to. Since Sandy Bridge at least, each CPU has
its own PCIe interface. Presumably, if you're doing user-space kernel
bypass IO you want your workload on the same CPU that your IO devices
are connected to. Otherwise you want the kernel running on the CPU that
is directly connected to IO.

Or you could work out the CPU isolation first then connect IO as
appropriate.

Ross.

Wojciech Kudla

unread,

May 25, 2017, 1:58:00 AM5/25/17

to mechanica...@googlegroups.com

> Since Sandy Bridge at least, each CPU has
its own PCIe interface. Presumably, if you're doing user-space kernel bypass IO you want your workload on the same CPU that your IO devices are connected to.

I think you meant the whole socket here. Yes, this is one of the reasons why many shops move away from 4-socket rigs as it sometimes gets really challenging to partition PCIe/cpu/memory resources when running multiple latency critical processes.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Adrien Mahieux

unread,

May 25, 2017, 3:40:38 PM5/25/17

to mechanical-sympathy

To be precise, each socket has 3 16x line. The others (1x, 2x, 4x, 8x) goes through the PCH before reaching the CPU, as usual.

"lspci -tv" to view the pci topology.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Sarunas Vancevicius

unread,

May 26, 2017, 6:10:01 AM5/26/17

to mechanical-sympathy

On Monday, 22 May 2017 16:03:53 UTC+3, Wojciech Kudla wrote:

There's lots of work being scheduled on even isolated cpus. If you are not running a tickless kernel, you should see around 1000 local timer interrupts per second (by default). You will also see soft irqs (if you haven't affinitized them with some housekeeping cpu), non maskable interrupts/machine check errors, work queue tasks, etc.
As for rcu, even with offloading you will see the isolated cores performing work required to schedule the callbacks on the offloaded cpus. You can solve that by switching to rcu callback polling, but my point here is, there's a number of different types of tasks that will run on isolated cpus.

Kernel can run work queue tasks on isolated cores quiet often, can observe them via:

# perf record -C 1 -e workqueue:workqueue_execute_start -e workqueue:workqueue_execute_end -o wrk_start_$(date "+%Y-%m-%d_%H%M") -- sleep 300

Sometimes things like cursor blink or EDAC can be avoided/reduced.

Wojciech Kudla

unread,

May 26, 2017, 6:12:42 AM5/26/17

to mechanical-sympathy

Yes, that's why blacklisting workqueues from critical cpus should be on the jitter elimination check list.
They can be affinitized just like irqs

--

Reply all

Reply to author

Forward