Maximizing inter-core memory bandwidth/minimizing latency in Broadwell Xeon v4?

Simon Thornington

unread,

Jul 19, 2016, 7:55:53 PM7/19/16

to mechanical-sympathy

Hi all,

I'm implementing some inter-core IPC using Aeron and other shared-memory-based techniques, and I am noticing that when the SPMC ratio goes up (down? towards more consumers, up to 1->18 with a core reserved for Aeron's media driver, for instance), I am seeing the throughput of the producer go down.

I don't have any graphs yet, but I'd like to ask if anyone has any good blog posts/references/white papers on the topology and bottlenecks of two-socket Broadwell Xeon v4 systems? Even primer-level material on NUMA with Intel Xeon/Linux would be helpful, to better understand the trade-offs of various BIOS NUMA settings with respect to intra-socket inter-core IPC vs inter-socket IPC and so forth. Any advice on diagnosing these sorts of bottlenecks would be helpful too; currently I'm recording faults and cache misses etc using `perf`, but perhaps there are more effective techniques?

Thanks in advance,

Simon.

Philip Haynes

unread,

Jul 24, 2016, 7:35:20 PM7/24/16

to mechanical-sympathy

Hi Simon,

Early last year I spent a few weeks profiling a cache line based SPSC shared memory queue between Java and C++. I wrote this whilst also developing an atomics library that parts of which eventually made its way into Aeron. My experience was that it was not always "bottlenecks" that got in the way, but also timing. For example, the C++ folk here got all hot and bothered when the C++ queue only exchanged 141M messages per sec versus the equivalent java implementation of 168M messages per second on an i7. The difference turned out slight timing differences between the offer and poll between the two implementations, and the fraction of times a queue entry was found. It also took quite a bit to get repeatable results isolating this.

If I were to repeat this sort of testing today for a new CPU, before I looked at more complex multi consumer scenario, I would first characterise and performance model the CPU and your experimental setup carefully, so you can be sure the computer is actually doing what you think it is. (I had thought of repeating these sort of tests but ran out of time for example). With this done, then increase the number of consumers in different circumstances to workout what is going on be it more contention, cache misses or what ever.

But without a decent performance model of how your system should behave and repeatable experimental setup, you can often end up blowing a lot of time making endless tweaks without being sure of why.

HTH.

Philip

Martin Thompson

unread,

Jul 25, 2016, 8:08:31 AM7/25/16

to mechanical-sympathy

Out of curiosity are you using C++ or Java, and does the behaviour change with 1.0 of Aeron?

In addition to Philip's comments on a performance model there are some things worth exploring:

- Turbo Boost: With increased active cores the clock rate can go down. Using x86 PAUSE in spin loops can help but best to frequency lock all cores.

- Bandwidth Limitations: If all cores are accessing the same L3 cache slice then the port on that slice can become a bottleneck. Cache coherence traffic for invalidation and then re-fetching of all cores needs to be considered as the publisher gets exclusive access before modification.

Are you seeing back pressure on the publisher? If so, then you likely waiting for the publisher flow control window to be updated. This could either the driver conductor or one or more of the subscribers being starved out and thus holding everyone else back.

There is so much to look at. Cache missing, starvation, setup for NUMA and CoD (Cluster on Die - effectively NUMA on socket). Best to have a model of what you expect and then measure what is being observed and see if the experimental evidence fits the model. You need to model the flow rates and dependencies. To have parallel in-flight cache misses you need to ensure you are avoiding data dependent loads and even then you only have 10 line feed buffers per core to keep the cache misses operating concurrently. If you graph the scale up then limitations like bandwidth, buffers, etc. become obvious as the queuing effects kick in.

Martin...

GAURAV KAUL

unread,

Aug 1, 2016, 6:33:38 AM8/1/16

to mechanical-sympathy

Some good replies here. One thing I have noticed is the sensitivity to NUMA esp. on the recent versions of Xeon E5. There are some tools such as the one from PV

http://www.pontusvision.com/thread-manager-threadmanager/concepts-numactl-taskset-migratepages/

But this is a commercial tool.

Specifically on NUMA this is a bit of a black art.

Please check this WP from Dell

http://en.community.dell.com/techcenter/b/techcenter/archive/2013/01/09/numa-best-practices-for-dell-poweredge-12th-generation-servers

Reply all

Reply to author

Forward