Detecting RAM-bandwidth-bound code

Jens Alfke

unread,

Oct 1, 2016, 3:12:32 PM10/1/16

to perfoptimi...@lists.apple.com

I hope this list is still alive…

I’m doing some optimization of low-level C++ code in a database engine, which is currently single-threaded. I’m using the Instruments time profiler, which has been very useful so far. One thing I’m wondering about is CPU-bound vs. RAM-bandwidth-bound code, and how this affects the ability to parallelize.

I’m not an expert on this, but my understanding is that the bandwidth between the CPU and DRAM is a major bottleneck in current systems. I’ve heard people say “RAM is the new disk.” And this does manifest in the code I’m profiling, in that memcpy is definitely one of the hot functions.

I’m assuming that RAM-bound code would not parallelize well, since it doesn’t make a difference how many threads I divide the RAM bandwidth between. A parallel approach would probably even make things worse by lowering locality of reference.

But how can I tell the difference between RAM-bound and genuinely CPU-bound (not to mention disk-I/O-bound) code? I know that the regular time profiler is a pretty blunt tool, but I stick with it because I know how it works and how to use it, whereas I’ve failed to get anything meaningful out of many of the other instruments. I’d appreciate any tips, or pointers to good documentation (as opposed to the Instruments manual!)

—Jens
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/perfoptimization-dev-garchive-8409%40googlegroups.com

This email sent to perfoptimization-...@googlegroups.com

Jamie Johnson

unread,

Oct 8, 2016, 1:01:40 PM10/8/16

to Jens Alfke, perfoptimi...@lists.apple.com

Hi Jens,

tl;dr "For this use a tool like cachegrid, part of valgrid tools."

Memory bound code can gain performance through paralyzation and vectorization. The obvious questions is why the memcpy, how many, and what are their nature?

That aside, first determine the target platform capabilities:

CPU

# modules
# cores per module
L2, L3 cache configuration
# memory channels

RAM

Maximum theoretical bandwidth
Measure maximum available bandwidth (OSX try Memory Tester)

With those best case targets in mind, how does the application compare? For this use a tool like cachegrid, part of valgrid tools.

Likely the application doesn’t come close. Ping back with the results and we can dive deeper.

Best,

Jamie

https://lists.apple.com/mailman/options/perfoptimization-dev/jamiejj%40gmail.com

This email sent to jam...@gmail.com

Jens Alfke

unread,

Oct 8, 2016, 2:59:34 PM10/8/16

to Jamie Johnson, perfoptimi...@lists.apple.com

On Oct 8, 2016, at 10:01 AM, Jamie Johnson <jam...@gmail.com> wrote:

tl;dr "For this use a tool like cachegrid, part of valgrid tools."

Thanks; but the primary target of this code is iOS, and it looks like Valgrind isn’t supported on iOS … their platform support lists AMD64/Darwin and ARM/Android, but not ARM/Darwin.

—Jens

Jamie Johnson

unread,

Oct 10, 2016, 10:53:55 AM10/10/16

to Jens Alfke, perfoptimi...@lists.apple.com

Sorry to say I'm not up-to-date on iOS hardware details. Also I'm not aware of a similar tool for iOS. And my next suggestion, d-trace, is only available in the simulator.

Nevertheless quantifying the apps bandwidth usage and comparing against best case benchmark is a good start. Next look for apis gathering HW counters, like cache miss/hit and instruction counts -- these will be needed to determine why the app isn't coming close.

Marcel Weiher

unread,

Nov 17, 2016, 5:06:46 AM11/17/16

to Jens Alfke, perfoptimi...@lists.apple.com

Just saw this, not sure it’s still relevant.

> On 1 Oct 2016, at 21:12, Jens Alfke <je...@mooseyard.com> wrote:
>
> I hope this list is still alive…

“resting”?

> I’m doing some optimization of low-level C++ code in a database engine, which is currently single-threaded. I’m using the Instruments time profiler, which has been very useful so far. One thing I’m wondering about is CPU-bound vs. RAM-bandwidth-bound code, and how this affects the ability to parallelize.
>
> I’m not an expert on this, but my understanding is that the bandwidth between the CPU and DRAM is a major bottleneck in current systems. I’ve heard people say “RAM is the new disk.” And this does manifest in the code I’m profiling, in that memcpy is definitely one of the hot functions.

Yes. It used to be we put multiplication tables in RAM, now performing tens of multiplications to avoid one main-memory hit would be worthwhile.

>
> I’m assuming that RAM-bound code would not parallelize well, since it doesn’t make a difference how many threads I divide the RAM bandwidth between. A parallel approach would probably even make things worse by lowering locality of reference.

I think your intuition is generally correct, though every bit of code can be different.

>
> But how can I tell the difference between RAM-bound and genuinely CPU-bound (not to mention disk-I/O-bound) code? I know that the regular time profiler is a pretty blunt tool, but I stick with it because I know how it works and how to use it, whereas I’ve failed to get anything meaningful out of many of the other instruments. I’d appreciate any tips, or pointers to good documentation (as opposed to the Instruments manual!)

One technique that can be helpful with just the time profiler is varying your data-set size so that it fits in L1, L2, L3 or RAM. You should typically see step-functions in time/size if you are memory-bound. With modern DRAM configurations, there is also a huge difference between accessing main memory sequentially or randomly (in my measurements ~ 100x ), so that’s a factor.

In terms of Instrumentation, the “Counters” Instrument can give you raw counts of events such as caches hits and misses, and nowadays that Instrument actually works, so yay! You will probably need to calibrate against code where you know whether its CPU or memory-bound in order to make sense of those numbers. Also, L3 isn’t called “L3”, but “Last Level Cache”.

Cheers,

Marcel

Ian Ollmann

unread,

Nov 17, 2016, 12:10:19 PM11/17/16

to Marcel Weiher, perfoptimi...@lists.apple.com

Another method to spot memory bound code is to run a simple Instruments time profile and look at the assembly view of where the samples land. If it is compute bound, the samples will be nicely distributed around the hot loop. In near perfect code, you will see the samples striding every 4 instructions on Intel, because it can decode 4 instructions per cycle on most recent machines.

Code that has a problem will generally have samples concentrated in one or a few places. Usually, you will see a lot of samples attributed to either the instruction after the one causing the problem (reorder window full) or the first instruction to depend on the result of the problem instruction. If the problem instruction is a load from memory, then this is a good sign that memory is the issue.

Sent from my iPhone

> https://lists.apple.com/mailman/options/perfoptimization-dev/iano%40apple.com
>
> This email sent to ia...@apple.com

Reply all

Reply to author

Forward