I’m doing some optimization of low-level C++ code in a database engine, which is currently single-threaded. I’m using the Instruments time profiler, which has been very useful so far. One thing I’m wondering about is CPU-bound vs. RAM-bandwidth-bound code, and how this affects the ability to parallelize.
I’m not an expert on this, but my understanding is that the bandwidth between the CPU and DRAM is a major bottleneck in current systems. I’ve heard people say “RAM is the new disk.” And this does manifest in the code I’m profiling, in that memcpy is definitely one of the hot functions.
I’m assuming that RAM-bound code would not parallelize well, since it doesn’t make a difference how many threads I divide the RAM bandwidth between. A parallel approach would probably even make things worse by lowering locality of reference.
But how can I tell the difference between RAM-bound and genuinely CPU-bound (not to mention disk-I/O-bound) code? I know that the regular time profiler is a pretty blunt tool, but I stick with it because I know how it works and how to use it, whereas I’ve failed to get anything meaningful out of many of the other instruments. I’d appreciate any tips, or pointers to good documentation (as opposed to the Instruments manual!)
—Jens
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/perfoptimization-dev-garchive-8409%40googlegroups.com
This email sent to perfoptimization-...@googlegroups.com
> On 1 Oct 2016, at 21:12, Jens Alfke <je...@mooseyard.com> wrote:
>
> I hope this list is still alive…
“resting”?
> I’m doing some optimization of low-level C++ code in a database engine, which is currently single-threaded. I’m using the Instruments time profiler, which has been very useful so far. One thing I’m wondering about is CPU-bound vs. RAM-bandwidth-bound code, and how this affects the ability to parallelize.
>
> I’m not an expert on this, but my understanding is that the bandwidth between the CPU and DRAM is a major bottleneck in current systems. I’ve heard people say “RAM is the new disk.” And this does manifest in the code I’m profiling, in that memcpy is definitely one of the hot functions.
Yes. It used to be we put multiplication tables in RAM, now performing tens of multiplications to avoid one main-memory hit would be worthwhile.
>
> I’m assuming that RAM-bound code would not parallelize well, since it doesn’t make a difference how many threads I divide the RAM bandwidth between. A parallel approach would probably even make things worse by lowering locality of reference.
I think your intuition is generally correct, though every bit of code can be different.
>
> But how can I tell the difference between RAM-bound and genuinely CPU-bound (not to mention disk-I/O-bound) code? I know that the regular time profiler is a pretty blunt tool, but I stick with it because I know how it works and how to use it, whereas I’ve failed to get anything meaningful out of many of the other instruments. I’d appreciate any tips, or pointers to good documentation (as opposed to the Instruments manual!)
One technique that can be helpful with just the time profiler is varying your data-set size so that it fits in L1, L2, L3 or RAM. You should typically see step-functions in time/size if you are memory-bound. With modern DRAM configurations, there is also a huge difference between accessing main memory sequentially or randomly (in my measurements ~ 100x ), so that’s a factor.
In terms of Instrumentation, the “Counters” Instrument can give you raw counts of events such as caches hits and misses, and nowadays that Instrument actually works, so yay! You will probably need to calibrate against code where you know whether its CPU or memory-bound in order to make sense of those numbers. Also, L3 isn’t called “L3”, but “Last Level Cache”.
Cheers,
Marcel
Code that has a problem will generally have samples concentrated in one or a few places. Usually, you will see a lot of samples attributed to either the instruction after the one causing the problem (reorder window full) or the first instruction to depend on the result of the problem instruction. If the problem instruction is a load from memory, then this is a good sign that memory is the issue.
Sent from my iPhone
> https://lists.apple.com/mailman/options/perfoptimization-dev/iano%40apple.com
>
> This email sent to ia...@apple.com