Am 24.10.16 um 18:53 schrieb Stefan Ram:
> I took »escape« from Chandler Carruth and wrote a small
> microbenchmark around it to benchmark the operation »%«
> (gcc or compatible assumed):
>
> { int i = 1; int j = 1; int k = 1;
> for( long l = 0; l < 1'000'000'000; ++l )
> { escape( &i );
> escape( &j );
> k = i % j;
> escape( &k ); }}
What are you trying to benchmark, the operation +,-,* ?
The main memory bandwidth? The optimization quality of your compiler?
> cltd means "Convert Long To Double". It does not appear
> for »+« and »*«, but it also appears for »/«. I'm not sure
> why it is done that way, first "cltd" and then "idivl".
> Is this a float operation or an integer operation?
As explained by Scott, in x86 assembly the integer multiplication and
division operate on twice as long products than the factors, so 32 bit *
32 bit -> 64 bit, and 64 bit / 32 bit -> 32 bit - as opposed to C, where
they are defined on operands and products of the same width.
> Now, is this a reasonable microbenchmark?
No. For a modern CPU, it is actually very hard to create full load. The
reason is that the CPUs are superscalar, i.e. perform multiple
instructions in parallel if possible, and that the memory runs at a
different speed than the CPU, wich is partly compensated by several
levels of caches.
In your benchmark, you are forcing the compiler to store & load every
operand to the memory (by takeing the address). This defeats register
allocation, which is one of the most important optimizations. I guess
you measure mostly the speed of the first level cache.
If you only want to prevent the compiler from removing the dead code,
you should use the result; i.e. print k at the and of the program.
Additionally, make the upper bound or initial value of k dependent on
input at program start. A very clever compiler can still - in some cases
- compute a closed formula for such a loop, but you will see that on the
assembly output.
Another thing to consider, CPUs have vector units and can perform
multiple operations in parallel (SIMD). So for a different loop which
divides 1000 integers, which are not interdependend, an autovectorizer
can create code that is twice, 4x or 8x faster.
A notable example of the influence of caches and vector units etc. is
linear algebra subroutines. A high performing library (such as OpenBLAS
or Intel MKL) can achieve a speedup of a fector of 1000 over naively
written C code.
Christian