Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Too many CppCon videos

47 views
Skip to first unread message

Scott Lurndal

unread,
Oct 24, 2016, 1:59:18 PM10/24/16
to
r...@zedat.fu-berlin.de (Stefan Ram) writes:
> Distribution through any means other than regular usenet
>
> channels is forbidden. It is forbidden to publish this
>
> article in the world wide web. It is forbidden to change
>
> URIs of this article into links. It is forbidden to remove
>
> this notice or to transfer the body without this notice.
>X-No-Archive: Yes
>Archive: no
>X-No-Archive-Readme: "X-No-Archive" is only set, because this prevents some
>
> services to mirror the article via the web (HTTP). But Stefan Ram
>
> hereby allows to keep this article within a Usenet archive server
>
> with only NNTP access without any time limitation.
>X-No-Html: yes
>Content-Language: en
>X-Received-Body-CRC: 1146306073
>X-Received-Bytes: 2789
>
> I took »escape« from Chandler Carruth and wrote a small
> microbenchmark around it to benchmark the operation »%«
> (gcc or compatible assumed):
>
[snip unreadable code fragment]
>
> Here is the assembler generated for the loop:
>
>.L3:
> movl 36(%rsp), %eax
> cltd
> idivl 40(%rsp)
> movl %edx, 44(%rsp)
> subl $1, %ecx
> jne .L3
>
> It seems that one can say
>
>i = 36(%rsp)
>j = 40(%rsp)
>k = 44(%rsp)
>l = %ecx
>
> cltd means "Convert Long To Double". It does not appear
> for »+« and »*«, but it also appears for »/«. I'm not sure
> why it is done that way, first "cltd" and then "idivl".
> Is this a float operation or an integer operation?
>
> Now, is this a reasonable microbenchmark?
>
> Without the »escape«, the loop and all is just removed
> by the compiler according to the assembly output.
>

the "escape" is simply a compiler barrier. If the compiler
can remove code, the compiler (optimizer, generally) has determined that the
code is unnecessary (because k is never used), the compiler barrier
tells the compiler don't mess with the code.

#define barrier() asm volatile ("":::"memory")

is sufficient in gcc.

I'm not sure why you'd expect division to be comparable to
addition from a performance perspective[*]. The CLTD instruction
takes a 32-bit long and sign extends it to a 64-bit long, which
is required by the IDIVL instruction. It has nothing to do with floating
point.

[*] See the appropriate vendor optimization guide for instruction
cycle counts (hint: idivl takes many more cycles than addl).

Christian Gollwitzer

unread,
Oct 25, 2016, 7:28:08 AM10/25/16
to
Am 24.10.16 um 18:53 schrieb Stefan Ram:
> I took »escape« from Chandler Carruth and wrote a small
> microbenchmark around it to benchmark the operation »%«
> (gcc or compatible assumed):
>

> { int i = 1; int j = 1; int k = 1;
> for( long l = 0; l < 1'000'000'000; ++l )
> { escape( &i );
> escape( &j );
> k = i % j;
> escape( &k ); }}

What are you trying to benchmark, the operation +,-,* ?
The main memory bandwidth? The optimization quality of your compiler?

> cltd means "Convert Long To Double". It does not appear
> for »+« and »*«, but it also appears for »/«. I'm not sure
> why it is done that way, first "cltd" and then "idivl".
> Is this a float operation or an integer operation?

As explained by Scott, in x86 assembly the integer multiplication and
division operate on twice as long products than the factors, so 32 bit *
32 bit -> 64 bit, and 64 bit / 32 bit -> 32 bit - as opposed to C, where
they are defined on operands and products of the same width.

> Now, is this a reasonable microbenchmark?

No. For a modern CPU, it is actually very hard to create full load. The
reason is that the CPUs are superscalar, i.e. perform multiple
instructions in parallel if possible, and that the memory runs at a
different speed than the CPU, wich is partly compensated by several
levels of caches.

In your benchmark, you are forcing the compiler to store & load every
operand to the memory (by takeing the address). This defeats register
allocation, which is one of the most important optimizations. I guess
you measure mostly the speed of the first level cache.

If you only want to prevent the compiler from removing the dead code,
you should use the result; i.e. print k at the and of the program.
Additionally, make the upper bound or initial value of k dependent on
input at program start. A very clever compiler can still - in some cases
- compute a closed formula for such a loop, but you will see that on the
assembly output.

Another thing to consider, CPUs have vector units and can perform
multiple operations in parallel (SIMD). So for a different loop which
divides 1000 integers, which are not interdependend, an autovectorizer
can create code that is twice, 4x or 8x faster.

A notable example of the influence of caches and vector units etc. is
linear algebra subroutines. A high performing library (such as OpenBLAS
or Intel MKL) can achieve a speedup of a fector of 1000 over naively
written C code.

Christian
0 new messages