Invert every 2nd byte in a container of raw data

Frederick Gotham

unread,

Mar 29, 2020, 11:40:13 AM3/29/20

to

size_t constexpr g_LEN = 16u;

array<uint8_t, g_LEN> data;

auto const my_range =
v
| boost::adaptors::sliced(1,15)
| boost::adaptors::strided(2);

boost::range::transform(my_range,
my_range.begin(),
std::bind2nd(std::bit_xor<uint8_t>(),0xFF)
);

It's annoying that there isn't a form of "boost::range::transform" that only takes two arguments.

Can anyone think of a cleaner way of doing this that will work with any kind of container of objects of type "uint8_t"? I use "bind2nd" with "bit_xor" and "0xFF" in this snippet because I can't find the unary "bit_not" class. Is there such a thing?

Alf P. Steinbach

unread,

Mar 29, 2020, 11:48:34 AM3/29/20

to

Well, assuming that your code is always dealing with 8-bit bytes, try

for( int i = 0; i < g_LEN; ++i ) { if( i%2 ) { data[i] = ~data[i]; }

Oh look, it's a one-liner. ;-)

- Alf

James Kuyper

unread,

Mar 29, 2020, 11:54:20 AM3/29/20

to

On 3/29/20 11:40 AM, Frederick Gotham wrote:
...

>I use "bind2nd" with "bit_xor" and "0xFF" in this snippet because I can't find the unary "bit_not" class. Is there such a thing?

The bit_not class is described in 23.14.9.4, immediately after
23.14.9.3, which describes the bit_xor class.

Bonita Montero

unread,

Mar 29, 2020, 11:56:22 AM3/29/20

to

> for( int i = 0; i < g_LEN; ++i ) { if( i%2 ) { data[i] = ~data[i]; }

Assuming we have a 2s-complement this would be faster:

void invertSecond( uint8_t *p, size_t n )
{
for( size_t i = 0; i != n; ++i )
p[i] ^= -(int8_t)(i & 1);
}

Bonita Montero

unread,

Mar 29, 2020, 12:01:05 PM3/29/20

to

>> for( int i = 0; i < g_LEN; ++i ) { if( i%2 ) { data[i] = ~data[i]; }

> Assuming we have a 2s-complement this would be faster:

Or maybe not because the branch-prediction is able
to predict the regular pattern of your branches.

Bonita Montero

unread,

Mar 29, 2020, 12:27:02 PM3/29/20

to

> Or maybe not because the branch-prediction is able
> to predict the regular pattern of your branches.

I tested it; my variant is faster although the loop is
very larger:

#include <iostream>
#include <chrono>
#include <cstdint>
#include <vector>
#include <algorithm>

using namespace std;
using namespace chrono;

void invertSecondA( uint8_t *p, size_t n )

{
for( size_t i = 0; i != n; ++i )
p[i] ^= -(int8_t)(i & 1);
}

void invertSecondB( uint8_t *p, size_t n )

{
for( size_t i = 0; i != n; ++i )

if( i % 2 )
p[i] = ~p[i];
}

int main()
{
size_t const SIZE = 1024, // fits in L1
ROUNDS = 1'000'000;
vector<uint8_t> v( SIZE, 0 );
time_point<high_resolution_clock> start = high_resolution_clock::now();
for( size_t round = ROUNDS; round; --round )
invertSecondA( &v[0], SIZE );
double sA = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
start = high_resolution_clock::now();
for( size_t round = ROUNDS; round; --round )
invertSecondB( &v[0], SIZE );
double sB = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
cout << sA << endl << sB << endl;
}

This are the MSVC 2019 times:
0.533812
1.1952
And this are the g++ 6.3.0 times:
0.190201
0.401627

Christian Gollwitzer

unread,

Mar 29, 2020, 12:39:57 PM3/29/20

to

Am 29.03.20 um 18:26 schrieb Bonita Montero:

>> Or maybe not because the branch-prediction is able
>> to predict the regular pattern of your branches.
>
> I tested it; my variant is faster although the loop is
> very larger:
>
> #include <iostream>
> #include <chrono>
> #include <cstdint>
> #include <vector>
> #include <algorithm>
>
> using namespace std;
> using namespace chrono;
>
> void invertSecondA( uint8_t *p, size_t n )
> {
>     for( size_t i = 0; i != n; ++i )
>         p[i] ^= -(int8_t)(i & 1);
> }
>
> void invertSecondB( uint8_t *p, size_t n )
> {
>     for( size_t i = 0; i != n; ++i )
>         if( i % 2 )
>             p[i] = ~p[i];
> }
>

How about:
for( size_t i = 0; i < n; i+=2 )

p[i] = ~p[i];
}

?

Christian

Bonita Montero

unread,

Mar 29, 2020, 12:49:13 PM3/29/20

to

> How about:
>       for( size_t i = 0; i < n; i+=2 )
>               p[i] = ~p[i];
>       }

That's too simple. ;-)

Christian Gollwitzer

unread,

Mar 29, 2020, 2:27:52 PM3/29/20

to

Am 29.03.20 um 18:49 schrieb Bonita Montero:

Indeed, that version is faster than what Alf posted, but not nearly as
fast as yours. Here is the output (I'm on macOS)

Apfelkiste:Tests chris$ clang++ -O2 --std=c++17 invert.cpp -march=native
Apfelkiste:Tests chris$ ./a.out
0.193238
1.62945
1.03101
Apfelkiste:Tests chris$ clang++ -v
Apple LLVM version 10.0.0 (clang-1000.11.45.5)

Looking at the assembly output, your version is vectorized by the
compiler - and that is the secret behind the 5x speed improvement
(compared to my version). Because the whole thing is memory bound, and
by vectorizing the number of memory accesses can be significantly reduced.

Christian

#include <iostream>
#include <chrono>
#include <cstdint>
#include <vector>
#include <algorithm>

using namespace std;
using namespace chrono;

void invertSecondA( uint8_t *p, size_t n )
{
for( size_t i = 0; i != n; ++i )
p[i] ^= -(int8_t)(i & 1);
}

void invertSecondB( uint8_t *p, size_t n )
{
for( size_t i = 0; i != n; ++i )

if( i % 2 )

p[i] = ~p[i];
}

void invertSecondC( uint8_t *p, size_t n )
{

for( size_t i = 0; i < n; i+=2 )
p[i] = ~p[i];
}

int main()
{
size_t const SIZE = 1024, // fits in L1

ROUNDS = 5'000'000;

vector<uint8_t> v( SIZE, 0 );
time_point<high_resolution_clock> start = high_resolution_clock::now();
for( size_t round = ROUNDS; round; --round )
invertSecondA( &v[0], SIZE );
double sA = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
start = high_resolution_clock::now();
for( size_t round = ROUNDS; round; --round )
invertSecondB( &v[0], SIZE );
double sB = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
start = high_resolution_clock::now();
for( size_t round = ROUNDS; round; --round )

invertSecondC( &v[0], SIZE );
double sC = (double)duration_cast<nanoseconds>(

high_resolution_clock::now() - start ).count() / 1.0E9;

cout << sA << endl << sB << endl << sC << endl;
}

Bonita Montero

unread,

Mar 30, 2020, 2:32:25 AM3/30/20

to

> void invertSecondC( uint8_t *p, size_t n )
> {
> for( size_t i = 0; i < n; i+=2 )
> p[i] = ~p[i];
> }

If I didn't make a mistake this should be faster:

void invertSecondD( uint8_t *p, size_t n )
{
if( !n-- )
return;
++p;
size_t head = (((intptr_t)p + 7) & (intptr_t)-8) - (intptr_t)p;
head = head >= n ? head : n;
for( uint8_t *w = p, *end = p + head; w < end; w += 2 )
*w = ~*w;
if( head <= n )
return;
int64_t odd = (int64_t)p & 1;
n -= head;
p += head;
if( n / 8 )
{
union
{
uint8_t u8Mask[8] = { 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00,
0xFF, 0x00 };
uint64_t mask;
};
uint64_t *w = (uint64_t *)p,
*end = w + n / 8;
mask ^= -odd;
do
*w ^= mask;
while( w != end );
p = (uint8_t *)w;
n = n % 8;
}
if( n <= (size_t)odd )
return;
p += (size_t)odd;
n -= (size_t)odd;
for( uint8_t *end = p + n; p < n; p += 2 )
p = ~*p;
}

Bonita Montero

unread,

Mar 30, 2020, 2:38:27 AM3/30/20

to

> for( uint8_t *end = p + n; p < n; p += 2 )
> p = ~*p;

for( uint8_t *end = p + n; p < end; p += 2 )
*p = ~*p;

Bonita Montero

unread,

Mar 30, 2020, 5:10:15 AM3/30/20

to

This is a blocked version that adapts to 32- and 64-bitness:

void invertSecondBlocked( uint8_t *p, size_t n )

{
if( !n-- )
return;
++p;
size_t head = (((intptr_t)p + 7) & (intptr_t)-8) - (intptr_t)p;
head = head <= n ? head : n;
for( uint8_t *w = p, *end = p + head; w < end; w += 2 )
*w = ~*w;

if( n == head )
return;
size_t odd = (size_t)p & 1; // assume size_t or ptrdiff_t is our
register width

n -= head;
p += head;

if constexpr( sizeof(size_t) == 8 )
{

if( n / 8 )
{
union
{
uint8_t u8Mask[8] = { 0xFF, 0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00 };

size_t mask;
};
size_t *w = (size_t *)p,

*end = w + n / 8;

mask ^= -(ptrdiff_t)odd;
do
*w ^= mask;
while( ++w != end );

p = (uint8_t *)w;
n = n % 8;
}
}

else if constexpr( sizeof(size_t) == 4 )
if( n / 4 )
{
union
{
uint8_t u8Mask[4] = { 0xFF, 0x00, 0xFF, 0x00 };
size_t mask;
};
size_t *w = (size_t *)p,
*end = w + n / 4;
mask ^= -(ptrdiff_t)odd;
do
*w ^= mask;
while( ++w != end );
p = (uint8_t *)w;
n = n % 4;
}
if( n <= odd )
return;
p += odd;
n -= odd;

uint8_t *end = p + n;

do
*p = ~*p;
while( (p += 2) < end );
}

Melzzzzz

unread,

Mar 30, 2020, 5:39:38 AM3/30/20

to

What does this do?

--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala

Bonita Montero

unread,

Mar 30, 2020, 5:55:26 AM3/30/20

to

It inverts every second byte. First the head until the first aligned
uint64_t-block, then the uint64_t-blocks and then the tail. It's more
than 10 times faster than the fastest algorithm so far.

Melzzzzz

unread,

Mar 30, 2020, 6:00:01 AM3/30/20

to

Have you measured? It looks pretty complicated for task at hand?
Also I really doubt it is faster then SSE2 solution...

Bonita Montero

unread,

Mar 30, 2020, 6:03:29 AM3/30/20

to

> Have you measured? It looks pretty complicated for task at hand?

There's no way to make it simpler.

> Also I really doubt it is faster then SSE2 solution...

A SSE-solution would have a bit more complexity and it wouldn't
be portable.

Bonita Montero

unread,

Mar 30, 2020, 6:37:24 AM3/30/20

to

Here's a SSE-enabled version:

void invertSecondBlocked( uint8_t *p, size_t n )
{

size_t const BLOCKSIZE = SSE_BLOCKS ? 16 : sizeof(size_t);

if( !n-- )
return;
++p;

size_t head = (((intptr_t)p + (BLOCKSIZE - 1)) &
-(intptr_t)BLOCKSIZE) - (intptr_t)p;

head = head <= n ? head : n;
for( uint8_t *w = p, *end = p + head; w < end; w += 2 )
*w = ~*w;
if( n == head )
return;
size_t odd = (size_t)p & 1; // assume size_t or ptrdiff_t is our
register width
n -= head;
p += head;

if constexpr( SSE_BLOCKS )
{
if( n / 16 )
{
union
{
uint8_t u8Mask[16] = { 0xFF, 0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00,

0xFF, 0x00, 0xFF, 0x00, 0xFF,

0x00, 0xFF, 0x00 };
__m128 mask;
};
static const union
{
uint8_t u8Invert[16] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF };
__m128 invert;
};
__m128 *w = (__m128 *)p,
*end = w + n / 16;
if( odd )
mask = _mm_xor_ps( mask, invert );
do
*w = _mm_xor_ps( mask, *w );

while( ++w != end );
p = (uint8_t *)w;

n = n % 16;
}
}
else if constexpr( sizeof(size_t) == 8 )

It's almost twice as fast on my CPU.

Bonita Montero

unread,

Mar 30, 2020, 7:57:38 AM3/30/20

to

This is SSE as well as AVX-optimized.
Unfortunately this runs only slightly faster on my old Ryzen 1800X
because the first two generations of Ryzen-CPUs split an 256 bit
AVX-operation into two 128 bit operations.

void invertSecondBlocked( uint8_t *p, size_t n )
{

size_t const BLOCKSIZE = AVX_BLOCKS ? 32 : SSE_BLOCKS ? 16 :
sizeof(size_t);

if( !n-- )
return;
++p;

size_t head = (((intptr_t)p + BLOCKSIZE - 1) &
-(intptr_t)BLOCKSIZE) - (intptr_t)p;

head = head <= n ? head : n;
for( uint8_t *w = p, *end = p + head; w < end; w += 2 )
*w = ~*w;
if( n == head )
return;
size_t odd = (size_t)p & 1; // assume size_t or ptrdiff_t is our
register width
n -= head;
p += head;

#if AVX_BLOCKS != 0
if( n / 32 )
{
union
{
uint8_t u8Mask[32] = { 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00,
0xFF, 0x00,
0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00,
0xFF, 0x00,
0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00,

0xFF, 0x00,
0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00,
0xFF, 0x00 };

__m256 mask;
};
static const union
{
uint8_t u8Invert[32] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,

0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF,

0xFF, 0xFF, 0xFF,

0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF,

0xFF, 0xFF, 0xFF };
__m256 invert;
};
__m256 *w = (__m256 *)p,
*end = w + n / 32;
if( odd )
mask = _mm256_xor_ps( mask, invert );
do
*w = _mm256_xor_ps( *w, mask );

while( ++w != end );
p = (uint8_t *)w;

n = n % 32;
}
#elif SSE_BLOCKS != 0

if( n / 16 )
{
union
{

uint8_t u8Mask[16] = { 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00,
0xFF, 0x00,

0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00,
0xFF, 0x00 };
__m128 mask;
};
static const union
{
uint8_t u8Invert[16] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF };
__m128 invert;
};
__m128 *w = (__m128 *)p,
*end = w + n / 16;
if( odd )
mask = _mm_xor_ps( mask, invert );
do

*w = _mm_xor_ps( *w, mask );

while( ++w != end );
p = (uint8_t *)w;

n = n % 16;
}
#else

#endif

Melzzzzz

unread,

Mar 30, 2020, 10:53:48 AM3/30/20

to

On 2020-03-30, Bonita Montero <Bonita....@gmail.com> wrote:

> This is SSE as well as AVX-optimized.
> Unfortunately this runs only slightly faster on my old Ryzen 1800X
> because the first two generations of Ryzen-CPUs split an 256 bit
> AVX-operation into two 128 bit operations.

Nope, they do full 256 bit, but can execute only one FMA 256 bit instruction
per clock. I would think that this is because of slow memory rather then
that. Same reason why avx512 does not blow avx out of water...
There is one more thing Ryzen first and second gen can execute
4 128 bit instructions per clock, and Intel only two.
so 128 bit SSE2 is fast on Ryzens...

Bonita Montero

unread,

Mar 30, 2020, 1:39:25 PM3/30/20

to

>> This is SSE as well as AVX-optimized.
>> Unfortunately this runs only slightly faster on my old Ryzen 1800X
>> because the first two generations of Ryzen-CPUs split an 256 bit
>> AVX-operation into two 128 bit operations.
>

> Nope, they do full 256 bit, ...

That's not true. AVX-operations on Ryzens before the Ryzen 3xxx
generation have a throughput of 2 clock-cycles. Here's a little
test:

C++-part:

#include <iostream>
#include <chrono>
#include <immintrin.h>

using namespace std;
using namespace chrono;

size_t __vectorcall fMul();
size_t __vectorcall fFma();

int main()
{

time_point<high_resolution_clock> start = high_resolution_clock::now();

size_t rounds = fMul();
double ns = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
cout << "ns per avx-mul: " << ns / rounds << endl;
start = high_resolution_clock::now();
rounds = fFma();
ns = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
cout << "ns per avx-mul: " << ns / rounds << endl;
}

Asm-part:

PUBLIC ?fMul@@YQ_KXZ
PUBLIC ?fFma@@YQ_KXZ
_TEXT SEGMENT
?fMul@@YQ_KXZ PROC
vpxor xmm0, xmm0, xmm0
vpxor xmm1, xmm1, xmm1
mov rcx, 1000000000 / 10
avxMulLoop:
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
dec rcx
jnz avxMulLoop
mov rax, 1000000000
ret
?fMul@@YQ_KXZ ENDP
?fFma@@YQ_KXZ PROC
vpxor xmm0, xmm0, xmm0
vpxor xmm1, xmm1, xmm1
vpxor xmm2, xmm2, xmm2
mov rcx, 1000000000 / 10
avxFmaALoop:
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
vfmadd231pd ymm0, ymm1, ymm2
dec rcx
jnz avxFmaALoop
mov rax, 1000000000
ret
?fFma@@YQ_KXZ ENDP
_TEXT ENDS
END

Both benchmarks report about 1ns per operation. Assuming
that when there is no load the CPU reaches about its boost
frequency of 4GHz that's a throughput of 2 clock cycles as
I said. With AVX, the throughput is a bit slower, i.e. the
seems to be lowered a bit as the adding- and multiplying
-unit are working at the same time.

Melzzzzz

unread,

Mar 30, 2020, 1:48:50 PM3/30/20

to

On 2020-03-30, Bonita Montero <Bonita....@gmail.com> wrote:

>>> This is SSE as well as AVX-optimized.
>>> Unfortunately this runs only slightly faster on my old Ryzen 1800X
>>> because the first two generations of Ryzen-CPUs split an 256 bit
>>> AVX-operation into two 128 bit operations.
>>
>> Nope, they do full 256 bit, ...
>
> That's not true. AVX-operations on Ryzens before the Ryzen 3xxx
> generation have a throughput of 2 clock-cycles. Here's a little
> test:

Bull. divpd is more then twice faster on Ryzen then on Haswell
because until Skylake Intel divide divpd and sqrtpd in two 128 bit
operations...

Bonita Montero

unread,

Mar 30, 2020, 2:37:40 PM3/30/20

to

>> That's not true. AVX-operations on Ryzens before the Ryzen 3xxx
>> generation have a throughput of 2 clock-cycles. Here's a little
>> test:

> Bull. divpd is more then twice faster on Ryzen then on Haswell
> because until Skylake Intel divide divpd and sqrtpd in two 128
> bit operations...

We were talking about the AVX-thoughput. I said that Ryzen hasn't
a full 256 bit AVX-unit (but instead all operations are split in
two pipelined 128 bit halves) and you said that is is wrong. I
proved that you were wrong.
Boy, youre so stupid.

red floyd

unread,

Mar 30, 2020, 3:59:00 PM3/30/20

to

On 3/30/20 3:03 AM, Bonita Montero wrote:
>> Have you measured? It looks pretty complicated for task at hand?
>
> There's no way to make it simpler.
>

Well, you could just go back to the original code, and let the
compiler's optimizer unroll the loop for you...

Bonita Montero

unread,

Mar 30, 2020, 10:10:27 PM3/30/20

to

>> There's no way to make it simpler.

> Well, you could just go back to the original code, and let the
> compiler's optimizer unroll the loop for you...

The discussion was about the simplicity of the source and not
of the compiled code.

Melzzzzz

unread,

Mar 31, 2020, 12:38:24 AM3/31/20

to

On 2020-03-30, Bonita Montero <Bonita....@gmail.com> wrote:

I am talking about AVX troughoutput. 256 bit vdivpd is twice
faster then on Haswell... my err.

Jorgen Grahn

unread,

Mar 31, 2020, 3:34:09 AM3/31/20

to

On Tue, 2020-03-31, Melzzzzz wrote:
> On 2020-03-30, Bonita Montero <Bonita....@gmail.com> wrote:
>>>> That's not true. AVX-operations on Ryzens before the Ryzen 3xxx
>>>> generation have a throughput of 2 clock-cycles. Here's a little
>>>> test:
>>
>>> Bull. divpd is more then twice faster on Ryzen then on Haswell
>>> because until Skylake Intel divide divpd and sqrtpd in two 128
>>> bit operations...
>>
>> We were talking about the AVX-thoughput. I said that Ryzen hasn't
>> a full 256 bit AVX-unit (but instead all operations are split in
>> two pipelined 128 bit halves) and you said that is is wrong. I
>> proved that you were wrong.
>> Boy, youre so stupid.
>
> I am talking about AVX troughoutput. 256 bit vdivpd is twice
> faster then on Haswell... my err.

Twice faster if you disregard the data cache, surely?

I haven't paid attention to this thread, but won't memory bandwidth
be the bottleneck in the end anyway?

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Bonita Montero

unread,

Mar 31, 2020, 3:51:23 AM3/31/20

to

>> I am talking about AVX troughoutput. 256 bit vdivpd is twice
>> faster then on Haswell... my err.

> Twice faster if you disregard the data cache, surely?

Here are the VDIVPD-numbers from agner.org for the first generation
Ryzen and for Haswell: Ryzen: 8 - 13 cycles latency, 8 - 9 cycles
throughput, Haswell 19-35 cycles latency, 18-28 cycles througput.
But Coffe Lake has a much higher performance than Haswell: 13 - 14
cycles latency, 8 cycles throughput.

Melzzzzz

unread,

Mar 31, 2020, 3:59:08 AM3/31/20

to

Sure. I am talking when data is in cache...
>
> /Jorgen

Melzzzzz

unread,

Mar 31, 2020, 4:01:20 AM3/31/20

to

This is because since Skylake, Intel has 256 bit divpd.
Ryzen has 128 bit units but in pairs, so that 256
bits are single op. Only thing that drives Ryzen first gen
behind are FMA instructions as it can execute only one per cycle...

Bonita Montero

unread,

Mar 31, 2020, 4:06:46 AM3/31/20

to

> This is because since Skylake, Intel has 256 bit divpd.
> Ryzen has 128 bit units but in pairs, so that 256

> bits are single op. ..

Do I have to write a benchmark comparing DIVPD and VDIVPD
on my 1800X?

Melzzzzz

unread,

Mar 31, 2020, 4:22:35 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

If you wish:
~/.../examples/assembler >>> ./latency
recip1
15.833327168900179712 0.063157919326280312 0.063157919326280328
700.059641597344125328 0.001428449721395557 0.001428449721395557
860.050613320340289648 0.001162722268331821 0.001162722268331821
12.280964395431137600 0.081426829994884368 0.081426829994884448
144.000000 16.920612
108.000000 16.134408
108.000000 16.479540
108.000000 16.828776
144.000000 17.158536
144.000000 17.091432
108.000000 17.163324
144.000000 16.072596
108.000000 17.177688
144.000000 12.160980
72.000000 10.093536
recip2
15.833327168900179712 0.063157919326280328 0.063157919326280328
700.059641597344125328 0.001428449721395557 0.001428449721395557
860.050613320340289648 0.001162722268331821 0.001162722268331821
12.280964395431137600 0.081426829994884448 0.081426829994884448
72.000000 13.325616
72.000000 13.353768
36.000000 13.353624
72.000000 13.296960
72.000000 13.292208
72.000000 13.476024
72.000000 13.329972
72.000000 13.335264
72.000000 13.297500
72.000000 13.315464
72.000000 14.205312
recip3
15.833327168900179712 0.063157919326280328 0.063157919326280328
700.059641597344125328 0.001428449721395557 0.001428449721395557
860.050613320340289648 0.001162722268331821 0.001162722268331821
12.280964395431137600 0.081426829994884448 0.081426829994884448
72.000000 9.000108
72.000000 9.066672
72.000000 9.042948
72.000000 9.023184
72.000000 9.018360
108.000000 9.027612
72.000000 9.032760
72.000000 9.024768
72.000000 9.034740
72.000000 9.000072
72.000000 9.023256

This is latency bench on my 2700x. recip3 is pure divpd, while recip1 is
and recip 2 is newton-rapshon aprox.
As you can see divpd is fastest, unline on Intel where recip1 is 8
cycles and recip2 12 cycles (slow FMA on Ryzen).

~/.../examples/assembler >>> cat latency.asm
; latency test
format elf64
public recip
public recip1
public recip2
public recip3
public _rdtsc
section '.text' executable
N = 1000000
recip:
recip1:
; Load constants and input
vbroadcastsd ymm1, [one]
vpbroadcastq ymm4, [magic]
mov eax, N
.loop:
vmovdqu ymm0, [rdi]
vpsubq ymm2, ymm4, ymm0
vfnmadd213pd ymm0, ymm2, ymm1
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm0, ymm2, ymm2
dec eax
jnz .loop
vmovups [rdi], ymm0
ret

recip2:
; Load constants and input
vbroadcastsd ymm1, [one]
mov eax, N
.loop:
vmovdqu ymm0, [rdi]
vcvtpd2ps xmm2,ymm0
vrcpps xmm2,xmm2
vcvtps2pd ymm2,xmm2
vfnmadd213pd ymm0, ymm2, ymm1
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm0, ymm2, ymm2
dec eax
jnz .loop
vmovups [rdi], ymm0
ret

recip3:
; Load constants and input
vbroadcastsd ymm1, [one]
mov eax, N
.loop:
vmovdqu ymm0, [rdi]
vdivpd ymm0,ymm1,ymm0
dec eax
jnz .loop
vmovups [rdi], ymm0
ret

_rdtsc:
rdtscp
shl rdx, 32
or rax, rdx
ret

section '.data' writeable align 16
align 16
one dq 3FF0000000000000h
magic dq 7FDE6238502484BAh

Bonita Montero

unread,

Mar 31, 2020, 4:44:03 AM3/31/20

to

Here's my code:

#define NOMINMAX
#if defined(_MSC_VER)
#include <Windows.h>
#endif
#include <iostream>
#include <chrono>
#include <vector>
#include <random>
#include <limits>

#include <immintrin.h>

using namespace std;
using namespace chrono;

uint64_t DIVPD( uint64_t rounds, __m128d *results, __m128d *dividends,
__m128d *divisors, size_t n )
{
for( uint64_t r = rounds; r; --r )
for( size_t i = 0; i != n; ++i )
results[i] = _mm_div_pd( dividends[i], divisors[i] );
return rounds * n;
}

uint64_t VDIVPD( uint64_t rounds, __m256d *results, __m256d *dividends,
__m256d *divisors, size_t n )
{
for( uint64_t r = rounds; r; --r )
for( size_t i = 0; i != n; ++i )
results[i] = _mm256_div_pd( dividends[i], divisors[i] );
return rounds * n;
}

int main()
{
#if defined(_MSC_VER)
SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS );
SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_HIGHEST );
#endif
uint64_t const ROUNDS = 1'000'000;
size_t const SIZE = 1'000;
random_device rd;
normal_distribution<double> nd( 0.0,
(double)numeric_limits<int64_t>::max() / 2 );
vector<__m128d> fResults,
fDividends,
fDivisors;
vector<__m256d> dResults,
dDividends,
dDivisors;
double v0, v1, v2, v3;
fResults.resize( SIZE );
fDividends.resize( SIZE );
fDivisors.resize( SIZE );
dResults.resize( SIZE );
dDividends.resize( SIZE );
dDivisors.resize( SIZE );
for( size_t i = 0; i != SIZE; ++i )
v0 = nd( rd ),
v1 = nd( rd ),
v2 = nd( rd ),
v3 = nd( rd ),
fDividends[i].m128d_f64[0] = v0,
fDividends[i].m128d_f64[1] = v1,
fDivisors[i].m128d_f64[0] = v2,
fDivisors[i].m128d_f64[1] = v3,
dDividends[i].m256d_f64[0] = v0,
dDividends[i].m256d_f64[1] = v1,
dDividends[i].m256d_f64[2] = v2,
dDividends[i].m256d_f64[3] = v3,
dDivisors[i].m256d_f64[0] = nd( rd ),
dDivisors[i].m256d_f64[1] = nd( rd ),
dDivisors[i].m256d_f64[2] = nd( rd ),
dDivisors[i].m256d_f64[3] = nd( rd );
time_point<high_resolution_clock> start;
uint64_t rounds;
uint64_t ns;
start = high_resolution_clock::now();
rounds = DIVPD( ROUNDS, &fResults[0], &fDividends[0],
&fDivisors[0], SIZE );
ns = (uint64_t)duration_cast<nanoseconds>(

high_resolution_clock::now() - start ).count();

cout << (double)ns / rounds << endl;
start = high_resolution_clock::now();
rounds = VDIVPD( ROUNDS, &dResults[0], &dDividends[0],
&dDivisors[0], SIZE );
ns = (uint64_t)duration_cast<nanoseconds>(

high_resolution_clock::now() - start ).count();

cout << (double)ns / rounds << endl;
}

On my 1800X, VDIVPD is exactly half as fast as DIVPD because
a 256 bit AVX data word is pumped through a single 128 bit FPU.

Melzzzzz

unread,

Mar 31, 2020, 4:46:40 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:
>

> On my 1800X, VDIVPD is exactly half as fast as DIVPD because
> a 256 bit AVX data word is pumped through a single 128 bit FPU.

Hm, they changed something on 2700X then.
Did you know that FMA4 also works on Ryzen fst/snd gen? ;)

Melzzzzz

unread,

Mar 31, 2020, 4:50:00 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

Another thing. Ryzen fst/send gen can execute 4 128 bit SSE instructions
per clock (4 128 bit units) and 2 256 bit. That can explain why your
benchmark shows difference...

Bonita Montero

unread,

Mar 31, 2020, 4:51:20 AM3/31/20

to

>> On my 1800X, VDIVPD is exactly half as fast as DIVPD because
>> a 256 bit AVX data word is pumped through a single 128 bit FPU.

> Hm, they changed something on 2700X then.

No, the change came with the Zen2 / Ryzen 3xxx.

> Did you know that FMA4 also works on Ryzen fst/snd gen? ;)

I just used it, but it's inofficial; CPUID doesn't report support.

Bonita Montero

unread,

Mar 31, 2020, 4:53:46 AM3/31/20

to

> Another thing. Ryzen fst/send gen can execute 4 128 bit SSE instructions
> per clock (4 128 bit units) and 2 256 bit.

But only one 128 bit mul and add in parallel.

Melzzzzz

unread,

Mar 31, 2020, 4:54:46 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

Try it. It works 100% ;)

Bonita Montero

unread,

Mar 31, 2020, 4:56:20 AM3/31/20

to

>> I just used it, but it's inofficial; CPUID doesn't report support.

> Try it. It works 100% ;)

For experiments it is ok, but for real software I won't rely on it.

Melzzzzz

unread,

Mar 31, 2020, 4:57:08 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

Nope you execute 1000 in a loop. Try calling in pairs.

Melzzzzz

unread,

Mar 31, 2020, 4:58:09 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

For real software that is what #ifdef s are for ;)

Bonita Montero

unread,

Mar 31, 2020, 5:01:26 AM3/31/20

to

>> For experiments it is ok, but for real software I won't rely on it.

> For real software that is what #ifdef s are for ;)

No, real software would make a CPUID-fork at runtime.
But a not officialy supported feature isn't worth to
think about for real sofware.

Bonita Montero

unread,

Mar 31, 2020, 5:02:46 AM3/31/20

to

>> But only one 128 bit mul and add in parallel.

> Nope you execute 1000 in a loop. Try calling in pairs.

Change the code yourself; it doesn't matter.
There are no two 128 bit units of the same type with Ryzens.

Melzzzzz

unread,

Mar 31, 2020, 5:03:40 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

Look that's your opinion. CPU's had undocumented features
seens begining of time...

Melzzzzz

unread,

Mar 31, 2020, 5:04:17 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

Again wrong. 2 additions and 2 multiplies.

Bonita Montero

unread,

Mar 31, 2020, 5:05:41 AM3/31/20

to

>> No, real software would make a CPUID-fork at runtime.
>> But a not officialy supported feature isn't worth to
>> think about for real sofware.

> Look that's your opinion. CPU's had undocumented features
> seens begining of time...

Using undocumented features in software which needs to be
reliable is stupid.

Bonita Montero

unread,

Mar 31, 2020, 5:07:54 AM3/31/20

to

>> Change the code yourself; it doesn't matter.
>> There are no two 128 bit units of the same type with Ryzens.

> Again wrong. 2 additions and 2 multiplies.

There are no two 128 bit units. There are 2 add- and mul-units
for single values. Both are used in parallel when a 128-bit-bundle
is calculated. I have proved this with my benchmark in this thread.

Melzzzzz

unread,

Mar 31, 2020, 5:27:30 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

4 128 bit units per core. 2 adds and 2 multiplies. You can freely check
at Agner's site...

Melzzzzz

unread,

Mar 31, 2020, 5:28:43 AM3/31/20

to

On 2020-03-31, Bonita Montero <Bonita....@gmail.com> wrote:

Undocumented CPU feature is reliable. There is not one CPU different then
other in production, if same...

Bonita Montero

unread,

Mar 31, 2020, 5:31:10 AM3/31/20

to

> 4 128 bit units per core. 2 adds and 2 multiplies. You can freely check
> at Agner's site...

That's wrong. Ryzens before Zen2 have 2 * 64 bit add 2 * 64 bit mul.
My benchmarks prove this. The asm mul-benchmark I've given in this
thread is unrolled that, if you would be correct, you would get twice
the throughput it actually gives. But it doesn't.

Bonita Montero

unread,

Mar 31, 2020, 5:32:46 AM3/31/20

to

> Undocumented CPU feature is reliable. ...

No, its undocumented and thereby not reliable.

Bonita Montero

unread,

Mar 31, 2020, 5:36:30 AM3/31/20

to

> 4 128 bit units per core. 2 adds and 2 multiplies. You can freely check
> at Agner's site...

In this code ...

?fMul@@YQ_KXZ PROC
vpxor xmm0, xmm0, xmm0
vpxor xmm1, xmm1, xmm1
mov rcx, 1000000000 / 10
avxMulLoop:
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
dec rcx
jnz avxMulLoop
mov rax, 1000000000
ret
?fMul@@YQ_KXZ ENDP

... the CPU would alternately dispatch the VMULPD-instructions
to the alleged two 128 bit mul execution-units. But the timings
report different.