Most efficient prefetching distance

Bonita Montero

unread,

Oct 1, 2021, 1:48:20 PM10/1/21

to

On today's x86-CPUs there is a prefetching-instruction which loads
cacheline into an cache-level chosable by a parameter for this in-
struction. But I often wondered what is the most appropriate pre-
fetching-distance. So I wrote a program which you can give a maxi-
mum block-size and it incrementally scans this block lineary from
a beginning of a block-size of 4kB up to a default of 64MB, but
you can chose a larger maximum (nk, nm, ng parameter, the parame-
ter can be a float). The prefetching is done with an incrementing
distance, from zero (special case without prefetching) to 512
cachelines (assuming a L1- cacheline-size of 64 bytes, which fits
for all x86 CPUs for decades). It first CLFLUSHs the cachelines
it scans afterwards. It only scans a block if the prefetching
-distance is up to one fourth of the block-size. The tailing part
of the block which would give a prefetching beyond the block is
scanned without prefetching to simulate a common optimization.
It runs the test multiple times and takes the fastest timing.
On Windows it sets it thread-affinity to CPU 0 and the priority
as high as possible. On Linux it sets only the affinity. So it
would be best to run the benchmark with "nice -20 ./a.out".

So here's the source (C++17):

#if defined(_MSC_VER)
#define NOMINMAX
#include <Windows.h>
#elif defined(__unix__)
#include <unistd.h>
#include <sched.h>
#include <pthread.h>
#endif
#include <iostream>
#include <charconv>
#include <cstdlib>
#include <vector>
#include <cstdint>
#include <limits>
#include <cstring>
#include <cmath>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__)
#include <x86intrin.h>
#endif

using namespace std;

size_t parseSize( char const *str );

int main( int argc, char **argv )
{
static size_t const DEFAULT_SIZE = (size_t)64 * 1024 * 1024;
size_t blockSize = argc >= 2 ? parseSize( argv[1] ) : DEFAULT_SIZE;
if( blockSize == -1 )
return EXIT_FAILURE;
#if defined(_MSC_VER)
// incrementally get more priority until we're denied
SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS );
SetPriorityClass( GetCurrentProcess(), REALTIME_PRIORITY_CLASS );
SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );
SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_HIGHEST );
SetThreadAffinityMask( GetCurrentThread(), 1 );
#elif defined(__unix__)
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(0, &cpuSet);
pthread_setaffinity_np( pthread_self(), sizeof cpuSet, &cpuSet );
#endif
using vchar_t = vector<char>;
using vchar_it = vchar_t::iterator;
vector<char> block( blockSize );
static size_t const CACHELINE_SIZE = 64;
size_t size = 4096;
do
{
if( size > blockSize )
size = blockSize;
uint64_t fastestTicks = numeric_limits<uint64_t>::max();
unsigned fastestDistance = 0;
size_t nTests = (ptrdiff_t)((double)(8.0 * 1024) / (ptrdiff_t)size *
25.0 + 0.5);
nTests = nTests >= 3 ? nTests : 3;
for( unsigned nClDistance = 0; nClDistance <= 256; ++nClDistance )
{
size_t distance = (size_t)nClDistance * CACHELINE_SIZE;
if( distance > size / 4 )
continue;
static unsigned const N_TESTS = 25;
for( size_t t = nTests; t; --t )
{
vchar_it it = block.begin();
for( vchar_it end = it + size; it != end; it += CACHELINE_SIZE )
_mm_clflush( &*it );
uint64_t start = __rdtsc();
if( nClDistance )
{
it = block.begin();
for( vchar_it beforeEnd = it + (size - distance); it != beforeEnd;
it += CACHELINE_SIZE )
_mm_prefetch( &*it + distance, _MM_HINT_NTA ),
*(char volatile *)&*it;
for( vchar_it end = block.begin() + size; it != end; it +=
CACHELINE_SIZE )
*(char volatile *)&*it;
}
else
{
it = block.begin();
for( vchar_it end = block.begin() + size; it != end; it +=
CACHELINE_SIZE )
*(char volatile *)&*it;
}
uint64_t ticks = __rdtsc() - start;
if( ticks < fastestTicks )
fastestTicks = ticks,
fastestDistance = nClDistance;
}
}
if( fastestTicks != numeric_limits<uint64_t>::max() )
cout << "block-size: " << size << " fastest distance: " <<
fastestDistance << " cachelines (" << nTests << ")" << endl;
} while( (size *= 2) <= blockSize );
}

size_t parseSize( char const *str )
{
double dSize;
from_chars_result fcr = from_chars( str, str + strlen( str ), dSize,
chars_format::general );
if( fcr.ec != errc() )
return -1;
if( !*(str = fcr.ptr) || str[1] )
return -1;
static const
struct suffix_t
{
char suffix;
size_t mult;
} suffixes[]
{
{ 'k', 1024 },
{ 'm', (size_t)1024 * 1024 },
{ 'g', (size_t)1024 * 1024 * 1024 }
};
char cSuf = tolower( *str );
for( suffix_t const &suf : suffixes )
if( suf.suffix == cSuf )
{
dSize = trunc( dSize * (ptrdiff_t)suf.mult );
if( dSize < 1.0 || dSize >= numeric_limits<ptrdiff_t>::max() )
return -1;
return (ptrdiff_t)dSize;
}
return -1;
}

It would be nice to see the results from you and you should
mention the CPU-type and RAM-type.

Scott Lurndal

unread,

Oct 1, 2021, 2:36:17 PM10/1/21

to

Bonita Montero <Bonita....@gmail.com> writes:
>On today's x86-CPUs there is a prefetching-instruction which loads
>cacheline into an cache-level chosable by a parameter for this in-
>struction.

Modern CPUs for the last decade have included automatic prefetchers
in the cache subsystems. Usually a mix of stride-based and/or predictive
fetchers.

It's very seldom necessary for an application to provide an explicit
prefetching hint except in very unusual circumstances. And most
programmers trying to insert hints manually will get it wrong.
The behavior of such is also heavily microarchitecture dependent,
so what works on one chip may really slow things down on another.

Note that they are, after all, hints. The processor need not
actually do anything for a prefetch instruction.

Let the hardware handle it.

Branimir Maksimovic

unread,

Oct 1, 2021, 2:58:32 PM10/1/21

to

Let her try:
t elf64 executable 3
include 'import64.inc'
interpreter '/lib64/ld-linux-x86-64.so.2'
needed 'libc.so.6'
import printf,atoi,exit

segment executable
entry $
mov r8,100
mov r10,100000
cmp dword[rsp],2
jl .skip
mov rdi, [rsp+16]
call [atoi]
movsxd r8,eax
test r8,r8
jz .errzero
xor edx,edx
mov rax,4000000
idiv r8
mov r10,rax
test r10,r10
jz .errexit
js .errsexit
.skip:
; warm up
imul r9,r8,128
mov rcx,r9
mov rdi,outbuf
mov rsi,inbuf
rep movsb

rdtscp
shl rdx,32
or rax,rdx
mov [r1],rax
mov rbx,r10
@@:
imul r9,r8,128
mov rcx,r9
mov rdi,outbuf
mov rsi,inbuf
rep movsb
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r1]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r1],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r2],rax
mov rbx,r10
@@:
imul r9,r8,128/8
mov rcx,r9
mov rdi,outbuf
mov rsi,inbuf
rep movsq
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r2]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r2],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r3],rax
mov rbx,r10
@@:
mov rcx,r8
mov rdi,outbuf
mov rsi,inbuf
.L0:
movdqa xmm0,[rsi]
movdqa xmm1,[rsi+0x10]
movdqa xmm2,[rsi+0x20]
movdqa xmm3,[rsi+0x30]
movdqa xmm4,[rsi+0x40]
movdqa xmm5,[rsi+0x50]
movdqa xmm6,[rsi+0x60]
movdqa xmm7,[rsi+0x70]
movntdq [rdi],xmm0
movntdq [rdi+0x10],xmm1
movntdq [rdi+0x20],xmm2
movntdq [rdi+0x30],xmm3
movntdq [rdi+0x40],xmm4
movntdq [rdi+0x50],xmm5
movntdq [rdi+0x60],xmm6
movntdq [rdi+0x70],xmm7
add rsi,128
add rdi,128
dec rcx
jnz .L0
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r3]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r3],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r4],rax
mov rbx,r10
@@:
mov rcx,r8
mov rdi,outbuf
mov rsi,inbuf
prefetch [rsi]
prefetch [rsi+0x40]
.L1:
prefetch [rsi+0x80]
prefetch [rsi+0xc0]
movdqa xmm0,[rsi]
movdqa xmm1,[rsi+0x10]
movdqa xmm2,[rsi+0x20]
movdqa xmm3,[rsi+0x30]
movdqa xmm4,[rsi+0x40]
movdqa xmm5,[rsi+0x50]
movdqa xmm6,[rsi+0x60]
movdqa xmm7,[rsi+0x70]
movntdq [rdi],xmm0
movntdq [rdi+0x10],xmm1
movntdq [rdi+0x20],xmm2
movntdq [rdi+0x30],xmm3
movntdq [rdi+0x40],xmm4
movntdq [rdi+0x50],xmm5
movntdq [rdi+0x60],xmm6
movntdq [rdi+0x70],xmm7
add rsi,128
add rdi,128
dec rcx
jnz .L1
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r4]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r4],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r5],rax
mov rbx,r10
@@:
mov rcx,r8
mov rdi,outbuf
mov rsi,inbuf
prefetch [rsi]
prefetch [rsi+0x40]
.L2:
prefetch [rsi+0x80]
prefetch [rsi+0xc0]
vmovdqa ymm0,[rsi]
vmovdqa ymm1,[rsi+0x20]
vmovdqa ymm2,[rsi+0x40]
vmovdqa ymm3,[rsi+0x60]
vmovntdq [rdi],ymm0
vmovntdq [rdi+0x20],ymm1
vmovntdq [rdi+0x40],ymm2
vmovntdq [rdi+0x60],ymm3
add rsi,128
add rdi,128
dec rcx
jnz .L2
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r5]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r5],xmm0

mov rdi,fmth
mov rsi,r8
mov rdx,r10
xor eax,eax
call [printf]

mov rdi,fmt
mov rsi,fmtmovsb
movsd xmm0, [r1]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovsq
movsd xmm0, [r2]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovntdq
movsd xmm0, [r3]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovntdqp
movsd xmm0, [r4]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovntdqy
movsd xmm0, [r5]
mov eax,1
call [printf]

call [exit]
.errexit:
mov rdi,fmtbig
jmp .next
.errsexit:
mov rdi,fmtsgn
jmp .next
.errzero:
mov rdi,fmtzero
.next:
mov rsi,r8
xor eax,eax
call [printf]
xor edi,edi
call [exit]

segment readable
fmt db '%-32s%16.14f',0ah,0
fmtmovsb db 'rep movsb',0
fmtmovsq db 'rep movsq',0
fmtmovntdq db 'movntdq',0
fmtmovntdqp db 'movntdq prefetch',0
fmtmovntdqy db 'movntdq prefetch ymm',0
fmth db '%d 128 byte blocks, loops:%d',0ah,0
fmtbig db 'value of %d is too big, maximum value is 4000000',0ah,0
fmtsgn db 'value of %d is negative, should be positive',0ah,0
fmtzero db 'nothing to do 0',0ah,0
align 8
clock dq 3.8e9
segment writeable
align 32
inbuf rb 4000000*128
outbuf rb 4000000*128
r1 rq 1
r2 rq 1
r3 rq 1
r4 rq 1
r5 rq 1

--

7-77-777
Evil Sinner!

Bonita Montero

unread,

Oct 2, 2021, 1:09:26 AM10/2/21

to

> Modern CPUs for the last decade have included automatic prefetchers
> in the cache subsystems. Usually a mix of stride-based and/or predictive
> fetchers.

If they would be better my program would give the best result of
zero prefetching. And there would be no prefetching-instructions
at all.

> It's very seldom necessary for an application to provide an
> explicit prefetching hint except in very unusual circumstances.

Automatic prefetchers are dumb.

Chris M. Thomasson

unread,

Oct 2, 2021, 2:15:34 AM10/2/21

to

Oh.... shit. You make me feel like a full blown moron for even
responding to you, Bonita. YIKES! Let me guess, you agree with me, and
say I am stupid for responding to you. ;^)

lol.

Bonita Montero

unread,

Oct 2, 2021, 2:40:37 AM10/2/21

to

This is my improved probing-algorithm. It also compares no prefetching
with the fastest prefetching-distance. I get an average improvement of
10% and a fastest improvement of 30% if I test up to 256m:

#if defined(_MSC_VER)
#define NOMINMAX
#include <Windows.h>
#elif defined(__unix__)
#include <unistd.h>
#include <sched.h>
#include <pthread.h>
#endif
#include <iostream>
#include <charconv>
#include <cstdlib>
#include <vector>
#include <cstdint>
#include <limits>
#include <cstring>
#include <cmath>

#include <sstream>

#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__)
#include <x86intrin.h>
#endif

using namespace std;

size_t parseSize( char const *str );

string blockSizeStr( size_t blockSize );

int main( int argc, char **argv )
{
static size_t const DEFAULT_SIZE = (size_t)64 * 1024 * 1024;
size_t blockSize = argc >= 2 ? parseSize( argv[1] ) : DEFAULT_SIZE;
if( blockSize == -1 )
return EXIT_FAILURE;
#if defined(_MSC_VER)
// incrementally get more priority until we're denied
SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS );
SetPriorityClass( GetCurrentProcess(), REALTIME_PRIORITY_CLASS );
SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );
SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_HIGHEST );
SetThreadAffinityMask( GetCurrentThread(), 1 );
#elif defined(__unix__)
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(0, &cpuSet);
pthread_setaffinity_np( pthread_self(), sizeof cpuSet, &cpuSet );
#endif
using vchar_t = vector<char>;

vector<char> block( blockSize );

char *begin = &*block.begin();

static size_t const CACHELINE_SIZE = 64;
size_t size = 4096;

double fastestImprovement = 0.0, avgImprovement = 0.0;
int avgDiv = 0;

do
{
if( size > blockSize )
size = blockSize;
uint64_t fastestTicks = numeric_limits<uint64_t>::max();
unsigned fastestDistance = 0;

uint64_t fastestZeroTicks = fastestTicks;

size_t nTests = (ptrdiff_t)((double)(8.0 * 1024) / (ptrdiff_t)size *
25.0 + 0.5);
nTests = nTests >= 3 ? nTests : 3;

bool hadTest = false;

for( unsigned nClDistance = 0; nClDistance <= 256; ++nClDistance )
{
size_t distance = (size_t)nClDistance * CACHELINE_SIZE;
if( distance > size / 4 )
continue;

hadTest = true;

for( size_t t = nTests; t; --t )
{

for( char *p = begin, *end = p + size; p != end; p += CACHELINE_SIZE )
_mm_clflush( p );

uint64_t start = __rdtsc();
if( nClDistance )
{

char *p = begin;
for( char *end = begin + size - distance; p < end; p +=
CACHELINE_SIZE )
_mm_prefetch( p, _MM_HINT_NTA ),
*(char volatile *)p;
for( char *end = begin + size; p != end; p += CACHELINE_SIZE )
*(char volatile *)p;
}
else
{
for( char *p = begin, *end = begin + size; p != end; p +=
CACHELINE_SIZE )
*(char volatile *)p;

}
uint64_t ticks = __rdtsc() - start;
if( ticks < fastestTicks )
fastestTicks = ticks,
fastestDistance = nClDistance;

if( !nClDistance && ticks < fastestZeroTicks )
fastestZeroTicks = ticks;
}
}
double improvement = (double)(int64_t)fastestZeroTicks /
(int64_t)fastestTicks - 1.0;

if( fastestTicks != numeric_limits<uint64_t>::max() )

cout << "block-size: " << blockSizeStr( size ),
cout << " fastest distance: " << fastestDistance,
cout << " cachelines (" << nTests << ") (",
cout << improvement * 100.0 << "%)" << endl;
avgImprovement += improvement;
fastestImprovement = improvement > fastestImprovement ? improvement :
fastestImprovement;
avgDiv += hadTest;

} while( (size *= 2) <= blockSize );

avgImprovement /= (double)avgDiv;
cout << "fastest improvement: " << fastestImprovement * 100.0 << "%" <<
endl;
cout << "avg. improvment: " << avgImprovement * 100.0 << "%" << endl;

}

size_t parseSize( char const *str )
{
double dSize;
from_chars_result fcr = from_chars( str, str + strlen( str ), dSize,
chars_format::general );
if( fcr.ec != errc() )
return -1;
if( !*(str = fcr.ptr) || str[1] )
return -1;
static const
struct suffix_t
{
char suffix;
size_t mult;
} suffixes[]
{
{ 'k', 1024 },
{ 'm', (size_t)1024 * 1024 },
{ 'g', (size_t)1024 * 1024 * 1024 }
};
char cSuf = tolower( *str );
for( suffix_t const &suf : suffixes )
if( suf.suffix == cSuf )
{
dSize = trunc( dSize * (ptrdiff_t)suf.mult );

if( dSize < 1.0 || dSize >= (double)numeric_limits<ptrdiff_t>::max() )

return -1;
return (ptrdiff_t)dSize;
}
return -1;
}

string blockSizeStr( size_t blockSize )
{
ostringstream oss;
if( blockSize < 1024 )
oss << blockSize;
else if( blockSize < (size_t)1024 * 1024 )
oss << (double)blockSize / 1024.0 << "kB";
else if( blockSize < (size_t)1024 * 1024 * 1024 )
oss << (double)blockSize / 1024.0 / 1024.0 << "MB";
else
oss << (double)blockSize / 1024.0 / 1024.0 / 1024.0 << "GB";
return oss.str();
}

Branimir Maksimovic

unread,

Oct 2, 2021, 7:02:25 AM10/2/21

to

On 2021-10-02, Chris M. Thomasson <chris.m.t...@gmail.com> wrote:
>> Automatic prefetchers are dumb.
>
> Oh.... shit. You make me feel like a full blown moron for even
> responding to you, Bonita. YIKES! Let me guess, you agree with me, and
> say I am stupid for responding to you. ;^)
>
> lol.

Enlightenment is, when you realise that everything that happens to you is from
self beliefs good or bad, and when you realise that you transfer that to
others, buy convincing, good or bad, you start to convince in only good, or
stop completely, which is even better :P ME

--

7-77-777
Evil Sinner!
https://github.com/rofl0r/chaos-pp

Marcel Mueller

unread,

Oct 3, 2021, 9:34:19 AM10/3/21

to

Am 01.10.21 um 20:36 schrieb Scott Lurndal:

> It's very seldom necessary for an application to provide an explicit
> prefetching hint except in very unusual circumstances. And most
> programmers trying to insert hints manually will get it wrong.
> The behavior of such is also heavily microarchitecture dependent,
> so what works on one chip may really slow things down on another.

I can confirm this.
I did several tests with __builtin_prefetch to reduce the collision rate
in lock free algorithms. While this worked on some platforms it did not
work or is even counterproductive on other platforms. I am in doubt that
there is any use of prefetching in /platform independent code/.
Of course, if you are coding only for a specific set of similar
platforms the situation changes.

Marcel

Bonita Montero

unread,

Oct 3, 2021, 10:00:31 AM10/3/21

to

Am 03.10.2021 um 15:33 schrieb Marcel Mueller:
> Am 01.10.21 um 20:36 schrieb Scott Lurndal:
>> It's very seldom necessary for an application to provide an explicit
>> prefetching hint except in very unusual circumstances. And most
>> programmers trying to insert hints manually will get it wrong.
>> The behavior of such is also heavily microarchitecture dependent,
>> so what works on one chip may really slow things down on another.
>
> I can confirm this.
> I did several tests with __builtin_prefetch to reduce the collision rate

> in lock free algorithms. ...

Why should a lockfree algorithm employ prefechting ?

Bonita Montero

unread,

Oct 4, 2021, 10:31:01 AM10/4/21

to

There's the Unix-command wc which counts words and lines. And the
wc-implementation from the current GNU core utilities contain an
optional very tricky AVX-implementation. This improves the speed
of wc on my Linux-computer by factor 29.
I improved this algorithm further to partition the data in three
parts which I handle interleaved, i.e. 32-byte-chunks synchronously
from each part and then I increment the common offset by 32. This
is while the original-algorithm has a depencency-chain which limits
out of order execution. I partitioned the data in three but not in
four parts because there wouldn't be enough integer-registers - I
need 14 in my interleaving-loop. With four parts I'd have much
regiseter spilling and reloading which gains a performance similar
to the original-algorithm.
The speedup of the interleaved code over the original wc-algorithm
is about 60%.
The reason why I tell this is that I benchmarked the code under
different conditions. I also improved the trivial algorithm with
prefetching and partitioning and it could be run with either
switched on like the AVX-code. This are the results on my Linux
Ryzen 7 1800X:

trivial / non-interleaved / non-prefetched
1 thread: 468MB/s
trivial / non-interleaved / prefetched
1 thread: 492MB/s
trivial / interleaved / non-prefetched
1 thread: 778MB/s
trivial / interleaved / prefetched
1 thread: 694MB/s
AVX / non-interleaved / non-prefetched
1 thread: 13731MB/s
AVX / non-interleaved / prefetched
1 thread: 13757MB/s
AVX / interleaved / non-prefetched
1 thread: 19722MB/s
AVX / interleaved / prefetched
1 thread: 23558MB/s

As you can see manual prefetching gives only a little gain for the
trivial non-interleaved code, and it even drops with the trivial
interleaved / prefetched code over the trivial interleaved / non
-prefetched code. But for the AVX-code there's a significant speedup
of the interleaved / prefetched code over the interleaved / non
-prefetched code. So there are cases where prefetching gives a
significant speedup.
With interleaving there are more complex memory access patterns
and I suspect that the prefetcher doesn't work that good under
such conditions.

If you're interested in the code. The relevant functions are the
lambdas trivialSpaceCount and avxSpaceCount. The code is compilable
only with C++20.

#if defined(_MSC_VER)
#include <Windows.h>
#elif defined(__unix__)

#include <pthread.h>
#endif
#include <iostream>

#include <utility>
#include <fstream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <chrono>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <atomic>
#include <vector>
#include <cstdlib>
#include <charconv>
#include <cmath>
#include <sstream>
#include <limits>
#include <cctype>
#include <functional>
#include <array>
#include <string.h>

#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__)
#include <x86intrin.h>

#include <cpuid.h>
#endif

#if defined(_MSC_VER)
#pragma warning(disable: 26495)
#endif

using namespace std;
using namespace chrono;

struct cmline_params
{
char const *fileName;
size_t blockSize;
unsigned nCPUs;
bool invert;
enum class priority_t : unsigned
{
UNSET, NORMAL, HIGH, REALTIME, BEST_AS_CAN
} priority;
vector<string> parse( int argc, char const *const *argv );
};

static void setThreadAffinity( thread::native_handle_type handle,
unsigned affinity );
static unsigned popCnt32( uint32_t value );
static vector<char> readFileRepeated( char const *fileName, size_t
blockSize );
static int xstricmp( char const *a, char const *b );

int main( int argc, char **argv )
{

cmline_params params;
vector<string> errs( params.parse( argc, argv ) );
if( errs.size() )
{
for( string &err : errs )
cout << err << endl;
return EXIT_FAILURE;
}
#if defined(_MSC_VER)
if( params.priority != cmline_params::priority_t::UNSET )
{
auto setPriority = []( DWORD dwPriorityClass )
{
// SetPriorityClass always returns false !
SetPriorityClass( GetCurrentProcess(), dwPriorityClass );
return GetPriorityClass( GetCurrentProcess() ) == dwPriorityClass;
};
static const
struct prio_map_t
{
cmline_params::priority_t priority;
DWORD dwPriorityClass;
} prioMappings[] =
{
{ cmline_params::priority_t::NORMAL, NORMAL_PRIORITY_CLASS },
{ cmline_params::priority_t::HIGH, HIGH_PRIORITY_CLASS },
{ cmline_params::priority_t::REALTIME, REALTIME_PRIORITY_CLASS }
};
DWORD dwPriorityClass = -1;
for( prio_map_t const &pm : prioMappings )
if( pm.priority == params.priority )
{
dwPriorityClass = pm.dwPriorityClass;
break;
}
if( dwPriorityClass != -1 )
if( !setPriority( dwPriorityClass ) )
return EXIT_FAILURE;
else;
else
{
ptrdiff_t p = 2;
bool succ;
do
succ = setPriority( prioMappings[p].dwPriorityClass );
while( !succ && --p >= 0 );
}
}
#endif
vector<char> block;
try
{
if( !(block = readFileRepeated( params.fileName, params.blockSize
)).size() )
throw 123;
}
catch( ... )
{
cout << "error reading file" << endl;
return EXIT_FAILURE;
}
struct words_and_lines
{
size_t words, lines;
words_and_lines( size_t words = 0, size_t lines = 0 ) :
words( words ), lines( lines )
{
}
};
using count_fn_t = void (*)( words_and_lines &, char *, size_t, bool,
bool * );
struct state_t
{
char *mem;
bool wasSpace;
words_and_lines counters;
state_t( char *mem, bool wasSpace, words_and_lines const &counters ) :
mem( mem ),
wasSpace( wasSpace ),
counters( counters )
{
}
};
static size_t const PREFETCH_DISTANCE = 32 * 64;
static
auto trivialSpaceCount = []( bool interleave, bool prefetch,
words_and_lines &counters, char *mem, size_t count, bool extend, bool
*pWasSpace )
{
bool wasSpace = pWasSpace ? *pWasSpace : false;
if( !count )
{
counters.words += !extend && !wasSpace;
return;
}
auto stateBlock = [&]<bool prefetch>( state_t &state, size_t offset )
{
if constexpr( prefetch )
_mm_prefetch( &state.mem[offset] + PREFETCH_DISTANCE, _MM_HINT_NTA );
bool isSpace = (unsigned char)state.mem[offset] <= ' ';
state.counters.words += isSpace && !wasSpace;
state.counters.lines += state.mem[offset] == '\n';
state.wasSpace = isSpace;
};
if( interleave && count >= 3 )
{
size_t partitionSize = count / 3;
char *ends[] = { mem + partitionSize, mem + partitionSize * 2 };
state_t states[] =
{
state_t( mem, wasSpace, words_and_lines() ),
state_t( ends[0], ends[0][-1] == ' ', words_and_lines() ),
state_t( ends[1], ends[1][-1] == ' ', words_and_lines() ),
};
size_t offset = 0;
if( prefetch )
for( ; (ptrdiff_t)offset < (ptrdiff_t)(partitionSize -
PREFETCH_DISTANCE); ++offset )
stateBlock.operator ()<true>( states[0], offset ),
stateBlock.operator ()<true>( states[1], offset ),
stateBlock.operator ()<true>( states[1], offset );
for( ; offset != partitionSize; ++offset )
stateBlock.operator ()<false>( states[0], offset ),
stateBlock.operator ()<false>( states[1], offset ),
stateBlock.operator ()<false>( states[1], offset );
mem += partitionSize * 3;
count -= partitionSize * 3;
counters.words += states[0].counters.words + states[1].counters.words
+ states[0].counters.words;
counters.lines += states[0].counters.lines + states[1].counters.lines
+ states[1].counters.lines;
wasSpace = states[2].wasSpace;
}
if( count )
{
state_t state( mem, wasSpace, counters );
size_t offset = 0;
if( prefetch )
for( ; (ptrdiff_t)offset < (ptrdiff_t)(count - PREFETCH_DISTANCE);
++offset )
stateBlock.operator ()<true>( state, offset );
for( ; offset != count; ++offset )
stateBlock.operator ()<false>( state, offset );
counters = state.counters;
}
if( pWasSpace )
*pWasSpace = wasSpace;
};
static
auto avxSpaceCount = []( bool interleave, bool prefetch,
words_and_lines &counters, char *mem, size_t count, bool extend, bool
*pWasSpace )
{
bool wasSpace = pWasSpace ? *pWasSpace : false;
if( !count )
{
counters.words += !extend && !wasSpace;
return;
}
size_t prefix = ((size_t)mem + 31 & -32) - (size_t)mem;
prefix = prefix <= count ? prefix : count;
trivialSpaceCount( interleave, prefetch, counters, mem, prefix, count
> prefix || extend, &wasSpace );
mem += prefix;
count -= prefix;
if( count >= 32 )
{
__m256i spaces = _mm256_set1_epi8( ' ' + 1 ),
newlines = _mm256_set1_epi8( '\n' );
auto stateBlock = [&]<bool prefetch>( state_t &state, size_t offset )
{
if constexpr( prefetch )
_mm_prefetch( &state.mem[offset] + PREFETCH_DISTANCE, _MM_HINT_NTA );
__m256i chunk = _mm256_load_si256( (__m256i *)&state.mem[offset] );
uint32_t isSpaceMask = _mm256_movemask_epi8( _mm256_andnot_si256(
chunk, _mm256_sub_epi8( chunk, spaces ) ) ),
wasSpaceMask = isSpaceMask << 1 | (uint32_t)state.wasSpace,
newlineMask = _mm256_movemask_epi8( _mm256_cmpeq_epi8(
chunk, newlines ) );
state.counters.words += popCnt32( isSpaceMask & ~wasSpaceMask );
state.counters.lines += popCnt32( newlineMask );
state.wasSpace = (int32_t)isSpaceMask < 0 ? 1 : 0;
};
if( interleave && count >= (3 * 32) )
{
size_t partitionSize = count / (3 * 32) * 32;
char *ends[] = { mem + partitionSize, mem + partitionSize * 2 };
state_t states[] =
{
state_t( mem, wasSpace, words_and_lines() ),
state_t( ends[0], ends[0][-1] == ' ', words_and_lines() ),
state_t( ends[1], ends[1][-1] == ' ', words_and_lines() ),
};
size_t offset = 0;
if( prefetch )
// with prefetching
for( ; (ptrdiff_t)offset < (ptrdiff_t)(partitionSize -
PREFETCH_DISTANCE); offset += 32 )
stateBlock.operator ()<true>( states[0], offset ),
stateBlock.operator ()<true>( states[1], offset ),
stateBlock.operator ()<true>( states[2], offset );
// without prefetching
for( ; offset != partitionSize; offset += 32 )
stateBlock.operator ()<false>( states[0], offset ),
stateBlock.operator ()<false>( states[1], offset ),
stateBlock.operator ()<false>( states[2], offset );
mem += partitionSize * 3;
count -= partitionSize * 3;
counters.words += states[0].counters.words +
states[1].counters.words + states[0].counters.words;
counters.lines += states[0].counters.lines +
states[1].counters.lines + states[1].counters.lines;
wasSpace = states[2].wasSpace;
}
if( count >= 32 )
{
state_t state( mem, wasSpace, counters );
size_t offset = 0;
do
stateBlock.operator ()<false>( state, offset );
while( (offset += 32) != (count & -32) );
mem += count & -32;
count %= 32;
counters = state.counters;
wasSpace = state.wasSpace;
}
}
trivialSpaceCount( interleave, prefetch, counters, mem, count, extend,
&wasSpace );
if( pWasSpace )
*pWasSpace = wasSpace;
};
using spacecount_fn_t = function<void( bool, bool, words_and_lines &,
char *, size_t, bool, bool * )>;
struct descr_count_fn
{
char const *descr;
spacecount_fn_t countFn;
};
static
array<descr_count_fn, 2> const descrCountFns(
{
{ "trivial", bind( trivialSpaceCount, placeholders::_1,
placeholders::_2, placeholders::_3, placeholders::_4, placeholders::_5,
placeholders::_6, placeholders::_7 ) },
{ "AVX", bind( avxSpaceCount, placeholders::_1, placeholders::_2,
placeholders::_3, placeholders::_4, placeholders::_5, placeholders::_6,
placeholders::_7 ) }
} );
mutex mtx;
unsigned ready;
condition_variable cvRready;
bool run;
condition_variable cvRun;
atomic_int64_t sumDur;
auto theThread = [&]( bool interleave, bool prefetch, spacecount_fn_t
const &countFn, char *mem, size_t blockSize, size_t repeats )
{
unique_lock<mutex> lock( mtx );
if( !--ready )
cvRready.notify_one();
cvRun.wait( lock, [&]() -> bool { return run; } );
lock.unlock();
auto start = high_resolution_clock::now();
size_t volatile sum = 0;
words_and_lines wordsAndLines;
for( size_t r = repeats; r; --r )
{
wordsAndLines = words_and_lines();
sum = 0;
countFn( interleave, prefetch, wordsAndLines, mem, blockSize, false,
nullptr );
sum += wordsAndLines.words + wordsAndLines.lines;
}
sumDur += (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
};
vector<thread> threads;
threads.reserve( params.nCPUs );
#if defined(NDEBUG)
double const MBS = 256.0;
#else
double const MBS = 1.0;
#endif
size_t repeats = (ptrdiff_t)(MBS * 1000 * 1000 /
(ptrdiff_t)params.blockSize + 0.5);
repeats += repeats == 0;
unsigned hc = thread::hardware_concurrency();
using vresult_t = vector<double>;
using vvresult_t = vector<vresult_t>;
for( descr_count_fn const &dfn : descrCountFns )
{
for( unsigned interleave = 0; interleave <= 1; ++interleave )
for( unsigned prefetch = 0; prefetch <= 1; ++prefetch )
{
std::cout << dfn.descr;

cout << (!interleave ? " / non-interleaved" : " / interleaved");
cout << (!prefetch ? " / non-prefetched" : " / prefetched ") << endl;
for( unsigned nThreads = 1; nThreads <= params.nCPUs; ++nThreads )
{
ready = nThreads;
run = false;
sumDur = 0;
threads.resize( 0 );
for( unsigned t = 0; t != nThreads; ++t )
{
threads.emplace_back( theThread, (bool)interleave, (bool)prefetch,
ref( dfn.countFn ), &block[0], params.blockSize, repeats );
unsigned affinity = !params.invert ? t : (t % 2) * (hc / 2) + t / 2;
setThreadAffinity( threads.back().native_handle(), affinity );
}
unique_lock<mutex> lock( mtx );
cvRready.wait( lock, [&]() -> bool { return !ready; } );
run = true;
cvRun.notify_all();
lock.unlock();
for( thread &thr : threads )
thr.join();
static double const MEGABYTE = 1000.0 * 1000.0;
double secs = sumDur / (1.0e9 * nThreads),
mbsPerSec = ((double)nThreads * (ptrdiff_t)params.blockSize *
(ptrdiff_t)repeats / MEGABYTE) / secs;
std::cout << "\t\t" << nThreads << (nThreads > 1 ? " threads: " : "
thread: ") << (int64_t)(mbsPerSec + 0.5) << "MB/s";
cout << endl;
}
}
}
}

inline
unsigned popCnt32( uint32_t value )
{
#if defined(_MSC_VER)
return __popcnt( value );
#elif defined(__GNUC__)
return __builtin_popcount( value );
#endif
}

vector<string> cmline_params::parse( int argc, char const *const *argv )
{
vector<string> errs;
unsigned hc = thread::hardware_concurrency();
if( hc )
{
fileName = nullptr;
blockSize = (size_t)256 * 1024 * 1024;
nCPUs = hc;
invert = false;
priority = priority_t::UNSET;
}
else
errs.emplace_back( "thread::hardware_concurrency() == 0" );
char const *const *param = argv + 1,
*const *paramEnd = argv + argc;
auto addErrString = [&]( char const *prefix, char const *param )
{
ostringstream oss;
oss << ": \"" << param << "\"";
errs.emplace_back( prefix + oss.str() );
};
while( param < paramEnd )
{
if( xstricmp( *param, "--file" ) == 0 )
{
if( ++param == paramEnd )
{
errs.emplace_back( "supply filename !" );
goto ret;
}
fileName = *param++;
continue;
}
if( xstricmp( *param, "--size" ) == 0 )
{
if( ++param == paramEnd )
{
errs.emplace_back( "supply size !" );
goto ret;
}
char const *sizeParam = *param;
double dSizeParam;
from_chars_result fcr = from_chars( sizeParam, sizeParam + strlen(
sizeParam ), dSizeParam, chars_format::general );
auto invalidSize = [&]()
{
addErrString( "invalid size", *param );
};
if( fcr.ec == errc() && (dSizeParam = trunc( dSizeParam )) >= 1.0 )
{
char const *suffixPtr = fcr.ptr;
static const
struct suffix_mult
{
char suffix;
size_t mult;
} sms[]
{
{ 'g', (size_t)1000 * 1000 * 1000 },
{ 'm', (size_t)1000 * 1000 },
{ 'k', (size_t)1000},
{ 'b', (size_t)1 }
};
suffix_mult const *pSm = nullptr;
if( *suffixPtr )
{
char suffix = tolower( *suffixPtr );
for( suffix_mult const &sm : sms )
if( suffix == sm.suffix )
{
pSm = &sm;
break;
}
}
static
auto dblSizeCvt = []( double dbl, size_t &st ) -> bool
{
if( dbl >= (double)((int64_t)1 << 53) )
return false;
st = (ptrdiff_t)dbl;
return true;
};
if( pSm )
if( !suffixPtr[1] )
if( !dblSizeCvt( dSizeParam * (ptrdiff_t)pSm->mult, blockSize ) )
invalidSize();
else;
else
invalidSize();
else
if( !*suffixPtr )
if( !dblSizeCvt( dSizeParam, blockSize ) )
invalidSize();
else;
else
invalidSize();
}
else
invalidSize();
++param;
continue;
}
if( xstricmp( *param, "--bound" ) == 0 )
{
if( ++param == paramEnd )
{
errs.emplace_back( "supply CPU-bound !" );
goto ret;
}
unsigned cpuBound = -1;
from_chars_result fcr = from_chars( *param, *param + strlen( *param
), cpuBound );
if( fcr.ec == errc() && !*fcr.ptr )
nCPUs = nCPUs <= cpuBound ? nCPUs : cpuBound;
else
addErrString( "invalid CPU-bound", argv[3] );
++param;
continue;
}
#if defined(_MSC_VER)
static const
struct str_prio
{
char const *prioStr;
priority_t priority;
} prios[] =
{
{ "--normal", priority_t::NORMAL },
{ "--high", priority_t::HIGH },
{ "--realtime", priority_t::REALTIME },
{ "--best", priority_t::BEST_AS_CAN },
};
bool prioSet = false;
for( str_prio const &strPrio : prios )
if( xstricmp( *param, strPrio.prioStr ) == 0 )
{
priority = strPrio.priority;
++param;
prioSet = true;
break;
}
if( prioSet )
continue;
#endif
if( xstricmp( *param, "--invert" ) == 0 )
{
++param;
int cpuIdRegs[2][4];
#if defined(_MSC_VER)
__cpuid( cpuIdRegs[0], 0 );
__cpuid( cpuIdRegs[1], 1 );
#elif defined(__GNUC__)
__cpuid(0, cpuIdRegs[0][0], cpuIdRegs[0][1], cpuIdRegs[0][2],
cpuIdRegs[0][3]);
__cpuid(1, cpuIdRegs[1][0], cpuIdRegs[1][1], cpuIdRegs[1][2],
cpuIdRegs[1][3]);
#endif
if( (unsigned)cpuIdRegs[0][0] < 1 || !((unsigned)cpuIdRegs[1][3] & 1
<< 28) )
{
errs.emplace_back( "inversion impossible - CPU hasn't SMT" );
continue;
}
invert = true;
continue;
}
addErrString( "invalid option", *param++ );
}
ret:
if( !fileName )
errs.insert( errs.begin(), "supply filename !" );
return errs;
}

static
vector<char> readFileRepeated( char const *fileName, size_t blockSize )
{
if( !blockSize )
return vector<char>();
ifstream ifs;
ifs.exceptions( ifstream::failbit | ifstream::badbit );
ifs.open( fileName, ifstream::binary );
ifs.seekg( 0, ios_base::end );
streampos fileSize = ifs.tellg();
if( !fileSize || fileSize > (size_t)-1 )
return vector<char>();
ifs.seekg( 0, ios_base::beg );
vector<char> block( blockSize, 0 );
size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize :
blockSize;
ifs.read( &*block.begin(), repSize );
bool lastNewline = block[repSize - 1] == '\n';
size_t remaining = block.size() - repSize;
do
{
size_t cpy = remaining >= repSize ? repSize : remaining;
copy( block.begin(), block.begin() + cpy, block.end() - remaining );
remaining -= cpy;
if( !lastNewline && remaining )
block.end()[-(ptrdiff_t)remaining--] = '\n';
} while( remaining );
return block;
}

static
void setThreadAffinity( thread::native_handle_type handle, unsigned
affinity )
{
#if defined(_MSC_VER)
SetThreadAffinityMask( handle, (DWORD_PTR)1 << affinity );

#elif defined(__unix__)
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);

CPU_SET(affinity, &cpuSet);
pthread_setaffinity_np( handle, sizeof cpuSet, &cpuSet );
#endif
}

inline
int xstricmp( char const *a, char const *b )
{
using uchar_t = unsigned char;
uchar_t lA, lB;
for( size_t i = 0; a[i] | b[i]; ++i )
if( (lA = tolower( a[i] )) != (lB = tolower( b[i] )) )
return (int)lA - (int)lB;
return 0;
}

Branimir Maksimovic

unread,

Oct 4, 2021, 12:36:40 PM10/4/21

to

On 2021-10-04, Bonita Montero <Bonita....@gmail.com> wrote:
> There's the Unix-command wc which counts words and lines. And the
> wc-implementation from the current GNU core utilities contain an
> optional very tricky AVX-implementation. This improves the speed
> of wc on my Linux-computer by factor 29.
> I improved this algorithm further to partition the data in three
> parts which I handle interleaved, i.e. 32-byte-chunks synchronously
>

Talking about efficiency :P
Who will pay you for overcomplicating simple things?

--

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp

Bonita Montero

unread,

Oct 4, 2021, 12:55:47 PM10/4/21

to

Why should this be overcomplicated ? Im repeatedly copy a
file into a buffer until it is full; maybe not even once
fully if the file doesn't fit in the buffer's maximum size.
That's the most direct way.

Marcel Mueller

unread,

Oct 4, 2021, 3:59:38 PM10/4/21

to

Am 03.10.21 um 16:00 schrieb Bonita Montero:

Prefetch can access invalid memory. So prefetching a shared memory area
behind a pointer can significantly decrease the probability of failed
CAS when implementing strong thread safety. But on some platforms I
observed excessive cachline hopping with this strategy.

Marcel

Branimir Maksimovic

unread,

Oct 4, 2021, 4:23:06 PM10/4/21

to

take a look at this simple and professionaly done
program that does all that :P
(Ian Collins I think is AUTHOR :P
#include <map>
#include <unordered_map>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iomanip>
using namespace std;
using Pairs = unordered_map<string,int>;

void fill( Pairs& pairs, char c )
{
static string word;

if( ispunct(c) ) return;

if( isspace(c) )
{
if( word.size() )
{
pairs[word]++;
word.clear();
}
}
else
{
word += tolower(c);
}
}

int main()
{
ifstream bible {"bible.txt"};

using citerator = istreambuf_iterator<char>;

Pairs pairs;

for_each( citerator(bible.rdbuf()), citerator(),
[&pairs]( char c ){ fill( pairs, c ); } );

multimap<unsigned,string> sorted;

// Sort the {word, count} pairs.
//
for_each( pairs.begin(), pairs.end(),
[&sorted]( const Pairs::value_type& p )
{ sorted.insert(make_pair(p.second,p.first)); } );

// Print the top 20.
//
auto item = sorted.rbegin();

for( auto n = 0; n < 20; ++n, ++item )
{
cout << "Position " << setw(2) << n+1
<< ": count = " << setw(6) << item->first
<< " " << item->second << '\n';
}

return 0;

Bonita Montero

unread,

Oct 5, 2021, 1:08:31 AM10/5/21

to

Ok, you don't understand what I do.

Bonita Montero

unread,

Oct 5, 2021, 1:11:30 AM10/5/21

to

Am 04.10.2021 um 21:59 schrieb Marcel Mueller:

> Prefetch can access invalid memory. So prefetching a shared memory area
> behind a pointer can significantly decrease the probability of failed
> CAS when implementing strong thread safety. But on some platforms I
> observed excessive cachline hopping with this strategy.

That doesn't make sense. When you prefetch you usually process a lot of
data before the point you prefetched. When you have CASes you rotatedy
process the same data; prefetching here is nonsense.

Branimir Maksimovic

unread,

Oct 5, 2021, 5:55:25 AM10/5/21

to

On 2021-10-05, Bonita Montero <Bonita....@gmail.com> wrote:
>
> Ok, you don't understand what I do.
>

Seems so :P

Branimir Maksimovic

unread,

Oct 5, 2021, 5:56:01 AM10/5/21

to

On 2021-10-05, Bonita Montero <Bonita....@gmail.com> wrote:

Prefetching is nonsense in HLL :P

Bonita Montero

unread,

Oct 5, 2021, 7:23:19 AM10/5/21

to

Am 05.10.2021 um 11:55 schrieb Branimir Maksimovic:

>>> Prefetch can access invalid memory. So prefetching a shared memory area
>>> behind a pointer can significantly decrease the probability of failed
>>> CAS when implementing strong thread safety. But on some platforms I
>>> observed excessive cachline hopping with this strategy.

>> That doesn't make sense. When you prefetch you usually process a lot of
>> data before the point you prefetched. When you have CASes you rotatedy
>> process the same data; prefetching here is nonsense.

> Prefetching is nonsense in HLL :P

No, automatic prefetching is dumd and there are a lot of patterns
they're unable to predict. With my 3-way interleaved access I've
even shown a very simple pattern where manual prefetching helps.