Melzzzzz wrote:
> [bmaxa@maxa-pc assembler]$ ./rdtscp 4000000
> 4000000 128 byte blocks, loops:1
> rep movsb 0,04352539184211
> rep movsq 0,02895878605263
> movntdq 0,02523812921053
> movntdq prefetch 0,02508215763158
> movntdq prefetch ymm 0.02417047026316
> [bmaxa@maxa-pc assembler]$ ./rdtscp 400000
> 400000 128 byte blocks, loops:10
> rep movsb 0.00311163213158
> rep movsq 0.00244263126316
> movntdq 0.00251265031579
> movntdq prefetch 0.00257390510526
> movntdq prefetch ymm 0.00242973521053
> [bmaxa@maxa-pc assembler]$ ./rdtscp 4000
> 4000 128 byte blocks, loops:1000
> rep movsb 0.00001444596763
> rep movsq 0.00001314468553
> movntdq 0.00002107178763
> movntdq prefetch 0.00002129352158
> movntdq prefetch ymm 0.00002099912526
> [bmaxa@maxa-pc assembler]$ ./rdtscp 40000
> 40000 128 byte blocks, loops:100
> rep movsb 0.00021878483684
> rep movsq 0.00018026386579
> movntdq 0.00023630260263
> movntdq prefetch 0.00024114757105
> movntdq prefetch ymm 0.00023099385000
Okay - the task was to copy 16 MB from one to another memory
location where both memory blocks do not overlap... ;)
I guess your results are times in seconds, but they probably
are not really reliable if the processor works with variable
clock speeds for each core. RDTSCP returns reliable measure-
ments, even if the processor changes clock speed or switches
to power saving mode. The speed of AVX moves only depends on
the number of 64 bit busses between processor and memory. It
cannot be faster than the same task performed with SSE (XMM)
registers - the memory interface (not the register size!) is
the bottleneck.
ST Test Melzzz 1 Melzzz 2 Melzzz 3 Melzzz 4
MOVSB 153.29 % 173.53 % 120.89 % 67.84 % 90.73 %
MOVSQ 160.73 % 115.46 % 94.90 % 61.73 % 74.75 %
MOVDQA 113.26 % 100.62 % 97.62 % 98.96 % 97.99 %
PREFETCH 100 % 100 % 100 % 100 % 100 %
My test results vary, as well, but the overall error is less
than five percent. Your results for REP MOVSx vary between
67.84 and 173.52 (REP MOVSB) or 61.73 and 115.46 (REP MOVSQ)
percent. Do you believe these results are reliable enough to
decide which copy algorithm shall be implemented for the new
superfast memcopy()?
<snip>
...
> @@:
> mov rcx,r8
> mov rdi,outbuf
> mov rsi,inbuf
> prefetch [rsi]
> prefetch [rsi+0x40]
These prefetches are not required, but
> .L1:
> prefetch [rsi+0x40]
> prefetch [rsi+0x80]
these should be 0x80[RSI] and 0xC0[RSI]. Same applies to the
AVX version later on. Prefetching cache lines only speeds up
execution if it is done very early, so the prefetched memory
is present in L1 whenever the next iteration issues one more
read access. Writes are not that crucial - a write combining
sequence collects multiple write instructions to one and the
same cache line. This (partially) 'hides' write latencies.
> movdqa xmm0,[rsi]
> movdqa xmm1,[rsi+0x10]
> movdqa xmm2,[rsi+0x20]
> movdqa xmm3,[rsi+0x30]
> movdqa xmm4,[rsi+0x40]
> movdqa xmm5,[rsi+0x50]
> movdqa xmm6,[rsi+0x60]
> movdqa xmm7,[rsi+0x70]
> movntdq [rdi],xmm0
> movntdq [rdi+0x10],xmm1
> movntdq [rdi+0x20],xmm2
> movntdq [rdi+0x30],xmm3
> movntdq [rdi+0x40],xmm4
> movntdq [rdi+0x50],xmm5
> movntdq [rdi+0x60],xmm6
> movntdq [rdi+0x70],xmm7
> add rsi,128
> add rdi,128
> dec rcx
> jnz .L1
> dec rbx
> jnz @b