I have two memory blocks p and q, both n=8192 bytes, allocated using C's
malloc routine.
I then copy 5000 bytes from p to q, repeating this one million times.
This takes 3.2 seconds with n=8192. But if I change n to 8000, it takes only
0.6 seconds!
I can understand that malloc adds extra information to the block so that it
is a bit bigger than my 8192 bytes, but what could it be about the block
addresses that would cause rep:movsb to take 4 times as long to execute?
Using n=9000 also results in a 0.6 second timing, and anyway I'm always
accessing only 5000 bytes in the loop.
Here's the code; p,q,count are 32-bit dwords. And it runs under 32-bit Vista
on (I think) an Intel 2140:
n equ 8192
extern _malloc
push dword n
call _malloc
add esp,4
mov [p],eax
push dword n
call _malloc
add esp,4
mov [q],eax
mov dword [count],1000000
L3:
mov esi,[q]
mov edi,[p]
mov ecx,5000
rep
movsb
dec dword [count]
jnz near L3
--
Bartc
...
Well, I don't know but when I first read your description I thought it
sounded like cache effects. For the size of memory you are copying
these should be L1 to L1 transfers and your caches should be something
like eight-way. Is it possible that for some offsets (i.e. offsets
between source and destination) your destination cache lines occupy
cache memory that are soon after needed for source lines? That would
require the destinations to be written back to L2 to make way for the
new L1 data (which would have to be read from memory).
James
malloc() may have a memory info block or header somewhere. Supposedly, it's
sometimes placed prior to the allocated memory block. I.e., it's not "added
to" the allocated block the user receives, but may precede it. Except for
address alignment, I don't see where this would affect your code.
> ... but what could it be about the block
> addresses that would cause rep:movsb to take 4 times as long to execute?
> Using n=9000 also results in a 0.6 second timing, and anyway I'm always
> accessing only 5000 bytes in the loop.
>
Just looking at your code, I don't know.
Seth Abraham of Intel goes into issues with REP MOVS:
http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/51402/reply/34703/
http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/51402/
He mentions alignment as a problem for REP MOVS. That may be an issue since
you're using byte sized moves. He also mentions something called "fast
strings"...
Or, maybe it's a bug. There are many bugs related to REP MOVS on both Intel
and AMD cpu's: failure to update SMI info with port I/O, skipping execution
of nearby instructions, failure to update debug registers, intermediate
update of CR2 register, and etc.:
"REP MOVS/STOS Executing with Fast Strings Enabled and Crossing Page
Boundaries with Inconsistent Memory Types may use an Incorrect Data Size or
Lead to Memory-Ordering Violations"
> Here's the code; p,q,count are 32-bit dwords. And it runs under 32-bit
Vista
> on (I think) an Intel 2140:
>
> n equ 8192
>
> extern _malloc
> push dword n
> call _malloc
> add esp,4
> mov [p],eax
>
> push dword n
> call _malloc
> add esp,4
> mov [q],eax
>
> mov dword [count],1000000
> L3:
> mov esi,[q]
> mov edi,[p]
> mov ecx,5000
> rep
> movsb
>
> dec dword [count]
> jnz near L3
>
Why are the "add esp,4" instructions needed? Is malloc() pushing an arg?
Why? malloc() should pop "n", yes? (No?) But, AFAIK, malloc() doesn't
return an arg...
Why did you put "count" in memory instead of ecx with push/pop or in edx?
What happens if you make all variables register based?
What happens if you rescale 5000 and use "rep movsd"? Aligned moves? "fast
strings" crossing a page boundary?
Rod Pemberton
"bartc" wrote:
> I've encountered a very strange problem with the timing of a block
> transfer demonstrated by the following loop.
> I have two memory blocks p and q, both n=8192 bytes, allocated using C's
> malloc routine.
> I then copy 5000 bytes from p to q, repeating this one million times.
> This takes 3.2 seconds with n=8192. But if I change n to 8000, it takes
> only 0.6 seconds!
> I can understand that malloc adds extra information to the block so that
> it is a bit bigger than my 8192 bytes, but what could it be about the
> block addresses that would cause rep:movsb to take 4 times as long to
> execute? Using n=9000 also results in a 0.6 second timing, and anyway I'm
> always accessing only 5000 bytes in the loop.
I really can't tell what C may do here, but if your p and q aren't
aligned to dw- or even page-bounds, then I'd see a bug in this _malloc.
Debugging could reveal the truth ...
btw: I'd use REP MOVSB (byte variant) only if really required
ie: overlapping ranges 'and' dest.-addr > source-address,
this need to work from the end with a set direction bit.
Even Pre- and/or Post-Alignment of REP MOVSD-loops with single
byte/word- moves isn't possible if both block-addresses weren't
aligned, it will gain speed if the REP-loop works just on MOVSD.
Best would be to have all data stored at aligned (dw/qw/cache-line) bounds
and (often impossible ie: text, but) use cache bounds matching blocksize.
__
wolfgang
I know it would be a pain, but can you find where n changes the
timing? What happens if n=8191, 8193, 8192+256, 8192-1024, etc.?
Sometimes knowing the size of the memory block that causes the break
can help determine what is going on.
Just a thought -
Gil
> I have two memory blocks p and q, both n=8192 bytes, allocated using C's
> malloc routine.
>
> I then copy 5000 bytes from p to q, repeating this one million times.
>
> This takes 3.2 seconds with n=8192. But if I change n to 8000, it takes
> only 0.6 seconds!
> mov dword [count],1000000
> L3:
> mov esi,[q]
> mov edi,[p]
> mov ecx,5000
> rep
> movsb
>
> dec dword [count]
> jnz near L3
This problem can also be shown up using a C example (a memcpy loop), and I
posted also in comp.lang.c.
The consensus there was that this problem is cache-related in some obscure
way. One poster could duplicate some of the effects, with up to 10x slowdown
when the two blocks differed by (IIRC) 4K-8 or 8K-8 bytes and so on, and
smaller slowdowns with differences in that neighbourhood.
My example had a difference of 2008H between p and q when n was 8192; when I
changed n to 8176, and got a 1FF8H difference, the slowdown *was* even more
dramatic.
My application relies on power-of-two allocations, so for now I've just
added an extra 16 bytes to those, to temporarily move out of the problem
area.
However this slowdown of up to 10x for block transfers, when source and
destination vary by 4nK-8 or so, seems a very real problem and difficult to
avoid in general.
--
Bartc
"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message
news:4b2d9268$0$5102$9a6e...@unlimited.newshosting.com...
>
> "bartc" <ba...@MUNGED.microcosmotalk.com> wrote in message
> news:4b2a7d01$0$4971$9a6e...@unlimited.newshosting.com...
>>
>> I've encountered a very strange problem with the timing of a block
> transfer
>> demonstrated by the following loop.
> Or, maybe it's a bug. There are many bugs related to REP MOVS on both
> Intel
> and AMD cpu's: failure to update SMI info with port I/O, skipping
> execution
> of nearby instructions, failure to update debug registers, intermediate
> update of CR2 register, and etc.:
It does seem to me rather like a 'bug', if the speed of a block transfer
depends on the distance between source and destination (see my other post).
>> Here's the code; p,q,count are 32-bit dwords. And it runs under 32-bit
> Vista
>> on (I think) an Intel 2140:
>>
>> n equ 8192
>>
>> extern _malloc
>> push dword n
>> call _malloc
>> add esp,4
>> mov [p],eax
> Why are the "add esp,4" instructions needed? Is malloc() pushing an arg?
> Why? malloc() should pop "n", yes? (No?) But, AFAIK, malloc() doesn't
> return an arg...
That's how calls to C runtime functions work on my machine. C functions
don't adjust the stack on returning so it must be done by the caller.
> Why did you put "count" in memory instead of ecx with push/pop or in edx?
> What happens if you make all variables register based?
Some of this code was translated from a high-level language so I kept the
instructions as they were generated. Only the main block transfer code was
in inline assembler. (Of course I retested the code as shown to make sure it
still highlighted the problem.)
> What happens if you rescale 5000 and use "rep movsd"? Aligned moves?
> "fast
> strings" crossing a page boundary?
Using variations such as movsd sometimes made the difference you would
expect (but was still slower on some block sizes). But movsb showed the
problem more effectively.
Playing around with alignments sometimes helped, but was not the problem
because the same alignments could sometimes be fast too!
--
Bartc
<snip>
MOVSx is quite outdated. Try this instead:
sub esp,8
nop ; resolve dependency (amd + intel)
nop ; (amd only)
mov [esp],n
call _malloc
mov edi,eax
call _malloc
mov esi,eax
mov [esp],edi
mov [esp+4],eax
mov ebp,1000000
L0:mov edx,416 ; reduce loop to 1/12th
L1:mov eax,[esi] ; feeds three Athlon pipes
mov ebx,[esi + 4] ; (on intel machines two or
mov ecx,[esi + 8] ; four moves per loop may
mov [edi],eax ; throw better results)
mov [edi + 4],ebx
mov [edi + 8],ecx
add esi,12
add edi,12
dec edx ; inner loop--
jns near L1 ; moves 5004 byte...
mov edi,[esp] ; restore addresses
mov esi,[esp+4]
dec ebp ; outer loop--
jne near L0
add esp,8 ; IMPORTANT last step!
Sorry for syntax errors, I'm used to write code
with AS... ;) The NOP(s) at the beginning feed
idle pipes for the time ESP needs to update its
content. We could prefetch the first bytes from
the arrays here, but their addresses are not
known, yet.
Do not use EBP if this code is called from C. C
depends on the base pointer! Because there are
no registers left, subtract 12 from ESP and re-
place EBP with [ESP + 8]. Alternatively, you
might store EBP on the stack and use it for the
loop. (DEC EBP=1, DEC [MEM]=4 clocks (Athlon) *
1e6 saves 3e6 clocks ~ 0.0015 s @ 2 GHz.)
Note that the three moves *to* memory trigger WC
(write combining). It's faster than three single
MOV instructions executed one after another (one
simultaneous write vs. three times one write per
cycle). However, it doesn't work if writes cross
a cache line (about every 4th inner loop).
Same scheme works much faster with four or eight
XMM registers and prefetching as shown on AMD's
website.
If this code shows the same behaviour than your
code, it's probably caused by _malloc. Your 0.6
seconds result seems reasonable for MOVSB.
Greetings from Augsburg
Bernhard Schornak
Are you really claiming that this loop will beat a "rep movsd", which runs
at one cycle per dword?
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
AMD 40546.pdf (software optimisation guide) says:
> ---------------------------------------------------------
>
> *REP MOVSB/W/D/Q VectorPath 5 clocks*
>
> 8.3 Repeated String Instructions
>
> Optimization
>
> Use the REP prefix judiciously when performing string
> operations.
>
> Rationale
>
> In general, using the REP prefix to repeatedly perform
> string instructions is less efficient than other methods,
> especially when copying blocks of memory. Even though
> using the REP prefix may seem attractive due to its small
> code size, a loop may yield better performance due to its
> minimal overhead, compared to the setup overhead of using
> the REP prefix. However, certain string operations can
> benefit from using the REP prefix when the increased
> throughput compared to that of a loop makes up for its
> setup overhead for any specific repeat count.
>
> ---------------------------------------------------------
Test different versions of the loop with RDTSC before and
after the loop body to determine the real execution time!
I tested a lot of stuff with my RDTSC test tools, so I am
sure a 1 clock MOVSx is an urban legend. No processor can
do a copy mem -> mem in one cycle. It always has to fetch
data before they can be written. Hence, it lasts at least
two clocks to perform a mem -> mem copy. It is physically
impossible to write data at the same time it is read. Not
to mention that any read or write consumes time, as well.
VectorPath instructions block all execution pipes for the
time they are executed. As mentioned in my posting, using
eight XMM registers with a prefetch mechanism outperforms
any other memcopy, see e.g.:
<http://archive.netbsd.se/?ml=firebird-devel&a=2009-10&m=11811666>
"Bernhard Schornak" <scho...@MUNGED.microcosmotalk.com> wrote in message
news:4b377657$0$4877$9a6e...@unlimited.newshosting.com...
>
> Tim Roberts wrote:
>>>
>>> ...
>>> L0:mov edx,416 ; reduce loop to 1/12th
>>> L1:mov eax,[esi] ; feeds three Athlon pipes
>>> mov ebx,[esi + 4] ; (on intel machines two or
>>> mov ecx,[esi + 8] ; four moves per loop may
>>> mov [edi],eax ; throw better results)
>>> mov [edi + 4],ebx
>>> mov [edi + 8],ecx
>>> add esi,12
>>> add edi,12
>>> dec edx ; inner loop--
>>> jns near L1 ; moves 5004 byte...
>>
>> Are you really claiming that this loop will beat a "rep movsd", which
>> runs
>> at one cycle per dword?
> > *REP MOVSB/W/D/Q VectorPath 5 clocks*
> >
> > 8.3 Repeated String Instructions
> >
> > Optimization
> >
> > Use the REP prefix judiciously when performing string
> > operations.
On my Intel processor, executing the main body of your loop 1 million times
took about 1 second.
Using REP:MOVSD took about 0.6 seconds (both copying 6000 bytes).
But even if the more complex loop was faster, the main problem with it is
dividing N by 12, if N is a runtime value.
--
Bartc
There are certain architectural differences between CPUs. As Bernhard
said in an earlier post you may need to use a different number of 32-
bit words per iteration.
>
> But even if the more complex loop was faster, the main problem with it is
> dividing N by 12, if N is a runtime value.
N is divided by 12 (or 8 or 16 or whatever) only outside the main loop
so performance should not be a concern. Any extra bytes left over
after the divide at the beginning or the end can be dealt with
separately. If one can only align source or destination and not both
the rule used to be to align the destination. All such decisions
should be made in a library routine - and that routine may differ per
CPU.
Did you get any resolve to your earlier issue - of the difference
between source and destination addresses? That's probably also an
architectural issue but it would be good to tie it down.
James
Use of rep movsd is normally fast. There are sometimes (often?,
normally?) faster ways but the specific method which is fastest will
depend on the CPU architecture and which cache level the source and
destination are in. (Since Bart is running his test repeatedly and the
memory size fits in L1 cache this should be L1 to L1 but that is not
the case in general.)
Some references on optimisation: the Intel and AMD optimisation
manuals, Inner Loops - A Sourcebook for Fast 32-Bit Software
Development (dated but a good grounding and definitely the easiest to
read), and
http://www.agner.org/optimize/
James
Just subtract 12 from the counter
--
ArarghMail912 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html
To reply by email, remove the extra stuff from the reply address.
But N might be small, if this is a general memory copy routine. But then,
ArarghMail912 suggested counting down in 12's so wouldn't have been a big
deal.
> Did you get any resolve to your earlier issue - of the difference
> between source and destination addresses? That's probably also an
> architectural issue but it would be good to tie it down.
No, there seems to be a genuine issue with up to 10x slowdown when the
difference is the region of 4nK-8 bytes. I'm just avoiding that at the
minute by making sure my larger objects are allocated at least 4nK+8 bytes
apart.
It can still happen in arbitrary situations, but hopefully with less impact.
--
Bartc
...
> > Did you get any resolve to your earlier issue - of the difference
> > between source and destination addresses? That's probably also an
> > architectural issue but it would be good to tie it down.
>
> No, there seems to be a genuine issue with up to 10x slowdown when the
> difference is the region of 4nK-8 bytes. I'm just avoiding that at the
> minute by making sure my larger objects are allocated at least 4nK+8 bytes
> apart.
>
> It can still happen in arbitrary situations, but hopefully with less impact.
It's a huge slowdown. Was hoping someone may have derived a set of
rules to predict it but I guess you don't have a need for that much
detail. I may have in the future. I'll add it to my to-do list.
Because of the difference it seems worth adding to a memcpy-type
library function that implements some processor-specific techniques.
James
"James Harris" <james.h...@MUNGED.microcosmotalk.com> wrote in message
news:4b383802$0$4954$9a6e...@unlimited.newshosting.com...
(The problem occurs when copying between two blocks already in the cache; if
you are accessing fresh main memory, then it will likely be that slow
anyway.
But it did come up for me in situations like the following:
do
a := a + b
end
where a,b are objects such as strings or lists. Implemented naively, new
memory is allocated for a+b (rounded to 2^N bytes in my program). The old
memory for a is freed (and reused in the next iteration), and a then takes
ownership of the new block.
In a loop, this effectively means copying between the same two blocks over
and over again. When the size exceeds 2047 bytes, block sizes become 4nK+8
(using C's malloc). The temporary fix was to use a 2^N+16 allocation, ie.
4nK+24 (not +8 as I'd stated)
4nK+8 is not the slowest (that was 4nK-8), but it was enough!)
--
Bartc
Reloading... done:
test:mov eax,[esp+0x04] ; EAX = mem size
pusha
sub esp,0x3C
mov ebx,eax ; EBX = mem size
mov ebp,0x000F4240 ; EBP = loop count
mov [esp+0x00],eax
call _malloc
mov edi,eax
call _malloc
mov esi,eax
mov [esp+0x00],edi ; 00[ESP] = source
mov [esp+0x04],eax ; 04[ESP] = target
rdtsc ; get current TSC
mov [esp+0x08],eax ; store low DD
mov [esp+0x0C],edx ; high DD
L0:mov ecx,ebx ; loop count
mov edi,[esp+0x00] ; restore addresses
mov esi,[esp+0x04]
rep movsb
mov ecx,ebx ; restore loop count
dec ebp ; outer loop--
jne near L0 ; moves 5000 byte...
rdtsc
sub eax,[esp+0x08] ; EAX = low DD
sbb edx,[esp+0x0C] ; EDX = high DD
mov [esp+0x10],eax ; store low result 1
mov [esp+0x14],edx ; high result 1
shr ebx,0x02 ; mem / 4
mov ebp,0x000F4240 ; EBP = loop count
rdtsc ; get current TSC
mov [esp+0x08],eax ; store low DD
mov [esp+0x0C],edx ; high DD
L1:mov ecx,ebx ; loop count
mov edi,[esp+0x00] ; restore addresses
mov esi,[esp+0x04]
rep movsd
dec ebp ; outer loop--
jne near L1 ; moves 5000 byte...
rdtsc
sub eax,[esp+0x08] ; EAX = low DD
sbb edx,[esp+0x0C] ; EDX = high DD
mov [esp+0x18],eax ; store low result 2
mov [esp+0x1C],edx ; high result 2
shr ebx ; mem / 8
mov ebp,0x000F4240 ; EBP = loop count
rdtsc ; get current TSC
mov [esp+0x08],eax ; store low DD
mov [esp+0x0C],edx ; high DD
L2:mov ecx,ebx ; loop count
mov edi,[esp+0x00] ; restore addresses
mov esi,[esp+0x04]
L3:mov eax,[esi+0x00] ; copy
mov edx,[esi+0x04]
mov [edi+0x00],eax
mov [edi+0x04],edx
add esi,0x08
add edi,0x08
dec ecx ; inner loop--
jne near L3 ; moves 5000 byte...
dec ebp ; outer loop--
jne near L2
rdtsc
sub eax,[esp+0x08] ; EAX = low DD
sbb edx,[esp+0x0C] ; EDX = high DD
mov [esp+0x20],eax ; store low result 3
mov [esp+0x24],edx ; high result 3
add [esp+0x00],0x0F ; align memory
add [esp+0x04],0x0F ;
and [esp+0x00],0xFFFFFFF0
and [esp+0x04],0xFFFFFFF0
shr ebx,0x03 ; mem / 64
mov ebp,0x000F4240 ; EBP = loop count
rdtsc ; get current TSC
mov [esp+0x08],eax ; store low DD
mov [esp+0x0C],edx ; high DD
L4:mov ecx,ebx ; loop count
mov edi,[esp+0x00] ; restore addresses
mov esi,[esp+0x04]
L5:movdqa xmm0,[esi+0x00]
movdqa xmm1,[esi+0x10]
movdqa xmm2,[esi+0x20]
movdqa xmm3,[esi+0x30]
movdqa [edi+0x00],xmm0
movdqa [edi+0x10],xmm1
movdqa [edi+0x20],xmm2
movdqa [edi+0x30],xmm3
add esi,0x40
add edi,0x40
dec ecx ; inner loop--
jne near L5 ; moves 4992 byte...
mov eax,[esi+0x00] ; copy 8 byte
mov edx,[esi+0x04]
mov [edi+0x00],eax
mov [edi+0x04],edx
dec ebp ; outer loop--
jne near L4
rdtsc
sub eax,[esp+0x08] ; EAX = low DD
sbb edx,[esp+0x0C] ; EDX = high DD
mov [esp+0x28],eax ; store low result 4
mov [esp+0x2C],edx ; high result 4
;
; DISPLAY OR STORE RESULTS
;
add esp,0x3c
popa
ret
The function aquires execution times of four memcpy()
versions. The results (in clock cycles) are stored on
the stack at locations 0x10[ESP]...0x2C[ESP]. You may
insert an output function or store the results as raw
stack dump to view them with a hex editor (all stored
results are qwords).
Versions are
1 REP MOVSB
2 REP MOVSD
3 8 byte MOV (INT) (to make iNTEL machines happy)
4 64 byte MOV (XMM) (not optimised)
If the RDTSC timer wrapped around (EDX negative), use
add eax,0xFFFFFFFF
adc edx,0xFFFFFFFF
to get the proper count.
BTW: To test the performance of a copyloop, 1,000,000
repetitions are ways too much. Reduce them to 16
or 256 to prevent interruption by task switching
mechanisms of the OS. A task switch adds several
hundreds (sometimes thousands) of clocks to your
results. These are not related to your function,
at all, but, unfortunately, they're added to the
execution time of the function. Keeping the loop
small enough to return before the time slice ex-
pired throws accurate (comparable) results.
Yes, but your OS may also provide functions to get the time, that is
really consumed by the thread. This eliminates time that the CPU gives
other tasks or threads.
The Redmond OSs have a function GetThreadTimes i.e.
I used them successfully for several performance monitoring tasks.
Helge
The way that you are accessing rdtsc is quite susceptible
to noise. The instruction is not sequentialised and you
have not flushed the pipelines with a cpuinfo instruction
to ensure that you are measuring from a fixed point. For
the length of sequence that you are timing this may not
make much difference.
Could you post the raw results from a few runs of your
code on your hardware to see what kind of distribution
the timing measurements have?
Amoss
I was working with OS/2 (eCS) until now, hence I'm not
familiar with Window's API. Well, I'm going to port my
stuff to Win7 at the moment, learning the one or other
thing.
>> BTW: To test the performance of a copyloop, 1,000,000
>> repetitions are ways too much. Reduce them to 16
>> or 256 to prevent interruption by task switching
>> mechanisms of the OS. A task switch adds several
>> hundreds (sometimes thousands) of clocks to your
>> results. These are not related to your function,
>> at all, but, unfortunately, they're added to the
>> execution time of the function. Keeping the loop
>> small enough to return before the time slice ex-
>> pired throws accurate (comparable) results.
>
> The way that you are accessing rdtsc is quite susceptible
> to noise. The instruction is not sequentialised and you
> have not flushed the pipelines with a cpuinfo instruction
> to ensure that you are measuring from a fixed point. For
> the length of sequence that you are timing this may not
> make much difference.
Right, a CPUID should preceed each RDTSC. On OS/2 it
doesn't make a difference if the CPUID is missing.
> Could you post the raw results from a few runs of your
> code on your hardware to see what kind of distribution
> the timing measurements have?
I try to write a small Win program on the fly. Might
take a day (no access to OS/2 at the moment - denies
to work on my new machine).