fast memory copy

Akis Tzortzis

unread,

Jul 22, 2003, 5:55:58 AM7/22/03

to

Hello

I am looking to see if there is a way to take advantage of any Pentium
III-IV functionality or other technical advancements to perform a faster
memory copy. Currently, for properly aligned memory areas, it seems that
"rep movsd" is the fastest. I tried using MMX registers "movq mm0, [esi+0]"
in blocks of code that move 64 bytes at a time (using all MMX registers) but
it is slower than the age old "rep movsd".

Of course the code is intended for Pentium CPUs, preferrably Pentium II
(common denominator), but of course we could check what CPU we are on and
use different techniques.

Thanks in anticipation

Akis Tzortzis
Developer
London UK

Slava M. Usov

unread,

Jul 22, 2003, 7:10:07 AM7/22/03

to

"Akis Tzortzis" <a...@jesqa.com> wrote in message
news:Osdl0dDU...@TK2MSFTNGP11.phx.gbl...

> I am looking to see if there is a way to take advantage of any Pentium
> III-IV functionality or other technical advancements to perform a faster
> memory copy. Currently, for properly aligned memory areas, it seems that
> "rep movsd" is the fastest. I tried using MMX registers
> "movq mm0, [esi+0]" in blocks of code that move 64 bytes at a time (using
> all MMX registers) but it is slower than the age old "rep movsd".

The most significant factor in memory copying is the combination of the
write buffer and the CPU cache. These will ensure that the system bus sees
mostly bursts of memory writes and reads, each a cache line wide, even if
you use rep movsd. On the other hand, when you do not use rep movsd, the CPU
will not be able to use whatever internal optimization it has for rep movsd
and will have to follow your code, which is rather complex and therefore
slow and which must be read from i-cache, which may steal bandwidth.

In general, you should use memcpy() as it is supposed to be written by those
who know the fastest way :-)

The MMX and the SSE data movement instructions are optimized to load and
save registers when you need to apply some transformations, not copy
memory.

S

Akis Tzortzis

unread,

Jul 22, 2003, 8:28:39 AM7/22/03

to

My latest tests show that I can shave 9% off performance to MS memmove
(essentially a "rep movsd") by using an unrolled 64 byte trasnfer invlolving
32 bit quanitites (ie without using the MMX registers).

I cannot get it to go any faster. Also I suspect that on an AMD the MMX
instructions may be faster than on the Pentium III . I will test now.

Akis

"Slava M. Usov" <stripit...@gmx.net> wrote in message
news:OAFnQHEU...@tk2msftngp13.phx.gbl...

Akis Tzortzis

unread,

Jul 22, 2003, 10:50:20 AM7/22/03

to

My tests on an Athlon XP 1900+ show that the "rep movsd" instruction is
faster than manual copying in all sorts of buffer sizes.
On Intel Xeon 2GHz and Pentium III 1GHz unrolled memory copy instructions
are faster than the "rep movsd".
On both types of CPUs the "movq" MMX instructions were slower.

The only thing remaining now would be perhaps to take advantage of any cache
instructions or some other exotic ways of doing it. How about DMA? Does
anyone use DMA anymore? Is there a Win32 API for it?

Thanks

Akis

"Akis Tzortzis" <a...@jesqa.com> wrote in message

news:#FNTJzEU...@tk2msftngp13.phx.gbl...

Slava M. Usov

unread,

Jul 22, 2003, 11:58:49 AM7/22/03

to

"Akis Tzortzis" <a...@jesqa.com> wrote in message

news:#o63TCGU...@TK2MSFTNGP12.phx.gbl...

[...]

> The only thing remaining now would be perhaps to take advantage of any
> cache instructions or some other exotic ways of doing it.

The default memory type, 'write-back', is the best performing for memory
copying, so changing that will only make it slower. Besides, you can't do
that in user mode.

> How about DMA? Does anyone use DMA anymore? Is there a Win32 API for it?

There is no API in user mode for that. But I doubt that DMA will outperform
CPU at that, and the overhead required can be greater for moderately sized
buffers.

What are you trying to do? If it is the very nature of your application to
copy large chunks of memory, then memcpy() is easily the best, as you have
found out. But if it is not, and you have most of the time spent copying
memory, then you need to think about the general design of your application.

S

mike

unread,

Jul 22, 2003, 1:21:32 PM7/22/03

to

Are you using prefetchnta ?

mike

unread,

Jul 22, 2003, 1:36:19 PM7/22/03

to

Me again. I did some sprite routines
way back and measured them with intel vtune.
Since I am processing and not doing a simple
copy its a little different, but I think the
best I was able to get that applies to a plain
memcpy involves movntq, movq and prefetchnta .
Google those for more details.

mike

Seregy Karpov

unread,

Jul 23, 2003, 9:33:07 AM7/23/03

to

CPUs from different vendors (Intel, AMD) has differnt
optimisation. Even if you find "exotic" way to speed up
copy on one CPU it may suffer once it comes to next CPU
generation.

Adobe used some tricks to speed up copy on PIII/PII
platform. The side effect was that Adobe Photoshop didn't
work well on many motherboards with i440BX chipset.

The bottom line:
If you do general application use memcpy().
If you do special type application like hi-res video
processing try to find right hardware (Motherboard with
RAMBUS memory and PIV, motherboard with double bank DDR
etc) and still use memcpy().

BTW you may speed up copy by loading and storing several
registers at once to benefit from pipe lines, like:
mov eax , DWORD PTR[esi]
mov edx , DWORD PTR[esi + 4]
mov ebx , DWORD PTR[esi + 8]
add esi , 12
The order of unrelated instruction may be crucial.
Also and check current optimization polices on Intel/AMD
web site. You may find that simple instructions like:
mov eax , DWORD PTR[esi]
add esi , TYPE DWORD
are given advantage over complicate like:
lodsd

-Sergey Karpov

>.
>

Slava M. Usov

unread,

Jul 23, 2003, 10:04:50 AM7/23/03

to

"Seregy Karpov" <serge...@yahoo.com> wrote in message
news:0be101c3511e$f3a98df0$a501...@phx.gbl...

[...]

> BTW you may speed up copy by loading and storing several
> registers at once to benefit from pipe lines, like:
> mov eax , DWORD PTR[esi]
> mov edx , DWORD PTR[esi + 4]
> mov ebx , DWORD PTR[esi + 8]
> add esi , 12

That will not produce the desired effect, at least not for all CPUs. This
sequence, especially in a loop, is quite complex for a CPU, because it
involves a flow of data between memory and registers, register allocations
from the registry engine, speculative data reads and speculative execution.
All that can steal CPU cycles, the speculative stuff may even go wrong and
do what is not necessary at all. The code itself may have to be read
occasionally, stealing bus bandwidth. rep movsd should not have these
problems.

The original poster has actually tested that on various CPUs and found that
it gives only a slight improvement with only one particular CPU model.

Finally, rep movsd is the engineered way to copy 'strings'. In C/C++,
memcpy() is the best choice. If there is a need to optimize _that_, in 99
cases out of 100 it indicates a general design flaw. In 0.9 cases out of the
remaining 1 it indicates that the hardware is inadequate.

S

Sergey Karpov

unread,

Jul 23, 2003, 11:25:27 AM7/23/03

to

Hi Slava,

You can see from my original post that I vote for using
memcpy().
The main reason is compatibility across existing and new
comming CPUs.

I admit - these 4 lines oversimplify the issue. It is an
example how this problem may be approached. The actual
implementation depends on CPU.

Here is example how people did it in past:
http://now.cs.berkeley.edu/Td/bcopy.html

-Sergey Karpov

>.
>

Slava M. Usov

unread,

Jul 23, 2003, 12:20:49 PM7/23/03

to

"Sergey Karpov" <serge...@yahoo.com> wrote in message
news:0e9201c3512e$a50109c0$a401...@phx.gbl...

> You can see from my original post that I vote for using
> memcpy().

Yes, I saw it. And I had said it before myself. But as you concluded with a
different statement, I thought it would not hurt to repeat it.

[...]

> I admit - these 4 lines oversimplify the issue. It is an
> example how this problem may be approached. The actual
> implementation depends on CPU.
>
> Here is example how people did it in past:
> http://now.cs.berkeley.edu/Td/bcopy.html

Too old, given their hardware. But even when they tried apparently the most
modern hardware they could get, the P6 200, Step-B, memcpy() was the top
performer.

S

Akis Tzortzis

unread,

Jul 23, 2003, 1:56:01 PM7/23/03

to

My recent message was incorrect: when the memory blocks exceed the size of
the L2 cache (which runs synchronously with the CPU) then "rep movsd" is
TWICE as slow as using bulk, unrolled MMX instructions (on a PIII). In other
cases and on Intel CPUs (PIII, PIV and Xeon) the following has 10% advantage
over "rep movsd" :
mov eax, [esi+edx]
mov [edi+edx], eax
add edx, ebx
...
Also having 20-50 instructions in a loop does not really matter : the CPU
has adequate code cache to keep the "loop" completely in the cache.

More tests are needed , for example when the memory blocks are far apart,
perhaps farther than the cache sizes, when they overlap, when they are very
small, or very large etc...

On Athlon XP1900+ CPU the "movsd" instruction works better than on the Intel
so that no benefits are seen.

memcpy and memmove use "rep movsd" . Assuming that the FPU registers (of
which MMX is just an alias) are available on all modern x86 compatible CPUs
then I can see no reason to use memmove and memcpy.

I have not tried the "prefetch" or "movnti" instructions yet. More tests to
follow...

Akis.

"Slava M. Usov" <stripit...@gmx.net> wrote in message

news:#F2oelTU...@tk2msftngp13.phx.gbl...

Slava M. Usov

unread,

Jul 23, 2003, 4:31:42 PM7/23/03

to

"Akis Tzortzis" <a...@jesqa.com> wrote in message

news:eolMvOUU...@TK2MSFTNGP10.phx.gbl...

> My recent message was incorrect: when the memory blocks exceed the size of
> the L2 cache (which runs synchronously with the CPU) then "rep movsd" is
> TWICE as slow as using bulk, unrolled MMX instructions (on a PIII). In
> other cases and on Intel CPUs (PIII, PIV and Xeon) the following has 10%
> advantage

Since your statements vary from "no" to "yes" on the same subject, I would
like to see your code.

[...]

> Also having 20-50 instructions in a loop does not really matter : the CPU
> has adequate code cache to keep the "loop" completely in the cache.

It still has to access the cache.

> More tests are needed , for example when the memory blocks are far apart,
> perhaps farther than the cache sizes, when they overlap, when they are
> very small, or very large etc...

... for all future CPU models, for all other CPU architectures...

[...]

> Assuming that the FPU registers (of which MMX is just an alias) are
> available on all modern x86 compatible CPUs then I can see no reason to
> use memmove and memcpy.

Blessed shall he be who reinventh the wheel.

S

mike

unread,

Jul 24, 2003, 2:14:28 AM7/24/03

to

In article <uELgwlVU...@TK2MSFTNGP12.phx.gbl>,

>
> Blessed shall he be who reinventh the wheel.

Really. I am getting all misty eyed remembering the
days I followed this discussion in several other news
groups. Google fast mem-copy.

BTW, last time I new, back-to-back mov's do not
a pipeline fill.

m

Akis Tzortzis

unread,

Jul 24, 2003, 5:19:47 AM7/24/03

to

> Since your statements vary from "no" to "yes" on the same subject, I would
> like to see your code.

Well yes, there are so many parameters and target CPUs I am trying on, and I
change the code so often that I am also a bit confused myself :-)
When I finish the code I will send it if you wish, in the meantime I suggest
you look at Microsoft's implementation of memmove() - since you advocate
using it you should at least know what and how it does it.

> > Also having 20-50 instructions in a loop does not really matter : the
CPU
> > has adequate code cache to keep the "loop" completely in the cache.
>
> It still has to access the cache.

What really matters is that "Code X" is quicker than "Code Y" - the fact
that an unrolled loop of 20 or 60 instrunctions performs better than a 2
byte instruction "rep movsd" is something for Intel to worry about (on AMD
it is different).

> ... for all future CPU models, for all other CPU architectures...

We are working with a known group of CPUs which our customers use today. We
test on those platforms and try to offer the best we can.

> Blessed shall he be who reinventh the wheel.

It depends on how deep down you examine 3rd party code before you adopt it
in your projets. When I started 15 years ago we were writing our own "C"
libraries - for technical reasons not for a laugh. Sure you cannot examine
the Win32 API and you cannot do much about the way the VC++ generates code,
but there are other elements over which you have more control. For example
the STL library as shipped by VS. Excluding the I/O stuff, you can getter
MUCH better mileage by doing it yourself. What I am trying to say is that
the process of investigation and experimentation is not "reinventing the
wheel" but rather good software practice. For example you can write a much
better malloc()/free() - in fact there are companies out there doing just
that.

Slava M. Usov

unread,

Jul 24, 2003, 7:20:24 AM7/24/03

to

"Akis Tzortzis" <a...@jesqa.com> wrote in message

news:uO6L7ScU...@TK2MSFTNGP10.phx.gbl...

> When I finish the code I will send it if you wish

I prefer that you post it here, because you made your statements here.

> in the meantime I suggest you look at Microsoft's implementation of
> memmove() - since you advocate using it

I'm not advocating Microsoft's memmove() and not even anybody else's
memmove(). I was advocating the language's standard memcpy().

> you should at least know what and how it does it.

<g> No I'll continue discussing things that I haven't a slightest clue about
</g>

> What really matters is that "Code X" is quicker than "Code Y" - the fact
> that an unrolled loop of 20 or 60 instrunctions performs better than a 2
> byte instruction "rep movsd" is something for Intel to worry about (on
> AMD it is different).

That has to be demonstrated yet. This is why I wanted to see the code that
gave you this impression.

[...]

> We test on those platforms and try to offer the best we can.

I believe I have asked what your application does _in general_. I don't
think it all boils down to memcpy() performance.

> For example the STL library as shipped by VS. Excluding the I/O stuff,
> you can getter MUCH better mileage by doing it yourself.

The real question is, do you care about a library that can be, say, 100%
more efficient than another library, if the net gain is 1%?

[...]

> For example you can write a much better malloc()/free() - in fact there
> are companies out there doing just that.

In C++ you can write an allocator for some particular class that outperforms
any generic allocator in some particular application, but see above.

S

mike

unread,

Jul 24, 2003, 10:10:50 AM7/24/03

to

See SGI's Article 'Optimizing CPU to Memory Accesses
on the SGI Visual Workstations 320 and 540'

http://www.joryanick.com/memcpySGI.htm

Slava M. Usov

unread,

Jul 24, 2003, 1:22:26 PM7/24/03

to

"mike" <winte...@operamail.com> wrote in message
news:MPG.1989a4d8f...@msnews.microsoft.com...

> See SGI's Article 'Optimizing CPU to Memory Accesses
> on the SGI Visual Workstations 320 and 540'
>
> http://www.joryanick.com/memcpySGI.htm

Unfortunately, it's not so clear cut as they say there. I only compared the
most advanced version of theirs against the plain "rep movsd". The relative
performance varies significantly with the size of the blocks copied. I had
these results:

nbytes, in K rep movsd prftch+L1+ntq ratio

PIII
2 289 510 0.567
4 241 505 0.477
8 454 1098 0.413
16 1046 2975 0.352
32 1039 2967 0.350
64 1035 3001 0.345
128 1093 3190 0.343
256 4738 3601 1.32
512 5472 3106 1.76
1024 5873 2873 2.04
2048 6055 2748 2.20
4096 6182 2677 2.31
8192 6232 2640 2.36
16384 6262 2620 2.39

P4 Xeon
2 664 1760 0.377
4 595 1825 0.326
8 617 1842 0.335
16 647 1848 0.350
32 662 1855 0.357
64 648 1872 0.346
128 660 1870 0.353
256 693 1936 0.358
512 2702 2201 1.23
1024 3078 2062 1.49
2048 3260 1991 1.64
4096 3362 1955 1.72
8192 3412 1942 1.76
16384 3427 1933 1.77

The second and the third columns are CPU cycles / 1K, the less is the
better. As you can see, for all sensible buffer sizes rep movsd has much
better performance, it is only past 256K when it makes sense to use the
other method. Another thing worth noticing is the benefit is smaller on P4,
and happens later, and happens only because rep movsd becomes 5 times slower
past 256K.

Anyway, I don't think that copying megabytes of memory makes any sense at
all, so rep movsd and thus the "default" memcpy() should be used.

S

Sergey Karpov

unread,

Jul 24, 2003, 3:21:18 PM7/24/03

to

Hi guys,
At this point it is not clear what are you talking about.

1. If you are about to establish new record lets talk
about particular CPU, chipset and RAM. There is no
general solution that will fit all AMDs and Intels /except
rep movs :-)/. I am sure that we will be able outperform
rep movs. In this case we should move away from win32
kernel to hardware and assembly.
2. If we are talking about particular project lets see
why speed is so important and discuss:
Performance vs. compatibility and support. I welcome idea
of reinventing the bicycle but it must be justified.

Nobody says that rep movs is the fastest way to copy
memory.
We say that memory should be copied using STANDART call
memcpy to preserve compatibility. It just happens that
most compilers on ix86 platform implement memcpy using rep
movs. For the simple reason – it always works.

-Sergey Karpov

>.
>

Slava M. Usov

unread,

Jul 25, 2003, 6:02:34 AM7/25/03

to

"Sergey Karpov" <serge...@yahoo.com> wrote in message

news:031d01c35218$c25a5ae0$a301...@phx.gbl...

[...]

> Nobody says that rep movs is the fastest way to copy
> memory.

I do. It's fastest or pretty close to fastest in all important cases. Since
it is also universally supported, the choice is clear. There is probably an
infinitesimal number of applications that copy memory "on the large scale",
so for them optimizing memcpy() might be important, yet I maintain that in
most cases that can be eliminated by using different data structures and
different algorithms.

S

mike

unread,

Jul 27, 2003, 1:52:17 PM7/27/03

to

Nice job with the numbers Slava.

I think people wanted to get a handle on the possibility
for enhancement of memory transfers, rather than take
somebody's word that memcpy is the best way to go.
As usual, there is no easy answer. AMD has had MMX since
1997, so compatability may not be an issue if one
decides to tinker with MMX.

mike

In article <031d01c35218$c25a5ae0$a301...@phx.gbl>,
serge...@yahoo.com says...

mike

unread,

Jul 27, 2003, 2:36:00 PM7/27/03

to

In article <Ob1boggU...@tk2msftngp13.phx.gbl>,
stripit...@gmx.net says...

> Anyway, I don't think that copying megabytes of
> memory makes any sense at all, so rep movsd and
> thus the "default" memcpy() should be used.

What about off sceen buffers?

Slava M. Usov

unread,

Jul 28, 2003, 6:06:05 AM7/28/03

to

"mike" <winte...@operamail.com> wrote in message

news:MPG.198dcd3e9...@msnews.microsoft.com...

[...]

> I think people wanted to get a handle on the possibility
> for enhancement of memory transfers, rather than take
> somebody's word that memcpy is the best way to go.
> As usual, there is no easy answer. AMD has had MMX since
> 1997, so compatability may not be an issue if one
> decides to tinker with MMX.

I did admit that rep movsd performed worse when transferring very large
buffers. Yet it was the top performer, overall, on the dual Xeon. Should I
have said, instead, "OK, we have rep movsd that for all practical reasons
outperforms anything else, but because it isn't cool, is sooo boring and
isn't clever, we'll use the MMX stuff. Wait, there's also 128 bit registers,
so let's use them." Then you'd have been satisfied, right?

S

Slava M. Usov

unread,

Jul 28, 2003, 6:06:22 AM7/28/03

to

"mike" <winte...@operamail.com> wrote in message

news:MPG.198dd77af...@msnews.microsoft.com...

Don't they have special chips in the modern video cards just for that? Then
you'd be better off using bitblt() or OpenGL or whatever that knows how to
use that.

S

mike

unread,

Jul 28, 2003, 11:07:04 AM7/28/03

to

In article <#LVhn$OVDHA...@TK2MSFTNGP12.phx.gbl>,
stripit...@gmx.net says...

A while back I was writing a spectral processor where
every pixel mapped to a data point. I wanted to be able
to print spectra or sub-spectra without losing data.
I also wanted to be able to edit/move notations and figures
transparently. I thought this was a good opportunity to
brushup on my asm. So, I did the thing using optimized
mmx. It worked pretty well - very little tearing while
dragging things around the screen. The embarrassing part
is that it used a ton of virtual memory. I saved a
snapshot of the background without the object I was moving
onto a giant dibsection the size of the window area. On the
same dibsection were areas for both the object and an equal
sized mask area. This was tested on a pentium pro and PIII.
I tried OpenGL, but printing was a real mess. I didn't like
DirectX. I saw your comment on copying megabytes of memory
and I wondered again how I could have done this effectively
with less memory.

mike

Slava M. Usov

unread,

Jul 29, 2003, 6:22:21 AM7/29/03

to

"mike" <winte...@operamail.com> wrote in message

news:MPG.198ef8146...@msnews.microsoft.com...

> A while back I was writing a spectral processor where
> every pixel mapped to a data point. I wanted to be able
> to print spectra or sub-spectra without losing data.
> I also wanted to be able to edit/move notations and figures
> transparently. I thought this was a good opportunity to
> brushup on my asm. So, I did the thing using optimized
> mmx. It worked pretty well - very little tearing while
> dragging things around the screen. The embarrassing part
> is that it used a ton of virtual memory.

Say, when you were moving objects "transparently", did you create an
overlapped image and then blit to the screen's DC? So where did you
implement memory _copying_? To superimpose objects with background you
needed more than just a copy even if you did everything yourself, and then
you had to use bitblt(), right?

[...]

> I tried OpenGL, but printing was a real mess. I didn't like
> DirectX. I saw your comment on copying megabytes of memory
> and I wondered again how I could have done this effectively
> with less memory.

So, did you use bitblt() or did you access video memory "directly" in some
way?

S

mike

unread,

Jul 30, 2003, 12:23:34 PM7/30/03

to

In article <u$5GNtbVD...@TK2MSFTNGP12.phx.gbl>,
stripit...@gmx.net says...

As object is moving to the right, I bitblt the
newly exposed background on the left to the screen
from my saved background dib. I then do the masking
operations in my giant dib which contains the
background, MaskOfCurItem, CurItem. This is where
I am copying data from one part of the dib to another.
It really is off topic for this thread of discussion
since I'm proccessing the data before the copy. Probably
the wrong newsgroup too.

The result, which leaves the CurItem,MaskOfCurItem intact
for later operations, is copied to the screen with a simple
BitBlt.

I was also able to do all of this with opcodes
and BitBlt in case there was no MMX. I tested it for
the first time on my new p4 and it really looks good.
Task manager says I'm using about 10 Mb when maximized.
The debug exe is 390kb. My display is set at 1600x1200x32.
I'm dragging a square donut which is about 500x300. The
transparent bitblt version takes about 25 cycles/byte.
The mmx version takes about 4 cycles/byte.

mike

Slava M. Usov

unread,

Jul 31, 2003, 6:10:35 AM7/31/03

to

"mike" <winte...@operamail.com> wrote in message

news:MPG.1991acff2...@msnews.microsoft.com...

[...]

> I am copying data from one part of the dib to another.
> It really is off topic for this thread of discussion
> since I'm proccessing the data before the copy.

That was my point. Processing and moving data is one thing, just copying is
another. I indeed find it hard to imagine a situation when somebody
legitimately wants to copy a couple of megs from one memory location to
another without changing them in transit. I think that should be avoided
like the plague, because you end up using twice the memory you need and
wasting time copying it, irrespective of how fast you can do that.

S

Slobodan Brcin

unread,

Jul 31, 2003, 11:05:11 AM7/31/03

to

This is all a fun, but it is a real question what do you want to do.

Before you can talk about copying from memory to memory you must define what
memory means. I will assume that we do sequential reads and writes not
random access.

Video memory (Memory on graphic adapter)
System memory (Main memory)
Card memory (Memory on some device)

Performances are significantly affected by the type of source or destination
memory, and by the chipset, as well and as processor and instruction set
involved.

Video memory uses cache type write combine, and you cant change this. This
means that if you write in one wc buffer (same 64 byte block) data don't go
to video memory until you change wc buffer, then processor burst data to
video memory, achieving maximum write speed. On NVIDIA AGP 8x cards you can
achieve 2GB/s transfer with almost every processor function movb, movw, movd
or with SSE2 functions this is few percent slower than maximum agp 8x
allows.
Owners of ATI card will be disappointed maximum is 1GB/s on APG 8x cards,
and below 250MB/s on AGP 4x cards.
And one more thing random writing to video memory is very very slow, so you
should avoid it.

If you try to read from graphics card you can't read more than 100-300 MB/s
depends on type of graphic card, because read in not cached.

If you have system with Intel P4 @ 800 Mhz FSB and 2x400 Mhz DDR memory you
could achieve real transfers above 5GB/s.

System memory depending of cache type specified:
If cache type is write combined then it is like for video memory you can
write fast with all instructions, but the read from this type of memory is
slooow.
No cache it is show for read and write and you can't do a thing to make it
faster.
Cached for optimal write performance you must bypass normal caching and
write and use nontemporal write with instructions like movntq, movnti, etc.
You should have write at theoretical maximum.
To read fast there are many issues it is not simple to make a fast read. It
depends on chipset, memory, processors, and so on. One setting optimal for
one processor or chipset is not optimal for other and so on. But the fastest
read for large block transfers is to use prefetchnta to give processor a
hint what will be needed later (depends on many factors but let say 1KB
ahead of current read).
Instead of prefetch instructions you can use mov instructions to bring one
memory block in cache and then read all the data, sometimes it's faster.

Device memory you should really use some DMA transfer that is available on
that device. Otherwise is it really slooooooow.

> That was my point. Processing and moving data is one thing, just copying
is
> another. I indeed find it hard to imagine a situation when somebody
> legitimately wants to copy a couple of megs from one memory location to
> another without changing them in transit. I think that should be avoided
> like the plague, because you end up using twice the memory you need and
> wasting time copying it, irrespective of how fast you can do that.

You can do some last data computation while copying, yes.

But if you want to draw many small objects on screen purely in software.
Than you should consider drawing in system memory, and then making fast
software copy to video memory.

You can do that at overall speed of 1.5GB/S sometimes even more.

To conclude:

rep movsd is really simple and in most cases very very slow, but get
work done.

If you need speed you must spend great deal of time tuning your software for
yours system, or making even different copy routines for different systems.

"Slava M. Usov" <stripit...@gmx.net> wrote in message

news:eNNb9v0V...@tk2msftngp13.phx.gbl...

Slava M. Usov

unread,

Jul 31, 2003, 11:26:20 AM7/31/03

to

"Slobodan Brcin" <sbr...@ptt.yu> wrote in message
news:u2VuqU3V...@TK2MSFTNGP09.phx.gbl...

> This is all a fun, but it is a real question what do you want to do.

I recommend that you read the whole thread before you write. You have said
what has been said many times already. Besides, some of your conclusions are
wrong, as has been demonstrated in the thread, too.

S

Slava M. Usov

unread,

Jul 31, 2003, 1:25:57 PM7/31/03

to

"Slobodan Brcin" <sbr...@ptt.yu> wrote in message

news:OLaoEF4V...@TK2MSFTNGP09.phx.gbl...
> Sorry if I have offended you or anyone else in any way.
> I have already read all postings.

I was not offended, don't worry. If you have read all the postings, then you
should know that "rep movsd" is not the slowest in most cases, it is
actually the fastest in most cases. It is true that we only measured its
performance with the normal [i.e., not video] memory.

You did not measure the speed of "plain" memory copy, and so you concluded
that "rep movsd" was the slowest. I performed direct tests of the speed of
memory copying, and some results were posted. In my tests, a number of
machines, PIII and P4 alike, all behaved in the same way: "rep movsd" was
the fastest way to copy memory up to 256K blocks. For example, on a P4 Xeon
2.4GHz, I had ~650 CPU cycles per kilobyte, that makes ~3.7 GB/s [I repeat
that is COPYING, not just reading or "filling"]. With bigger blocks, the
speed dropped to ~700MB/s. On the other hand, the best performance that I
had with the non-temporal prefetched and "locally buffered" memory copying
was almost constant for all block sizes, at ~1900 CPU cycles per kilobyte,
which makes ~1.3 GB/s.

[...]

> And please tell me except I repeated some things that other said already,

Things like "rep movsd works everywhere".

> what conclusion is wrong.

That rep movsd is the slowest at memory copying.

I actually appreciate your posting the test data, but they do not comprise
raw "memory copy" performance figures. You measure memory read, memory fill
and memory read + combine + memory write. This is a not what we have been
discussing.

S

Slava M. Usov

unread,

Aug 1, 2003, 6:56:22 AM8/1/03

to

"Slobodan Brcin" <sbr...@ptt.yu> wrote in message

news:#qst6H7V...@TK2MSFTNGP10.phx.gbl...
> I have made elementary test for sys mem. Well, must be honest I didn't
> expected to see this high numbers for rep movsd. I have included the test
> source in attachment.

That does not prove anything. I've mentioned that rep movsd was faster up to
256K, measuring that for 8MB is hardly relevant.

> On Intel 875 Chipset P4 2400 MHz @ 533 Mhz DualChannel 333MHz Memory I got
> for movsd ~989 MB/s with SSE2 instructions ~1400 MB/s.

Which is similar to my results for blocks bigger than 256K. So what?

[...]

> Results that you have mentioned ~3.7 GB/s is not possible even
> theoretically than would mean that you can read 3.7GB/s and write
> 3.7 GB/s that is around 7.4 GB/s.
> You have used the same memory for successive tests. You should first
> pollute cache just before the test. This should give you the real
> memory transfer results.

I do not care about such theories and about such "real memory transfer
results". If we talk about reality, then in real world code it is 20% of
memory that is used 80% of total execution time, that means repeated access
to the same memory, that means the memory is cached; that also means such
memory is not copied by megabytes. I specifically exclude video and device
memory here.

S

Slobodan Brcin

unread,

Aug 1, 2003, 3:09:36 PM8/1/03

to

> That does not prove anything. I've mentioned that rep movsd was faster up
to
> 256K, measuring that for 8MB is hardly relevant.

Sorry, for that piece of code with 8MB test. But results are in favor or
SSE2 and for smaller that 64KB tests if we don't use already cached data.
Intel did a very good job in optimizing rep movsd instruction. And of course
if I could get with some weird optimization up to 25% better results in
99.99% cases I would rather used rep movds or RtlCopyMemory I don't care
which. Because in most of the times I'm not in hurry for small memory
transfers.

Today I have tested MMX based copying for XP processor on NFORCE2 chipset,
using only movq instruction for loading and storing, mem block was 8 KB
using the same cached memory, and I got about 20% improvement over the
movsd, but also on Intel platform it was much slower than movds, so there
are no absolute winners in speed.

> I do not care about such theories and about such "real memory transfer
> results". If we talk about reality, then in real world code it is 20% of
> memory that is used 80% of total execution time, that means repeated
access
> to the same memory, that means the memory is cached; that also means such
> memory is not copied by megabytes. I specifically exclude video and device
> memory here.

That is all true but I care about "real memory transfers" between every
piece of hardware, because I don't deal with theory. And I must support XP
and P4 processors and variety of chipsets that are so different.

In my very real and practical situation I have 4 video grabber card that my
company developed. Every card is capable of capturing 4 different composite
sources (cameras). Every card captures 100 fields in resolution 704x288x2.
Two cards with all overhead uses around 108 MB/s of PCI bandwidth. I have
two PCI busses filled with two cards each, although it can work on three
buses. Every card possesses 7 different DMA processors, 2 of them we
dedicated for video transfer. On each card two RISC processors are directing
the flow of video pictures in system memory. Every picture is 396 KB and
when arrives in memory it is in UYVY format (U0 Y0 V0 Y1 U1 Y2 V1 Y3). U,
V - Crominance, Y - Lumminance. Operation that must be done on this is UYVY
to YUV or translated in English, one stream must be copied to three streams
in three different places in memory. On every picture 400 Pictures per
second. Then I must resize chosen pictures to some size. And then I must
send them to video overlay in format UYVY, so YUV to UYVY conversion is
required.

And of course there is, a question how to copy desired YUV pictures to some
memory that wont be overwritten. This all is happening in driver.

So I don't have source pictures in cache, because they are arriving trough
the DMA. I'm copying pictures with or without format change on different
places. And because in user mode there are many processes for compressing
video material, I simply don't have the luxury of using cache for one time
operations.

BTW:
I don't know how I got myself in this discussion, my only intention was to
make a point what can be achieved with direct access to video memory, what
is something that Mike asked.
I know what can be achieved when you transfer data from system memory to
video memory. How to read fast and how to write fast is what I know.
Everything else is mystery and is more likely for divination.

Why memory to memory transfers are so slow, when partially read and writes
are at theoretical limits, this is a real question?

I don't know maybe because transfers take place in the same memory banks so
we must wait for all those CAS, RAS, and other timings. Maybe if we copied
from one physical memory module to the other then we might have a
difference.
I don't have time or will to try this, and it would be only for theory
because it's very hard to certainly allocate physical memory. But I am
willing to discus on this matter, and not on the matter what is better movsd
or MMX or SSE2, it's irrelevant and we all saw the numbers.

"Slava M. Usov" <stripit...@gmx.net> wrote in message

news:OeeVMuB...@TK2MSFTNGP09.phx.gbl...

Slobodan Brcin

unread,

Aug 1, 2003, 3:47:51 PM8/1/03

to

Hi Mike,

If you are not using 3D you should consider DirectDraw v7. Because there is
everything that you need for direct access to video memory.
If you like I can provide you with code that will give you ptr to primary
display surface.

You have two choices you can use exclusive access to video memory but then
you will not have user interface, but only what you draw. In this mode you
can easily flip (swap surfaces synchronously with VSync) one surface to
become the primary (visible) surface, so you don't need copying at all,
except to copy data to backbuffer surface. Tearing does not exists.

Copying to video memory is very very fast. And if you have some Geforce 4 8x
(not tested on MX) you can fill video memory at speed of exactly 2GB/s. And
copying from buffer in system memory to video memory should go above
1.5GB/s. If we know that the picture size in 1600x1280x4 is ~7.8 MB you can
do the math.

I have results 1.1 GB/s but this is because I read from three sides in
memory and use extensive MMX transformations before I write it to video
memory.

Second more difficult but yet simple option is to use hardware overlay. You
create overlay surface, get pointer to it, and to make it simple make it
overlay the whole desktop.
You designate one color that will be transparent. And when graphics card
encounter such color instead of her you will see pixel from overlay. This
will allow you to use all GDI functions and forms. And to draw in separate
memory regardless of their state.
Also you can make overlay back buffer, so you can tell overlay to
automatically flip when VSYNC starts.

If you have small amount of pixels to write you can white them directly in
video memory so you wont need any system memory.

If you have large amount of pixels fasted method would be to write them to
system memory and then to do copying. And don't use blits because software
copying is faster. Only use it if you can use hardware blits that are not
working very well for most memory surfaces.
Overlays are a bit tricky because you must use one of the YUV formats, but
this is not a big problem for most applications (video applications). :(
There should be RGB overlays but I haven't used them.

If you want I can send you few lines that will give you required interfaces
and pointer to primary surface memory.

Hope this will help,

Best regards,
Slobodan

"mike" <winte...@operamail.com> wrote in message

news:MPG.1991acff2...@msnews.microsoft.com...

Slava M. Usov

unread,

Aug 1, 2003, 5:05:33 PM8/1/03

to

"Slobodan Brcin" <sbr...@ptt.yu> wrote in message

news:O0UpVXGW...@tk2msftngp13.phx.gbl...

> Sorry, for that piece of code with 8MB test. But results are in favor or
> SSE2 and for smaller that 64KB tests if we don't use already cached data.

That may be the case. But in most applications, at least those that we
normally discuss in this newsgroup, it is usually cached transfers of less
than 1KB and very infrequently of a few KBs that happen most of the time.
And even they do not dominate the execution time, so optimizing them is a
pure waste of time, especially because it will break six ways from Sunday as
soon as a new CPU or chipset or memory chips are out.

I've been saying that for almost two weeks. I'm glad that you agree.

[...]

> I don't know how I got myself in this discussion, my only intention was to
> make a point what can be achieved with direct access to video memory, what
> is something that Mike asked.

Then I'm sorry, because I did not realize you were talking about sys. memory
to vid. memory. We were talking about sys. memory to sys. memory [because
this is not a video/GUI forum]. That is why I asked Mike where he copied and
whether he used any blit().

[...]

> Why memory to memory transfers are so slow, when partially read and writes
> are at theoretical limits, this is a real question?

When you only need to write to it [when you stream from some external
source] then it is a long burst in one direction, and the same holds for
streaming from sys. memory. But when you copy then it is a lot more complex.
There are two different kinds of transactions so it has to switch from one
mode to another. Besides, the CPU uses short bursts [a cache line or a write
buffer]. Initiation of each burst is expensive with a typical memory chip.
Also, because the chips for the sys. memory are the cheapest possible :-)
that constant switching is all the more expensive. That is what I think
happens.

I agree with the rest of your message.

S

Slobodan Brcin

unread,

Aug 1, 2003, 5:42:49 PM8/1/03

to

Why wouldn't I agree? Everything that you have said is true.
memcpy can be hardly outperformed, but with great price to pay, loosing
portability, going to asm, etc. And gain for most applications is few
microseconds that end user won't feel :)

Best regards,
Slobodan Brcin

"Slava M. Usov" <stripit...@gmx.net> wrote in message

news:OS$4lCHWD...@TK2MSFTNGP12.phx.gbl...

Carl Appellof

unread,

Aug 1, 2003, 7:18:22 PM8/1/03

to

"Slava M. Usov" <stripit...@gmx.net> wrote in message

news:OS$4lCHWD...@TK2MSFTNGP12.phx.gbl...

> That may be the case. But in most applications, at least those that we
> normally discuss in this newsgroup, it is usually cached transfers of less
> than 1KB and very infrequently of a few KBs that happen most of the time.
> And even they do not dominate the execution time, so optimizing them is a
> pure waste of time, especially because it will break six ways from Sunday
as
> soon as a new CPU or chipset or memory chips are out.
>

I worked on an application once where memory copying DID dominate the
execution time. The program took an input buffer and had to add headers and
trailers to reformat it into something nice for a device. Despite having an
8-way Alpha SMP system with extremely fast processors, we saw nearly 100%
CPU utilization. Turned out to be memory bandwidth that was the limit, not
CPU speed itself. The only way out was to come up with a new format that
didn't require the copy.

It wouldn't surprise me if Slava's observations about rep movsd being
fastest only up to 256KB blocks had something to do with the processor's
cache algorithm. I'll bet that above 256KB, the source and target memory
blocks might use the same cache lines in some cases, so each access to
source or target block would flush the other block's cache.

Carl

Slava M. Usov

unread,

Aug 1, 2003, 7:54:53 PM8/1/03

to

"Carl Appellof" <carl.a...@nospam.veritas.com> wrote in message
news:O76qSMIW...@TK2MSFTNGP10.phx.gbl...

[...]

> The only way out was to come up with a new format that didn't require
> the copy.

See! I've been saying "if you have memory copying galore, then something is
already wrong".

In your case, though, it is partially so, because you're dealing with
devices and you could [in principle] use DMA and you had to feed the data to
the device anyway. But the double buffering could and should be eliminated.
It is a pity that even today the DMA controllers are often broken and you
have to double buffer. On the positive side, though, MS has done much to
help fight that, especially in the network stack.

> It wouldn't surprise me if Slava's observations about rep movsd being
> fastest only up to 256KB blocks had something to do with the processor's
> cache algorithm.

For P4, it can be easily explained: the L2 cache size is 256K. For PIII it
can be something else, but there are actually models that have 128K and 256K
L2 caches; chances are I used one of those for testing. Hmm, I need to be
more discriminative about those things :-)

S

Carl Appellof

unread,

Aug 5, 2003, 2:04:30 PM8/5/03

to

"Slava M. Usov" <stripit...@gmx.net> wrote in message

news:uoiUPhI...@tk2msftngp13.phx.gbl...

> In your case, though, it is partially so, because you're dealing with
> devices and you could [in principle] use DMA and you had to feed the data
to
> the device anyway. But the double buffering could and should be
eliminated.
> It is a pity that even today the DMA controllers are often broken and you
> have to double buffer. On the positive side, though, MS has done much to
> help fight that, especially in the network stack.
>

Well, yes, I would have liked to use scatter/gather DMA to output my
header/data/trailer stuff to a device. But the only option for mass storage
devices is to do it on a sector boundary, and my header and trailer were
smaller than a complete sector. With network devices, it's probably a
different story.

Carl