Mov question

Jerome H. Fine

unread,

Apr 25, 2007, 11:02:14 PM4/25/07

to

I want to move a large saved block of bytes into a work
area many times (at least thousands, perhaps millions).
The saved area is 58,642,669 bytes in length - an odd
number of bytes. At that point, I also want to duplicate
the saved area 15 more times for a total of 938,282,704
bytes, i.e. the work area will end up having 16 contiguous
copies of the saved area. BOTH the saved area and the
work area start at a byte address aligned at 16 bytes.

I can't see that either L1 or L2 cache might be a factor.
If I am incorrect, please advise what aspect L1 or L2
cache might play in this regard.

After I move the first 58,642,669 bytes in groups of 16
using mmx instructions, I must obviously start at the
beginning of the saved area and at byte 58,642,669 within
the work area. I can either:

(a) Move 58,642,669 bytes a byte at a time.

(b) Continue to use the mmx instructions, but suffer the
lack of alignment when the mov into memory is not
aligned.

Is it possible to predict which method should be used
with a Pentium 4 CPU? The only criteria is that the
complete move be finished ASAP. Extra instructions
are unlikely to be a factor, although the code will
obviously be much easier if (b) is the choice.

The choice of (a) or (b) will obviously indicate the
choice of the third 58,642,669 bytes when the mov into
memory is now word aligned, but not yet 8 byte aligned.

Have I missed other methods of solving this question?

Jerome Fine

Terje Mathisen

unread,

Apr 26, 2007, 2:00:46 AM4/26/07

to

Jerome H. Fine wrote:
> I want to move a large saved block of bytes into a work
> area many times (at least thousands, perhaps millions).
> The saved area is 58,642,669 bytes in length - an odd
> number of bytes. At that point, I also want to duplicate
> the saved area 15 more times for a total of 938,282,704
> bytes, i.e. the work area will end up having 16 contiguous
> copies of the saved area. BOTH the saved area and the
> work area start at a byte address aligned at 16 bytes.
>
> I can't see that either L1 or L2 cache might be a factor.
> If I am incorrect, please advise what aspect L1 or L2
> cache might play in this regard.

You can store the same L1/L2/TLB sized block to all of the target areas
at the same time, before going on to the next.

>
> After I move the first 58,642,669 bytes in groups of 16
> using mmx instructions, I must obviously start at the
> beginning of the saved area and at byte 58,642,669 within
> the work area. I can either:
>
> (a) Move 58,642,669 bytes a byte at a time.
>
> (b) Continue to use the mmx instructions, but suffer the
> lack of alignment when the mov into memory is not
> aligned.

(c) shift the source blocks around so asto make all the writes aligned.
It is possible that it will be faster to use misalgned reads instead of
misaligned writes.

You are trying to use cache-bypassing stores,right?

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Vinoj

unread,

Apr 26, 2007, 7:42:04 AM4/26/07

to

Try REP MOVSD to move the bytes and that executes faster.

Load ECX with the number of dwords to move and then load ESI with
source and EDI with destination pointers. With just few instructions
the REP MOVSD will be a lot faster!!

Regards,
Vinoj

Author of "Classic Utilities using Assembly Language", 1995, India.

Jerome H. Fine

unread,

Apr 26, 2007, 9:41:35 PM4/26/07

to

I can certainly use the "REP MOVSD" instructions and repeat the whole process
a total of 16 times. I wonder whether or not the misaligned dwords should be on
the source side or the destination side?

The other question is if I can start at the beginning of the work area after
the first 4 times and finish the rest of the mov with a final "REP MOVSD"
that has the source overlap the destination after 234,570,676 bytes
or 58,642,669 dwords with ECX set for 234,570,676 times in the loop?

In addition, what about Terje's suggestion to break the move into areas
that allow the source to fit and remain in L1 cache and repeat each
L1 cache portion 16 times? Assuming that L1 data cache is about
32 KBytes, splitting the saved data area into about 2000 portions
would allow all the subsequent (15 of them) reads to be in L1 data cache.

Jerome Fine

Jerome H. Fine

unread,

Apr 26, 2007, 9:39:47 PM4/26/07

to

Jerome Fine replies:

(a) I am attempting to initialize a large work area.  How
the cache is managed is probably beyond my control.
Assuming that cache-bypassing stores are possible, can
they be selected or is that usage automatic if stores
are consistently overloading the cache?

(b) As for misaligned reads, that would be possible by moving
a few bytes until the next store byte is aligned.  Are you
suggesting that as long as the data being read is in L1 cache,
the fact that the read is not aligned can be compensated for
by the delay in waiting for the aligned store to complete?

(c) By the way, do you or anyone know if the L1 cache size on
P4 is a standard size which the L2 cache size varies over
a wide range depending on the price of the CPU?  In any
case, it looks like the L1 data cache size is 32 KBytes -
can anyone confirm this?

If the answer to (b) is yes, then 16 repeats of the first
30 KBytes of the saved area would seem to be a useful test.

On the other hand, maybe just as useful might be a loop
which reads 8 bytes at a time and stores those 8 bytes
16 times each with an increasing offset of 58,642,669 bytes
for each succeeding store.  While only 2 of the stores
will be aligned on 8 bytes, perhaps the lack of needing
to perform repeated reads will overcome the reduced speed
of the misaligned stores?  Or might the L1 cache / L2 cache
hold the stores temporarily and perform an aligned write-through
during the flush (after all, the stores are contiguous even
though 14/16 of them are not 8 byte aligned) which would
overcome the initial lack of alignment?

It seems like the best solution would be to attempt different
solutions and try them all.  The last question I might have is
if it is possible to depend on a given solution across different
P4 CPUs?

Jerome Fine

Vinoj

unread,

Apr 27, 2007, 1:36:04 AM4/27/07

to

The L1 or L2 cache is the implementation of Intel. So you can forget
the "beyond your hands" problems. In the latest trends the REP MOVSD
will take mostly 2 clock cycles and least 1 clock cycle. First move
the bytes that are not 32-bit aligned which are negligible in the
beginning and end and then move the aligned dwords, which forms most
of the moves, in the 32-bit boundaries and it will warp.

If you have source and destination overlap then you can do a
calculation in your program whether the dwords are moved from the
bottom to top or top to bottom before the REP MOVSD. So its gonna
take 58,642,669 clock cycles or 58.6MHz at the least or *2 at the
most. Be sure to use Direction Flag for this movement. Currently we
have GHz CPUs so this amount is negligible. If you are doing it
repeatedly then it can be threaded and the main application can be in
CPU 1 and the movement thread can be in CPU 2.

Regards,
Vinoj

Author of "Classic Utilities Using Assembly Language", 1995, India.

> >>Have I missed other methods of solving this question?- Hide quoted text -
>
> - Show quoted text -

Terje Mathisen

unread,

Apr 27, 2007, 2:25:19 AM4/27/07

to

Jerome H. Fine wrote:
> (a) I am attempting to initialize a large work area. How
> the cache is managed is probably beyond my control.
> Assuming that cache-bypassing stores are possible, can
> they be selected or is that usage automatic if stores
> are consistently overloading the cache?

Use asm code or maybe compiler intrinsics.

>
> (b) As for misaligned reads, that would be possible by moving
> a few bytes until the next store byte is aligned. Are you
> suggesting that as long as the data being read is in L1 cache,
> the fact that the read is not aligned can be compensated for
> by the delay in waiting for the aligned store to complete?

Yes.

>
> (c) By the way, do you or anyone know if the L1 cache size on
> P4 is a standard size which the L2 cache size varies over
> a wide range depending on the price of the CPU? In any
> case, it looks like the L1 data cache size is 32 KBytes -
> can anyone confirm this?

It is _not_ 32 K on all cpus, use the CPUID opcode to retrieve the real
number.

You could use 8 KB as a safe lower limit.

>
> If the answer to (b) is yes, then 16 repeats of the first
> 30 KBytes of the saved area would seem to be a useful test.

Yes, except try a litle under 8 and 16 kB as well.

>
> On the other hand, maybe just as useful might be a loop
> which reads 8 bytes at a time and stores those 8 bytes
> 16 times each with an increasing offset of 58,642,669 bytes
> for each succeeding store. While only 2 of the stores
> will be aligned on 8 bytes, perhaps the lack of needing
> to perform repeated reads will overcome the reduced speed
> of the misaligned stores? Or might the L1 cache / L2 cache
> hold the stores temporarily and perform an aligned write-through
> during the flush (after all, the stores are contiguous even
> though 14/16 of them are not 8 byte aligned) which would
> overcome the initial lack of alignment?

No, it will not: Misaligned writes are pretty bad in fact, and might not
allow you to use the NT (Non Temporal) versions of the store opcodes.

>
> It seems like the best solution would be to attempt different
> solutions and try them all.

YES!

The last question I might have is
> if it is possible to depend on a given solution across different
> P4 CPUs?

NO!

Not at all, you have to optimize for the actual HW.

Use CPUID to retrieve cache/tlb sizes&layout.

spam...@crayne.org

unread,

Apr 27, 2007, 12:56:36 PM4/27/07

to

On Apr 26, 10:36 pm, Vinoj <spamt...@crayne.org> wrote:
> The L1 or L2 cache is the implementation of Intel. So you can forget
> the "beyond your hands" problems.

AMD also offers this. It's not like you can't count on cache being
present. Nevertheless, as Terje points out, the CPUID instruction will
tell you all about the cache on the system, so you can adapt your
algorithm accordingly.

> In the latest trends the REP MOVSD
> will take mostly 2 clock cycles and least 1 clock cycle.

This assumes that both the source and destination addresses are
already sitting in cache. You pay a heavy penalty if data has to be
moved from main memory to cache (or cache to main memory) or even
between the L1 and L2 caches.

> First move
> the bytes that are not 32-bit aligned which are negligible in the
> beginning and end and then move the aligned dwords, which forms most
> of the moves, in the 32-bit boundaries and it will warp.

Good advice, of course. But for the memory blocks that the OP is
describing, non-cached move operations are probably a better choice --
at least for the memory write operations.

>
> If you have source and destination overlap then you can do a
> calculation in your program whether the dwords are moved from the
> bottom to top or top to bottom before the REP MOVSD. So its gonna
> take 58,642,669 clock cycles or 58.6MHz at the least or *2 at the
> most.

Your calculations don't include cache latencies. And that's where most
of the time will probably wind up being spent.

> Be sure to use Direction Flag for this movement.

Ouch!
*Dimly* I seem to recall reading something in the Intel literature
that MOVSD was optimized for the case where the direction flag was
clear, but that such optimizations did not apply when the direction
flag was set. This is a *real shaky* memory, so I could be completely
mistaken here. But I'd measure using MOVSD with the direction flag set
on a modern x86 CPU before making this recommendation. Again, it's
quite possible I'm wrong on this, but I'd check it out.

> Currently we
> have GHz CPUs so this amount is negligible.

Not when the OP claimed that speed was very important here. Also keep
in mind that he stated that he has to make multiple copies of the data
(IOW read once, write many).

> If you are doing it
> repeatedly then it can be threaded and the main application can be in
> CPU 1 and the movement thread can be in CPU 2.

That's a great suggestion if the main thread doesn't require the
memory move to be complete before it can proceed. Also, keep in mind
that a memory move of the size the OP is talking about is going to
saturate the bus, therefore the main thread is going to run a bit
slower unless everything it does fits within the cache and the cores
have *different* caches (if they don't, then you wind up with cache
thrashing and running the main thread defeats the purpose of carefully
controlling how the memory move uses the cache).

As usual, I strongly recommend reading the AMD manual here:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
Chapter Five goes into detail concerning memory move operations.
The manufacturer's literature isn't always the absolute best way to do
the job, but it's a good starting place.
hLater,
Randy Hyde

spam...@crayne.org

unread,

Apr 27, 2007, 12:36:17 PM4/27/07

to

On Apr 26, 6:41 pm, "Jerome H. Fine" <spamt...@crayne.org> wrote:
> I can certainly use the "REP MOVSD" instructions and repeat the whole
> process
> a total of 16 times. I wonder whether or not the misaligned dwords
> should be on
> the source side or the destination side?

Source, absolutely.

>
> The other question is if I can start at the beginning of the work area after
> the first 4 times and finish the rest of the mov with a final "REP MOVSD"
> that has the source overlap the destination after 234,570,676 bytes
> or 58,642,669 dwords with ECX set for 234,570,676 times in the loop?

I would check out the AMD optimization manual. Block moves is an
optimization task that has been studied to death on the x86. IIRC,
they have three different suggestions based on the size of the block
you want to move. For large blocks (and your application definitely
falls into that group), I seem to recall that they suggested using the
XMM registers to move the data 16 bytes at a time in an unrolled loop
(load up all the registers, store all the registers, repeat). OTOH,
I've not looked at this problem for a couple of years so I could be
forgetting something (or maybe something has changed with the latest
crop of CPUs).

>
> In addition, what about Terje's suggestion to break the move into areas
> that allow the source to fit and remain in L1 cache and repeat each
> L1 cache portion 16 times? Assuming that L1 data cache is about
> 32 KBytes, splitting the saved data area into about 2000 portions
> would allow all the subsequent (15 of them) reads to be in L1 data cache.

Unless your write instructions skip going through the L1 cache, I'd
recommend using less than *half* the cache for your source data. Don't
forget that the writes will consume cache, too (unless you use
explicit instructions that write directly to memory and bypass the
cache). Of course, it's far better to avoid using writes that go
through the cache.

If your CPU has a unified cache (code+data, most x86s do, IIRC), then
don't forget to leave a small amount of room in the cache for the code
that is executing. And leave room for locals/globals if you use any of
those during your memory move operations.

As Terje points out, the CPUID instruction will give you information
about the cache size, etc. If you want your algorithm to work across
all x86 platforms (and run optimized on each of those), then I'd
recommend following that suggestion.

Of course, if your CPU supports the instruction, I would highly
recommend the use of the MOVNTDQ instruction to store the data away in
memory. This bypasses the cache entirely and lets you use the entire
L1 cache (sans code/locals/globals) for the source data. It also
spares you having to load cache lines for the write operations. I
gather from the discussion that you are aware of this instruction,
right?

Note that the MOVSD instruction is a bad choice explicitly because
both reads and writes go through the L1 (and L2) cache. Thus, you're
going to pay a heavy latency penalty everytime the CPU has to load a
cache line with destination data, only to immediately overwrite all
that destination data.

hLater,
Randy Hyde

Jerome H. Fine

unread,

May 4, 2007, 5:41:06 PM5/4/07

to

Jerome Fine replies:

I very much appreciate the suggestions that you and Terje made.
THANK YOU!

The following two links also provide significant information.

Detailed Platform Analysis Memory Analyzer. Intel Pentium M Processor
http://www.digit-life.com/articles2/rmma/rmma-dothan2.html
More
http://www.digit-life.com/articles2/cpu/rmma-prescott2.html

It seems that the suggestion made by Terje to repeat the read portion
which fits into L1 cache and skip forward to each write area that
corresponds to that repeated read portion. When combined with the
MOVNTDQ for the portion that is 16 byte aligned for the write (but not
for the read), that might be the best possible solution.

Also, unrolling the loop and loading xmm0 -> xmm7 all at once before
storing xmm0 -> xmm7 via 8 MOVNTDQ instructions by writing to
16 bit aligned memory is a suggestion I just found via a google of
the MOVNTDQ instruction. This should be a snap when the load
portion is also 16 byte aligned. When the load portion is not aligned,
there should still be ample time to do the 8 loads when all of the data is
in L1 cache memory and still keep up the the 8 write to memory - or
that is what I would hope for. Alternatively, if there is sufficient memory
available, the saved portion would be duplicated 16 times before making
a copy to the work area. that would also solve the alignment problem.
Unfortunately, the work area is best being as large as possible, so any
memory "wasted" by duplicating the saved area will slow down other
portions of the program.

Another question with respect to initialization of the work area is the
probable combination of the saved area with a secondary saved area
using a logical OR operation. Fortunately, although the OR operation
seems to be restricted to 32 bits or 4 bytes at a time, my initial method
would be to read 8 bytes at a time of the primary and secondary work
areas, perform the OR operation within registers followed by moving
the next 8 "ORed" bytes back to an xmm register which would still allow
the MOVNTDQ instruction to write the data out to the work area. Can
anyone verify that the OR must be done 32 bits at a time (which means
that the xmm registers must be split and then both results placed back
into the xmm)? Also, would it be just as fast to read 4 bytes at a time
from the 2 saved areas, OR the data and then combine each pair of
4 bytes into an xmm register for the MOVNTDQ instruction? I suspect
that I will have to try both ways.

Once the work area has been processed, the only thing remaining is to
count the number of zero bits in each byte. From what I have read, the
likely method is to use each byte as an index into a 256 byte table which
contains the corresponding number of zero bits in each byte. I could
attempt to speed up this process by obtaining the count of the number
of zero bits in each 16 bit word, but the 65536 word table would greatly
exceed L1 data cache. Maybe that would not be a problem since the
word table still fits into L2 data cache and reading each successive word
from memory might still be the limiting factor. Can anyone comment as
to whether or not there is a best way to count zero bits (or one bits since
zero bits = 8 - one bits) in each byte (or larger) of a very large work area
(many times L2 data cache)? I have not found a Pentium 4 instruction
which does this - did I miss it?

Jerome Fine