I can't see that either L1 or L2 cache might be a factor.
If I am incorrect, please advise what aspect L1 or L2
cache might play in this regard.
After I move the first 58,642,669 bytes in groups of 16
using mmx instructions, I must obviously start at the
beginning of the saved area and at byte 58,642,669 within
the work area. I can either:
(a) Move 58,642,669 bytes a byte at a time.
(b) Continue to use the mmx instructions, but suffer the
lack of alignment when the mov into memory is not
aligned.
Is it possible to predict which method should be used
with a Pentium 4 CPU? The only criteria is that the
complete move be finished ASAP. Extra instructions
are unlikely to be a factor, although the code will
obviously be much easier if (b) is the choice.
The choice of (a) or (b) will obviously indicate the
choice of the third 58,642,669 bytes when the mov into
memory is now word aligned, but not yet 8 byte aligned.
Have I missed other methods of solving this question?
Jerome Fine
You can store the same L1/L2/TLB sized block to all of the target areas
at the same time, before going on to the next.
>
> After I move the first 58,642,669 bytes in groups of 16
> using mmx instructions, I must obviously start at the
> beginning of the saved area and at byte 58,642,669 within
> the work area. I can either:
>
> (a) Move 58,642,669 bytes a byte at a time.
>
> (b) Continue to use the mmx instructions, but suffer the
> lack of alignment when the mov into memory is not
> aligned.
(c) shift the source blocks around so asto make all the writes aligned.
It is possible that it will be faster to use misalgned reads instead of
misaligned writes.
You are trying to use cache-bypassing stores,right?
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Load ECX with the number of dwords to move and then load ESI with
source and EDI with destination pointers. With just few instructions
the REP MOVSD will be a lot faster!!
Regards,
Vinoj
Author of "Classic Utilities using Assembly Language", 1995, India.
Jerome Fine replies: (a) I am attempting to initialize a large work area. How the cache is managed is probably beyond my control. Assuming that cache-bypassing stores are possible, can they be selected or is that usage automatic if stores are consistently overloading the cache? (b) As for misaligned reads, that would be possible by moving a few bytes until the next store byte is aligned. Are you suggesting that as long as the data being read is in L1 cache, the fact that the read is not aligned can be compensated for by the delay in waiting for the aligned store to complete? (c) By the way, do you or anyone know if the L1 cache size on P4 is a standard size which the L2 cache size varies over a wide range depending on the price of the CPU? In any case, it looks like the L1 data cache size is 32 KBytes - can anyone confirm this? If the answer to (b) is yes, then 16 repeats of the first 30 KBytes of the saved area would seem to be a useful test. On the other hand, maybe just as useful might be a loop which reads 8 bytes at a time and stores those 8 bytes 16 times each with an increasing offset of 58,642,669 bytes for each succeeding store. While only 2 of the stores will be aligned on 8 bytes, perhaps the lack of needing to perform repeated reads will overcome the reduced speed of the misaligned stores? Or might the L1 cache / L2 cache hold the stores temporarily and perform an aligned write-through during the flush (after all, the stores are contiguous even though 14/16 of them are not 8 byte aligned) which would overcome the initial lack of alignment? It seems like the best solution would be to attempt different solutions and try them all. The last question I might have is if it is possible to depend on a given solution across different P4 CPUs? Jerome Fine
If you have source and destination overlap then you can do a
calculation in your program whether the dwords are moved from the
bottom to top or top to bottom before the REP MOVSD. So its gonna
take 58,642,669 clock cycles or 58.6MHz at the least or *2 at the
most. Be sure to use Direction Flag for this movement. Currently we
have GHz CPUs so this amount is negligible. If you are doing it
repeatedly then it can be threaded and the main application can be in
CPU 1 and the movement thread can be in CPU 2.
Regards,
Vinoj
Author of "Classic Utilities Using Assembly Language", 1995, India.
> >>Have I missed other methods of solving this question?- Hide quoted text -
>
> - Show quoted text -
Use asm code or maybe compiler intrinsics.
>
> (b) As for misaligned reads, that would be possible by moving
> a few bytes until the next store byte is aligned. Are you
> suggesting that as long as the data being read is in L1 cache,
> the fact that the read is not aligned can be compensated for
> by the delay in waiting for the aligned store to complete?
Yes.
>
> (c) By the way, do you or anyone know if the L1 cache size on
> P4 is a standard size which the L2 cache size varies over
> a wide range depending on the price of the CPU? In any
> case, it looks like the L1 data cache size is 32 KBytes -
> can anyone confirm this?
It is _not_ 32 K on all cpus, use the CPUID opcode to retrieve the real
number.
You could use 8 KB as a safe lower limit.
>
> If the answer to (b) is yes, then 16 repeats of the first
> 30 KBytes of the saved area would seem to be a useful test.
Yes, except try a litle under 8 and 16 kB as well.
>
> On the other hand, maybe just as useful might be a loop
> which reads 8 bytes at a time and stores those 8 bytes
> 16 times each with an increasing offset of 58,642,669 bytes
> for each succeeding store. While only 2 of the stores
> will be aligned on 8 bytes, perhaps the lack of needing
> to perform repeated reads will overcome the reduced speed
> of the misaligned stores? Or might the L1 cache / L2 cache
> hold the stores temporarily and perform an aligned write-through
> during the flush (after all, the stores are contiguous even
> though 14/16 of them are not 8 byte aligned) which would
> overcome the initial lack of alignment?
No, it will not: Misaligned writes are pretty bad in fact, and might not
allow you to use the NT (Non Temporal) versions of the store opcodes.
>
> It seems like the best solution would be to attempt different
> solutions and try them all.
YES!
The last question I might have is
> if it is possible to depend on a given solution across different
> P4 CPUs?
NO!
Not at all, you have to optimize for the actual HW.
Use CPUID to retrieve cache/tlb sizes&layout.
AMD also offers this. It's not like you can't count on cache being
present. Nevertheless, as Terje points out, the CPUID instruction will
tell you all about the cache on the system, so you can adapt your
algorithm accordingly.
> In the latest trends the REP MOVSD
> will take mostly 2 clock cycles and least 1 clock cycle.
This assumes that both the source and destination addresses are
already sitting in cache. You pay a heavy penalty if data has to be
moved from main memory to cache (or cache to main memory) or even
between the L1 and L2 caches.
> First move
> the bytes that are not 32-bit aligned which are negligible in the
> beginning and end and then move the aligned dwords, which forms most
> of the moves, in the 32-bit boundaries and it will warp.
Good advice, of course. But for the memory blocks that the OP is
describing, non-cached move operations are probably a better choice --
at least for the memory write operations.
>
> If you have source and destination overlap then you can do a
> calculation in your program whether the dwords are moved from the
> bottom to top or top to bottom before the REP MOVSD. So its gonna
> take 58,642,669 clock cycles or 58.6MHz at the least or *2 at the
> most.
Your calculations don't include cache latencies. And that's where most
of the time will probably wind up being spent.
> Be sure to use Direction Flag for this movement.
Ouch!
*Dimly* I seem to recall reading something in the Intel literature
that MOVSD was optimized for the case where the direction flag was
clear, but that such optimizations did not apply when the direction
flag was set. This is a *real shaky* memory, so I could be completely
mistaken here. But I'd measure using MOVSD with the direction flag set
on a modern x86 CPU before making this recommendation. Again, it's
quite possible I'm wrong on this, but I'd check it out.
> Currently we
> have GHz CPUs so this amount is negligible.
Not when the OP claimed that speed was very important here. Also keep
in mind that he stated that he has to make multiple copies of the data
(IOW read once, write many).
> If you are doing it
> repeatedly then it can be threaded and the main application can be in
> CPU 1 and the movement thread can be in CPU 2.
That's a great suggestion if the main thread doesn't require the
memory move to be complete before it can proceed. Also, keep in mind
that a memory move of the size the OP is talking about is going to
saturate the bus, therefore the main thread is going to run a bit
slower unless everything it does fits within the cache and the cores
have *different* caches (if they don't, then you wind up with cache
thrashing and running the main thread defeats the purpose of carefully
controlling how the memory move uses the cache).
As usual, I strongly recommend reading the AMD manual here:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
Chapter Five goes into detail concerning memory move operations.
The manufacturer's literature isn't always the absolute best way to do
the job, but it's a good starting place.
hLater,
Randy Hyde
Source, absolutely.
>
> The other question is if I can start at the beginning of the work area after
> the first 4 times and finish the rest of the mov with a final "REP MOVSD"
> that has the source overlap the destination after 234,570,676 bytes
> or 58,642,669 dwords with ECX set for 234,570,676 times in the loop?
I would check out the AMD optimization manual. Block moves is an
optimization task that has been studied to death on the x86. IIRC,
they have three different suggestions based on the size of the block
you want to move. For large blocks (and your application definitely
falls into that group), I seem to recall that they suggested using the
XMM registers to move the data 16 bytes at a time in an unrolled loop
(load up all the registers, store all the registers, repeat). OTOH,
I've not looked at this problem for a couple of years so I could be
forgetting something (or maybe something has changed with the latest
crop of CPUs).
>
> In addition, what about Terje's suggestion to break the move into areas
> that allow the source to fit and remain in L1 cache and repeat each
> L1 cache portion 16 times? Assuming that L1 data cache is about
> 32 KBytes, splitting the saved data area into about 2000 portions
> would allow all the subsequent (15 of them) reads to be in L1 data cache.
Unless your write instructions skip going through the L1 cache, I'd
recommend using less than *half* the cache for your source data. Don't
forget that the writes will consume cache, too (unless you use
explicit instructions that write directly to memory and bypass the
cache). Of course, it's far better to avoid using writes that go
through the cache.
If your CPU has a unified cache (code+data, most x86s do, IIRC), then
don't forget to leave a small amount of room in the cache for the code
that is executing. And leave room for locals/globals if you use any of
those during your memory move operations.
As Terje points out, the CPUID instruction will give you information
about the cache size, etc. If you want your algorithm to work across
all x86 platforms (and run optimized on each of those), then I'd
recommend following that suggestion.
Of course, if your CPU supports the instruction, I would highly
recommend the use of the MOVNTDQ instruction to store the data away in
memory. This bypasses the cache entirely and lets you use the entire
L1 cache (sans code/locals/globals) for the source data. It also
spares you having to load cache lines for the write operations. I
gather from the discussion that you are aware of this instruction,
right?
Note that the MOVSD instruction is a bad choice explicitly because
both reads and writes go through the L1 (and L2) cache. Thus, you're
going to pay a heavy latency penalty everytime the CPU has to load a
cache line with destination data, only to immediately overwrite all
that destination data.
hLater,
Randy Hyde