Can anyone help me understanding what is the use of PREFETCHNTA
instruction? What does it actually do?
Best regards,
Amal
Are you asking about prefetching in general, or trying to contrast
prefetchnta to the other prefetch instructions?
Prefetching in general tries to move data from main memory into the
cache early, so that it's there when the program actually does
reference it. If you know that 50 instructions from now you program is
likely to want to access location 12345, issuing a prefetch on 12345
will likely move that data into the cache, and the processor won't
stall waiting for memory at the actual instruction using the data.
The processor also does prefetching automatically, trying to predict
future references (although this is somewhat limited).
The prefetches don't really do anything as far as the visible state of
the processor and program execution are concerned, and may in fact be
true no-ops on a given implementation. They may also fail to do
anything in specific instances (perhaps the processor already has too
many memory references outstanding, and can't add the prefetch request
to the queue). The only thing it (might) accomplish, if it's used
correctly, is moving data from main memory into the cache earlier than
it would otherwise, potentially improving performance by avoiding (or
reducing) a memory stall.
Prefetching can also be very bad for performance, if you prefetch a
bunch of data that you're *not* going to use, and thus evict useful
data from the cache.
The non-temporal prefetch is intended as a prefetch hint that you will
use the prefetched data only once, unlike the temporal prefetches which
imply that you're going to use the data repeatedly. To some extent the
processor might adjust its cache management/replacement policy based on
the knowledge that the data will be used once or repeatedly - exactly
what happens is implementation dependant (for example, exactly what the
PIII and P4 do with those is different - see the Intel docs, and AMD is
different too).
Code:-
_asm
{
MOV ESI, sorce
MOV EDI, destination
MOV ECX, count
MOV EDX, 64
SHR ECX, 6
TOP:
PREFETCHNTA 64[ESI]
// Copy data from source(L1 cache)
MOVQ MM0, 0[ESI]
MOVQ MM1, 8[ESI]
MOVQ MM2, 16[ESI]
MOVQ MM3, 24[ESI]
MOVQ MM4, 32[ESI]
MOVQ MM5, 40[ESI]
MOVQ MM6, 48[ESI]
MOVQ MM7, 56[ESI]
// Save the data from MM registers to Destination
MOVNTQ 0[EDI], MM0
MOVNTQ 8[EDI], MM1
MOVNTQ 16[EDI], MM2
MOVNTQ 24[EDI], MM3
MOVNTQ 32[EDI], MM4
MOVNTQ 40[EDI], MM5
MOVNTQ 48[EDI], MM6
MOVNTQ 56[EDI], MM7
ADD ESI, EDX
ADD EDI, EDX
DEC ECX
JNZ TOP
EMMS
}
Can anyone suggest any faster method than of this. This runs allmost
40% faster that ordinary c++ memcpy function. But this speed is not
enough for my application.
spam...@crayne.org wrote:
> Hi,
>
> Can anyone help me understanding what is the use of PREFETCHNTA
> instruction? What does it actually do?
PREFETCHNTA instruction hints processor to fetch the data
non-temporally (i.e. this data is not to be used again or used only
once). e.g. You're copying data from one location to another you can
use this instruction in that case. And PREFETCHTn instructions hints
processor that these data are needed repeatedly. e.g. You're doing
calculations on same data.
>
> Best regards,
> Amal
Thanx
Ashish Shukla alias Wah Java !!
--
http://wahjava.blogspot.com/
First, and foremost, there is the paging issue, which if mishandled can
dominate anything else. But, I presume, that you have both the source
and destination in main memory, with nothing paged out and no need for
page-allocations in the middle of the memory copy.
Second, there is the issue of caching. It's been my experience that the
hardware prefetcher makes any prefeteching of data virutally a nill
gain, no matter how you structure it with this type of memory stride. A
single linear constant-width access is pretty much how the hardware
prefetcher was designed to perform its best, and you probably can't do
much better. In other words, the prefetch instruction isn't telling the
processor to do something new, since in hardware it's already doing it.
Lastly, your choice of movement instruction is pretty important.
Non-temporal storing is very obviously the way to go, but I am curious
why you chose to go with 64bit MMX instructions instead of 128bit SSE
instructions. Did you actually see if that gave you any performance
speed up or down? I suspect that this operation is entirely capped at
memory bandwidth, so if all other things are equal, using SSE over MMX
probably isn't going to make that large of a difference, but it is
worth trying.
There are some other arcane tricks (like TLB priming) that might make a
minor difference.
As coded, your prefetch isn't doing anything much, since there's not
enough time between the prefetch and the use. You've got to prefetch
much further ahead - try prefetching 300-500 bytes ahead in your loop.
Other than that, a minor issue is that you should move the updates to
the registers higher in the loop (update edi at the very top, update
esi and ecx in the middle).
For a rather different approach, try a technique called cache blocking.
Fill the cache with much more data on the input side, and then copy
that chunk. Fetch about 4KB of data into the cache by doing a read of
locations from [esi+4032] to [esi+0] in steps of -64, and then execute
the loop you've written above for that same 4KB block. You have to go
backwards to avoid triggering hardware prefetch, which will probably
get in the way. Also, you don't need any prefetch instructions in the
copy loop, since the data's in cache anyway.
So you get 1100 MB/s bandwidth. Not too bad, but this
depends on memory speed.
> of data. If i delete line (1) then also the program is
> taking 9ms. My major aim is to minimize the speed of this
> copying. If there is any other possible ways can u suggest.
> PREFETCHNTA 64[ESI]
Far too small a stride. Try much larger, say 640[esi]
[ full MMXblock copy w MOVTNQ sto snipped]
This is a pretty good algorithm. It can be improved by
cache streaming (read ~2 KB, then write 2 KB) to take
advantage of cacheline bursting.
Fast memcopy is a perennial important ASM subject.
AMD has nice write up in their Optimisation manual.
-- Robert
I am not sure about in, but as PrefetchNTA is a non-temporary cache load,
you should use it only once.
But you read the same cache line (64 bytes) 8 times. Maybe some other
prefetch instruction I would try.
There is a memcpy() replacement in AMD documentation, but afaik, AMD is not
so familiar with prefetchn instruction. So dont expect any advantages there.
I wasn't aware that hardware prefetchers were that intelligent.
You are absolutely right that PREFETCH will not help something
as simple as a block move. The case that prefetching is
designed to help is where there are calculations that can
run simultaneously with the prefetch. The maximum speedup
is the latency of the fetch or the calcs, whichever is smaller.
-- Robert
Best regards,
Amal P
Sure! The fastest method is just to pss a pointer :)
If you actually want to physically copy the data,
then check the AMD Optimization Manual. It discusses
mem copies at length. It might work (with mods) on
Intel CPUs.
A lot depends on the memory subsystems.
-- Robert
You mention _real_time_ which makes me wonder what operating system you
are using, and what the memory model is. Do you mean a 'true'
real-time system? Is this VxWorks or some other highly custom system
that is actually meant to be "real-time"? Or is this your basic
off-the-shelf linux box?
I suspect that you are running into a hardware wall, more than a
software one. Furthermore, and this may be a stupid suggestion, but are
you sure that, algorithmically, this is actually a necessary operation?
What performance are you expecting? There are certain physical limits that
you simply cannot overcome, such as the cycle time of your memory and the
overhead of your memory bus. Once you get it as fast as possible, you
can't get it any faster.
There is no zero-cycle copy.
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
> There is no zero-cycle copy.
I found a wire in the future to a tommorow-CPU.
But i have a little problem with my cache,
the data ....oh one moment...ups...
...the copy end before it begin to start.
Dirk