What is the use of PREFETCHNTA?

spam...@crayne.org

unread,

Feb 2, 2006, 11:04:17 PM2/2/06

to

Hi,

Can anyone help me understanding what is the use of PREFETCHNTA
instruction? What does it actually do?

Best regards,
Amal

spam...@crayne.org

unread,

Feb 3, 2006, 2:44:51 AM2/3/06

to

spam...@crayne.org wrote:
> Hi,
>
> Can anyone help me understanding what is the use of PREFETCHNTA
> instruction? What does it actually do?

Are you asking about prefetching in general, or trying to contrast
prefetchnta to the other prefetch instructions?

Prefetching in general tries to move data from main memory into the
cache early, so that it's there when the program actually does
reference it. If you know that 50 instructions from now you program is
likely to want to access location 12345, issuing a prefetch on 12345
will likely move that data into the cache, and the processor won't
stall waiting for memory at the actual instruction using the data.

The processor also does prefetching automatically, trying to predict
future references (although this is somewhat limited).

The prefetches don't really do anything as far as the visible state of
the processor and program execution are concerned, and may in fact be
true no-ops on a given implementation. They may also fail to do
anything in specific instances (perhaps the processor already has too
many memory references outstanding, and can't add the prefetch request
to the queue). The only thing it (might) accomplish, if it's used
correctly, is moving data from main memory into the cache earlier than
it would otherwise, potentially improving performance by avoiding (or
reducing) a memory stall.

Prefetching can also be very bad for performance, if you prefetch a
bunch of data that you're *not* going to use, and thus evict useful
data from the cache.

The non-temporal prefetch is intended as a prefetch hint that you will
use the prefetched data only once, unlike the temporal prefetches which
imply that you're going to use the data repeatedly. To some extent the
processor might adjust its cache management/replacement policy based on
the knowledge that the data will be used once or repeatedly - exactly
what happens is implementation dependant (for example, exactly what the
PIII and P4 do with those is different - see the Intel docs, and AMD is
different too).

spam...@crayne.org

unread,

Feb 3, 2006, 6:23:36 AM2/3/06

to

When i run the below given code it takes 9 ms to copy 10MB of data. If
i delete line (1) then also the program is taking 9ms. My major aim is
to minimize the speed of this copying. If there is any other possible
ways can u suggest.

Code:-
_asm
{

MOV ESI, sorce
MOV EDI, destination
MOV ECX, count
MOV EDX, 64
SHR ECX, 6
TOP:
PREFETCHNTA 64[ESI]
// Copy data from source(L1 cache)
MOVQ MM0, 0[ESI]
MOVQ MM1, 8[ESI]
MOVQ MM2, 16[ESI]
MOVQ MM3, 24[ESI]
MOVQ MM4, 32[ESI]
MOVQ MM5, 40[ESI]
MOVQ MM6, 48[ESI]
MOVQ MM7, 56[ESI]

// Save the data from MM registers to Destination
MOVNTQ 0[EDI], MM0
MOVNTQ 8[EDI], MM1
MOVNTQ 16[EDI], MM2
MOVNTQ 24[EDI], MM3
MOVNTQ 32[EDI], MM4
MOVNTQ 40[EDI], MM5
MOVNTQ 48[EDI], MM6
MOVNTQ 56[EDI], MM7

ADD ESI, EDX
ADD EDI, EDX
DEC ECX
JNZ TOP
EMMS
}

Can anyone suggest any faster method than of this. This runs allmost
40% faster that ordinary c++ memcpy function. But this speed is not
enough for my application.

WahJava

unread,

Feb 3, 2006, 5:54:28 AM2/3/06

to

Hi amal,

spam...@crayne.org wrote:
> Hi,
>
> Can anyone help me understanding what is the use of PREFETCHNTA
> instruction? What does it actually do?

PREFETCHNTA instruction hints processor to fetch the data
non-temporally (i.e. this data is not to be used again or used only
once). e.g. You're copying data from one location to another you can
use this instruction in that case. And PREFETCHTn instructions hints
processor that these data are needed repeatedly. e.g. You're doing
calculations on same data.

>
> Best regards,
> Amal

Thanx
Ashish Shukla alias Wah Java !!
--
http://wahjava.blogspot.com/

ldb

unread,

Feb 3, 2006, 4:57:44 PM2/3/06

to

When moving that much data, there are several considerations...

First, and foremost, there is the paging issue, which if mishandled can
dominate anything else. But, I presume, that you have both the source
and destination in main memory, with nothing paged out and no need for
page-allocations in the middle of the memory copy.

Second, there is the issue of caching. It's been my experience that the
hardware prefetcher makes any prefeteching of data virutally a nill
gain, no matter how you structure it with this type of memory stride. A
single linear constant-width access is pretty much how the hardware
prefetcher was designed to perform its best, and you probably can't do
much better. In other words, the prefetch instruction isn't telling the
processor to do something new, since in hardware it's already doing it.

Lastly, your choice of movement instruction is pretty important.
Non-temporal storing is very obviously the way to go, but I am curious
why you chose to go with 64bit MMX instructions instead of 128bit SSE
instructions. Did you actually see if that gave you any performance
speed up or down? I suspect that this operation is entirely capped at
memory bandwidth, so if all other things are equal, using SSE over MMX
probably isn't going to make that large of a difference, but it is
worth trying.

There are some other arcane tricks (like TLB priming) that might make a
minor difference.

spam...@crayne.org

unread,

Feb 3, 2006, 5:41:39 PM2/3/06

to

As coded, your prefetch isn't doing anything much, since there's not
enough time between the prefetch and the use. You've got to prefetch
much further ahead - try prefetching 300-500 bytes ahead in your loop.

Other than that, a minor issue is that you should move the updates to
the registers higher in the loop (update edi at the very top, update
esi and ecx in the middle).

For a rather different approach, try a technique called cache blocking.
Fill the cache with much more data on the input side, and then copy
that chunk. Fetch about 4KB of data into the cache by doing a read of
locations from [esi+4032] to [esi+0] in steps of -64, and then execute
the loop you've written above for that same 4KB block. You have to go
backwards to avoid triggering hardware prefetch, which will probably
get in the way. Also, you don't need any prefetch instructions in the
copy loop, since the data's in cache anyway.

Robert Redelmeier

unread,

Feb 3, 2006, 6:12:24 PM2/3/06

to

enjoy...@gmail.com <spam...@crayne.org> wrote in part:

> When i run the below given code it takes 9 ms to copy 10MB

So you get 1100 MB/s bandwidth. Not too bad, but this
depends on memory speed.

> of data. If i delete line (1) then also the program is
> taking 9ms. My major aim is to minimize the speed of this
> copying. If there is any other possible ways can u suggest.

> PREFETCHNTA 64[ESI]

Far too small a stride. Try much larger, say 640[esi]

[ full MMXblock copy w MOVTNQ sto snipped]

This is a pretty good algorithm. It can be improved by
cache streaming (read ~2 KB, then write 2 KB) to take
advantage of cacheline bursting.

Fast memcopy is a perennial important ASM subject.
AMD has nice write up in their Optimisation manual.

-- Robert

Zdenek Sojka

unread,

Feb 3, 2006, 9:00:24 PM2/3/06

to

"enjoy...@gmail.com" <spam...@crayne.org> píse v diskusním príspevku
news:1138965816.8...@g14g2000cwa.googlegroups.com...

> When i run the below given code it takes 9 ms to copy 10MB of data. If
> i delete line (1) then also the program is taking 9ms. My major aim is
> to minimize the speed of this copying. If there is any other possible
> ways can u suggest.
>
> Code:-
> _asm
> {
>
> MOV ESI, sorce
> MOV EDI, destination
> MOV ECX, count
> MOV EDX, 64
> SHR ECX, 6
> TOP:
> PREFETCHNTA 64[ESI]

I am not sure about in, but as PrefetchNTA is a non-temporary cache load,
you should use it only once.
But you read the same cache line (64 bytes) 8 times. Maybe some other
prefetch instruction I would try.
There is a memcpy() replacement in AMD documentation, but afaik, AMD is not
so familiar with prefetchn instruction. So dont expect any advantages there.

ldb

unread,

Feb 3, 2006, 11:43:06 PM2/3/06

to

I'm going to say that no matter how you prefetch, you aren't going to
notice much difference in a pentium4 or equivalent AMD processor. As I
said in the other post, these processors do have hardware prefetchers
which will pick up on this stride pattern and be all over it within a
few iterations (of which there are 5+ orders of magnitude more). If you
do get a noticable speedup with prefetching, let me know, because I'm
interested in seeing how and why the hardware prefetcher is
outperformed in this instance, but I suspect with the data sizes you
are talking about, that you will not.

Robert Redelmeier

unread,

Feb 5, 2006, 10:29:07 PM2/5/06

to

ldb <spam...@crayne.org> wrote in part:

I wasn't aware that hardware prefetchers were that intelligent.

You are absolutely right that PREFETCH will not help something
as simple as a block move. The case that prefetching is
designed to help is where there are calculations that can
run simultaneously with the prefetch. The maximum speedup
is the latency of the fetch or the calcs, whichever is smaller.

-- Robert

spam...@crayne.org

unread,

Feb 6, 2006, 4:29:47 AM2/6/06

to

Hi,
Can anyone suggest any faster way to copy block of data? Because the
speed of current memcopy is not enough for my realtime application. I
want to achieve more speed. Is there any other faster method?

Best regards,
Amal P

Robert Redelmeier

unread,

Feb 6, 2006, 1:37:29 PM2/6/06

to

enjoy...@gmail.com <spam...@crayne.org> wrote in part:

> Can anyone suggest any faster way to copy block of data? Because
> the speed of current memcopy is not enough for my realtime
> application. I want to achieve more speed. Is there any other
> faster method?

Sure! The fastest method is just to pss a pointer :)

If you actually want to physically copy the data,
then check the AMD Optimization Manual. It discusses
mem copies at length. It might work (with mods) on
Intel CPUs.

A lot depends on the memory subsystems.

-- Robert

ldb

unread,

Feb 6, 2006, 3:56:02 PM2/6/06

to

Have you tried using SSE instructions instead of MMX? Have you tried
priming the translation lookaside buffer? Are you sure that there is no
paging issues on this machine? The intel "ia-32 architecture
optimization reference manual", on page 6-47, has an example of a
memory copy. But I don't think it's going to be much faster then yours
unless you are getting page faults, or page-lookup misses.

You mention _real_time_ which makes me wonder what operating system you
are using, and what the memory model is. Do you mean a 'true'
real-time system? Is this VxWorks or some other highly custom system
that is actually meant to be "real-time"? Or is this your basic
off-the-shelf linux box?

I suspect that you are running into a hardware wall, more than a
software one. Furthermore, and this may be a stupid suggestion, but are
you sure that, algorithmically, this is actually a necessary operation?

Tim Roberts

unread,

Feb 7, 2006, 1:35:08 AM2/7/06

to

What performance are you expecting? There are certain physical limits that
you simply cannot overcome, such as the cycle time of your memory and the
overhead of your memory bus. Once you get it as fast as possible, you
can't get it any faster.

There is no zero-cycle copy.
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Dirk Wolfgang Glomp

unread,

Feb 7, 2006, 3:21:12 AM2/7/06

to

Tim Roberts schrieb:

> There is no zero-cycle copy.

I found a wire in the future to a tommorow-CPU.
But i have a little problem with my cache,
the data ....oh one moment...ups...
...the copy end before it begin to start.

Dirk