Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

software prefetch overhead

439 views
Skip to first unread message

Vince Weaver

unread,
Sep 23, 2015, 10:00:12 AM9/23/15
to
So software prefetch instructions should execute fairly quickly, correct?
I know that poorly done prefetches can affect overall cache performance, but
shouldn't the actual instruction itself retire quickly?

This is actually a homework question, but since I'm the one who assigned it,
and the students didn't really have any useful insights, I thought I'd ask
here.

The actual question involved using the "perf record" tool on a Linux machine
to go some profiling. The tool by default gets the PC every 4000 cycles
and generates profiles based on that. I'm running on LINPACK compiled
against OpenBLAS on a 4-core (8-thread) Haswell desktop system.

On other systems (most recently an ARM system) the results would be
straightforward, the hotspot in the code would usually be something
unexpected (usually an integer add instruction) but this is due to skid,
and if you look back a few instructions it's usually pointing to a vector
multiply instruction in an inside loop.

On the haswell machine though the result is more like this, in dgemm_kernel,
where the first column is % of execution time.

1.42 │ d0: prefetcht0 0x200(%rdi)
1.68 │ prefetcht0 0x200(%r15)
1.61 │ prefetcht0 0x200(%rbp)
...
0.01 │ │ add $0x40,%rdi
0.01 │ │ add $0x40,%r15
0.02 │ │ add $0x40,%rbp
0.01 │ │ add $0xc0,%rsi
│ │ dec %rax
0.01 │ └──jne d0

so very odd. This is consistent too. So why are prefetch instructions
accounting for so many cycles?

I tried enabling PEBS to reduce the skid, no change. I tried using a prime
number as the sampling frequency, also no change.

The only thing I can think of is possibly the branch is being mispredicted
and that is being accounted weird, but it's a simple loop so that seems
unlikely.

Of course it could be a bug in the perf counter implementation.

Anyway just wondering if anyone had any ideas.

If you use an instruction event instead of a cycles event you get more
believable results, but that doesn't account for latency, just frequency:

0.73 │ vfmadd %ymm0,%ymm3,%ymm15
0.12 │ vmovup -0x20(%rsi),%ymm3
0.68 │ vmovup -0x60(%rdi),%ymm0
0.45 │ vfmadd %ymm0,%ymm1,%ymm4
0.69 │ vfmadd %ymm0,%ymm2,%ymm8
0.31 │ vfmadd %ymm0,%ymm3,%ymm12
0.46 │ vpermp $0xb1,%ymm0,%ymm0

Vince

John D. McCalpin

unread,
Sep 23, 2015, 4:44:53 PM9/23/15
to
Software prefetch instructions normally execute and retire very quickly,
but there are a couple of cases where this may not be the case.

(1) TLB miss: I think that all mainstream Intel processors will perform
a table walk on a TLB miss due to a software prefetch. My most recent
measurements on Haswell suggest a hardware page table walker latency
of 14 cycles plus the load latency of whatever level of the memory
hierarchy contains the requested page table entry.

(2) Running out of Line Fill Buffers: Each Haswell core can only track
10 outstanding L1 cache misses. It looks to me like subsequent loads
that don't map to one of those 10 cache lines will stall (or reject &
retry) until a Line Fill Buffer opens up. It is certainly possible
that software prefetches behave the same way.

Mike Stump

unread,
Sep 23, 2015, 7:00:13 PM9/23/15
to
In article <slrnn05c73...@pianoman.cluster.toy>,
Vince Weaver <vi...@pianoman.cluster.toy> wrote:
>So software prefetch instructions should execute fairly quickly, correct?
>I know that poorly done prefetches can affect overall cache performance, but
>shouldn't the actual instruction itself retire quickly?

> 1.42 │ d0: prefetcht0 0x200(%rdi)
> 1.68 │ prefetcht0 0x200(%r15)
> 1.61 │ prefetcht0 0x200(%rbp)

>so very odd. This is consistent too. So why are prefetch instructions
>accounting for so many cycles?

So, think of it this way. There is a very long dynamic chain of
dependencies that end at prefetch. Along this chain are resources
used and the chain ends when all the resources run out. Prefetch uses
something that is used very often on this chain. Now, exactly what,
you'd have to have visibility into the chain and everything
outstanding in that chain or a nice view of the performance counters,
and/or some intuition or experience with the code.

I'd recommend an application to view the performance counters, it can
give a lot more insight into the dynamic behavior of your application.
One of my favorite was the number of L3 cache miss divided by the
number of instructions retired. Exceptionally low numbers are common
in pointer chasing code for example.

https://software.intel.com/sites/default/files/article/478273/ipdps13-hpl-final.pdf
might be interesting reading for you, if you've not seen it before.

https://www.pugetsystems.com/labs/articles/Xeon-E5-v3-Haswell-EP-Performance----Linpack-595/
has haswell performance, and they claim near theoretical peak on the
compute side, so, what you are observing may just be the prefetcher
exhausting all available forward looking on the fetch side. Once the
compute side is backlogged and in cache, then the prefetcher can run
ahead, but it is limited by the compute side making use of the data it
has already brought in before, so when forward looking space runs out,
it has to stall and wait around for the compute side to cache up and
be done with the data. You can think of it as stalled, awaiting a
free cache line to start fetching into. That can't happen, til the
compute side reads the line, computes an answer, and puts the answer
away into memory and the last such store is done for that line, after
that, then that line is free to be reused for the next load.

The best in depth view would be on a micro-architectural simulator,
then you can see all the detail. The trick would be finding one.

Vince Weaver

unread,
Sep 24, 2015, 1:16:04 AM9/24/15
to
On 2015-09-23, Mike Stump <m...@kithrup.com> wrote:
>
> I'd recommend an application to view the performance counters, it can
> give a lot more insight into the dynamic behavior of your application.
> One of my favorite was the number of L3 cache miss divided by the
> number of instructions retired. Exceptionally low numbers are common
> in pointer chasing code for example.

So I did run some results using some other counters, including thes ones
suggested by John McCalpin.

It turns out that our prefetch instructions didn't turn up at all in TLB or
l1-cache measurements.

They did turn up when measuring LLC-load-misses:

LLC-load-misses

0.03 │ d0: prefetcht0 0x200(%rdi)
6.10 │ prefetcht0 0x200(%r15)
5.94 │ prefetcht0 0x200(%rbp)
5.30 │ prefetcht0 0x200(%rsi)
0.19 │ vmovup (%rdi),%ymm1

but I guess that's hardly surprising and doesn't necessarily mean
the instructions are stalling waiting for the LLC misses to be handled.

It turns out though that the instructions also appear if you profile
on branch-misses:

branch-misses

0.17 │ d0: prefet 0x200(%rdi)
4.02 │ prefet 0x200(%r15)
9.58 │ prefet 0x200(%rbp)
0.85 │ prefet 0x200(%rsi)
0.00 │ vmovup (%rdi),%ymm1

Obviously this is a skid issue, if instead I enable PEBS I get:

branch-misses:pp

│ add $0x40,%rbp
│ add $0xc0,%rsi
44.23 │ dec %rax
0.62 │ ↑ jne d0

although that's 44% of the branch misses in dgemm_kernel, not 44% overall.
For the whole program the branch mispredict rate is 0.41% so it seems like
this should have a small effect overall.

I tried using the "perf mem" routines which tap into the advanced memory
latency/source measurement stuff in newer Intel chips but I couldn't seem
to gain any useful info from the results.

Vince

Vince Weaver

unread,
Sep 24, 2015, 11:00:00 AM9/24/15
to
and it turns out that having prefetch show up in a profie isn't that
uncommon, for example it has in Linux in the past:
https://lwn.net/Articles/444336/
I somehow missed that article when originally trying to figure this out.

Although in the Linux-kernel case it sounds like the problem was their
pointer-chasing prefetches would try to prefetch the NULL pointer, and
this was triggering the TLB miss handler every time.

So yes, I guess my misconception was that modern architectures would use
some sort of heuristic to squash prefetches if resources were unavailable,
rather than stalling (sort of like how nops are more or less ignored
these days). It makes me understand more why people still try to avoid
software prefetches unless strictly necessary.

Vince

Vince Weaver

unread,
Sep 24, 2015, 4:59:43 PM9/24/15
to
further research in Agner Fog's documentation seems to show that Ivybridge
is known to have introduced awful prefetch latencies. I wonder if that
carried through to Haswell. It does explain why I wasn't seeing the issue
before on a Sandybridge machine.

Vince

Mike Stump

unread,
Sep 24, 2015, 11:00:04 PM9/24/15
to
In article <slrnn08438...@pianoman.cluster.toy>,
Vince Weaver <vi...@pianoman.cluster.toy> wrote:
>So yes, I guess my misconception was that modern architectures would use
>some sort of heuristic to squash prefetches if resources were unavailable,

No. The program knows it, in the future, will be useful. The right
approach is to wait until the resources to complete the start of the
prefetch are available and then start it.

Now, why do that? Easy, if you ignore it, you will pay the price
later, and then, you will hose performance on the latency. Stalling on
prefetch is fine, as we are merely waiting for some of the outstanding
instructions to complete.

>It makes me understand more why people still try to avoid software
>prefetches unless strictly necessary.

First, you have to do a better job than the hardware will do. The
hardware has 4 prefetchers and one of them is quite smart. Since you
get them for free, and since prefetch instructions consume dram, cache
lines, disk bandwidth and so on, you should only use them if they
help. If one can engineer their code to be hardware prefetch
compatible, this will win better than the same code with prefetch that
isn't.

Anton Ertl

unread,
Sep 25, 2015, 6:20:21 AM9/25/15
to
m...@kithrup.com (Mike Stump) writes:
>In article <slrnn08438...@pianoman.cluster.toy>,
>Vince Weaver <vi...@pianoman.cluster.toy> wrote:
>>So yes, I guess my misconception was that modern architectures would use
>>some sort of heuristic to squash prefetches if resources were unavailable,
>
>No. The program knows it, in the future, will be useful.

If the program knows it, it can use a load instruction, and there
would not have been a need for a prefetch instruction. The prefetch
instructions are there for cases where the program does not know, but
is pretty sure about needing the data. That's why prefetch
instructions don't produce exceptions.

>First, you have to do a better job than the hardware will do.

Also, you don't want to prefetch data that's in the cache already.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

John D. McCalpin

unread,
Sep 25, 2015, 11:33:13 AM9/25/15
to
On Friday, September 25, 2015 at 5:20:21 AM UTC-5, Anton Ertl wrote:
> (Mike Stump) writes:
>
> >First, you have to do a better job than the hardware will do.
>
> Also, you don't want to prefetch data that's in the cache already.

Interestingly, the L2 streaming hardware prefetcher on recent Intel
processors (I tested on both Sandy Bridge and Haswell) is activated
by sequences of accesses, not by sequences of cache misses.

For reasons that I can't even begin to understand, the L2 hardware
prefetcher will attempt to fetch anywhere between 0 and >90% of the
cache lines in an L2-contained block of memory that is read repeatedly.

(The aggressiveness of the prefetcher is not too hard to understand, but
the variability is quite confusing....)

Anton Ertl

unread,
Sep 25, 2015, 1:11:22 PM9/25/15
to
"John D. McCalpin" <mcca...@tacc.utexas.edu> writes:
>On Friday, September 25, 2015 at 5:20:21 AM UTC-5, Anton Ertl wrote:
>> (Mike Stump) writes:
>>
>> >First, you have to do a better job than the hardware will do.
>>
>> Also, you don't want to prefetch data that's in the cache already.
>
>Interestingly, the L2 streaming hardware prefetcher on recent Intel
>processors (I tested on both Sandy Bridge and Haswell) is activated
>by sequences of accesses, not by sequences of cache misses.

It's pretty obvious to me why that is: You want the prefetcher to see
strided accesses and do its predictions based on that, while it can be
much harder to see strided accesses looking only at cache misses.

My comment was about where you would insert software prefetch
instructions (which also cost instruction bandwidth).

Robert Wessel

unread,
Sep 25, 2015, 2:40:38 PM9/25/15
to
On Fri, 25 Sep 2015 10:14:31 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>m...@kithrup.com (Mike Stump) writes:
>>In article <slrnn08438...@pianoman.cluster.toy>,
>>Vince Weaver <vi...@pianoman.cluster.toy> wrote:
>>>So yes, I guess my misconception was that modern architectures would use
>>>some sort of heuristic to squash prefetches if resources were unavailable,
>>
>>No. The program knows it, in the future, will be useful.
>
>If the program knows it, it can use a load instruction, and there
>would not have been a need for a prefetch instruction. The prefetch
>instructions are there for cases where the program does not know, but
>is pretty sure about needing the data. That's why prefetch
>instructions don't produce exceptions.


Well, only if I have an (architectural) register free to use as the
target of the load. In ideal* cases you can issue prefetches hundreds
of cycles in advance (if, for example, you can predict misses to main
memory), you'd need a register free for those hundreds of cycles.


*In practice, that's quite hard, of course.

Quadibloc

unread,
Sep 26, 2015, 1:48:18 AM9/26/15
to
On Wednesday, September 23, 2015 at 8:00:12 AM UTC-6, Vince Weaver wrote:
> So software prefetch instructions should execute fairly quickly, correct?
> I know that poorly done prefetches can affect overall cache performance, but
> shouldn't the actual instruction itself retire quickly?

Prefetching means that when it comes time to decode and execute the
instruction, there's no need to wait for digging it out of DRAM with its huge
latency, you got it from the cache.

But that is the sum total of the gains therefrom. It won't make your
floating-point divide instruction do its division any faster. It won't even
make the logic required to determine that it _is_ a floating-point divide go
any faster.

But surely you already knew *that*, so what was the point of your post which
clearly I must have totally missed?

John Savard

Quadibloc

unread,
Sep 26, 2015, 1:57:44 AM9/26/15
to
On Friday, September 25, 2015 at 11:48:18 PM UTC-6, Quadibloc wrote:
> On Wednesday, September 23, 2015 at 8:00:12 AM UTC-6, Vince Weaver wrote:
> > So software prefetch instructions should execute fairly quickly, correct?
> > I know that poorly done prefetches can affect overall cache performance, but
> > shouldn't the actual instruction itself retire quickly?

> But surely you already knew *that*, so what was the point of your post which
> clearly I must have totally missed?

Ah, you weren't referring to instructions that had been prefetched. You were
referring to a special kind of instruction used to request that data be
prefetched. That was my misunderstanding.

A prefetch is still a read from memory, so if the read has to take place before the instruction can be retired, it might take a while. However, if it's just queuing the prefetch, for optional execution if the bus happens to be available in the future, then indeed one would expect the instruction to retire quickly.

Now, what has me curious, though, is whether there's any kind of expiration
time on these prefetches; that is, if the future computation that is expected
to use the data to be prefetched finishes without needing it, can that be
indicated?

On the other hand, maybe one does only use a prefetch (instead of a load) on
data one is absolutely sure will be needed... because at the moment, one does
not have a register available to put it in, so a prefetch instead of a load
could well be valid.

John Savard

Ivan Godard

unread,
Sep 26, 2015, 3:08:42 AM9/26/15
to
On 9/25/2015 10:57 PM, Quadibloc wrote:
> On Friday, September 25, 2015 at 11:48:18 PM UTC-6, Quadibloc wrote:

> A prefetch is still a read from memory, so if the read has to take
> place before the instruction can be retired, it might take a while.
> However, if it's just queuing the prefetch, for optional execution
> if the bus happens to be available in the future, then indeed one
> would expect the instruction to retire quickly.
>
> Now, what has me curious, though, is whether there's any kind of
> expiration time on these prefetches; that is, if the future
> computation that is expected to use the data to be prefetched
> finishes without needing it, can that be indicated?

Quashing an in-flight request is difficult; the number of states that
the request could be in is high, and a quash would need to clean up all
the ancillary structures that can be in temporary use while the request
is pending. Generally you would let the request complete normally even
if pointlessly; at worst you will have muddied a cache line. Particular
implementations of the hierarchy may be able to intercept and quash a
request at a selection of other points, but that is idiosyncratic.


Anton Ertl

unread,
Sep 26, 2015, 12:03:09 PM9/26/15
to
Robert Wessel <robert...@yahoo.com> writes:
>>If the program knows it, it can use a load instruction, and there
>>would not have been a need for a prefetch instruction. The prefetch
>>instructions are there for cases where the program does not know, but
>>is pretty sure about needing the data. That's why prefetch
>>instructions don't produce exceptions.
>
>
>Well, only if I have an (architectural) register free to use as the
>target of the load. In ideal* cases you can issue prefetches hundreds
>of cycles in advance (if, for example, you can predict misses to main
>memory), you'd need a register free for those hundreds of cycles.

The compiler can choose any free architectural register a target of
the intended-to-be-a-known-good prefetch load, and then continue to
treat it as free register. If there is no free register available
right now, this load can be inserted at the next place when there is a
free register (which is typically a few instructions later).

So that's not a reason for introducing prefetch instructions in the
architecture.

Mike Stump

unread,
Sep 28, 2015, 8:45:06 PM9/28/15
to
In article <80728d06-2f29-415d...@googlegroups.com>,
Quadibloc <jsa...@ecn.ab.ca> wrote:
>Now, what has me curious, though, is whether there's any kind of
>expiration time on these prefetches; that is, if the future
>computation that is expected to use the data to be prefetched
>finishes without needing it, can that be indicated?

I'm not aware of too many mechanisms to do this. Cache flush, cache
eviction and context switch seem the most likely ways, but, I doubt if
these would actually help to reduce memory usage on most architectures.

>On the other hand, maybe one does only use a prefetch (instead of a
>load) on data one is absolutely sure will be needed... because at the
>moment, one does not have a register available to put it in, so a
>prefetch instead of a load could well be valid.

If you have 32 byte cache lines, and 4MB of cache, that's 131 thousand
prefetch possibilities. Most architectures have fewer registers.

Robert Wessel

unread,
Sep 28, 2015, 11:05:27 PM9/28/15
to
On Sat, 26 Sep 2015 15:55:11 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Robert Wessel <robert...@yahoo.com> writes:
>>>If the program knows it, it can use a load instruction, and there
>>>would not have been a need for a prefetch instruction. The prefetch
>>>instructions are there for cases where the program does not know, but
>>>is pretty sure about needing the data. That's why prefetch
>>>instructions don't produce exceptions.
>>
>>
>>Well, only if I have an (architectural) register free to use as the
>>target of the load. In ideal* cases you can issue prefetches hundreds
>>of cycles in advance (if, for example, you can predict misses to main
>>memory), you'd need a register free for those hundreds of cycles.
>
>The compiler can choose any free architectural register a target of
>the intended-to-be-a-known-good prefetch load, and then continue to
>treat it as free register. If there is no free register available
>right now, this load can be inserted at the next place when there is a
>free register (which is typically a few instructions later).
>
>So that's not a reason for introducing prefetch instructions in the
>architecture.


Ah. You're suggesting doing the load and then discarding the result
by reusing the register for something else.

Anton Ertl

unread,
Sep 29, 2015, 5:18:07 AM9/29/15
to
Right. That's what an architect would probably suggest to someone who
asks for adding a prefetch instruction for known-to-be-needed memory
locations.

EricP

unread,
Sep 29, 2015, 11:25:37 AM9/29/15
to
Anton Ertl wrote:
> Robert Wessel <robert...@yahoo.com> writes:
>>> If the program knows it, it can use a load instruction, and there
>>> would not have been a need for a prefetch instruction. The prefetch
>>> instructions are there for cases where the program does not know, but
>>> is pretty sure about needing the data. That's why prefetch
>>> instructions don't produce exceptions.
>>
>> Well, only if I have an (architectural) register free to use as the
>> target of the load. In ideal* cases you can issue prefetches hundreds
>> of cycles in advance (if, for example, you can predict misses to main
>> memory), you'd need a register free for those hundreds of cycles.
>
> The compiler can choose any free architectural register a target of
> the intended-to-be-a-known-good prefetch load, and then continue to
> treat it as free register. If there is no free register available
> right now, this load can be inserted at the next place when there is a
> free register (which is typically a few instructions later).
>
> So that's not a reason for introducing prefetch instructions in the
> architecture.
>
> - anton

There are a number of reasons:

1) Intel has now added PREFETCHW Prefetch-For-Write instruction
which was on AMD but was a noop on Intel. PrefetchW cannot be
simulated with a bogus store because it would change memory.

By prefetching for write it loads the cache line exclusive,
skipping first going to the shared state then upgrading later,
saving delays and bus traffic.

2) Intel call using loads to prefetch "preloading".

According to the Intel Optimization manual, prefetch seems
to act similar to Async I/O in that it queues the request but
does not wait for completion, and the prefetch instruction
can retire before the cache line arrives.
Preload would stall its retire until the data arrived.

3) The Intel Optimization manual 7.4.3 says in part:

"Currently, PREFETCH provides greater performance than preloading because:
- Has no destination register, it only updates cache lines.
- Does not stall the normal instruction retirement.
- Does not affect the functional behavior of the program.
- Has no cache line split accesses.
- Does not complete its own execution if that would cause a fault.

Currently, the advantage of PREFETCH over preloading instructions
are processor-specific. This may change in the future.

There are cases where a PREFETCH will not perform the data prefetch.
These include:
- PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss.
This applies to Pentium 4 processors with CPUID signature (blah blah...)
- An access to the specified address that causes a fault/exception.
- If the memory subsystem runs out of request buffers between the
first-level cache and the second-level cache.
- PREFETCH targets an uncacheable memory region (for example, USWC and UC).
"

Eric

John D. McCalpin

unread,
Sep 29, 2015, 2:59:40 PM9/29/15
to
On Tuesday, September 29, 2015 at 10:25:37 AM UTC-5, EricP wrote:
>
> There are cases where a PREFETCH will not perform the data prefetch.
> These include:
> - PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss.
> This applies to Pentium 4 processors with CPUID signature (blah blah...)
> - An access to the specified address that causes a fault/exception.
> - If the memory subsystem runs out of request buffers between the
> first-level cache and the second-level cache.
> - PREFETCH targets an uncacheable memory region (for example, USWC and UC).

Software prefetches are also dropped on Xeon Phi ("Knights Corner") if they
require a table walk.

The behavior on Xeon Phi is unusual for three reasons:
1. Most Intel processors will do a table walk on a software prefetch.
2. Xeon Phi does not cache 4KiB page table entries in the STLB, so
a DTLB miss on a 4KiB page will cause a table walk. (The STLB caches
PDEs for 4KiB pages on the Xeon Phi to make the table walks faster.)
3. Unlike Nehalem/Westmere and Sandy Bridge/Ivy Bridge, Xeon Phi does
cache 16MiB page table entries in the STLB. Haswell supports
both 4KiB and 2MiB/4MiB pages in the STLB, while Broadwell supports
4KiB, 2MiB, and 1GiB translations in the STLB.

MitchAlsup

unread,
Sep 29, 2015, 7:59:18 PM9/29/15
to
On Tuesday, September 29, 2015 at 1:59:40 PM UTC-5, John D. McCalpin wrote:
> On Tuesday, September 29, 2015 at 10:25:37 AM UTC-5, EricP wrote:
> >
> > There are cases where a PREFETCH will not perform the data prefetch.
> > These include:
> > - PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss.
> > This applies to Pentium 4 processors with CPUID signature (blah blah...)
> > - An access to the specified address that causes a fault/exception.
> > - If the memory subsystem runs out of request buffers between the
> > first-level cache and the second-level cache.
> > - PREFETCH targets an uncacheable memory region (for example, USWC and UC).
>
> Software prefetches are also dropped on Xeon Phi ("Knights Corner") if they
> require a table walk.

There are other processors where the prefetch can be dropped if some non-ISA visible structure is over-committed; things like miss-buffers, table-walkers (JDMc), network queues,...basically anything that gets overloaded can result in the prefetch disappearing.

John D. McCalpin

unread,
Sep 30, 2015, 9:33:05 AM9/30/15
to
Ditto for hardware prefetches (for the same sorts of reasons).

Whether dropping prefetches is a good idea is hard to tell. It is
relatively easy to come up with examples that demonstrate slowdowns
(or speedups) with virtually any policy compared to any other policy.
0 new messages