more control on CPU cache

440 views
Skip to first unread message

ymo

unread,
Apr 20, 2015, 6:07:08 PM4/20/15
to mechanica...@googlegroups.com
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





Vitaly Davidovich

unread,
Apr 20, 2015, 6:20:09 PM4/20/15
to mechanica...@googlegroups.com
There're existing facilities to do both of the things you mention.  Some processors have prefetch instructions that software can emit, and there are also non-temporal load/store instructions to bypass cache.  Both of these come with a sizable YMMV label.  Or did you mean something else?

On Mon, Apr 20, 2015 at 6:07 PM, ymo <ymol...@gmail.com> wrote:
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Barker

unread,
Apr 20, 2015, 6:26:09 PM4/20/15
to mechanica...@googlegroups.com
Some times more control is not necessarily a good thing.  Below is an interesting article about how misuse of prefetch hurt performance in the kernel.



Rajiv Kurian

unread,
Apr 20, 2015, 9:20:22 PM4/20/15
to mechanica...@googlegroups.com
Sure they lead to poor performance many times, but I still think it is good to have such control. Ultimately good measurement is required to classify changes in the performance improvement/regression camps. Non temporal store instructions are often used in graphics programming/games. The prefetch instruction is also heavily utilized in software routers where multiple packets are looked up in a hash map. These use cases would suffer quite significantly without these facilities.


On Monday, April 20, 2015 at 3:26:09 PM UTC-7, mikeb01 wrote:
Some times more control is not necessarily a good thing.  Below is an interesting article about how misuse of prefetch hurt performance in the kernel.



On 21 April 2015 at 10:20, Vitaly Davidovich <vit...@gmail.com> wrote:
There're existing facilities to do both of the things you mention.  Some processors have prefetch instructions that software can emit, and there are also non-temporal load/store instructions to bypass cache.  Both of these come with a sizable YMMV label.  Or did you mean something else?
On Mon, Apr 20, 2015 at 6:07 PM, ymo <ymol...@gmail.com> wrote:
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

ymo

unread,
Apr 20, 2015, 10:25:03 PM4/20/15
to mechanica...@googlegroups.com
Ok i have to admit i am only interested in intel for the near (and maybe far) future. Can you elaborate which cpu instructions ( if any on intel ) you are referring to ?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 20, 2015, 10:40:38 PM4/20/15
to mechanica...@googlegroups.com

prefetcht0, prefetcht1, prefetcht2, prefetchnta, prefetchw - these are prefetch hint instructions.

movntXXX family of non temporal move instructions for writes that bypass cache.

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rajiv Kurian

unread,
Apr 20, 2015, 11:05:12 PM4/20/15
to mechanica...@googlegroups.com
Ulrich Drepper's paper https://people.freebsd.org/~lstewart/articles/cpumemory.pdf has sections on using prefetch as well as non-temporal store/load instructions. He uses intrinsics instead of assembly if I remember. Here is the section on bypassing the cache - http://lwn.net/Articles/255364/


On Monday, April 20, 2015 at 7:40:38 PM UTC-7, Vitaly Davidovich wrote:

prefetcht0, prefetcht1, prefetcht2, prefetchnta, prefetchw - these are prefetch hint instructions.

movntXXX family of non temporal move instructions for writes that bypass cache.

sent from my phone

On Apr 20, 2015 10:25 PM, "ymo" <ymol...@gmail.com> wrote:
Ok i have to admit i am only interested in intel for the near (and maybe far) future. Can you elaborate which cpu instructions ( if any on intel ) you are referring to ?


On Monday, April 20, 2015 at 6:20:09 PM UTC-4, Vitaly Davidovich wrote:
There're existing facilities to do both of the things you mention.  Some processors have prefetch instructions that software can emit, and there are also non-temporal load/store instructions to bypass cache.  Both of these come with a sizable YMMV label.  Or did you mean something else?
On Mon, Apr 20, 2015 at 6:07 PM, ymo <ymol...@gmail.com> wrote:
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rajiv Kurian

unread,
Apr 21, 2015, 12:25:20 AM4/21/15
to mechanica...@googlegroups.com
If you are only interested in Intel processors - then this talk on programming a Xeon processor is fascinating - https://www.youtube.com/watch?v=m9dRPnfKTxs 

It showcases the latest technologies in Haswell and how you can use it to reduce latency and increase throughput.

Martin Thompson

unread,
Apr 21, 2015, 7:28:02 AM4/21/15
to mechanica...@googlegroups.com
I like how he says he is only covering the simple or really basic stuff. I my experience only a tiny minority of the development community know any of this.

It is not the best presentation style but does give a good overview of what is important in the latest Xeon processors.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 21, 2015, 8:59:50 AM4/21/15
to mechanica...@googlegroups.com

Yeah, I've mostly seen people using compiler intrinsics for issuing prefetch (makes sense from portability perspective).  Hotspot JVM actually had prefetch intrinsics in its native Unsafe class, but never exposed them in java.  In fact, I think they took it out of the native Unsafe recently, so java isn't going to get the same love here.

sent from my phone

On Apr 20, 2015 11:05 PM, "Rajiv Kurian" <geet...@gmail.com> wrote:
Ulrich Drepper's paper https://people.freebsd.org/~lstewart/articles/cpumemory.pdf has sections on using prefetch as well as non-temporal store/load instructions. He uses intrinsics instead of assembly if I remember. Here is the section on bypassing the cache - http://lwn.net/Articles/255364/

On Monday, April 20, 2015 at 7:40:38 PM UTC-7, Vitaly Davidovich wrote:

prefetcht0, prefetcht1, prefetcht2, prefetchnta, prefetchw - these are prefetch hint instructions.

movntXXX family of non temporal move instructions for writes that bypass cache.

sent from my phone

On Apr 20, 2015 10:25 PM, "ymo" <ymol...@gmail.com> wrote:
Ok i have to admit i am only interested in intel for the near (and maybe far) future. Can you elaborate which cpu instructions ( if any on intel ) you are referring to ?


On Monday, April 20, 2015 at 6:20:09 PM UTC-4, Vitaly Davidovich wrote:
There're existing facilities to do both of the things you mention.  Some processors have prefetch instructions that software can emit, and there are also non-temporal load/store instructions to bypass cache.  Both of these come with a sizable YMMV label.  Or did you mean something else?
On Mon, Apr 20, 2015 at 6:07 PM, ymo <ymol...@gmail.com> wrote:
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gary Mulder

unread,
Apr 21, 2015, 11:00:38 AM4/21/15
to mechanica...@googlegroups.com
On 21 April 2015 at 05:25, Rajiv Kurian <geet...@gmail.com> wrote:
If you are only interested in Intel processors - then this talk on programming a Xeon processor is fascinating - https://www.youtube.com/watch?v=m9dRPnfKTxs 

It showcases the latest technologies in Haswell and how you can use it to reduce latency and increase throughput.

The take-away seems to be to align one's data to 64 byte (bit?) boundaries, use "structures of arrays" rather than "arrays of structures", and use the vectorisation capabilities of the instruction set.

Does Java have the ability to align data to boundaries? Does it support native vectorisation operations?

Gary

Vitaly Davidovich

unread,
Apr 21, 2015, 11:06:54 AM4/21/15
to mechanica...@googlegroups.com
Java doesn't have explicit alignment/layout control, although there are some indirect ways to achieve it with variable success.

As for vectorization, Hotspot currently only uses vector instructions for memory copies.  However, I posted a few days ago a link to a few recent Intel patches that are aiming to expand use of automatic vectorization (e.g. int/fp reductions in loops).

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 21, 2015, 11:25:31 AM4/21/15
to mechanica...@googlegroups.com
This is not news to you but the software prefetch instructions have major problems so much that everyone is saying dont use them. The major one being prefetch only for reads not writes ... on anything before the phi or Broadwell. Hardware prefetch is like lottery you win or you lose depending on your use case. I asked here before and was told the same )))

Now my dream is intel coming up with a way for me to "instruct" the cpu ahead of time which loops i will be doing and then optimize it for me ... I was looking at the recent SPIR-V sepcification for GPUs and just could not stop wonder why the hell same thing cannot be done at the CPU level ? All the loops and data structures are known before the program is loaded. Why cant the CPU and cache work on a higher level like that ???

I am sure its easy to say than do .. but one cannot stop but wonder !


On Monday, April 20, 2015 at 10:40:38 PM UTC-4, Vitaly Davidovich wrote:

prefetcht0, prefetcht1, prefetcht2, prefetchnta, prefetchw - these are prefetch hint instructions.

movntXXX family of non temporal move instructions for writes that bypass cache.

sent from my phone

On Apr 20, 2015 10:25 PM, "ymo" <ymol...@gmail.com> wrote:
Ok i have to admit i am only interested in intel for the near (and maybe far) future. Can you elaborate which cpu instructions ( if any on intel ) you are referring to ?


On Monday, April 20, 2015 at 6:20:09 PM UTC-4, Vitaly Davidovich wrote:
There're existing facilities to do both of the things you mention.  Some processors have prefetch instructions that software can emit, and there are also non-temporal load/store instructions to bypass cache.  Both of these come with a sizable YMMV label.  Or did you mean something else?
On Mon, Apr 20, 2015 at 6:07 PM, ymo <ymol...@gmail.com> wrote:
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 21, 2015, 11:46:59 AM4/21/15
to mechanica...@googlegroups.com
I think the other trick here is prefetch effectiveness will also vary across CPU generations, so it's something that would need to be reevaluated.  This is definitely one of those things you measure and only use if it yields significant enough gains that warrant the maintenance headache.

What do you mean by CPU optimizing your loops for you?

On Tue, Apr 21, 2015 at 11:25 AM, ymo <ymol...@gmail.com> wrote:
This is not news to you but the software prefetch instructions have major problems so much that everyone is saying dont use them. The major one being prefetch only for reads not writes ... on anything before the phi or Broadwell. Hardware prefetch is like lottery you win or you lose depending on your use case. I asked here before and was told the same )))

Now my dream is intel coming up with a way for me to "instruct" the cpu ahead of time which loops i will be doing and then optimize it for me ... I was looking at the recent SPIR-V sepcification for GPUs and just could not stop wonder why the hell same thing cannot be done at the CPU level ? All the loops and data structures are known before the program is loaded. Why cant the CPU and cache work on a higher level like that ???

I am sure its easy to say than do .. but one cannot stop but wonder !


On Monday, April 20, 2015 at 10:40:38 PM UTC-4, Vitaly Davidovich wrote:

prefetcht0, prefetcht1, prefetcht2, prefetchnta, prefetchw - these are prefetch hint instructions.

movntXXX family of non temporal move instructions for writes that bypass cache.

sent from my phone

On Apr 20, 2015 10:25 PM, "ymo" <ymol...@gmail.com> wrote:
Ok i have to admit i am only interested in intel for the near (and maybe far) future. Can you elaborate which cpu instructions ( if any on intel ) you are referring to ?


On Monday, April 20, 2015 at 6:20:09 PM UTC-4, Vitaly Davidovich wrote:
There're existing facilities to do both of the things you mention.  Some processors have prefetch instructions that software can emit, and there are also non-temporal load/store instructions to bypass cache.  Both of these come with a sizable YMMV label.  Or did you mean something else?
On Mon, Apr 20, 2015 at 6:07 PM, ymo <ymol...@gmail.com> wrote:
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 21, 2015, 12:33:14 PM4/21/15
to mechanica...@googlegroups.com
The dilemma here is that the CPU instructions are too low level for the CPU to infer which place in memory you will be asking next. Specially with code branches. So you could have a high level language that says here is what i am trying to do (like SPIR-V) and i want the CPU *plan* the cache filling ahead of time for me since it already knows what i am trying to do. For example if i had more control myself i will translate that as finding :
1) whats the *optimum* number of cache lines to fetch from main memory given the number of instructions that will get executed and the speed of execution.
2) When is the *optimum* time to prefetch it. Again, given all the variables in flight.

In both cases the CPU knows ahead of time (read before execution starts) which memory locations i am going to be touching since that was given to it explicitly so it could make educated guesses here. I am not saying that this will all happen in the CPU itself (obviously). But this can be some sort of compiler (aka optimizer) that has enough control on the CPU/Cache instructions to maybe disable the things that don't make sense in this scenario. Maybe hw prefetch could be disabled here altogether since the generated *plan* will include an explicit pre-fetch. But this optimizer will need to have much better control than what is available right now. The GPUs can schedule each thread efficiently because maybe the execution and scheduling is done ahead of time. Not during the execution. These devices are much dumber but they are great workers !

So in the same way as you have a query optimizer for a query you can have a "cache utilization" optimizer. The major difference being instead of inferring the memory and cache use at execution time the optimizer will plan it ahead of time and measure the execution afterwards. If it is not fast enough change one of the variables above (1 and 2) then try again. At some point it will converge to something that cannot be made better.

TL;DR an ahead of time scheduler/optimizer for CPU cores and cache to work in the most efficient way !


On Tuesday, April 21, 2015 at 11:46:59 AM UTC-4, Vitaly Davidovich wrote:
I think the other trick here is prefetch effectiveness will also vary across CPU generations, so it's something that would need to be reevaluated.  This is definitely one of those things you measure and only use if it yields significant enough gains that warrant the maintenance headache.

What do you mean by CPU optimizing your loops for you?
On Tue, Apr 21, 2015 at 11:25 AM, ymo <ymol...@gmail.com> wrote:
This is not news to you but the software prefetch instructions have major problems so much that everyone is saying dont use them. The major one being prefetch only for reads not writes ... on anything before the phi or Broadwell. Hardware prefetch is like lottery you win or you lose depending on your use case. I asked here before and was told the same )))

Now my dream is intel coming up with a way for me to "instruct" the cpu ahead of time which loops i will be doing and then optimize it for me ... I was looking at the recent SPIR-V sepcification for GPUs and just could not stop wonder why the hell same thing cannot be done at the CPU level ? All the loops and data structures are known before the program is loaded. Why cant the CPU and cache work on a higher level like that ???

I am sure its easy to say than do .. but one cannot stop but wonder !


On Monday, April 20, 2015 at 10:40:38 PM UTC-4, Vitaly Davidovich wrote:

prefetcht0, prefetcht1, prefetcht2, prefetchnta, prefetchw - these are prefetch hint instructions.

movntXXX family of non temporal move instructions for writes that bypass cache.

sent from my phone

On Apr 20, 2015 10:25 PM, "ymo" <ymol...@gmail.com> wrote:
Ok i have to admit i am only interested in intel for the near (and maybe far) future. Can you elaborate which cpu instructions ( if any on intel ) you are referring to ?


On Monday, April 20, 2015 at 6:20:09 PM UTC-4, Vitaly Davidovich wrote:
There're existing facilities to do both of the things you mention.  Some processors have prefetch instructions that software can emit, and there are also non-temporal load/store instructions to bypass cache.  Both of these come with a sizable YMMV label.  Or did you mean something else?
On Mon, Apr 20, 2015 at 6:07 PM, ymo <ymol...@gmail.com> wrote:
With the advent of main memory becoming the slowest link do you think one day we will have more control on prefetch instructions on the CPU ? Also stop (or lock) the CPU from being cache polluted would be nice ... So far all of this feels like playing Russian roulette !





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 21, 2015, 2:09:49 PM4/21/15
to mechanica...@googlegroups.com

If you know a priori exactly the code path and memory you're going to touch, which is what you're hinting at I think, how does the dynamic "planning" that the cpu does fail? Afterall, it does plan execution via speculation and OoO execution, including loop detection.

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 21, 2015, 4:18:02 PM4/21/15
to mechanica...@googlegroups.com
Because you cannot avoid a stall when it is dynamic. The CPU will do best efforts but ultimately when it has to stalll it will stall. Its not like how you have in software a try lock that returns a would block so that you can go do other things. As far as i know i don't have a way of knowing when a particular block has arrived in the cache without costing the wait. And its not like you can go and do more things anyway without risking to trash this coveted. So assuming both you and the CPU had to make educated guess about when to fetch data i would say that you would know more than the cpu most of the time except some corner areas.

For example let say that i have this loop:

loopStat:
for x=1 to 1000 {
<enough instructions>
}

if a = 1 do loopA
if b = 1 do loopB

loopA:
for x=1 to 1000 {
<enough instructions>
}

loopB:
for x=1 to 1000 {
<enough instructions>
}

If you assume that the variable a and b (are many and) are on the same cache line (packed in proximity) you could pre-fetch that block before you get out of loopStart. 

Assuming that loopA and loopB are known to not be in your instructions cache (based on your algorithm) you could also pre-fetch those instructions from main memory before getting out of loop start. Is this something that the hw pre-fetch can predict ?

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 21, 2015, 4:41:19 PM4/21/15
to mechanica...@googlegroups.com
Hardware already does this via speculation; while a certain iteration of loopStart still has unretired instructions in flight, the processor can be speculatively executing subsequent iterations, including issuing memory loads.  The goal is simply to keep the cpu from stalling, so if issuing speculative loads for a and b on the 999th iteration of loopStart achieves that, then it's all good -- it doesn't need to (and shouldn't) issue any earlier then.  Arguably, requesting additional resources as close to use is better, generally speaking, that requesting them very early on because you may be wasting them.  It's like that analogy of your office desk with boxes of files and remote warehouse containing the rest of the boxes: http://gameprogrammingpatterns.com/data-locality.html

In your example, a likely issue is that loop exit will cause a branch misprediction (if compiler doesn't assist), throwing out speculation.

Also, as Mike's link to the lwn article points out, linux kernel devs tried to use this prefetch for seemingly random (in terms of layout in memory) accesses (afterall, they were traversing a linked list), and it was slower than not doing that.  Now, some of that could be implementation artifacts of prefetch on that particular hardware, but the point still stands that clearly hardware was able to figure this out on its own (and do it better).



To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 21, 2015, 5:37:09 PM4/21/15
to mechanica...@googlegroups.com
You might disagree with me but all i am saying is that the only reason the processors like the GPUs and even the phi are able to do what they do is because they took out all those "features" that are only meant to help when the cpu has no clue what is coming down the pipiline and has to make educated guesses. It is meant for the masses basically. Thus giving more control to people writing the code instead of the CPU for managing everything and making sure that the devices are becoming less complicated but greater in number. I think this is the new trend to go for an army of minions instead of a behemoth !


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 21, 2015, 5:50:38 PM4/21/15
to mechanica...@googlegroups.com
I agree with your statement on GPUs, except I don't think it's "meant for the masses" -- IMHO, it's meant for very specific workloads that are throughput oriented and are highly parallel (in the sense that you can partition the bigger problem into many smaller independent bits), whereas CPUs are optimized for single thread execution.  They're basically compute monsters with higher memory bandwidth than CPUs, but they're hardly general purpose (i.e. for the masses).  But, they're simply geared towards different workload profiles, and one isn't "better" than the other as a whole.  For instance, you wouldn't write a high performance web server in a GPU; you wouldn't implement a database engine in a GPU; you wouldn't implement a kernel in a GPU; you wouldn't implement a JVM in a GPU; etc.  Now you may be able to find bits in the execution pipeline that could benefit from a GPU-style processor, but I surmise that wouldn't be the bulk of the execution.  

The ideal setup, I think, would be to have performant GPU offload and tooling around that such that if you have a mixed workload where some things benefit from GPU (and others from CPU), you could partition your work manually.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Richard Warburton

unread,
Apr 22, 2015, 2:07:19 AM4/22/15
to mechanica...@googlegroups.com

Hi,

The army of minions approach works well in the GPU case where the ecpectation is tasks will be mainly maths oriented and you are looking to exploit data parallelism.

Its not so good for the category of applications where you are focused upon performing small amounts of computational work, and doing a lot of messaging or if your main bottleneck is a single threaded network io dispatcher.

regards,

  Richard

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Apr 22, 2015, 3:37:21 AM4/22/15
to mechanica...@googlegroups.com

ymo

unread,
Apr 22, 2015, 7:42:10 AM4/22/15
to mechanica...@googlegroups.com
You are right for now because frankly speaking we have not figured how (extreme) data parallelism can be applied to all parts of the application stack.  Said in another way at the application level we are still working with an array or structures. I think it is only a matter of time before we figure how to convert the whole application data into structure of arrays that these little minions can feed on. Here is a quote from Mike Acton on data parallelism that i really like "Rule of thumb: Where there is one, there are many. Try looking on the time axis"

Obviously, the hardest part we have not figured so far is to do it in a low latency manner. This is why applications where latency is not important (for example batch processors) are a best fit for this type of processing .. That is for now ! But if you think about it is not something impossible.

The tooling is also another area that is stopping us from moving in this direction. The IDEs we have today are just very badly done for multi processing and data oriented applications. We are still using idioms we learned in IDEs about 40 years ago. Nothing has changed to bring about the latest changes in hw in the latest IDEs.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Scott Carey

unread,
Apr 22, 2015, 9:21:58 PM4/22/15
to mechanica...@googlegroups.com


On Wednesday, April 22, 2015 at 4:42:10 AM UTC-7, ymo wrote:
You are right for now because frankly speaking we have not figured how (extreme) data parallelism can be applied to all parts of the application stack.  Said in another way at the application level we are still working with an array or structures. I think it is only a matter of time before we figure how to convert the whole application data into structure of arrays that these little minions can feed on. Here is a quote from Mike Acton on data parallelism that i really like "Rule of thumb: Where there is one, there are many. Try looking on the time axis"

It will never be the case that everything can be converted into an embarrassingly parallel problem.

Many things can be converted into a somewhat parallel problem.

Some things can be made embarrassingly parallel.

Some things can never be made even partially parallel.

Consider for example compression algorithms.  LZ style algorithms are amenable to some chunking and parallelism when decoding though it can come at the cost of a some compression ratio.  At the low level these are still very branch intensive and your CPU is faster than your GPU.  Prefetching does not help, but larger caches and lower latency memory does.  On the compression side, things are even more difficult to parallelize (efficiently) and memory access patterns are even more random and completely unpredictable.

Unfortunately, an army of minions (GPU) can not beat flash gordon (CPU) every time.
 

Gary Mulder

unread,
Apr 23, 2015, 5:40:22 AM4/23/15
to mechanica...@googlegroups.com
On 23 April 2015 at 02:21, Scott Carey <scott...@gmail.com> wrote:

It will never be the case that everything can be converted into an embarrassingly parallel problem.

Many things can be converted into a somewhat parallel problem.

Some things can be made embarrassingly parallel.

Some things can never be made even partially parallel.

Martin Thompson

unread,
Apr 23, 2015, 10:35:27 AM4/23/15
to mechanica...@googlegroups.com
Even on single threads we should be able to extract more ILP given the common things we do like copying, searching, sorting, pattern matching. SIMD can provide the parallelism here.

In Java, String.indexOf() gets some SIMD love under the covers. It would be great to see similar love for searching for a pattern in any type of primitive array (bytes, shorts, ints, longs, etc.).

Also when it comes to copying and zero'ing memory there is likely more to do. Does the compiler generate a "REP MOVSD" in all possible cases on x86, or use the latest XMM features if available? It would also be great to have a assembly instruction that provides a cache line zero'ed without fetching its existing contents from main memory. Gil has pointed out how useful that can be having done it on Vega, i.e. greatly save on memory traffic and object allocation latency.

Nitsan introduced me to this interesting paper on Super Word Parallelism which is worth a read: http://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf

My experience from the field has repeatedly taught me that often the quest for parallelism results in a decrease in performance. Not just Amdahl's but USL hunts you down like with no chance of escape. Nearly every application I see can be speed up, and greatly clarified, by reducing rather than adding concurrency.

Vitaly Davidovich

unread,
Apr 23, 2015, 11:14:25 AM4/23/15
to mechanica...@googlegroups.com
Also when it comes to copying and zero'ing memory there is likely more to do. Does the compiler generate a "REP MOVSD" in all possible cases on x86, or use the latest XMM features if available? It would also be great to have a assembly instruction that provides a cache line zero'ed without fetching its existing contents from main memory. Gil has pointed out how useful that can be having done it on Vega, i.e. greatly save on memory traffic and object allocation latency.

Yes, it does use vector operations for some cases.  Nitsan did a little expose on this recently: http://psy-lob-saw.blogspot.com/2015/04/on-arraysfill-intrinsics-superword-and.html

As for speeding up allocations, I believe the C2 compiler issues prefetchnta hints as part of an allocation to prefetch further out in the allocation buffer so that memory is in cache once the *next* allocation is made, which I think is what you're sort of looking for? But zeroing is an interesting topic, which was discussed in this paper: http://users.cecs.anu.edu.au/~steveb/downloads/pdf/zero-oopsla-2011.pdf.  There are a couple of styles to choose from, each with its own pros/cons.  There's also an optimization in Hotspot that avoids zeroing arrays when the array is filled by user code post-allocation.  In addition, I believe you can request a zero'd TLAB (-XX:ZeroTLAB, disabled by default).  There're also attempts to minimize field zeroing (-XX:ReduceFieldZeroing, on by default).

As for rep movs, I think the general idea is using inline explicit movs for small blocks, rep movs for medium, and then SSE/AVX + prefetch for large blocks.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Daniel Lemire

unread,
Apr 23, 2015, 1:46:27 PM4/23/15
to mechanica...@googlegroups.com

Even on single threads we should be able to extract more ILP given the common things we do like copying, searching, sorting, pattern matching. SIMD can provide the parallelism here.

Absolutely. Copying, searching, sorting, compression... all these things stand to greatly benefit from more vectorization and fewer genuine branches. My impression is that people get used to solve problems with branching whereas a little bit of engineering and SIMD expertise can generate superior branchless algorithms.

I have never considered the vectorization of LZ compression, but I would not bet against it. 

Martin Thompson

unread,
Apr 23, 2015, 1:51:36 PM4/23/15
to mechanica...@googlegroups.com

As for speeding up allocations, I believe the C2 compiler issues prefetchnta hints as part of an allocation to prefetch further out in the allocation buffer so that memory is in cache once the *next* allocation is made, which I think is what you're sort of looking for? But zeroing is an interesting topic, which was discussed in this paper: http://users.cecs.anu.edu.au/~steveb/downloads/pdf/zero-oopsla-2011.pdf.  There are a couple of styles to choose from, each with its own pros/cons.  There's also an optimization in Hotspot that avoids zeroing arrays when the array is filled by user code post-allocation.  In addition, I believe you can request a zero'd TLAB (-XX:ZeroTLAB, disabled by default).  There're also attempts to minimize field zeroing (-XX:ReduceFieldZeroing, on by default).

Interesting paper. I'd not seen that one before. I noticed the call out to the CLZ instruction on Azul Vega but no real detail.

The work is a little old being 2011. With the Ivy Bridge copy changes and Haswell now doubling the bandwidth to L1 and L2 it could be a different picture today. These both particularly help the inline case as opposed to non-temporal cache bypass techniques.

I wonder if anyone has got Intel to consider the equivalent of the CLZ instruction? Gil?

Martin... 

 

Vitaly Davidovich

unread,
Apr 23, 2015, 2:11:09 PM4/23/15
to mechanica...@googlegroups.com
Aren't prefetch + non-temporal bulk/AVX moves (of zero, in this case) effectively providing that? I'm not intimately familiar with Vega's CLZ either, so I'd be interested to know what it helps that prefetch+movnt stuff doesn't.

--

Martin Thompson

unread,
Apr 23, 2015, 3:35:15 PM4/23/15
to mechanica...@googlegroups.com
Not sure how the bulk non-temporal side would fit in with then fencing it to pay off. Prefetch will include the memory, and some cache, bandwidth that CLZ should be able to avoid. The memory bandwidth alone is likely to be in the 10-30% range.

Just thinking other things can come into play these days. Sandy Bridge introduced the "zeroing idioms" support, i.e. xor'ing a value with itself, and since Ivy Bridge those can be zero latency by pre allocating from the register file.

I don't know the answer but it feels like an area worth some current research to see what is best.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 23, 2015, 3:58:09 PM4/23/15
to mechanica...@googlegroups.com
Just thinking other things can come into play these days. Sandy Bridge introduced the "zeroing idioms" support, i.e. xor'ing a value with itself, and since Ivy Bridge those can be zero latency by pre allocating from the register file.

I'm pretty sure Sandy Bridge handled xor via register renaming, and thus 0 execution units.  Ivy Bridge added the ability to handle register-to-register moves via renaming, which is perhaps what you're thinking of.

Compilers have been using xor to clear registers for a long time; even prior to them becoming nop's at the execution unit level, they were used because (1) cpus since like P4 understood this instruction as having no data dependence, so they wouldn't stall and (2) compilers would use it "artificially" break data dependence on a register.  But, sometimes CPUs mess up here -- e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011, which is SB/IB/Haswell having a bogus data dependency in its popcnt instruction.  What did GCC do to "fix" this? add an artificial XOR :).

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Tomasz Borek

unread,
Apr 23, 2015, 6:01:58 PM4/23/15
to mechanica...@googlegroups.com
2015-04-21 13:27 GMT+02:00 Martin Thompson <mjp...@gmail.com>:
I like how he says he is only covering the simple or really basic stuff. I my experience only a tiny minority of the development community know any of this.

My friend is now at CraftConf and said that when Paul Butcher asked who knows what Java Memory Model is out of 700 people, 3 hands shot up.

pozdrawiam,
LAFK
Reply all
Reply to author
Forward
0 new messages