can we have access to pause and prefetch from java ?

312 views
Skip to first unread message

ymo

unread,
Apr 17, 2014, 4:09:32 PM4/17/14
to mechanica...@googlegroups.com
I would not mind if this was on x86 only !!!

Nitsan Wakart

unread,
Apr 18, 2014, 10:45:27 AM4/18/14
to mechanica...@googlegroups.com
No and no :(
On Thursday, April 17, 2014 10:09 PM, ymo <ymol...@gmail.com> wrote:
I would not mind if this was on x86 only !!!
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


ymo

unread,
Apr 18, 2014, 6:23:52 PM4/18/14
to mechanica...@googlegroups.com, Nitsan Wakart
I know we cant today ... anything on the radar on why or why it would not be made available one day ?


On Friday, April 18, 2014 10:45:27 AM UTC-4, Nitsan Wakart wrote:
No and no :(
On Thursday, April 17, 2014 10:09 PM, ymo <ymol...@gmail.com> wrote:
I would not mind if this was on x86 only !!!
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Dan Eloff

unread,
Apr 24, 2014, 10:52:48 PM4/24/14
to mechanica...@googlegroups.com


On Thu, Apr 17, 2014 at 3:09 PM, ymo <ymol...@gmail.com> wrote:
I would not mind if this was on x86 only !!!

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 25, 2014, 12:32:18 PM4/25/14
to mechanica...@googlegroups.com
Can you confirm since which jdk this christmas present was put in ?

The real test now is to see if this is going to make a difference .. but still thanks a lot for the link )))


On Thursday, April 24, 2014 10:52:48 PM UTC-4, Daniel Eloff wrote:
On Thu, Apr 17, 2014 at 3:09 PM, ymo <ymol...@gmail.com> wrote:
I would not mind if this was on x86 only !!!

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jimmy Jia

unread,
Apr 25, 2014, 12:37:36 PM4/25/14
to mechanica...@googlegroups.com

Not for long:

http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-March/026031.html

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 25, 2014, 3:02:08 PM4/25/14
to mechanica...@googlegroups.com
wtf nooooooooooooooooooooooooooooo !!!!!


On Friday, April 25, 2014 12:37:36 PM UTC-4, Jimmy Jia wrote:
On Apr 25, 2014 12:32 PM, "ymo" <ymol...@gmail.com> wrote:
Can you confirm since which jdk this christmas present was put in ?

The real test now is to see if this is going to make a difference .. but still thanks a lot for the link )))


On Thursday, April 24, 2014 10:52:48 PM UTC-4, Daniel Eloff wrote:
On Thu, Apr 17, 2014 at 3:09 PM, ymo <ymol...@gmail.com> wrote:
I would not mind if this was on x86 only !!!

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ymo

unread,
Apr 25, 2014, 3:09:25 PM4/25/14
to mechanica...@googlegroups.com
They dont even mention why on earth they need to remove it .. because its unused ... seriously  !!!


On Friday, April 25, 2014 12:37:36 PM UTC-4, Jimmy Jia wrote:
On Apr 25, 2014 12:32 PM, "ymo" <ymol...@gmail.com> wrote:
Can you confirm since which jdk this christmas present was put in ?

The real test now is to see if this is going to make a difference .. but still thanks a lot for the link )))


On Thursday, April 24, 2014 10:52:48 PM UTC-4, Daniel Eloff wrote:
On Thu, Apr 17, 2014 at 3:09 PM, ymo <ymol...@gmail.com> wrote:
I would not mind if this was on x86 only !!!

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Apr 25, 2014, 3:55:25 PM4/25/14
to mechanica...@googlegroups.com

Just curious - what is your usecase where you want to do software prefetch?

Sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 25, 2014, 4:06:28 PM4/25/14
to mechanica...@googlegroups.com
I am doing tight loops running on dedicated hardware that could benefit from pre fetching the data that is stored in queues. All the data is stored as an array that the cpu runs through as fast as possible. When i I say dedicated i mean that no other process is accessing the shared caches.

All of this is done in c++ since that is now impossible to do it in java. .. unfortunately so i cant move this code over to java ((

Vitaly Davidovich

unread,
Apr 25, 2014, 4:09:39 PM4/25/14
to mechanica...@googlegroups.com

And hardware prefetcher isn't picking up on this automatically? Not saying the prefetch in c++ is useless but what you describe sounds like hardware should handle well (I'm assuming the running through the array is done linearly).

Sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Apr 25, 2014, 4:13:19 PM4/25/14
to mechanica...@googlegroups.com
Have you considered implementing the queue as sparse and then doing a write/read ahead to an unused address next to what you need on the next iteration? With this technique you can even effectively prefetch for write which you cannot with "explicit" prefetching.

Martin...

ymo

unread,
Apr 25, 2014, 4:16:48 PM4/25/14
to mechanica...@googlegroups.com
1) you know how you are using your data better than the hw prefetch guessing game. That is why they have these instructions in the first place !
2) you want to do some sort of pipeline with multiple stages where you do a prefetch on the next item in the queue while working on current item . All within the same cache lines of course.

Ariel Weisberg

unread,
Apr 25, 2014, 4:29:14 PM4/25/14
to mechanica...@googlegroups.com
Hi,

How do you avoid having the store reordered or optimized in some unexpected way? Is it a non-issue?

Ariel

ymo

unread,
Apr 25, 2014, 4:29:51 PM4/25/14
to mechanica...@googlegroups.com
Martin sparse data for now is nto possible but we still prefetch based on the cache lines. The c++ code is now explicitly calling prefetcht{0,1,2} . Can you explain or give me pointers to what you mean by "you can even effectively prefetch for write which you cannot with "explicit" prefetching".

However, Its a nice trick/hack and will try that for sure !


On Friday, April 25, 2014 4:13:19 PM UTC-4, Martin Thompson wrote:

Martin Thompson

unread,
Apr 25, 2014, 4:37:07 PM4/25/14
to mechanica...@googlegroups.com
Depending on what structure you want prefetch, say it is an array of structs in C/C++, have a field in the struct that you can read or write ahead of needing the other fields. Similar can work for a sparse array, i.e. using 1 in N slots. Prefetch only works for getting the cacheline shared on x86. If you want to write you need the cacheline exclusive under the MESI model. If you write ahead you will get the cacheline exclusive.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Apr 25, 2014, 4:42:53 PM4/25/14
to mechanica...@googlegroups.com
Experiment and make sure it works for your algorithm. Prefetching can be tricky as other things can happen that you don't expect. Linux ended up removing the prefetch on linked lists as it hurt more than it helped.


For ordering on the same thread you need to have data dependencies. A compiler will ensure program order within a thread.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 25, 2014, 4:47:08 PM4/25/14
to mechanica...@googlegroups.com
I think i will end up using both look ahead techniques you are mentioning if/when depending if i want to write on something or read as well. Thank you so much for sharing that !
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ariel Weisberg

unread,
Apr 25, 2014, 4:56:33 PM4/25/14
to mechanica...@googlegroups.com
Hi,

I guess what I am getting at is that a blind store would have no dependencies. What would be the right way to introduce a dependency so that it would execute at the beginning of the loop body and not at the middle/end, or possibly be moved out of the loop body entirely.

It seems like the CPU could choose to reorder execution as well although maybe at that granularity it isn't a concern for a pre-fetch optimization?

Ariel
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,
Apr 25, 2014, 5:08:13 PM4/25/14
to mechanica...@googlegroups.com
I see what you are getting at. If you have no data dependency then it could be reordered. On x86 load and stores are not reordered given its memory model. Only loads are reordered with older stores to different locations.

That being said the compiler could reorder. I tend to check if I get the results I expect but have not used this on a large scale. Prefetch can be considered a hint. It is probably worth checking if prefetch is an ordered instruction like CPUID is. Does anyone know? Because it would have the same issue.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 25, 2014, 5:24:41 PM4/25/14
to mechanica...@googlegroups.com
Maybe this explains why they have "asm volatile " in front of the prefetch code ! To stop the compiler from reordering i guess ?

Huge lights went on in one village .. somewhere !


On Friday, April 25, 2014 5:08:13 PM UTC-4, Martin Thompson wrote:
I see what you are getting at. If you have no data dependency then it could be reordered. On x86 load and stores are not reordered given its memory model. Only loads are reordered with older stores to different locations.

That being said the compiler could reorder. I tend to check if I get the results I expect but have not used this on a large scale. Prefetch can be considered a hint. It is probably worth checking if prefetch is an ordered instruction like CPUID is. Does anyone know? Because it would have the same issue.
On 25 April 2014 21:56, Ariel Weisberg <arielw...@gmail.com> wrote:
Hi,

I guess what I am getting at is that a blind store would have no dependencies. What would be the right way to introduce a dependency so that it would execute at the beginning of the loop body and not at the middle/end, or possibly be moved out of the loop body entirely.

It seems like the CPU could choose to reorder execution as well although maybe at that granularity it isn't a concern for a pre-fetch optimization?

Ariel


On Friday, April 25, 2014 4:42:53 PM UTC-4, Martin Thompson wrote:
Experiment and make sure it works for your algorithm. Prefetching can be tricky as other things can happen that you don't expect. Linux ended up removing the prefetch on linked lists as it hurt more than it helped.


For ordering on the same thread you need to have data dependencies. A compiler will ensure program order within a thread.
On 25 April 2014 21:29, Ariel Weisberg <arielw...@gmail.com> wrote:
Hi,

How do you avoid having the store reordered or optimized in some unexpected way? Is it a non-issue?

Ariel

On Friday, April 25, 2014 4:13:19 PM UTC-4, Martin Thompson wrote:
Have you considered implementing the queue as sparse and then doing a write/read ahead to an unused address next to what you need on the next iteration? With this technique you can even effectively prefetch for write which you cannot with "explicit" prefetching.

Martin...

On 25 April 2014 21:06, ymo <ymol...@gmail.com> wrote:
I am doing tight loops running on dedicated hardware that could benefit from pre fetching the data that is stored in queues. All the data is stored as an array that the cpu runs through as fast as possible. When i I say dedicated i mean that no other process is accessing the shared caches.

All of this is done in c++ since that is now impossible to do it in java. .. unfortunately so i cant move this code over to java ((

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Apr 25, 2014, 6:36:26 PM4/25/14
to mechanica...@googlegroups.com

Prefetch is unordered (it's just a hint anyway).  I wonder if ymo's c++ code that's doing prefetch is some old relic to help on older hardware; modern cpus are pretty good at prefetching on their own; putting in manual hints may hurt as it may impede "normal" prefetch.  Either way, I'd bench this before asking for its inclusion in java :).

Sent from my phone

Richard Warburton

unread,
Apr 25, 2014, 6:54:11 PM4/25/14
to mechanica...@googlegroups.com
Hi,

1) you know how you are using your data better than the hw prefetch guessing game. That is why they have these instructions in the first place !

This is the kind of assumption which I'm pretty sure we all make at some point in time. I'm not sure that its an assumption that will necessarily hold. What is the performance difference in your C++ code with and without your prefetch instructions? What do your CPU's performance event counters tell you about the difference in behaviour?

Answering these questions will give you a convincing case as to the value of the prefetch in the context of your application and also a case either way as to whether Java's lack of prefetch support is actually a problem. I'm also being selfish here, because I would also be interested to know the answer ;)

regards,

  Richard Warburton

ymo

unread,
Apr 25, 2014, 7:06:26 PM4/25/14
to mechanica...@googlegroups.com
The verdict is still out on prefetch in general application but when it comes to controlled and dedicated environments like 10G network card handling i am told its the norm. Again its a simple thing to prove or disprove given the right hw and ... time )

Dan Eloff

unread,
Apr 26, 2014, 1:57:40 PM4/26/14
to mechanica...@googlegroups.com
Doh! So what's the best way to prefetch in Java then? My use case if iterating over 256 byte blocks arranged in a linked list. Hardware auto prefetch doesn't help with that because it's unpredictable, and prefetching the next block before processing the current block essentially eliminates the memory latency and inevitable TLB miss latency. It seems like an ideal use case for prefetch (but I'll admit, at this point I have not timed it.)

I could put in volatile reads, but that might introduce membars which are not needed? Also even if I fool the compiler that those reads can't be optimized away, I'm not sure it will fool the CPU. Also might spill some things out of registers.

Thoughts?

ymo

unread,
Apr 26, 2014, 4:17:15 PM4/26/14
to mechanica...@googlegroups.com
Martin, according to the link below it seems that you can also add a "hints field" when you are doing a prefetch for write usage.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,
Apr 27, 2014, 11:58:48 AM4/27/14
to mechanica...@googlegroups.com
This link is for the Phi co-processor and not a standard x86 CPU. Am I missing something?

3DNow introduced by AMD added 2 instructions for prefetch for read and write but this was a bit of a disaster.

    PREFETCH/PREFETCHW – Prefetch at least a 32-byte line into L1 data cache

SSE added the following 4 prefetch instructions. I don't know of any others.

PREFETCH0Prefetch Data from AddressPrefetch into all cache levels
PREFETCH1Prefetch Data from AddressPrefetch into all cache levels EXCEPT L1
PREFETCH2Prefetch Data from AddressPrefetch into all cache levels EXCEPT L1 and L2
PREFETCHNTAPrefetch Data from AddressPrefetch into all cache levels to non-temporal cache structure

Are there others as part of standard x86?

Martin...

Gil Tene

unread,
May 1, 2014, 11:55:16 AM5/1/14
to mechanica...@googlegroups.com
Looks like we'll get PREFETCHW in Broadwell, finally (http://en.wikipedia.org/wiki/Broadwell_(microarchitecture) )

When writes are streamed and the hardware can detect a stride, the hardware prefetchers do a good job, but the current lack of an explicit prefect-for-write comes in the way of various situations for which hardware prefetched are useless, like prefetching the start of a short target stream in a memcpy, or prefetching for an upcoming lightly contented CAS or SWAP operation when you can see it coming miles away in software, but the CPU can't. 
Reply all
Reply to author
Forward
0 new messages