Theoretical lowest latency across same socket

404 views
Skip to first unread message

André Monteiro

unread,
Jun 11, 2016, 5:38:41 PM6/11/16
to mechanical-sympathy
Hi,

How would one get the theoretical lowest latency achievable for interthread or IPC for a 50 bytes message in a modern server with 2 treads exchanging data in 2 separate cores, same cup socket?

Martin Thompson

unread,
Jun 12, 2016, 4:13:11 AM6/12/16
to mechanica...@googlegroups.com
I'll assume x86_64.

Align messages to 128 byte boundaries to account for the MLC prefetcher, Use one field in the message header to signal completion from the producer with an MOV instruction that has a release fence inserted before it by the compiler. The consumer busy spins on completion field with a PAUSE instruction inside the busy spin loop. The read of the completion field needs to have an acquire fence after it.

On the Haswell+ Xeon processors with 10 or more cores you need to be aware of the SBox switch and the Cluster On Die (COD) setup to get the best latency out of the socket.

Martin...

Ross Bencina

unread,
Jun 12, 2016, 5:49:49 AM6/12/16
to mechanica...@googlegroups.com
On 12/06/2016 6:13 PM, Martin Thompson wrote:
> I'll assume x86_64.
>
> Align messages to 128 byte boundaries to account for the MLC prefetcher,
> Use one field in the message header to signal completion from the
> producer with an MOV instruction that has a release fence inserted
> before it by the compiler. The consumer busy spins on completion field
> with a PAUSE instruction inside the busy spin loop. The read of the
> completion field needs to have an acquire fence after it.

I've been wondering about this setup for a while. Something that bugs me:

Let's assume that the producer writes data to the message incrementally
and finally sets the completion field.

If the completion field is on the same cache line as the rest of the
message data, and the consumer is polling the field, won't that cause
the cache line to ping-pong between <producer-exclusive> and <shared>
each time the producer's data writes interleave with the consumer's polls?

Would it not be better if the completion field was on its own cache
line? Then there is only one <shared> --> <exclusive> --> <shared>
transition when the producer signals completion. (Rather than
potentially one transition every time the producer sets a non-completion
field in the message).

I assume that this would be more of a problem between sockets, but
doesn't it make a difference between cores on the same die too?

Thanks,

Ross.

Martin Thompson

unread,
Jun 12, 2016, 5:58:23 AM6/12/16
to mechanica...@googlegroups.com

The best case for a dirty hit between cores is 60 cycles, it can be a lot more with large L3 caches. This should be less cycles than writing a 50 byte message plus header. The alternative is two cache misses.

Write a test to see how they compare. It is not a difficult experiment. :) Use MSRs to dig in.

Vitaly Davidovich

unread,
Jun 12, 2016, 12:35:44 PM6/12/16
to mechanica...@googlegroups.com
Without store buffering and write combining, yes, your scheme might be better.  For a 50 byte message, it's likely write combining will turn the individual writes into one cache-to-cache transaction anyway and consumer will see the entire (completed) message in one transfer.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

André Monteiro

unread,
Jun 13, 2016, 7:44:38 PM6/13/16
to mechanical-sympathy
Thanks a lot! Am I right to say that PAUSE instruction is not available in Java?

Are there other limitations java imposes to this design?

I will get the numbers and let you know

Martin Thompson

unread,
Jun 14, 2016, 2:42:14 AM6/14/16
to mechanical-sympathy
The PAUSE instruction will be available on x86 with Thread.onSpinWait() that arrives in Java 9.

Gil Tene

unread,
Jun 14, 2016, 9:37:41 AM6/14/16
to mechanical-sympathy
For both a numeric example of the impact of PAUSE on minimum round trip times, and for actual measurement of those round trip times between two threads on the same core (so the absolute minimum latency possible between two threads on current Intel Xeon processors), you can see JEP285: http://openjdk.java.net/jeps/285 , and some specific example test code at https://github.com/giltene/GilExamples/tree/master/SpinWaitTest which you can use as a starting point.

-- Gil.

André Monteiro

unread,
Jun 16, 2016, 10:01:26 PM6/16/16
to mechanical-sympathy
Thanks. That's exactly what I am looking for.

André
Reply all
Reply to author
Forward
0 new messages