Invalid -> Shared -> Exclusive(Modified) transition

171 views
Skip to first unread message

Roman Leventov

unread,
Sep 17, 2015, 3:24:33 PM9/17/15
to mechanica...@googlegroups.com
In this thread https://groups.google.com/forum/#!topic/mechanical-sympathy/kMwNObjHMrE Vitaly stated:

I know the cores will try avoiding additional RFO traffic by bringing the line into exclusive mode upfront

This is interesting, because recently I observed how reading a cache line just before CAS-ing adds 12 ns of latency, compared to performing CAS without read.


My assumption was CPU is failing to recognize this pattern and doubles bus traffic, making Invalid -> Shared -> Exclusive transition instead of just Invalid -> Exclusive.

However, testing was done on Sandy Bridge. I don't know how does this perform on newer architectures.

Martin Thompson

unread,
Sep 17, 2015, 3:40:39 PM9/17/15
to mechanica...@googlegroups.com
I've observed similar results on Sandy Bridge and Ivy Bridge. Not tested on Haswell Xeons yet.

However on a multi-socket systems I've found algorithms that test before CAS'ing often scale better under contention. For example, do other work in a duty cycle if you see something is already part way through some algorithm. Have you tested your lock code on a multi-socket server? In the pathological case and your word being CAS'ed straddles a cache line then you need to take out the big bus lock on the QPI. Need to be careful off heap :-)

Roman Leventov

unread,
Sep 17, 2015, 3:46:37 PM9/17/15
to mechanica...@googlegroups.com
Yes, the test which showed that 12 ns difference has the following layout: 8 threads, 4 on one socket and 4 on another, all affined to cores, (machine has 8 cores, 16 HW threads) perform random ops on 1000 segments. There are typically so many segments in Chronicle Map that there is no contention.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roman Leventov

unread,
Sep 17, 2015, 3:52:26 PM9/17/15
to mechanica...@googlegroups.com
The idea after this observation was that it would be nice to have something like Unsafe.read[Int, Long, Object]ForWrite(long address, [int, long, Object] unrealisticValue), that should compile (on x86) to:

mov unrealisticValue, %ax
xcmpchg unrealisticValue, (address) // note without lock prefix
// here we have value at (address) in %ax

The idea is that CPU should read-to-exclusive for xcmpchg, but it never actually writes if unrealisticValue is chosen right.

For patterns like:

long lockState = getLongForWrite(...);
.. some check operations on the lock state
if (cas(lockState, newLockState) ...

Martin Thompson

unread,
Sep 17, 2015, 3:56:26 PM9/17/15
to mechanica...@googlegroups.com
If you have virtually no contention then going straight for the CAS will be lowest latency. A nice win on the Broadwell processor then would be to issue a PREFETCHW which could come ahead in the pipeline if we could hint from Java.

Vitaly Davidovich

unread,
Sep 17, 2015, 4:31:43 PM9/17/15
to mechanical-sympathy
This is something that I don't have tangible detail on, unfortunately.  You could try asking on Intel's forum and see if anyone pipes up with implementation details.

Keep in mind, though, that a normal store will just hit the store buffer while it waits for the cacheline to reach proper state, allowing execution to proceed (and if the store is store-forward'able, even better).  There will be an additional QPI transaction for the RFO, but not necessarily a latency penalty.  However, you're testing with an atomic operation which would suppress this.

--

Vitaly Davidovich

unread,
Sep 17, 2015, 4:42:20 PM9/17/15
to mechanical-sympathy
Personally, I wouldn't bother with this.  If you believe that a preliminary test would avoid a more expensive operation such as CAS, just let the compiler and/or CPU do the load scheduling.  If you know you're going to write anyway, let the CPU sort out the scheduling and don't issue manual prefetch-like instructions.  The processors are way too complicated for vast majority of manual hints (IMHO).

Vitaly Davidovich

unread,
Sep 17, 2015, 5:37:08 PM9/17/15
to mechanical-sympathy
Ok, so to elaborate a bit.

Certain read-and-modify instructions, like cmpxchg, immediately issue RFO (if needed) -- they do not first request a line in shared state.  A plain store instruction requests ownership of the line; so the line is brought in, but it's ready for writes when it's available.  If the line is not available, it's ok -- the store sits in the store buffer, and can be store forwarded if the subsequent load is store forwardable (empirically, it looks like each new gen of Intel chips makes more and more cases of store-load forwardable).

Roman, I think you were expecting the initial load before the explicit CAS to already request the line in RFO, speculating that the CAS will execute.  So this is what I don't know about -- we're talking about the OoO execution engine doing a look-ahead and seeing if the line will be modified (or speculate on it).  I highly doubt this is done; apart from possible other reasons, who's to say the line hasn't been "stolen"/invalidated by the time you're ready to write to it?

Roman Leventov

unread,
Sep 17, 2015, 6:17:48 PM9/17/15
to mechanica...@googlegroups.com
Vitaly, I interpreted your original statement as CPUs are able to do such multi-instruction analysis and speculative RFOs ahead of reads. Seems you meant this just for single read-to-modify instruction themselves; this is ok; in this case my test and latency improvement is not surprising at all (and it wasn't for me, before I read and misinterpreted your message).

On 18 September 2015 at 00:37, Vitaly Davidovich <vit...@gmail.com> wrote:

who's to say the line hasn't been "stolen"/invalidated by the time you're ready to write to it?

Theoretically, it could operate like on normal miss and make another RFO. The whole difference could be, that for initial read, the core sends different micro-instruction of cache coherency protocol. IIUC

Vitaly Davidovich

unread,
Sep 17, 2015, 6:43:25 PM9/17/15
to mechanical-sympathy

Yes, I can see how it was confusing, particularly since we started talking about true inter-instruction OoO execution, speculative loads, etc.  I think part of the confusion also is that thread talking about "extra read before writing" to the cacheline, but not elaborating that it's talking about single instruction.

Ok, with that out of the way ... I don't know how the lookahead would work; right now the execution engine "looks backwards" to determine what's in flight already, detect dependencies, etc. This is going the other way: CPU sees a load, and now wants to know if there's a store to it later - not possible since we're processing instructions in a stream.  Ok, we'll issue the load right now and have it enter the load queue for servicing.  At some point we see a store to a cacheline - do we see if there are pending loads? Do we check every store for this? Mark their load requests as exclusive? Or they've already left (possibly retired), ok we're back to today's scenario - issue RFO and drop the store into the store buffer.

I'm not a hardware engineer, so take this with a grain of salt, but I don't see this being practical or efficient.  We do want to minimize interconnect traffic, but must not create other hazards/stalls/inefficiencies in the process.  Perhaps hardware will have better answers/implementations as the core counts increase, but I doubt having many writers on same memory will ever scale well.

sent from my phone

Gil Tene

unread,
Sep 19, 2015, 3:06:39 PM9/19/15
to mechanical-sympathy
The reason it would "wrong" for the CPU to generally (and unconditionally) interpret a read ahead of a CAS as a wish to bring the cache line to exclusive mode is that many concurrent algorithms rely on such reads to filter out modifications and keep a hotly read-from line in a shared state. E.g. HotSpot's -XX:+UseCondCardMark does exactly that to avoid false sharing on the card table (since each cache line in the card table covers 32KB of heap address space, any reference stores to the same 32KB range will "dirty" a card in the same cache line). If the CPU interrupted the read of the card ahead of the conditional use of CAS as a wish to bring the cache line in as exclusive, the false sharing-driven thrashing would remain.

You could probably try for something like "interpret a read followed by a CAS on the same address, with no conditional branches in between" to be interpreted as a wanting the line to come in in exclusive mode...

Roman Leventov

unread,
Sep 19, 2015, 6:44:10 PM9/19/15
to mechanica...@googlegroups.com
Another option - using counters to estimate how often read is followed by a successful write/CAS, like Branch Prediction unit work. If this is possible.

Reply all
Reply to author
Forward
0 new messages