Atomic operations on two cachelines

139 views

Skip to first unread message

Oleg Zabluda

unread,

Feb 20, 2014, 3:25:10 AM2/20/14

to lock...@googlegroups.com

Currently, on Intel x86, there are no locked operations for operands, which are larger than one cacheline.

Locked operations with operand crossing cacheline are allowed, but they take 4-6K cc, probably implemented

as "Stop the world" ("#LOCK pin" emulation).

AFAIK, locked ops on a single cacheline are implemented like so:

0. flush "load buffers", if needed
1. get the cacheline C1 in E state.
2. stop listening to coherency traffic on C1
3. do the op
4. flush "store buffers" if needed (cacheline may go to M state)
5. start listening to coherency traffic on C1

In principle, locked instruction involving 2 cacheline can be
implemented (?) without taking any global "stop the world" locks,
similarly:

0. flush "load buffers", if needed
1. get cacheline C1 with lower address in E state
2. stop listening to coherency traffic on C1
3. get cacheline C2 with higher address in E state
4. stop listening to coherency traffic on C2
5. do the op
6. flush "store buffers" if needed (cachelines may go to M state)
7. start listening to coherency traffic on C1,C2.

I am not entirely sure why nobody ever (?) implemented it like that.
It's useful not so much for operands crossing cachlines, but for
dealing with larger operands (sorta like CMPXCHG16B). This could be
used instead of a common-case (?) of a small struct protected by an
intrusive lock (making things faster and getting rid of the lock).
Unlike RTM/HLE, there is no need for fallback mechanism, because it
can't be "permanently failing", because the "critical section" is
well-defined in advance. On x86 the opcode may have been a LOCKed
string operation or a LOCKed SIMD operation (which may be a good idea
(?), even if aligned).

My best guess is that it is due to high cost of indivisible
instructions with multiple memory operands (x86 doesn't have any). The
cost is high, because a lot of expensive state has to be kept for
speculative execution rollback. This is why

1. string operations are divisible (visible as separate loads and stores).
2. HLE/RTM abort on page faults/etc

Oleg Zabluda

unread,

Feb 21, 2014, 1:26:03 PM2/21/14

to lock...@googlegroups.com

On Thursday, February 20, 2014 12:25:10 AM UTC-8, Oleg Zabluda wrote:

[...]

My best guess is that it is due to high cost of indivisible
instructions with multiple memory operands (x86 doesn't have any). The
cost is high, because a lot of expensive state has to be kept for
speculative execution rollback. This is why

1. string operations are divisible (visible as separate loads and stores).
2. HLE/RTM abort on page faults/etc

On the second thought, the huge and well-known problems with multiple memory operands don't apply to x86, if it's just a contiguous memory a couple of cachelines long. The simplest observation is that x86 has no problem with non-locked instructions or their operands crossing cachelines, yet they are still indivisible (execution is all-or-nothing).