Currently, on Intel x86, there are no locked operations for operands, which are larger than one cacheline.
Locked operations with operand crossing cacheline are allowed, but they take 4-6K cc, probably implemented
as "Stop the world" ("#LOCK pin" emulation).
AFAIK, locked ops on a single cacheline are implemented like so:0. flush "load buffers", if needed1. get the cacheline C1 in E state.2. stop listening to coherency traffic on C13. do the op4. flush "store buffers" if needed (cacheline may go to M state)5. start listening to coherency traffic on C1In principle, locked instruction involving 2 cacheline can beimplemented (?) without taking any global "stop the world" locks,similarly:0. flush "load buffers", if needed1. get cacheline C1 with lower address in E state2. stop listening to coherency traffic on C13. get cacheline C2 with higher address in E state4. stop listening to coherency traffic on C25. do the op6. flush "store buffers" if needed (cachelines may go to M state)7. start listening to coherency traffic on C1,C2.I am not entirely sure why nobody ever (?) implemented it like that.It's useful not so much for operands crossing cachlines, but fordealing with larger operands (sorta like CMPXCHG16B). This could beused instead of a common-case (?) of a small struct protected by anintrusive lock (making things faster and getting rid of the lock).Unlike RTM/HLE, there is no need for fallback mechanism, because itcan't be "permanently failing", because the "critical section" iswell-defined in advance. On x86 the opcode may have been a LOCKedstring operation or a LOCKed SIMD operation (which may be a good idea(?), even if aligned).
My best guess is that it is due to high cost of indivisible
instructions with multiple memory operands (x86 doesn't have any). The
cost is high, because a lot of expensive state has to be kept for
speculative execution rollback. This is why
1. string operations are divisible (visible as separate loads and stores).
2. HLE/RTM abort on page faults/etc