lock cmpxchg - 100 cycles
mfence - 104 cycles
So I conclude that they are nearly identical wrt consumed cycles.
But is there some difference between them wrt system performance?
Especially on modern multicore processors (Core 2 Duo, Core 2 Quad)?
Is following assumption correct: Lock prefix affects bus/cache
locking, so has impact on total system performance. And mfence has
only local impact on current core.
Or more practical: If I have 2 algorithms - one use lock prefix, and
another use mfence. Other things being equal, what I must prefer?
For example:
Program use sufficiently large amount of mutexes. Every particular
mutex synchronize only 2 threads.
I can implement mutex with:
1. "Traditional" scheme. Based on "lock xchg" in acquire operation and
"naked store" in release operation.
2. Peterson algorithm. Based on #StoreLoad memory barrier (mfence) in
acquire operation and "naked store" in release operation.
So net difference is - LOCK vs MFENCE.
The question is: Will be any difference in system performance on quad
core machine?
Thanks for any advance
Dmitriy V'jukov