Here's a brief tutorial, in hope that it helps.
Optimistic strategies are used in many places besides synchronization:
network protocols (Ethernet and friends), hash tables, memory and
translation caches, and many other places. In all cases, a plot of cost
vs. load factor shows a very low cost at low load, rising slowly until a
knee is reached, with very high cost past the knee as thrashing sets in.
Where the knee is varies with the details, but the curve shape is
characteristic.
For most places in which an optimistic strategy is available there is
also a pessimistic strategy available as well. These too have a
characteristic cost vs. load curve: relatively high at low load, rising
slowly but steadily with load in a reasonably linear fashion, up to a
maximum load at which load is constrained by other factors. There is no
knee, but below the optimistic knee the optimistic strategy is cheaper,
often significantly cheaper, than the pessimistic. After all, were it
not cheaper then optimistic would never be used at all, while in fact
you use such strategies all the time without thinking of it.
However, there may be non-cost reasons for choosing one strategy over
the other. In much real time work, the actual cost is irrelevant so long
as it never goes over a certain fixed value. In such a case, the higher
cost but greater certainty may be preferable to the average lower cost
but higher variability of optimistic. IBM invented Token Ring based on
just such an argument (see
https://en.wikipedia.org/wiki/Token_ring for
the fate of Token Ring). I ignore such specialized situations herein.
Nearly all optimistic strategies are subject to a phenomenon called
"convoying". This is a local, emergent effect, in which minor
happenstantial variations in load cause retries that increase the
effective local load factor, leading to more failure, retries, and more
load. When convoying is present, the global load factor may be well
below the knee but local convoying can raise the local load above the
knee and performance drops sharply until the local effect works itself
out. The convoying effect has been studied mathematically and is akin to
local onset of chaotic regions in the phase space.
There are two possible responses to the appearance of convoying in a
system. In one, the global load factor is kept low enough, and the
statistical character of the load is adjusted, so as to lessen the
probability of a local load cluster over some period of time. Thus, in
Ethernet you might add bandwidth while retaining the optimistic
algorithm unchanged.
The second approach is to prevent happenstantial failure from clustering
locally and hence thrashing. In hash tables, rather than just looking in
the next cell on a collision, you rehash the key and reprobe to
someplace completely different in the table. Uniformly reprobing gives
better performance in higher loads, because it avoids miss clustering.
The "try again someplace else far from here and different from what your
collider will do" way of avoiding local collision clusters can be
applied to most optimistic strategies. In the time domain, the common
approach is called "exponential backoff". It is used in Ethernet and
many other protocols. It is also used in the allocation domain so as to
achieve asymptotically constant overhead in C++ vectors: each time the
vector runs out of space its size is doubled.
These retry algorithms do not remove the knee that is inherent in
optimistic strategies, but they due remove statistical effects when load
is "bunchy", and tend to move the knee toward lower cost at higher load.
Measuring an optimistic strategy without using a backoff is shooting a
straw man.
Both optimistic and pessimistic show statistical variation in the cost
of any single access; you may think that pessimistic does not, but for
fixed bandwidth if requests locally bunch higher than the bandwidth then
some requests must be queued, even though no retry is used; that's the
source of the rising cost curve of pessimistic. For any given load
factor, the mean of these request costs can be computed from the shape
of the respecting curves and the underlying zero-load cost.
The cost of synchronization is but one part of the overall constraints
experienced by a program. Thus synchronization may be completely hidden
by a lack of memory bandwidth, or insufficient floating point units, or
so on. The designer must fit within externally imposed resource limits -
it might be nice to have 25 floating point units, but they won't fit in
the power and area budgets. Within those limits, the goal is to provide
minimal average cost over the universe of applications.
This goal is difficult to achieve; every application hits one limit or
another, and every application user feels that his limit (whatever it
is) is the only one that matters and has been unfairly shirked. He feels
mightily aggrieved.
To turn to the case a hand, the cost of synchronization is irrelevant if
the application is not sync-bound. The costs must be measured in
meaningful units. In this case you favor RMW because it is "one
instruction". The number of instructions only matters to decode, which
is not the concern here; that one instruction will promptly become
several micro-ops anyway. Instruction count is not a meaningful measure.
The appropriate measure here is the number of syncs per minute, or,
equivalently, the average number of cycles per sync. Note "average": if
you are doing hard real time and truly require fixed maximum delay you
should not be using any conventional CPU; timing variations from cache,
TLB, memory refresh, and scheduler will completely swap the variations
from any kind of sync.
The hardware can only do one RMW at a time. If there are 20 cores
contending, each core will have a sync bandwidth that is a 20th of what
is available without contention. So you cannot know what that single RMW
costs, until you are aware of what else is happening, and the RMW cost
will certainly not be "one instruction" by a long shot, at any load
approaching the optimistic knee.
Your comparison here is not only flawed by treating RMW as
instantaneous, it is also flawed by a failing implementation of
optimistic. As written, your loop will convoy because you have no
backoff, but that is a fault of your code and not of the strategy. These
things are subtle - any multicore aspect is - and it's easy to get it
wrong. That's why it's better to use a well written synchronization
library rather than your own loops.
Depending on the shape of the cost curves, how good a backoff is used,
and the phase of the moon, there may be a region of the load space in
which the system is sync bound and pessimistic (RMW) is less costly than
optimistic; that is far from certain, but it may be. Being close to
saturation, that region of the load space will be rarely occupied by
applications - very few applications get anywhere near sync bound in any
strategy. Short of that region (i.e. at lighter contention), optimistic
is less costly than pessimistic, which benefits the mass of applications
that are in that lighter load region.
It is a designer's choice whether to benefit the great mass at the
expense of the extreme and rare, or vice versa. The Mill design attempts
to have it both ways and benefit both, but sometimes you can't. In this
case it's not even clear that a dilemma exists; the examples you have
evinced so far do not in fact support the claim made. Our analysis of
the hardware, including the Mill extensions, has convinced us that our
optimistic will be less costly on average than RMW would be, at any load
factor up to saturation caused by reasons other then sync. Consequently
we feel that there is no load at which RMW would be better. True, RMW
will probably have more stable variation in response, but the Mill is
not suitable for hard real time in any case so that doesn't matter to us.
It may well be that we face surprises when the design reaches silicon -
I'm sure there will be a few, and sync cost may be one. If our analysis
turns out to have been wrong then we'll fix it then. But we won't change
it now on the basis of a misunderstood straw man.
YMMV.
This thread has probably gone on longer than is profitable to either of
us, so I'm dropping it at this point.
Ivan