Re: Core i7 CMPXCHG-performance

Message has been deleted

David Schwartz

unread,

Dec 29, 2008, 10:36:45 AM12/29/08

to

On Dec 27, 2:49 pm, Elcaro Nosille <ElcaroNosi...@akapost.com> wrote:
> I heard that the Core-i7-CPu has an improved performance of
> LOCK CMPXCHG over Core-2-CPUs. Has anyone here some numbers?
> Or is there an official throughput/latency-table from Intel?

You won't get official throughput/latency numbers from Intel anymore
because there are so many possibly cases. However, I have heard from
Intel sources that LOCK CMPXCHG will be 40% less expensive (in terms
of clock cycles) on Core i7 than it is on Core 2. The same source that
LOCK CMPXCHG is 65% less expensive on Core 2 than it is on late-model
P4s.

DS

Chris M. Thomasson

unread,

Dec 29, 2008, 2:01:34 PM12/29/08

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:da247bcf-2446-4e45...@a29g2000pra.googlegroups.com...

Great!

Message has been deleted

David Schwartz

unread,

Dec 29, 2008, 6:12:22 PM12/29/08

to

On Dec 29, 1:42 pm, Elcaro Nosille <ElcaroNosi...@akapost.com> wrote:

> David Schwartz schrieb:

> > You won't get official throughput/latency numbers from Intel anymore
> > because there are so many possibly cases. However, I have heard from
> > Intel sources that LOCK CMPXCHG will be 40% less expensive (in terms
> > of clock cycles) on Core i7 than it is on Core 2. The same source that
> > LOCK CMPXCHG is 65% less expensive on Core 2 than it is on late-model
> > P4s.

> I've written a little program that repeatedly does LOCK CMPXCHG of
> the same value (zero exchanged by zero). On a Core i7 of a friend,
> each LOCK CMPXCHG takes about 22 cycles; on my QX6850 (3GHz, 65nm
> Quadcore) it takes about 25 cycles. Not a big difference.

That's about 12%, which is a far cry from the claimed 40%.

However, testing in a tight loop is not always fair. For one thing,
you're probably only testing the case where the cache line is already
in the L1 cache. That may or may not be the case you care most about.
For another thing, you may not be testing the effect of the locked
instruction on the pipeline status.

It's also possible that your test is accurate.

DS

Message has been deleted

rani_s...@hotmail.com

unread,

Jan 2, 2009, 11:49:39 AM1/2/09

to

I heard, mainly from the STM camp, that the biggest problem with these
CPU locked operations is that they poorly scale for many cores due to
memory synchronization (and not CPU performance). Maybe Intel improved
this issue and therefore there is potential for big gain in high
contention scenarios.

Rani

David Schwartz

unread,

Jan 2, 2009, 11:11:38 PM1/2/09

to

On Jan 2, 8:49 am, rani_shar...@hotmail.com wrote:

> I heard, mainly from the STM camp, that the biggest problem with these
> CPU locked operations is that they poorly scale for many cores due to
> memory synchronization (and not CPU performance). Maybe Intel improved
> this issue and therefore there is potential for big gain in high
> contention scenarios.

Ultimately, I think the solution is just not to have high contention.
The idea is to have the tasks done by your cores be as logically-
independent as possible. Focus on reducing the amount of contention
rather than by making the contention do less harm.

This is largely the reason I have argued that lock-free approaches are
largely movements in the wrong direction. Use locks intelligently --
so that threads that contend with each other will not run concurrently
unless there is no other way to keep a core busy.

DS

rani_s...@hotmail.com

unread,

Jan 2, 2009, 11:59:58 PM1/2/09

to

I think that one of the challenges of hardware vendors is to
transparently improve the current state of things for which better
scalability of such atomic operations is more than welcomed (e.g. even
making state of the art lock-free algorithms freer).

Rani

Message has been deleted

Chris M. Thomasson

unread,

Jan 4, 2009, 10:13:58 AM1/4/09

to

"Elcaro Nosille" <Elcaro...@akapost.com> wrote in message
news:gjqj2b$jsa$1...@hoshi.visyn.net...
> David Schwartz schrieb:

>
>> This is largely the reason I have argued that lock-free approaches are
>> largely movements in the wrong direction. Use locks intelligently --
>> so that threads that contend with each other will not run concurrently
>> unless there is no other way to keep a core busy.
>

> 100% ACK! When you deal intelligently with your locks, you can in most
> cases be more efficient than with lock-free programming.

Be sure to factor ___intelligently___ designed non-blocking algorithms into
your calculus.

Chris M. Thomasson

unread,

Jan 4, 2009, 10:16:17 AM1/4/09

to

"Elcaro Nosille" <Elcaro...@akapost.com> wrote in message
news:gjqj2b$jsa$1...@hoshi.visyn.net...
> David Schwartz schrieb:
>

>> This is largely the reason I have argued that lock-free approaches are
>> largely movements in the wrong direction. Use locks intelligently --
>> so that threads that contend with each other will not run concurrently
>> unless there is no other way to keep a core busy.
>

> 100% ACK! When you deal intelligently with your locks, you can in most
> cases be more efficient than with lock-free programming.

Sometimes, its almost "impossible" for lock-based algorihtms to "fairly"
compete with certain non-blocking algorihtms.

David Schwartz

unread,

Jan 4, 2009, 10:42:29 AM1/4/09

to

On Jan 4, 7:13 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > 100% ACK! When you deal intelligently with your locks, you can in most
> > cases be more efficient than with lock-free programming.

> Be sure to factor ___intelligently___ designed non-blocking algorithms into
> your calculus.

The problem is not so much the design as understanding the
characteristics and intelligently selecting the right algorithm.

For example, if the system has nothing else to do, a lock-free
algorithm will beat a lock-based algorithm every time. However, if
there are uncontending threads, a lock-based algorithm will beat a
lock-free algorithm every time.

DS

Chris M. Thomasson

unread,

Jan 4, 2009, 10:30:11 PM1/4/09

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:4a2a0bc6-88db-4304...@y1g2000pra.googlegroups.com...

On Jan 4, 7:13 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > 100% ACK! When you deal intelligently with your locks, you can in most
> > > cases be more efficient than with lock-free programming.

> > Be sure to factor ___intelligently___ designed non-blocking algorithms
> > into
> > your calculus.

> The problem is not so much the design as understanding the
> characteristics and intelligently selecting the right algorithm.

Fair enough indeed.

> For example, if the system has nothing else to do, a lock-free
> algorithm will beat a lock-based algorithm every time.

Humm, well, I know that certain lock-free algorihtms (e.g., multiple
producer/consumer unbounded queue) are totally infested with expensive
operations such that it provides absolutely no benefit over uncontended lock
based queue.

Here is a simple rule that I generally try to follow:

1. If a lock-free algorihtm has more than 1 atomic rmw and 1 membar on a
per-operation basis, it will give no clear advantage over clever locking
scheme that provided mostly uncontended access to critical-sections.

> However, if
> there are uncontending threads, a lock-based algorithm will beat a
> lock-free algorithm every time.

Are you 100% sure about that? Take a simple thread-safe LIFO data-structure
with the following basic interface:

struct lifo {
void push(void*);
void* pop();
};

An uncontended lock-based version of the push and pop operations will
usually have 2 atomic RMW and 2 membars each. Therefore, it will usually
take 4 atomic rmw, and 4 membars to complete a push to pop transition.

The non-blocking version will be using 1 membar and 1 atomic RMW per
operation. Therefore, it will take 2 atomic rmw, and 2 membars to complete a
push to pop transition.

AFAICT, the non-blocking version still has an "advantage" over uncontended
lock-based algorihtm. It takes less synchronization instructions to get the
job done on a per-operation basis.

Also, there are wait-free algorihtms that do not need any atomics or membars
for any of there operations... AFAICT, locks generally cannot compete with
these types of non-blocking algorithms...

David Schwartz

unread,

Jan 5, 2009, 11:44:11 AM1/5/09

to

On Jan 4, 7:30 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > However, if
> > there are uncontending threads, a lock-based algorithm will beat a
> > lock-free algorithm every time.

> Are you 100% sure about that? Take a simple thread-safe LIFO data-structure
> with the following basic interface:

Yes.

> struct lifo {
> void push(void*);
> void* pop();
> };

> An uncontended lock-based version of the push and pop operations will
> usually have 2 atomic RMW and 2 membars each. Therefore, it will usually
> take 4 atomic rmw, and 4 membars to complete a push to pop transition.
>
> The non-blocking version will be using 1 membar and 1 atomic RMW per
> operation. Therefore, it will take 2 atomic rmw, and 2 membars to complete a
> push to pop transition.
>
> AFAICT, the non-blocking version still has an "advantage" over uncontended
> lock-based algorihtm. It takes less synchronization instructions to get the
> job done on a per-operation basis.

Except the non-blocking version will have contending threads, so each
of these operations will be cache misses. The blocking version will
not have contending threads running, so the data will stay hot in the
cache.

> Also, there are wait-free algorihtms that do not need any atomics or membars
> for any of there operations... AFAICT, locks generally cannot compete with
> these types of non-blocking algorithms...

Yes, they can, because locks avoid contention. (Again, assuming there
are threads that won't contend that can be scheduled.)

It is commonly thought, incorrectly, that threads provide a tiny
window of exclusion to keep other threads out and that we imagine them
doing their job when contending threads each 'pause' for a split
second to let the other threads though. But this is bad even if the
threads never block on the lock, since they're fighting over the same
cache lines as they run similar code manipulating the same objects.

Locks actually work primarily be de-scheduling threads that tend to
manipulate the same data. By finding the set of running threads that
contend the least, locks do their job mostly by changing which threads
run concurrently.

Forgetting this can lead to optimizing completely the wrong thing and
getting an algorithm that works great if there is nothing else to do
but maximizes FSB/memory bandwidth consumption otherwise and slows the
system to a crawl.

DS

rani_s...@hotmail.com

unread,

Jan 5, 2009, 12:43:17 PM1/5/09

to

On Jan 5, 8:44 am, David Schwartz <dav...@webmaster.com> wrote:
> Locks actually work primarily be de-scheduling threads that tend to
> manipulate the same data. By finding the set of running threads that
> contend the least, locks do their job mostly by changing which threads
> run concurrently.

Not sure whether this can scale for many cores in which free
processing power is anyway available for the de-scheduled threads.

From what I heard, avoiding shared state is best for scaling (though
sequentially slower) and STM is meant to be transient technique until
we'll get there.

Fortunately only few of us are responsible for super hot locks (e.g.
heap or scheduler) so performance due to high contention is not an
issue.

Rani

David Schwartz

unread,

Jan 5, 2009, 12:46:00 PM1/5/09

to

On Jan 5, 9:43 am, rani_shar...@hotmail.com wrote:
> On Jan 5, 8:44 am, David Schwartz <dav...@webmaster.com> wrote:
>
> > Locks actually work primarily be de-scheduling threads that tend to
> > manipulate the same data. By finding the set of running threads that
> > contend the least, locks do their job mostly by changing which threads
> > run concurrently.

> Not sure whether this can scale for many cores in which free
> processing power is anyway available for the de-scheduled threads.

It cannot. That's why lock-free algorithms tend to work best in
operating systems and threading libraries. There isn't going to be a
thread that doesn't need the scheduler or the vm subsystem.

> Fortunately only few of us are responsible for super hot locks (e.g.
> heap or scheduler) so performance due to high contention is not an
> issue.

Exactly. But botching your design so that heavily-contending threads
suffer in parallel because some toy benchmark said the lock-free
design was faster is.

DS

rani_s...@hotmail.com

unread,

Jan 5, 2009, 8:10:47 PM1/5/09

to

On Jan 5, 9:46 am, David Schwartz <dav...@webmaster.com> wrote:
> > > Locks actually work primarily be de-scheduling threads that tend to
> > > manipulate the same data. By finding the set of running threads that
> > > contend the least, locks do their job mostly by changing which threads
> > > run concurrently.
> > Not sure whether this can scale for many cores in which free
> > processing power is anyway available for the de-scheduled threads.
>
> It cannot. That's why lock-free algorithms tend to work best in
> operating systems and threading libraries. There isn't going to be a
> thread that doesn't need the scheduler or the vm subsystem.

My original point, after listening too much to the STM marketing hype,
was that those lock-free algorithms that relay on atomic cmpxchg like
operations and were designed for optimal many cores scalability are in
fact poorly scale due to the HW implementation so they are not much
better than lock based algorithms.
Maybe Intel did something to also address this issue hence their
claims.

Another issue that I had in mind is the widely spread COM (windows)
like ref-counting that uses atomic operations that probably also hurt
performance on many cores due to memory coherency guarantees (i.e.
interlocked add is also full memory barrier).

Rani

David Schwartz

unread,

Jan 5, 2009, 8:47:03 PM1/5/09

to

On Jan 5, 5:10 pm, rani_shar...@hotmail.com wrote:

> Another issue that I had in mind is the widely spread COM (windows)
> like ref-counting that uses atomic operations that probably also hurt
> performance on many cores due to memory coherency guarantees (i.e.
> interlocked add is also full memory barrier).
>
> Rani

I agree. Most of that is really silly since it's almost always
immediately associated with an operation that is protected by a lock
anyway. For example, "find an object and increment its reference
count" or "reduce the reference count on an object and stop working on
it". It seems silly to make the reference count an expensive atomic
operation when you're working on other shared data that's protected by
a lock anyway.

DS

rani_s...@hotmail.com

unread,

Jan 6, 2009, 1:29:05 AM1/6/09

to

On Jan 5, 5:47 pm, David Schwartz <dav...@webmaster.com> wrote:
> On Jan 5, 5:10 pm, rani_shar...@hotmail.com wrote:
> > Another issue that I had in mind is the widely spread COM (windows)
> > like ref-counting that uses atomic operations that probably also hurt
> > performance on many cores due to memory coherency guarantees (i.e.
> > interlocked add is also full memory barrier).
>

> I agree. Most of that is really silly since it's almost always
> immediately associated with an operation that is protected by a lock
> anyway. For example, "find an object and increment its reference
> count" or "reduce the reference count on an object and stop working on
> it". It seems silly to make the reference count an expensive atomic
> operation when you're working on other shared data that's protected by
> a lock anyway.

That's true. Naively speaking the memory barrier associated with the
atomic increment/decrement is essential for visibility of the full
construction when the object is used by the non constructing thread
(e.g. create and destruct from another thread). OTOH, there is some
synchronization between the constructing thread and the other threads
in order to make the object visible in the first place. This
synchronization (e.g. using conditional variables, events, completion
queues or creation of new threads) is anyway guarantying full
visibility (just like with non ref counting).

Beside of convenient abstract lifetime management and reference
semantics (i.e. no copying) we use ref-counting and immutable state to
minimize contention typically by acquiring the related lock only to
take a reference. This technique is great for simplifying the code
(e.g. compared with using obscure flags to manage lifetime) but due to
poor HW implementation, mainly on multi cores, we are noticing counter
intuitive performance hit in which the code spend a lot of time on
atomic increment/decrement.

Rani

Chris M. Thomasson

unread,

Jan 6, 2009, 6:14:44 AM1/6/09

to

<rani_s...@hotmail.com> wrote in message
news:3047a48f-185a-4441...@t26g2000prh.googlegroups.com...

On Jan 5, 9:46 am, David Schwartz <dav...@webmaster.com> wrote:
> > > > Locks actually work primarily be de-scheduling threads that tend to
> > > > manipulate the same data. By finding the set of running threads that
> > > > contend the least, locks do their job mostly by changing which
> > > > threads
> > > > run concurrently.
> > > Not sure whether this can scale for many cores in which free
> > > processing power is anyway available for the de-scheduled threads.
> >
> > It cannot. That's why lock-free algorithms tend to work best in
> > operating systems and threading libraries. There isn't going to be a
> > thread that doesn't need the scheduler or the vm subsystem.

> My original point, after listening too much to the STM marketing hype,
> was that those lock-free algorithms that relay on atomic cmpxchg like
> operations and were designed for optimal many cores scalability are in
> fact poorly scale due to the HW implementation so they are not much
> better than lock based algorithms.

IMVHO, atomic/membar operations should be avoided, which of course includes
the ones in the mutex implementation; the less locking a thread needs to do,
the better...

> Maybe Intel did something to also address this issue hence their
> claims.

> Another issue that I had in mind is the widely spread COM (windows)
> like ref-counting that uses atomic operations that probably also hurt
> performance on many cores due to memory coherency guarantees (i.e.
> interlocked add is also full memory barrier).

Indeed. IMHO, you don't want any threads to frequently execute atomic rmw
and/or membar instructions. This includes accessing uncontended locks which
usually consist of the exact same expensive synchronization operations on
the fast-path anyway.

FWIW, there are non-blocking algorihtms that attempt to heavily amortize the
use of atomics and membars. In fact, there are some algorithms that are
completely free of expensive operations. One quick example: Single
producer/consumer unbounded queue. The push operation of the queue can be
implemented with a handful of `MOV' instructions, no branching... The pop
operation has a `CMP' and a handful of `MOV' instructions; a single
branch... No `LOCK' operations, no `MFENCE'. Its extremely efficient. I, and
others can use the queue to create scaleable, distributed wait-free message
passing networks for high-performance intra-process concurrent communication
schemes. This system needs to scale up to very many processors. Luckily,
there are no expensive operations on the fast-paths! This design works well.
I can discuss some more if you want, anyway...

I really don't think I could create a message passing system that has
similar performance and scalability characteristics using 100%* lock-based
synchronization scheme. First of all, it would not be wait-free anymore
which is not all that friendly to real-time systems. It would hit
scalability walls because the act of acquiring and releasing a lock usually
involve 2 atomics and 2 membars; btw, one of those membars is a really
expensive `#StoreLoad | #StoreStore'...

* BTW, I don't have anything against locks. I would suggest making use of
both blocking _and_ non-blocking algorihtms. They can be use quite
effectively together. For instance, one can use locking schemes to handle
the slow-path of a non-blocking algorihtm. One can use locks to add
conditional waits (e.g., lock-based eventcount) to non-blocking algorihtms.
AFAICT, locks can heavily enrich the overall usefulness of non-blocking
algorihtms.

Chris M. Thomasson

unread,

Jan 6, 2009, 6:23:18 AM1/6/09

to

<rani_s...@hotmail.com> wrote in message
news:64c5ad7a-ee39-4ced...@w1g2000prm.googlegroups.com...

Yup. Naive non-distributed reference counting schemes, such as COM, have
problems scaling. It does not matter if one uses locks or atomics to mutate
the counter; they both have the same problems. Frequent lock-based/free
counter mutations use expensive operations; period.

Well, there are reference counting schemes that can scale __very__ well
indeed, but that is another story...

rani_s...@hotmail.com

unread,

Jan 6, 2009, 11:36:06 AM1/6/09

to

On Jan 6, 3:23 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> <rani_shar...@hotmail.com> wrote in message

>> On Jan 5, 5:47 pm, David Schwartz <dav...@webmaster.com> wrote:
> > > It seems silly to make the reference count an expensive atomic
> > > operation when you're working on other shared data that's protected by
> > > a lock anyway.

[...]

>> This technique is great for simplifying the code
> > (e.g. compared with using obscure flags to manage lifetime) but due to
> > poor HW implementation, mainly on multi cores, we are noticing counter
> > intuitive performance hit in which the code spend a lot of time on
> > atomic increment/decrement.
>
> Yup. Naive non-distributed reference counting schemes, such as COM, have
> problems scaling. It does not matter if one uses locks or atomics to mutate
> the counter; they both have the same problems. Frequent lock-based/free
> counter mutations use expensive operations; period.

Is this (COM style thread safe ref-counting) a conceptual scaling
problem or merely a result of poor HW support?
It seems that given a lightweight atomic operation, without full
barrier semantics, such ref counting scheme should have optimal m-
cores scaling since contention is very rare (e.g. IA64's
InterlockedDecrementAcquire?).

> Well, there are reference counting schemes that can scale __very__ well
> indeed, but that is another story...

I'm not aware about other lifetime management schemes, beside GC, that
provide both functionality and *usability* like COM style ref-
counting. I don't think that the huge success of such ref-counting was
merely an oversight.

Rani

rani_s...@hotmail.com

unread,

Jan 6, 2009, 11:44:33 AM1/6/09

to

On Jan 6, 3:14 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> FWIW, there are non-blocking algorihtms that attempt to heavily amortize the
> use of atomics and membars. In fact, there are some algorithms that are
> completely free of expensive operations. One quick example: Single
> producer/consumer unbounded queue. The push operation of the queue can be
> implemented with a handful of `MOV' instructions, no branching... The pop
> operation has a `CMP' and a handful of `MOV' instructions; a single
> branch... No `LOCK' operations, no `MFENCE'. Its extremely efficient.

How does such interlock free technique guarantee memory visibility?
Is this just a property of the conservative x86 arch?

Let say that thread A creates an object, place it in the queue and
thread B pop it. Without any memory barrier thread it seems that
thread B might see a partially constructed object...

Rani

Dmitriy V'jukov

unread,

Jan 7, 2009, 6:48:44 AM1/7/09

to

On 6 янв, 04:10, rani_shar...@hotmail.com wrote:
> On Jan 5, 9:46 am, David Schwartz <dav...@webmaster.com> wrote:
>
> > > > Locks actually work primarily be de-scheduling threads that tend to
> > > > manipulate the same data. By finding the set of running threads that
> > > > contend the least, locks do their job mostly by changing which threads
> > > > run concurrently.
> > > Not sure whether this can scale for many cores in which free
> > > processing power is anyway available for the de-scheduled threads.
>
> > It cannot. That's why lock-free algorithms tend to work best in
> > operating systems and threading libraries. There isn't going to be a
> > thread that doesn't need the scheduler or the vm subsystem.
>
> My original point, after listening too much to the STM marketing hype,
> was that those lock-free algorithms that relay on atomic cmpxchg like
> operations and were designed for optimal many cores scalability are in
> fact poorly scale due to the HW implementation so they are not much
> better than lock based algorithms.

Exactly.
But STM is much worse than both lock-based and lock-free from this
point of view. Because most current STM designs tend to execute tons
of atomic RMW operations on extremely contended data (internal STM
structures). This way STM makes otherwise good distributed
applications centralized and non-scalable. However, hardware support
will help.

> Maybe Intel did something to also address this issue hence their
> claims.

Probably. But I believe it's still hundreds of cycles.

> Another issue that I had in mind is the widely spread COM (windows)
> like ref-counting that uses atomic operations that probably also hurt
> performance on many cores due to memory coherency guarantees (i.e.
> interlocked add is also full memory barrier).

What I was asking from Intel is fine-grained (relaxed, release,
acquire) atomic operations:
http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/60984/

This way application will be able to transparently benefit from those
operations once system libraries will be optimized.

--
Dmitriy V'jukov

Dmitriy V'jukov

unread,

Jan 7, 2009, 7:02:17 AM1/7/09

to

On 6 янв, 09:29, rani_shar...@hotmail.com wrote:

> Beside of convenient abstract lifetime management and reference
> semantics (i.e. no copying) we use ref-counting and immutable state to
> minimize contention typically by acquiring the related lock only to
> take a reference. This technique is great for simplifying the code
> (e.g. compared with using obscure flags to manage lifetime) but due to
> poor HW implementation, mainly on multi cores, we are noticing counter
> intuitive performance hit in which the code spend a lot of time on
> atomic increment/decrement.

Especially for such situation I've designed my distributed reference
counting algo:
http://groups.google.ru/group/lock-free/browse_frm/thread/46d00a0f543a758e
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/dab22a17c32c6b13

Every acquire/release operation is no more than few plain instructions
on local data. Periodically threads synchronize local state, but this
is relatively rare batch operation. This algo requires minimal
cooperation from worker threads.
There is also a couple of improvements to this also, but they are
currently non published.

Btw, please describe in more detail what kind of performance hit you
have experienced. What percent of time was associated with atomic
increments/decrements? On what hardware you were testing? etc.

--
Dmitriy V'jukov

Dmitriy V'jukov

unread,

Jan 7, 2009, 7:42:25 AM1/7/09

to

Ok. But can you estimate the percent of systems which have
sufficiently uncontending threads? Is it 99%, 90%, 50%, 10%, 1%?
If it's 99%, then, well, non-lock-based synchronization can be thrown
out. But if it's 1% then your initial statement is basically
senseless...
From my experience there is always some centralized (contended,
accessed by all threads) resources (memory management, object lifetime
management, logging, settings, database caches, queues, scheduler,
etc). So if I will apply your scheme I have to run threads one by
one...
Non-centralized resources sometimes don't need synchronization at all.
Consider 'request' object, it is accessed only by thread which
processes it.
And access pattern to non-centralized resources can be chaotic.
Consider following example. Banking system processes requests to
change balance, query balance, transfer balance from one account to
another. Account numbers are basically random, there is neither
temporal nor spatial locality. So de-scheduling threads based on
conflicts when accessing account objects is senseless here - you will
not prevent further conflicts...

--
Dmitriy V'jukov

rani_s...@hotmail.com

unread,

Jan 7, 2009, 1:08:11 PM1/7/09

to

> > Another issue that I had in mind is the widely spread COM (windows)
> > like ref-counting that uses atomic operations that probably also hurt
> > performance on many cores due to memory coherency guarantees (i.e.
> > interlocked add is also full memory barrier).
>
> What I was asking from Intel is fine-grained (relaxed, release,

> acquire) atomic operations:http://software.intel.com/en-us/forums/threading-on-intel-parallel-ar...

I was reading Intel's x32/x64 manual and wondered whether read-modify-
write operations are not anyway atomic in respect to their operand.
For example:
1) thread A - INC X ; i.e. X = X + 1
2) thread B - INC X
3) threads A & B exited

On x86, are there cases in which X != 2 (say X was zero) because of
cache inconsistency?

If not then it seems to be easy to implement thread safe ref-counting
using non locking instructions (except for the final decrement):
AddRef - simple increment
Release - decrement, if zero then atomically exchange to some value
and destroy if succeeded.

Rani

Chris M. Thomasson

unread,

Jan 7, 2009, 1:13:14 PM1/7/09

to

<rani_s...@hotmail.com> wrote in message
news:b1e17b63-f404-4912...@g3g2000pre.googlegroups.com...

Perhaps I totally misunderstand what you mean... However, multiple threads
cannot concurrently execute an atomic RMW operation on the same location in
memory without a LOCK on x86. There will be race-conditions and the actual
result of the reference counting will be screwed up; corrupting the count,
and then your screwed...

rani_s...@hotmail.com

unread,

Jan 7, 2009, 1:51:22 PM1/7/09

to

> > I was reading Intel's x32/x64 manual and wondered whether read-modify-
> > write operations are not anyway atomic in respect to their operand.
> > For example:
> > 1) thread A - INC X ; i.e. X = X + 1
> > 2) thread B - INC X
> > 3) threads A & B exited
>
> > On x86, are there cases in which X != 2 (say X was zero) because of
> > cache inconsistency?
>
> > If not then it seems to be easy to implement thread safe ref-counting
> > using non locking instructions (except for the final decrement):
> > AddRef - simple increment
> > Release - decrement, if zero then atomically exchange to some value
> > and destroy if succeeded.
>
> Perhaps I totally misunderstand what you mean... However, multiple threads
> cannot concurrently execute an atomic RMW operation on the same location in
> memory without a LOCK on x86. There will be race-conditions and the actual
> result of the reference counting will be screwed up; corrupting the count,
> and then your screwed...

From x86 HW pov, how can the counter get corrupted (e.g. per my
example)?
x86 seems to be conservative enough to pervert that.

Rani

Anthony Williams

unread,

Jan 7, 2009, 3:13:29 PM1/7/09

to

rani_s...@hotmail.com writes:

As I understand it, without the LOCK, the read and write parts of an
RMW operation are entirely separate on x86. Consequently, two INCs on
separate CPUs can read the same value and write the same value back
(e.g. both read 1, both write 2).

Anthony
--
Anthony Williams
Author of C++ Concurrency in Action | http://www.manning.com/williams
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Just Software Solutions Ltd, Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK

rani_s...@hotmail.com

unread,

Jan 7, 2009, 3:25:27 PM1/7/09

to

> > From x86 HW pov, how can the counter get corrupted (e.g. per my
> > example)?
> > x86 seems to be conservative enough to pervert that.
>
> As I understand it, without the LOCK, the read and write parts of an
> RMW operation are entirely separate on x86. Consequently, two INCs on
> separate CPUs can read the same value and write the same value back
> (e.g. both read 1, both write 2).

The theoretical thing that I wrote on the non locked release is bogus
regardless of the HW.
I just checked, on my 8 cores machine, the single op RMW inc [mem] and
it's indeed *not* atomic.

OTOH, the x86 locked operations still seems to be too expensive for
ref-counting as beside of being atomic they also other serialize
memory operations (being a full barrier).

Thanks,
Rani

rani_s...@hotmail.com

unread,

Jan 8, 2009, 3:07:25 AM1/8/09

to

On Jan 7, 12:25 pm, rani_shar...@hotmail.com wrote:
> OTOH, the x86 locked operations still seems to be too expensive for
> ref-counting as beside of being atomic they also other serialize
> memory operations (being a full barrier).

Sorry for whining too much. x86 has a key role in sum of all good that
maybe something will prevail.

Rani

Chris M. Thomasson

unread,

Jan 8, 2009, 6:08:06 AM1/8/09

to

<rani_s...@hotmail.com> wrote in message
news:c9ddb7fc-f469-4e3d...@l33g2000pri.googlegroups.com...

You can get so-called "naked" atomic operations on the SPARC; no implicit
memory barrier. The ISA clearly separates atomic instructions from membar
instructions. You get to place membars exactly where you need them because
nothing is implied in RMO mode. AFAICT, its way more fine-grain than Intel's
design...

Chris M. Thomasson

unread,

Jan 8, 2009, 6:14:23 AM1/8/09

to

<rani_s...@hotmail.com> wrote in message
news:c14986b8-d2f5-4561...@u18g2000pro.googlegroups.com...

On Jan 6, 3:14 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > FWIW, there are non-blocking algorihtms that attempt to heavily amortize
> > the
> > use of atomics and membars. In fact, there are some algorithms that are
> > completely free of expensive operations. One quick example: Single
> > producer/consumer unbounded queue. The push operation of the queue can
> > be
> > implemented with a handful of `MOV' instructions, no branching... The
> > pop
> > operation has a `CMP' and a handful of `MOV' instructions; a single
> > branch... No `LOCK' operations, no `MFENCE'. Its extremely efficient.

> How does such interlock free technique guarantee memory visibility?
> Is this just a property of the conservative x86 arch?

http://www.intel.com/products/processor/manuals/318147.pdf

> Let say that thread A creates an object, place it in the queue and
> thread B pop it. Without any memory barrier thread it seems that
> thread B might see a partially constructed object...

The x86 has more than enough implied membars to get the job done. The
implementation on a TSO SPARC is basically identical. However, on a RMO
SPARC, I need an explicit `#StoreStore' barrier for the producer before
making a new item "fully" visible to the queue internal data-structures. The
consumer can still stay barrier free on RMO mode because it only relies on
the arch performing implicit data-dependant load barrier trailing every
atomic load. So far, the only arch that does not do this is the Alpha.

Chris M. Thomasson

unread,

Jan 8, 2009, 6:29:13 AM1/8/09

to

<rani_s...@hotmail.com> wrote in message
news:ad9cbe00-9929-4983...@s9g2000prm.googlegroups.com...

On Jan 6, 3:23 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > <rani_shar...@hotmail.com> wrote in message
> >> On Jan 5, 5:47 pm, David Schwartz <dav...@webmaster.com> wrote:
> > > > It seems silly to make the reference count an expensive atomic
> > > > operation when you're working on other shared data that's protected
> > > > by
> > > > a lock anyway.
[...]
> >> This technique is great for simplifying the code
> > > (e.g. compared with using obscure flags to manage lifetime) but due to
> > > poor HW implementation, mainly on multi cores, we are noticing counter
> > > intuitive performance hit in which the code spend a lot of time on
> > > atomic increment/decrement.
> >
> > Yup. Naive non-distributed reference counting schemes, such as COM, have
> > problems scaling. It does not matter if one uses locks or atomics to
> > mutate
> > the counter; they both have the same problems. Frequent lock-based/free
> > counter mutations use expensive operations; period.

> Is this (COM style thread safe ref-counting) a conceptual scaling
> problem or merely a result of poor HW support?

Threads that frequently mutate the reference count of a plurality of COM
objects will be generating a lot of overhead in the form of thrashing the
cache and FSB. This can have a negative impact on the whole application. As
for poor HW support, well, implementing atomic operations on hardware is
always going to incur overhead. Its probably best to try and avoid the
overhead all together by various amortization techniques in software...

> It seems that given a lightweight atomic operation, without full
> barrier semantics, such ref counting scheme should have optimal m-
> cores scaling since contention is very rare (e.g. IA64's
> InterlockedDecrementAcquire?).

Humm... Well, I would not consider `InterlockedDecrementAcquire' to be a
lightweight atomic operation. I would be in favor of something a bit more
fine grain:

Interlocked<RMW>Acquire/Release/Relaxed

Also:

Interlocked<Load/Store>Acquire/Relaxed/Depends

> > Well, there are reference counting schemes that can scale __very__ well
> > indeed, but that is another story...

> I'm not aware about other lifetime management schemes, beside GC, that
> provide both functionality and *usability* like COM style ref-
> counting. I don't think that the huge success of such ref-counting was
> merely an oversight.

So be it.

rani_s...@hotmail.com

unread,

Jan 8, 2009, 11:30:00 AM1/8/09

to

On Jan 8, 3:14 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> <rani_shar...@hotmail.com> wrote in message

> > How does such interlock free technique guarantee memory visibility?
> > Is this just a property of the conservative x86 arch?
>
> http://www.intel.com/products/processor/manuals/318147.pdf
>
> > Let say that thread A creates an object, place it in the queue and
> > thread B pop it. Without any memory barrier thread it seems that
> > thread B might see a partially constructed object...
>
> The x86 has more than enough implied membars to get the job done. The
> implementation on a TSO SPARC is basically identical.

Giving such implicit barriers, how do you know that such (x86)
interlock free solutions have good many cores scalability? It seems
that such conservative arch (i.e. w/ implicit barriers) should have
the same scalability characteristics for fully locked operations that
are simply also atomic and Intel is anyway improving their perf.

> However, on a RMO
> SPARC, I need an explicit `#StoreStore' barrier for the producer before
> making a new item "fully" visible to the queue internal data-structures. The
> consumer can still stay barrier free on RMO mode because it only relies on
> the arch performing implicit data-dependant load barrier trailing every
> atomic load. So far, the only arch that does not do this is the Alpha.

It seems that in general, when working with shared data, memory
visibility constraints requires some serialization (i.e. HW locking)
that will hurt scalability on high contention scenarios (in which lock
free often claims to be scalable). OTOH, avoiding high contention
should make the perf advantage of lock-free insignificant in most
cases (AKA premature optimization).

Thanks,
Rani

rani_s...@hotmail.com

unread,

Jan 8, 2009, 11:40:44 AM1/8/09

to

On Jan 8, 3:29 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> <rani_shar...@hotmail.com> wrote in message

> > Is this (COM style thread safe ref-counting) a conceptual scaling
> > problem or merely a result of poor HW support?
>
> Threads that frequently mutate the reference count of a plurality of COM
> objects will be generating a lot of overhead in the form of thrashing the
> cache and FSB. This can have a negative impact on the whole application. As
> for poor HW support, well, implementing atomic operations on hardware is
> always going to incur overhead. Its probably best to try and avoid the
> overhead all together by various amortization techniques in software...

AFAICT, the common use case for ref-counting is single thread (i.e.
create, use and release) and even for multi threading usage there is
typically low contention so, IMHO, HW can provide some support to
avoid high penalty in all cases (i.e. pay only when needed).

Rani

Chris M. Thomasson

unread,

Jan 8, 2009, 12:44:55 PM1/8/09

to

<rani_s...@hotmail.com> wrote in message
news:53c664ad-33ab-421e...@w1g2000prm.googlegroups.com...

Well, I have seen 100% lock-based synchronization schemes take several
minutes to complete some simple test routines which artificially generate
very high contention and worst case scenarios. The clever non-blocking
synchronization analog version completed its tests in a couple of seconds. I
have seen non-blocking algorithms get hundreds of thousands of reads
per-second per-thread where a read was an iteration of a shared linked list.
The lock-based analog got thousands of reads per-second per-thread. Another
example of how a non-blocking algorithm can scale well beyond "100%"
lock-based systems is RCU:

(please read...)

http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf
(rcu vs locks)

http://www.rdrop.com/users/paulmck/RCU

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce

How can locks possibly compete with the scalability algorihtms like RCU
provide wrt handling the synchronization of read-mostly shared
data-structures? BTW, RCU can be implemented on Windows... ;^D

> OTOH, avoiding high contention
> should make the perf advantage of lock-free insignificant in most
> cases (AKA premature optimization).

non-blocking algorithms have their place. By all means, try to use locks
first. However, if your designing applications that simply demand high
scalability and hard core throughput, well, there are solutions out there.

Also, there are some things that don't work with locks, but can work well
with non-blocking algorihtms. For instance, one can use them within
interrupt handlers. Also, locks don't tend to work well with real-time
systems. For hard core real time, IMVHO, you want efficient wait-free
synchronization; like the single producer/consumer queue I mentioned:

(please read...)

http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426

http://groups.google.com/group/comp.arch/msg/5dba6fee05aa07ea

The algorihtm is loopless and all its operations are comprised of a hand
full of instructions; perfect for real-time. On the other hand, if
contention is hit with a lock, well, the contending threads are going to
have to wait which is generally not a good idea in a real time system...