Re: Reconciling Garbage Collection with Deterministic Finalization

79 views
Skip to first unread message

Sergey P. Derevyago

unread,
Mar 21, 2006, 7:12:20 AM3/21/06
to
"Andrei Alexandrescu (See Website For Email)" wrote:
> Garbage collection is desirable because:
>
> (3) Can improve performance, particularly in multithreaded environments
>
Explain, please!

As it was stated by David Butenhof: "multithreading is defined by an
application design that ALLOWS FOR concurrent or simultaneous execution". And
if we're designing a C++ MT program, then:
1. The application must have several threads of control that work almost
independently. They use synchronized message queues to pass messages from one
thread to another and virtually NO SYNCHRONIZATION is supposed outside these
message queues!
2. Note please, this design is not about "lock-free programming". The point
is that different threads don't have to modify shared data outside the opaque
message queues. These threads really "ALLOW FOR concurrent or simultaneous
execution".
3. As a result, every thread has its own memory allocator that works (almost)
independently from the allocators of other threads. In particular, the "thread
A allocator" doesn't have to deallocate blocks of memory allocated by the
"thread B allocator" because the data between threads is copied via the
message queues.
4. That is a data structure is serialized into a queue. The sequence of bytes
are passed to another thread. The bytes are deserialized and the structure
gets reconstructed in this thread.

Now your GC/C++:
1. Doesn't have the guarantee that allocated memory blocks are always
deallocated in the same thread. So it must implement some kind of "stop the
world" algorithm.
2. "Stop the world" algorithms definitely kill the concurrency. In
particular, they have to cross the thread bounds so they don't "ALLOW FOR
concurrent or simultaneous execution".

The bottom line is clear: GC/C++ _degrades_ the performance of well-designed
C++ MT programs!
--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

kanze

unread,
Mar 21, 2006, 9:10:51 PM3/21/06
to
Sergey P. Derevyago wrote:
> "Andrei Alexandrescu (See Website For Email)" wrote:
> > Garbage collection is desirable because:

> > (3) Can improve performance, particularly in multithreaded environments

> Explain, please!

> As it was stated by David Butenhof: "multithreading is
> defined by an application design that ALLOWS FOR concurrent or
> simultaneous execution". And if we're designing a C++ MT
> program, then:

> 1. The application must have several threads of control
> that work almost independently. They use synchronized message
> queues to pass messages from one thread to another and
> virtually NO SYNCHRONIZATION is supposed outside these message
> queues!

This depends on the application. Other applications use highly
synchronized threads to parallelize the work.

> 2. Note please, this design is not about "lock-free
> programming". The point is that different threads don't have
> to modify shared data outside the opaque message queues. These
> threads really "ALLOW FOR concurrent or simultaneous
> execution".

It depends on what you mean by "shared data". In most of the
applications I've worked on, objects migrate from one thread to
another. Via message queues, but what is copied is a pointer,
not the entire object. (I don't see any way to do it otherwise
when the objects are polymorphic, which is generally the case.)

> 3. As a result, every thread has its own memory
> allocator that works (almost) independently from the
> allocators of other threads. In particular, the "thread A
> allocator" doesn't have to deallocate blocks of memory
> allocated by the "thread B allocator" because the data between
> threads is copied via the message queues.

In which system? Under what conditions? This is NOT the way
things work under Solaris, for example, where there is a common
allocator for all of the threads.

In practice, it's true that a large number of objects are
allocated and freed in the same thread. But there are also a
certain number which are allocated in one thread, and freed in
another. All of the message objects that you pass in the queue,
for example. (Not necessarily, of course. I tend to recycle
the message object for the response, so a message will be
created in thread A, live most of its life in thread B, and then
be sent back to thread A shortly before it dies.)

> 4. That is a data structure is serialized into a queue.
> The sequence of bytes are passed to another thread. The bytes
> are deserialized and the structure gets reconstructed in this
> thread.

That sounds like a horribly inefficient way of doing things.
The whole point of using threads, instead of separate processes,
is that we don't have to serialize complex objects in order to
pass them between threads.

> Now your GC/C++:

> 1. Doesn't have the guarantee that allocated memory
> blocks are always deallocated in the same thread.

Of course it doesn't, since that doesn't correspond to any
realistic real-life scenario.

> So it must
> implement some kind of "stop the world" algorithm.

> 2. "Stop the world" algorithms definitely kill the
> concurrency. In particular, they have to cross the thread
> bounds so they don't "ALLOW FOR concurrent or simultaneous
> execution".

> The bottom line is clear: GC/C++ _degrades_ the
> performance of well-designed C++ MT programs!

That's your theory. The actual experimental facts disagree. Do
we change the facts, or your theory?

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Andrei Alexandrescu (See Website For Email)

unread,
Mar 21, 2006, 9:15:49 PM3/21/06
to
Sergey P. Derevyago wrote:
> "Andrei Alexandrescu (See Website For Email)" wrote:
>
>>Garbage collection is desirable because:
>>
>>(3) Can improve performance, particularly in multithreaded environments
>>
>
> Explain, please!
>
> As it was stated by David Butenhof: "multithreading is defined by an
> application design that ALLOWS FOR concurrent or simultaneous execution". And
> if we're designing a C++ MT program, then:
> 1. The application must have several threads of control that work almost
> independently. They use synchronized message queues to pass messages from one
> thread to another and virtually NO SYNCHRONIZATION is supposed outside these
> message queues!

Ehm, yah, fine. Why the exclamation? No need to get overexcited around
here :o).

> 2. Note please, this design is not about "lock-free programming". The point
> is that different threads don't have to modify shared data outside the opaque
> message queues. These threads really "ALLOW FOR concurrent or simultaneous
> execution".

There is the concept of "almost non-blocking" where the amount of
synchronization is very low. It's not a black-and-white issue.

> 3. As a result, every thread has its own memory allocator that works (almost)
> independently from the allocators of other threads. In particular, the "thread
> A allocator" doesn't have to deallocate blocks of memory allocated by the
> "thread B allocator" because the data between threads is copied via the
> message queues.

Stop right there. This is but one allocator design, and has been
discussed a lot in MT circles. It has the disadvantage that memory
allocated but freed by one thread cannot be used by anyone else, so the
design can (in the worst case) consume proportional with k times more
memory than a design with a shared allocator, where k is the number of
threads.

Some applications can use the model successfully due to the nature of
their threads and their allocation patterns. Some can't. But you can't
possibly claim that that's the one possible design of an MT allocator.
The variety of MT allocator designs is hard proof.

> The bottom line is clear: GC/C++ _degrades_ the performance of well-designed
> C++ MT programs!

I think the way this argument was constructed is bogus. In order to
avoid spreading myself too thin, however, I won't comment on MT aspects
of GC any further.


Andrei

David Schwartz

unread,
Mar 21, 2006, 9:25:12 PM3/21/06
to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:441FE502...@iobox.com...

> The bottom line is clear: GC/C++ _degrades_ the performance of
> well-designed
> C++ MT programs!

For your very peculiar notion of "well-designed". I could not disagree
with this notion more strongly. In the vast majority of cases, having
objects 'belong' to threads is very bad design.

For example, if object 'A' belongs to thread 'A', and object 'A'
associates with some real-world thing that needs services, and only thread
'A' can manipulate object 'A', that means that if thread 'A' *ever* blocks,
object 'A' will be starved. There are reaons threads might block that are
beyond your control (such as page faults), so I would argue that for most
real-world applications, this type of design is actually very bad.

DS

Sergey P. Derevyago

unread,
Mar 22, 2006, 10:56:26 AM3/22/06
to
kanze wrote:
> > 1. The application must have several threads of control
> > that work almost independently. They use synchronized message
> > queues to pass messages from one thread to another and
> > virtually NO SYNCHRONIZATION is supposed outside these message
> > queues!
>
> This depends on the application. Other applications use highly
> synchronized threads to parallelize the work.
>
If some code creates threads that don't allow for parallel simultaneous
execution then it's a bad MT code (by definition).
Some other technique (say state machine) should be applied.

> > 2. Note please, this design is not about "lock-free
> > programming". The point is that different threads don't have
> > to modify shared data outside the opaque message queues. These
> > threads really "ALLOW FOR concurrent or simultaneous
> > execution".
>
> It depends on what you mean by "shared data".
>

The definition is standard and comes from the "Memory Synchronization"
section:
"Applications shall ensure that access to any memory location by more than
one thread of control (threads or processes) is restricted such that no thread
of control can read or modify a memory location while another thread of
control may be modifying it. Such access is restricted using functions that
synchronize thread execution and also synchronize memory with respect to other
threads."

> In most of the
> applications I've worked on, objects migrate from one thread to
> another. Via message queues, but what is copied is a pointer,
> not the entire object. (I don't see any way to do it otherwise
> when the objects are polymorphic, which is generally the case.)
>

The answer is serialization and reconstruction.

> > 3. As a result, every thread has its own memory
> > allocator that works (almost) independently from the
> > allocators of other threads. In particular, the "thread A
> > allocator" doesn't have to deallocate blocks of memory
> > allocated by the "thread B allocator" because the data between
> > threads is copied via the message queues.
>
> In which system? Under what conditions? This is NOT the way
> things work under Solaris, for example, where there is a common
> allocator for all of the threads.
>

I'm talking about a (thread-bound) user-defined memory allocators.

> In practice, it's true that a large number of objects are
> allocated and freed in the same thread. But there are also a
> certain number which are allocated in one thread, and freed in
> another. All of the message objects that you pass in the queue,
> for example. (Not necessarily, of course. I tend to recycle
> the message object for the response, so a message will be
> created in thread A, live most of its life in thread B, and then
> be sent back to thread A shortly before it dies.)
>
> > 4. That is a data structure is serialized into a queue.
> > The sequence of bytes are passed to another thread. The bytes
> > are deserialized and the structure gets reconstructed in this
> > thread.
>
> That sounds like a horribly inefficient way of doing things.
>

The impression is misleading. The data that really has to cross the thread
boundaries are pretty rare. The copying and reconstructing can cost less than
one system call.

> The whole point of using threads, instead of separate processes,
> is that we don't have to serialize complex objects in order to
> pass them between threads.
>

This statement is also misleading.

> > The bottom line is clear: GC/C++ _degrades_ the
> > performance of well-designed C++ MT programs!
>
> That's your theory. The actual experimental facts disagree. Do
> we change the facts, or your theory?
>

It's not a theory.
The point is that your definition of well-designed MT program is not "a code
that allows for parallel simultaneous execution".


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,
Mar 22, 2006, 10:57:14 AM3/22/06
to
"Andrei Alexandrescu (See Website For Email)" wrote:
> > 3. As a result, every thread has its own memory allocator that works (almost)
> > independently from the allocators of other threads. In particular, the "thread
> > A allocator" doesn't have to deallocate blocks of memory allocated by the
> > "thread B allocator" because the data between threads is copied via the
> > message queues.
>
> Stop right there.
>
No problem.

> This is but one allocator design, and has been
> discussed a lot in MT circles. It has the disadvantage that memory
> allocated but freed by one thread cannot be used by anyone else, so the
> design can (in the worst case) consume proportional with k times more
> memory than a design with a shared allocator, where k is the number of
> threads.
>

It seems like you have a wrong answer for the "How many threads my
application should create?" question.

> Some applications can use the model successfully due to the nature of
> their threads and their allocation patterns. Some can't. But you can't
> possibly claim that that's the one possible design of an MT allocator.
> The variety of MT allocator designs is hard proof.
>

Please tell me about different allocator design that also allows for parallel
simultaneous execution but imposes less overhead.

> > The bottom line is clear: GC/C++ _degrades_ the performance of well-designed
> > C++ MT programs!
>
> I think the way this argument was constructed is bogus. In order to
> avoid spreading myself too thin, however, I won't comment on MT aspects
> of GC any further.
>

Pity.


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,
Mar 22, 2006, 10:55:29 AM3/22/06
to
Sergey P. Derevyago wrote:
> "Andrei Alexandrescu (See Website For Email)" wrote:
>
>>Garbage collection is desirable because:
>>
>>(3) Can improve performance, particularly in multithreaded environments
>>
>
> Explain, please!
>
> As it was stated by David Butenhof: "multithreading is defined by an
> application design that ALLOWS FOR concurrent or simultaneous execution". And
> if we're designing a C++ MT program, then:
> 1. The application must have several threads of control that work almost
> independently. They use synchronized message queues to pass messages from one
> thread to another and virtually NO SYNCHRONIZATION is supposed outside these
> message queues!

There is a way to achieve a highly distributed virtually zero-overhead
messaging environment!

;)


No atomic operations... Simple loads and stores coupled with a
StoreStore style memory barrier for producers, and dependent-load
ordering for consumers:

http://appcore.home.comcast.net/

One could implement a distributed messaging service by using sets of
per-thread single producer/consumer queues. A set of queues can allow
for virtually zero-overhead, bidirectional communications between two
threads. If a thread has more than one set of queues, it can organize
bidirectional communications with a plurality of application threads by
simply applying a multiplexing algorithm...


> 2. Note please, this design is not about "lock-free programming". The point
> is that different threads don't have to modify shared data outside the opaque
> message queues. These threads really "ALLOW FOR concurrent or simultaneous
> execution".

There are algorithms and usage patterns that can provide the means for
efficient data-sharing throughout a plurality of application threads.
For instance, certain lock-free reader patterns coupled with an
efficient solution to the reader/writer problem ( VZOOM or Joe Seigh's
SMR-RCU hybrid ) can provide virtually zero-overhead access to critical
shared data-structures. IMHO, this can be a very important feature:

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce

IMO, addressing these issues is essential for achieving outstanding
scalability and throughput characteristics on the new multicore systems
that are now available. Algorithms that force threads to frequently use
tons of StoreLoad style memory barriers will not scale well at all on
the new stuff thats out there now. Like the UltraSPARC T1 which has an
on-chip high-speed mechanism for enforcing data-coherency throughout its
processing cores; SUN calls the mechanism "The Crossbar". An application
that frequently saturates The Crossbar with atomic operation and/or
StoreLoad membar coherency traffic would be a damn shame, IMHO of course...


> 3. As a result, every thread has its own memory allocator that works (almost)
> independently from the allocators of other threads. In particular, the "thread
> A allocator" doesn't have to deallocate blocks of memory allocated by the
> "thread B allocator" because the data between threads is copied via the
> message queues.

There are algorithms out there that can allow for the implementation of
an efficient distributed per-thread allocation scheme. The memory
allocated in one thread can be deallocated in another thread, no problem:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?q=lock-free+slab+allocator&rnum=1#e3efa5628aad4a82

Once you apply the fixes that I documented, my algorithm happens to
work. IMHO, it can be an excellent solution for per-thread allocation
schemes in general. Also, has anybody seen a reference counting
algorithm out there that is similar to the one I used for my algorithm?


> 4. That is a data structure is serialized into a queue. The sequence of bytes
> are passed to another thread. The bytes are deserialized and the structure
> gets reconstructed in this thread.
>
> Now your GC/C++:
> 1. Doesn't have the guarantee that allocated memory blocks are always
> deallocated in the same thread. So it must implement some kind of "stop the
> world" algorithm.

You should not say "must implement" because that is simple a false
assertion wrt this subject. You do not have to use any type of
stop-the-world algorithm to solve this "problem".


> 2. "Stop the world" algorithms definitely kill the concurrency. In
> particular, they have to cross the thread bounds so they don't "ALLOW FOR
> concurrent or simultaneous execution".

They do allow for parallel execution when their not "stopping the
world", so to speak...

;)

I am definitely not a fan of the technique in general.

Sergey P. Derevyago

unread,
Mar 22, 2006, 6:40:01 PM3/22/06
to
Chris Thomasson wrote:
>> 3. As a result, every thread has its own memory allocator
>> that works (almost)
>> independently from the allocators of other threads. In particular,
>> the "thread
>> A allocator" doesn't have to deallocate blocks of memory allocated
>> by the
>> "thread B allocator" because the data between threads is copied
>> via the
>> message queues.
>
> There are algorithms out there that can allow for the
> implementation of
> an efficient distributed per-thread allocation scheme. The memory
> allocated in one thread can be deallocated in another thread, no
> problem:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/
> thread/8245f4b48591fc69/e3efa5628aad4a82?q=lock-free+slab
> +allocator&rnum=1#e3efa5628aad4a82
>
No way, this implementation crosses the tread boundaries so it must
serialize
the access and synchronize memory. While the per-thread allocation
needs no
synchronization at all.
For example, here is your synchronization point:

void slab_sys_shared_push( block_t *_this )
{
block_t *cmp;

do
{
cmp = _this->thread->shared.front;
_this->next = cmp & ~1;
}

while ( ! CAS( &_this->thread->shared.front,
cmp,
( cmp & 1 ) ? _this | 1 : _this ) );

if ( cmp & 1 && XADD( &_this->refs, -1 ) == 1 )
{
/* destroy _this->thread */
}

}

>> 1. Doesn't have the guarantee that allocated memory blocks
>> are always
>> deallocated in the same thread. So it must implement some kind of
>> "stop the
>> world" algorithm.
>
> You should not say "must implement" because that is simple a false
> assertion wrt this subject.
>

I believe you must examine (at least):
1. The registers of all threads.
2. The stack of all threads.
3. The whole memory.
to make sure that some previously allocated memory block is now
inaccessible.

Please show your solution without "stop the world" action.


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Andrei Alexandrescu (See Website For Email)

unread,
Mar 23, 2006, 6:04:12 AM3/23/06
to
Sergey P. Derevyago wrote:
> Please tell me about different allocator design that also allows for parallel
> simultaneous execution but imposes less overhead.

http://www.research.ibm.com/people/m/michael/pldi-2004.pdf

Andrei

Chris Thomasson

unread,
Mar 23, 2006, 6:29:17 AM3/23/06
to
Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>> 3. As a result, every thread has its own memory allocator
>>>that works (almost)
>>>independently from the allocators of other threads. In particular,
>>>the "thread
>>>A allocator" doesn't have to deallocate blocks of memory allocated
>>>by the
>>>"thread B allocator" because the data between threads is copied
>>>via the
>>>message queues.
>>
>>There are algorithms out there that can allow for the
>>implementation of
>>an efficient distributed per-thread allocation scheme. The memory
>>allocated in one thread can be deallocated in another thread, no
>>problem:
>>
>>http://groups.google.com/group/comp.programming.threads/browse_frm/
>>thread/8245f4b48591fc69/e3efa5628aad4a82?q=lock-free+slab
>>+allocator&rnum=1#e3efa5628aad4a82
>>
>
> No way, this implementation crosses the tread boundaries so it must
> serialize
> the access and synchronize memory. While the per-thread allocation
> needs no
> synchronization at all.

Well, your correct if the allocation activity was "guaranteed" to be
kept local. However, IMHO, a per-thread allocation scheme should
probably be robust enough to be able to address allocations that
frequently cross thread boundaries. I guess you could augment the
presented algorithm with something like the distributed messaging system
that I briefly described in my previous post. That could reduce overhead
down to #StoreStore and possibly dependent load barriers. That is fairly
lightweight...


> For example, here is your synchronization point:
>

[...]

Indeed. BTW, thank you for taking the time to take a look the
pseudo-code. As you can see, an implementation of this "particular"
piece of pseudo-code would require #StoreStore and
#LoadStore|#StoreStore membar and CAS instructions. However, it only
uses the expensive barrier/atomic-op when a thread frees a memory block
that it does not own. You could apply an internal usage pattern to the
algorithm that could reduce the number of times a thread would actually
need to linearize itself at this synchronization point. You could setup
a scheme where threads could push an arbitrary number of memory blocks
at a time. A thread could decide to fill a cache of memory blocks up to
a desired limit "before" it decides to "push the entire cache" onto the
owner threads lock-free LIFO in a single atomic-op. As you can see, this
simple usage pattern has the potential to reduce synchronizations
throughout the presented algorithm in general.

If you apply this usage pattern to a per-thread allocation
implementation that uses low-overhead queues as a means of
communication, you could possibly reduce synchronization overhead down
to virtually zero. Therefore, there could be an marked improvement in
scalability...


>>> 1. Doesn't have the guarantee that allocated memory blocks
>>>are always
>>>deallocated in the same thread. So it must implement some kind of
>>>"stop the
>>>world" algorithm.
>>
>>You should not say "must implement" because that is simple a false
>>assertion wrt this subject.
>>
>
> I believe you must examine (at least):
> 1. The registers of all threads.
> 2. The stack of all threads.
> 3. The whole memory.

This is not true for all types of garbage collection algorithms.


> to make sure that some previously allocated memory block is now
> inaccessible.
>
> Please show your solution without "stop the world" action.

Well, there are some proxy garbage collectors that I developed that do
not use stop the world:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1216861ac568b2be/b28bdb0051e2c2c6?lnk=st&q=vzoom&rnum=1#b28bdb0051e2c2c6

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/dc39b17aed1220c3/4bb0a3fb0339550c?q=proxy+gc&rnum=1#4bb0a3fb0339550c

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/69a72519fc302eff/de500aec6ee67dea?q=proxy+gc&rnum=2#de500aec6ee67dea

http://groups.google.com/group/comp.programming.threads/msg/c9ddb529ad35199c

http://home.comcast.net/~vzoom/demos/pc_sample.c
(one of my lock-free proxy gc impls)

http://groups.google.com/group/comp.programming.threads/msg/59e9b6e427b4a144

http://groups.google.com/group/comp.programming.threads/msg/b7814ee86cd4b54d

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7dda9ad554a976a6/60110614c07b0170?lnk=st&q=vzoom&rnum=6#60110614c07b0170

http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/65c9c2673682d4cf/bc1506fb6d0e3ba7?lnk=st&q=vzoom&rnum=3#bc1506fb6d0e3ba7


Please note, that the presented algorithms, usage patterns and virtually
every other algorithms that has anything to do with this kind of stuff
has patents, or patent applications that are pending... Patent activity
in this area of computing is remarkable.

Joe Seigh

unread,
Mar 23, 2006, 9:49:17 AM3/23/06
to
Andrei Alexandrescu (See Website For Email) wrote:
> Sergey P. Derevyago wrote:
>
>> Please tell me about different allocator design that also allows for parallel
>>simultaneous execution but imposes less overhead.
>
>
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
Also some stuff on RCU here
http://www.rdrop.com/users/paulmck/RCU/

There's a prototype SMR+RCU implementation here. Also one for
a prescient implementation of lock-free reference counting.
(see patent application
20060037026 "Lightweight reference counting using single-target synchronization",
the illustrations are particularly illuminating)

http://atomic-ptr-plus.sourceforge.net/


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Sergey P. Derevyago

unread,
Mar 23, 2006, 9:54:12 AM3/23/06
to
Chris Thomasson wrote:
> > I believe you must examine (at least):
> > 1. The registers of all threads.
> > 2. The stack of all threads.
> > 3. The whole memory.
>
> This is not true for all types of garbage collection algorithms.
>
Here we talk about conventional C++ programs.
Explain please, why you don't have to examine the registers, the stacks and
the whole memory.

> > Please show your solution without "stop the world" action.
>
> Well, there are some proxy garbage collectors that I developed that do
> not use stop the world:
>

Can they be applied to conventional C++ programs?


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,
Mar 23, 2006, 9:54:56 AM3/23/06
to
"Andrei Alexandrescu (See Website For Email)" wrote:
> > Please tell me about different allocator design that also allows for parallel
> > simultaneous execution but imposes less overhead.
>
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
>
The paper is flawed. In particular, it stands:

"For example, it is trivial to design a wait-free allocator with pure
per-thread private heaps. That is, each thread allocates from its own heap and
also frees blocks to its own heap."

Yes, this is the way to go.

"However, this is hardly an acceptable general-purpose solution, as it can
lead to unbounded memory consumption (e.g., under a producer-consumer pattern
[3]), even when the program's memory needs are in fact very small."

This statement is incorrect: if you pass messages via serialization then no
"unbounded memory consumption (e.g., under a producer-consumer pattern)" is
possible.

P.S. I'm going to read further. Probably more comments will appear in a few
days...


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,
Mar 23, 2006, 5:15:04 PM3/23/06
to
Joe Seigh wrote:
> Also some stuff on RCU here
> http://www.rdrop.com/users/paulmck/RCU/
>
Yet another flaw from the very beginning:

"RCU OVERVIEW" section of
http://www.rdrop.com/users/paulmck/RCU/whatisRCU.html tells:

"The basic idea behind RCU is to split updates into "removal" and
"reclamation" phases. The removal phase removes references to data items
within a data structure (possibly by replacing them with references
to new
versions of these data items), and can run concurrently with readers.
The
reason that it is safe to run the removal phase concurrently with
readers is
the semantics of modern CPUs guarantee that readers will see either
the old or
the new version of the data structure rather than a partially updated
reference."

That is atomic pointers is a must.
You can't find the atomic pointers guarantee in POSIX so the show is
over.


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,
Mar 24, 2006, 7:58:58 AM3/24/06
to
Andrei Alexandrescu (See Website For Email) wrote:
> Sergey P. Derevyago wrote:
>
>> Please tell me about different allocator design that also allows for parallel
>>simultaneous execution but imposes less overhead.
>
>
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
>

Here is one of my contributions to this:

http://home.comcast.net/~vzoom/demos/pc_sample.c

You can apply this 100% working implementation to a proxy garbage
collection pattern, that turns out to be compatible with RCU. RCU is a
proxy collector at heart. Basically, the general pattern boils down to
something like this:

/* readers */
acquire pc_ref from collector
access nodes of shared data-structures
release pc_ref to collector


/* writers */
atomically make node unreachable
defer removed node to the collector


/* collector callback app */
void pc_callback( void *s )
{
user_state_t *us = s;

/* free to destroy us */
}


Interestingly, the presented collector code (pc_sample.c) is very
flexible. There are a number of small tweaks that if applied to the
presented code could provide the means for:


Lock-Free Reference Counting
Lock-Free Process Level Smart Pointers
Lock-Free Software Transactional Memory
Lock-Free GC-Memory Allocation


I can provide some examples if you want, just ask. There are a couple of
more things you can do with it, but first of all, are you familiar with
proxy garbage collection techniques and usage patterns in general? If
not, Joe Seigh and I can probably answer any question you, or others,
may have. It kind of seems like we are the only two researchers out
there that are developing these kinds of algorithms. Does anybody know
of any development going on that involves a firm understanding of proxy gc?

Chris Thomasson

unread,
Mar 24, 2006, 11:17:26 AM3/24/06
to
Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>> I believe you must examine (at least):
>>>1. The registers of all threads.
>>>2. The stack of all threads.
>>>3. The whole memory.
>>
>>This is not true for all types of garbage collection algorithms.
>>
>
> Here we talk about conventional C++ programs.
> Explain please, why you don't have to examine the registers, the stacks and
> the whole memory.

RCU, VZOOM, SMR, RCU+SMR all use techniques that do not involve
registers, stacks, and the whole memory:

RCU - tracks per-thread quiescent states.


VZOOM - can track per-thread arrays of persistent reference counters
after quiescent states. I created this one.


SMR - tracks per-thread static array of "hazard pointers".


RCU+SMR - tracks per-thread static array of hazard pointers after
quiescent states. Joe Seigh combined RCU and SMR into hybrid.

All of the tracking and bookkeeping can be executed in parallel with the
application threads; no stop-the-world, ect...

Chris Thomasson

unread,
Mar 24, 2006, 11:14:55 AM3/24/06
to
Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>> I believe you must examine (at least):
>>>1. The registers of all threads.
>>>2. The stack of all threads.
>>>3. The whole memory.
>>
>>This is not true for all types of garbage collection algorithms.
>>
>
> Here we talk about conventional C++ programs.
> Explain please, why you don't have to examine the registers, the stacks and
> the whole memory.

Proxy garbage collectors don't need to track registers, stacks or the
whole memory. Instead they track the individual state of a plurality of
threads "collected-regions". The regions can be checked in number of
different ways, none of which include tracking registers, stacks or the
whole memory. Compatible forms of atomic reference counting, or separate
polling logic can be used to determine when a plurality of
collected-regions are "quiescent" and associated generations of deferred
requests can be processed.


>
>
>>> Please show your solution without "stop the world" action.
>>
>>Well, there are some proxy garbage collectors that I developed that do
>>not use stop the world:
>>
>
> Can they be applied to conventional C++ programs?

Yes. They can be applied to C++ as it existed today. No modifications
are necessary. I have a patent pending on a solution that can use POSIX
mutexs to create an environment that supports virtually zero-overhead
object management, (VZOOM).

Dag-Erling Smørgrav

unread,
Mar 24, 2006, 11:20:58 AM3/24/06
to
"Sergey P. Derevyago" <non-ex...@iobox.com> writes:
> Yet another flaw [in RCU] from the very beginning:
> [quote from http://www.rdrop.com/users/paulmck/RCU/whatisRCU.html]

> You can't find the atomic pointers guarantee in POSIX so the show is
> over.

Given that the article you quote describes RCU within the Linux
kernel, POSIX is irrelevant.

DES
--
Dag-Erling Smørgrav - d...@des.no

Sergey P. Derevyago

unread,
Mar 24, 2006, 6:33:39 PM3/24/06
to
Chris Thomasson wrote:
>>>> I believe you must examine (at least):
>>>> 1. The registers of all threads.
>>>> 2. The stack of all threads.
>>>> 3. The whole memory.
>>>
>> Here we talk about conventional C++ programs.
>> Explain please, why you don't have to examine the
>> registers, the stacks and
>> the whole memory.
>
> Proxy garbage collectors don't need to track registers, stacks or the
> whole memory.
>
Sorry, don't understand how this could be possible.
Suppose you have a regular C++ program that heavily uses regular raw
pointers. It keeps them in the registers, on the stacks and on the
free store.
Now you can take Boehm GC and plug it instead of malloc/free.

The question is: Can you take XXX proxy GC and plug it just like
Boehm GC?
Where can I get this XXX proxy GC and use in my C++ code?


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Joe Seigh

unread,
Mar 25, 2006, 6:14:52 AM3/25/06
to
Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>>Proxy garbage collectors don't need to track registers, stacks or the
>>whole memory.
>>
>
> Sorry, don't understand how this could be possible.
> Suppose you have a regular C++ program that heavily uses regular raw
> pointers. It keeps them in the registers, on the stacks and on the
> free store.

For proxy GC like RCU there are two kinds of references. Global references
in shared storage (free store or the heap), and local references (stack
or registers). The former are handle conventially and when a writer
removes the last global reference so an object isn't reachable from
shared memory, it waits until all threads have executed a quiescent state
at least once, and then frees the object. Since by RCU convention, threads
won't hold a reference to an RCU managed object across a quiescent state,
it's safe to free the object at that point.

> Now you can take Boehm GC and plug it instead of malloc/free.

No. Code that uses malloc/free doesn't have to zero pointers. Simply
"plugging" in Boehm GC won't work.


>
> The question is: Can you take XXX proxy GC and plug it just like
> Boehm GC?

No, things like RCU, etc... aren't used for storage management. They're
used for read lock-free solutions to the reader/writer problem in a
multi-threaded environment.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,
Mar 27, 2006, 9:39:23 AM3/27/06
to
Joe Seigh wrote:
> Sergey P. Derevyago wrote:
>
>> Chris Thomasson wrote:
>>
>>> Proxy garbage collectors don't need to track registers, stacks or
>>> the
>>> whole memory.
>>>
>>
>> Sorry, don't understand how this could be possible.
>> Suppose you have a regular C++ program that heavily uses
>> regular raw
>> pointers. It keeps them in the registers, on the stacks and on the
>> free store.
>
>
> For proxy GC like RCU there are two kinds of references. Global
> references
> in shared storage (free store or the heap), and local references
> (stack
> or registers). The former are handle conventially and when a writer
> removes the last global reference so an object isn't reachable from
> shared memory, it waits until all threads have executed a quiescent
> state
> at least once, and then frees the object.

You don't really have to track all of the threads. You can choose to
track quiescent states per-data-structure...


> Since by RCU convention, threads
> won't hold a reference to an RCU managed object across a quiescent
> state,
> it's safe to free the object at that point.

I feel that I should point out that there are algorithms that allow for
an arbitrary number of references to persist across quiescent states.
This ability can simplify creating content for proxy collectors by not
having to worry if a function call is going to swing the thread into
quiescent state or not...

Chris Thomasson

unread,
Mar 27, 2006, 9:43:03 AM3/27/06
to
Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>>>> I believe you must examine (at least):
>>>>> 1. The registers of all threads.
>>>>> 2. The stack of all threads.
>>>>> 3. The whole memory.
>>>>
>>> Here we talk about conventional C++ programs.
>>> Explain please, why you don't have to examine the
>>> registers, the stacks and
>>> the whole memory.
>>
>> Proxy garbage collectors don't need to track registers, stacks or the
>> whole memory.
>>
>
> Sorry, don't understand how this could be possible.
> Suppose you have a regular C++ program that heavily uses regular raw
> pointers. It keeps them in the registers, on the stacks and on the
> free store.
> Now you can take Boehm GC and plug it instead of malloc/free.
>
> The question is: Can you take XXX proxy GC and plug it just like
> Boehm GC?

No. You can usually use them to replace read/write mutexs with lock-free
reader patterns. They can be used for other stuff as well. However,
before I get into any of that, I would suggest that you familiarize
yourself with basic RCU usage patterns. Then try to apply a solution to
a experimental test case.

I posted a link to my collector code in a previous post. It has same
basic pattern as RCU. I really do need to post some documentation on
it... Anyway, after you study up on RCU, and your still interested, we
can help you setup a solution...

Sergey P. Derevyago

unread,
Mar 27, 2006, 1:02:43 PM3/27/06
to
Joe Seigh wrote:
> > The question is: Can you take XXX proxy GC and plug it just like
> > Boehm GC?
>
> No, things like RCU, etc... aren't used for storage management. They're
> used for read lock-free solutions to the reader/writer problem in a
> multi-threaded environment.
>
So I was right: there is no GC/C++ that doesn't "stop the world".
And it seems like it's not possible for any GC/C++ not to "stop the world".

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Joe Seigh

unread,
Mar 28, 2006, 10:07:37 AM3/28/06
to
Sergey P. Derevyago wrote:
> Herb Sutter wrote:
>
>>> 1. Atomic pointers should not be assumed available.
>>
>> I'm stunned. Of course they certainly should be.
>>
>
> This one is a key point of our miscommunication.
> POSIX is the standard in MT area. It allows you to write pretty good
> _portable_ MT programs and POSIX doesn't require atomic pointers.
>
>
>> Their availability is a
>> basic assumption; all lock-free work relies on this kind atomicity
>> guarantee, plus an operation like CAS (or LL/SC).
>>
>
> I don't speak about lock-free tricks. Lock-free tricks are not as
> useful as
> one can suppose. Also they are not required for well-designed MT
> code to be
> good in both the performance and scalability.


Well, you're perfectly welcome to submit a solution to the "Global
lock contention
problem" on c.p.t. That's where, to reduce lock contention, a global
mutex on
a large data structure is replaced with finer granularity locking on
the data
structure nodes. Except that you still need to globally lock the
large data
structure to safely access the nodes, get a node lock, and release
the global
lock. Even a rwlock as a global lock may not work well especially if
the rwlock
use mutexes internally in its implementation.

What some programmers do is just not bother with a global lock and go
with no
locking or synchronization at all. So the program blows up once in a
while
on a race condition. Just close the problem report as "non
reproducible".
At least the code has good "performance and scalability".

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,
Mar 28, 2006, 2:36:44 PM3/28/06
to
Joe Seigh wrote:
> > I don't speak about lock-free tricks. Lock-free tricks are not as
> > useful as one can suppose. Also they are not required for well-designed
> > MT code to be good in both the performance and scalability.
> >
> Well, you're perfectly welcome to submit a solution to the "Global
> lock contention problem" on c.p.t.
>
Sorry, it's not a problem I've started to discuss:

1. Andrei Alexandrescu wrote:
-----------------------------


Garbage collection is desirable because:
(3) Can improve performance, particularly in multithreaded environments

2. I asked him to explain:
--------------------------
Now your GC/C++:


1. Doesn't have the guarantee that allocated memory blocks are always
deallocated in the same thread. So it must implement some kind of "stop the
world" algorithm.

2. "Stop the world" algorithms definitely kill the concurrency. In
particular, they have to cross the thread bounds so they don't "ALLOW FOR
concurrent or simultaneous execution".

The bottom line is clear: GC/C++ _degrades_ the performance of
well-designed C++ MT programs!

3. He replied:
--------------


> 3. As a result, every thread has its own memory allocator that works (almost)
> independently from the allocators of other threads. In particular, the "thread
> A allocator" doesn't have to deallocate blocks of memory allocated by the
> "thread B allocator" because the data between threads is copied via the
> message queues.

Stop right there. This is but one allocator design, and has been

discussed a lot in MT circles. It has the disadvantage that memory
allocated but freed by one thread cannot be used by anyone else, so the
design can (in the worst case) consume proportional with k times more
memory than a design with a shared allocator, where k is the number of
threads.

Some applications can use the model successfully due to the nature of

their threads and their allocation patterns. Some can't. But you can't
possibly claim that that's the one possible design of an MT allocator.
The variety of MT allocator designs is hard proof.

4. I replied:
-------------


Please tell me about different allocator design that also allows for
parallel simultaneous execution but imposes less overhead.

5. He replied:
--------------
http://www.research.ibm.com/people/m/michael/pldi-2004.pdf

6. I replied:
-------------


The paper is flawed. In particular, it stands:

"For example, it is trivial to design a wait-free allocator with pure
per-thread private heaps. That is, each thread allocates from its own heap and
also frees blocks to its own heap."

Yes, this is the way to go.

"However, this is hardly an acceptable general-purpose solution, as it can
lead to unbounded memory consumption (e.g., under a producer-consumer pattern
[3]), even when the program's memory needs are in fact very small."

This statement is incorrect: if you pass messages via serialization
then no "unbounded memory consumption (e.g., under a producer-consumer
pattern)" is possible.

-------------

OK, what we have now is:
1. A per-thread C++ allocator (which doesn't have to deallocate memory blocks
allocated in another threads) definitely beats any possible GC/C++ collector
(that can be plugged to regular C++ programs and therefore have to "stop the
world").
2. So "GC/C++ _degrades_ the performance of well-designed C++ MT programs".


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Herb Sutter

unread,
Mar 28, 2006, 3:13:16 PM3/28/06
to
>>>Sergey P. Derevyago wrote:
>>>> 1. Atomic pointers should not be assumed available.

>> Herb Sutter wrote:
>>> I'm stunned. Of course they certainly should be.

>Sergey P. Derevyago wrote:
>> This one is a key point of our miscommunication.

I agree. I'm glad we found the disconnect.

>> POSIX is the standard in MT area. It allows you to write pretty good
>> _portable_ MT programs and POSIX doesn't require atomic pointers.

Citing POSIX as the authority won't necessary get you far. :-) Now that
Joe has cross-posted this to c.p.l you might get more details on both
topics (lock-free and POSIX).

>> Herb Sutter wrote:
>>> Their availability is a
>>> basic assumption; all lock-free work relies on this kind
>>> atomicity guarantee, plus an operation like CAS (or LL/SC).

>Sergey P. Derevyago wrote:
>> I don't speak about lock-free tricks. Lock-free tricks
>> are not as useful as one can suppose. Also they are not
>> required for well-designed MT code to be
>> good in both the performance and scalability.

For the information of those just tuning in, this thread is about Sergey
making and defending the assertion that ref counted smart pointers should
_not_ use atomic inc/dec of the ref count, and that instead users should
be required to externally lock code like this:

// acquire lock for the object p1 points to

shared_ptr<T> p2 = p1; // p1 is a private shared_ptr<T> instance

// release lock for the object p1 points to

That is, for a given mutable shared T object t, the contention was that
the user is supposed to take a lock not only on every use of t but also
take the same lock on every use of every shared_ptr<T> that points to t.

I asserted that this contention is wrong because it a) is harder for the
programmer, b) is more expensive than the atomic inc/dec would be, and c)
needlessly increases uses of locks when we should be decreasing lock use.
Even though there could be other shared_ptr<T>'s pointing to the same T,
the p1 object itself is private and unshared (no other thread sees p1,
though another thread might have another p3 that refers to the same T as
p1) and so per general good practice the user should not be required to
lock code that just takes a copy of a private object (here the smart
pointer).


Joe Seigh <jsei...@xemaps.com> responded to Sergey:


>Well, you're perfectly welcome to submit a solution to the "Global
>lock contention
>problem" on c.p.t. That's where, to reduce lock contention, a global
>mutex on
>a large data structure is replaced with finer granularity locking on
>the data
>structure nodes. Except that you still need to globally lock the
>large data
>structure to safely access the nodes, get a node lock, and release
>the global
>lock. Even a rwlock as a global lock may not work well especially if
>the rwlock
>use mutexes internally in its implementation.
>
>What some programmers do is just not bother with a global lock and go
>with no
>locking or synchronization at all. So the program blows up once in a
>while
>on a race condition. Just close the problem report as "non
>reproducible".
>At least the code has good "performance and scalability".

:-)

Herb

---
Herb Sutter (www.gotw.ca) (www.pluralsight.com/blogs/hsutter)

Convener, ISO WG21 (C++ standards committee) (www.gotw.ca/iso)
Architect, Developer Division, Microsoft (www.gotw.ca/microsoft)

Chris Thomasson

unread,
Mar 28, 2006, 5:53:22 PM3/28/06
to
Joe Seigh wrote:
> Sergey P. Derevyago wrote:
>
>>Herb Sutter wrote:
>>
>>
>>>> 1. Atomic pointers should not be assumed available.
>>>
>>>I'm stunned. Of course they certainly should be.
>>>
>>
>> This one is a key point of our miscommunication.
>> POSIX is the standard in MT area. It allows you to write pretty good
>>_portable_ MT programs and POSIX doesn't require atomic pointers.
>>
>>
>>
>>>Their availability is a
>>>basic assumption; all lock-free work relies on this kind atomicity
>>>guarantee, plus an operation like CAS (or LL/SC).
>>>
>>
>> I don't speak about lock-free tricks. Lock-free tricks are not as
>>useful as
>>one can suppose. Also they are not required for well-designed MT
>>code to be
>>good in both the performance and scalability.

Some of those tricks can be extremely useful. Lock-free reader patterns
are an example... Zero-overhead signaling via. eventcounts are another,
ect...

You can also use POSIX mutexs to create a scalable environment that
supports lock-free reader patterns, and much more. The only thing the
environment relies on is dependent load ordering, which is supported
everywhere except DEC Alpha; it needs a rmb. Funny how you can get
low-overhead lock-free algorithms from locks...

:)


BTW, have you ever seen a read/write mutex outperform "any" type of
lock-free reader pattern? If so, I would really appreciate a link or
some quick facts on the algorithm...

David Schwartz

unread,
Mar 29, 2006, 5:10:12 AM3/29/06
to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:4427C981...@iobox.com...

> So I was right: there is no GC/C++ that doesn't "stop the world".
> And it seems like it's not possible for any GC/C++ not to "stop the
> world".

The ability of operating systems to garbage collect not-recently-used
physical memory from an application without stopping the world would seem to
show that what you are claiming does not exist most certainly can exist.

DS

Chris Thomasson

unread,
Mar 29, 2006, 6:06:21 AM3/29/06
to
Sergey P. Derevyago wrote:
> Joe Seigh wrote:
>
>>> The question is: Can you take XXX proxy GC and plug it just like
>>>Boehm GC?
>>
>>No, things like RCU, etc... aren't used for storage management. They're
>>used for read lock-free solutions to the reader/writer problem in a
>>multi-threaded environment.
>>
>
> So I was right: there is no GC/C++ that doesn't "stop the world".
> And it seems like it's not possible for any GC/C++ not to "stop the world".
> --

You don't need to garbage collect everything. Proxy collection can allow
an application to get away from traditional GC, and all of the overhead
that comes along with them. You don't need to "stop the world" to solve
the reader/writer problem...

Sean Kelly

unread,
Mar 29, 2006, 7:19:11 AM3/29/06
to
Sergey P. Derevyago wrote:
> Joe Seigh wrote:
> > > I don't speak about lock-free tricks. Lock-free tricks are not as
> > > useful as one can suppose. Also they are not required for well-designed
> > > MT code to be good in both the performance and scalability.
> > >
> > Well, you're perfectly welcome to submit a solution to the "Global
> > lock contention problem" on c.p.t.
> >
> Sorry, it's not a problem I've started to discuss:
>
> 1. Andrei Alexandrescu wrote:
> -----------------------------
> Garbage collection is desirable because:
> (3) Can improve performance, particularly in multithreaded environments
>
> 2. I asked him to explain:
> --------------------------
> Now your GC/C++:
> 1. Doesn't have the guarantee that allocated memory blocks are always
> deallocated in the same thread. So it must implement some kind of "stop the
> world" algorithm.
> 2. "Stop the world" algorithms definitely kill the concurrency. In
> particular, they have to cross the thread bounds so they don't "ALLOW FOR
> concurrent or simultaneous execution".

Iterative GC doesn't have a "stop the world" phase, as far as I know,
but it also requires compiler support to achieve. Still, I've read
papers describing hard realtime iterative garbage collectors, so
collection doesn't have to be quite the unpredictable bottleneck you
describe... though this is perhaps more relevant for languages with
in-language GC support.

> The bottom line is clear: GC/C++ _degrades_ the performance of
> well-designed C++ MT programs!

How does this logically follow from what you said above? Isn't there a
cost associated with the techniques used in traditional non-GCed
applications? Is this cost demonstrably less than garbage collection?

> 3. He replied:
> --------------
> > 3. As a result, every thread has its own memory allocator that works (almost)
> > independently from the allocators of other threads. In particular, the "thread
> > A allocator" doesn't have to deallocate blocks of memory allocated by the
> > "thread B allocator" because the data between threads is copied via the
> > message queues.
>
> Stop right there. This is but one allocator design, and has been
> discussed a lot in MT circles. It has the disadvantage that memory
> allocated but freed by one thread cannot be used by anyone else, so the
> design can (in the worst case) consume proportional with k times more
> memory than a design with a shared allocator, where k is the number of
> threads.

Not true. While HOARD does have a substantially higher cost for
freeing memory allocated in another thread, there's no reason that this
must be so. Assuming lock-free facilities are available (CAS, LL/SC)
it's possible to perform such a delete with a single atomic operation
by maintaining per-thread lock-free deletion queues.

> Some applications can use the model successfully due to the nature of
> their threads and their allocation patterns. Some can't. But you can't
> possibly claim that that's the one possible design of an MT allocator.
> The variety of MT allocator designs is hard proof.
>
> 4. I replied:
> -------------
> Please tell me about different allocator design that also allows for
> parallel simultaneous execution but imposes less overhead.
>
> 5. He replied:
> --------------
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
>
> 6. I replied:
> -------------
> The paper is flawed. In particular, it stands:
>
> "For example, it is trivial to design a wait-free allocator with pure
> per-thread private heaps. That is, each thread allocates from its own heap and
> also frees blocks to its own heap."
>
> Yes, this is the way to go.

All blocks or just its own blocks? I would argue that bocks should be
returned to the heap they are allocated from, or cache contention and
such could be an issue. See my last statement above for a fairly
simple way to accomplish this.

> "However, this is hardly an acceptable general-purpose solution, as it can
> lead to unbounded memory consumption (e.g., under a producer-consumer pattern
> [3]), even when the program's memory needs are in fact very small."
>
> This statement is incorrect: if you pass messages via serialization
> then no "unbounded memory consumption (e.g., under a producer-consumer
> pattern)" is possible.

But the paper is about memory allocation strategies, not alternative
methods for passing data between threads. I believe the paper is
correct with respect to passing pointers to allocated memory between
threads in a producer/consumer model.

> -------------
>
> OK, what we have now is:
> 1. A per-thread C++ allocator (which doesn't have to deallocate memory blocks
> allocated in another threads) definitely beats any possible GC/C++ collector
> (that can be plugged to regular C++ programs and therefore have to "stop the
> world").

That's a rather bold statement to make considering the assumptions
you've made. And while I don't feel qualified to make such an
assertion one way or the other, I don't think things are nearly as
clear as you suggest. I'm working from memory so forgive me if I've
got the wrong paper, but I believe Hans Boehm's paper entitled "Mostly
Parallel Garbage Collection"
(http://www.hpl.hp.com/personal/Hans_Boehm/gc/papers/pldi91.ps.Z) may
be relevant.

> 2. So "GC/C++ _degrades_ the performance of well-designed C++ MT programs".

Do you have data to support this claim?


Sean

Casper H.S. Dik

unread,
Mar 29, 2006, 10:57:53 AM3/29/06
to
Chris Thomasson <cristom_nospam@comcast._nospam_net> writes:

>Some of those tricks can be extremely useful. Lock-free reader patterns
>are an example... Zero-overhead signaling via. eventcounts are another,
>ect...

There are also other "contention free" algorithms by using
distributed locking; e.g., the Solaris kernel memory allocator
has little or no contention because it uses per-CPU caches of data;
only once such caches run out, the CPU will need to go and get some
more.

As a side note: before the new kernel memory allocator was implemented
in Solaris 2.4 (I think) the old memory allocator performed poorly and
some people thought it was a good idea to have their own freelists
protected with a single lock.

That stops scaling at one CPU as the (artificial) workload showed
(allocating memory was apparently performance critical for the
auditing subsystem).

After modifying the code to use the contentionless default memory
allocator, no memory contention could be observed (I only had a 12-way
system at my disposal)

>You can also use POSIX mutexs to create a scalable environment that
>supports lock-free reader patterns, and much more. The only thing the
>environment relies on is dependent load ordering, which is supported
>everywhere except DEC Alpha; it needs a rmb. Funny how you can get
>low-overhead lock-free algorithms from locks...

There are times when you don't even need that as is the case with
the Solaris kernel memory allocator; it just doesn't need resources
owned by other CPUs.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Sergey P. Derevyago

unread,
Mar 29, 2006, 10:57:04 AM3/29/06
to
Herb Sutter wrote:
> For the information of those just tuning in, this thread is about Sergey
> making and defending the assertion that ref counted smart pointers should
> _not_ use atomic inc/dec of the ref count, and that instead users should
> be required to externally lock code like this:
>
> // acquire lock for the object p1 points to
>
> shared_ptr<T> p2 = p1; // p1 is a private shared_ptr<T> instance
>
> // release lock for the object p1 points to
>
> That is, for a given mutable shared T object t, the contention was that
> the user is supposed to take a lock not only on every use of t but also
> take the same lock on every use of every shared_ptr<T> that points to t.
>
Herb, I've written the following statements several times:
1. Shared smart pointer objects (that concurrently modify the same memory
locations) are pretty rare in well-designed MT applications. Their use in MT
applications are almost unacceptable (in particular, your ...p2=p1... example
illustrates one of the drawbacks).
2. If you decided to use a shared smart pointer object then you definitely
have to synchronize the access (but these cases are pretty rare).
3. As a result, smart pointers that synchronize every shared operation will
definitely degrade the whole application performance, because almost all smart
pointer objects aren't shared and require no synchronization (atomic
operations on SMP require memory synchronization so they are substantively
expensive).

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,
Mar 29, 2006, 10:59:37 AM3/29/06
to
David Schwartz wrote:
> > So I was right: there is no GC/C++ that doesn't "stop the world".
> > And it seems like it's not possible for any GC/C++ not to "stop the
> > world".
>
> The ability of operating systems to garbage collect not-recently-used
> physical memory from an application without stopping the world would seem to
> show that what you are claiming does not exist most certainly can exist.
>
Explain, please.

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,
Mar 29, 2006, 10:59:59 AM3/29/06
to
Chris Thomasson wrote:
> Funny how you can get low-overhead lock-free algorithms from locks...
>
I've explained the point several times:
1. Well-designed MT code allows for parallel simultaneous execution.
2. The most of the time threads work with their private data structures so no
synchronization is required.
3. In those rare cases where you really have to synchronize some operations,
the standard POSIX synchronization primitives do well.

Let's consider the following simplified examples.

1. Suppose your code works with the thread-private data 95% of the time and
spends 5% on the synchronization. Suppose MagicLockFreeMutex is 2 times faster
than StandardMutex. Then if you replace the StandardMutex with
MagicLockFreeMutex your application will work for 97.5% of the previous time
so you will get 2.5% performance improvement.

The bottom line is: well-designed MT applications don't get significant gains
from faster lock operations.

2. Suppose your code works with the thread-private data 20% of the time and
spends 80% on the synchronization. That is you have a usual _bad_ MT design.
Then if you replace the StandardMutex with MagicLockFreeMutex your application
will work for 60% of the previous time so you will get 40% performance
improvement which is substantial.

The bottom line is: bad MT applications can get certain gains from faster lock
operations.

In principle, you can talk about some MT application design where all threads
work on the same data structure (say, some matrix transformation). Yes,
well-chosen lock-free algorithms are the must in this particular case.
My point is that regular MT design is "several threads that work almost
independently". These threads must not be forced to perform some hidden
synchronization under the covers. A good example of this bad hidden
synchronization is a global memory allocator that has to cross the thread
boundaries...


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,
Mar 30, 2006, 2:12:35 AM3/30/06
to
Sean Kelly wrote:
> Iterative GC doesn't have a "stop the world" phase, as far as I know,
>
"The devil is in the details".

> > The bottom line is clear: GC/C++ _degrades_ the performance of
> > well-designed C++ MT programs!
>
> How does this logically follow from what you said above?
>

Please read the whole message rather than the bottom line:
http://groups.google.com/group/comp.lang.c++.moderated/msg/9fe3ee93d3729321

> But the paper is about memory allocation strategies, not alternative
> methods for passing data between threads. I believe the paper is
> correct with respect to passing pointers to allocated memory between
> threads in a producer/consumer model.
>

The paper explains the motivation: "it is trivial to design a wait-free
allocator with pure per-thread private heaps" but "this is hardly an


acceptable general-purpose solution, as it can lead to unbounded memory

consumption".
I state: Per-thread private heaps don't lead to unbounded memory consumption
if the memory allocation/deallocation doesn't cross the thread boundaries (as
I've proposed). So per-thread private heaps are superior to any possible heaps
(that cross the thread boundaries and therefore have to synchronize the
access) and don't have the "unbounded memory consumption" drawback. Per-thread
private heaps are the way to go as an excellent genarally available
conventional C++ solution (here we talk about GC/C++, BTW).


--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

David Schwartz

unread,
Mar 30, 2006, 2:45:42 AM3/30/06
to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:442A5E82...@iobox.com...

> 2. Suppose your code works with the thread-private data 20% of the time
> and
> spends 80% on the synchronization. That is you have a usual _bad_ MT
> design.
> Then if you replace the StandardMutex with MagicLockFreeMutex your
> application
> will work for 60% of the previous time so you will get 40% performance
> improvement which is substantial.
>
> The bottom line is: bad MT applications can get certain gains from faster
> lock
> operations.

(I'm assuming an SMP P4 machine, but things are similar on most modern
machines.)

Quite the reverse, this type of application would *suffer* *terribly*
from lock-free algorithms. With a lock, two threads would never be permitted
concurrent access to the same data, so it would never enter a state where
data was ping-ponging back and forth from cache to cache burning heavy CPU
with no significant forward progress. However, with a lock-free structure,
it would be allowed to do this horrible thing 80% of the time!

DS

Ulrich Eckhardt

unread,
Mar 30, 2006, 3:07:41 AM3/30/06
to
Sergey P. Derevyago wrote:
> 1. Shared smart pointer objects (that concurrently modify the same
> memory locations) are pretty rare in well-designed MT applications.
> Their use in MT applications are almost unacceptable (in particular,
> your ...p2=p1... example illustrates one of the drawbacks).

Just wondering, but what is a 'shared smart pointer object'? Are you
referring to a shared_ptr object or do you mean an object pointed to (and
shared) by several shared_ptrs?

Uli


chris noonan

unread,
Mar 30, 2006, 6:48:35 AM3/30/06
to
Sergey P. Derevyago wrote:
> 1. Suppose your code works with the thread-private data 95% of the time and
> spends 5% on the synchronization. Suppose MagicLockFreeMutex is 2 times faster
> than StandardMutex. Then if you replace the StandardMutex with
> MagicLockFreeMutex your application will work for 97.5% of the previous time
> so you will get 2.5% performance improvement.

But be careful to distinguish between 5% on a lines-of-code
basis and 5% on a time basis. StandardMutex will use a
bus-locking instruction; same number of clocks as perhaps
100 ordinary instructions; no real problem. StandardMutex
may cause a thread switch; thousands of clocks on
current architectures (no idea why, how difficult can it be
to swap a few registers over?). To have thread synchronisation
and to keep its cost down to 5% on a time basis, the extent
of the communication must be slight.

> My point is that regular MT design is "several threads that work almost
> independently". These threads must not be forced to perform some hidden
> synchronization under the covers. A good example of this bad hidden
> synchronization is a global memory allocator that has to cross the thread
> boundaries...

I couldn't agree more.

Regards,
Chris

Joe Seigh

unread,
Mar 30, 2006, 7:26:05 AM3/30/06
to
Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>Funny how you can get low-overhead lock-free algorithms from locks...
>>
>
> I've explained the point several times:
> 1. Well-designed MT code allows for parallel simultaneous execution.
> 2. The most of the time threads work with their private data structures so no
> synchronization is required.
> 3. In those rare cases where you really have to synchronize some operations,
> the standard POSIX synchronization primitives do well.
>
> Let's consider the following simplified examples.
>
> 1. Suppose your code works with the thread-private data 95% of the time and
> spends 5% on the synchronization. Suppose MagicLockFreeMutex is 2 times faster
> than StandardMutex. Then if you replace the StandardMutex with
> MagicLockFreeMutex your application will work for 97.5% of the previous time
> so you will get 2.5% performance improvement.
>
> The bottom line is: well-designed MT applications don't get significant gains
> from faster lock operations.

[...]


>
> The bottom line is: bad MT applications can get certain gains from faster lock
> operations.

You're missing the point. In a preemptive threading environment, one of
the major benefits is less overhead suspending and resuming threads on
blocking operations. It's faster for the same reason highway interchanges with
overpasses are less congested than intersections with traffic lights for
the same volume of traffic. Lock-free also scales better and doesn't
suffer from that hysteresis effect where the application bogs down so badly
that it can't recover even under reduced load and has to be restarted.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Joe Seigh

unread,
Mar 30, 2006, 8:57:29 AM3/30/06
to
David Schwartz wrote:
> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
> news:442A5E82...@iobox.com...
>
>
>>2. Suppose your code works with the thread-private data 20% of the time
>>and
>>spends 80% on the synchronization. That is you have a usual _bad_ MT
>>design.
>>Then if you replace the StandardMutex with MagicLockFreeMutex your
>>application
>>will work for 60% of the previous time so you will get 40% performance
>>improvement which is substantial.
>>
>>The bottom line is: bad MT applications can get certain gains from faster
>>lock
>>operations.
>
>
> (I'm assuming an SMP P4 machine, but things are similar on most modern
> machines.)
>
> Quite the reverse, this type of application would *suffer* *terribly*
> from lock-free algorithms. With a lock, two threads would never be permitted
> concurrent access to the same data, so it would never enter a state where
> data was ping-ponging back and forth from cache to cache burning heavy CPU
> with no significant forward progress. However, with a lock-free structure,
> it would be allowed to do this horrible thing 80% of the time!
>

This makes no sense. Most lock-free algorithms are of the read lock-free
variety and for things like RCU and SMR hazard pointers actually reduce
the amount of cache traffic. RCU was created at Sequent where this kind
of thing was important since it was in a NUMA environment.

For regular write lock-free things like lock-free queues, you're going to have
to explain how a queue operations guarded with a mutex causes less cache
activity than lock-free queue operations do without slowing things down.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,
Mar 30, 2006, 9:56:21 AM3/30/06
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:e0eqoc$43m$1...@nntp.webmaster.com...

What about the mutex implementations "internal state"? A typical mutex
usually costs 2 atomic-ops and 2 "expensive" memory barriers for
uncontended
case... Unfortunately, when you apply load to those 4 instructions
the basic
end result will be he locks internal state ping-ponging all over the
place;
the lock state is a "proxy". This happens to be true for "frequent
accesses
to uncontended locks". Now, if you experience contention, under
load, the
threads may need to frequently park in the kernel; context-switch,
user-kernel transition, lock-convoys, "possible" priority-inversion,
ect,
ect...


> However, with a lock-free structure,
> it would be allowed to do this horrible thing 80% of the time!

Well, this could happen if you apply "persistent" heavy load to highly
contended "CAS-Loop" based lock-free algorithms... IMHO, I prefer to use
"loopless" algorithms when ever I can. Also, it can be beneficial to
join
lock-free and lock-based solutions together... Something like my
algorithm I
posted here:

http://groups.google.de/group/comp.programming.threads/msg/
632b6bdc2a137459

http://groups.google.de/group/comp.programming.threads/msg/
fd508d99ee2b8a70

?


Also, the overhead in taking locks can "adversely effect other
threads" that
are running "on the same core as the lock-taker". This assertion
holds true
for some of the new high-end multicore processors that are now
available.
Frequent use of memory barriers (e.g.,mutex usage) will flood the core
(s)
with coherency traffic and bog down other threads. Overall
scalability and
throughput can be reduced...

{ It looks like this conversation is drifting away from the topic of C
++.
Please remove comp.lang.c++.moderated from the Newsgroups line if
continued. Thanks, -mod/dv }

David Hopwood

unread,
Mar 30, 2006, 11:10:37 AM3/30/06
to
[posting via Google because of news server problems -- sorry if this
is a duplicate]

David Schwartz wrote:


> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote:
>
>>The bottom line is clear: GC/C++ _degrades_ the performance of
>>well-designed C++ MT programs!
>

> For your very peculiar notion of "well-designed". I could not disagree
> with this notion more strongly. In the vast majority of cases, having
> objects 'belong' to threads is very bad design.
>
> For example, if object 'A' belongs to thread 'A', and object 'A'
> associates with some real-world thing that needs services, and only thread
> 'A' can manipulate object 'A', that means that if thread 'A' *ever* blocks,
> object 'A' will be starved.

The previous poster was speaking in the context of systems using
message passing concurrency. In such systems, each object is
associated with a thread of control (although they are usually not
called "threads") by definition. If one were to assume that associating
objects with threads is the only change relative to a model using
shared state and locks, one might conclude that the result will be
a bad design, but in fact it is very far from being the only change.

For instance, some message passing languages have constructs
that make it easy to avoid blocking while causing minimal change
to the logical structure of the program. In E, a call that could
potentially block can always be rewritten as an asynchronous
message send that returns a promise for the call's result, followed
by a "when-catch" clause (see
<http://www.erights.org/talks/promises/paper/tgc05.pdf>).

> There are reasons threads might block that are
> beyond your control (such as page faults), so I would argue that for most
> real-world applications, this type of design is actually very bad.

Except for hard real-time applications, page faults are not much
of a problem, since they are guaranteed to only block for a short
time (even if there is a disk error, since then the process will be
terminated). There is obviously some latency associated with
servicing a page fault, but this is unavoidable regardless of the
concurrency design.

In a lock-based model, however, because a thread synchronously
blocks when it cannot immediately acquire a lock, operations
that do not logically depend on having acquired the lock can be
delayed. In a model with asynchronous message passing and
promises, each operation will take place as soon as it is logically
possible, which generally allows a higher degree of concurrency
for a comparable complexity and effort in design and validation.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,
Mar 30, 2006, 11:23:02 AM3/30/06
to
[posting via Google because of news server problems -- sorry if this is
a duplicate]

Chris Thomasson wrote:
> Sergey P. Derevyago wrote:
>>Joe Seigh wrote:
>>
>>>> The question is: Can you take XXX proxy GC and plug it just like
>>>>Boehm GC?
>>>
>>>No, things like RCU, etc... aren't used for storage management. They're
>>>used for read lock-free solutions to the reader/writer problem in a
>>>multi-threaded environment.
>>
>> So I was right: there is no GC/C++ that doesn't "stop the world".
>> And it seems like it's not possible for any GC/C++ not to "stop the world".
>

> You don't need to garbage collect everything. Proxy collection can allow
> an application to get away from traditional GC, and all of the overhead
> that comes along with them.

I always find it odd when people talk about the "overhead of GC" as
though
other memory management techniques had no overheads.

You shouldn't assume that a hybrid of GC and some other memory
management
technique will have lower overhead than applying GC to all objects. If
the
other technique is some form of reference counting (as is commonly the
case
when using proxy GC), then it definitely won't; refcounting is very
inefficient.

> You don't need to "stop the world" to solve
> the reader/writer problem...

Sergey is mistaken in thinking that global GC implies "stop the world"
(even
for C++, in principle, although there are no state-of-the-art
concurrent
GC implementations for C++, and it would be difficult to write one that
would
be link-compatible with existing C++ implementations).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Sergey P. Derevyago

unread,
Mar 30, 2006, 11:57:09 AM3/30/06
to
David Hopwood wrote:
> Sergey is mistaken in thinking that global GC implies "stop the world"
>
Explain please how this could be possible.
Don't you have to examine the registers, the stacks and so on? (We talk about
conventional C++ programs and pluggable GC).

David Schwartz

unread,
Mar 30, 2006, 2:25:28 PM3/30/06
to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:Y-mdnXDUsrOAfLbZ...@comcast.com...

> What about the mutex implementations "internal state"? A typical mutex
> usually costs 2 atomic-ops and 2 "expensive" memory barriers for
> uncontended
> case... Unfortunately, when you apply load to those 4 instructions
> the basic
> end result will be he locks internal state ping-ponging all over the
> place;
> the lock state is a "proxy". This happens to be true for "frequent
> accesses
> to uncontended locks". Now, if you experience contention, under
> load, the
> threads may need to frequently park in the kernel; context-switch,
> user-kernel transition, lock-convoys, "possible" priority-inversion,
> ect,
> ect...

I have no objection to optimizing mutexes to decrease the cost in the
uncontended case. However, in the contended case, the cost of rescheduling
is well worth it -- a thread that won't conflict gets to run at full speed
rather than a thread that will conflict getting to fight at low speed.

DS


David Schwartz

unread,
Mar 30, 2006, 7:24:54 PM3/30/06
to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:U6SdnUIS2vs...@comcast.com...

> This makes no sense. Most lock-free algorithms are of the read lock-free
> variety and for things like RCU and SMR hazard pointers actually reduce
> the amount of cache traffic. RCU was created at Sequent where this kind
> of thing was important since it was in a NUMA environment.

If the algorithm is purely read, then this argument does not apply. But
there are very few lock-free algorithms that don't involve allowing threads
to modify things concurrently.

> For regular write lock-free things like lock-free queues, you're going to
> have
> to explain how a queue operations guarded with a mutex causes less cache
> activity than lock-free queue operations do without slowing things down.

Suppose you have two CPUs and four threads that want to run. Two of the
threads want to access queue A and two of the threads want to access queue
B. If each queue is protected by a lock, the scheduler will have to schedule
one thread using queue A and one using queue B. Both threads will run at
full speed and will not interfere with each other. If the threads are lock
free, it is equally likely that two threads will run using the same queue as
using different queues, the queue structures will ping-pong violently across
the FSB and the two threads will run at low speed.

Locks serve an extremely important purpose -- they deschedule threads
that will compete for the same cache lines and allow threads that have no
such dependencies to run concurrently at full speed. Lock-free algorithms
work well only when they don't disturb this, for example, when they are
embedded inside the operating system or in very low-level libraries.
However, when lock-free structures are used at higher levels, this important
property of locks is lost, and performance suffers.

Contention is bad. Lock-free algorithms allow threads that content to
run at the same time, contending all the while. Locks deschedule contending
threads, which is great if there are threads ready to run that will not
contend.

DS

chris noonan

unread,
Mar 30, 2006, 7:47:09 PM3/30/06
to
[apologies if this is a near-duplicate; the first post seemed to
disappear]

David Schwartz wrote:

>> Quite the reverse, this type of application would *suffer*
*terribly*
>> from lock-free algorithms. With a lock, two threads would never be
permitted
>> concurrent access to the same data, so it would never enter a state
where
>> data was ping-ponging back and forth from cache to cache burning
heavy CPU
>> with no significant forward progress. However, with a lock-free
structure,
>> it would be allowed to do this horrible thing 80% of the time!


A real-world example, more information available at:
group: comp.unix.internals
topic: Kernel Memory Allocator
date: 6 November 2003

A simple program (benchmark type) that divides a fixed
amount of work up among a variable number of threads,
so that with perfect scaling all runs take the same time.
Can use either a locking or lock-free algorithm.

Several runs on a 4-way machine. Each result sequence
starts at one thread and doubles successively up to
128 (1,2,4,8,16,32,64,128). Timings are in seconds
(so the smaller the better).

Locking: 265,289,280,400,555,456,604,575
Lock-free: 245,122,80,70,70,58,75,82

In this case lock-free scales better.

Regards
Chris Noonan

chris noonan

unread,
Mar 30, 2006, 8:23:33 PM3/30/06
to
David Schwartz wrote:
> Quite the reverse, this type of application would *suffer* *terribly*
> from lock-free algorithms. With a lock, two threads would never be permitted
> concurrent access to the same data, so it would never enter a state where
> data was ping-ponging back and forth from cache to cache burning heavy CPU
> with no significant forward progress. However, with a lock-free structure,
> it would be allowed to do this horrible thing 80% of the time!

A real-world example (see


group: comp.unix.internals
topic: Kernel Memory Allocator

date: 6 November 2003)

A simple program, benchmark type, that divides a fixed amount
of work between a variable number of threads (thus perfect scalability
shows up as a constant runtime whatever the number of threads).
Program running on 4-way machine. Results in seconds (so the lower
the better). Each sequence of results starts at one thread and doubles
successively up to 128 threads.
Locking algorithm: 265,289,280,400,555,456,604,575
Lock-free algorithm: 245,122,80,70,70,58,75,82

In this case lock-free scales better.

Regards,
Chris

David Hopwood

unread,
Mar 30, 2006, 10:08:57 PM3/30/06
to
David Schwartz wrote:
> Locks serve an extremely important purpose -- they deschedule threads
> that will compete for the same cache lines and allow threads that have no
> such dependencies to run concurrently at full speed.

Note that locks are not the only way of doing that. Any method that achieves
mutual exclusion between "threads that will compete for the same cache lines"
will do it. Besides locks, another approach is based on active objects which
use an event loop to respond to messages. This message passing may be either
asynchronous, as in Erlang or E, or synchronous, as in Concurrent ML.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Schwartz

unread,
Mar 30, 2006, 11:13:00 PM3/30/06
to
[possible duplicate post, sorry]

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:U6SdnUIS2vs...@comcast.com...

> This makes no sense. Most lock-free algorithms are of the read lock-free


> variety and for things like RCU and SMR hazard pointers actually reduce
> the amount of cache traffic. RCU was created at Sequent where this kind
> of thing was important since it was in a NUMA environment.

If the algorithm is purely read, then this argument does not apply. But


there are very few lock-free algorithms that don't involve allowing threads
to modify things concurrently.

> For regular write lock-free things like lock-free queues, you're going to


> have
> to explain how a queue operations guarded with a mutex causes less cache
> activity than lock-free queue operations do without slowing things down.

Suppose you have two CPUs and four threads that want to run. Two of the


threads want to access queue A and two of the threads want to access queue
B. If each queue is protected by a lock, the scheduler will have to schedule
one thread using queue A and one using queue B. Both threads will run at
full speed and will not interfere with each other. If the threads are

lock-free,


it is equally likely that two threads will run using the same queue as
using different queues, the queue structures will ping-pong violently across
the FSB and the two threads will run at low speed.

Locks serve an extremely important purpose -- they deschedule threads


that will compete for the same cache lines and allow threads that have no

such dependencies to run concurrently at full speed. Lock-free algorithms
work well only when they don't disturb this, for example, when they are
embedded inside the operating system or in very low-level libraries.
However, when lock-free structures are used at higher levels, this important
property of locks is lost, and performance suffers.

Contention is bad. Lock-free algorithms allow threads that content to
run at the same time, contending all the while. Locks deschedule contending
threads, which is great if there are threads ready to run that will not
contend.

This is often missed in artificial benchmarking because only the
performance of a single aspect of the system is measured. The affect of
FSB-inefficient code on threads running on other CPUs, for example, is
seldom noticed. If there is nothing else for the system to do, running a
thread very slowly and trashing the FSB may be better than nothing. But if a
real-world system or application could do other things effectively, allowing
it to do something slowly is a losing proposition.

DS


David Schwartz

unread,
Mar 31, 2006, 1:41:18 AM3/31/06
to

"chris noonan" <use...@leapheap.co.uk> wrote in message
news:1143765082.0...@g10g2000cwb.googlegroups.com...

> Several runs on a 4-way machine. Each result sequence
> starts at one thread and doubles successively up to
> 128 (1,2,4,8,16,32,64,128). Timings are in seconds
> (so the smaller the better).
>
> Locking: 265,289,280,400,555,456,604,575
> Lock-free: 245,122,80,70,70,58,75,82
>
> In this case lock-free scales better.

The results don't seem to make any sense. How can 6 threads go faster
than 4 on a 4-way machine? I thought this was a pure CPU load.

DS


David Schwartz

unread,
Mar 31, 2006, 1:43:10 AM3/31/06
to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:e0iius$elp$1...@nntp.webmaster.com...

Sorry, that should be: How can 8 threads go faster than 4 on a 4-way
machine? Seems that 4 threads taking 80 seconds when 2 threads took 122 on a
4-way machine is pretty mediocre scalability. That you found some other
algorithm that scales even worse doesn't really say anything.

DS