Re: Reconciling Garbage Collection with Deterministic Finalization

Sergey P. Derevyago

unread,

Mar 21, 2006, 7:12:20 AM3/21/06

to

"Andrei Alexandrescu (See Website For Email)" wrote:
> Garbage collection is desirable because:
>
> (3) Can improve performance, particularly in multithreaded environments
>
Explain, please!

As it was stated by David Butenhof: "multithreading is defined by an
application design that ALLOWS FOR concurrent or simultaneous execution". And
if we're designing a C++ MT program, then:
1. The application must have several threads of control that work almost
independently. They use synchronized message queues to pass messages from one
thread to another and virtually NO SYNCHRONIZATION is supposed outside these
message queues!
2. Note please, this design is not about "lock-free programming". The point
is that different threads don't have to modify shared data outside the opaque
message queues. These threads really "ALLOW FOR concurrent or simultaneous
execution".
3. As a result, every thread has its own memory allocator that works (almost)
independently from the allocators of other threads. In particular, the "thread
A allocator" doesn't have to deallocate blocks of memory allocated by the
"thread B allocator" because the data between threads is copied via the
message queues.
4. That is a data structure is serialized into a queue. The sequence of bytes
are passed to another thread. The bytes are deserialized and the structure
gets reconstructed in this thread.

Now your GC/C++:
1. Doesn't have the guarantee that allocated memory blocks are always
deallocated in the same thread. So it must implement some kind of "stop the
world" algorithm.
2. "Stop the world" algorithms definitely kill the concurrency. In
particular, they have to cross the thread bounds so they don't "ALLOW FOR
concurrent or simultaneous execution".

The bottom line is clear: GC/C++ _degrades_ the performance of well-designed
C++ MT programs!
--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

kanze

unread,

Mar 21, 2006, 9:10:51 PM3/21/06

to

Sergey P. Derevyago wrote:
> "Andrei Alexandrescu (See Website For Email)" wrote:
> > Garbage collection is desirable because:

> > (3) Can improve performance, particularly in multithreaded environments

> Explain, please!

> As it was stated by David Butenhof: "multithreading is
> defined by an application design that ALLOWS FOR concurrent or
> simultaneous execution". And if we're designing a C++ MT
> program, then:

> 1. The application must have several threads of control
> that work almost independently. They use synchronized message
> queues to pass messages from one thread to another and
> virtually NO SYNCHRONIZATION is supposed outside these message
> queues!

This depends on the application. Other applications use highly
synchronized threads to parallelize the work.

> 2. Note please, this design is not about "lock-free
> programming". The point is that different threads don't have
> to modify shared data outside the opaque message queues. These
> threads really "ALLOW FOR concurrent or simultaneous
> execution".

It depends on what you mean by "shared data". In most of the
applications I've worked on, objects migrate from one thread to
another. Via message queues, but what is copied is a pointer,
not the entire object. (I don't see any way to do it otherwise
when the objects are polymorphic, which is generally the case.)

> 3. As a result, every thread has its own memory
> allocator that works (almost) independently from the
> allocators of other threads. In particular, the "thread A
> allocator" doesn't have to deallocate blocks of memory
> allocated by the "thread B allocator" because the data between
> threads is copied via the message queues.

In which system? Under what conditions? This is NOT the way
things work under Solaris, for example, where there is a common
allocator for all of the threads.

In practice, it's true that a large number of objects are
allocated and freed in the same thread. But there are also a
certain number which are allocated in one thread, and freed in
another. All of the message objects that you pass in the queue,
for example. (Not necessarily, of course. I tend to recycle
the message object for the response, so a message will be
created in thread A, live most of its life in thread B, and then
be sent back to thread A shortly before it dies.)

> 4. That is a data structure is serialized into a queue.
> The sequence of bytes are passed to another thread. The bytes
> are deserialized and the structure gets reconstructed in this
> thread.

That sounds like a horribly inefficient way of doing things.
The whole point of using threads, instead of separate processes,
is that we don't have to serialize complex objects in order to
pass them between threads.

> Now your GC/C++:

> 1. Doesn't have the guarantee that allocated memory
> blocks are always deallocated in the same thread.

Of course it doesn't, since that doesn't correspond to any
realistic real-life scenario.

> So it must
> implement some kind of "stop the world" algorithm.

> 2. "Stop the world" algorithms definitely kill the
> concurrency. In particular, they have to cross the thread
> bounds so they don't "ALLOW FOR concurrent or simultaneous
> execution".

> The bottom line is clear: GC/C++ _degrades_ the
> performance of well-designed C++ MT programs!

That's your theory. The actual experimental facts disagree. Do
we change the facts, or your theory?

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Andrei Alexandrescu (See Website For Email)

unread,

Mar 21, 2006, 9:15:49 PM3/21/06

to

Sergey P. Derevyago wrote:
> "Andrei Alexandrescu (See Website For Email)" wrote:
>
>>Garbage collection is desirable because:
>>
>>(3) Can improve performance, particularly in multithreaded environments
>>
>
> Explain, please!
>
> As it was stated by David Butenhof: "multithreading is defined by an
> application design that ALLOWS FOR concurrent or simultaneous execution". And
> if we're designing a C++ MT program, then:
> 1. The application must have several threads of control that work almost
> independently. They use synchronized message queues to pass messages from one
> thread to another and virtually NO SYNCHRONIZATION is supposed outside these
> message queues!

Ehm, yah, fine. Why the exclamation? No need to get overexcited around
here :o).

> 2. Note please, this design is not about "lock-free programming". The point
> is that different threads don't have to modify shared data outside the opaque
> message queues. These threads really "ALLOW FOR concurrent or simultaneous
> execution".

There is the concept of "almost non-blocking" where the amount of
synchronization is very low. It's not a black-and-white issue.

> 3. As a result, every thread has its own memory allocator that works (almost)
> independently from the allocators of other threads. In particular, the "thread
> A allocator" doesn't have to deallocate blocks of memory allocated by the
> "thread B allocator" because the data between threads is copied via the
> message queues.

Stop right there. This is but one allocator design, and has been
discussed a lot in MT circles. It has the disadvantage that memory
allocated but freed by one thread cannot be used by anyone else, so the
design can (in the worst case) consume proportional with k times more
memory than a design with a shared allocator, where k is the number of
threads.

Some applications can use the model successfully due to the nature of
their threads and their allocation patterns. Some can't. But you can't
possibly claim that that's the one possible design of an MT allocator.
The variety of MT allocator designs is hard proof.

> The bottom line is clear: GC/C++ _degrades_ the performance of well-designed
> C++ MT programs!

I think the way this argument was constructed is bogus. In order to
avoid spreading myself too thin, however, I won't comment on MT aspects
of GC any further.

Andrei

David Schwartz

unread,

Mar 21, 2006, 9:25:12 PM3/21/06

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:441FE502...@iobox.com...

> The bottom line is clear: GC/C++ _degrades_ the performance of
> well-designed
> C++ MT programs!

For your very peculiar notion of "well-designed". I could not disagree
with this notion more strongly. In the vast majority of cases, having
objects 'belong' to threads is very bad design.

For example, if object 'A' belongs to thread 'A', and object 'A'
associates with some real-world thing that needs services, and only thread
'A' can manipulate object 'A', that means that if thread 'A' *ever* blocks,
object 'A' will be starved. There are reaons threads might block that are
beyond your control (such as page faults), so I would argue that for most
real-world applications, this type of design is actually very bad.

DS

Sergey P. Derevyago

unread,

Mar 22, 2006, 10:56:26 AM3/22/06

to

kanze wrote:
> > 1. The application must have several threads of control
> > that work almost independently. They use synchronized message
> > queues to pass messages from one thread to another and
> > virtually NO SYNCHRONIZATION is supposed outside these message
> > queues!
>
> This depends on the application. Other applications use highly
> synchronized threads to parallelize the work.
>

If some code creates threads that don't allow for parallel simultaneous
execution then it's a bad MT code (by definition).
Some other technique (say state machine) should be applied.

> > 2. Note please, this design is not about "lock-free
> > programming". The point is that different threads don't have
> > to modify shared data outside the opaque message queues. These
> > threads really "ALLOW FOR concurrent or simultaneous
> > execution".
>
> It depends on what you mean by "shared data".
>

The definition is standard and comes from the "Memory Synchronization"
section:
"Applications shall ensure that access to any memory location by more than
one thread of control (threads or processes) is restricted such that no thread
of control can read or modify a memory location while another thread of
control may be modifying it. Such access is restricted using functions that
synchronize thread execution and also synchronize memory with respect to other
threads."

> In most of the
> applications I've worked on, objects migrate from one thread to
> another. Via message queues, but what is copied is a pointer,
> not the entire object. (I don't see any way to do it otherwise
> when the objects are polymorphic, which is generally the case.)
>

The answer is serialization and reconstruction.

> > 3. As a result, every thread has its own memory
> > allocator that works (almost) independently from the
> > allocators of other threads. In particular, the "thread A
> > allocator" doesn't have to deallocate blocks of memory
> > allocated by the "thread B allocator" because the data between
> > threads is copied via the message queues.
>
> In which system? Under what conditions? This is NOT the way
> things work under Solaris, for example, where there is a common
> allocator for all of the threads.
>

I'm talking about a (thread-bound) user-defined memory allocators.

> In practice, it's true that a large number of objects are
> allocated and freed in the same thread. But there are also a
> certain number which are allocated in one thread, and freed in
> another. All of the message objects that you pass in the queue,
> for example. (Not necessarily, of course. I tend to recycle
> the message object for the response, so a message will be
> created in thread A, live most of its life in thread B, and then
> be sent back to thread A shortly before it dies.)
>
> > 4. That is a data structure is serialized into a queue.
> > The sequence of bytes are passed to another thread. The bytes
> > are deserialized and the structure gets reconstructed in this
> > thread.
>
> That sounds like a horribly inefficient way of doing things.
>

The impression is misleading. The data that really has to cross the thread
boundaries are pretty rare. The copying and reconstructing can cost less than
one system call.

> The whole point of using threads, instead of separate processes,
> is that we don't have to serialize complex objects in order to
> pass them between threads.
>

This statement is also misleading.

> > The bottom line is clear: GC/C++ _degrades_ the
> > performance of well-designed C++ MT programs!
>
> That's your theory. The actual experimental facts disagree. Do
> we change the facts, or your theory?
>

It's not a theory.
The point is that your definition of well-designed MT program is not "a code
that allows for parallel simultaneous execution".

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,

Mar 22, 2006, 10:57:14 AM3/22/06

to

"Andrei Alexandrescu (See Website For Email)" wrote:
> > 3. As a result, every thread has its own memory allocator that works (almost)
> > independently from the allocators of other threads. In particular, the "thread
> > A allocator" doesn't have to deallocate blocks of memory allocated by the
> > "thread B allocator" because the data between threads is copied via the
> > message queues.
>
> Stop right there.
>

No problem.

> This is but one allocator design, and has been
> discussed a lot in MT circles. It has the disadvantage that memory
> allocated but freed by one thread cannot be used by anyone else, so the
> design can (in the worst case) consume proportional with k times more
> memory than a design with a shared allocator, where k is the number of
> threads.
>

It seems like you have a wrong answer for the "How many threads my
application should create?" question.

> Some applications can use the model successfully due to the nature of
> their threads and their allocation patterns. Some can't. But you can't
> possibly claim that that's the one possible design of an MT allocator.
> The variety of MT allocator designs is hard proof.
>

Please tell me about different allocator design that also allows for parallel
simultaneous execution but imposes less overhead.

> > The bottom line is clear: GC/C++ _degrades_ the performance of well-designed
> > C++ MT programs!
>
> I think the way this argument was constructed is bogus. In order to
> avoid spreading myself too thin, however, I won't comment on MT aspects
> of GC any further.
>

Pity.

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,

Mar 22, 2006, 10:55:29 AM3/22/06

to

Sergey P. Derevyago wrote:
> "Andrei Alexandrescu (See Website For Email)" wrote:
>
>>Garbage collection is desirable because:
>>
>>(3) Can improve performance, particularly in multithreaded environments
>>
>
> Explain, please!
>
> As it was stated by David Butenhof: "multithreading is defined by an
> application design that ALLOWS FOR concurrent or simultaneous execution". And
> if we're designing a C++ MT program, then:
> 1. The application must have several threads of control that work almost
> independently. They use synchronized message queues to pass messages from one
> thread to another and virtually NO SYNCHRONIZATION is supposed outside these
> message queues!

There is a way to achieve a highly distributed virtually zero-overhead
messaging environment!

;)

No atomic operations... Simple loads and stores coupled with a
StoreStore style memory barrier for producers, and dependent-load
ordering for consumers:

http://appcore.home.comcast.net/

One could implement a distributed messaging service by using sets of
per-thread single producer/consumer queues. A set of queues can allow
for virtually zero-overhead, bidirectional communications between two
threads. If a thread has more than one set of queues, it can organize
bidirectional communications with a plurality of application threads by
simply applying a multiplexing algorithm...

> 2. Note please, this design is not about "lock-free programming". The point
> is that different threads don't have to modify shared data outside the opaque
> message queues. These threads really "ALLOW FOR concurrent or simultaneous
> execution".

There are algorithms and usage patterns that can provide the means for
efficient data-sharing throughout a plurality of application threads.
For instance, certain lock-free reader patterns coupled with an
efficient solution to the reader/writer problem ( VZOOM or Joe Seigh's
SMR-RCU hybrid ) can provide virtually zero-overhead access to critical
shared data-structures. IMHO, this can be a very important feature:

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce

IMO, addressing these issues is essential for achieving outstanding
scalability and throughput characteristics on the new multicore systems
that are now available. Algorithms that force threads to frequently use
tons of StoreLoad style memory barriers will not scale well at all on
the new stuff thats out there now. Like the UltraSPARC T1 which has an
on-chip high-speed mechanism for enforcing data-coherency throughout its
processing cores; SUN calls the mechanism "The Crossbar". An application
that frequently saturates The Crossbar with atomic operation and/or
StoreLoad membar coherency traffic would be a damn shame, IMHO of course...

> 3. As a result, every thread has its own memory allocator that works (almost)
> independently from the allocators of other threads. In particular, the "thread
> A allocator" doesn't have to deallocate blocks of memory allocated by the
> "thread B allocator" because the data between threads is copied via the
> message queues.

There are algorithms out there that can allow for the implementation of
an efficient distributed per-thread allocation scheme. The memory
allocated in one thread can be deallocated in another thread, no problem:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?q=lock-free+slab+allocator&rnum=1#e3efa5628aad4a82

Once you apply the fixes that I documented, my algorithm happens to
work. IMHO, it can be an excellent solution for per-thread allocation
schemes in general. Also, has anybody seen a reference counting
algorithm out there that is similar to the one I used for my algorithm?

> 4. That is a data structure is serialized into a queue. The sequence of bytes
> are passed to another thread. The bytes are deserialized and the structure
> gets reconstructed in this thread.
>
> Now your GC/C++:
> 1. Doesn't have the guarantee that allocated memory blocks are always
> deallocated in the same thread. So it must implement some kind of "stop the
> world" algorithm.

You should not say "must implement" because that is simple a false
assertion wrt this subject. You do not have to use any type of
stop-the-world algorithm to solve this "problem".

> 2. "Stop the world" algorithms definitely kill the concurrency. In
> particular, they have to cross the thread bounds so they don't "ALLOW FOR
> concurrent or simultaneous execution".

They do allow for parallel execution when their not "stopping the
world", so to speak...

;)

I am definitely not a fan of the technique in general.

Sergey P. Derevyago

unread,

Mar 22, 2006, 6:40:01 PM3/22/06

to

Chris Thomasson wrote:
>> 3. As a result, every thread has its own memory allocator
>> that works (almost)
>> independently from the allocators of other threads. In particular,
>> the "thread
>> A allocator" doesn't have to deallocate blocks of memory allocated
>> by the
>> "thread B allocator" because the data between threads is copied
>> via the
>> message queues.
>
> There are algorithms out there that can allow for the
> implementation of
> an efficient distributed per-thread allocation scheme. The memory
> allocated in one thread can be deallocated in another thread, no
> problem:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/
> thread/8245f4b48591fc69/e3efa5628aad4a82?q=lock-free+slab
> +allocator&rnum=1#e3efa5628aad4a82
>

No way, this implementation crosses the tread boundaries so it must
serialize
the access and synchronize memory. While the per-thread allocation
needs no
synchronization at all.
For example, here is your synchronization point:

void slab_sys_shared_push( block_t *_this )
{
block_t *cmp;

do
{
cmp = _this->thread->shared.front;
_this->next = cmp & ~1;
}

while ( ! CAS( &_this->thread->shared.front,
cmp,
( cmp & 1 ) ? _this | 1 : _this ) );

if ( cmp & 1 && XADD( &_this->refs, -1 ) == 1 )
{
/* destroy _this->thread */
}

}

>> 1. Doesn't have the guarantee that allocated memory blocks
>> are always
>> deallocated in the same thread. So it must implement some kind of
>> "stop the
>> world" algorithm.
>
> You should not say "must implement" because that is simple a false
> assertion wrt this subject.
>

I believe you must examine (at least):
1. The registers of all threads.
2. The stack of all threads.
3. The whole memory.
to make sure that some previously allocated memory block is now
inaccessible.

Please show your solution without "stop the world" action.

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Andrei Alexandrescu (See Website For Email)

unread,

Mar 23, 2006, 6:04:12 AM3/23/06

to

Sergey P. Derevyago wrote:
> Please tell me about different allocator design that also allows for parallel
> simultaneous execution but imposes less overhead.

http://www.research.ibm.com/people/m/michael/pldi-2004.pdf

Andrei

Chris Thomasson

unread,

Mar 23, 2006, 6:29:17 AM3/23/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>> 3. As a result, every thread has its own memory allocator
>>>that works (almost)
>>>independently from the allocators of other threads. In particular,
>>>the "thread
>>>A allocator" doesn't have to deallocate blocks of memory allocated
>>>by the
>>>"thread B allocator" because the data between threads is copied
>>>via the
>>>message queues.
>>
>>There are algorithms out there that can allow for the
>>implementation of
>>an efficient distributed per-thread allocation scheme. The memory
>>allocated in one thread can be deallocated in another thread, no
>>problem:
>>
>>http://groups.google.com/group/comp.programming.threads/browse_frm/
>>thread/8245f4b48591fc69/e3efa5628aad4a82?q=lock-free+slab
>>+allocator&rnum=1#e3efa5628aad4a82
>>
>
> No way, this implementation crosses the tread boundaries so it must
> serialize
> the access and synchronize memory. While the per-thread allocation
> needs no
> synchronization at all.

Well, your correct if the allocation activity was "guaranteed" to be
kept local. However, IMHO, a per-thread allocation scheme should
probably be robust enough to be able to address allocations that
frequently cross thread boundaries. I guess you could augment the
presented algorithm with something like the distributed messaging system
that I briefly described in my previous post. That could reduce overhead
down to #StoreStore and possibly dependent load barriers. That is fairly
lightweight...

> For example, here is your synchronization point:
>

[...]

Indeed. BTW, thank you for taking the time to take a look the
pseudo-code. As you can see, an implementation of this "particular"
piece of pseudo-code would require #StoreStore and
#LoadStore|#StoreStore membar and CAS instructions. However, it only
uses the expensive barrier/atomic-op when a thread frees a memory block
that it does not own. You could apply an internal usage pattern to the
algorithm that could reduce the number of times a thread would actually
need to linearize itself at this synchronization point. You could setup
a scheme where threads could push an arbitrary number of memory blocks
at a time. A thread could decide to fill a cache of memory blocks up to
a desired limit "before" it decides to "push the entire cache" onto the
owner threads lock-free LIFO in a single atomic-op. As you can see, this
simple usage pattern has the potential to reduce synchronizations
throughout the presented algorithm in general.

If you apply this usage pattern to a per-thread allocation
implementation that uses low-overhead queues as a means of
communication, you could possibly reduce synchronization overhead down
to virtually zero. Therefore, there could be an marked improvement in
scalability...

>>> 1. Doesn't have the guarantee that allocated memory blocks
>>>are always
>>>deallocated in the same thread. So it must implement some kind of
>>>"stop the
>>>world" algorithm.
>>
>>You should not say "must implement" because that is simple a false
>>assertion wrt this subject.
>>
>
> I believe you must examine (at least):
> 1. The registers of all threads.
> 2. The stack of all threads.
> 3. The whole memory.

This is not true for all types of garbage collection algorithms.

> to make sure that some previously allocated memory block is now
> inaccessible.
>
> Please show your solution without "stop the world" action.

Well, there are some proxy garbage collectors that I developed that do
not use stop the world:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1216861ac568b2be/b28bdb0051e2c2c6?lnk=st&q=vzoom&rnum=1#b28bdb0051e2c2c6

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/dc39b17aed1220c3/4bb0a3fb0339550c?q=proxy+gc&rnum=1#4bb0a3fb0339550c

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/69a72519fc302eff/de500aec6ee67dea?q=proxy+gc&rnum=2#de500aec6ee67dea

http://groups.google.com/group/comp.programming.threads/msg/c9ddb529ad35199c

http://home.comcast.net/~vzoom/demos/pc_sample.c
(one of my lock-free proxy gc impls)

http://groups.google.com/group/comp.programming.threads/msg/59e9b6e427b4a144

http://groups.google.com/group/comp.programming.threads/msg/b7814ee86cd4b54d

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7dda9ad554a976a6/60110614c07b0170?lnk=st&q=vzoom&rnum=6#60110614c07b0170

http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/65c9c2673682d4cf/bc1506fb6d0e3ba7?lnk=st&q=vzoom&rnum=3#bc1506fb6d0e3ba7

Please note, that the presented algorithms, usage patterns and virtually
every other algorithms that has anything to do with this kind of stuff
has patents, or patent applications that are pending... Patent activity
in this area of computing is remarkable.

Joe Seigh

unread,

Mar 23, 2006, 9:49:17 AM3/23/06

to

Andrei Alexandrescu (See Website For Email) wrote:

> Sergey P. Derevyago wrote:
>
>> Please tell me about different allocator design that also allows for parallel
>>simultaneous execution but imposes less overhead.
>
>
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf

Also some stuff on RCU here
http://www.rdrop.com/users/paulmck/RCU/

There's a prototype SMR+RCU implementation here. Also one for
a prescient implementation of lock-free reference counting.
(see patent application
20060037026 "Lightweight reference counting using single-target synchronization",
the illustrations are particularly illuminating)

http://atomic-ptr-plus.sourceforge.net/

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Sergey P. Derevyago

unread,

Mar 23, 2006, 9:54:12 AM3/23/06

to

Chris Thomasson wrote:
> > I believe you must examine (at least):
> > 1. The registers of all threads.
> > 2. The stack of all threads.
> > 3. The whole memory.
>
> This is not true for all types of garbage collection algorithms.
>

Here we talk about conventional C++ programs.
Explain please, why you don't have to examine the registers, the stacks and
the whole memory.

> > Please show your solution without "stop the world" action.
>
> Well, there are some proxy garbage collectors that I developed that do
> not use stop the world:
>

Can they be applied to conventional C++ programs?

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,

Mar 23, 2006, 9:54:56 AM3/23/06

to

"Andrei Alexandrescu (See Website For Email)" wrote:
> > Please tell me about different allocator design that also allows for parallel
> > simultaneous execution but imposes less overhead.
>
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
>

The paper is flawed. In particular, it stands:

"For example, it is trivial to design a wait-free allocator with pure
per-thread private heaps. That is, each thread allocates from its own heap and
also frees blocks to its own heap."

Yes, this is the way to go.

"However, this is hardly an acceptable general-purpose solution, as it can
lead to unbounded memory consumption (e.g., under a producer-consumer pattern
[3]), even when the program's memory needs are in fact very small."

This statement is incorrect: if you pass messages via serialization then no
"unbounded memory consumption (e.g., under a producer-consumer pattern)" is
possible.

P.S. I'm going to read further. Probably more comments will appear in a few
days...

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,

Mar 23, 2006, 5:15:04 PM3/23/06

to

Joe Seigh wrote:
> Also some stuff on RCU here
> http://www.rdrop.com/users/paulmck/RCU/
>

Yet another flaw from the very beginning:

"RCU OVERVIEW" section of
http://www.rdrop.com/users/paulmck/RCU/whatisRCU.html tells:

"The basic idea behind RCU is to split updates into "removal" and
"reclamation" phases. The removal phase removes references to data items
within a data structure (possibly by replacing them with references
to new
versions of these data items), and can run concurrently with readers.
The
reason that it is safe to run the removal phase concurrently with
readers is
the semantics of modern CPUs guarantee that readers will see either
the old or
the new version of the data structure rather than a partially updated
reference."

That is atomic pointers is a must.
You can't find the atomic pointers guarantee in POSIX so the show is
over.

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,

Mar 24, 2006, 7:58:58 AM3/24/06

to

Andrei Alexandrescu (See Website For Email) wrote:

> Sergey P. Derevyago wrote:
>
>> Please tell me about different allocator design that also allows for parallel
>>simultaneous execution but imposes less overhead.
>
>
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
>

Here is one of my contributions to this:

http://home.comcast.net/~vzoom/demos/pc_sample.c

You can apply this 100% working implementation to a proxy garbage
collection pattern, that turns out to be compatible with RCU. RCU is a
proxy collector at heart. Basically, the general pattern boils down to
something like this:

/* readers */
acquire pc_ref from collector
access nodes of shared data-structures
release pc_ref to collector

/* writers */
atomically make node unreachable
defer removed node to the collector

/* collector callback app */
void pc_callback( void *s )
{
user_state_t *us = s;

/* free to destroy us */
}

Interestingly, the presented collector code (pc_sample.c) is very
flexible. There are a number of small tweaks that if applied to the
presented code could provide the means for:

Lock-Free Reference Counting
Lock-Free Process Level Smart Pointers
Lock-Free Software Transactional Memory
Lock-Free GC-Memory Allocation

I can provide some examples if you want, just ask. There are a couple of
more things you can do with it, but first of all, are you familiar with
proxy garbage collection techniques and usage patterns in general? If
not, Joe Seigh and I can probably answer any question you, or others,
may have. It kind of seems like we are the only two researchers out
there that are developing these kinds of algorithms. Does anybody know
of any development going on that involves a firm understanding of proxy gc?

Chris Thomasson

unread,

Mar 24, 2006, 11:17:26 AM3/24/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>> I believe you must examine (at least):
>>>1. The registers of all threads.
>>>2. The stack of all threads.
>>>3. The whole memory.
>>
>>This is not true for all types of garbage collection algorithms.
>>
>
> Here we talk about conventional C++ programs.
> Explain please, why you don't have to examine the registers, the stacks and
> the whole memory.

RCU, VZOOM, SMR, RCU+SMR all use techniques that do not involve
registers, stacks, and the whole memory:

RCU - tracks per-thread quiescent states.

VZOOM - can track per-thread arrays of persistent reference counters
after quiescent states. I created this one.

SMR - tracks per-thread static array of "hazard pointers".

RCU+SMR - tracks per-thread static array of hazard pointers after
quiescent states. Joe Seigh combined RCU and SMR into hybrid.

All of the tracking and bookkeeping can be executed in parallel with the
application threads; no stop-the-world, ect...

Chris Thomasson

unread,

Mar 24, 2006, 11:14:55 AM3/24/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>> I believe you must examine (at least):
>>>1. The registers of all threads.
>>>2. The stack of all threads.
>>>3. The whole memory.
>>
>>This is not true for all types of garbage collection algorithms.
>>
>
> Here we talk about conventional C++ programs.
> Explain please, why you don't have to examine the registers, the stacks and
> the whole memory.

Proxy garbage collectors don't need to track registers, stacks or the
whole memory. Instead they track the individual state of a plurality of
threads "collected-regions". The regions can be checked in number of
different ways, none of which include tracking registers, stacks or the
whole memory. Compatible forms of atomic reference counting, or separate
polling logic can be used to determine when a plurality of
collected-regions are "quiescent" and associated generations of deferred
requests can be processed.

>
>
>>> Please show your solution without "stop the world" action.
>>
>>Well, there are some proxy garbage collectors that I developed that do
>>not use stop the world:
>>
>
> Can they be applied to conventional C++ programs?

Yes. They can be applied to C++ as it existed today. No modifications
are necessary. I have a patent pending on a solution that can use POSIX
mutexs to create an environment that supports virtually zero-overhead
object management, (VZOOM).

Dag-Erling Smørgrav

unread,

Mar 24, 2006, 11:20:58 AM3/24/06

to

"Sergey P. Derevyago" <non-ex...@iobox.com> writes:
> Yet another flaw [in RCU] from the very beginning:
> [quote from http://www.rdrop.com/users/paulmck/RCU/whatisRCU.html]

> You can't find the atomic pointers guarantee in POSIX so the show is
> over.

Given that the article you quote describes RCU within the Linux
kernel, POSIX is irrelevant.

DES
--
Dag-Erling Smørgrav - d...@des.no

Sergey P. Derevyago

unread,

Mar 24, 2006, 6:33:39 PM3/24/06

to

Chris Thomasson wrote:
>>>> I believe you must examine (at least):
>>>> 1. The registers of all threads.
>>>> 2. The stack of all threads.
>>>> 3. The whole memory.
>>>

>> Here we talk about conventional C++ programs.
>> Explain please, why you don't have to examine the
>> registers, the stacks and
>> the whole memory.
>
> Proxy garbage collectors don't need to track registers, stacks or the
> whole memory.
>

Sorry, don't understand how this could be possible.
Suppose you have a regular C++ program that heavily uses regular raw
pointers. It keeps them in the registers, on the stacks and on the
free store.
Now you can take Boehm GC and plug it instead of malloc/free.

The question is: Can you take XXX proxy GC and plug it just like
Boehm GC?
Where can I get this XXX proxy GC and use in my C++ code?

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Joe Seigh

unread,

Mar 25, 2006, 6:14:52 AM3/25/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:

>>Proxy garbage collectors don't need to track registers, stacks or the
>>whole memory.
>>
>
> Sorry, don't understand how this could be possible.
> Suppose you have a regular C++ program that heavily uses regular raw
> pointers. It keeps them in the registers, on the stacks and on the
> free store.

For proxy GC like RCU there are two kinds of references. Global references
in shared storage (free store or the heap), and local references (stack
or registers). The former are handle conventially and when a writer
removes the last global reference so an object isn't reachable from
shared memory, it waits until all threads have executed a quiescent state
at least once, and then frees the object. Since by RCU convention, threads
won't hold a reference to an RCU managed object across a quiescent state,
it's safe to free the object at that point.

> Now you can take Boehm GC and plug it instead of malloc/free.

No. Code that uses malloc/free doesn't have to zero pointers. Simply
"plugging" in Boehm GC won't work.

>
> The question is: Can you take XXX proxy GC and plug it just like
> Boehm GC?

No, things like RCU, etc... aren't used for storage management. They're
used for read lock-free solutions to the reader/writer problem in a
multi-threaded environment.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,

Mar 27, 2006, 9:39:23 AM3/27/06

to

Joe Seigh wrote:
> Sergey P. Derevyago wrote:
>
>> Chris Thomasson wrote:
>>
>>> Proxy garbage collectors don't need to track registers, stacks or
>>> the
>>> whole memory.
>>>
>>
>> Sorry, don't understand how this could be possible.
>> Suppose you have a regular C++ program that heavily uses
>> regular raw
>> pointers. It keeps them in the registers, on the stacks and on the
>> free store.
>
>
> For proxy GC like RCU there are two kinds of references. Global
> references
> in shared storage (free store or the heap), and local references
> (stack
> or registers). The former are handle conventially and when a writer
> removes the last global reference so an object isn't reachable from
> shared memory, it waits until all threads have executed a quiescent
> state
> at least once, and then frees the object.

You don't really have to track all of the threads. You can choose to
track quiescent states per-data-structure...

> Since by RCU convention, threads
> won't hold a reference to an RCU managed object across a quiescent
> state,
> it's safe to free the object at that point.

I feel that I should point out that there are algorithms that allow for
an arbitrary number of references to persist across quiescent states.
This ability can simplify creating content for proxy collectors by not
having to worry if a function call is going to swing the thread into
quiescent state or not...

Chris Thomasson

unread,

Mar 27, 2006, 9:43:03 AM3/27/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>>>> I believe you must examine (at least):
>>>>> 1. The registers of all threads.
>>>>> 2. The stack of all threads.
>>>>> 3. The whole memory.
>>>>
>>> Here we talk about conventional C++ programs.
>>> Explain please, why you don't have to examine the
>>> registers, the stacks and
>>> the whole memory.
>>
>> Proxy garbage collectors don't need to track registers, stacks or the
>> whole memory.
>>
>
> Sorry, don't understand how this could be possible.
> Suppose you have a regular C++ program that heavily uses regular raw
> pointers. It keeps them in the registers, on the stacks and on the
> free store.
> Now you can take Boehm GC and plug it instead of malloc/free.
>
> The question is: Can you take XXX proxy GC and plug it just like
> Boehm GC?

No. You can usually use them to replace read/write mutexs with lock-free
reader patterns. They can be used for other stuff as well. However,
before I get into any of that, I would suggest that you familiarize
yourself with basic RCU usage patterns. Then try to apply a solution to
a experimental test case.

I posted a link to my collector code in a previous post. It has same
basic pattern as RCU. I really do need to post some documentation on
it... Anyway, after you study up on RCU, and your still interested, we
can help you setup a solution...

Sergey P. Derevyago

unread,

Mar 27, 2006, 1:02:43 PM3/27/06

to

Joe Seigh wrote:
> > The question is: Can you take XXX proxy GC and plug it just like
> > Boehm GC?
>
> No, things like RCU, etc... aren't used for storage management. They're
> used for read lock-free solutions to the reader/writer problem in a
> multi-threaded environment.
>

So I was right: there is no GC/C++ that doesn't "stop the world".
And it seems like it's not possible for any GC/C++ not to "stop the world".

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Joe Seigh

unread,

Mar 28, 2006, 10:07:37 AM3/28/06

to

Sergey P. Derevyago wrote:
> Herb Sutter wrote:
>
>>> 1. Atomic pointers should not be assumed available.
>>
>> I'm stunned. Of course they certainly should be.
>>
>
> This one is a key point of our miscommunication.
> POSIX is the standard in MT area. It allows you to write pretty good
> _portable_ MT programs and POSIX doesn't require atomic pointers.
>
>
>> Their availability is a
>> basic assumption; all lock-free work relies on this kind atomicity
>> guarantee, plus an operation like CAS (or LL/SC).
>>
>
> I don't speak about lock-free tricks. Lock-free tricks are not as
> useful as
> one can suppose. Also they are not required for well-designed MT
> code to be
> good in both the performance and scalability.

Well, you're perfectly welcome to submit a solution to the "Global
lock contention
problem" on c.p.t. That's where, to reduce lock contention, a global
mutex on
a large data structure is replaced with finer granularity locking on
the data
structure nodes. Except that you still need to globally lock the
large data
structure to safely access the nodes, get a node lock, and release
the global
lock. Even a rwlock as a global lock may not work well especially if
the rwlock
use mutexes internally in its implementation.

What some programmers do is just not bother with a global lock and go
with no
locking or synchronization at all. So the program blows up once in a
while
on a race condition. Just close the problem report as "non
reproducible".
At least the code has good "performance and scalability".

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,

Mar 28, 2006, 2:36:44 PM3/28/06

to

Joe Seigh wrote:
> > I don't speak about lock-free tricks. Lock-free tricks are not as
> > useful as one can suppose. Also they are not required for well-designed
> > MT code to be good in both the performance and scalability.
> >
> Well, you're perfectly welcome to submit a solution to the "Global
> lock contention problem" on c.p.t.
>

Sorry, it's not a problem I've started to discuss:

1. Andrei Alexandrescu wrote:
-----------------------------

Garbage collection is desirable because:
(3) Can improve performance, particularly in multithreaded environments

2. I asked him to explain:
--------------------------
Now your GC/C++:

1. Doesn't have the guarantee that allocated memory blocks are always
deallocated in the same thread. So it must implement some kind of "stop the
world" algorithm.

2. "Stop the world" algorithms definitely kill the concurrency. In
particular, they have to cross the thread bounds so they don't "ALLOW FOR
concurrent or simultaneous execution".

The bottom line is clear: GC/C++ _degrades_ the performance of
well-designed C++ MT programs!

3. He replied:
--------------

> 3. As a result, every thread has its own memory allocator that works (almost)
> independently from the allocators of other threads. In particular, the "thread
> A allocator" doesn't have to deallocate blocks of memory allocated by the
> "thread B allocator" because the data between threads is copied via the
> message queues.

Stop right there. This is but one allocator design, and has been

discussed a lot in MT circles. It has the disadvantage that memory
allocated but freed by one thread cannot be used by anyone else, so the
design can (in the worst case) consume proportional with k times more
memory than a design with a shared allocator, where k is the number of
threads.

Some applications can use the model successfully due to the nature of

their threads and their allocation patterns. Some can't. But you can't
possibly claim that that's the one possible design of an MT allocator.
The variety of MT allocator designs is hard proof.

4. I replied:
-------------

Please tell me about different allocator design that also allows for
parallel simultaneous execution but imposes less overhead.

5. He replied:
--------------
http://www.research.ibm.com/people/m/michael/pldi-2004.pdf

6. I replied:
-------------

The paper is flawed. In particular, it stands:

"For example, it is trivial to design a wait-free allocator with pure
per-thread private heaps. That is, each thread allocates from its own heap and
also frees blocks to its own heap."

Yes, this is the way to go.

"However, this is hardly an acceptable general-purpose solution, as it can
lead to unbounded memory consumption (e.g., under a producer-consumer pattern
[3]), even when the program's memory needs are in fact very small."

This statement is incorrect: if you pass messages via serialization
then no "unbounded memory consumption (e.g., under a producer-consumer
pattern)" is possible.

-------------

OK, what we have now is:
1. A per-thread C++ allocator (which doesn't have to deallocate memory blocks
allocated in another threads) definitely beats any possible GC/C++ collector
(that can be plugged to regular C++ programs and therefore have to "stop the
world").
2. So "GC/C++ _degrades_ the performance of well-designed C++ MT programs".

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Herb Sutter

unread,

Mar 28, 2006, 3:13:16 PM3/28/06

to

>>>Sergey P. Derevyago wrote:
>>>> 1. Atomic pointers should not be assumed available.

>> Herb Sutter wrote:
>>> I'm stunned. Of course they certainly should be.

>Sergey P. Derevyago wrote:
>> This one is a key point of our miscommunication.

I agree. I'm glad we found the disconnect.

>> POSIX is the standard in MT area. It allows you to write pretty good
>> _portable_ MT programs and POSIX doesn't require atomic pointers.

Citing POSIX as the authority won't necessary get you far. :-) Now that
Joe has cross-posted this to c.p.l you might get more details on both
topics (lock-free and POSIX).

>> Herb Sutter wrote:
>>> Their availability is a
>>> basic assumption; all lock-free work relies on this kind
>>> atomicity guarantee, plus an operation like CAS (or LL/SC).

>Sergey P. Derevyago wrote:
>> I don't speak about lock-free tricks. Lock-free tricks
>> are not as useful as one can suppose. Also they are not
>> required for well-designed MT code to be
>> good in both the performance and scalability.

For the information of those just tuning in, this thread is about Sergey
making and defending the assertion that ref counted smart pointers should
_not_ use atomic inc/dec of the ref count, and that instead users should
be required to externally lock code like this:

// acquire lock for the object p1 points to

shared_ptr<T> p2 = p1; // p1 is a private shared_ptr<T> instance

// release lock for the object p1 points to

That is, for a given mutable shared T object t, the contention was that
the user is supposed to take a lock not only on every use of t but also
take the same lock on every use of every shared_ptr<T> that points to t.

I asserted that this contention is wrong because it a) is harder for the
programmer, b) is more expensive than the atomic inc/dec would be, and c)
needlessly increases uses of locks when we should be decreasing lock use.
Even though there could be other shared_ptr<T>'s pointing to the same T,
the p1 object itself is private and unshared (no other thread sees p1,
though another thread might have another p3 that refers to the same T as
p1) and so per general good practice the user should not be required to
lock code that just takes a copy of a private object (here the smart
pointer).

Joe Seigh <jsei...@xemaps.com> responded to Sergey:

>Well, you're perfectly welcome to submit a solution to the "Global
>lock contention
>problem" on c.p.t. That's where, to reduce lock contention, a global
>mutex on
>a large data structure is replaced with finer granularity locking on
>the data
>structure nodes. Except that you still need to globally lock the
>large data
>structure to safely access the nodes, get a node lock, and release
>the global
>lock. Even a rwlock as a global lock may not work well especially if
>the rwlock
>use mutexes internally in its implementation.
>
>What some programmers do is just not bother with a global lock and go
>with no
>locking or synchronization at all. So the program blows up once in a
>while
>on a race condition. Just close the problem report as "non
>reproducible".
>At least the code has good "performance and scalability".

:-)

Herb

---
Herb Sutter (www.gotw.ca) (www.pluralsight.com/blogs/hsutter)

Convener, ISO WG21 (C++ standards committee) (www.gotw.ca/iso)
Architect, Developer Division, Microsoft (www.gotw.ca/microsoft)

Chris Thomasson

unread,

Mar 28, 2006, 5:53:22 PM3/28/06

to

Joe Seigh wrote:
> Sergey P. Derevyago wrote:
>
>>Herb Sutter wrote:
>>
>>
>>>> 1. Atomic pointers should not be assumed available.
>>>
>>>I'm stunned. Of course they certainly should be.
>>>
>>
>> This one is a key point of our miscommunication.
>> POSIX is the standard in MT area. It allows you to write pretty good
>>_portable_ MT programs and POSIX doesn't require atomic pointers.
>>
>>
>>
>>>Their availability is a
>>>basic assumption; all lock-free work relies on this kind atomicity
>>>guarantee, plus an operation like CAS (or LL/SC).
>>>
>>
>> I don't speak about lock-free tricks. Lock-free tricks are not as
>>useful as
>>one can suppose. Also they are not required for well-designed MT
>>code to be
>>good in both the performance and scalability.

Some of those tricks can be extremely useful. Lock-free reader patterns
are an example... Zero-overhead signaling via. eventcounts are another,
ect...

You can also use POSIX mutexs to create a scalable environment that
supports lock-free reader patterns, and much more. The only thing the
environment relies on is dependent load ordering, which is supported
everywhere except DEC Alpha; it needs a rmb. Funny how you can get
low-overhead lock-free algorithms from locks...

:)

BTW, have you ever seen a read/write mutex outperform "any" type of
lock-free reader pattern? If so, I would really appreciate a link or
some quick facts on the algorithm...

David Schwartz

unread,

Mar 29, 2006, 5:10:12 AM3/29/06

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:4427C981...@iobox.com...

> So I was right: there is no GC/C++ that doesn't "stop the world".
> And it seems like it's not possible for any GC/C++ not to "stop the
> world".

The ability of operating systems to garbage collect not-recently-used
physical memory from an application without stopping the world would seem to
show that what you are claiming does not exist most certainly can exist.

DS

Chris Thomasson

unread,

Mar 29, 2006, 6:06:21 AM3/29/06

to

Sergey P. Derevyago wrote:
> Joe Seigh wrote:
>
>>> The question is: Can you take XXX proxy GC and plug it just like
>>>Boehm GC?
>>
>>No, things like RCU, etc... aren't used for storage management. They're
>>used for read lock-free solutions to the reader/writer problem in a
>>multi-threaded environment.
>>
>
> So I was right: there is no GC/C++ that doesn't "stop the world".
> And it seems like it's not possible for any GC/C++ not to "stop the world".
> --

You don't need to garbage collect everything. Proxy collection can allow
an application to get away from traditional GC, and all of the overhead
that comes along with them. You don't need to "stop the world" to solve
the reader/writer problem...

Sean Kelly

unread,

Mar 29, 2006, 7:19:11 AM3/29/06

to

Sergey P. Derevyago wrote:
> Joe Seigh wrote:
> > > I don't speak about lock-free tricks. Lock-free tricks are not as
> > > useful as one can suppose. Also they are not required for well-designed
> > > MT code to be good in both the performance and scalability.
> > >
> > Well, you're perfectly welcome to submit a solution to the "Global
> > lock contention problem" on c.p.t.
> >
> Sorry, it's not a problem I've started to discuss:
>
> 1. Andrei Alexandrescu wrote:
> -----------------------------
> Garbage collection is desirable because:
> (3) Can improve performance, particularly in multithreaded environments
>
> 2. I asked him to explain:
> --------------------------
> Now your GC/C++:
> 1. Doesn't have the guarantee that allocated memory blocks are always
> deallocated in the same thread. So it must implement some kind of "stop the
> world" algorithm.
> 2. "Stop the world" algorithms definitely kill the concurrency. In
> particular, they have to cross the thread bounds so they don't "ALLOW FOR
> concurrent or simultaneous execution".

Iterative GC doesn't have a "stop the world" phase, as far as I know,
but it also requires compiler support to achieve. Still, I've read
papers describing hard realtime iterative garbage collectors, so
collection doesn't have to be quite the unpredictable bottleneck you
describe... though this is perhaps more relevant for languages with
in-language GC support.

> The bottom line is clear: GC/C++ _degrades_ the performance of
> well-designed C++ MT programs!

How does this logically follow from what you said above? Isn't there a
cost associated with the techniques used in traditional non-GCed
applications? Is this cost demonstrably less than garbage collection?

> 3. He replied:
> --------------
> > 3. As a result, every thread has its own memory allocator that works (almost)
> > independently from the allocators of other threads. In particular, the "thread
> > A allocator" doesn't have to deallocate blocks of memory allocated by the
> > "thread B allocator" because the data between threads is copied via the
> > message queues.
>
> Stop right there. This is but one allocator design, and has been
> discussed a lot in MT circles. It has the disadvantage that memory
> allocated but freed by one thread cannot be used by anyone else, so the
> design can (in the worst case) consume proportional with k times more
> memory than a design with a shared allocator, where k is the number of
> threads.

Not true. While HOARD does have a substantially higher cost for
freeing memory allocated in another thread, there's no reason that this
must be so. Assuming lock-free facilities are available (CAS, LL/SC)
it's possible to perform such a delete with a single atomic operation
by maintaining per-thread lock-free deletion queues.

> Some applications can use the model successfully due to the nature of
> their threads and their allocation patterns. Some can't. But you can't
> possibly claim that that's the one possible design of an MT allocator.
> The variety of MT allocator designs is hard proof.
>
> 4. I replied:
> -------------
> Please tell me about different allocator design that also allows for
> parallel simultaneous execution but imposes less overhead.
>
> 5. He replied:
> --------------
> http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
>
> 6. I replied:
> -------------
> The paper is flawed. In particular, it stands:
>
> "For example, it is trivial to design a wait-free allocator with pure
> per-thread private heaps. That is, each thread allocates from its own heap and
> also frees blocks to its own heap."
>
> Yes, this is the way to go.

All blocks or just its own blocks? I would argue that bocks should be
returned to the heap they are allocated from, or cache contention and
such could be an issue. See my last statement above for a fairly
simple way to accomplish this.

> "However, this is hardly an acceptable general-purpose solution, as it can
> lead to unbounded memory consumption (e.g., under a producer-consumer pattern
> [3]), even when the program's memory needs are in fact very small."
>
> This statement is incorrect: if you pass messages via serialization
> then no "unbounded memory consumption (e.g., under a producer-consumer
> pattern)" is possible.

But the paper is about memory allocation strategies, not alternative
methods for passing data between threads. I believe the paper is
correct with respect to passing pointers to allocated memory between
threads in a producer/consumer model.

> -------------
>
> OK, what we have now is:
> 1. A per-thread C++ allocator (which doesn't have to deallocate memory blocks
> allocated in another threads) definitely beats any possible GC/C++ collector
> (that can be plugged to regular C++ programs and therefore have to "stop the
> world").

That's a rather bold statement to make considering the assumptions
you've made. And while I don't feel qualified to make such an
assertion one way or the other, I don't think things are nearly as
clear as you suggest. I'm working from memory so forgive me if I've
got the wrong paper, but I believe Hans Boehm's paper entitled "Mostly
Parallel Garbage Collection"
(http://www.hpl.hp.com/personal/Hans_Boehm/gc/papers/pldi91.ps.Z) may
be relevant.

> 2. So "GC/C++ _degrades_ the performance of well-designed C++ MT programs".

Do you have data to support this claim?

Sean

Casper H.S. Dik

unread,

Mar 29, 2006, 10:57:53 AM3/29/06

to

Chris Thomasson <cristom_nospam@comcast._nospam_net> writes:

>Some of those tricks can be extremely useful. Lock-free reader patterns
>are an example... Zero-overhead signaling via. eventcounts are another,
>ect...

There are also other "contention free" algorithms by using
distributed locking; e.g., the Solaris kernel memory allocator
has little or no contention because it uses per-CPU caches of data;
only once such caches run out, the CPU will need to go and get some
more.

As a side note: before the new kernel memory allocator was implemented
in Solaris 2.4 (I think) the old memory allocator performed poorly and
some people thought it was a good idea to have their own freelists
protected with a single lock.

That stops scaling at one CPU as the (artificial) workload showed
(allocating memory was apparently performance critical for the
auditing subsystem).

After modifying the code to use the contentionless default memory
allocator, no memory contention could be observed (I only had a 12-way
system at my disposal)

>You can also use POSIX mutexs to create a scalable environment that
>supports lock-free reader patterns, and much more. The only thing the
>environment relies on is dependent load ordering, which is supported
>everywhere except DEC Alpha; it needs a rmb. Funny how you can get
>low-overhead lock-free algorithms from locks...

There are times when you don't even need that as is the case with
the Solaris kernel memory allocator; it just doesn't need resources
owned by other CPUs.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Sergey P. Derevyago

unread,

Mar 29, 2006, 10:57:04 AM3/29/06

to

Herb Sutter wrote:
> For the information of those just tuning in, this thread is about Sergey
> making and defending the assertion that ref counted smart pointers should
> _not_ use atomic inc/dec of the ref count, and that instead users should
> be required to externally lock code like this:
>
> // acquire lock for the object p1 points to
>
> shared_ptr<T> p2 = p1; // p1 is a private shared_ptr<T> instance
>
> // release lock for the object p1 points to
>
> That is, for a given mutable shared T object t, the contention was that
> the user is supposed to take a lock not only on every use of t but also
> take the same lock on every use of every shared_ptr<T> that points to t.
>

Herb, I've written the following statements several times:
1. Shared smart pointer objects (that concurrently modify the same memory
locations) are pretty rare in well-designed MT applications. Their use in MT
applications are almost unacceptable (in particular, your ...p2=p1... example
illustrates one of the drawbacks).
2. If you decided to use a shared smart pointer object then you definitely
have to synchronize the access (but these cases are pretty rare).
3. As a result, smart pointers that synchronize every shared operation will
definitely degrade the whole application performance, because almost all smart
pointer objects aren't shared and require no synchronization (atomic
operations on SMP require memory synchronization so they are substantively
expensive).

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,

Mar 29, 2006, 10:59:37 AM3/29/06

to

David Schwartz wrote:
> > So I was right: there is no GC/C++ that doesn't "stop the world".
> > And it seems like it's not possible for any GC/C++ not to "stop the
> > world".
>
> The ability of operating systems to garbage collect not-recently-used
> physical memory from an application without stopping the world would seem to
> show that what you are claiming does not exist most certainly can exist.
>

Explain, please.

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,

Mar 29, 2006, 10:59:59 AM3/29/06

to

Chris Thomasson wrote:
> Funny how you can get low-overhead lock-free algorithms from locks...
>

I've explained the point several times:
1. Well-designed MT code allows for parallel simultaneous execution.
2. The most of the time threads work with their private data structures so no
synchronization is required.
3. In those rare cases where you really have to synchronize some operations,
the standard POSIX synchronization primitives do well.

Let's consider the following simplified examples.

1. Suppose your code works with the thread-private data 95% of the time and
spends 5% on the synchronization. Suppose MagicLockFreeMutex is 2 times faster
than StandardMutex. Then if you replace the StandardMutex with
MagicLockFreeMutex your application will work for 97.5% of the previous time
so you will get 2.5% performance improvement.

The bottom line is: well-designed MT applications don't get significant gains
from faster lock operations.

2. Suppose your code works with the thread-private data 20% of the time and
spends 80% on the synchronization. That is you have a usual _bad_ MT design.
Then if you replace the StandardMutex with MagicLockFreeMutex your application
will work for 60% of the previous time so you will get 40% performance
improvement which is substantial.

The bottom line is: bad MT applications can get certain gains from faster lock
operations.

In principle, you can talk about some MT application design where all threads
work on the same data structure (say, some matrix transformation). Yes,
well-chosen lock-free algorithms are the must in this particular case.
My point is that regular MT design is "several threads that work almost
independently". These threads must not be forced to perform some hidden
synchronization under the covers. A good example of this bad hidden
synchronization is a global memory allocator that has to cross the thread
boundaries...

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Sergey P. Derevyago

unread,

Mar 30, 2006, 2:12:35 AM3/30/06

to

Sean Kelly wrote:
> Iterative GC doesn't have a "stop the world" phase, as far as I know,
>

"The devil is in the details".

> > The bottom line is clear: GC/C++ _degrades_ the performance of
> > well-designed C++ MT programs!
>
> How does this logically follow from what you said above?
>

Please read the whole message rather than the bottom line:
http://groups.google.com/group/comp.lang.c++.moderated/msg/9fe3ee93d3729321

> But the paper is about memory allocation strategies, not alternative
> methods for passing data between threads. I believe the paper is
> correct with respect to passing pointers to allocated memory between
> threads in a producer/consumer model.
>

The paper explains the motivation: "it is trivial to design a wait-free
allocator with pure per-thread private heaps" but "this is hardly an

acceptable general-purpose solution, as it can lead to unbounded memory

consumption".
I state: Per-thread private heaps don't lead to unbounded memory consumption
if the memory allocation/deallocation doesn't cross the thread boundaries (as
I've proposed). So per-thread private heaps are superior to any possible heaps
(that cross the thread boundaries and therefore have to synchronize the
access) and don't have the "unbounded memory consumption" drawback. Per-thread
private heaps are the way to go as an excellent genarally available
conventional C++ solution (here we talk about GC/C++, BTW).

--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

David Schwartz

unread,

Mar 30, 2006, 2:45:42 AM3/30/06

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:442A5E82...@iobox.com...

> 2. Suppose your code works with the thread-private data 20% of the time
> and
> spends 80% on the synchronization. That is you have a usual _bad_ MT
> design.
> Then if you replace the StandardMutex with MagicLockFreeMutex your
> application
> will work for 60% of the previous time so you will get 40% performance
> improvement which is substantial.
>
> The bottom line is: bad MT applications can get certain gains from faster
> lock
> operations.

(I'm assuming an SMP P4 machine, but things are similar on most modern
machines.)

Quite the reverse, this type of application would *suffer* *terribly*
from lock-free algorithms. With a lock, two threads would never be permitted
concurrent access to the same data, so it would never enter a state where
data was ping-ponging back and forth from cache to cache burning heavy CPU
with no significant forward progress. However, with a lock-free structure,
it would be allowed to do this horrible thing 80% of the time!

DS

Ulrich Eckhardt

unread,

Mar 30, 2006, 3:07:41 AM3/30/06

to

Sergey P. Derevyago wrote:
> 1. Shared smart pointer objects (that concurrently modify the same
> memory locations) are pretty rare in well-designed MT applications.
> Their use in MT applications are almost unacceptable (in particular,
> your ...p2=p1... example illustrates one of the drawbacks).

Just wondering, but what is a 'shared smart pointer object'? Are you
referring to a shared_ptr object or do you mean an object pointed to (and
shared) by several shared_ptrs?

Uli

chris noonan

unread,

Mar 30, 2006, 6:48:35 AM3/30/06

to

Sergey P. Derevyago wrote:
> 1. Suppose your code works with the thread-private data 95% of the time and
> spends 5% on the synchronization. Suppose MagicLockFreeMutex is 2 times faster
> than StandardMutex. Then if you replace the StandardMutex with
> MagicLockFreeMutex your application will work for 97.5% of the previous time
> so you will get 2.5% performance improvement.

But be careful to distinguish between 5% on a lines-of-code
basis and 5% on a time basis. StandardMutex will use a
bus-locking instruction; same number of clocks as perhaps
100 ordinary instructions; no real problem. StandardMutex
may cause a thread switch; thousands of clocks on
current architectures (no idea why, how difficult can it be
to swap a few registers over?). To have thread synchronisation
and to keep its cost down to 5% on a time basis, the extent
of the communication must be slight.

> My point is that regular MT design is "several threads that work almost
> independently". These threads must not be forced to perform some hidden
> synchronization under the covers. A good example of this bad hidden
> synchronization is a global memory allocator that has to cross the thread
> boundaries...

I couldn't agree more.

Regards,
Chris

Joe Seigh

unread,

Mar 30, 2006, 7:26:05 AM3/30/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>Funny how you can get low-overhead lock-free algorithms from locks...
>>
>
> I've explained the point several times:
> 1. Well-designed MT code allows for parallel simultaneous execution.
> 2. The most of the time threads work with their private data structures so no
> synchronization is required.
> 3. In those rare cases where you really have to synchronize some operations,
> the standard POSIX synchronization primitives do well.
>
> Let's consider the following simplified examples.
>
> 1. Suppose your code works with the thread-private data 95% of the time and
> spends 5% on the synchronization. Suppose MagicLockFreeMutex is 2 times faster
> than StandardMutex. Then if you replace the StandardMutex with
> MagicLockFreeMutex your application will work for 97.5% of the previous time
> so you will get 2.5% performance improvement.
>
> The bottom line is: well-designed MT applications don't get significant gains
> from faster lock operations.

[...]

>
> The bottom line is: bad MT applications can get certain gains from faster lock
> operations.

You're missing the point. In a preemptive threading environment, one of
the major benefits is less overhead suspending and resuming threads on
blocking operations. It's faster for the same reason highway interchanges with
overpasses are less congested than intersections with traffic lights for
the same volume of traffic. Lock-free also scales better and doesn't
suffer from that hysteresis effect where the application bogs down so badly
that it can't recover even under reduced load and has to be restarted.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Joe Seigh

unread,

Mar 30, 2006, 8:57:29 AM3/30/06

to

David Schwartz wrote:
> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
> news:442A5E82...@iobox.com...
>
>
>>2. Suppose your code works with the thread-private data 20% of the time
>>and
>>spends 80% on the synchronization. That is you have a usual _bad_ MT
>>design.
>>Then if you replace the StandardMutex with MagicLockFreeMutex your
>>application
>>will work for 60% of the previous time so you will get 40% performance
>>improvement which is substantial.
>>
>>The bottom line is: bad MT applications can get certain gains from faster
>>lock
>>operations.
>
>
> (I'm assuming an SMP P4 machine, but things are similar on most modern
> machines.)
>
> Quite the reverse, this type of application would *suffer* *terribly*
> from lock-free algorithms. With a lock, two threads would never be permitted
> concurrent access to the same data, so it would never enter a state where
> data was ping-ponging back and forth from cache to cache burning heavy CPU
> with no significant forward progress. However, with a lock-free structure,
> it would be allowed to do this horrible thing 80% of the time!
>

This makes no sense. Most lock-free algorithms are of the read lock-free
variety and for things like RCU and SMR hazard pointers actually reduce
the amount of cache traffic. RCU was created at Sequent where this kind
of thing was important since it was in a NUMA environment.

For regular write lock-free things like lock-free queues, you're going to have
to explain how a queue operations guarded with a mutex causes less cache
activity than lock-free queue operations do without slowing things down.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Thomasson

unread,

Mar 30, 2006, 9:56:21 AM3/30/06

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:e0eqoc$43m$1...@nntp.webmaster.com...

What about the mutex implementations "internal state"? A typical mutex
usually costs 2 atomic-ops and 2 "expensive" memory barriers for
uncontended
case... Unfortunately, when you apply load to those 4 instructions
the basic
end result will be he locks internal state ping-ponging all over the
place;
the lock state is a "proxy". This happens to be true for "frequent
accesses
to uncontended locks". Now, if you experience contention, under
load, the
threads may need to frequently park in the kernel; context-switch,
user-kernel transition, lock-convoys, "possible" priority-inversion,
ect,
ect...

> However, with a lock-free structure,
> it would be allowed to do this horrible thing 80% of the time!

Well, this could happen if you apply "persistent" heavy load to highly
contended "CAS-Loop" based lock-free algorithms... IMHO, I prefer to use
"loopless" algorithms when ever I can. Also, it can be beneficial to
join
lock-free and lock-based solutions together... Something like my
algorithm I
posted here:

http://groups.google.de/group/comp.programming.threads/msg/
632b6bdc2a137459

http://groups.google.de/group/comp.programming.threads/msg/
fd508d99ee2b8a70

?

Also, the overhead in taking locks can "adversely effect other
threads" that
are running "on the same core as the lock-taker". This assertion
holds true
for some of the new high-end multicore processors that are now
available.
Frequent use of memory barriers (e.g.,mutex usage) will flood the core
(s)
with coherency traffic and bog down other threads. Overall
scalability and
throughput can be reduced...

{ It looks like this conversation is drifting away from the topic of C
++.
Please remove comp.lang.c++.moderated from the Newsgroups line if
continued. Thanks, -mod/dv }

David Hopwood

unread,

Mar 30, 2006, 11:10:37 AM3/30/06

to

[posting via Google because of news server problems -- sorry if this
is a duplicate]

David Schwartz wrote:

> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote:
>
>>The bottom line is clear: GC/C++ _degrades_ the performance of
>>well-designed C++ MT programs!
>

> For your very peculiar notion of "well-designed". I could not disagree
> with this notion more strongly. In the vast majority of cases, having
> objects 'belong' to threads is very bad design.
>
> For example, if object 'A' belongs to thread 'A', and object 'A'
> associates with some real-world thing that needs services, and only thread
> 'A' can manipulate object 'A', that means that if thread 'A' *ever* blocks,
> object 'A' will be starved.

The previous poster was speaking in the context of systems using
message passing concurrency. In such systems, each object is
associated with a thread of control (although they are usually not
called "threads") by definition. If one were to assume that associating
objects with threads is the only change relative to a model using
shared state and locks, one might conclude that the result will be
a bad design, but in fact it is very far from being the only change.

For instance, some message passing languages have constructs
that make it easy to avoid blocking while causing minimal change
to the logical structure of the program. In E, a call that could
potentially block can always be rewritten as an asynchronous
message send that returns a promise for the call's result, followed
by a "when-catch" clause (see
<http://www.erights.org/talks/promises/paper/tgc05.pdf>).

> There are reasons threads might block that are
> beyond your control (such as page faults), so I would argue that for most
> real-world applications, this type of design is actually very bad.

Except for hard real-time applications, page faults are not much
of a problem, since they are guaranteed to only block for a short
time (even if there is a disk error, since then the process will be
terminated). There is obviously some latency associated with
servicing a page fault, but this is unavoidable regardless of the
concurrency design.

In a lock-based model, however, because a thread synchronously
blocks when it cannot immediately acquire a lock, operations
that do not logically depend on having acquired the lock can be
delayed. In a model with asynchronous message passing and
promises, each operation will take place as soon as it is logically
possible, which generally allows a higher degree of concurrency
for a comparable complexity and effort in design and validation.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Mar 30, 2006, 11:23:02 AM3/30/06

to

[posting via Google because of news server problems -- sorry if this is
a duplicate]

Chris Thomasson wrote:
> Sergey P. Derevyago wrote:
>>Joe Seigh wrote:
>>
>>>> The question is: Can you take XXX proxy GC and plug it just like
>>>>Boehm GC?
>>>
>>>No, things like RCU, etc... aren't used for storage management. They're
>>>used for read lock-free solutions to the reader/writer problem in a
>>>multi-threaded environment.
>>
>> So I was right: there is no GC/C++ that doesn't "stop the world".
>> And it seems like it's not possible for any GC/C++ not to "stop the world".
>

> You don't need to garbage collect everything. Proxy collection can allow
> an application to get away from traditional GC, and all of the overhead
> that comes along with them.

I always find it odd when people talk about the "overhead of GC" as
though
other memory management techniques had no overheads.

You shouldn't assume that a hybrid of GC and some other memory
management
technique will have lower overhead than applying GC to all objects. If
the
other technique is some form of reference counting (as is commonly the
case
when using proxy GC), then it definitely won't; refcounting is very
inefficient.

> You don't need to "stop the world" to solve
> the reader/writer problem...

Sergey is mistaken in thinking that global GC implies "stop the world"
(even
for C++, in principle, although there are no state-of-the-art
concurrent
GC implementations for C++, and it would be difficult to write one that
would
be link-compatible with existing C++ implementations).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Sergey P. Derevyago

unread,

Mar 30, 2006, 11:57:09 AM3/30/06

to

David Hopwood wrote:
> Sergey is mistaken in thinking that global GC implies "stop the world"
>

Explain please how this could be possible.
Don't you have to examine the registers, the stacks and so on? (We talk about
conventional C++ programs and pluggable GC).

David Schwartz

unread,

Mar 30, 2006, 2:25:28 PM3/30/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:Y-mdnXDUsrOAfLbZ...@comcast.com...

> What about the mutex implementations "internal state"? A typical mutex
> usually costs 2 atomic-ops and 2 "expensive" memory barriers for
> uncontended
> case... Unfortunately, when you apply load to those 4 instructions
> the basic
> end result will be he locks internal state ping-ponging all over the
> place;
> the lock state is a "proxy". This happens to be true for "frequent
> accesses
> to uncontended locks". Now, if you experience contention, under
> load, the
> threads may need to frequently park in the kernel; context-switch,
> user-kernel transition, lock-convoys, "possible" priority-inversion,
> ect,
> ect...

I have no objection to optimizing mutexes to decrease the cost in the
uncontended case. However, in the contended case, the cost of rescheduling
is well worth it -- a thread that won't conflict gets to run at full speed
rather than a thread that will conflict getting to fight at low speed.

DS

David Schwartz

unread,

Mar 30, 2006, 7:24:54 PM3/30/06

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:U6SdnUIS2vs...@comcast.com...

> This makes no sense. Most lock-free algorithms are of the read lock-free
> variety and for things like RCU and SMR hazard pointers actually reduce
> the amount of cache traffic. RCU was created at Sequent where this kind
> of thing was important since it was in a NUMA environment.

If the algorithm is purely read, then this argument does not apply. But
there are very few lock-free algorithms that don't involve allowing threads
to modify things concurrently.

> For regular write lock-free things like lock-free queues, you're going to
> have
> to explain how a queue operations guarded with a mutex causes less cache
> activity than lock-free queue operations do without slowing things down.

Suppose you have two CPUs and four threads that want to run. Two of the
threads want to access queue A and two of the threads want to access queue
B. If each queue is protected by a lock, the scheduler will have to schedule
one thread using queue A and one using queue B. Both threads will run at
full speed and will not interfere with each other. If the threads are lock
free, it is equally likely that two threads will run using the same queue as
using different queues, the queue structures will ping-pong violently across
the FSB and the two threads will run at low speed.

Locks serve an extremely important purpose -- they deschedule threads
that will compete for the same cache lines and allow threads that have no
such dependencies to run concurrently at full speed. Lock-free algorithms
work well only when they don't disturb this, for example, when they are
embedded inside the operating system or in very low-level libraries.
However, when lock-free structures are used at higher levels, this important
property of locks is lost, and performance suffers.

Contention is bad. Lock-free algorithms allow threads that content to
run at the same time, contending all the while. Locks deschedule contending
threads, which is great if there are threads ready to run that will not
contend.

DS

chris noonan

unread,

Mar 30, 2006, 7:47:09 PM3/30/06

to

[apologies if this is a near-duplicate; the first post seemed to
disappear]

David Schwartz wrote:

>> Quite the reverse, this type of application would *suffer*
*terribly*
>> from lock-free algorithms. With a lock, two threads would never be
permitted
>> concurrent access to the same data, so it would never enter a state
where
>> data was ping-ponging back and forth from cache to cache burning
heavy CPU
>> with no significant forward progress. However, with a lock-free
structure,
>> it would be allowed to do this horrible thing 80% of the time!

A real-world example, more information available at:
group: comp.unix.internals
topic: Kernel Memory Allocator
date: 6 November 2003

A simple program (benchmark type) that divides a fixed
amount of work up among a variable number of threads,
so that with perfect scaling all runs take the same time.
Can use either a locking or lock-free algorithm.

Several runs on a 4-way machine. Each result sequence
starts at one thread and doubles successively up to
128 (1,2,4,8,16,32,64,128). Timings are in seconds
(so the smaller the better).

Locking: 265,289,280,400,555,456,604,575
Lock-free: 245,122,80,70,70,58,75,82

In this case lock-free scales better.

Regards
Chris Noonan

chris noonan

unread,

Mar 30, 2006, 8:23:33 PM3/30/06

to

David Schwartz wrote:
> Quite the reverse, this type of application would *suffer* *terribly*
> from lock-free algorithms. With a lock, two threads would never be permitted
> concurrent access to the same data, so it would never enter a state where
> data was ping-ponging back and forth from cache to cache burning heavy CPU
> with no significant forward progress. However, with a lock-free structure,
> it would be allowed to do this horrible thing 80% of the time!

A real-world example (see

group: comp.unix.internals
topic: Kernel Memory Allocator

date: 6 November 2003)

A simple program, benchmark type, that divides a fixed amount
of work between a variable number of threads (thus perfect scalability
shows up as a constant runtime whatever the number of threads).
Program running on 4-way machine. Results in seconds (so the lower
the better). Each sequence of results starts at one thread and doubles
successively up to 128 threads.
Locking algorithm: 265,289,280,400,555,456,604,575
Lock-free algorithm: 245,122,80,70,70,58,75,82

In this case lock-free scales better.

Regards,
Chris

David Hopwood

unread,

Mar 30, 2006, 10:08:57 PM3/30/06

to

David Schwartz wrote:
> Locks serve an extremely important purpose -- they deschedule threads
> that will compete for the same cache lines and allow threads that have no
> such dependencies to run concurrently at full speed.

Note that locks are not the only way of doing that. Any method that achieves
mutual exclusion between "threads that will compete for the same cache lines"
will do it. Besides locks, another approach is based on active objects which
use an event loop to respond to messages. This message passing may be either
asynchronous, as in Erlang or E, or synchronous, as in Concurrent ML.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Schwartz

unread,

Mar 30, 2006, 11:13:00 PM3/30/06

to

[possible duplicate post, sorry]

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:U6SdnUIS2vs...@comcast.com...

> This makes no sense. Most lock-free algorithms are of the read lock-free

> variety and for things like RCU and SMR hazard pointers actually reduce
> the amount of cache traffic. RCU was created at Sequent where this kind
> of thing was important since it was in a NUMA environment.

If the algorithm is purely read, then this argument does not apply. But

there are very few lock-free algorithms that don't involve allowing threads
to modify things concurrently.

> For regular write lock-free things like lock-free queues, you're going to

> have
> to explain how a queue operations guarded with a mutex causes less cache
> activity than lock-free queue operations do without slowing things down.

Suppose you have two CPUs and four threads that want to run. Two of the

threads want to access queue A and two of the threads want to access queue
B. If each queue is protected by a lock, the scheduler will have to schedule
one thread using queue A and one using queue B. Both threads will run at
full speed and will not interfere with each other. If the threads are

lock-free,

it is equally likely that two threads will run using the same queue as
using different queues, the queue structures will ping-pong violently across
the FSB and the two threads will run at low speed.

Locks serve an extremely important purpose -- they deschedule threads

that will compete for the same cache lines and allow threads that have no

such dependencies to run concurrently at full speed. Lock-free algorithms
work well only when they don't disturb this, for example, when they are
embedded inside the operating system or in very low-level libraries.
However, when lock-free structures are used at higher levels, this important
property of locks is lost, and performance suffers.

Contention is bad. Lock-free algorithms allow threads that content to
run at the same time, contending all the while. Locks deschedule contending
threads, which is great if there are threads ready to run that will not
contend.

This is often missed in artificial benchmarking because only the
performance of a single aspect of the system is measured. The affect of
FSB-inefficient code on threads running on other CPUs, for example, is
seldom noticed. If there is nothing else for the system to do, running a
thread very slowly and trashing the FSB may be better than nothing. But if a
real-world system or application could do other things effectively, allowing
it to do something slowly is a losing proposition.

DS

David Schwartz

unread,

Mar 31, 2006, 1:41:18 AM3/31/06

to

"chris noonan" <use...@leapheap.co.uk> wrote in message
news:1143765082.0...@g10g2000cwb.googlegroups.com...

> Several runs on a 4-way machine. Each result sequence
> starts at one thread and doubles successively up to
> 128 (1,2,4,8,16,32,64,128). Timings are in seconds
> (so the smaller the better).
>
> Locking: 265,289,280,400,555,456,604,575
> Lock-free: 245,122,80,70,70,58,75,82
>
> In this case lock-free scales better.

The results don't seem to make any sense. How can 6 threads go faster
than 4 on a 4-way machine? I thought this was a pure CPU load.

DS

David Schwartz

unread,

Mar 31, 2006, 1:43:10 AM3/31/06

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:e0iius$elp$1...@nntp.webmaster.com...

Sorry, that should be: How can 8 threads go faster than 4 on a 4-way
machine? Seems that 4 threads taking 80 seconds when 2 threads took 122 on a
4-way machine is pretty mediocre scalability. That you found some other
algorithm that scales even worse doesn't really say anything.

DS

Joe Seigh

unread,

Mar 31, 2006, 6:59:31 AM3/31/06

to

David Schwartz wrote:
> [possible duplicate post, sorry]
>
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:U6SdnUIS2vs...@comcast.com...
>
>
>>This makes no sense. Most lock-free algorithms are of the read lock-free
>>variety and for things like RCU and SMR hazard pointers actually reduce
>>the amount of cache traffic. RCU was created at Sequent where this kind
>>of thing was important since it was in a NUMA environment.
>
>
> If the algorithm is purely read, then this argument does not apply. But
> there are very few lock-free algorithms that don't involve allowing threads
> to modify things concurrently.
>

That's like saying rwlocks aren't synchronization, just mutexes are.

>
>>For regular write lock-free things like lock-free queues, you're going to
>>have
>>to explain how a queue operations guarded with a mutex causes less cache
>>activity than lock-free queue operations do without slowing things down.
>
>
> Suppose you have two CPUs and four threads that want to run. Two of the
> threads want to access queue A and two of the threads want to access queue
> B. If each queue is protected by a lock, the scheduler will have to schedule
> one thread using queue A and one using queue B. Both threads will run at
> full speed and will not interfere with each other. If the threads are
> lock-free,
> it is equally likely that two threads will run using the same queue as
> using different queues, the queue structures will ping-pong violently across
> the FSB and the two threads will run at low speed.
>
> Locks serve an extremely important purpose -- they deschedule threads
> that will compete for the same cache lines and allow threads that have no
> such dependencies to run concurrently at full speed. Lock-free algorithms
> work well only when they don't disturb this, for example, when they are
> embedded inside the operating system or in very low-level libraries.
> However, when lock-free structures are used at higher levels, this important
> property of locks is lost, and performance suffers.

Yeah, that's the ticket!

>
> Contention is bad. Lock-free algorithms allow threads that content to
> run at the same time, contending all the while. Locks deschedule contending
> threads, which is great if there are threads ready to run that will not
> contend.
>
> This is often missed in artificial benchmarking because only the
> performance of a single aspect of the system is measured. The affect of
> FSB-inefficient code on threads running on other CPUs, for example, is
> seldom noticed. If there is nothing else for the system to do, running a
> thread very slowly and trashing the FSB may be better than nothing. But if a
> real-world system or application could do other things effectively, allowing
> it to do something slowly is a losing proposition.
>

Let's see. How would this work? So you have a bunch of threads using a lock.
The other threads wanting to use the resource wait on the lock. If the running
thread preempts holding the lock, that would be bad for performance. But if all
other threads are waiting on the lock then the thread will just execute again until
it eventually preempts not holding the lock and one of the other threads can run.

So basically, you've turned a multi-threaded application into a single threaded
one. No matter how many threads get added, performance doesn't improve. Not
very good scalability.

dav...@webmaster.com

unread,

Mar 31, 2006, 10:06:16 AM3/31/06

to

>You're missing the point. In a preemptive threading environment, one of
>the major benefits is less overhead suspending and resuming threads on
>blocking operations.

The thing is, suspending threads that would contend is *good*. Better
to switch to a thread that doesn't contend than one that does.

>It's faster for the same reason highway interchanges with
>overpasses are less congested than intersections with traffic lights for
>the same volume of traffic.

Locks are like traffic lights, you have to stop every once in a while,
but when you go, you go fast. Lock-free is like everyone driving slowly
at every intersection, not stopping, not letting others go quickly. You
always get to move forward, but if there's a lot of traffic, nobody
goes at full speed.

>Lock-free also scales better and doesn't
>suffer from that hysteresis effect where the application bogs down so badly
>that it can't recover even under reduced load and has to be restarted.

Quite the reverse, lock-free algorithms can get you into a state where
all your threads are ping-ponging the same cache lines across the FSB
when there are threads that wouldn't contend at all that would love to
run. (And even threads that don't contend crawl because the FSB is
saturated.) Lock-free algorithms are generally only suitable for very
low-level code, such as in kernels and in the implementations of
threading libraries and memory allocators.

DS

Chris Friesen

unread,

Mar 31, 2006, 11:51:59 AM3/31/06

to

Joe Seigh wrote:
> David Schwartz wrote:

>> Locks deschedule
>> contending
>> threads, which is great if there are threads ready to run that will not
>> contend.

> So basically, you've turned a multi-threaded application into a single
> threaded
> one. No matter how many threads get added, performance doesn't
> improve. Not
> very good scalability.

I think you ignored David's statement that I highlighted above. If
there are non-contending threads ready to run then it is a net gain to
deschedule threads that will contend.

If all runnable threads will contend, then yes, you're serializing the
application. However, even that may not be a bad thing. If there are
other applications running on the system at the same time, then the
*system* throughput may increase even though each individual application
runs slower.

Chris

David Schwartz

unread,

Mar 31, 2006, 5:22:57 PM3/31/06

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:qLmdna81Ios...@comcast.com...

> Let's see. How would this work? So you have a bunch of threads using a
> lock.
> The other threads wanting to use the resource wait on the lock. If the
> running
> thread preempts holding the lock, that would be bad for performance. But
> if all
> other threads are waiting on the lock then the thread will just execute
> again until
> it eventually preempts not holding the lock and one of the other threads
> can run.

You have to create a situation where there's one data structure that
every single thread that could do useful work needs to access and where each
of those threads must hold the lock for a significant fraction of their
execution time. The best advice is to avoid such situations rather than
design code so that it doesn't suffer quite so much in such a pathological
situation.

In exchange for better behavior in this pathological, avoidable
situation, you get much worse behavior in more typical situations. That is,
threads that could run at full speed wait in favor of threads that can
suffer through major contention.

This is only a good tradeoff in the particular cases where the
pathological behavior is unavoidable. This includes inside the kernel, in
memory allocators, and in the implementation of synchronization primitives.
These special cases are, however, by far the exception.

The point is, contention is bad. Locks prevent contention by scheduling
threads that will not contend over threads that will contend. Lock-free
algorithms make contention slightly less bad, but at the cost of not
avoiding it even when you can avoid it.

In most applications, the cost of synchronization is very low because
locks rapidly arrange it so that threads that won't contend are scheduled
concurrently.

DS

chris noonan

unread,

Mar 31, 2006, 7:45:00 PM3/31/06

to

Chris Friesen wrote:
> I think you ignored David's statement that I highlighted above. If
> there are non-contending threads ready to run then it is a net gain to
> deschedule threads that will contend.
>
> If all runnable threads will contend, then yes, you're serializing the
> application. However, even that may not be a bad thing. If there are
> other applications running on the system at the same time, then the
> *system* throughput may increase even though each individual application
> runs slower.

If you have other, unrelated applications running on your
system you have a small problem. Buy another machine
and move those applications onto it.

I thought there was tacit agreement that we were discussing
a different problem: you have one application running on your
system; you need more processing power; however you are
already at the point on the price/performance curve where
doubling processing power will multiply hardware cost tenfold.
That is the big problem and has been for the last forty years.

Other Chris

chris noonan

unread,

Mar 31, 2006, 9:10:27 PM3/31/06

to

David Schwartz wrote:
>> Several runs on a 4-way machine. Each result sequence
>> starts at one thread and doubles successively up to
>> 128 (1,2,4,8,16,32,64,128). Timings are in seconds
>> (so the smaller the better).

>> Locking: 265,289,280,400,555,456,604,575
>> Lock-free: 245,122,80,70,70,58,75,82

>How can 8 threads go faster than 4 on a 4-way machine?

Several possible reasons: cache effects, competition
with system threads. Really I don't know. I have
performed many trials like this and normally there
is some magic number of threads that beats (repeatably)
its neighbours for no discernible reason.

>Seems that 4 threads taking 80
>seconds when 2 threads took 122 on a
>4-way machine is pretty mediocre scalability.

Why don't you ask Microsoft Corporation and
Intel Corporation how a 4-way machine can run
four threads slower than one (289 seconds versus
265)? I am sure most people will find that
comparison the striking one.

>That you found some other algorithm that scales
>even worse doesn't really say anything.

Consider the common case where a thread frequently
enters a critical region to access shared data, no
complications like nested locks.

All such thread functions may be treated the same
as regards scalability. The important factors are:
1. The type of lock.
2. The frequency with which the lock is taken.
3. The time spent holding the lock, as calculable
from the number of instructions, the instruction
timings and the expected incidence of cache
misses.
Nothing else is likely to make much difference.

Both the algorithms compared in the trial are
large-block allocators with O(1) time complexity.
Both navigate a linked list looking for a free
block of suitable size (though the structure
of the lists differs markedly between the two).

So how can the locking algorithm be made more
scalable?

Factor 1: you're not going to change the lock type;
critical section is the fastest thing Microsoft offer.
Factor 2: you can't change the frequency because
when a client calls malloc() (let's ignore free())
you have to find a memory block.

That leaves factor 3, the time spent inside the lock.
The scope of the lock can't be changed, because you
can't navigate a linked list if writers may be present.
The locking algorithm is a performance-sensitive
routine, coded in C by the sort of people who
write books about coding in C, and compiled with
a world-class optimising compiler. So it must be
good ;). Further, navigating a linked list comparing
two sizes doesn't sound like the sort of code
that could be dramatically improved.

But you never know.

Chris

Chris Thomasson

unread,

Apr 1, 2006, 5:43:49 AM4/1/06

to

"Casper H.S. Dik" <Caspe...@Sun.COM> wrote in message
news:442a668c$0$11069$e4fe...@news.xs4all.nl...
> Chris Thomasson <cristom_nospam@comcast._nospam_net> writes:
>
>>Some of those tricks can be extremely useful. Lock-free reader patterns
>>are an example... Zero-overhead signaling via. eventcounts are another,
>>ect...
>
> There are also other "contention free" algorithms by using
> distributed locking; e.g., the Solaris kernel memory allocator
> has little or no contention because it uses per-CPU caches of data;
> only once such caches run out, the CPU will need to go and get some
> more.

Yeah, kind of similar to Hoard's per-cpu scheme right?

> That stops scaling at one CPU as the (artificial) workload showed
> (allocating memory was apparently performance critical for the
> auditing subsystem).
>
> After modifying the code to use the contentionless default memory
> allocator, no memory contention could be observed (I only had a 12-way
> system at my disposal)

Well, maybe no contention for the data-structures that make up the "memory",
but what the lock that the allocator is apparently using has to synchronize
on? Frequent accesses to uncontend locks can generate a lot of overhead...
Does the allocator make use of per-thread memory pools? As you may know, you
can avoid using locks to synchronize access to a distributed per-thread
based allocation scheme... Here is some a rough pseudo-code example of a
very simple setup:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?q=lock-free+slab+allocator&rnum=1#e3efa5628aad4a82

This 100% lock-free per-thread allocator, "should", outperform a "100%"
lock-based per-cpu allocator... Humm, what do you think?

>>You can also use POSIX mutexs to create a scalable environment that
>>supports lock-free reader patterns, and much more. The only thing the
>>environment relies on is dependent load ordering, which is supported
>>everywhere except DEC Alpha; it needs a rmb. Funny how you can get

>>low-overhead lock-free algorithms from locks...
>

> There are times when you don't even need that as is the case with
> the Solaris kernel memory allocator; it just doesn't need resources
> owned by other CPUs.

Well, I was really addressing lock-free reader patterns. I guess you could
apply one to the internals of a memory allocator... You could have multiple
threads searching for memory blocks that exist in a common shared-data
structure without using any synchronization instructions. As stated before,
dependant load-ordering is all that is required. To remove blocks of the
shared-data, you can use simple CAS. Lock-free algorithms can be beneficial
when used in the implementations of memory allocators...

Chris Thomasson

unread,

Apr 1, 2006, 6:08:48 AM4/1/06

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:e0hbb2$e6$1...@nntp.webmaster.com...

>
> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:Y-mdnXDUsrOAfLbZ...@comcast.com...
>
>> What about the mutex implementations "internal state"? A typical mutex
>> usually costs 2 atomic-ops and 2 "expensive" memory barriers for
>> uncontended
>> case... Unfortunately, when you apply load to those 4 instructions
>> the basic
>> end result will be he locks internal state ping-ponging all over the
>> place;
>> the lock state is a "proxy". This happens to be true for "frequent
>> accesses
>> to uncontended locks". Now, if you experience contention, under
>> load, the
>> threads may need to frequently park in the kernel; context-switch,
>> user-kernel transition, lock-convoys, "possible" priority-inversion,
>> ect,
>> ect...
>
> I have no objection to optimizing mutexes to decrease the cost in the
> uncontended case.

Agreed. However, the point I was really trying to get at was the coherency
traffic generated by "frequent access to uncontended locks" is usually
greater than the traffic generated by "frequent access to conventional
write-based lock-free algorithms" (e.g., IBM freelist, lock-free queue,
ect... ). Agreed?

Before you answer, think of this... A IBM freelist usually requires one
release barrier and a CAS instruction to push a node, and one DWCAS
instruction and an acquire barrier to pop a node... A lock-based freelist
requires 2 CAS operations and an acquire/release barriers to push a node,
and 2 CAS operations and an acquire/release barriers to pop a node. Which
one is going to win...

> However, in the contended case, the cost of rescheduling is well worth
> it -- a thread that won't conflict gets to run at full speed rather than a
> thread that will conflict getting to fight at low speed.

What is the point of adaptive mutexs then?... They try to achieve a
lock-free fast-path on the mutex by spinning, no? Do you have a problem with
adaptive mutexs? If not, why spin on the lock, when you could be making
guaranteed forward progress on each spin with a lock-free algorithm? I say
avoid the kernel at all costs, even if it means using a locking pattern like
this:

for ( ;; )
{
if ( ! try_to_lock() )
{
if ( try_to_do_something_else() ) { continue; }

lock();
}

// whatever

unlock();

try_to_do_something_else(); // whatever
}

...

Any thoughts?

David Schwartz

unread,

Apr 1, 2006, 10:45:26 PM4/1/06

to

"chris noonan" <use...@leapheap.co.uk> wrote in message

news:1143852300.4...@v46g2000cwv.googlegroups.com...

> If you have other, unrelated applications running on your
> system you have a small problem. Buy another machine
> and move those applications onto it.

Who said they were "unrelated"?

> I thought there was tacit agreement that we were discussing
> a different problem: you have one application running on your
> system; you need more processing power; however you are
> already at the point on the price/performance curve where
> doubling processing power will multiply hardware cost tenfold.
> That is the big problem and has been for the last forty years.

Really?!

DS

Chris Thomasson

unread,

Apr 2, 2006, 4:34:29 AM4/2/06

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:e0hb84$e0$1...@nntp.webmaster.com...

>
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:U6SdnUIS2vs...@comcast.com...
>
>> This makes no sense. Most lock-free algorithms are of the read lock-free
>> variety and for things like RCU and SMR hazard pointers actually reduce
>> the amount of cache traffic. RCU was created at Sequent where this kind
>> of thing was important since it was in a NUMA environment.
>
> If the algorithm is purely read, then this argument does not apply.

No. Not purely read, its "mostly" read.

* The argument most certainly does apply *

Some lock-free reader patterns are ideal for NUMA based environments.
Remember, it mostly read... There are also some reader patterns that can
handle periods of prolonged heavy write activity... No problem.

> But
> there are very few lock-free algorithms that don't involve allowing
> threads
> to modify things concurrently.

Not to sure what your getting at here... You can use mutexs, or apply basic
write-based lock-free algorithms for the "writer" side of a lock-free reader
pattern... No problem.

> Suppose you have two CPUs and four threads that want to run. Two of the
> threads want to access queue A and two of the threads want to access queue
> B. If each queue is protected by a lock, the scheduler will have to
> schedule
> one thread using queue A and one using queue B. Both threads will run at
> full speed and will not interfere with each other. If the threads are lock
> free, it is equally likely that two threads will run using the same queue
> as
> using different queues, the queue structures will ping-pong violently
> across
> the FSB and the two threads will run at low speed.

What about a single-producer/consumer loopless lock-free queue algorithm
that dosen't use any atomic operations or StoreLoad membars? Can you design
a lock-based solution that can defeat it? You can try to get a mutex-condvar
based design to compete with an algorithm that is 100% loopless, has no
atomics, or hefty membars:

http://appcore.home.comcast.net/

Good Luck!

;)

> Locks serve an extremely important purpose -- they deschedule threads
> that will compete for the same cache lines and allow threads that have no
> such dependencies to run concurrently at full speed. Lock-free algorithms
> work well only when they don't disturb this, for example, when they are
> embedded inside the operating system or in very low-level libraries.
> However, when lock-free structures are used at higher levels, this
> important
> property of locks is lost, and performance suffers.
>
> Contention is bad. Lock-free algorithms allow threads that content to
> run at the same time, contending all the while.

Hummm.... "Certain" lock-free algorithms can allow for contention less
synchronization free reads... You seem to be totally missing the point that
lock-free algorithms can be used to remove the need for synchronization
instructions on a number of different algorithms.

?

Chris Thomasson

unread,

Apr 2, 2006, 8:07:35 PM4/2/06

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:1143735782....@i39g2000cwa.googlegroups.com...

I agree, atomic lock-free refcounting can be very expensive. However, you
can simply amortize the cost of reference counting, if your using
ref-counting to begin with, when your implementing a proxy collector. You
don't reference count everything, "just" the collector structures. IMO,
using lock-free atomic reference counting to protect each node of a
data-structure would generate massive overheads. However, as you probably
know, there are other methods that can amortize the cost of actually
reference counting the collector structures. I created a straightforward
usage pattern for my VZOOM collector that is based on per-thread
periodic/episodic synchronizations...

- You don't need to activate the collector every time you access a shared
data-structure... You can setup a runtime configurable dynamic ratio of
accesses:syncs... Unfortunately, algorithms like RCU are not compatible with
this setup because it simply requires an arbitrary number of references to
persist over quiescent states. Your application will drive right off a cliff
if your try to do this with RCU! IMO, even RCU-SMR can't really fit this
bill because it was not really designed to allow the user to grab an
arbitrary number of references...

- Also, the collector I came up can be implemented with POSIX and dependant
load-ordering. So, an application can use it to implement lock-free reader
patterns on many different operating systems and processor combinations. Can
a traditional GC compete with the flexibility of a portable user-space proxy
collector wrt solving the reader/writer problem in lock-free algorithms?

Chris Thomasson

unread,

Apr 2, 2006, 8:17:48 PM4/2/06

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:1143735782....@i39g2000cwa.googlegroups.com...

> [posting via Google because of news server problems -- sorry if this is
> a duplicate]
>
> Chris Thomasson wrote:
>> Sergey P. Derevyago wrote:
>>>Joe Seigh wrote:
>>>
>>>>> The question is: Can you take XXX proxy GC and plug it just like
>>>>>Boehm GC?
>>>>
>>>>No, things like RCU, etc... aren't used for storage management. They're
>>>>used for read lock-free solutions to the reader/writer problem in a
>>>>multi-threaded environment.
>>>
>>> So I was right: there is no GC/C++ that doesn't "stop the world".
>>> And it seems like it's not possible for any GC/C++ not to "stop the
>>> world".
>>
>> You don't need to garbage collect everything. Proxy collection can allow
>> an application to get away from traditional GC, and all of the overhead
>> that comes along with them.
>
> I always find it odd when people talk about the "overhead of GC" as
> though
> other memory management techniques had no overheads.

Ahh, however, there are virtually zero-overhead proxy collection algorithms
out there. No kidding...

VZOOM and Joe Seighs RCU-SMR can be implemented in a highly non-portable way
that can dissect quiescent-states out of operating system provided
information...

David Schwartz

unread,

Apr 3, 2006, 5:09:25 PM4/3/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:OrGdnYZrUO9...@comcast.com...

> What about a single-producer/consumer loopless lock-free queue algorithm
> that dosen't use any atomic operations or StoreLoad membars? Can you
> design
> a lock-based solution that can defeat it? You can try to get a
> mutex-condvar
> based design to compete with an algorithm that is 100% loopless, has no
> atomics, or hefty membars:

If the performance of a queue algorithm in an application significantly
affects the performance of that application, I would argue that the
application is probably incredibly badly designed. In any realistic case,
the complexity, fragility, non-portability, and greater likelihood of bugs
in the lock-free algorithm would rule it out entirely.

>Hummm.... "Certain" lock-free algorithms can allow for contention less
>synchronization free reads... You seem to be totally missing the point that
>lock-free algorithms can be used to remove the need for synchronization
>instructions on a number of different algorithms.

These contentionless synchronization free reads increase the cost of the
next write because the shared cache line has to be made exclusive. If that
happens a lot, you wind up better off using locks.

The whole point is this -- don't concurrently schedule threads that
contend. That's the secret to high-performance multi-threaded applications.
Really. The exceptions are in very low-level code (like synchronization
primitives, memory allocators, and so on) where there's not much else you
can do or sharing is forced by the architecture.

DS

Chris Thomasson

unread,

Apr 3, 2006, 9:01:12 PM4/3/06

to

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:OrGdnYZrUO9...@comcast.com...
>
>> What about a single-producer/consumer loopless lock-free queue algorithm
>> that dosen't use any atomic operations or StoreLoad membars? Can you
>> design
>> a lock-based solution that can defeat it? You can try to get a
>> mutex-condvar
>> based design to compete with an algorithm that is 100% loopless, has no
>> atomics, or hefty membars:
>
> If the performance of a queue algorithm in an application significantly
> affects the performance of that application, I would argue that the
> application is probably incredibly badly designed.

The frequent use of the atomic operations and membars in a queue
implementation can have a marked negative effect on other threads running on
the same core. I have stated this before. The ability to remove
synchronization instructions from a queue can drastically reduce the
coherency traffic generated per-call to the queue API. IMHO, virtually
zero-overhead user-space queues are the way to go for scaleable
message-passing. Also, these queus are great for NUMA environments where
atomic operation/membar usage is usually frowned upon...

> In any realistic case, the complexity, fragility, non-portability, and
> greater likelihood of bugs in the lock-free algorithm would rule it out
> entirely.

http://groups.google.com/group/comp.programming.threads/msg/e7a5514f6c3306d9

Humm, for lock-free algorithms that use pointers with DWCAS, non-portability
is definitely an issue. However, how non-portable is a word-based lock-free
algorithm? Remember word-based means that it can use normal CAS... Also, how
non portable is using POSIX mutexs to create an environment that support
lock-free reader patterns?

>>Hummm.... "Certain" lock-free algorithms can allow for contention less
>>synchronization free reads... You seem to be totally missing the point
>>that
>>lock-free algorithms can be used to remove the need for synchronization
>>instructions on a number of different algorithms.
>
> These contentionless synchronization free reads increase the cost of
> the next write because the shared cache line has to be made exclusive. If
> that happens a lot, you wind up better off using locks.

Using locks for the readers? Tell that or Mr. McKenney...

http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf

IMO, do not want my writers and readers to be mutually excludable... No
way...

BTW, I can get hundreds-of-thousands more reads per-second-per-thread with
lock-free readers. I remember that I was having trouble getting a lock-based
solution to even get close to a hundred-thousand reads. The numbers get even
more dramatic when an ultra low-overhead proxy GC is put to good use...

http://groups.google.com/group/comp.programming.threads/msg/6bd87eef7e071d0a

Any thoughts?

( David, I think I accidentally mailed this response to you, if I did:
please ignore it. Sorry! )

dav...@webmaster.com

unread,

Apr 5, 2006, 5:36:31 AM4/5/06

to

>Why don't you ask Microsoft Corporation and
>Intel Corporation how a 4-way machine can run
>four threads slower than one (289 seconds versus
>265)? I am sure most people will find that
>comparison the striking one.

You can always blame the machine. You can always blame the programmer.
(The programmer should have known that his algorithm would suck on this
machine, thus he shouldn't have used it.) There's some truth to both
positions.

My bet would be that if a 4-way machine runs four threads slower than
one, you have an algorithm that is killing the FSB. Note that lock-free
algorithms will let those four threads continue to run at a ludicrously
low speed. A lock will deschedule all but one of those four threads.
With luck, uncontending threads will get the other CPUs, but in this
example, you're better off even if they don't.

>The locking algorithm is a performance-sensitive
>routine, coded in C by the sort of people who
>write books about coding in C, and compiled with
>a world-class optimising compiler. So it must be
>good

I am very suspicious of this type of code. It's generally
micro-optimized in artificial benchmark circumstances, where the
benefit of allowing uncontended threads to run isn't measured but the
'benefit' of allowing contending threads to run slowly is.

Tests are done on a machine that is doing nothing else, and results are
valid only if the tested structure is a hideous, inexcusable
bottleneck. The sort of monstrousity that should just be avoided,
except where you can't (memory allocators, synchronization primitives,
and so on).

DS

Joe Seigh

unread,

Apr 5, 2006, 7:24:37 AM4/5/06

to

dav...@webmaster.com wrote:
>>Why don't you ask Microsoft Corporation and
>>Intel Corporation how a 4-way machine can run
>>four threads slower than one (289 seconds versus
>>265)? I am sure most people will find that
>>comparison the striking one.
>

[...]

>
> Tests are done on a machine that is doing nothing else, and results are
> valid only if the tested structure is a hideous, inexcusable
> bottleneck. The sort of monstrousity that should just be avoided,
> except where you can't (memory allocators, synchronization primitives,
> and so on).
>

There could very well be problems with the testcase cited here,
but have you actually run or know of any benchmarks or testcases
that support your position, or you just saying all this? Seems like
there would be tradeoffs with your approach, e.g. extra suspend/resume
overhead and lack of concurrency. The latter needs to be explained
since why would you design a program with more threads than can be
utilized? Were they needed in the first place?

David Schwartz

unread,

Apr 5, 2006, 5:45:09 PM4/5/06

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:f_6dnQe90L5...@comcast.com...

>> Tests are done on a machine that is doing nothing else, and results are
>> valid only if the tested structure is a hideous, inexcusable
>> bottleneck. The sort of monstrousity that should just be avoided,
>> except where you can't (memory allocators, synchronization primitives,
>> and so on).

> There could very well be problems with the testcase cited here,
> but have you actually run or know of any benchmarks or testcases
> that support your position, or you just saying all this?

Almost every benchmark I have ever seen supports my position. I'm kind
of puzzled why you'd ask me that considering we're talking about a case
where the benchmark provided supports my point!

> Seems like
> there would be tradeoffs with your approach, e.g. extra suspend/resume
> overhead and lack of concurrency. The latter needs to be explained
> since why would you design a program with more threads than can be
> utilized? Were they needed in the first place?

That would be the argument *against* lock-free. Why do you need all
those threads to access the same collection? My argument is that in the real
world, most of the threads will be accessing different collections and in
the rare case where two threads try to manipulate the same structure,
descheduling one is better than letting them slowly suffer through it.

You create as many threads as you can keep busy, plus a few extras in
case of unexpected blocking. Contention should be the exception rather than
the rule. Again, the exceptions are in memory allocators and in the
implementation of synchronization primitives where contention is
unavoidable.

DS

David Hopwood

unread,

Apr 5, 2006, 9:33:38 PM4/5/06

to

Sergey P. Derevyago wrote:
> David Hopwood wrote:
>
>>Sergey is mistaken in thinking that global GC implies "stop the world"
>
> Explain please how this could be possible.

The first fully worked-out solution was given in this paper:

Damien Doligez, Georges Gonthier,
"Portable, Unobtrusive Garbage Collection for Multiprocessor Systems,"
24th Symp. on Principles of Programming Languages, pp. 70-83, Jan 1994.
<http://citeseer.ist.psu.edu/doligez94portable.html>

and here is a later paper applying it to Java and loosening some assumptions
(e.g. not assuming sequential consistency):

Tamar Domani, Elliot K. Kolodner, Ethan Lewis, Eliot E. Salant,
Katherine Barabash, Itai Lahan, Yossi Levanoni, Erez Petrank, Igor Yanover,
"Implementing an On-the-fly Garbage Collector for Java,"
In Proceedings of the ACM SIGPLAN Symposium on Memory Management,
pp. 155--165, 2000.
<http://citeseer.ist.psu.edu/domani00implementing.html>

Note that, as I said, it would be difficult to apply this in a way that
would be link-compatible with existing C++ implementations. It is a precise
GC which requires update barriers. This is not incompatible with the
semantics of C++ (except possibly for some misfeatures inherited from C:
type punning via untagged unions, 'memcpy', etc.), but it is incompatible
with existing C++ ABIs.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Apr 5, 2006, 10:03:57 PM4/5/06

to

David Schwartz wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:f_6dnQe90L5...@comcast.com...
>
>
>>>Tests are done on a machine that is doing nothing else, and results are
>>>valid only if the tested structure is a hideous, inexcusable
>>>bottleneck. The sort of monstrousity that should just be avoided,
>>>except where you can't (memory allocators, synchronization primitives,
>>>and so on).
>
>
>>There could very well be problems with the testcase cited here,
>>but have you actually run or know of any benchmarks or testcases
>>that support your position, or you just saying all this?
>
>
> Almost every benchmark I have ever seen supports my position. I'm kind
> of puzzled why you'd ask me that considering we're talking about a case
> where the benchmark provided supports my point!
>

This one? That's the one I though was being discussed.

chris noonan wrote:
> A real-world example, more information available at:
> group: comp.unix.internals
> topic: Kernel Memory Allocator
> date: 6 November 2003
>
> A simple program (benchmark type) that divides a fixed
> amount of work up among a variable number of threads,
> so that with perfect scaling all runs take the same time.
> Can use either a locking or lock-free algorithm.
>

> Several runs on a 4-way machine. Each result sequence
> starts at one thread and doubles successively up to
> 128 (1,2,4,8,16,32,64,128). Timings are in seconds
> (so the smaller the better).
>
> Locking: 265,289,280,400,555,456,604,575
> Lock-free: 245,122,80,70,70,58,75,82
>

> In this case lock-free scales better.
>

dav...@webmaster.com

unread,

Apr 6, 2006, 7:01:32 AM4/6/06

to

Joe Seigh wrote:
> David Schwartz wrote:

> > Almost every benchmark I have ever seen supports my position. I'm kind
> > of puzzled why you'd ask me that considering we're talking about a case
> > where the benchmark provided supports my point!

> This one? That's the one I though was being discussed.

> > Locking: 265,289,280,400,555,456,604,575

> > Lock-free: 245,122,80,70,70,58,75,82
> >
> > In this case lock-free scales better.

This one definitely supports my claim.

First, the statement "lock-free scales better" is nonsense. This
particular lock free algorithm scaled better than this particular
locking algorithm. There's lots of evidence to suggest that the locking
algorithm was atrocious. (For example, switching to two threads
actually *slowed* it down. That's really only possible if you're doing
something really bad, like perhaps holding the locks for too little
time).

Second, this clearly shows serious problems with the lock-free
algorithm of exactly the type I'm talking about. For example, on a
4-CPU machine, 2 threads took 122 and 4 took 80. That's *atrocious*
scaling.

DS

Sergey P. Derevyago

unread,

Apr 7, 2006, 4:22:29 AM4/7/06

to

David Hopwood wrote:
> >>Sergey is mistaken in thinking that global GC implies "stop the world"
> >
> > Explain please how this could be possible.
>
> The first fully worked-out solution was given in this paper:
>
> Damien Doligez, Georges Gonthier,
> "Portable, Unobtrusive Garbage Collection for Multiprocessor Systems,"
> 24th Symp. on Principles of Programming Languages, pp. 70-83, Jan 1994.
> <http://citeseer.ist.psu.edu/doligez94portable.html>
>

Unfortunately:
1. This paper doesn't give a solution to my original question.
2. The paper presents an algorithm which is broken if a relaxed memory model
requires the both reader and writer to issue memory barriers.

The details are:

1. The last paragraph of 2.3 "The Doligex-Leroy design" tells: "...clearly
the global collector cannot trace the local heap without the cooperation of
the mutator thread."
That is they "clearly" agree with my POV: you _must_ "stop the world" if you
don't cooperate with the user code. But the cooperation means _manual_
management. And this manual management in some sense is equal to manual memory
management. That is it's not a GC I'm talking about.
BTW the cooperation required is quite tricky and can easily be got wrong.

2. The paper in 1 "Introduction" tells: "...the published proven algorithms
often contain simplifying assumptions that cannot be met in practice in a
multiprocessor system, because this would either impose unbearable overhead on
the mutator processes, or require a degree of hardware and/or operating system
support that compromises portability. Implemented systems that do not fall in
the latter two categories often rely on incompletely formalized algorithms,
which generally means buggy algorithms, given the subtleness of the
correctness proofs. To our knowledge, and as we shall argue below, all
published concurrent collectors fall in one of the above categories, and thus
fail to meet at least one of the basic requirements for portable, effective
garbage collection on multiprocessors."
Bravo!
Unfortunately, the paper doesn't mention modern relaxed memory models so it
definitely "contains simplifying assumptions that cannot be met in
practice...".
The show is over.

> and here is a later paper applying it to Java and loosening some assumptions
> (e.g. not assuming sequential consistency):
>

No time for Java, sorry.
I asked for a pluggable (automatic!) GC/C++ that doesn't "stop the world".
The paper above just confirmed that it's just impossible to implement such a
beast (on modern multiprocessors that use relaxed memory models).
--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

David Hopwood

unread,

Apr 7, 2006, 9:39:34 AM4/7/06

to

Sergey P. Derevyago wrote:
> David Hopwood wrote:
>
>>>>Sergey is mistaken in thinking that global GC implies "stop the world"
>>>
>>> Explain please how this could be possible.
>>
>>The first fully worked-out solution was given in this paper:
>>
>> Damien Doligez, Georges Gonthier,
>> "Portable, Unobtrusive Garbage Collection for Multiprocessor Systems,"
>> 24th Symp. on Principles of Programming Languages, pp. 70-83, Jan 1994.
>> <http://citeseer.ist.psu.edu/doligez94portable.html>
>>
> Unfortunately:
> 1. This paper doesn't give a solution to my original question.

I answered the question you asked above.

> 2. The paper presents an algorithm which is broken if a relaxed memory model
> requires the both reader and writer to issue memory barriers.

>>and here is a later paper applying it to Java and loosening some assumptions

>>(e.g. not assuming sequential consistency):

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>
> No time for Java, sorry.

Pay attention. Why do you think I referenced both papers, anticipating the
point about relaxed memory models?

> The details are:
>
> 1. The last paragraph of 2.3 "The Doligex-Leroy design" tells: "...clearly
> the global collector cannot trace the local heap without the cooperation of
> the mutator thread."
> That is they "clearly" agree with my POV: you _must_ "stop the world" if you
> don't cooperate with the user code. But the cooperation means _manual_
> management.

Nonsense.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Sergey P. Derevyago

unread,

Apr 7, 2006, 10:11:02 AM4/7/06

to

David Hopwood wrote:
> > 1. The last paragraph of 2.3 "The Doligex-Leroy design" tells: "...clearly
> > the global collector cannot trace the local heap without the cooperation of
> > the mutator thread."
> > That is they "clearly" agree with my POV: you _must_ "stop the world" if you
> > don't cooperate with the user code. But the cooperation means _manual_
> > management.
>
> Nonsense.
>

Not at all.
If user code must carefully schedule the calls to some GC functions then we
deal with some hidden form _manual_ memory management rather than automatic
GC.
In particular, if some thread stops to call the Cooperate procedure (defined
in the paper) then the whole (process-wide) Collector will stop to collect the
garbage! Nonsense, if we're talking about any reasonable GC.

Chris Thomasson

unread,

Apr 7, 2006, 5:02:42 PM4/7/06

to

Indeed...

The mutator cooperation scheme described in the latter paper is kind of
similar to the per-thread/cpu periodic/episodic synchronization pattern
my stuff uses. I only briefly skimmed through the documentation,
however, they seem to be relying on release semantics along with load
barriers... I wonder exactly what type of load barrier they are talking
about. An explicit #LoadLoad, or a dependent load. I assume they are
using dependent load ordering... Since they are using release barriers
for the mutator cooperation, subsequent loads can rise above them. So
far, this is compatible with epoch-based proxy collection.

I bet they could get rid of the explicit release barriers... However, I
can't really think of a way for them to get rid of periodic thread
suspensions for register/stack snapshots. As you know, proxy gc doesn't
need to track this information. Thats one of the benefits of not having
to "collect-the-world"...

;)

Another benefit of proxy collection is the ability for the algorithm to
recognize both writer and reader threads. A simple example of this would
be in how a proxy gc can handle periods of heavy write activity. If the
collector gets overloaded with deferred requests it can choose to block
the writer threads only; reader threads continue along at full-speed.
The writers get signaled when the deferred request level drops back down
to an acceptable level...

Sergey P. Derevyago

unread,

Apr 8, 2006, 6:51:22 AM4/8/06

to

Chris Thomasson wrote:
> If the collector gets overloaded with deferred requests it can choose to
> block the writer threads only; reader threads continue along at full-speed.
>

A run-time environment that is allowed to stop arbitrary threads at arbitrary
points is deadlock-prone and must not be used in reliable applications.
Yes, "stop the world" approach is definitely bad. But it seems to be better
than the alternatives proposed so far (for real-world GC).

David Hopwood

unread,

Apr 8, 2006, 7:37:28 AM4/8/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>If the collector gets overloaded with deferred requests it can choose to
>>block the writer threads only; reader threads continue along at full-speed.
>
> A run-time environment that is allowed to stop arbitrary threads at arbitrary
> points is deadlock-prone and must not be used in reliable applications.

No, this is not deadlock-prone. No correct GC or proxy collector arbitrarily
terminates threads; concurrent collectors only cause a thread to run collector
code for a short time before resuming the application code, and the collector
code does not block on any conditions that might create the possibility of
deadlock. Deadlock can only occur if there are cyclic dependencies between
conditions on which threads might block.

> Yes, "stop the world" approach is definitely bad. But it seems to be better
> than the alternatives proposed so far (for real-world GC).

There is nothing wrong with the DLG collector or any of its variants in this
respect; they do not stop the world, and they do not create deadlocks. Neither
does the proxy collector optimization that Chris Thomasson was referring to.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Sergey P. Derevyago

unread,

Apr 8, 2006, 7:39:43 AM4/8/06

to

David Hopwood wrote:
> >>and here is a later paper applying it to Java and loosening some assumptions
> >>(e.g. not assuming sequential consistency):
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > No time for Java, sorry.
>
> Pay attention. Why do you think I referenced both papers, anticipating the
> point about relaxed memory models?
>

Well. As long as the previous paper is found to be flawed, let's look at

"Implementing an On-the-fly Garbage Collector for Java"

http://citeseer.ist.psu.edu/domani00implementing.html

1. Artificial/managed Java run-time is substantively different from native
C++ runtime. JVM is able to emit on the fly any utility calls into the user
code so it has the control you don't have on conventional C++ programs. So JVM
can really implement something impossible in the case of regular native C++
programs.
As a result, the paper is irrelevant to my "show me please a pluggable GC/C++
that doesn't stop the world" question.

2. Nevertheless, the paper doesn't show a GC that doesn't stop the world even
for Java!
In particular, 8 "Future directions" tells: "Even when using an on-the-fly GC
algorithm, for a highly multithreaded program running on a multiprocessor with
many processors, the garbage collector thread may be overwhelmed by mutators
rapidly creating and discarding objects. Boosting the priority of the
collector thread can help, but only up to a certain degree. If the collector
is not able to clear garbage at the rate that it is being produced, then
eventually the heap will become depleted; the garbage collection will be
synchronous, so that threads attempting to allocate will have to wait until
the collector sweeps memory and begins freeing objects before they will be
able to continue. In this case the collector is essentially reduced to working
in stop-the-world mode, but with the extra overhead of the write barriers.".

3. Next. You also wrote "...here is a later paper applying it to Java and
loosening some assumptions (e.g. not assuming sequential consistency)". Your
"some assumptions" note is misleading. The paper explicitly stands that rather
big subset of important modern architectures is out of scope.
In particular, 3.1 "Memory coherency" tells: "There are architectures that
employ weaker forms of memory coherency, where any pair of memory accesses can
be exchanged (assuming no data dependency). In order to run the on-the-fly
collector on these architectures, the load-load, load-store, and store-store
dependencies would also have to be removed from the algorithm. This is an item
for future work.".

4. The bottom line is rather frustrating. You've referred two papers, both of
them are obviously irrelevant and flawed.

Sergey P. Derevyago

unread,

Apr 8, 2006, 8:10:58 AM4/8/06

to

David Hopwood wrote:
> > A run-time environment that is allowed to stop arbitrary threads at arbitrary
> > points is deadlock-prone and must not be used in reliable applications.
>
> No, this is not deadlock-prone.
>

Yes, it is. Consider:

pthread_mutex_lock(&mutex);
//...
void* ptr=malloc(size); // can block while mutex is held
//...
pthread_mutex_unlock(&mutex);

> No correct GC or proxy collector arbitrarily
> terminates threads; concurrent collectors only cause a thread to run collector
> code for a short time before resuming the application code, and the collector
> code does not block on any conditions that might create the possibility of
> deadlock. Deadlock can only occur if there are cyclic dependencies between
> conditions on which threads might block.
>

In the code above I shouldn't consider the case where malloc blocks for an
unpredictable time because some GC decided that my thread is a "writer"...
Should I?

> > Yes, "stop the world" approach is definitely bad. But it seems to be better
> > than the alternatives proposed so far (for real-world GC).
>
> There is nothing wrong with the DLG collector or any of its variants in this
> respect;
>

Yes, it is.
Please read my previous replies.

> they do not stop the world,
>

They do "stop the world" if "the garbage collector thread is overwhelmed by

mutators rapidly creating and discarding objects".

Joe Seigh

unread,

Apr 8, 2006, 8:50:12 AM4/8/06

to

Sergey P. Derevyago wrote:
> David Hopwood wrote:

[...]

>>they do not stop the world,
>>
>
> They do "stop the world" if "the garbage collector thread is overwhelmed by
> mutators rapidly creating and discarding objects".

I don't think anything will meet your definition of not stopping the world. So
yes, according to your definition, there are no forms of GC that do not "stop
the world".

Sergey P. Derevyago

unread,

Apr 8, 2006, 9:05:42 AM4/8/06

to

Joe Seigh wrote:
> > They do "stop the world" if "the garbage collector thread is
> > overwhelmed by mutators rapidly creating and discarding objects".
>
> I don't think anything will meet your definition of not stopping the world.
>

Sorry, it's an exaggeration.
"My definition" is obvious and straightforward: GC intentionally stops all
user threads to collect the memory (or mark, or whatever else required to be
able to collect).
What's wrong with it?

> So yes, according to your definition, there are no forms of GC that do not
> "stop the world".
>

That is you've said: According to the obvious and straightforward definition,

there are no forms of GC that do not "stop the world".

Well. I've seen the following kinds of GC:
1. GC that stops the world whenever necessary.
2. GC that dynamically traces the references (substantial overhead and
scalability issues are a result) and tries not to stop the world whenever
possible. But it does intentionally stop the world under the heavy load.

Did I miss any substantial point?

Chris Thomasson

unread,

Apr 8, 2006, 2:08:29 PM4/8/06

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:4437A852...@iobox.com...

> David Hopwood wrote:
>> > A run-time environment that is allowed to stop arbitrary threads
>> > at arbitrary
>> > points is deadlock-prone and must not be used in reliable applications.
>>
>> No, this is not deadlock-prone.
>>
> Yes, it is. Consider:
>
> pthread_mutex_lock(&mutex);
> //...
> void* ptr=malloc(size); // can block while mutex is held
> //...
> pthread_mutex_unlock(&mutex);

First of all, proxy collectors don't really have to be involved with the
implementation of malloc because they are usually explicitly used for
solving the reader/writer problem in lock-free algorithms, not for tracking
pointers returned by calls to malloc... Although, proxy gc can be designed
this way...

Anyway, do you know that the malloc you are using now is lock-free? I mean,
it may take locks and need to block for a little while... Is a malloc
implementation that takes a lock, and may block, completely out of the
question? I can't see any deadlocks because there are no cyclic
dependencies...

>> No correct GC or proxy collector arbitrarily
>> terminates threads; concurrent collectors only cause a thread to run
>> collector
>> code for a short time before resuming the application code, and the
>> collector
>> code does not block on any conditions that might create the possibility
>> of
>> deadlock. Deadlock can only occur if there are cyclic dependencies
>> between
>> conditions on which threads might block.
>>
> In the code above I shouldn't consider the case where malloc blocks for an
> unpredictable time because some GC decided that my thread is a "writer"...
> Should I?

No. This is really not the place where the GC would determine that a thread
is a writer or not. The "defer" function that is common in many proxy
collectors is the place where a writer thread can usually be differentiated
from a reader thread. Here is a simple example of how the defer function can
be utilized:

/* Writer Thread - Push */
node_t *n = malloc(...)

mutex_lock(...);

/* init new node */
n->nx = stack->front;

/* release barrier */

/* make new node visible to readers */
stack->front = n;

mutex_unlock(...);

/* Writer Thread - Pop */
node_t *n;

mutex_lock(...);

n = stack->front;
if ( n ) { stack->front = n->nx; }

mutex_unlock(...);

if ( n )
{
/* Defer to the proxy collector */
PGC_DEFER( n, PGC_CALLBACK );
}

/* Threads And The Lock-Free Reader Pattern */

/* read stack anchor */
node_t *nx, *n = stack->front;

/* dependent load-barrier -
read_barrier_depends() for Linux. */

/* Iterate */
while ( n )
{
/* read next node */
nx = n->nx;

/* dependent load-barrier -
read_barrier_depends() for Linux. */

/* do whatever */
do_whatever( n );

n = nx;
}

/* Periodic/Episodic Synchronizations For Readers can go here */
(...);

/* Proxy Callback */
void PGC_CALLBACK( void *s )
{
node_t *n = s;

/* n can now be freed or reused. No readers are accessing it */
}

>> > Yes, "stop the world" approach is definitely bad. But it seems to
>> > be better
>> > than the alternatives proposed so far (for real-world GC).
>>
>> There is nothing wrong with the DLG collector or any of its variants in
>> this
>> respect;
>>
> Yes, it is.
> Please read my previous replies.
>
>> they do not stop the world,
>>
> They do "stop the world" if "the garbage collector thread is overwhelmed
> by
> mutators rapidly creating and discarding objects".

They don't have to stop the world... IMO, stopping the world means stopping
all threads at the same time... If am implementation does not have to stop
all threads at the same time, then its not stopping the world...

Chris Thomasson

unread,

Apr 8, 2006, 2:37:32 PM4/8/06

to

> /* Writer Thread - Pop */
> node_t *n;
>
> mutex_lock(...);
>
> n = stack->front;
> if ( n ) { stack->front = n->nx; }
>
> mutex_unlock(...);
>
> if ( n )
> {
> /* Defer to the proxy collector */
> PGC_DEFER( n, PGC_CALLBACK );
> }

Even if the defer function was called in and blocked in the writers critical
section, there is still no possibility of deadlock. The blocking mechanism
that can be used in the defer function does not rely on another application
thread to do something in order to unblock. There are no dependencies
between the blocking and application logic, its all internal to the
collector...

If your interested Sergey, I can give you a more detailed description of a
simple deadlock-free blocking scheme that I use for a couple of my collector
implementations... No deadlock, no way, no how...

:)

David Hopwood

unread,

Apr 8, 2006, 3:54:22 PM4/8/06

to

Sergey P. Derevyago wrote:
> David Hopwood wrote:
>
>>> A run-time environment that is allowed to stop arbitrary threads at arbitrary
>>>points is deadlock-prone and must not be used in reliable applications.
>>
>>No, this is not deadlock-prone.
>
> Yes, it is. Consider:
>
> pthread_mutex_lock(&mutex);
> //...
> void* ptr=malloc(size); // can block while mutex is held
> //...
> pthread_mutex_unlock(&mutex);

This cannot deadlock. malloc either succeeds or fails in a short time. You seem
to be confusing temporary blocking with deadlock.

>>> Yes, "stop the world" approach is definitely bad. But it seems to be better
>>>than the alternatives proposed so far (for real-world GC).
>>
>>There is nothing wrong with the DLG collector or any of its variants in this

>>respect; they do not stop the world,

>
> They do "stop the world" if "the garbage collector thread is overwhelmed by
> mutators rapidly creating and discarding objects".

I'll address that point in another reply.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

chris noonan

unread,

Apr 9, 2006, 7:08:58 PM4/9/06

to

Joe Seigh wrote:
> David Schwartz wrote:

> > Almost every benchmark I have ever seen supports my position. I'm kind
> > of puzzled why you'd ask me that considering we're talking about a case
> > where the benchmark provided supports my point!
> >
> This one? That's the one I though was being discussed.
>
> chris noonan wrote:
> > Several runs on a 4-way machine. Each result sequence
> > starts at one thread and doubles successively up to
> > 128 (1,2,4,8,16,32,64,128). Timings are in seconds
> > (so the smaller the better).
> >
> > Locking: 265,289,280,400,555,456,604,575
> > Lock-free: 245,122,80,70,70,58,75,82
> >
> > In this case lock-free scales better.

Don't waste your time Joe.

Chris

dav...@webmaster.com

unread,

Apr 9, 2006, 8:53:42 PM4/9/06

to

>Except for hard real-time applications, page faults are not much
>of a problem, since they are guaranteed to only block for a short
>time (even if there is a disk error, since then the process will be
>terminated). There is obviously some latency associated with
>servicing a page fault, but this is unavoidable regardless of the
>concurrency design.

I don't see how you can argue that the latency of a page fault is
unavoidable. A page fault only stalls the faulting thread. A design
that permanently associates resources to threads, for example, will be
more crippled by a page fault than a design that allows whatever thread
happens to be running to grab a resource and handle it.

Consider, as an extreme example, two designs for the same server. They
both handle 10,000 clients with 100 threads. In one design, each client
is permanently assigned to a thread, 10 to each. In another design, the
clients that need work are put on a queue and a thread grabs a client
whenever it's free.

In the first design, if one client does something really unusual, that
causes code to fault in that isn't resident in memory, nine other
arbitrary connections will be stalled until the code faults in as well
as the client that triggered the fault. In the second design, the fault
will only affect the client that caused it (which did something rare
and unusual anyway), and other clients can be services by other
threads.

Consider also code that handles unusual error conditions. This type of
code is more likely to encounter page faults because the code is less
frequently used and less likely to be in memory. You can enter this
code while holding critical locks (thus stalling other threads while
the page fault is services) or you can carefully avoid doing so
(perhaps even dropping the lock and then re-acquiring it in the rarely
used code if appropriate).

So the argument that the latency of page faults is unavoidable
regardless of the design is misleading. Good design can minimize the
harm that page faults do and bad design can magnify it.

DS

Sergey P. Derevyago

unread,

Apr 10, 2006, 5:15:00 AM4/10/06

to

Chris Thomasson wrote:
> They don't have to stop the world... IMO, stopping the world means stopping
> all threads at the same time... If am implementation does not have to stop
> all threads at the same time, then its not stopping the world...
>

"Even when using an on-the-fly GC
algorithm, for a highly multithreaded program running on a multiprocessor with
many processors, the garbage collector thread may be overwhelmed by mutators
rapidly creating and discarding objects. Boosting the priority of the
collector thread can help, but only up to a certain degree. If the collector
is not able to clear garbage at the rate that it is being produced, then
eventually the heap will become depleted; the garbage collection will be
synchronous, so that threads attempting to allocate will have to wait until
the collector sweeps memory and begins freeing objects before they will be
able to continue. In this case the collector is essentially reduced to working
in stop-the-world mode, but with the extra overhead of the write barriers."

Please, pay attention to "the collector is essentially reduced to working in
stop-the-world mode, but with the extra overhead". I.e. it not only stops the
world but also imposes significant overhead when the world is not stopped...

Sergey P. Derevyago

unread,

Apr 10, 2006, 5:19:48 AM4/10/06

to

David Hopwood wrote:
> > pthread_mutex_lock(&mutex);
> > //...
> > void* ptr=malloc(size); // can block while mutex is held
> > //...
> > pthread_mutex_unlock(&mutex);
>
> This cannot deadlock.
>

Sorry, I wasn't quite correct. It can provoke a dead lock or some other fatal
effect.
The point is that the program logic is unaffected only if all the threads are
stopped simultaneously.

David Hopwood

unread,

Apr 10, 2006, 8:56:16 AM4/10/06

to

dav...@webmaster.com wrote:
>>Except for hard real-time applications, page faults are not much
>>of a problem, since they are guaranteed to only block for a short
>>time (even if there is a disk error, since then the process will be
>>terminated). There is obviously some latency associated with
>>servicing a page fault, but this is unavoidable regardless of the
>>concurrency design.
>
> I don't see how you can argue that the latency of a page fault is
> unavoidable. A page fault only stalls the faulting thread.

... and any other threads that logically depend on something that the
faulting thread would do after the fault. That is the latency that is
unavoidable.

> A design that permanently associates resources to threads, for example,
> will be more crippled by a page fault than a design that allows whatever
> thread happens to be running to grab a resource and handle it.
>
> Consider, as an extreme example, two designs for the same server. They
> both handle 10,000 clients with 100 threads. In one design, each client
> is permanently assigned to a thread, 10 to each. In another design, the
> clients that need work are put on a queue and a thread grabs a client
> whenever it's free.

Who is arguing for the first design? I'm certainly not.

> So the argument that the latency of page faults is unavoidable
> regardless of the design is misleading.

I didn't argue that the effect on other threads is independent of the
design. I argued that *some* latency is unavoidable regardless of the
design. I also argued that this effect is "not much of a problem" -- and
it isn't. Designs that would show a significant performance loss from
this effect are seriously broken for much more important reasons.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Apr 10, 2006, 9:01:52 AM4/10/06

to

Sergey P. Derevyago wrote:
> David Hopwood wrote:
>
>>> pthread_mutex_lock(&mutex);
>>> //...
>>> void* ptr=malloc(size); // can block while mutex is held
>>> //...
>>> pthread_mutex_unlock(&mutex);
>>
>>This cannot deadlock.
>
> Sorry, I wasn't quite correct. It can provoke a dead lock or some other fatal
> effect.

It can only "provoke a deadlock" if the program already had a potential deadlock,
which happens to be exposed by timing issues. Otherwise, it can't.

"Some other fatal effect" is not specific enough for me to respond to, but in
the absence of any supporting technical argument it just sounds like FUD.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Sergey P. Derevyago

unread,

Apr 10, 2006, 10:16:00 AM4/10/06

to

David Hopwood wrote:
> > Sorry, I wasn't quite correct. It can provoke a dead lock or some other fatal
> > effect.
>
> It can only "provoke a deadlock" if the program already had a potential deadlock,
> which happens to be exposed by timing issues. Otherwise, it can't.
>

OK, let's consider the following scenario:
1. There are several WorkerThreads which use DispatcherThread to communicate
with each other.
2. DispatcherThread calls malloc to allocate memory for the messages.
3. The "Smart GC" has discovered that it is "overloaded with deferred
requests" and decided to stop DispatcherThread as a "writer". Now
DispatcherThread must wait while "readers" are allowed to proceed.
4. That is DispatcherThread is blocked and WorkerThreads are pointlessly loops
(possibly updating some statistics) not receiving any data to work on.

Yes, this design isn't perfect but quite possible. It relies on a key
requirement that DispatcherThread is always up and running. Otherwise the
whole subsystem isn't able to process external requests and but works on some
other tasks (irrelevant to the requests processing).
It's a "fatal effect" I'm talking about.

David Hopwood

unread,

Apr 10, 2006, 12:42:21 PM4/10/06

to

Sergey P. Derevyago wrote:
> David Hopwood wrote:
>
>>> Sorry, I wasn't quite correct. It can provoke a dead lock or some other fatal
>>>effect.
>>
>>It can only "provoke a deadlock" if the program already had a potential deadlock,
>>which happens to be exposed by timing issues. Otherwise, it can't.
>
> OK, let's consider the following scenario:
> 1. There are several WorkerThreads which use DispatcherThread to communicate
> with each other.
> 2. DispatcherThread calls malloc to allocate memory for the messages.
> 3. The "Smart GC" has discovered that it is "overloaded with deferred
> requests" and decided to stop DispatcherThread as a "writer". Now
> DispatcherThread must wait while "readers" are allowed to proceed.
> 4. That is DispatcherThread is blocked and WorkerThreads are pointlessly loops
> (possibly updating some statistics) not receiving any data to work on.

If malloc has side effects that cause other threads to block permanently, then
it is incorrect. If it only causes some threads to block temporarily, then no
deadlocks will be introduced that could not otherwise occur. That is only a
quality-of-implementation issue.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

dick.dunbar

unread,

Apr 10, 2006, 1:58:51 PM4/10/06

to

> If malloc has side effects that cause other threads to block permanently, then
it is incorrect.

malloc does have documented side effects that might cause other threads
to block permanently.

malloc, and any system call that might invoke it, is not async-signal
safe.

The implementations may be changing now to be lock-free, but malloc is
not re-entrant on most posix platforms, and you absolutely will
deadlock threads if you use it in a signal handler (well, you might
get lucky, but processor speeds change the odds)

Otherwise, I agree with your point that malloc is one of those
"quality-of-implementation" pieces of code that has a huge effect on
programming languages like C++.

David Hopwood

unread,

Apr 10, 2006, 9:02:17 PM4/10/06

to

dick.dunbar wrote:

> David Hopwood wrote:
>> If malloc has side effects that cause other threads to block permanently, then
>> it is incorrect.
>
> malloc does have documented side effects that might cause other threads
> to block permanently.
>
> malloc, and any system call that might invoke it, is not async-signal
> safe.

<http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html>

[snip table of guaranteed async-signal-safe functions]

# All functions not in the above table are considered to be unsafe with respect
# to signals. In the presence of signals, all functions defined by this volume
# of IEEE Std 1003.1-2001 shall behave as defined when called from or interrupted
# by a signal-catching function, with a single exception: when a signal interrupts
# an unsafe function and the signal-catching function calls an unsafe function,
# the behavior is undefined.

IOW, while it is nonportable to call malloc from a signal-catching function,
there is no other constraint on how malloc may be called, and no other reason
why it might have side effects that could cause other threads to block permanently.

(Well, unless the application uses an API that explicitly imposes extra constraints
on use of non-async-signal-safe functions, as fork() does for example -- but
fork() is almost incompatible with threads anyway.)

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Apr 14, 2006, 8:43:57 PM4/14/06

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:443A68A0...@iobox.com...

> David Hopwood wrote:
>> > Sorry, I wasn't quite correct. It can provoke a dead lock or some
>> > other fatal
>> > effect.
>>
>> It can only "provoke a deadlock" if the program already had a potential
>> deadlock,
>> which happens to be exposed by timing issues. Otherwise, it can't.
>>
> OK, let's consider the following scenario:
> 1. There are several WorkerThreads which use DispatcherThread to
> communicate
> with each other.
> 2. DispatcherThread calls malloc to allocate memory for the messages.
> 3. The "Smart GC" has discovered that it is "overloaded with deferred
> requests" and decided to stop DispatcherThread as a "writer". Now
> DispatcherThread must wait while "readers" are allowed to proceed.

As I stated before, proxy gc usually does not concern itself with the
implementation of malloc; they can exist "completely outside" the memory
allocator. Therefore, proxy collectors can be "freely interchanged with
different types of allocators". That's another benefit of using proxy gc...

> 4. That is DispatcherThread is blocked and WorkerThreads are pointlessly
> loops
> (possibly updating some statistics) not receiving any data to work on.
>
> Yes, this design isn't perfect but quite possible. It relies on a key
> requirement that DispatcherThread is always up and running. Otherwise the
> whole subsystem isn't able to process external requests and but works on
> some
> other tasks (irrelevant to the requests processing).
> It's a "fatal effect" I'm talking about.

Remember, proxy collectors can give you non-dependant temporary blocking on
the "defer" function, absolutely no chance of deadlock. Also, there is
nothing to stop an implementation of a proxy gc from providing a "trydefer"
function, or even a "timeddefer" function... Those simple functions can
allow individual application threads to pick and choose when, and how long,
they may block... That's another benefit of using proxy gc...

- Proxy GC can achieve a fairly high degree of flexibility by not having to
be tied to a specific malloc implementation, and by not having to
collect-the-world. IMHO, they are an attractive alternative to traditional
garbage collection...

Sergey P. Derevyago

unread,

Apr 15, 2006, 8:38:36 AM4/15/06

to

Chris Thomasson wrote:
> As I stated before, proxy gc usually does not concern itself with the
> implementation of malloc; they can exist "completely outside" the memory
> allocator. Therefore, proxy collectors can be "freely interchanged with
> different types of allocators". That's another benefit of using proxy gc...
>

Give me please an URL to a neat description of the proxy gc you're talking

Joe Seigh

unread,

Apr 15, 2006, 10:07:20 AM4/15/06

to

Sergey P. Derevyago wrote:
> Chris Thomasson wrote:
>
>>As I stated before, proxy gc usually does not concern itself with the
>>implementation of malloc; they can exist "completely outside" the memory
>>allocator. Therefore, proxy collectors can be "freely interchanged with
>>different types of allocators". That's another benefit of using proxy gc...
>>
>
> Give me please an URL to a neat description of the proxy gc you're talking
> about.

I coined the phrase to describe something that wasn't in the current lexicon.

Instead of using GC on individual objects, proxy GC is the technique of using
a proxy garbage collected object to guard designated objects within a critical
region. When a designated object is deleted, it's deallocation is tied to that
of the current proxy object which is then (atomically usually) replaced with a
new proxy object. When all previous references to the old proxy object are
dropped by threads leaving the critical region, the old proxy object is deallocated
along with any designated objects tied to it.

Proxy GC is a mechanism to reduce the overhead of some GC methods, e.g. reference
counting with a trade off of some reduced GC granularity.

There are some optimizations such as using the designated objects to be freed as
the new proxy object to reduce object allocation of specialized proxy objects.
There's some order of decallocation requirements to ensure safety of collection
traversal.

There's some examples of proxy GC on
http://atomic-ptr-plus.sourceforge.net/
in the atomic-ptr-plus, fastsmr, and rcpc packages.

RCU is also a form of proxy GC although it's not so clear since there doesn't
appear to be an proxy object referenced at the beginning of a critical region
and dropped at the end, but there is. The RCU "proxy object" is the current
quiescent state tracking interval for that thread or processor. When a thread
enters a quiescent state, it drops the reference to the previous interval.
RCU doesn't track references to individual objects, just intervals that those
objects can be referenced in. You could also call the intervals non-quiesced
states.

Proxy GC is also used in this paper I mentioned in the discussion about Sun's
patent application for lock-free reference counting.
http://research.sun.com/scalable/pubs/main-8.pdf
There's a bacwards linked FIFO list used to GC objects used in their lock-free
synchronization techniques.

--

Sergey P. Derevyago

unread,

Apr 15, 2006, 10:37:35 AM4/15/06

to

Joe Seigh wrote:
> Instead of using GC on individual objects, proxy GC is the technique of using
> a proxy garbage collected object to guard designated objects within a critical
> region. When a designated object is deleted, it's deallocation is tied to that
> of the current proxy object which is then (atomically usually) replaced with a
> new proxy object. When all previous references to the old proxy object are
> dropped by threads leaving the critical region, the old proxy object is deallocated
> along with any designated objects tied to it.
>

Don't understand, sorry. Could you please give some short real-world example
to illustrate the idea?

BTW if some on-the-fly reference modification is supposed then (on some
modern hardware) both the reader and the writer must issue membars so the
unnecessary contention is a result (as compared to the private per-thread
memory allocators used in well-designed MT applications).

Joe Seigh

unread,

Apr 15, 2006, 11:09:34 AM4/15/06

to

Sergey P. Derevyago wrote:
> Joe Seigh wrote:
>
>>Instead of using GC on individual objects, proxy GC is the technique of using
>>a proxy garbage collected object to guard designated objects within a critical
>>region. When a designated object is deleted, it's deallocation is tied to that
>>of the current proxy object which is then (atomically usually) replaced with a
>>new proxy object. When all previous references to the old proxy object are
>>dropped by threads leaving the critical region, the old proxy object is deallocated
>>along with any designated objects tied to it.
>>
>
> Don't understand, sorry. Could you please give some short real-world example
> to illustrate the idea?

Well, from the fastsmr sample code (not the most current version)
the writer does

pthread_mutex_lock(&mutex);
p2 = lfq_dequeue(&q);

if (p2) {
p = proxy;
proxy = p2; // this should be atomic_store I think
node = containerof(p, node_t, link);
node->seqnum++;

//
// free item
//
node->defer.func = &defer_free;
node->defer.arg = p;
smr_defer(&(node->defer));
}
pthread_mutex_unlock(&mutex);

Readers do

smrload(local, &proxy);
for (p = atomic_load_depends(&q.tail);
p != NULL;
p = atomic_load_depends(&(p->next)))
{
numNodes++;

node = containerof(p, node_t, link);
seqnum0 = node->seqnum;
sched_yield();
seqnum1 = node->seqnum;
if ((seqnum1 - seqnum0) > 1) // too stale
abort();// this never happens if fastsmr implementation is correct

}
smrnull(local); // clear hazard pointer

>
> BTW if some on-the-fly reference modification is supposed then (on some
> modern hardware) both the reader and the writer must issue membars so the
> unnecessary contention is a result (as compared to the private per-thread
> memory allocators used in well-designed MT applications).

Memory model is important but you haven't been paying attention since reducing
memory barrier overhead is part of some of the lock-free techniques. There's
the taking advantage of dependent load ordering, and the RCU+SMR eliminates the
need for the store/load membar in the hazard pointer load. Although with proxy
GC that's not so critical since you're amortizing the syncrhonization overhead
over the whole critical region. One hazard pointer or refcounted pointer load
vs. one for each node in the linked list in this example.