Ref-counted strings not thread-safe -- looking for clarification

Carlos Moreno

unread,

Aug 28, 2002, 6:03:05 PM8/28/02

to

Hi,

I remember reading about ref-counted strings and the fact that
they can not be made thread-safe (I *think* it was in one of the
GOTW items).

Not remembering well the details, I wonder: what does the *not
thread-safe* refers to? Will I stay out of trouble if a use
locks to synchronize access to any string and guarantee that
no two threads will be "simultaneously" accessing any given
string? For instance, using g++ (on which strings are ref-
counted), if I do something like this:

Fragment of main thread:

list<string> requests; // this list is shared between the main
// thread and the worker thread (say
// that the worker thread holds a reference
// to it, or that it is a global object)

lock_the_mutex(); // the mutex is also shared by the threads

requests.push_back (" ... ");
// (or maybe use some string variable instead of a literal)

unlock_the_mutex();

Fragment of worker thread:

while (true)
{
string next_request = "";

lock_the_mutex();

if (! requests.empty())
{
next_request = requests.front();
requests.pop_front();
}

unlock_the_mutex();

if (next_request != "")
{
// process the request...
}
}

I assume that everything should be ok (at least concerning the
strings access -- I didn't really pay attention to the locks
and those details, so let's assume all that is fine). I mean,
the string is necessarily deep-copied, regardless of the string
implementation being ref-counted or not... But the copy is made
from a string that is shared between two threads, so since I
don't know why is it that ref-counting implies a problem with
thread-safety, I'm not sure if something like the above could
end up giving some trouble.

Any words from the experts and gurus? :-)

Thanks in advance for any thoughts!

Carlos
--

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]

Bo Persson

unread,

Aug 29, 2002, 10:41:18 AM8/29/02

to

"Carlos Moreno" <moreno_at_mo...@xx.xxx> skrev i meddelandet
news:3D6CEDC2...@xx.xxx...

>
> Hi,
>
> I remember reading about ref-counted strings and the fact that
> they can not be made thread-safe (I *think* it was in one of the
> GOTW items).
>
> Not remembering well the details, I wonder: what does the *not
> thread-safe* refers to? Will I stay out of trouble if a use
> locks to synchronize access to any string and guarantee that
> no two threads will be "simultaneously" accessing any given
> string? For instance, using g++ (on which strings are ref-
> counted), if I do something like this:
>

code removed :-)

>
> I assume that everything should be ok (at least concerning the
> strings access -- I didn't really pay attention to the locks
> and those details, so let's assume all that is fine). I mean,
> the string is necessarily deep-copied, regardless of the string
> implementation being ref-counted or not... But the copy is made
> from a string that is shared between two threads, so since I
> don't know why is it that ref-counting implies a problem with
> thread-safety, I'm not sure if something like the above could
> end up giving some trouble.
>

One problem is that the locking and the sharing occur at different levels.
You lock at the string level, while the sharing is at the implementation
level:

std::string s1 = "abc";

std::string s2 = s1;

ThreadA = create_thread(s1);
ThreadB = create_thread(s2);

Now, even though the two threads act on different strings, in a reference
counted implementation they *might* share the strings' character buffer. You
cannot lock that, so the string class has to do it internally. On the other
hand, the library implementor does not know which strings are actually used
by different threads, so *all* string modifications have to be locked,
whether they are shared or not. The net "efficiency" seems to be negative...

This, and more, was discussed in GOTWs #43-45, see
http://www.gotw.ca/gotw/index.htm

Bo Persson
bo...@telia.com

Attila Feher

unread,

Aug 29, 2002, 12:00:47 PM8/29/02

to

Carlos Moreno wrote:
>
> Hi,
>
> I remember reading about ref-counted strings and the fact that
> they can not be made thread-safe (I *think* it was in one of the
> GOTW items).

I doubt it.

> Not remembering well the details, I wonder: what does the *not
> thread-safe* refers to?

Nothing. Noboday has ever said anything like that. COW constructs need
special attention to be thread safe, because they hide sharing of data.

> Will I stay out of trouble if a use
> locks to synchronize access to any string and guarantee that
> no two threads will be "simultaneously" accessing any given
> string?

Read More Exceptional C++ Appendix A, which describes the situ. COW
strings cannot be made thread-safe _from_the_outside_!

> For instance, using g++ (on which strings are ref-
> counted),

Then you just do not care that they are. If your program uses threads,
the string implementation in the g++ standard headers will take care of
serialization of the access to the reference counts so the string will
look like (for you) as a string with non-shared implementation (I mean
data). Except for speed in MT - but for that really read MEXC++

> Any words from the experts and gurus? :-)

My 2c is there, and if you want guru answer get the book!

Attila

Alexander Terekhov

unread,

Aug 29, 2002, 2:50:54 PM8/29/02

to

Attila Feher wrote:
[...]

> Except for speed in MT - but for that really read MEXC++

You might also want to take a look at following {entire} thread(s):

http://groups.google.com/groups?as_usubject=MT%20design%20advice
("...C) [``Consider:'' part] Atomic operations/gotw 43/44/45...")

> > Any words from the experts and gurus? :-)
>
> My 2c is there, and if you want guru answer get the book!

Yeah, but don't believe everything written in the "Appendix A". ;-)

http://groups.google.com/groups?selm=7209q4%24p1p%241%40shell7.ba.best.com
(Subject: Re: Guru of the Week #45: Solution)

regards,
alexander.

Attila Feher

unread,

Aug 30, 2002, 5:42:10 AM8/30/02

to

Alexander Terekhov wrote:
[SNIP]

> You might also want to take a look at following {entire} thread(s):
>
> http://groups.google.com/groups?as_usubject=MT%20design%20advice
> ("...C) [``Consider:'' part] Atomic operations/gotw 43/44/45...")

This did not tell me much new. I have tested applications (well
designed) using strings and it has turned out that in MT COW _was_
killing performance and even deep copy was better. I did not test
atomic ops :-(

With the mutexes I don't find the locking expensive. It is negligable.
However _making_ and _destroying_ them is death on Sun Solaris. It
seems they stop all other threads from doing any of the above on any
mutex. Imagine now an app, which works with strings in nearly all
threads and heavily... The end result: _any_ number of CPUs is
virtually reduced to 1 and that (in best case) is utilized 90% working
on the real task.

> > > Any words from the experts and gurus? :-)
> >
> > My 2c is there, and if you want guru answer get the book!
>
> Yeah, but don't believe everything written in the "Appendix A". ;-)
>
> http://groups.google.com/groups?selm=7209q4%24p1p%241%40shell7.ba.best.com
> (Subject: Re: Guru of the Week #45: Solution)

I meant about the "problems" involved, which prevent us from making a
COW thread safe from the outside! Buu, I say. :-)

Oh Alexander, The Guru of MT, The Holder of The Truth of How Threads Can
Get Along! ;-))) I am just teasing, but I see from your posts that you
really know what you are talking about.

I saw in the second thread that you refer some code where you have used
atomic ops for COW. I was looking for Herb's code, but it is nowhere.
Actually I have some doubts about his implementation (I would rather say
there are some part I jsut don't know how it would work - therefore I
don't believe it would :-))) so I wanted to take a look (placing a some
sleep to some places to cause thread switching). But anyways it isn't
there and I would like to see _once_ a proven, working atomic op COW
_with_ the comments that what are the _minimum_ required atomic ops and
_why_? Until then, I would be happy to see some code. Can you tell us
a place where such (simple! not part of a 10^9 code lines STL
implementation) code can be seen?

Attila

Carlos Moreno

unread,

Aug 30, 2002, 6:25:06 AM8/30/02

to

Attila Feher wrote:
>
> My 2c is there, and if you want guru answer get the book!

Whoa whoa! Show some respect, here!! I *do* have the
book!! ;-)

I know, I know, shame on me that I haven't read that book
entirely!! :-(

Kidding aside... Thanks a bunch for your comments! I'll
take a look at the book and the threads (no pun intended :-))
where these issues were discussed.

{I have allowed some latitude for the non-native users of English. Where
I come from 'Thanks a bunch' has very strong negative undertones meaning
something akin to 'Why have you been wasting my time with trivia?' But I
am sure Carlos does not mean it that way -mod/fwg}

Thanks,

Carlos
--

James Kanze

unread,

Aug 30, 2002, 12:25:46 PM8/30/02

to

Alexander Terekhov <tere...@web.de> wrote in message
news:<3D6E4A87...@web.de>...

> Attila Feher wrote:
> [...]
> > Except for speed in MT - but for that really read MEXC++

> You might also want to take a look at following {entire} thread(s):

> http://groups.google.com/groups?as_usubject=MT%20design%20advice
> ("...C) [``Consider:'' part] Atomic operations/gotw 43/44/45...")

> > > Any words from the experts and gurus? :-)

> > My 2c is there, and if you want guru answer get the book!

> Yeah, but don't believe everything written in the "Appendix A". ;-)

> http://groups.google.com/groups?selm=7209q4%24p1p%241%40shell7.ba.best.com
> (Subject: Re: Guru of the Week #45: Solution)

You might really want to consider what was said in that message:

What his results do imply is that a COW string implemented using
mutex-guarded reference counts really stinks. Any other differences
are too small to generalize about.

Now, I'm not too sure that even this isn't more than Herb said, and I'm
not sure that there might be cases where even mutex-guarded reference
counts are a performance win. What I am sure of, however, is that at
least under Posix, there are in general no possible thread safe
implementations which don't use mutex-guards around the reference count.
I'm sure of it, because it is what the Posix standard explicitly says:

Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and
also synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads:

(Followed by a list of functions.)

Some implementations may make additional guarantees, but I've yet to see
a thread-safe implementation of reference counting that doesn't use
mutex-guards on a Posix machine. (The most recent one I've looked at,
in g++ 3.1 for Sparc Solaris, has a deadlock, for example.)

Note that any proof must be based on contractual guarantees, like the
one above. Intel architecture, for example, provides the hooks which
can be used to implement the necessary synchronization (the lock
prefix). It's not sufficient that the processor provides the hooks,
however; all of the surrounding hardware (the caches, etc.) have to be
designed to use them correctly. Do Dell or Compaq guarantee that the
use of lock will ensure cache consistency accross multiple processors?
(What does Intel say about it, for that matter? The lock prefix dates
from much simpler days.)

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

Attila Feher

unread,

Aug 30, 2002, 12:35:26 PM8/30/02

to

Carlos Moreno wrote:
>
> Attila Feher wrote:
> >
> > My 2c is there, and if you want guru answer get the book!
>
> Whoa whoa! Show some respect, here!! I *do* have the
> book!! ;-)

RRR EEE Ss PP EEE Cc TTT
R R E S P P E C T
RRR EE S PP EE C T
R R E S P E C T
R R EEE Ss P EEE cC T

Boy, that took long ;-)

> I know, I know, shame on me that I haven't read that book
> entirely!! :-(

Well, it happens. I owe myself the Stroustrup book. :-((( And many
more, not only C++.

> Kidding aside... Thanks a bunch for your comments! I'll
> take a look at the book and the threads (no pun intended :-))
> where these issues were discussed.

You can also look at:

GOTW #45 at http://www.gotw.ca

Ref counted std::string and thread safety at comp.std.c++, or on Google:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&frame=right&th=6ba23d58354acfdf&seekm=ngtnrss098oppa68sjltog0vvup106j13i%404ax.com#link1

I know, you have to unwrap it :-(

Thread safe reference counted pointers and copy on write. of this NG:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&frame=right&th=bb0f822b3e4d4ec4&seekm=6ascvh%24f1i%40netlab.cs.rpi.edu#link1

COW (copy-on-write) strings & multithreading of this NG:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&frame=right&th=e67c921b307a3609&seekm=9nhir3%24ai8%241%40newsreader.mailgate.org#link1

There are more, but I think from these (esp. the std one, with Herb in
it) you will get a pretty clear idea.

Attila

Joshua Lehrer

unread,

Aug 30, 2002, 10:39:22 PM8/30/02

to

We wrote our own reference counted COW multi-thread safe String class.
I can't give you our code, but I can answer any questions you might
have about other, similar code. I can tell you that it can be done
with just atomic_increment and atomic_decrement.

joshua lehrer
factset research systems

Attila Feher <Attila...@lmf.ericsson.se> wrote in message news:<3D6F2F42...@lmf.ericsson.se>...

Alexander Terekhov

unread,

Aug 30, 2002, 10:52:52 PM8/30/02

to

James Kanze wrote:
[...]

> > http://groups.google.com/groups?selm=7209q4%24p1p%241%40shell7.ba.best.com
> > (Subject: Re: Guru of the Week #45: Solution)
>
> You might really want to consider what was said in that message:
>
> What his results do imply is that a COW string implemented using
> mutex-guarded reference counts really stinks. Any other differences
> are too small to generalize about.
>
> Now, I'm not too sure that even this isn't more than Herb said,

In his MXC++ book, Herb said this:

"Getting rid of COW implementations falls into ``good optimizations''
category in most cases (the general-purpose libraries care most about
``most cases''). .... a few major vendors have already decided to
abandon possible false optimizations like copy-on-write. To me, and
to many, this is Good Thing. It is to be hoped that the rest of the
vendors will follow suit."

> and I'm
> not sure that there might be cases where even mutex-guarded reference
> counts are a performance win.

Yeah, especially using REAL mutexes -- NOT MS-like "entirely" kernel
based beasts... with attached "security" silliness. ;-)

> What I am sure of, however, is that at
> least under Posix, there are in general no possible thread safe
> implementations which don't use mutex-guards around the reference count.

http://lists.boost.org/MailArchives/boost/msg30364.php
([boost] Fw: Re: boost::shared_ptr and atomic_count_linux.hpp)

> I'm sure of it, because it is what the Posix standard explicitly says:
>
> Applications shall ensure that access to any memory location by more
> than one thread of control (threads or processes) is restricted such
> that no thread of control can read or modify a memory location while
> another thread of control may be modifying it. Such access is
> restricted using functions that synchronize thread execution and
> also synchronize memory with respect to other threads. The following
> functions synchronize memory with respect to other threads:

Yeah, and, perhaps, you might also want to check out the
"memory location" thread referenced below...

> (Followed by a list of functions.)
>
> Some implementations may make additional guarantees, but I've yet to see
> a thread-safe implementation of reference counting that doesn't use
> mutex-guards on a Posix machine. (The most recent one I've looked at,
> in g++ 3.1 for Sparc Solaris, has a deadlock, for example.)
>
> Note that any proof must be based on contractual guarantees, like the
> one above. Intel architecture, for example, provides the hooks which
> can be used to implement the necessary synchronization (the lock
> prefix). It's not sufficient that the processor provides the hooks,
> however; all of the surrounding hardware (the caches, etc.) have to be
> designed to use them correctly. Do Dell or Compaq guarantee that the
> use of lock will ensure cache consistency accross multiple processors?
> (What does Intel say about it, for that matter? The lock prefix dates
> from much simpler days.)

Well, "Intel" says this:

http://groups.google.com/groups?selm=3C3C9B63.DFDC9920%40web.de
(Subject: Re: Multi-Processor Concurrency Problem)

"Dave Hansen wrote:
[...]
> Of course, I don't really think the OP would really want to worry
> about this particular target, especially with a C++ string library...

OP should really worry about undefined/illegal behavior
and consequences of breaking the memory synchronization/
visibility rules (POSIX/4.10), and should rather carefully
consider the following recommendation from a well known
chip manufacturer:

"IA-32 Intel ® Architecture
Software Developer's
Manual
.
.
.
Despite the fact that Pentium 4, Intel Xeon, and P6 family
processors support processor ordering, Intel does not guarantee
that future processors will support this model. To make software
portable to future processors, it is recommended that operating
systems provide critical region and resource control constructs
and API's (application program interfaces) based on I/O, locking,
and/or serializing instructions be used to synchronize access to
shared areas of memory in multiple-processor systems. "
...."

http://groups.google.com/groups?selm=c29b5e33.0202150632.6d579f24%40posting.google.com
(Subject: Re: Can multiple threads set a global variable simultaneously?)

"josh <jo...@xtreme.net> wrote in message news:<1103_1013701471@js3owsj2-2j2-2j>...
> On 14 Feb 2002 05:41:59 -0800, tere...@web.de (Alexander Terekhov) wrote:
> > josh <jo...@xtreme.net> wrote in message news:<1108_1013618263@js3owsj2-2j2-
2j>...
> > > Use InterlockedBlahBlah... functions.
> >
> > Well, but *what* (if any) MEMORY SYNCHRONIZATION semantics do they
> > provide?
> Atomic updates.

That's rather easy to comprehend... but there are VISIBILITY
issues w.r.t to DEPENDENT objects/data as well:

> I didn't get the rest of your message.

Read this:

http://www.primenet.com/~jakubik/mpsafe/MultiprocessorSafe.pdf
"Relaxed multiprocessor memory models..." -- think of IA64/Windows,
for example, if you just refuse to accept the Intel's lack
of strong/processor ordering guarantees w.r.t to future IA32
multiprocessors... despite the fact that IA32 ALREADY has
memory barriers/fence instructions:

"11.4.4.3. MEMORY ORDERING INSTRUCTIONS

The SSE2 extensions introduce two new fence instructions
(LFENCE and MFENCE) as companions to the SFENCE instruction
introduced with the SSE extensions. The LFENCE instruction
establishes a memory fence for loads. It guarantees ordering
between two loads and prevents speculative loads from passing
the load fence (that is, no speculative loads are allowed
until all loads specified before the load fence have been
carried out).

The MFENCE instruction combines the functions of the LFENCE
and SFENCE instructions by establishing a memory fence for
both loads and stores. It guarantees that all loads and stores
specified before the fence are globally observable prior to any
loads or stores being carried out after the fence."

Also, you might want to read this:

http://altair.cs.oswego.edu/pipermail/concurrency-interest/2002-February/000023.html

http://altair.cs.oswego.edu/pipermail/concurrency-interest/2002-February/000024.html

http://groups.yahoo.com/group/Boost-Users/message/278
(atomic operations and ref.counting/thread_shared_ptr)

...."

As for the rest...

http://groups.google.com/groups?selm=496D85.5AC9CE91%40web.de
(Subject: Re: "memory location", comp.std.c)

"....please agree and provide me a new portable interface for
non-blocking/lock-less reference counting with optimizations
for immutable objects, please...."

http://groups.google.com/groups?selm=2AA730.1F22435C%40web.de
(Subject: Re: about the InterlockedCompareExchange function)

"....
Mike Mowbray wrote:
[...]
> 1) It's a bit dangerous to think of these functions as "cool".
> To implement them properly, you also need to know a lot
> about memory barriers.

Yep. However, if all you need is 'just' reference counting for
*immutable* [while shareable, like in COW] objects, you really
want 'naked' atomic operations [w/o membars] since membars would
just slow things down and won't bring anything useful in exchange
at all... unless I'm just missing something here, of course.

[...]
> And if you want portability amongst
> other major processors, you'll need to implement equivalent
> functionality on each of them. I'm a bit lucky in this respect
> because I only need to worry about sparc and alpha.

I personally see NO reasons why we don't have portable interface
for reference counting *specifically* [optimization for immutable
objects including] and some 'general' interface ala proposed java
'atomic' classes [with built-in memory barriers: acquire/release
semantics -- sort of implicit lock/unlock] for other fancy things.

Perhaps someone could enlighten me... TIA.
...."

Heck, http://groups.google.com/groups?selm=3CFB3BE8.2BA4E404%40web.de
(Subject: Re: C++0x Wish list (memory management), comp.std.c++)

"John Nagle wrote:
[...]
> The overhead of reference counting can be brought down down.

Yup, I think so too. Perhaps somewhat relevant stuff w.r.t. this
topic/discussion:
...."

regards,
alexander.

Kip Rugger

unread,

Aug 31, 2002, 7:38:27 AM8/31/02

to

James Kanze <ka...@gabi-soft.de> wrote:
[snip]

> Intel architecture, for example, provides the hooks which
>can be used to implement the necessary synchronization (the lock
>prefix). It's not sufficient that the processor provides the hooks,
>however; all of the surrounding hardware (the caches, etc.) have to be
>designed to use them correctly. Do Dell or Compaq guarantee that the
>use of lock will ensure cache consistency accross multiple processors?
>(What does Intel say about it, for that matter? The lock prefix dates
>from much simpler days.)

In even simpler days, you could count on a=b; x=y; to appear on the
bus as fetch b, store a, fetch y, store x. This is no longer the case
as compilers can do code movement, and hardware can re-order operations.

The usual solution is expressed in terms of `barriers', which are
both software and hardware in nature. A read barrier ensures that
neither software nor hardware perform a memory read prior to the
read barrier, a write barrier ensures that neither software nor
hardware delay a memory write operation past the barrier. So, if
we wanted to completely order the simple example, we would need
read_barrier(); a=b; barrier(); x=y; write_barrier(); (where
barrier() is understood to do both read and write barriers).

My reading of the Standard is that declaring a variable volatile
provides a write barrier at the next sequence point, at least as
far as that variable is concerned. For example, the updated value
should not be kept in a register past the sequence point.

Read barriers do not seem to be addressed by the Standard. Linux
relies on the fact that gcc will not do code movement (either
direction) through an asm statement -- even an empty one. This
is crucial in many critical sections, and is handled by declaring
barrier() as a macro which generates an empty asm statement.

At the hardware level, there are a variety of solutions. Certain
varieties of mips rely entirely on software solutions such as Lamport's
mutual exclusion algorithm. Even when specialized locking hardware
is provided, it may be slower than Lamport. The ppc offers the
`eieio' instruction that completely clears the processor pipelines,
which could offer a significant delay.

You asked about the Intel: the lock prefix is indeed a barrier.
Additionally, Intel CPUs obey `processor order' on writes; that is,
writes will not be re-ordered as observed on the system bus.

Caches are not usually a problem because the cache can see writes
occurring on the system bus, and invalidate or update the cached line.

JKB

unread,

Aug 31, 2002, 2:27:15 PM8/31/02

to

Aren't most strings used entirely within one thread? And isn't that
basically Herb's point in the GoTW article - that COW implies paying mutex
prices on strings that don't need it? That, plus the annoying fact that
mutexes turn out to be expensive.

Seems like one could tag strings with a thread ID, and then restrict the
shared-storage part so it was local to one thread. Then the only time you'd
need locking would be when copying a string across threads. That would be
an expensive but presumably rare operation.

Anybody have experience with this approach?
-- jkb

Alexander Terekhov

unread,

Sep 1, 2002, 5:59:51 AM9/1/02

to

Attila Feher wrote:
[...]

> With the mutexes I don't find the locking expensive. It is negligable.
> However _making_ and _destroying_ them is death on Sun Solaris. It
> seems they stop all other threads from doing any of the above on any
> mutex. Imagine now an app, which works with strings in nearly all
> threads and heavily... The end result: _any_ number of CPUs is
> virtually reduced to 1 and that (in best case) is utilized 90% working
> on the real task.

Stay away from SUNny stuff; buy and use more prosaic PowerPC{&IPF}->AIX! ;-)

[...]

> I saw in the second thread that you refer some code where you have used
> atomic ops for COW. I was looking for Herb's code, but it is nowhere.

It's here: http://www.gotw.ca/gotw/045code.zip

> Actually I have some doubts about his implementation (I would rather say
> there are some part I jsut don't know how it would work - therefore I
> don't believe it would :-)))

Do you mean: < from common.cpp that can be found in Sutter's 045code.zip >

inline StringBuf::StringBuf( const StringBuf& other, size_t n )
: buf(0), len(0), used(0), refs(1)
^^^^^^^
{
Reserve( max( other.len, n ) );
memcpy( buf, other.buf, used );
^^^^
used = other.used;
^^^^^^^^^^^^^^^^^
}

<?> ;-) ;-)

> so I wanted to take a look (placing a some
> sleep to some places to cause thread switching). But anyways it isn't
> there and I would like to see _once_ a proven, working atomic op COW
> _with_ the comments that what are the _minimum_ required atomic ops and
> _why_? Until then, I would be happy to see some code. Can you tell us
> a place where such (simple! not part of a 10^9 code lines STL
> implementation) code can be seen?

Well, I wrote this:

http://groups.google.com/groups?selm=356AB6.AFF57BB3%40web.de
(Subject: Re: MT design advice, comp.lang.c++.moderated)

Surely, there is more than 'one bug' in it, but FWIW, here is the
stuff from my archive:

namespace COW_AtomicInt3 {

#undef NAME
#undef BAGGAGE
#define NAME "COW_AtomicInt3"
#define BAGGAGE

struct StringBuf;

class AppendStringBufPtr {
const StringBuf* ptr;
public:
AppendStringBufPtr( const StringBuf* p ) : ptr( p ) {}
const StringBuf* operator->() const { return ptr; }
};

class CharRefStringBufPtr {
const StringBuf* ptr;
public:
CharRefStringBufPtr( const StringBuf* p ) : ptr( p ) {}
const StringBuf* operator->() const { return ptr; }
};

struct StringBuf {

StringBuf();
~StringBuf();
StringBuf( const StringBuf& other );
StringBuf( AppendStringBufPtr other );
StringBuf( CharRefStringBufPtr other );

void Clear();
void Reserve( size_t n );

char* buf;
size_t len;
size_t used;
long refs;
BAGGAGE;

void* operator new( size_t n );
void operator delete( void* p );

};

static FastArena fa( NAME, sizeof(StringBuf) );
void* StringBuf::operator new( size_t n ) { return fa.Allocate( n ); }
void StringBuf::operator delete( void* p ) { fa.Deallocate( p ); }

inline StringBuf::StringBuf() : buf(0), len(0), used(0), refs(1) { }

inline StringBuf::~StringBuf() { delete[] buf; }

inline StringBuf::StringBuf( const StringBuf& other )
: buf(0), len(0), used(0), refs(1)
{
Reserve( other.len );
memcpy( buf, other.buf, used = other.used );
}

inline StringBuf::StringBuf( AppendStringBufPtr other )
: buf(0), len(0), used(0), refs(1)
{
Reserve( max( other->len, other->used+1 ) );
memcpy( buf, other->buf, used = other->used );
}

inline StringBuf::StringBuf( CharRefStringBufPtr other )
: buf(0), len(0), used(0), refs(0)
{
Reserve( other->len );
memcpy( buf, other->buf, used = other->used );
}

inline void StringBuf::Clear() {
delete[] buf;
buf = 0;
len = 0;
used = 0;
}

class String {
public:
String();
~String();
String( const String& );
void Clear();
void Append( char );
size_t Length() const;
char& operator[](size_t);

static int nCopies;
static int nAllocs;
private:
StringBuf* data_;
};

int String::nCopies;
int String::nAllocs;

inline void StringBuf::Reserve( size_t n ) {

<snip>

}

inline String::String() : data_(new StringBuf) { }

inline String::~String() {
if( 2 > IntAtomicGet( data_->refs ) ||
1 > IntAtomicDecrement( data_->refs ) ) {
delete data_;
}
}

inline String::String( const String& other )
{
int other_refs = IntAtomicGet( other.data_->refs );
if( 0 != other_refs ) {
data_ = other.data_;
if( 1 != other_refs ) { // 'optimize' non-shared case
IntAtomicIncrement( data_->refs );
}
else {
data_->refs = 2;
}
}
else { // 'unshareable'
data_ = new StringBuf( *other.data_ );
}
++nCopies;
}

inline void String::Append( char c ) {
if( 1 < IntAtomicGet( data_->refs ) ) {
StringBuf* newdata = new StringBuf( AppendStringBufPtr( data_ ) );
if( 0 == IntAtomicDecrement( data_->refs ) ) {
delete data_;
}
data_ = newdata;
}
else {
data_->Reserve( data_->used+1 );
data_->refs = 1; // mutation -> 'shareable'
}
data_->buf[data_->used++] = c;
}

inline size_t String::Length() const {
return data_->used;
}

inline char& String::operator[]( size_t n ) {
if( 1 < IntAtomicGet( data_->refs ) ) {
StringBuf* newdata = new StringBuf( CharRefStringBufPtr( data_ ) );
if( 0 == IntAtomicDecrement( data_->refs ) ) {
delete data_;
}
data_ = newdata;
}
else {
data_->refs = 0; // 'unshareable'
}
return data_->buf[n];
}

inline void String::Clear() {
if( 2 > IntAtomicGet( data_->refs ) ||
1 > IntAtomicDecrement( data_->refs ) ) {
data_->Clear();
data_->refs = 1; // mutation -> 'shareable'
}
else {
data_ = new StringBuf;
}
}

}

#define TEST_CONST_COPY 1

E:\045code\GURU\test045>test
Preparing for clean timing runs... done.
Running 18000000 iterations with strings of length 100:

Plain_FastAlloc 3671ms copies:18000000 allocs:18000000
Plain 9500ms copies:18000000 allocs:18000000
COW_Unsafe 578ms copies:18000000 allocs: 0
COW_AtomicInt 2516ms copies:18000000 allocs: 0
COW_AtomicInt2 2187ms copies:18000000 allocs: 0
COW_AtomicInt3 1234ms copies:18000000 allocs: 0
COW_CritSec 4157ms copies:18000000 allocs: 0
COW_Mutex 71640ms copies:18000000 allocs: 0

Plain_FastAlloc 3672ms copies:18000000 allocs:18000000
Plain 9531ms copies:18000000 allocs:18000000
COW_Unsafe 562ms copies:18000000 allocs: 0
COW_AtomicInt 2531ms copies:18000000 allocs: 0
COW_AtomicInt2 2203ms copies:18000000 allocs: 0
COW_AtomicInt3 1235ms copies:18000000 allocs: 0
COW_CritSec 4141ms copies:18000000 allocs: 0
COW_Mutex 72031ms copies:18000000 allocs: 0

#define TEST_APPEND 1

E:\045code\GURU\test045>test
Preparing for clean timing runs... done.
Running 18000000 iterations with strings of length 100:

Plain_FastAlloc 844ms copies: 0 allocs: 1425744
Plain 1296ms copies: 0 allocs: 1425744
COW_Unsafe 1704ms copies: 0 allocs: 1425744
COW_AtomicInt 2453ms copies: 0 allocs: 1425744
COW_AtomicInt2 2203ms copies: 0 allocs: 1603962
COW_AtomicInt3 1797ms copies: 0 allocs: 1425744
COW_CritSec 3265ms copies: 0 allocs: 1425744
COW_Mutex 37703ms copies: 0 allocs: 1425744

Plain_FastAlloc 844ms copies: 0 allocs: 1425744
Plain 1328ms copies: 0 allocs: 1425744
COW_Unsafe 1735ms copies: 0 allocs: 1425744
COW_AtomicInt 2453ms copies: 0 allocs: 1425744
COW_AtomicInt2 2281ms copies: 0 allocs: 1603962
COW_AtomicInt3 1781ms copies: 0 allocs: 1425744
COW_CritSec 3344ms copies: 0 allocs: 1425744
COW_Mutex 37969ms copies: 0 allocs: 1425744

#define TEST_OPERATOR 1

E:\045code\GURU\test045>test
Preparing for clean timing runs... done.
Running 18000000 iterations with strings of length 100:

Plain_FastAlloc 0ms copies: 0 allocs: 0
Plain 0ms copies: 0 allocs: 0
COW_Unsafe 922ms copies: 0 allocs: 0
COW_AtomicInt 1500ms copies: 0 allocs: 0
COW_AtomicInt2 1171ms copies: 0 allocs: 0
COW_AtomicInt3 406ms copies: 0 allocs: 0
COW_CritSec 2313ms copies: 0 allocs: 0
COW_Mutex 36390ms copies: 0 allocs: 0

Plain_FastAlloc 0ms copies: 0 allocs: 0
Plain 0ms copies: 0 allocs: 0
COW_Unsafe 922ms copies: 0 allocs: 0
COW_AtomicInt 1516ms copies: 0 allocs: 0
COW_AtomicInt2 1125ms copies: 0 allocs: 0
COW_AtomicInt3 406ms copies: 0 allocs: 0
COW_CritSec 2312ms copies: 0 allocs: 0
COW_Mutex 36219ms copies: 0 allocs: 0

#define TEST_MUTATING_COPY_2A 1

E:\045code\GURU\test045>test
Preparing for clean timing runs... done.
Running 18000000 iterations with strings of length 100:

Plain_FastAlloc 3922ms copies:18000000 allocs:18000000
Plain 9906ms copies:18000000 allocs:18000000
COW_Unsafe 9953ms copies:18000000 allocs:12000000
COW_AtomicInt 13141ms copies:18000000 allocs:12000000
COW_AtomicInt2 10656ms copies:18000000 allocs:12000000
COW_AtomicInt3 10625ms copies:18000000 allocs:12000000
COW_CritSec 27641ms copies:18000000 allocs:12000000
COW_Mutex 203109ms copies:18000000 allocs:12000000

Plain_FastAlloc 3938ms copies:18000000 allocs:18000000
Plain 9828ms copies:18000000 allocs:18000000
COW_Unsafe 9828ms copies:18000000 allocs:12000000
COW_AtomicInt 13000ms copies:18000000 allocs:12000000
COW_AtomicInt2 10531ms copies:18000000 allocs:12000000
COW_AtomicInt3 10469ms copies:18000000 allocs:12000000
COW_CritSec 29438ms copies:18000000 allocs:12000000
COW_Mutex 202391ms copies:18000000 allocs:12000000

#define TEST_MUTATING_COPY_2B 1

E:\045code\GURU\test045>test
Preparing for clean timing runs... done.
Running 18000000 iterations with strings of length 100:

Plain_FastAlloc 4188ms copies:18000000 allocs:18000000
Plain 9828ms copies:18000000 allocs:18000000
COW_Unsafe 8453ms copies:18000000 allocs: 9000000
COW_AtomicInt 11969ms copies:18000000 allocs: 9000000
COW_AtomicInt2 9563ms copies:18000000 allocs: 9000000
COW_AtomicInt3 9109ms copies:18000000 allocs: 9000000
COW_CritSec 24000ms copies:18000000 allocs: 9000000
COW_Mutex 206797ms copies:18000000 allocs: 9000000

Plain_FastAlloc 4218ms copies:18000000 allocs:18000000
Plain 9922ms copies:18000000 allocs:18000000
COW_Unsafe 8516ms copies:18000000 allocs: 9000000
COW_AtomicInt 12094ms copies:18000000 allocs: 9000000
COW_AtomicInt2 9687ms copies:18000000 allocs: 9000000
COW_AtomicInt3 9219ms copies:18000000 allocs: 9000000
COW_CritSec 24344ms copies:18000000 allocs: 9000000
COW_Mutex 208171ms copies:18000000 allocs: 9000000

regards,
alexander.

Carlos Moreno

unread,

Sep 1, 2002, 6:19:21 AM9/1/02

to

Carlos Moreno wrote:
>
> Kidding aside... Thanks a bunch for your comments! I'll
> take a look at the book and the threads (no pun intended :-))
> where these issues were discussed.
>
> {I have allowed some latitude for the non-native users of English. Where
> I come from 'Thanks a bunch' has very strong negative undertones meaning
> something akin to 'Why have you been wasting my time with trivia?' But I
> am sure Carlos does not mean it that way

Oops! Not at all (i.e., no, of course I didn't mean it
that way).

I always see the expression "thanks a bunch" on the net,
and I always thought it was an exact synonim for "thanks
a lot", or "thanks a million" -- I always thought that,
if anything, it would be the tone (when spoken, as opposed
to written) what could give any of those expressions the
sarcastic tone of "why have you been wasting my time?"

Anyway, thanks for the clarification, Francis! I would
have never imagined :-( I'll avoid the expression in
the future!

Artem Livshits

unread,

Sep 1, 2002, 7:16:41 AM9/1/02

to

"JKB" <nospam-b...@nospam.seanet.com> wrote in message
news:un1p1kf...@corp.supernews.com...

> Aren't most strings used entirely within one thread?

I think so. I would go even further and say that most objects are used
entirely within one thread; otherwise there is absolutely no point in
multithreading: if several threads are constantly synchronizing with each
other, it may be a good idea to merge them in one thread (and it will be
simpler, faster and more reliable code).

> Then the only time you'd
> need locking would be when copying a string across threads. That would be
> an expensive but presumably rare operation.

Why would you need locking for ref counting at all? Just always do a deep
copy with shared strings and there will be no race conditions with ref
counting. E.g.

static std::string sharedStr;
static Mutex mx;

std::string foo(std::string s)
{
const Lock ___(mx);
const std::string result(sharedStr.c_str());
sharedStr = s.c_str()
return result;
}

Artem Livshits
Brainbench MVP for C++
http://www.brainbench.com

Pete Becker

unread,

Sep 1, 2002, 7:23:34 AM9/1/02

to

JKB wrote:
>
> Seems like one could tag strings with a thread ID, and then restrict the
> shared-storage part so it was local to one thread. Then the only time you'd
> need locking would be when copying a string across threads. That would be
> an expensive but presumably rare operation.
>

Each time you copy you must first check the thread ID to determine
whether you're copying across threads. Before you check the ID you must
lock a mutex in order to ensure memory visibility. So you end up locking
a mutex in order to determine that you don't need to lock a mutex.

When it comes to thread synchronization, don't trust your intuition.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

Artem Livshits

unread,

Sep 1, 2002, 7:25:35 AM9/1/02

to

"Kip Rugger" <rug...@mts.net> wrote in message
news:akom0h$bh$1...@rugger.nodomain.ca...

> In even simpler days, you could count on a=b; x=y; to appear on the
> bus as fetch b, store a, fetch y, store x. This is no longer the case
> as compilers can do code movement, and hardware can re-order operations.
>

<snip>

> My reading of the Standard is that declaring a variable volatile
> provides a write barrier at the next sequence point, at least as
> far as that variable is concerned. For example, the updated value
> should not be kept in a register past the sequence point.
>
> Read barriers do not seem to be addressed by the Standard. Linux
> relies on the fact that gcc will not do code movement (either
> direction) through an asm statement -- even an empty one. This
> is crucial in many critical sections, and is handled by declaring
> barrier() as a macro which generates an empty asm statement.

The compiler is free to move code around (or remove it altogether) as long
as it doesn't affect observable behavior (a.k.a "as if" rule). But 1.7/6
says:

The observable behavior of the abstract machine is its sequence of
reads and writes to volatile data and calls to library I/O
functions.

and my reading is that this basically means that the compiler can't reorder
(nor eliminate) a sequence of reads and writes to volatile data.

So even in these complicated days, if the code looks like

volatile int a, b, x, y;
a = b;
x = y;

you can be pretty sure that the generated code would be "fetch b, store a,
fetch y, store x". (After all a volatile variable could represent a memory
mapped COM port in which case any reordering/eliminating would really affect
observable behavior).

The other question is that on on modern hardware executing "fetch b, store
a, fetch y, store x" doesn't guarantee that the bus will see it that way,
but that has nothing to do with the compiler.

gcc doesn't move code around an asm definition most likely because the
module that does code transformation isn't asm-aware, so it pessimistically
assumes that moving code around an asm definition might change observable
behavior, so it doesn't do that. But it doesn't have to be that way, so
Linux'd better rely on the standard way instead of something that happens to
work by accident.

Artem Livshits
Brainbench MVP for C++
http://www.brainbench.com

Kip Rugger

unread,

Sep 1, 2002, 5:56:45 PM9/1/02

to

Artem Livshits <a_e_li...@yahoo.com> wrote:
>The compiler is free to move code around (or remove it altogether) as long
>as it doesn't affect observable behavior (a.k.a "as if" rule). But 1.7/6
>says:
>
> The observable behavior of the abstract machine is its sequence of
> reads and writes to volatile data and calls to library I/O
> functions.
>
>and my reading is that this basically means that the compiler can't reorder
>(nor eliminate) a sequence of reads and writes to volatile data.

This is necessary, but not sufficient in the presence of asm statements.

[snip]

>gcc doesn't move code around an asm definition most likely because the
>module that does code transformation isn't asm-aware, so it pessimistically
>assumes that moving code around an asm definition might change observable
>behavior, so it doesn't do that. But it doesn't have to be that way, so
>Linux'd better rely on the standard way instead of something that happens to
>work by accident.

Your wording suggests that a compiler could be constructed which
is `asm-aware'. That is complete nonsense if you mean that the
compiler is able to deduce at compile time the complete run-time
state and thence the precise implications of an asm sequence.

The gcc code-generation process is built around a register-transfer
description of the target machine, and the asm facility is able to
interact with this to a certain degree. In particular, the compiler
gives `asm volitile("":::"memory");' the same aliasing semantics as
a function call; that is, registers are spilled back to memory.
This is documented behaviour of the compiler and not an `accident'.

The use of such an extension is easily isolated to compiler- and
processor-specific header files. Although the files themselves
aren't portable, the functionality certainly is. This is about
as standard as you're going to get with today's tools.

JKB

unread,

Sep 1, 2002, 6:05:07 PM9/1/02

to

> JKB wrote:
> > Seems like one could tag strings with a thread ID, and then restrict
the
> > shared-storage part so it was local to one thread. Then the only time
you'd
> > need locking would be when copying a string across threads.

"Pete Becker" <peteb...@acm.org> wrote:
> Each time you copy you must first check the thread ID to determine
> whether you're copying across threads. Before you check the ID you must
> lock a mutex in order to ensure memory visibility. So you end up locking
> a mutex in order to determine that you don't need to lock a mutex.

Yeah. You don't need a lock to read the object, because the thread ID is
immutable. And if it turns out to be in your own thread, you don't need a
lock at all. But you do need to know that the object isn't being deleted
out from under you.

What I'm trying to reason through is whether the strings can now be
synchronized from outside, the way you would if they weren't COW at all.
Holding a lock on the outside string should guarantee that the internal
refcount doesn't go to zero, which is enough for the other thread to read
it. I haven't found any holes in this argument, but haven't quite been able
to prove it either.

> When it comes to thread synchronization, don't trust your intuition.

I have excellent intuition. It said "this is a hard problem; maybe I should
ask for help".
-- jkb

Attila Feher

unread,

Sep 2, 2002, 7:09:21 AM9/2/02

to

Joshua Lehrer wrote:
>
> We wrote our own reference counted COW multi-thread safe String class.
> I can't give you our code, but I can answer any questions you might
> have about other, similar code. I can tell you that it can be done
> with just atomic_increment and atomic_decrement.

Interestring (pun intended - altought it started its life as a typo).
How can you make it work without atomic reading of integers?

JKB

unread,

Sep 2, 2002, 7:16:13 AM9/2/02

to

So I went to look up how it's done by particular vendors. VC7 uses a COW
implementation for the MFC class CString, and a non-COW implementation for
std::string. The funny part is the sheer number of layers one has to dig
through to get there.

Here's a (fairly impenetrable) summary of the declarations for MFC's
CString:
ChTraitsBase is a template providing a typedef for PXSTR
ChTraitsCRT is a template that derives from ChTraitsBase
StrTraitMFC is a template that derives from ChTraitsCRT
CSimpleStringT is a template that stores a PXSTR
CStringT is a template that derives from CSimpleStringT
CString is typedef, specializing CStringT and using StrTraitMFC
... where the PXSTR is a char* or wchar_t* pointing beyond the edge of a
CStringData object, which declares all the data members that do not store
text.

The STL declarations aren't much better:
_String_base is a class with no data, just some methods
_String_val is a template that derives from _String_base
basic_string is a template that derives from _String_val
char_traits is a templatized struct
allocator is a templatized class
string is a typedef, specializing basic_string,
and using char_traits and allocator specializations
... where the data is stored by basic_string, in a union that's either
inline characters or an external pointer.

All this just for strings...
-- jkb

Pete Becker

unread,

Sep 2, 2002, 7:16:44 AM9/2/02

to

JKB wrote:
>
> > JKB wrote:
> > > Seems like one could tag strings with a thread ID, and then restrict
> the
> > > shared-storage part so it was local to one thread. Then the only time
> you'd
> > > need locking would be when copying a string across threads.
>
> "Pete Becker" <peteb...@acm.org> wrote:
> > Each time you copy you must first check the thread ID to determine
> > whether you're copying across threads. Before you check the ID you must
> > lock a mutex in order to ensure memory visibility. So you end up locking
> > a mutex in order to determine that you don't need to lock a mutex.
>
> Yeah. You don't need a lock to read the object, because the thread ID is
> immutable.

Immutability is not sufficient to guarantee visiblity.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Sep 2, 2002, 10:36:28 AM9/2/02

to

rug...@mts.net (Kip Rugger) wrote in message
news:<akom0h$bh$1...@rugger.nodomain.ca>...

I can find nothing in the standard which suggests this, and I know of no
compiler, off hand, which implements it as such.

> For example, the updated value should not be kept in a register past
> the sequence point.

> Read barriers do not seem to be addressed by the Standard.

The standard doesn't address anything concerning multiple threads, much
less multiple processes or multiple processors.

> Linux relies on the fact that gcc will not do code movement (either
> direction) through an asm statement -- even an empty one. This is
> crucial in many critical sections, and is handled by declaring
> barrier() as a macro which generates an empty asm statement.

That might cover compiler code movement, but it doesn't do anything to
ensure cache coherence.

> At the hardware level, there are a variety of solutions. Certain
> varieties of mips rely entirely on software solutions such as
> Lamport's mutual exclusion algorithm. Even when specialized locking
> hardware is provided, it may be slower than Lamport. The ppc offers
> the `eieio' instruction that completely clears the processor
> pipelines, which could offer a significant delay.

> You asked about the Intel: the lock prefix is indeed a barrier.

I suspected as much. It wouldn't be much use otherwise.

> Additionally, Intel CPUs obey `processor order' on writes; that is,
> writes will not be re-ordered as observed on the system bus.

Currently. Apparently, from documentation posted in another response,
Intel does not guarantee that this will remain the case in the future.

At any rate, if you program correctly, once you've got the barrier, you
don't need additional ordering on writes.

> Caches are not usually a problem because the cache can see writes
> occurring on the system bus, and invalidate or update the cached line.

I can, but from what I have been able to assertain, it doesn't on
certain systems. The earlier multiprocessor systems I used did this, and
wrote through the cache, so that all writes were immediately visible in
main memory. This sure made life easier. But there is a performance
issue involved, and if two processors are actively working in the same
page (albeit not on the same objects), the performance hit for doing
this can be significant.

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Sep 2, 2002, 10:37:02 AM9/2/02

to

"JKB" <nospam-b...@nospam.seanet.com> wrote in message
news:<un1p1kf...@corp.supernews.com>...

> Aren't most strings used entirely within one thread? And isn't that
> basically Herb's point in the GoTW article - that COW implies paying
> mutex prices on strings that don't need it? That, plus the annoying
> fact that mutexes turn out to be expensive.

> Seems like one could tag strings with a thread ID, and then restrict
> the shared-storage part so it was local to one thread. Then the only
> time you'd need locking would be when copying a string across threads.
> That would be an expensive but presumably rare operation.

The problem is how to transfer a string from one thread to another.

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Alexander Terekhov

unread,

Sep 2, 2002, 10:56:57 AM9/2/02

to

JKB wrote:
>
> > JKB wrote:
> > > Seems like one could tag strings with a thread ID, and then restrict
> the
> > > shared-storage part so it was local to one thread. Then the only time
> you'd
> > > need locking would be when copying a string across threads.
>
> "Pete Becker" <peteb...@acm.org> wrote:
> > Each time you copy you must first check the thread ID to determine
> > whether you're copying across threads. Before you check the ID you must
> > lock a mutex in order to ensure memory visibility. So you end up locking
> > a mutex in order to determine that you don't need to lock a mutex.
>
> Yeah. You don't need a lock to read the object, because the thread ID is
> immutable. And if it turns out to be in your own thread, you don't need a
> lock at all.

< check out followup(s) as well >

http://groups.google.com/groups?threadm=mkm9mu8uuilemevuusvfvkph17q66bocu7%404ax.com
(Subject: Re: Mutex implementation with semaphores, comp.realtime)

regards,
alexander.

P.S. http://groups.google.com/groups?threadm=dilvgcqbfxj.fsf%40isolde.research.att.com
(Subject: Re: crash in std::string on dual processor, comp.programming.threads)

P.J. Plauger

unread,

Sep 3, 2002, 6:39:19 AM9/3/02

to

"JKB" <nospam-b...@nospam.seanet.com> wrote in message

news:un54to3...@corp.supernews.com...

>
> So I went to look up how it's done by particular vendors. VC7 uses a COW
> implementation for the MFC class CString, and a non-COW implementation for
> std::string. The funny part is the sheer number of layers one has to dig
> through to get there.
>
> Here's a (fairly impenetrable) summary of the declarations for MFC's
> CString:
> ChTraitsBase is a template providing a typedef for PXSTR
> ChTraitsCRT is a template that derives from ChTraitsBase
> StrTraitMFC is a template that derives from ChTraitsCRT
> CSimpleStringT is a template that stores a PXSTR
> CStringT is a template that derives from CSimpleStringT
> CString is typedef, specializing CStringT and using StrTraitMFC
> ... where the PXSTR is a char* or wchar_t* pointing beyond the edge of a
> CStringData object, which declares all the data members that do not store
> text.
>
> The STL declarations aren't much better:

You mean it would be ``better'' to be easier for you to read and
understand than to be conforming, correct, and efficient?

> _String_base is a class with no data, just some methods
> _String_val is a template that derives from _String_base
> basic_string is a template that derives from _String_val
> char_traits is a templatized struct
> allocator is a templatized class
> string is a typedef, specializing basic_string,
> and using char_traits and allocator specializations
> ... where the data is stored by basic_string, in a union that's either
> inline characters or an external pointer.

There's a reason for every one of those layers.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Artem Livshits

unread,

Sep 3, 2002, 8:26:39 AM9/3/02

to

"Kip Rugger" <rug...@mts.net> wrote in message

news:aktmc9$9e$1...@rugger.nodomain.ca...

> >and my reading is that this basically means that the compiler can't
reorder
> >(nor eliminate) a sequence of reads and writes to volatile data.
>
> This is necessary, but not sufficient in the presence of asm statements.

Can you point at a place in the standard where this is mentioned?

> Your wording suggests that a compiler could be constructed which
> is `asm-aware'. That is complete nonsense if you mean that the
> compiler is able to deduce at compile time the complete run-time
> state and thence the precise implications of an asm sequence.

Any pointers at the standard where it's said about being 'asm-aware' or not
'asm-aware'?

1.8/1 says:

The semantic descriptions in this International Standard define a
parameterized nondeterministic abstract machine. This International
Standard places no requirement on the structure of conforming imple-
mentations. In particular, they need not copy or emulate the struc-
ture of the abstract machine. Rather, conforming implementations are
required to emulate (only) the observable behavior of the abstract
machine as explained below.

I don't see any specific provisions w.r.t asm defitions here, which alows me
to reasonably conclude that the compiler is free to generate any code as
long as the generated code produces the required observable behavior (which
is a sequence of reads and writes to volatile data and calls to library I/O
functions [1.8/6]). So if the compiler can prove that it can
reorder/eliminate an asm definition without affecting the observable
behavior of the program it's allowed to do so.

> The gcc code-generation process is built around a register-transfer
> description of the target machine, and the asm facility is able to
> interact with this to a certain degree. In particular, the compiler
> gives `asm volitile("":::"memory");' the same aliasing semantics as
> a function call; that is, registers are spilled back to memory.
> This is documented behaviour of the compiler and not an `accident'.

Using an empty asm definition as a "compiler optimization barrier" is an
implementation detail of some particular compiler and from the point of view
of the standard it's an accident. Another compiler may not work that way; or
future versions of gcc may not work that way, you never know.

BTW, the standard doesn't make any special provisions w.r.t. function calls
either; so although in most implementations the compiler can't see the
function definition at the compile-time if it's defined in a different TU
(so it has to be conservative and assume that reordering may change
observable behavior) in implementations with link-time code generation the
compiler may see the function defition (even if it's defined in another TU)
and can reorder/eliminate a function call if it can prove that the
observable behavior is not changed by doing that.

> The use of such an extension is easily isolated to compiler- and
> processor-specific header files. Although the files themselves
> aren't portable, the functionality certainly is. This is about
> as standard as you're going to get with today's tools.

Well, I guess you can get more: using volatile data is a standard way to
control compiler optimizations (w.r.t. code elimination/reordering), so I
see no need for compiler-specific solutions here.

Artem Livshits
Brainbench MVP for C++
http://www.brainbench.com

JKB

unread,

Sep 3, 2002, 12:19:06 PM9/3/02

to

> "JKB" <nospam-b...@nospam.seanet.com> wrote:
> > So I went to look up how it's done by particular vendors. VC7 uses a
COW
> > implementation for the MFC class CString, and a non-COW implementation
for
> > std::string. The funny part is the sheer number of layers one has to
dig
> > through to get there.

> [snip list of classes and templates for CString]

> > The STL declarations aren't much better:

> [snip list for std::string]

"P.J. Plauger" <p...@dinkumware.com> wrote :

> You mean it would be ``better'' to be easier for you to read and
> understand than to be conforming, correct, and efficient?

That's a false choice. Ease of comprehension is a good thing, as are
correctness and efficiency. Conformance is good too, once it's clear what
to conform to. Are you suggesting that these factors conflict?

I should perhaps have said "the STL declarations aren't much simpler", but
I'll stand by the implication that simpler *is* better. It is a Good
Thing - not the final word, but definitely something to consider.

> There's a reason for every one of those layers.

Maybe so. And maybe MS has good reasons for how CString is built. But I've
watched a lot of people struggle with the STL. Competent people, with
plenty of brain cells. It's a great library with many advantages, but ease
of reading and understanding is not one of them.

-- jkb

Joshua Lehrer

unread,

Sep 4, 2002, 5:33:09 AM9/4/02

to

Attila Feher <Attila...@lmf.ericsson.se> wrote in message news:<3D731B57...@lmf.ericsson.se>...

> Joshua Lehrer wrote:
> >
> > We wrote our own reference counted COW multi-thread safe String class.
> > I can't give you our code, but I can answer any questions you might
> > have about other, similar code. I can tell you that it can be done
> > with just atomic_increment and atomic_decrement.
>
>
> Interestring (pun intended - altought it started its life as a typo).
> How can you make it work without atomic reading of integers?
>

atomic_decrement needs to return the new value. If you decrement the
reference count and atomic_decrement returns 0, then you know you are
the last person to have a handle, and you can safely dispose of the
characters without fear of someone else referencing the String.

Kip Rugger

unread,

Sep 4, 2002, 5:51:23 AM9/4/02

to

James Kanze <ka...@gabi-soft.de> wrote:
[snip]

>> Caches are not usually a problem because the cache can see writes
>> occurring on the system bus, and invalidate or update the cached line.
>
>I can, but from what I have been able to assertain, it doesn't on
>certain systems. The earlier multiprocessor systems I used did this, and
>wrote through the cache, so that all writes were immediately visible in
>main memory. This sure made life easier. But there is a performance
>issue involved, and if two processors are actively working in the same
>page (albeit not on the same objects), the performance hit for doing
>this can be significant.

Actually, some of the earlier systems were the worst culprits.
Consider those designs that managed to get a cache between the
processor and the MMU: the contents of the cache would be tagged
with virtual addresses. The resulting aliasing made it very
difficult (or inefficient) to implement shared memory segments.

Modern operating systems depend on caches appearing to be write
through. Although there is no standard you can reference, there
are market forces. So, if a certain processor architecture could
not support shared memory (or had to multiplex its mapping in
real time), it's unlikely the architecture could compete in the
general-purpose arena.

Caches are generally organized into `lines', and a common line size
is 128 bits. The CPU may negotiate transfers with the cache in
byte-sized units, but the cache and memory system use the larger
line-sized units. On a system with a 32-bit data bus, all non-
cached memory references will entail the movement of four 32-bit
words. The pre-fetching of neighboring data is often an overall
win (eg instructions). The downside happens when a line `ping pongs'
between caches.

Attila Feher

unread,

Sep 4, 2002, 11:17:15 AM9/4/02

to

Joshua Lehrer wrote:
[SNIP]

> atomic_decrement needs to return the new value. If you decrement the
> reference count and atomic_decrement returns 0, then you know you are
> the last person to have a handle, and you can safely dispose of the
> characters without fear of someone else referencing the String.

And what if someones' (another thread) atomic increment is just waiting
there to make it 1 again? Isn't that possible?

Attila

Christian Vogler

unread,

Sep 4, 2002, 9:49:46 PM9/4/02

to

Attila Feher (Attila...@lmf.ericsson.se) wrote:
: > atomic_decrement needs to return the new value. If you decrement the

: > reference count and atomic_decrement returns 0, then you know you are
: > the last person to have a handle, and you can safely dispose of the
: > characters without fear of someone else referencing the String.

: And what if someones' (another thread) atomic increment is just waiting
: there to make it 1 again? Isn't that possible?

Not if the reference counting mechanism is correctly implemented. The
reference counts can only be modified if the thread actually held a
live reference to the object. A value of 0 after the atomic decrement
implies a previous value of 1, hence the decrementing thread held the
last live reference.

If another thread increments this value at the same time, it must have
held a live reference, too, and hence the previous reference count
must have been at least 2. IOW, the scenario that you describe is
impossible with a correct implementation.

Regards
- Christian

Shannon Barber

unread,

Sep 5, 2002, 5:43:48 AM9/5/02

to

ka...@gabi-soft.de (James Kanze) wrote in message

> Note that any proof must be based on contractual guarantees, like the

> one above. Intel architecture, for example, provides the hooks which

> can be used to implement the necessary synchronization (the lock
> prefix). It's not sufficient that the processor provides the hooks,
> however; all of the surrounding hardware (the caches, etc.) have to be
> designed to use them correctly. Do Dell or Compaq guarantee that the
> use of lock will ensure cache consistency accross multiple processors?
> (What does Intel say about it, for that matter? The lock prefix dates
> from much simpler days.)

I'm hardly an expert, but from what I have read about 386+
architecture the lock prefix guarantees correctly behavior (of atomic
inc/dec) on multi-processor machines, because it locks the entire
memory bus for the duration of the transaction. (If the cache works
at all with SMP, it will have to work after such a write is complete.)
On single-processor machines, on 386+ hardware, inc and dec (without
the lock prefix) are already atomic operations!

So, how long does it take to lock a mutex on a SPARC? ;0)

Wil Evers

unread,

Sep 5, 2002, 6:23:19 AM9/5/02

to

In article <un1p1kf...@corp.supernews.com>, JKB wrote:

> Aren't most strings used entirely within one thread? And isn't that
> basically Herb's point in the GoTW article - that COW implies paying mutex
> prices on strings that don't need it? That, plus the annoying fact that
> mutexes turn out to be expensive.

OK, here's the final word on this subject :-).

Just for the fun of it, I wrote a simple smart_string template, which
takes the string representation class to use as a template argument. I
implemented the following representation classes :

nonshared_rep - one instance per string object
shared_rep<no_mutex> - COW, unprotected reference count
shared_rep<pthread_mutex> - COW, mutexed reference count

The test case I used consisted of ten million calls to a function that
returns a copy of a statically allocated smart_string<Rep> of length 1.
This resulted in the following execution times (Linux, g++-3.2, glibc
2.1.3, PentiumII 350 Mhz):

Rep class used without -O2 with -O2

nonshared_rep 24.247s 21.382s
shared_rep<no_mutex> 2.831s 0.540s
shared_rep<pthread_mutex> 13.058s 10.284s

I read these results as follows:

(*) The thread-unaware COW implementation (shared_rep<no_mutex>) is
about an order of magnitude faster than the non-COW (nonshared_rep)
implementation. Furthermore, it provides much better opportunities for
inlining (the -O2 case).

(*) Contrary to popular belief, the thread-safe COW implementaion
(shared_rep<pthread_mutex>) is about twice as fast as the non-COW
implementation. Although mutexed reference counting is expensive, it is
considerably cheaper than the extra calls to the heap manager required
to avoid it. This is actually no surprise, taking into account that a
thread-safe heap manager must do its own locking anyway - which is a
simple fact of life that is often overlooked.

Obviously, the tests I ran just measure the performance of the string
class's copy constructor an destructor, and ignore the impact of
(mutexed) reference counting on writes to the rep's character buffer. The
benefit of COW depends on the way strings are used by the program: if all
you do is change the characters in existing strings, and you never copy the
strings themselves, COW is not the way to go.

However, the numbers above do indicate that a single, one-size-fits-all
standard string class is not the answer, especially for multi-threaded
applications. Of course, JKB's suggestion that most strings are never
copied between threads is correct. The potential performance benefits
of the ability to specify which sharing policy to use are simply too big to
be ignored.

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

Attila Feher

unread,

Sep 5, 2002, 9:26:14 AM9/5/02

to

Wil Evers wrote:
[SNIP]

> (*) Contrary to popular belief, the thread-safe COW implementaion
> (shared_rep<pthread_mutex>) is about twice as fast as the non-COW
> implementation. Although mutexed reference counting is expensive, it is

Popular belief tested in real-life systems. It is not the locking of
the MUTEX what is expensive, but creating and destroying it! If you
create and destroy enough COW strings in an MT system you can render
_any_ amount of paralell CPUs to virtually _one_. I have seen this
working: 25% COU use max on 4 CPU system, 12.5% on 8 CPU system and so
on.

Yeah, and please repeat your tests with some nice thing being used -
like libHoard. libHoard speeds up memory allocation in MT (and OO)
systems quite dramatically. Using libHoard one can gain 100%
performance increase - if those MUTEX creation/destroy beasts don't jump
in.

A normal program (written properly) rarely just copies strings around
without touching them. So I believe that the tests you did run test for
a very specific case - which does not reflect most (and certainly not
all) real-life scennarios. And basically that is behind the philosophy
of _not_ doing optimization before it is proven that you need it.

Attila

Francis Glassborow

unread,

Sep 5, 2002, 9:26:52 AM9/5/02

to

In article <de001473.02090...@posting.google.com>, Shannon
Barber <shannon...@myrealbox.com> writes

> On single-processor machines, on 386+ hardware, inc and dec (without
>the lock prefix) are already atomic operations!

Really? With multiple pipelines and floating point units? And I seem to
remember that the latest Intel architectures go even further with
interweaving instructions from different threads on the same pipeline.

--
Francis Glassborow ACCU
64 Southfield Rd
Oxford OX4 1PA +44(0)1865 246490
All opinions are mine and do not represent those of any organisation

Ross Smith

unread,

Sep 5, 2002, 1:06:06 PM9/5/02

to

Wil Evers wrote:
>
> Just for the fun of it, I wrote a simple smart_string template, which
> takes the string representation class to use as a template argument.
> I implemented the following representation classes :
>
> nonshared_rep - one instance per string object
> shared_rep<no_mutex> - COW, unprotected reference count
> shared_rep<pthread_mutex> - COW, mutexed reference count
>
> The test case I used consisted of ten million calls to a function that
> returns a copy of a statically allocated smart_string<Rep> of length
> 1. This resulted in the following execution times (Linux, g++-3.2,
> glibc 2.1.3, PentiumII 350 Mhz):
>
> Rep class used without -O2 with -O2
>
> nonshared_rep 24.247s 21.382s
> shared_rep<no_mutex> 2.831s 0.540s
> shared_rep<pthread_mutex> 13.058s 10.284s

It would be interesting to add the results for an unprotected _but
guaranteed atomic_ reference count (i.e. thread safe with no mutex).
This is easy on Linux -- see man atomic_add.

> (*) Contrary to popular belief, the thread-safe COW implementaion
> (shared_rep<pthread_mutex>) is about twice as fast as the non-COW
> implementation. Although mutexed reference counting is expensive, it
> is considerably cheaper than the extra calls to the heap manager
> required to avoid it.

I think your test program is too narrowly specialised to draw such a
wide-sweeping conclusion. Yes, I know you qualified it further down,
but the above still sounds a bit too definite to me. The unshared vs
mutex-protected figures differ by only a factor of 2; given the
enormous variation in the things people do with strings, a factor of 2
is going to be lost in the noise.

> This is actually no surprise, taking into account that a
> thread-safe heap manager must do its own locking anyway - which is a
> simple fact of life that is often overlooked.

This, however, is an excellent point.

--
Ross Smith ..................................... Auckland, New Zealand
r-s...@ihug.co.nz ...................................................

"A specter is haunting Wall Street; it is the specter
of honest accounting." -- Arthur D. Hlavaty

Alexander Terekhov

unread,

Sep 5, 2002, 1:12:34 PM9/5/02

to

Wil Evers wrote:
>
> In article <un1p1kf...@corp.supernews.com>, JKB wrote:
>
> > Aren't most strings used entirely within one thread? And isn't that
> > basically Herb's point in the GoTW article - that COW implies paying mutex
> > prices on strings that don't need it? That, plus the annoying fact that
> > mutexes turn out to be expensive.

Rather: when there is almost no contention for them, *Microsoft
Windows/Win32* mutexes turn out to be annoyingly expensive. So
(instead of trying to convince Microsoft engineers to adopt [in
Win32/64 too(*)] POSIX mutexes "model" -- shared pthread_mutex_t
objects), Herb Sutter (C++ community liaison at Microsoft now)
is simply bashing COW based C++ string implementations... and
COW (and "internal" resource sharing) in general. ;-) ;-)

>
> OK, here's the final word on this subject :-).

[...]

> However, the numbers above do indicate that a single, one-size-fits-all
> standard string class is not the answer, especially for multi-threaded
> applications. Of course, JKB's suggestion that most strings are never
> copied between threads is correct. The potential performance benefits
> of the ability to specify which sharing policy to use are simply too big to
> be ignored.

http://groups.google.com/groups?selm=3C7615B7.2FA347D7%40web.de
(Subject: Re: crash in std::string on dual processor)

---
Matthew Austern wrote:
[...]
> (The main reason some implementations avoid reference counting
> is that, in the presence of threads, it's much harder to get
> things to work correctly if you use refcounting. As you've
> found.)

Just a few ideas/food for thought:

immutable_string< ...,
class thread_sharing_policy_for_representation =
/* -pthread option: shared/private by default */ >

string< ...,
class thread_sharing_policy_for_representation =
/* -pthread option: shared/private by default */ >

http://groups.google.com/groups?threadm=3BC8463D.6E23002A%40web.de
(note that unref() needs to inject write/"release" mem.bars too IF
threads could unref() mutable objects INSIDE synchronized regions
for mutation/updates)
---

http://groups.google.com/groups?selm=c29b5e33.0201281112.10db0944%40posting.google.com
(Subject: Re: An idea on fast multithreaded refcounted strings)

---
David Schwartz <dav...@webmaster.com> wrote in message news:<3C557385...@webmaster.com>...
> > > 2. When copying the string, compare the creator's thread id saved at point 1
> > > with the current thread id. If the thread ids are the same, increment the
> > > reference count. Otherwise (and here's the whole point), lock the string,
> > > perform a deep copy of it, and unlock it.
>
> This sounds impossible to me. If the thread that used it last didn't
> lock it, what good is locking it in this thread going to do?

ummm... I do not think that he really meant to lock the string
to perform a deep copy of it. The shared string is most likely
meant to be immutable (const) with respect to all threads (while
being shared, or read/write access fully synchronized on some
higher level). I guess that what he wants to achieve is that
copies of such string(s) in *other* (non-owner) threads would
NOT need synchronization being private to those other (non-owner)
threads... thread_shared_string/thread_private_string might work
better... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
---

regards,
alexander.

(*) http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/windows2000serv/deploy/sfu/sfuposix.asp

James Kanze

unread,

Sep 5, 2002, 5:44:11 PM9/5/02

to

rug...@mts.net (Kip Rugger) wrote in message

news:<al2vg9$79$1...@rugger.nodomain.ca>...

> Modern operating systems depend on caches appearing to be write
> through.

On least some modern machines, the caches are NOT write through.

The operating system is the last element which needs write through. The
operating system knows which threads are on what machines, when it will
switch threads, etc., and can issue a barrier instruction when needed.

> Although there is no standard you can reference, there are market
> forces. So, if a certain processor architecture could not support
> shared memory (or had to multiplex its mapping in real time), it's
> unlikely the architecture could compete in the general-purpose arena.

I wonder if this is the reason which the Digital (Compaq, HP) Alpha
architecture is not more successful. (For that matter, do modern Intel
processors write through? I'd be surprised, but then, Intel does
sometimes surprise me.)

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Sep 5, 2002, 5:50:25 PM9/5/02

to

Wil Evers <bou...@dev.null> wrote in message
news:<3d7657c5$0$91246$e4fe...@dreader4.news.xs4all.nl>...

> In article <un1p1kf...@corp.supernews.com>, JKB wrote:

> > Aren't most strings used entirely within one thread? And isn't
> > that basically Herb's point in the GoTW article - that COW implies
> > paying mutex prices on strings that don't need it? That, plus the
> > annoying fact that mutexes turn out to be expensive.

> OK, here's the final word on this subject :-).

> Just for the fun of it, I wrote a simple smart_string template, which
> takes the string representation class to use as a template argument.
> I implemented the following representation classes :

> nonshared_rep - one instance per string object
> shared_rep<no_mutex> - COW, unprotected reference count
> shared_rep<pthread_mutex> - COW, mutexed reference count

> The test case I used consisted of ten million calls to a function that
> returns a copy of a statically allocated smart_string<Rep> of length
> 1. This resulted in the following execution times (Linux, g++-3.2,
> glibc 2.1.3, PentiumII 350 Mhz):

> Rep class used without -O2 with -O2

> nonshared_rep 24.247s 21.382s
> shared_rep<no_mutex> 2.831s 0.540s
> shared_rep<pthread_mutex> 13.058s 10.284s

> I read these results as follows:

You haven't given the critical figure: how many times, on the average,
must a string be copied before reference counting is a win? While your
test is interesting, I don't know of many programs which make ten
million copies of a single string.

I don't think that there is any doubt that copying a reference counted
string is cheaper than deap copying, even in a multi-threaded
environment. After all, the allocation of the deap copy must be
protected by a mutex as well.

The difference comes because reference counting increases the cost of
other operations. How much depends on the interface being implemented,
and one of the points being made is that the interface imposed by the
standard makes the other operations very expensive, and effectively
prevents sharing in many cases that we'd like anyway. (I'm not sure I
agree with the arguments, and I'd like to see measures. But your test
doesn't provide a counter argument.)

The argument hinges on the fact that there are so many ways to modify a
string (e.g. through an iterator, a reference to a character, etc., as
well as through the non-const functions). Any time you might modify a
string, you must ensure that its copy is unique: i.e. check the
reference count. Checking the reference count requires a lock; if you
don't use a shared representation, you know the copy is unique, so you
don't need the lock.

If I'm not really convinced by the argument, it's probably because in my
own code, I make almost no use of the non-const functions.

> (*) The thread-unaware COW implementation (shared_rep<no_mutex>) is
> about an order of magnitude faster than the non-COW (nonshared_rep)
> implementation. Furthermore, it provides much better opportunities for
> inlining (the -O2 case).

> (*) Contrary to popular belief, the thread-safe COW implementaion
> (shared_rep<pthread_mutex>) is about twice as fast as the non-COW
> implementation. Although mutexed reference counting is expensive, it is
> considerably cheaper than the extra calls to the heap manager required
> to avoid it. This is actually no surprise, taking into account that a
> thread-safe heap manager must do its own locking anyway - which is a
> simple fact of life that is often overlooked.

> Obviously, the tests I ran just measure the performance of the string
> class's copy constructor an destructor, and ignore the impact of
> (mutexed) reference counting on writes to the rep's character buffer.
> The benefit of COW depends on the way strings are used by the program:
> if all you do is change the characters in existing strings, and you
> never copy the strings themselves, COW is not the way to go.

It's more subtle than that. If you are using std::string, and you have
a non-const instance, then you'll need to grab a mutex lock in begin()
and end() when passing the iterators to an algorithm. Even if the
algorithm doesn't modify the string.

> However, the numbers above do indicate that a single,
> one-size-fits-all standard string class is not the answer, especially
> for multi-threaded applications. Of course, JKB's suggestion that
> most strings are never copied between threads is correct. The
> potential performance benefits of the ability to specify which sharing
> policy to use are simply too big to be ignored.

As are the benefits of an intelligent interface to string, which doesn't
allow modification (except for assignment operators), and above all,
which doesn't allow modification through external references or
iterators (which inhibits copying in many cases even when no
modification is intented).

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

Sep 5, 2002, 5:52:18 PM9/5/02

to

shannon...@myrealbox.com (Shannon Barber) wrote in message
news:<de001473.02090...@posting.google.com>...

> ka...@gabi-soft.de (James Kanze) wrote in message
> > Note that any proof must be based on contractual guarantees, like
> > the one above. Intel architecture, for example, provides the hooks
> > which can be used to implement the necessary synchronization (the
> > lock prefix). It's not sufficient that the processor provides the
> > hooks, however; all of the surrounding hardware (the caches, etc.)
> > have to be designed to use them correctly. Do Dell or Compaq
> > guarantee that the use of lock will ensure cache consistency accross
> > multiple processors? (What does Intel say about it, for that
> > matter? The lock prefix dates from much simpler days.)

> I'm hardly an expert, but from what I have read about 386+
> architecture the lock prefix guarantees correctly behavior (of atomic
> inc/dec) on multi-processor machines, because it locks the entire
> memory bus for the duration of the transaction.

That was the situation in the old days, at any rate. (Of course, the
chip never actually locked the memory bus. It just generated a signal
saying that it wanted the memory bus to be locked.) The question today
is: what memory bus? In order to guarantee correct operation, it must
lock the memory bus, ensure write-through, and also notify the other
processors that their cache is dirty.

I believe that it is the intention of Intel that this be the case. At
any rate, another poster has indicated that on modern processors, the
lock prefix DOES ensure a memory barrier -- any values written by
instructions before the locked instruction will be written to main
memory before the locked instruction starts, any values read from memory
by the locked instruction will come from main memory, and not the cache,
any values written by the locked instruction will be written through to
main memory, and any values read from memory after the locked
instruction will be read from main memory, and not the cache. If
they've gone to that much effort (and given the lock prefix that much
runtime overhead), the only possible goal is to make it work correctly
in a multiprocessor environment.

> (If the cache works at all with SMP, it will have to work after such a
> write is complete.) On single-processor machines, on 386+ hardware,
> inc and dec (without the lock prefix) are already atomic operations!

> So, how long does it take to lock a mutex on a SPARC? ;0)

Too long. I've not measured the actual request, but a couple of
benchmarks using the thread safe version of a reference counted string,
and the single thread version, show a factor 10 difference in execution
times.

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

JKB

unread,

Sep 6, 2002, 4:46:54 AM9/6/02

to

"Attila Feher" <Attila...@lmf.ericsson.se> wrote:

> Yeah, and please repeat your tests with some nice thing
> being used - like libHoard.

Oh? Then he'd have libHoard-specific results instead of results specific to
whatever memory system he's using now. We need the comparison across string
implementations, while holding the rest of the system constant. And that's
not possible, because different string implementations use the heap
differently. Benchmarking is difficult.

> Using libHoard one can gain 100% performance increase

That's so cool. So if my test takes 25.03 seconds to run, then libHoard
will save me 100% of that, or approximately 25.03 seconds! Let's see now,
that leaves my final running time at zero, which is nice.

> A normal program (written properly) rarely just copies strings around
> without touching them. So I believe that the tests you did run test for
> a very specific case - which does not reflect most (and certainly not
> all) real-life scennarios.

A very important point here. All these tests, from Herb Sutter's article to
what's been posted in this group, test particular patterns of behavior. One
could probably produce any desired test result just by choosing what member
functions to stress.

-- jkb

Francis Glassborow

unread,

Sep 6, 2002, 8:40:59 AM9/6/02

to

In article <d6651fb6.02090...@posting.google.com>, James
Kanze <ka...@gabi-soft.de> writes

>As are the benefits of an intelligent interface to string, which doesn't
>allow modification (except for assignment operators), and above all,
>which doesn't allow modification through external references or
>iterators (which inhibits copying in many cases even when no
>modification is intented).

Actually, I think there are major benefits to having immutable strings.
And, it seems to me, that string literals should be of that type.

--
Francis Glassborow ACCU
64 Southfield Rd
Oxford OX4 1PA +44(0)1865 246490
All opinions are mine and do not represent those of any organisation

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Joshua Lehrer

unread,

Sep 6, 2002, 10:00:17 AM9/6/02

to

Ross Smith <r-s...@ihug.co.nz> wrote in message news:<al7hb0$nro$1...@lust.ihug.co.nz>...

> > This is actually no surprise, taking into account that a
> > thread-safe heap manager must do its own locking anyway - which is a
> > simple fact of life that is often overlooked.
>
> This, however, is an excellent point.

Unfortunately, I don't believe it to be true.

Just as a ref_counted_string can be written using atomic operations
instead of a mutex, so can a thread-safe heap manager.

On the other hand, I have written a thread-safe COW-string class using
atomic operations. I have never written a heap manager (except on the
whiteboard at MSFT during an interview), let alone a thread-safe one
using only atomic operations.

joshua lehrer
factset research systems

James Kanze

unread,

Sep 6, 2002, 12:51:35 PM9/6/02

to

Francis Glassborow <francis.g...@ntlworld.com> wrote in message
news:<QOzRnKkY...@robinton.demon.co.uk>...

> In article <de001473.02090...@posting.google.com>, Shannon
> Barber <shannon...@myrealbox.com> writes
> > On single-processor machines, on 386+ hardware, inc and dec (without
> >the lock prefix) are already atomic operations!

> Really? With multiple pipelines and floating point units? And I seem
> to remember that the latest Intel architectures go even further with
> interweaving instructions from different threads on the same pipeline.

This is getting rather off topic, but strictly speaking, even on the
original 8086, you could get cases where the value used disagreed with
the last value written, even with the lock instruction. Of course, it
would only happen in self modifying code, where you modified the
instruction less than six bytes ahead of the instruction doing the
modification.

As you say, the modern versions of the processor make much more
extensive use of pipelining. Which on one hand increases the risk of
such problems. On the other hand, because the increased risk makes it
more likely that the problem will affect real code, I consider it highly
likely that the Intel engineers have addressed the problem.

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Attila Feher

unread,

Sep 6, 2002, 11:36:14 PM9/6/02

to

JKB wrote:
>
> "Attila Feher" <Attila...@lmf.ericsson.se> wrote:
>
> > Yeah, and please repeat your tests with some nice thing
> > being used - like libHoard.
>
> Oh? Then he'd have libHoard-specific results instead of results specific to
> whatever memory system he's using now. We need the comparison across string
> implementations, while holding the rest of the system constant. And that's
> not possible, because different string implementations use the heap
> differently. Benchmarking is difficult.

Yep. So for example by running a test on a certain platform one may
say: look ma' COW is faster!, while all he has seen is that he has a
sh*tt* OS/C allocator... So to make any measurements the idea is to
move every slider to see what do you really measure!

> > Using libHoard one can gain 100% performance increase
>
> That's so cool. So if my test takes 25.03 seconds to run, then libHoard
> will save me 100% of that, or approximately 25.03 seconds! Let's see now,
> that leaves my final running time at zero, which is nice.

100% performance increase means: twice as fast. I said performance
(crashes per second :-) and not time! But nice try. ;-)

> > A normal program (written properly) rarely just copies strings around
> > without touching them. So I believe that the tests you did run test for
> > a very specific case - which does not reflect most (and certainly not
> > all) real-life scennarios.
>
> A very important point here. All these tests, from Herb Sutter's article to
> what's been posted in this group, test particular patterns of behavior. One
> could probably produce any desired test result just by choosing what member
> functions to stress.

So measuring is a science and an art. And statistics and therefore
politics. :-)

Attila

Attila Feher

unread,

Sep 6, 2002, 11:37:19 PM9/6/02

to

Ross Smith wrote:
[SNIP]

> > This is actually no surprise, taking into account that a
> > thread-safe heap manager must do its own locking anyway - which is a
> > simple fact of life that is often overlooked.
>
> This, however, is an excellent point.

Except, that it is not necessarily true. Unfortunately it is for most
OSes I know, but if you look at the libHoard, you will see it is not
actually needed.

Attila

Wil Evers

unread,

Sep 7, 2002, 6:04:54 AM9/7/02

to

In article <3D7737B0...@lmf.ericsson.se>, Attila Feher wrote:

> Wil Evers wrote:
> [SNIP]
>> (*) Contrary to popular belief, the thread-safe COW implementaion
>> (shared_rep<pthread_mutex>) is about twice as fast as the non-COW
>> implementation. Although mutexed reference counting is expensive, it is
>
> Popular belief tested in real-life systems. It is not the locking of
> the MUTEX what is expensive, but creating and destroying it! If you
> create and destroy enough COW strings in an MT system you can render
> _any_ amount of paralell CPUs to virtually _one_. I have seen this
> working: 25% COU use max on 4 CPU system, 12.5% on 8 CPU system and so
> on.

Tried that on the only SMP system I have access to: a dual-CPU Pentium III
(500 mHz), linux-2.4.18, g++-2.95.2 (-O2), glibc-2.1.3. This time, the job
to do is to construct a string object from a C-style string of length 1 (no
copy ctor involved) and destroy it again, repeated ten million times. Here
are the results:

string class used #threads 1 2 4

smart_string<nonshared_rep> 13.515s 8.298s 7.579s
smart_string<shared_rep<no_mutex> > 13.594s 8.322s 7.711s
smart_string<shared_rep<pthread_mutex> > 17.729s 11.070s 10.127s

As expected, creating and destroying a rep object containing a mutex is
more expensive than creating one without (by about 30%). At least on this
system, this extra 30% appears to be indepedent of the number of threads
involved. I also took a look at the source code for pthread_mutex_init()
and pthread_mutex_destroy() in glibc; I couldn't find any code that
interferes with other threads.

It seems like the problems you're having with creating and destroying
mutexes are a peculiarity of the platfrom you're using.

Regards,

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

Francis Glassborow

unread,

Sep 7, 2002, 6:41:43 AM9/7/02

to

In article <3D78DEA2...@lmf.ericsson.se>, Attila Feher
<Attila...@lmf.ericsson.se> writes

>> That's so cool. So if my test takes 25.03 seconds to run, then libHoard
>> will save me 100% of that, or approximately 25.03 seconds! Let's see now,
>> that leaves my final running time at zero, which is nice.
>
>100% performance increase means: twice as fast. I said performance
>(crashes per second :-) and not time! But nice try. ;-)

The mathematical pedant in me is itching. 100% should mean infinitely
faster:-) Just as 100% reduction for sale price means zero cost. About
20 years ago we managed to persuade UK Trading Standards Officers of the
later. Percentages are calculated on the original, so 100% means the
whole original value. The best answer is to stop using percentages, most
peoples understanding of them is even more flawed than their
understanding of common fractions.

The problem of COW is that it is highly dependant on the mix of
operations that cannot mutate the object (either directly or indirectly)
versus those that can. Please note that const member functions are not
immune to problems if there is either a mutable member, or the member
hands out access to a member. In multi-threaded environments that is
close to killing COW implementations in many cases. Of course OS level
facilities may make it viable on some platforms.

I wonder how often a const string really is an immutable string.

--
Francis Glassborow ACCU
64 Southfield Rd
Oxford OX4 1PA +44(0)1865 246490
All opinions are mine and do not represent those of any organisation

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Wil Evers

unread,

Sep 7, 2002, 10:29:01 AM9/7/02

to

In article <3D78DEA2...@lmf.ericsson.se>, Attila Feher wrote:

> JKB wrote:

>> A very important point here. All these tests, from Herb Sutter's article
>> to what's been posted in this group, test particular patterns of
>> behavior.
>> One could probably produce any desired test result just by choosing what
>> member functions to stress.
>
> So measuring is a science and an art. And statistics and therefore
> politics. :-)

What the numbers that I posted do, is show that on a *specific* platform,
with a *specific* usage pattern, changing the implementation of the string
class can have a dramatic impact on performance. I did not suggest these
numbers would apply to other platforms, other applications or other
programming styles. Now *that* would have been political indeed.

You have a fast heap, I have a fast mutex; you manipulate strings in place,
I copy them around all the time. Obviously, your preferred implementation
differs from mine. I have no problem with that.

If there's anything political about what I'm trying to say, it is probably
that we should stop arguing aboout how the current std::string is best
implemented. Instead, users should have the option to specify the
representation sharing policy they want, as an Alexandrescu-style template
parameter. That way, we can both have our cake and eat it too.

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

Carlos Moreno

unread,

Sep 7, 2002, 10:41:32 AM9/7/02

to

Christian Vogler wrote:
> Attila Feher (Attila...@lmf.ericsson.se) wrote:
> : > atomic_decrement needs to return the new value. If you decrement the
> : > reference count and atomic_decrement returns 0, then you know you are
> : > the last person to have a handle, and you can safely dispose of the
> : > characters without fear of someone else referencing the String.
>
> : And what if someones' (another thread) atomic increment is just waiting
> : there to make it 1 again? Isn't that possible?
>
> Not if the reference counting mechanism is correctly implemented. The
> reference counts can only be modified if the thread actually held a
> live reference to the object. A value of 0 after the atomic decrement
> implies a previous value of 1, hence the decrementing thread held the
> last live reference.
>
> If another thread increments this value at the same time, it must have
> held a live reference, too, and hence the previous reference count
> must have been at least 2. IOW, the scenario that you describe is
> impossible with a correct implementation.

I still don't see it. After you decrement, you have to check if
the value is 0; that portion is not atomic; the value does not
remain locked, so that right after the atomic decrement took place,
a thread switch might take place that will increment the value...

What am I missing?

Carlos
--

White Wolf

unread,

Sep 7, 2002, 10:45:19 AM9/7/02

to

Francis Glassborow wrote:
> In article <3D78DEA2...@lmf.ericsson.se>, Attila Feher
> <Attila...@lmf.ericsson.se> writes
>
>>>That's so cool. So if my test takes 25.03 seconds to run, then libHoard
>>>will save me 100% of that, or approximately 25.03 seconds! Let's see now,
>>>that leaves my final running time at zero, which is nice.
>>
>>100% performance increase means: twice as fast. I said performance
>>(crashes per second :-) and not time! But nice try. ;-)
>
>
> The mathematical pedant in me is itching. 100% should mean infinitely
> faster:-) Just as 100% reduction for sale price means zero cost.

OK. So guy #1 makes 10 hamburgers per minute. The other makes 20 per
minute. So if I have to express, using percentage, how much more
hamburgers guy #2 does I would say.... YES 100% more, so he has 100%
more performance. The rest is deep philosophy, which I leave with you.

> About
> 20 years ago we managed to persuade UK Trading Standards Officers of the
> later. Percentages are calculated on the original, so 100% means the
> whole original value.

Yeaaah! Finally you have understood! SO if my original performance was
110 calls per second and then I have got 200 calles per second my
performance was increased by 100%! And you have just proven that.

> The problem of COW is that it is highly dependant on the mix of
> operations that cannot mutate the object (either directly or indirectly)
> versus those that can.

Yep. Therefore we can take Herb's advice and _not_ use COW until we
have proven by profiling that our speed suffers because of copying strings.

WW aka Attila

Carlos Moreno

unread,

Sep 7, 2002, 11:07:29 AM9/7/02

to

Francis Glassborow wrote:
>
> 100% should mean infinitely
> faster:-)

I disagree on this.

> Percentages are calculated on the original, so 100% means the
> whole original value.

Which is precisely why 100% increase means twice! 100% *increase*,
as in "the increase is 100%" -- 100% of what? As you said, 100%
*of the original value*. So, the final value is the original
value plus 100% of the original value. I know this is off-topic,
but I have to express my disagreement on this one.

> The best answer is to stop using percentages, most
> peoples understanding of them is even more flawed than their
> understanding of common fractions.

I agree that 50% of the times (oops... sorry! :-)), people
misuse the percentages, for their convenience (taking advantage
of the fact that people will misunderstand them). Say, someone
wants to inflate the benefits of some product they're selling
and they use the percentages the wrong way to make it sound
better to the inadverted persons...

Carlos
--

White Wolf

unread,

Sep 7, 2002, 9:42:55 PM9/7/02

to

Wil Evers wrote:
>>So measuring is a science and an art. And statistics and therefore
>>politics. :-)

Please observe how did I start my sentence. The "So" you can drop, it is
just used to look like sophisticated. So ;-) it starts with Measuring.
Not, "your "measurements", but simply measuring.

> What the numbers that I posted do, is show that on a *specific* platform,
> with a *specific* usage pattern, changing the implementation of the string
> class can have a dramatic impact on performance. I did not suggest these
> numbers would apply to other platforms, other applications or other
> programming styles. Now *that* would have been political indeed.

And I did not suggest you did. If I implied suggesting that - it was by
mistake.

> You have a fast heap, I have a fast mutex; you manipulate strings in place,
> I copy them around all the time. Obviously, your preferred implementation
> differs from mine. I have no problem with that.

Well, I have yet to see C++ code where non-modifying copies cannot be
eliminated by using constant references - but that may just be one more
thing to learn. And I still believe that your application is not _only_
copying strings around (AFAIR your test program did just that), so you
will pay for MUTEX creation and destruction - which is far worse than a
mutex lock...

> If there's anything political about what I'm trying to say, it is probably
> that we should stop arguing aboout how the current std::string is best
> implemented.

Well, AFAIU we were "arguing" about what std::string implementation
(given the current standard) is not good for us - because it causes us
serious trouble in MT programs. The idea is that _no_ premature
optimizations should be done in a standard string class - one should
only avoid pessimizations.

> Instead, users should have the option to specify the
> representation sharing policy they want, as an Alexandrescu-style template
> parameter. That way, we can both have our cake and eat it too.

However so far (like for the next 6 years until the new standard is out)
we have to make it known that _many_ of us is badly hit by COW in MT.
And for no good reason: it could be implemented a faster way _and_ we
don't really need it either. :-((( So for example I would _really_ like
to see atomic ops in the Sun/RogueWave strings _or_ that the "magic
define" to get rid off the COW string would actually *work*.

So I believe that we are on the same side here. :-) I would like to see
an std::string, which uses policies. However now all I can cry after is
an std::string where I have the choice.

--
WW aka Attila

Christian Vogler

unread,

Sep 8, 2002, 6:58:34 AM9/8/02

to

Carlos Moreno (moreno_at_mo...@xx.xxx) wrote:
: I still don't see it. After you decrement, you have to check if

: the value is 0; that portion is not atomic; the value does not
: remain locked, so that right after the atomic decrement took place,
: a thread switch might take place that will increment the value...

: What am I missing?

You are missing that in this scenario the reference count can never
become less than one - even for a brief instant. This sequence of
operations applies only for reference counts > 1.

To see why, consider the invariant of reference counting: The
reference count for a particular piece of shared data is at least
equal to the number of fully constructed objects that share it.

By "fully constructed" I mean that the object's constructor has
returned and the destructor has not been called. Given atomic
increment and decrement it is relatively easy to verify that this
invariant holds, so I will not go into this here.

For simplicity's sake, assume that only destroying an object decreases
the reference count by one, and only copying an object increases it by
one.

Suppose that thread A decrements the reference count, and that
thread B increments it. So, thread A does something like this:

delete s1;

and thread B does something like this:

string *s3 = new string(*s2);

s1 and s2 have to be fully constructed; otherwise these fragments of
code would invoke undefined behavior. Now there are three possibilities:

(1) s1 and s2 are two distinct objects that share distinct pieces of data.

(2) s1 and s2 are two distinct objects that share the same piece of data.

(3) s1 and s2 are the same object.

In case (1), different reference counts are accessed, so there is no
problem.

In case (2), it follows from the invariant that the reference count is
at least 2, so it cannot become 0 after the decrementing.

In case (3), one thread modifies an object, while another thread
attempts to access the same object at the same time. This case exposes
a concurrency bug, plain and simple, which bites no matter whether
strings share data or not. At the very minimum, the string needs to be
protected by a mutex.

The arguments for the copy-on-write operations are analogous.

Regards
- Christian

Wil Evers

unread,

Sep 8, 2002, 10:57:26 AM9/8/02

to

In article <al7hb0$nro$1...@lust.ihug.co.nz>, Ross Smith wrote:

> Wil Evers wrote:

>> The test case I used consisted of ten million calls to a function that
>> returns a copy of a statically allocated smart_string<Rep> of length
>> 1. This resulted in the following execution times (Linux, g++-3.2,
>> glibc 2.1.3, PentiumII 350 Mhz):
>>
>> Rep class used without -O2 with -O2
>>
>> nonshared_rep 24.247s 21.382s
>> shared_rep<no_mutex> 2.831s 0.540s
>> shared_rep<pthread_mutex> 13.058s 10.284s
>
> It would be interesting to add the results for an unprotected _but
> guaranteed atomic_ reference count (i.e. thread safe with no mutex).
> This is easy on Linux -- see man atomic_add.

To be able to handle that, I changed the template parameter type for
shared_rep; it now takes a reference count implementation type:

unprotected_rc thread-unaware
mutexed_rc uses pthread_mutex_*()
atomic_rc uses ops in <bits/atomicity.h>

Here are the results:

Rep class used without -O2 with -O2

nonshared_rep 24.846s 21.846s
shared_rep<unprotected_rc> 2.512s 0.546s
shared_rep<mutexed_rc> 12.477s 10.258s
shared_rep<atomic_rc> 4.883s 2.222s

As expected, the Intel- and gcc-specific hacks in gcc's <bits/atomicity.h>
are significantly faster than the official pthread_mutex_*() facilities,
but still significantly slower than thread-unware reference counting.

Regards,

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

Wil Evers

unread,

Sep 9, 2002, 9:16:04 AM9/9/02

to

In article <d6651fb6.02090...@posting.google.com>, James
Kanze

wrote:

> You haven't given the critical figure: how many times, on the average,
> must a string be copied before reference counting is a win? While your
> test is interesting, I don't know of many programs which make ten
> million copies of a single string.

Let me try to answer that. Since this morning, I use four different rep
classes for my smart_string<Rep> template tests:

nonshared_rep no COW, just clones itself
shared_rep<unprotected_rc> COW, thread-unaware rc ops
shared_rep<mutexed_rc> COW, uses pthread_mutex_*() for locking
shared_rep<atomic_rc> COW, uses ops in gcc's <bits/atomicity.h>

Test 1 - construct and destroy a single smart_string<Rep> object from a
C-style string of length 1 ten million times (Linux-2.4.18, single
processor 350 mHz Pentium II, g++-3.2, glibc 2.1.3; the numbers in
parentheses are running times relative to nonshared_rep). Results:

Rep class used without -O2 with -O2

nonshared_rep 24.504s (1.00) 21.649s (1.00)
shared_rep<unprotected_rc> 24.867s (1.01) 21.815s (1.01)
shared_rep<mutexed_rc> 31.614s (1.29) 29.233s (1.35)
shared_rep<atomic_rc> 25.525s (1.04) 22.315s (1.03)

Test 2 - call a function that returns a copy of a smart_string<Rep> from
a
statically allocated instance of length 1 ten million times (same
platform). This is a slightly modified version of the tests I posted
about
before. Results:

Rep class used without -O2 with -O2

nonshared_rep 24.530s (1.00) 21.438s (1.00)
shared_rep<unprotected_rc> 2.502s (0.10) 0.545s (0.03)
shared_rep<mutexed_rc> 12.428s (0.50) 10.371s (0.48)
shared_rep<atomic_rc> 4.818s (0.20) 2.226s (0.10)

The usual disclaimers apply: these numbers are highly platform-,
compiler-
and hardware-specific, so YMMV. Furthermore, I'm ignoring the impact of
COW on the performance of modifications to the rep's embedded character
buffer. Here I go anyway:

(*) depending on the selected COW implementation, constructing a
smart_string object from a C-style string (test 1) is between 1 and 35%
slower with COW than it is without COW.
(*) depending on the selected COW implementation, returning a copy of a
smart_string object (test 2) with COW saves us between 50 and 97% of the
time required without COW.

For nonshared_rep, the C-style string ctor (test 1) essentially does the
same thing as the copy ctor (test 2); they both create a fresh rep
object
on the heap, and the numbers above appear to reflect this.

So, getting back to your question: in these tests, COW always yields a
performance benefit for each string object that is ever copied, even if
copied only once. Obviously, if the string is not copied at all, COW is
more expensive than non-COW.

[big snip]

>> However, the numbers above do indicate that a single,
>> one-size-fits-all standard string class is not the answer, especially
>> for multi-threaded applications. Of course, JKB's suggestion that
>> most strings are never copied between threads is correct. The
>> potential performance benefits of the ability to specify which sharing
>> policy to use are simply too big to be ignored.
>
> As are the benefits of an intelligent interface to string, which doesn't
> allow modification (except for assignment operators)

I'm not so sure about that. A few months ago, after reading one of your
previous posts, I implemented such a beast. When it was done, I
realized I
could add the usual modifying operations without changing the
implementation of the const member functions at all. It seems the only
benefit of not allowing modifications is that it eliminates some of the
performance surprises of COW.

> and above all,
> which doesn't allow modification through external references or
> iterators (which inhibits copying in many cases even when no
> modification is intented).

That's an essential prerequiste for what I just said, so no arguments
here,
of course.

Regards,

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

Mark Charsley

unread,

Sep 9, 2002, 9:46:22 PM9/9/02

to

In article <3D774CA9...@web.de>, tere...@web.de (Alexander
Terekhov) wrote:

>
> Wil Evers wrote:
> >
> > In article <un1p1kf...@corp.supernews.com>, JKB wrote:
> >
> > > Aren't most strings used entirely within one thread? And isn't
> > > that
> > > basically Herb's point in the GoTW article - that COW implies
> > > paying mutex
> > > prices on strings that don't need it? That, plus the annoying
> > > fact that
> > > mutexes turn out to be expensive.
>
> Rather: when there is almost no contention for them, *Microsoft
> Windows/Win32* mutexes turn out to be annoyingly expensive. So

At the risk of getting too platform specific for the mods...

Mutexes are inter-process locks on Win32. Unless you're doing something
really want to share strings between different processes, then you a
critical section (inter-thread lock) will be significantly faster than a
mutex - especially if they're not contended.

--

Mark - personal opinion only, could well be wrong, not representing
company, don't sue us etc. etc.

James Kanze

unread,

Sep 9, 2002, 9:51:15 PM9/9/02

to

Wil Evers <bou...@dev.null> wrote in message

news:<3d7af4af$0$12321$e4fe...@dreader3.news.xs4all.nl>...

> > Wil Evers wrote:

> Here are the results:

And do be aware that the gcc-specific hacks do not work for all
processors. The Sparc implementation (at least in 3.1), for example,
deadlocks in specific cases. (It will sort of work if all of the
threads use the default parameters, but even then, it can be very
inefficient if there is serious concurrence for a single counter.)

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Hans Boehm

unread,

Sep 10, 2002, 6:02:00 AM9/10/02

to

"Wil Evers" <bou...@dev.null> wrote in message

news:3d7bb637$0$12357$e4fe...@dreader3.news.xs4all.nl...

> In article <d6651fb6.02090...@posting.google.com>, James
> Kanze
> wrote:
>
> Let me try to answer that. Since this morning, I use four different rep
> classes for my smart_string<Rep> template tests:
>
> nonshared_rep no COW, just clones itself
> shared_rep<unprotected_rc> COW, thread-unaware rc ops
> shared_rep<mutexed_rc> COW, uses pthread_mutex_*() for locking
> shared_rep<atomic_rc> COW, uses ops in gcc's <bits/atomicity.h>
>

How about a nonshared representation that includes a 16-byte buffer (for
some value of 16 :-) ) for short strings? A string object would still
contains a pointer to the array of characters in the representation. But
for short strings it would simply point at the array in the string object
itself. Thus a copy of a short string would just copy the object (including
the extra 16 character buffer), fixing up the pointer in the process. For a
longer string, you'd just copy the object (excluding the buffer) and
allocate and copy the separately allocated character array, roughly as the
nonshared representation does now.

You then end up with a representation in which:

1) Strings may be a bit larger. Or maybe not. In most cases you save a
malloc header. In some cases, you save a pthread_mutex_lock, which is
likely to be larger than the buffer, at least in 64-bit environments.

2) Short strings can be copied with no memory allocation or locking. I
conjecture this can be done roughly at the same speed as the COW version
without locking, but the operation is thread-safe. (It takes 2 extra 64-bit
loads plus 2 stores to do the copy. But you save the reference count
operations.) I think it has been argued in the past that the large majority
of C++ strings are short. (I suspect this is highly application dependent,
but probably true in most cases. And if most strings are long, you should
probably be using something like ropes anyway.)

3) Long strings take roughly the same amount of time to copy as they do in
the plain nonshared_rep. (There is an extra test, but that can be
overlapped with the copy of the string object.)

4) You get clean multithreaded semantics in which the client never has to
lock to prevent concurrent reads, etc.

Hans .

P.J. Plauger

unread,

Sep 10, 2002, 5:55:35 PM9/10/02

to

"Hans Boehm" <Hans_...@hp.com> wrote in message news:alij3b$d52$1...@hplms2.hpl.hp.com...

> How about a nonshared representation that includes a 16-byte buffer (for
> some value of 16 :-) ) for short strings? A string object would still
> contains a pointer to the array of characters in the representation. But
> for short strings it would simply point at the array in the string object
> itself. Thus a copy of a short string would just copy the object (including
> the extra 16 character buffer), fixing up the pointer in the process. For a
> longer string, you'd just copy the object (excluding the buffer) and
> allocate and copy the separately allocated character array, roughly as the
> nonshared representation does now.

This is almost exactly the implementation we've used for the past several
years. (You'll find strings like these in VC++ V7, aka .NET, in our upgrade
library for VC++ V6, and in our recently released Dinkum Unabridged Library.)
The only difference is that we overlap the pointer with the buffer to save a
little space in the string object.

> You then end up with a representation in which:
>
> 1) Strings may be a bit larger. Or maybe not. In most cases you save a
> malloc header. In some cases, you save a pthread_mutex_lock, which is
> likely to be larger than the buffer, at least in 64-bit environments.
>
> 2) Short strings can be copied with no memory allocation or locking. I
> conjecture this can be done roughly at the same speed as the COW version
> without locking, but the operation is thread-safe. (It takes 2 extra 64-bit
> loads plus 2 stores to do the copy. But you save the reference count
> operations.) I think it has been argued in the past that the large majority
> of C++ strings are short. (I suspect this is highly application dependent,
> but probably true in most cases. And if most strings are long, you should
> probably be using something like ropes anyway.)
>
> 3) Long strings take roughly the same amount of time to copy as they do in
> the plain nonshared_rep. (There is an extra test, but that can be
> overlapped with the copy of the string object.)
>
> 4) You get clean multithreaded semantics in which the client never has to
> lock to prevent concurrent reads, etc.

Yes, these are the properties we've seen in most common usages of string
objects.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Hans Bos

unread,

Sep 11, 2002, 8:23:46 AM9/11/02

to

"Hans Boehm" <Hans_...@hp.com> wrote in message
news:alij3b$d52$1...@hplms2.hpl.hp.com...

> How about a nonshared representation that includes a 16-byte buffer (for
> some value of 16 :-) ) for short strings? A string object would still
> contains a pointer to the array of characters in the representation. But
> for short strings it would simply point at the array in the string object
> itself. Thus a copy of a short string would just copy the object
(including
> the extra 16 character buffer), fixing up the pointer in the process. For
a
> longer string, you'd just copy the object (excluding the buffer) and
> allocate and copy the separately allocated character array, roughly as the
> nonshared representation does now.
>

Of course you can use the same trick for reference counted strings.
In that case the behaviour of the reference counted strings will be exactly
the same as the nonshared case for strings with length 16 or less.
Now only longer strings will be reference counted.

As I learned from previous postings in this group, the big win for reference
counted strings is that no copy is needed for the strings.
So when you compare refcounted strings with nonrefcounted strings with this
short string optimization (for both versions), the reference counted strings
will be relative faster (compared to the situation without short string
optimization), because you are now comparing situations with longer strings.

Of course it still depends on the application and the system which string
version will be faster.

Unfortenately, in my application strings larger then 16 characters are not
uncommon :-(.

Greetings,
Hans Bos.

James Kanze

unread,

Sep 11, 2002, 2:49:19 PM9/11/02

to

Wil Evers <bou...@dev.null> wrote in message

news:<3d7bb637$0$12357$e4fe...@dreader3.news.xs4all.nl>...

> In article <d6651fb6.02090...@posting.google.com>, James
> Kanze wrote:

> > You haven't given the critical figure: how many times, on the
> > average, must a string be copied before reference counting is a win?
> > While your test is interesting, I don't know of many programs which
> > make ten million copies of a single string.

> Let me try to answer that. Since this morning, I use four different
> rep classes for my smart_string<Rep> template tests:

> nonshared_rep no COW, just clones itself
> shared_rep<unprotected_rc> COW, thread-unaware rc ops
> shared_rep<mutexed_rc> COW, uses pthread_mutex_*() for locking
> shared_rep<atomic_rc> COW, uses ops in gcc's <bits/atomicity.h>

> Test 1 - construct and destroy a single smart_string<Rep> object from
> a C-style string of length 1 ten million times (Linux-2.4.18, single
> processor 350 mHz Pentium II, g++-3.2, glibc 2.1.3; the numbers in
> parentheses are running times relative to nonshared_rep). Results:

> Rep class used without -O2 with -O2
>
> nonshared_rep 24.504s (1.00) 21.649s (1.00)
> shared_rep<unprotected_rc> 24.867s (1.01) 21.815s (1.01)
> shared_rep<mutexed_rc> 31.614s (1.29) 29.233s (1.35)
> shared_rep<atomic_rc> 25.525s (1.04) 22.315s (1.03)

> Test 2 - call a function that returns a copy of a smart_string<Rep>
> from a statically allocated instance of length 1 ten million times
> (same platform).

A statically allocated instance of smart_string<Rep>, or a statically
allocated C-style string. If the first, this is really no different
than your preceding case.

I think that it is generally true that using a shared representation
always reduces the cost of the copy constructor. It is also true that
it increases the cost of most other non-const operations, and of the non
copy constructors. In a single threaded environment, this increase in
cost is normally very low; in a multithreaded environment, it is likely
to involve aquiring a mutex lock, at least for the non-const operations
other than the constructor.

In his "C++ Strategies and Tactics", Robert Murray finds a win using COW
after about two copies, if I remember right. This is, of course, for
his environment, with his string class (this is pre-standard), and
without threads or locks.

A lot depends on how strings are used. In my own code, I generally
construct a string (using a constructor, either directly or through
functions like operator+, which must also use a constructor, since they
return a new string), then pass it around. Since the reference counting
doesn't need a lock in the constructor, and I call no other non-const
functions, I suspect that COW is a real win for me, and would still be a
win with threads.

> [big snip]

> >> However, the numbers above do indicate that a single,
> >> one-size-fits-all standard string class is not the answer,
> >> especially for multi-threaded applications. Of course, JKB's
> >> suggestion that most strings are never copied between threads is
> >> correct. The potential performance benefits of the ability to
> >> specify which sharing policy to use are simply too big to be
> >> ignored.

> > As are the benefits of an intelligent interface to string, which
> > doesn't allow modification (except for assignment operators)

> I'm not so sure about that. A few months ago, after reading one of
> your previous posts, I implemented such a beast. When it was done, I
> realized I could add the usual modifying operations without changing
> the implementation of the const member functions at all.

True. I should have been more precise, since the non-const member
functions don't have to cost anything if you don't use them:-).

> It seems the only benefit of not allowing modifications is that it
> eliminates some of the performance surprises of COW.

> > and above all, which doesn't allow modification through external
> > references or iterators (which inhibits copying in many cases even
> > when no modification is intented).

> That's an essential prerequiste for what I just said, so no arguments
> here, of course.

Note, however, that it doesn't allow a non-const operator[] (unless you
use proxies). Although I have yet to find a use for such a beast, it
seems popular with others.

With std::string, of course, it is probably the iterators which kill
you, although in general, the fact that non-const operator[] exists, and
that it will be used if the actual string object is not const, even if
no modification is intended, will also play a role.

--
James Kanze mailto:jka...@caicheuvreux.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Francis Glassborow

unread,

Sep 11, 2002, 2:56:42 PM9/11/02

to

In article <3d7e5c6b$0$13215$e4fe...@dreader4.news.xs4all.nl>, Hans Bos

<hans...@xelion.nl> writes

>Of course you can use the same trick for reference counted strings.
>In that case the behaviour of the reference counted strings will be
exactly
>the same as the nonshared case for strings with length 16 or less.
>Now only longer strings will be reference counted.

Not really, every mutating and potentially mutating member has to check
the type of representation. And, yes, it has to lock first because some
other thread may be about change it from short enough to be non shared
to long enough to be ref counted.

--
Francis Glassborow ACCU
64 Southfield Rd
Oxford OX4 1PA +44(0)1865 246490
All opinions are mine and do not represent those of any organisation

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Bill Wade

unread,

Sep 12, 2002, 8:41:57 AM9/12/02

to

"Francis Glassborow" <francis.g...@ntlworld.com> wrote

> Not really, every mutating and potentially mutating member has to check
> the type of representation.

Yes.

> And, yes, it has to lock first

No. The representation check does not need to be locked. It will be made
by looking entirely at the portion of the string which is in the unshared
body of the class, and so never shared except when external locking is in
place.

You wouldn't want to have to look on the heap to determine if the heap were
being used ;-)

> because some
> other thread may be about change it from short enough to be non shared
> to long enough to be ref counted.

If other threads are doing this, even the non-COW string is in trouble. It
means you are sharing a single logical object without external locks.
Non-COW does not begin to be sufficient in that case.

Hans Bos

unread,

Sep 12, 2002, 8:44:38 AM9/12/02

to

"Francis Glassborow" <francis.g...@ntlworld.com> wrote in message

news:NXsGRdXt...@robinton.demon.co.uk...

> In article <3d7e5c6b$0$13215$e4fe...@dreader4.news.xs4all.nl>, Hans Bos
>
> <hans...@xelion.nl> writes
> >Of course you can use the same trick for reference counted strings.
> >In that case the behaviour of the reference counted strings will be
> exactly
> >the same as the nonshared case for strings with length 16 or less.
> >Now only longer strings will be reference counted.
>
> Not really, every mutating and potentially mutating member has to check
> the type of representation. And, yes, it has to lock first because some
> other thread may be about change it from short enough to be non shared
> to long enough to be ref counted.
>

I don't see this.

Suppose a string class has a member "bool shared", that is true if a
reference counted representation is used and false if the string is stored
local.
This member is not shared between strings and only changes when the size of
the string gets larger: when the string is actively modified (e.g. not
copied).

I assume here, that a string is not accessed in another thread while the
length may change. So when you call string::erase, string::insert,
string::resize, etc, you must make sure that you don't read the string in
another thread.
If you don't assume this, you have to protect all accesses with a lock even
for non reference counted strings (otherwise one thread could be reading
memory that is being deleted inside another thread).

The "shared" member only changes when the string is being modified (grows)
and then no other threads access the string.
So you can inspect the "shared" member without a lock and if "shared" is
equal to false you know that it has a private copy of the string and no lock
is needed to access or change the private stored string.

Regards,
Hans.

Hans Boehm

unread,

Sep 12, 2002, 8:49:24 AM9/12/02

to

"Hans Bos" <hans...@xelion.nl> wrote in message
news:3d7e5c6b$0$13215$e4fe...@dreader4.news.xs4all.nl...

> "Hans Boehm" <Hans_...@hp.com> wrote in message
> news:alij3b$d52$1...@hplms2.hpl.hp.com...
> > How about a nonshared representation that includes a 16-byte buffer
(for

> > some value of 16 :-) ) for short strings ...

>
> Of course you can use the same trick for reference counted strings.
> In that case the behaviour of the reference counted strings will be
exactly
> the same as the nonshared case for strings with length 16 or less.
> Now only longer strings will be reference counted.

Right. But it doesn't get you back the clean semantics you get in the
nonshared case.

>
> As I learned from previous postings in this group, the big win for
reference

> counted strings is that no copy is needed for the strings. ...
Right, so long as you don't apply non-const operator[] or the like to look
at the string. As was discussed earlier, a lot of the tradeoff here is
between atomic (or locked) reference count updates and locked memory
allocation. I think it's actually quite possible to avoid locking on every
memory allocation. Java implementations generally don't. They instead keep
per-thread allocation regions. Standard C/C++ malloc/new implementations
generally lock, though I'm not completely sure why. Explicit deallocation
makes things a bit harder, but not that much. (Or so I claim, having only
implemented the garbage-collected version.)

In my mind, the main down side of reference counting are the messy
invalidation/concurrency semantics, and the fact that I don't know of a
clean (any?) set of rules that tells me when client code is correct, even
for single-threaded code. There is one attempt at this in the 5th paragraph
of 21.3 in (my slightly dated version of) the standard. I'm still not sure
what this paragraph means (does "any of the above" include operator[]? is
"s[0] == s[1] && s[0] == c" legal?), or whether there exists a
reference-counted implementation for which it says the right thing. I am
sure that anyone who actually remembers these rules while programming has a
far better memory than I do. See the thread entitled "Why basic_string !=
vector ? " in comp.std.c++ around Mar 1997 for more discussion. There is
also a nontrivial one in Jan. 1999 entitled "basic_string: s[0] = s[1]
legal?".

Thus I would conclude that if we uniformly apply the short string
optimization, we get two implementations which are very similar in
performance for the kind of input (short strings) for which they were really
designed But I only know how to write reliable code with only one of them,
making it fairly clear which one I would advocate ...

Hans

Alexander Terekhov

unread,

Sep 12, 2002, 8:49:47 AM9/12/02

to

Francis Glassborow wrote:
>
> In article <3d7e5c6b$0$13215$e4fe...@dreader4.news.xs4all.nl>, Hans Bos
>
> <hans...@xelion.nl> writes
> >Of course you can use the same trick for reference counted strings.
> >In that case the behaviour of the reference counted strings will be
> exactly
> >the same as the nonshared case for strings with length 16 or less.
> >Now only longer strings will be reference counted.
>
> Not really, every mutating and potentially mutating member has to check
> the type of representation.

Uhmm. Every "mutating and potentially mutating member" (assignments
aside) has to UNSHARE shared representation.

> And, yes, it has to lock first because some
> other thread may be about change it from short enough to be non shared
> to long enough to be ref counted.

Uhmm. Mutations ought to be synchronized by CALLERS, I guess.

regards,
alexander.

--
inline void String::Append( char c ) {
switch ( IntAtomicGet( data_->refs ) ) {
case 0: data_->refs = 1; // mutation -> 'shareable'
case 1: data_->Reserve( data_->used+1 ); break;
default:
StringBuf* newdata =
new StringBuf( AppendStringBufPtr( data_ ) );
if( 0 == IntAtomicDecrement( data_->refs ) )
delete data_;
data_ = newdata;
}
data_->buf[data_->used++] = c;
}

inline char& String::operator[]( size_t n ) {
switch ( IntAtomicGet( data_->refs ) ) {
case 1: data_->refs = 0; // 'unshareable'
case 0: break;
default:
StringBuf* newdata =
new StringBuf( CharRefStringBufPtr( data_ ) );
if( 0 == IntAtomicDecrement( data_->refs ) )
delete data_;
data_ = newdata;
}
return data_->buf[n];

Wil Evers

unread,

Sep 12, 2002, 2:20:18 PM9/12/02

to

In article <NXsGRdXt...@robinton.demon.co.uk>, Francis Glassborow
wrote:

> In article <3d7e5c6b$0$13215$e4fe...@dreader4.news.xs4all.nl>, Hans Bos
>
> <hans...@xelion.nl> writes
>>Of course you can use the same trick for reference counted strings.
>>In that case the behaviour of the reference counted strings will be
> exactly
>>the same as the nonshared case for strings with length 16 or less.
>>Now only longer strings will be reference counted.
>
> Not really, every mutating and potentially mutating member has to check
> the type of representation. And, yes, it has to lock first because some
> other thread may be about change it from short enough to be non shared
> to long enough to be ref counted.

That would depend on which lock we're talking about. In general, *any*
single object that is potentially modified from one thread while being
accessed from another thread must be locked. This is true for all string
class implementations discussed in this thread, including the 'short
strings embedded, longer strings on the heap, but never shared'
implementation Hans Boehm (as opposed to Hans Bos) described. Current
wisdom is that in such cases, it is up to the user to supply the necesarry
locking; putting the locking in the class implementation would
unnecessarily slow it down in for instances that are not accessed from
multiple threads.

The problem with COW is that the representation object may be shared
between threads, even when the string objects holding on to it are not,
which is why the rep object must use its own lock to protect its reference
count (or otherwise make sure it is handled atomically). The contract
between the string object and the rep object says that the string object
can only modify the rep object when it is not shared; otherwise, it must
create its own, separate rep.

Getting back to the scenario you describe: a string object that is about to
modify its *embedded* buffer can simply go ahead and do so without worrying
about threads at all. If some other thread concurrently changes the
string's length, then this is caused by the user, who failed to supply the
external locking. This is also true if we have a non-shared external rep
(as suggested by Hans Boehm) for a string that is too long to fit in the
embedded buffer.

If we have an external buffer managed by COW (as suggested by Hans Bos),
the string object must first (atomically) check the reference count. If
the reference count is 1, the rep is not shared, so it can be modified
without further locking worries. The reference count can only go up when
the string object is copied or assigned to another string object; if this
happens while the rep is being modified, then this is again due to the
user, who failed to stop other threads from accessing the string object
while a modifying operation is in progress.

Regards,

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

Wil Evers

unread,

Sep 12, 2002, 8:37:09 PM9/12/02

to

In article <d6651fb6.02091...@posting.google.com>, James Kanze
wrote:

> Wil Evers <bou...@dev.null> wrote in message
> news:<3d7bb637$0$12357$e4fe...@dreader3.news.xs4all.nl>...

>> nonshared_rep no COW, just clones itself

>> shared_rep<unprotected_rc> COW, thread-unaware rc ops
>> shared_rep<mutexed_rc> COW, uses pthread_mutex_*() for locking
>> shared_rep<atomic_rc> COW, uses ops in gcc's <bits/atomicity.h>
>
>> Test 1 - construct and destroy a single smart_string<Rep> object from
>> a C-style string of length 1 ten million times (Linux-2.4.18, single
>> processor 350 mHz Pentium II, g++-3.2, glibc 2.1.3; the numbers in
>> parentheses are running times relative to nonshared_rep). Results:
>
>> Rep class used without -O2 with -O2
>>
>> nonshared_rep 24.504s (1.00) 21.649s (1.00)
>> shared_rep<unprotected_rc> 24.867s (1.01) 21.815s (1.01)
>> shared_rep<mutexed_rc> 31.614s (1.29) 29.233s (1.35)
>> shared_rep<atomic_rc> 25.525s (1.04) 22.315s (1.03)
>
>> Test 2 - call a function that returns a copy of a smart_string<Rep>
>> from a statically allocated instance of length 1 ten million times
>> (same platform).
>
> A statically allocated instance of smart_string<Rep>, or a statically
> allocated C-style string. If the first, this is really no different
> than your preceding case.

I presented two tests; one constructs and destroys a string object from a
statically allocated C-style string, the other one calls a function that
returns a copy of a statically allocated string object. The part you
quoted only contains the results for the first test; the second test is
almost the same as some previous tests I posted about.

[snip]

> In his "C++ Strategies and Tactics", Robert Murray finds a win using COW
> after about two copies, if I remember right. This is, of course, for
> his environment, with his string class (this is pre-standard), and
> without threads or locks.

On my system, the test results indicate that just making a single copy of a
string object is more than sufficient to make up for the initial extra
overhead of creating a representation object that supports COW. This is
for the slowest (MT-safe, pthreads-portable) COW rep class:
shared_rep<mutexed_rc>. The other (less safe, less portable) COW
implementations perform even better.

[snip]

>> > and above all, which doesn't allow modification through external
>> > references or iterators (which inhibits copying in many cases even
>> > when no modification is intented).
>
>> That's an essential prerequiste for what I just said, so no arguments
>> here, of course.
>
> Note, however, that it doesn't allow a non-const operator[] (unless you
> use proxies). Although I have yet to find a use for such a beast, it
> seems popular with others.
>
> With std::string, of course, it is probably the iterators which kill
> you, although in general, the fact that non-const operator[] exists, and
> that it will be used if the actual string object is not const, even if
> no modification is intended, will also play a role.

I don't think the iterators themselves are the problem; it's the
requirements the standard imposes on the return type of applying operator*
on a non-const iterator. And, as you say, the return type of applying the
non-const operator[] on a string object, which is probably the same thing.

While it would certainly help if these requirements were relaxed to the
point where returning a proxy object would be conforming, I guess my tests
show that it would help even more if users were able to specify the
representation class to use on a case-by-case basis.

Regards,

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

Hans Bos

unread,

Sep 12, 2002, 8:48:47 PM9/12/02

to

"Hans Boehm" <Hans_...@hp.com> wrote in message

news:alo88o$hvg$1...@hplms2.hpl.hp.com...

> "Hans Bos" <hans...@xelion.nl> wrote in message
> news:3d7e5c6b$0$13215$e4fe...@dreader4.news.xs4all.nl...
> > "Hans Boehm" <Hans_...@hp.com> wrote in message
> > news:alij3b$d52$1...@hplms2.hpl.hp.com...
> > > How about a nonshared representation that includes a 16-byte buffer
> (for
> > > some value of 16 :-) ) for short strings ...
> >
> > Of course you can use the same trick for reference counted strings.
> >

You can take this a bit further.

You can store short strings directly in the string class.
Medium sized strings use a nonshared allocated buffer.
Long strings use reference counting.

By tuning the different sizes you can get an optimal string class.
This will finally end the reference-count/non-reference-count discussions
;-)

> Right, so long as you don't apply non-const operator[] or the like to look
> at the string. As was discussed earlier, a lot of the tradeoff here is
> between atomic (or locked) reference count updates and locked memory
> allocation. I think it's actually quite possible to avoid locking on
every
> memory allocation. Java implementations generally don't. They instead
keep
> per-thread allocation regions. Standard C/C++ malloc/new implementations
> generally lock, though I'm not completely sure why. Explicit deallocation
> makes things a bit harder, but not that much. (Or so I claim, having only
> implemented the garbage-collected version.)

Of course this comes with a price. In effect each thread has it's own heap,
so more memory will be requested from the operating system. And I think
deletion process is more complicated, because you must support deallocating
from another thread.

And if the string gets large enough, copying a string will take longer then
locking it.

>
> In my mind, the main down side of reference counting are the messy
> invalidation/concurrency semantics, and the fact that I don't know of a
> clean (any?) set of rules that tells me when client code is correct, even
> for single-threaded code. There is one attempt at this in the 5th
paragraph
> of 21.3 in (my slightly dated version of) the standard. I'm still not
sure
> what this paragraph means (does "any of the above" include operator[]? is
> "s[0] == s[1] && s[0] == c" legal?), or whether there exists a
> reference-counted implementation for which it says the right thing. I am
> sure that anyone who actually remembers these rules while programming has
a
> far better memory than I do

There are some problems with the interface of std::string. Since operator[]
and the iterator functions returns raw pointers (references) into the inner
parts of string, you should be carefull when using these functions.

I think it is better to encapsulate the string class in a class of your own.
E.g. you can define your own operator[] that returns a character and not a
reference (and calls the const version of operator[] on std::string).
With the encapsulation you can hide most suprises in std::string and you
don't have to remember the complex rules.

Wil Evers

unread,

Sep 13, 2002, 7:24:22 AM9/13/02

to

In article <alo88o$hvg$1...@hplms2.hpl.hp.com>, Hans Boehm wrote:

> I think it's actually quite possible to avoid locking on every
> memory allocation. Java implementations generally don't. They
> instead keep per-thread allocation regions. Standard C/C++
> malloc/new implementations generally lock, though I'm not completely
> sure why. Explicit deallocation makes things a bit harder, but not
> that much. (Or so I claim, having only implemented the
> garbage-collected version.)

That sounds interesting. Any pointers to how this is implemented?

Thanks,

- Wil

--
Wil Evers, DOOSYS R&D, Utrecht, Holland
[wil underscore evers at doosys dot com]

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Rob

unread,

Sep 13, 2002, 7:27:10 AM9/13/02

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:3D7FA479...@web.de...

>
> > And, yes, it has to lock first because some
> > other thread may be about change it from short enough to be non shared
> > to long enough to be ref counted.
>
> Uhmm. Mutations ought to be synchronized by CALLERS, I guess.
>

I disagree. If a string implementation chooses to share internal
representation
among individual string instances, it is up to the string implementation
to keep track of that. That's true if we're playing with multiple
threads or not.

In a multi-threaded environment, the caller is responsible for
synchronisation
if multiple threads are able to access the *same* string object. If the
representation is shared under the covers between string objects, the caller
has no realistic way of preventing a collision between two threads that are
actually accessing different string objects. [It is possible, but it makes
the
callers responsible for checking the actual implementation details of string
objects; at the least, that results in very poor encapsulation and strange
dependencies between modules].

Alexander Terekhov

unread,

Sep 13, 2002, 2:04:24 PM9/13/02

to

Rob wrote:
>
> "Alexander Terekhov" <tere...@web.de> wrote in message
> news:3D7FA479...@web.de...
> >
> > > And, yes, it has to lock first because some
> > > other thread may be about change it from short enough to be non shared
> > > to long enough to be ref counted.
> >
> > Uhmm. Mutations ought to be synchronized by CALLERS, I guess.
> >
>
> I disagree. If a string implementation chooses to share internal
> representation among individual string instances, it is up to the

> string implementation to keep track of that. ....

Show me some problem with respect to failing "...to keep track of
that" (presuming safety of atomic operations[1]) in following kinda
"string implementation":

namespace COW_AtomicInt3 {

#undef NAME
#undef BAGGAGE
#define NAME "COW_AtomicInt3"
#define BAGGAGE

class String;
struct StringBuf;

class AppendStringBufPtr {
StringBuf* ptr;
public:
AppendStringBufPtr( StringBuf* p ) : ptr( p ) {}
StringBuf* operator->() const { return ptr; }
void destruct() { delete ptr; }
};

class CharRefStringBufPtr {
StringBuf* ptr;
public:
CharRefStringBufPtr( StringBuf* p ) : ptr( p ) {}
StringBuf* operator->() const { return ptr; }
void destruct() { delete ptr; }
};

struct StringBuf {
StringBuf();
~StringBuf();
StringBuf( const StringBuf& other );
StringBuf( AppendStringBufPtr other );
StringBuf( CharRefStringBufPtr other );

void Clear();
void Reserve( size_t n );

size_t len;
size_t used;
long refs;
char* buf;
BAGGAGE;

void* operator new( size_t n );
void operator delete( void* p );
};

class String {
public:
String();
~String();
String( const String& );

void Clear();
void Append( char );
size_t Length() const;
char& operator[](size_t);

static int nCopies;
static int nAllocs;
private:
StringBuf* data_;
};

int String::nCopies;
int String::nAllocs;

static FastArena fa( NAME, sizeof(StringBuf) );
void* StringBuf::operator new( size_t n ) { return fa.Allocate( n ); }
void StringBuf::operator delete( void* p ) { fa.Deallocate( p ); }

inline StringBuf::StringBuf() : len(0), used(0), refs(1), buf(0) { }

inline StringBuf::~StringBuf() { delete[] buf; }

inline StringBuf::StringBuf( const StringBuf& other )
: len(other.len), used(other.used), refs(1), buf(new char[len])
{
++String::nAllocs;
memcpy( buf, other.buf, used );
}

inline StringBuf::StringBuf( AppendStringBufPtr other )
: len(0), used(0), refs(1), buf(0)
{
Reserve( max( other->len, other->used+1 ) );
memcpy( buf, other->buf, used = other->used );
if ( 0 == IntAtomicDecrement( other->refs ) )
other.destruct();
}

inline StringBuf::StringBuf( CharRefStringBufPtr other )
: len(other->len), used(other->used), refs(0), buf(new char[len])
{
++String::nAllocs;
memcpy( buf, other->buf, used );
if ( 0 == IntAtomicDecrement( other->refs ) )
other.destruct();
}

inline void StringBuf::Clear() {
delete[] buf;
buf = 0;
len = 0;
used = 0;
}

inline void StringBuf::Reserve( size_t n ) {
<snip>
}

inline String::String() : data_(new StringBuf) { }

inline String::~String() {
if ( 2 > IntAtomicGet( data_->refs ) ||
1 > IntAtomicDecrement( data_->refs ) )
delete data_;
}

inline String::String( const String& other )
{
switch ( IntAtomicGet( other.data_->refs ) ) {
case 00: data_ = new StringBuf( *other.data_ ); break;
case 01: (data_ = other.data_)->refs = 2; break;
default: IntAtomicIncrement( (data_ = other.data_)->refs );
}
++nCopies;
}

inline void String::Append( char c ) {
switch ( IntAtomicGet( data_->refs ) ) {

case 00: data_->refs = 1; // mutation -> 'shareable'
case 01: data_->Reserve( data_->used+1 ); break;
default: data_ = new StringBuf( AppendStringBufPtr( data_ ) );

}
data_->buf[data_->used++] = c;
}

inline char& String::operator[]( size_t n ) {
switch ( IntAtomicGet( data_->refs ) ) {

case 01: data_->refs = 0; // 'unshareable'
case 00: break;
default: data_ = new StringBuf( CharRefStringBufPtr( data_ ) );
}
return data_->buf[n];
}

inline size_t String::Length() const {
return data_->used;
}

inline void String::Clear() {
if ( 2 > IntAtomicGet( data_->refs ) ||
1 > IntAtomicDecrement( data_->refs ) ) {
data_->Clear();

data_->refs = 1; // mutation -> 'shareable'
}

else {
data_ = new StringBuf;
}
}

} // namespace COW_AtomicInt3

regards,
alexander.

[1] http://groups.google.com/groups?threadm=3BC8463D.6E23002A%40web.de
(note that unref() needs to inject write/"release" mem.bars too IF
threads could unref() mutable objects INSIDE synchronized regions
for mutation/updates)

JKB

unread,

Sep 13, 2002, 7:55:04 PM9/13/02

to

> Hans Boehm wrote:
>
> > I think it's actually quite possible to avoid locking on every
> > memory allocation. Java implementations generally don't. They
> > instead keep per-thread allocation regions. Standard C/C++
> > malloc/new implementations generally lock, though I'm not completely
> > sure why. Explicit deallocation makes things a bit harder, but not
> > that much. (Or so I claim, having only implemented the
> > garbage-collected version.)
>

"Wil Evers" <bou...@dev.null> wrote :

> That sounds interesting. Any pointers to how this is implemented?

An obvious way is to keep a medium-size heap in thread-specific storage,
which is used only by that thread. No cross-thread sharing, so no locking.

Doing this would have the effect that you can only delete from within the
allocating thread. Or you could be fancy: run the destructor explicitly
and then put the memory in a global list of reclaimable blocks. Every now
and then a thread would check the list and free up the parts it owns.

Disclaimer: I've never actually implemented this, so I am Making This Up.
Seems sound offhand...
-- jkb

Hans Boehm

unread,

Sep 14, 2002, 8:40:04 AM9/14/02

to

"Alexander Terekhov" <tere...@web.de> wrote in message

news:3D81E7B3...@web.de...

>
> > I disagree. If a string implementation chooses to share internal
> > representation among individual string instances, it is up to the
> > string implementation to keep track of that. ....
>
> Show me some problem with respect to failing "...to keep track of
> that" (presuming safety of atomic operations[1]) in following kinda
> "string implementation":

> ...

> inline String::String( const String& other )
> {
> switch ( IntAtomicGet( other.data_->refs ) ) {
> case 00: data_ = new StringBuf( *other.data_ ); break;
> case 01: (data_ = other.data_)->refs = 2; break;
> default: IntAtomicIncrement( (data_ = other.data_)->refs );
> }
> ++nCopies;
> }
>

I may be missing something, but this looks completely broken to me. If I
have two threads simultaneously making copies of the same string "other",
the reference count may well end up wrong. For example, they may both see a
reference count of 1, and store back a count of 2, when the real count
should be 3.

Note that the only data being shared between threads is the string being
read. Thus I claim it's not reasonable to expect the client threads to lock
around this, at least if we are following the rules in
http://www.sgi.com/tech/stl/thread_safety.html .

I think there are ways to get this right. (The other one I just checked in
a slightly obsolete Linux distribution seems to not even try to do the
increment atomically, for some reason. The one in the current gcc source
tree looks a lot better, though rather complex.) The main problem here,
aside from the fact that so many reference counted string implementations
seem to get it wrong, is that for synchronization purposes, the user has to
treat operator[] as a write to the string, even if it's only used to read a
character. I claim this is very unnatural, inconsistent with other
containers, and leads to obscure application bugs. It's possible to change
the interface to the standard string class to fix this. But unless you go
to immutable strings, those changes introduce other complications.

Hans

Hans Boehm

unread,

Sep 14, 2002, 8:41:44 AM9/14/02

to

"Wil Evers" <bou...@dev.null> wrote in message

news:3d816ee6$0$12004$e4fe...@dreader1.news.xs4all.nl...

> In article <alo88o$hvg$1...@hplms2.hpl.hp.com>, Hans Boehm wrote:
>
> > I think it's actually quite possible to avoid locking on every
> > memory allocation. Java implementations generally don't. They
> > instead keep per-thread allocation regions. Standard C/C++
> > malloc/new implementations generally lock, though I'm not completely
> > sure why. Explicit deallocation makes things a bit harder, but not
> > that much. (Or so I claim, having only implemented the
> > garbage-collected version.)
>
> That sounds interesting. Any pointers to how this is implemented?
>

There is some discussion of what I did in section 4 of
http://www.hpl.hp.com/techreports/2000/HPL-2000-165.html . (I recommend the
PostScript version if you have a viewer installed.)

Compacting collectors generally just give each thread a chunk of the global
heap to allocate from. This is a bit trickier than it sounds, since the
individual threads need to be able to store the size and other
meta-information for the objects they allocate, so that the garbage
collector can later trace them, etc. But it seems not to be too hard.

I think a malloc/free implementation would essentially need to batch
allocation and deallocation requests, so that it could update the shared
heap data structures only occasionally, when it acquires the appropriate
lock. On the allocation side, this probably looks very similar to the
scheme described in the technical report. On the deallocation side, in the
simplest case you just wait until you have deallocated N bytes, and then
acquire a lock once to restore them to the global free list data structure.
In reality, you probably want to combine this with some of the techniques
Hoard uses in order to get scalability.

Hans

Alexander Terekhov

unread,

Sep 16, 2002, 6:47:32 AM9/16/02

to

Agreed. It's broken, indeed. Thanks.

> I think there are ways to get this right.

inline String::String( const String& other )

{
switch ( IntAtomicGet( other.data_->refs ) ) {
case 00: data_ = new StringBuf( *other.data_ ); break;

// case 01: (data_ = other.data_)->refs = 2; break;

default: IntAtomicIncrement( (data_ = other.data_)->refs );
}
++nCopies;
}

regards,
alexander.