Guru of the Week #45: Solution

Herb Sutter

unread,

Oct 26, 1998, 3:00:00 AM10/26/98

to

.--------------------------------------------------------------------.
| Guru of the Week problems and solutions are posted regularly on |
| news:comp.lang.c++.moderated. For past problems and solutions |
| see the GotW archive at http://www.cntc.com. |
| Is there a topic you'd like to see covered? mailto:he...@cntc.com |
`--------------------------------------------------------------------'
_______________________________________________________

GotW #45: Reference Counting - Part III

Difficulty: 9 / 10
_______________________________________________________

-------------------------------------------------------

Congratulations to Nathan Myers, the Guru of the Week!

Last month, in comp.std.c++, I asserted that using
copy-on-write (COW) in a thread-safe way necessarily
incurred performance penalties of an order of
magnitude or two on popular platforms, making COW
decidedly a pessimization. Thanks to Nathan for
correctly pointing out that, if atomic integer
operations are available, this performance overhead
can be reduced.

Unfortunately, even with atomic integer operations,
a thread-safe COW implementation is still
substantially slower than a non-COW implementation,
and COW indeed remains in general a "false
optimization" in code that might be used in a
multithreaded environment. The worst part is that
the performance hit applies even if the code is
actually used in only a single-threaded program.

-------------------------------------------------------

Introduction
------------

Standard C++ is silent on the subject of threads.
Unlike Java, C++ has no built-in support for threads
and gives no consideration to thread safety issues. So
why a GotW issue on threads? Simply because more and
more of us are writing multithreaded (MT) programs, and
no discussion of reference-counted String
implementations is complete if it does not cover thread
safety issues.

I won't go into a long discussion about what's involved
with writing thread-safe code in detail; see a good
book on threads for more details.

>JG Question
>-----------
>
>Why is Optimized::String not thread-safe? Give examples.

It is the responsibility of code that uses an object to
ensure that access to the object is serialized as
needed. For example, if a certain String object could
be modified by two different threads, it's not the poor
String object's job to defend itself from abuse; it's
the job of the code that uses the String to make sure
that two threads never try to modify the same String
object at the same time.

The problem with the code in GotW #44 is twofold:
First, the reference-counted (copy-on-write, COW)
implementation is designed to hide the fact that two
different visible String objects could be sharing a
common hidden state; hence it's the String class's
responsibility to ensure that the calling code never
modifies a String whose representation is shared. The
String code shown already does that, by performing a
deep copy ("un-sharing" the representation) if
necessary the moment a caller tries to modify the
String. This part is fine in general.

Unfortunately, it now brings us to the second problem:
The code in String that "un-shares" the representation
isn't thread-safe. Imagine that there are two String
objects s1 and s2:

a) s1 and s2 happen to share the same representation
under the covers (okay, because String is designd
for this);

b) thread 1 tries to modify s1 (okay, because thread 1
knows that no one else is trying to modify s1);

c) thread 2 tries to modify s2 (okay, because thread 2
knows that no one else is trying to modify s2);

d) at the same time (error)

The problem is (d): At the same time, both s1 and s2
will attempt to "un-share" their shared representation,
and the code to do that is is not thread-safe.
Specifically, consider the very first line of code in
String::AboutToModify():

void String::AboutToModify(
size_t n,
bool bMarkUnshareable /* = false */
) {
if( data_->refs > 1 && data_->refs != Unshareable ) {
/* ... etc. ... */

This if-condition is not thread-safe. For one thing,
evaluating even "data_->refs > 1" may not be atomic; if
so, it's possible that if thread 1 tries to evaluate
"data_->refs > 1" while thread 2 is updating the value
of refs, the value read from data_->refs might be
anything -- 1, 2, or even something that's neither the
original value nor the new value. The problem is that
String isn't following the basic thread-safety
requirement that code that uses an object must ensure
that access to the object is serialized as needed.
Here, String has to ensure that no two threads are
going to use the same "refs" value in a dangerous way
at the same time. The traditional way to accomplish
this is to serialize access to the shared StringBuf (or
just its .refs member) using a critical section or
semaphore. In this case, at minimum the "comparing an
int" operation must be guaranteed to be atomic.

This brings us to the second issue: Even if reading
and updating "refs" were atomic, there are two parts to
the if condition. The problem is that the thread
executing the above code could be interrupted after it
evaluates the first part of the condition but before it
evaluates the second part. In our example:

Thread 1 Thread 2

>> enter s1's AboutToModify()

evaluate "data_->refs > 1"
(true, because data_->refs is 2)

******** context switch ********

>> enter s2's AboutToModify()

(runs all the way to completion,
including that it decrements
data_->refs to the value 1)

<< exit s2's AboutToModify()

******** context switch ********

evaluate "data_->refs != Unshareable"
(true, because data_->refs is now 1)

enters AboutToModify's "I'm shared and
need to unshare" block, which clones
the representation, decrements
data_->refs to the value 0, and gives
up the last pointer to the StringBuf...
poof, we have a memory leak because
the StringBuf that had been shared by
s1 and s2 can now never be deleted

Having covered that, we're ready to see how to solve
these safety problems.

>Guru Questions
>--------------
>
>1. Demonstrate how to make Optimized::String thread-safe:
>
> a) assuming that there atomic operations to get,
> set, and get-and-set integer values; and

> b) assuming that there aren't.

I'm going to answer b) first because it's more general.
What's needed here is a lock-management device like a
critical section or a mutex. The code is equivalent in
either case, so below I'll use a critical section,
which is usually a more efficient synchronization
device than a general-purpose mutex. The key to what
we're about to do is quite simple: It turns out that if
we do things right we only need to lock access to the
reference count itself.

Before doing anything else, we need to add a
CriticalSection member object into
Optimized::StringBuf. Let's call the member cs:

namespace Optimized {

struct StringBuf {
StringBuf(); // start off empty
~StringBuf(); // delete the buffer
StringBuf( const StringBuf& other, size_t n = 0 );
// initialize to copy of other,
// and ensure len >= n

void Reserve( size_t n ); // ensure len >= n

char* buf; // allocated buffer
size_t len; // length of buffer
size_t used; // # chars actually used
unsigned refs; // reference count
CriticalSection cs; // serialize work on this object
};

The only function that necessarily operates on two
StringBuf objects at once is the copy constructor.
String only calls StringBuf's copy constructor from two
places (from String's own copy constructor, and from
AboutToModify()). Note that String only needs to
serialize access to the reference count, because by
definition no String will do any work on a StringBuf
that's shared (if it is shared, it will be read in
order to take a copy, but we don't need to worry about
anyone else trying to change or Reserve() or otherwise
alter/move the buffer).

The default constructor needs no locks:

String::String() : data_(new StringBuf) { }

The destructor need only lock its interrogation and
update of the refs count:

String::~String() {
bool bDelete = false;
data_->cs.Lock(); //---------------------------
if( data_->refs == Unshareable || --data_->refs < 1 ) {
bDelete = true;
}
data_->cs.Unlock(); //-------------------------
if( bDelete ) {
delete data_;
}
}

For the String copy constructor, note that we can
assume that the other String's data buffer won't be
modified or moved during this operation, since it's the
responsibility of the caller to serialize access to
visible objects. We must still, however, serialize
access to the reference count itself, as we did above:

String::String( const String& other )
{
bool bSharedIt = false;
other.data_->cs.Lock(); //---------------------
if( other.data_->refs != Unshareable ) {
bSharedIt = true;
data_ = other.data_;
++data_->refs;
}
other.data_->cs.Unlock(); //-------------------

if( !bSharedIt ) {
data_ = new StringBuf( *other.data_ );
}
}

So making the String copy constructor safe wasn't very
hard at all. This brings us to AboutToModify(), which
turns out to be very similar, but notice that this
sample code actually acquires the lock during the
entire deep copy operation... really, the lock is
strictly only needed when looking at the refs value,
and again when updating the refs value at the end, but
I decided to lock the whole operation instead of
getting slightly better concurrency by releasing the
lock during the deep copy and then reacquiring it just
to update refs:

void String::AboutToModify(
size_t n,
bool bMarkUnshareable /* = false */
) {
bool bCopiedIt = false;
data_->cs.Lock(); //---------------------------
if( data_->refs > 1 && data_->refs != Unshareable ) {
bCopiedIt = true;
StringBuf* newdata = new StringBuf( *data_, n );
--data_->refs; // now all the real work is
data_ = newdata; // done, so take ownership
}
data_->cs.Unlock(); //-------------------------

if( !bCopiedIt ) {
data_->Reserve( n );
}
data_->refs = bMarkUnshareable ? Unshareable : 1;
}

None of the other functions need to be changed.
Append() and operator[]() don't need locks because once
AboutToModify() completes we're guaranteed that we're
not using a shared representation. Length() doesn't
need a lock because by definition we're okay if our
StringBuf is not shared (there's no one else to change
the used count on us), and we're okay if it is shared
(the other thread would take its own copy before doing
any work and hence still wouldn't modify our used count
on us):

void String::Append( char c ) {
AboutToModify( data_->used+1 );
data_->buf[data_->used++] = c;
}

size_t String::Length() const {
return data_->used;
}

char& String::operator[]( size_t n ) {
AboutToModify( data_->len, true );
return *(data_->buf+n);
}

}

Again, note the interesting thing in all of this: The
only locking we needed to do involved the refs count
itself.

With that observation and the above general-purpose
solution under our belts, let's look back to the a)
part of the question:

> a) assuming that there atomic operations to get,
> set, and get-and-set integer values;

Some operating systems provide these kinds of
functions.

NOTE: These functions are usually much more efficient
than general-purpose synchronization primitives like
critical sections and mutexes. It is, however, a
fallacy so say that we can use atomic integer
operations "instead of locks" because locking is
still required -- the locking is just generally less
expensive than other alternatives, but it's not free
by a long shot.

Here is a thread-safe implementation of String that
assumes we have three functions: an IntAtomicGet, and
IntAtomicDecrement and IntAtomicIncrement that safely
return the new value. We'll do essentially the same
thing we did above, but use only atomic integer
operations to serialize access to the refs count:

namespace Optimized {

String::String() : data_(new StringBuf) { }

String::~String() {
if( IntAtomicGet( data_->refs ) == Unshareable ||
IntAtomicDecrement( data_->refs ) < 1 ) {
delete data_;
}
}

String::String( const String& other )
{
if( IntAtomicGet( other.data_->refs ) != Unshareable ) {
data_ = other.data_;
IntAtomicIncrement( data_->refs );
}
else {
data_ = new StringBuf( *other.data_ );
}
}

void String::AboutToModify(
size_t n,
bool bMarkUnshareable /* = false */
) {
int refs = IntAtomicGet( data_->refs );
if( refs > 1 && refs != Unshareable ) {
StringBuf* newdata = new StringBuf( *data_, n );
if( IntAtomicDecrement( data_->refs ) < 1 ) {
delete newdata; // just in case two threads
} // are trying this at once
else { // now all the real work is
data_ = newdata; // done, so take ownership
}
}
else {
data_->Reserve( n );
}
data_->refs = bMarkUnshareable ? Unshareable : 1;
}

void String::Append( char c ) {
AboutToModify( data_->used+1 );
data_->buf[data_->used++] = c;
}

size_t String::Length() const {
return data_->used;
}

char& String::operator[]( size_t n ) {
AboutToModify( data_->len, true );
return *(data_->buf+n);
}

}

>2. What are the effects on performance? Discuss.

In a word: Bad. Without atomic integer operations,
copy-on-write typically incurs a performance slowdown
of an order of magnitude or two. Even with atomic
integer operations, COW can make common String
operations take nearly 50% longer -- even in
single-threaded programs.

In general, copy-on-write is a bad idea in
multithread-ready code. In short, the reason is that
the calling code can no longer know whether two
distinct String objects actually share the same
representation under the covers, and so String must
incur overhead to do enough serialization that calling
code can take its normal share of responsibility for
thread safety. Depending on the availability of
more-efficient options like atomic integer operations,
the impact on performance ranges from "moderately
heavy" to "profound."

Some Empirical Results
----------------------

Enough theory; here are some actual test cases. I put
four versions of String/StringBuf through a test
harness that primarily calls String::operator[] in a
timed loop. The four versions I tested are:

Name Code Used
------------- ------------------------------------
Unsafe COW GotW #44 answer
CritSec COW GotW #45 1(b) answer above
Mutex COW GotW #45 1(b) answer, with Critical-
Section replaced by Mutex
AtomicInt COW GotW #45 1(a) answer, above

All results were compared to Unsafe COW. In short,
AtomicInt COW averaged 44% slower in this particular
test code, CritSec COW averaged nearly 8 times slower,
and Mutex COW averaged 46 times slower. (Under
Windows, a critical section is more efficient than a
mutex, because the latter offers more services and can
also be used to synchronize across processes.) I will
make the test code available on the GotW website.

# Output of two runs of 1,000,000 iterations each.
# The test harness was compiled with MSVC 5.0 SP3.

Unsafe COW: 6794 ms 100%
CritSec COW: 53436 ms 786%
Mutex COW: 317174 ms 4668%
AtomicInt COW: 9806 ms 144%

Unsafe COW: 6798 ms 100%
CritSec COW: 53464 ms 786%
Mutex COW: 316900 ms 4661%
AtomicInt COW: 9805 ms 144%

A few important points about this test harness:

0. CAVEAT LECTOR: Take this for what it is: A first
cut at a test harness. Comments and corrections are
welcome. I'm showing raw performance numbers here;
I haven't inspected the compiled code, and I've made
no attempt to determine the impact of cache
hits/misses and other secondary effects. (Even so,
this GotW took much more effort than usual to
produce, and I guarantee that the next few issues
will feature simpler topics!)

1. The test harness is designed to exercise possibly-
mutating operations, but it is otherwise designed to
be as favourable as possible to the locking
operations: it operates only on unshared strings,
it does not actually modify the strings, and there
is no lock contention because there's only one
thread (so it never has one thread waiting while
another is performing an atomic operation). In
other words, this code isn't trying to unfairly
stack the deck against the locking operations in an
attempt to magnify the performance difference.

2. Inside the timed loop, the code does not modify the
string; that is, operator[] is called but the
returned reference is not used to change the string.
This demonstrates that the overhead is indeed
incurred "for every possibly mutating operation
whether the string is shared or not," and not just
"for every mutating operation on a shared string" as
is sometimes suggested. Note that operator[] is
just like the standard begin() and end() in this
respect, since all three functions return an
iterator or reference that can later be used to
modify the string.

3. TANSTAAFL ("there ain't no such thing as a free
lunch" -R.A.Heinlein). Even with atomic integer
operations, it is incorrect to say "there's no
locking required" because the atomic integer
operations clearly do perform serialization, and do
incur significant overhead.

4. The test harness itself is SINGLE-threaded. A
thread-safe COW implementation incurs overhead even
in programs that are not themselves multithreaded.
At best, COW could be a true optimization only when
the COW code does not need to be made thread-safe at
all (even then, see Rob Murray's "C++ Strategies and
Tactics" book for empirical tests that show COW is
only beneficial in certain situations). If thread
safety is required, COW imposes a significant
performance penalty on all users, even users running
only single-threaded code.

For more details, see the upcoming article on this
topic in... (later I'll announce which magazine).

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]

sa...@bear.com

unread,

Oct 26, 1998, 3:00:00 AM10/26/98

to

[snipped interesting discussion]

Thanks to the STL team at SGI for doing the right thing for string -
that is
choosing a non-shared vector like implementation. Digital C++ also had
similar
implementation (I do not know the current implementation, but I'd think
they
still do the right thing.) VC++ also provides a way to disable reference
counting.

For those readers who do not follow comp.programming.threads group, I'd
like
to quote from Dave Butenhof of Compaq Corporation that "Mutex is a badly
chosen name; we should have called it 'bottleneck'. It is a necessary
evil."

And for strings, it is not necessary at all(I am talking in MT context
only).

-- Saroj Mahapatra

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

Nathan Myers

unread,

Oct 26, 1998, 3:00:00 AM10/26/98

to

Herb Sutter<he...@cntc.com> wrote:
> GotW #45: Reference Counting - Part III

> Congratulations to Nathan Myers, the Guru of the Week!

Thanks, Herb, for the kudos, but I'm afraid that this doesn't
let you off the hook. Your thread-safe string implementation
is still far from optimal, so any conclusions drawn from it
are (still) bogus.

{I'll follow up tomorrow, but I'd be sincerely happy to
see you demonstrate an implementation that supports your
position below. -hps}

> Last month, in comp.std.c++, I asserted that using
> copy-on-write (COW) in a thread-safe way necessarily
> incurred performance penalties of an order of
> magnitude or two on popular platforms, making COW
> decidedly a pessimization. Thanks to Nathan for
> correctly pointing out that, if atomic integer
> operations are available, this performance overhead
> can be reduced.

^^^^^^^
Actually, what I said was "eliminated". In fact, the overhead
is negated, because it's more than made up for by reduced copying.

> Unfortunately, even with atomic integer operations,
> a thread-safe COW implementation is still
> substantially slower than a non-COW implementation,
> and COW indeed remains in general a "false
> optimization" in code that might be used in a
> multithreaded environment. The worst part is that
> the performance hit applies even if the code is
> actually used in only a single-threaded program.

This "reduced" statement of overhead is still false, because it
is based on a false assumption. [I *really* wish Herb had checked
with me before posting.] In fact a correctly-coded thread-safe
COW implementation is not appreciably slower than a non-thread-
safe one, given atomic reference-count operations: load, increment,
decrement-and-test.

Where did Herb go wrong? Let's explore:

> if( data_->refs > 1 && data_->refs != Unshareable ) {
>

> This if-condition is not thread-safe. For one thing, ...

HERE is the problem.

On several postings to this newsgroup it was pointed out
that the "Unshareable" condition can be encoded into the
"refs" member as a special value. What Herb neglected to
take into account was that this can be discerned by a single
(atomic) test. Hence, no locking is required to make this
test thread-safe, given an atomic integer "load" operation.

In particular, the conclusion of the following paragraph is false:

> NOTE: These functions are usually much more efficient
> than general-purpose synchronization primitives like
> critical sections and mutexes. It is, however, a
> fallacy so say that we can use atomic integer
> operations "instead of locks" because locking is
> still required -- the locking is just generally less
> expensive than other alternatives, but it's not free
> by a long shot.

Given atomic operations, locks are in fact _not_ required in
a thread-safe COW string class implementation. It is true that
the memory allocator itself must serialize access to its own data
structures when a deep-copy is performed (COW or otherwise).
However, a COW implementation does many fewer deep-copies, thus
many fewer allocations, and need not lock when it doesn't allocate.
Fewer deep-copies is why COW wins.

Therefore, Herb's conclusion below is also wrong:

> 2. What are the effects on performance? Discuss.
>

> In a word: Bad. [...] Even with atomic integer operations,

> COW can make common String operations take nearly 50% longer

> -- even in single-threaded programs. [...] In general,

> copy-on-write is a bad idea in multithread-ready code.

... and thus his empirical results are flawed:

>Some Empirical Results
>----------------------

In any case, his tests showed only a 44% overhead -- much less than
a factor of two. Even if we gave him this, it is not enough to
justify the conclusions cited above. A fair comparison would, like
any real program, do more copies than modifications. A non-COW
string suffers miserably in comparison when copying is required.
Even with the unnecessary locks, Herb's COW-string would have come
out miles ahead.

-------

As a result, Herb's measurements don't really tell you anything
about the performance of a well-written thread-safe string
implementation in a real program. I'm afraid his magazine article
on the subject will need to wait, again.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Herb Sutter

unread,

Oct 27, 1998, 3:00:00 AM10/27/98

to

On 26 Oct 1998 18:58:16 -0500, n...@nospam.cantrip.org (Nathan Myers)
wrote:

>Your thread-safe string implementation is still far from optimal,

The implementation was intended to be tutorial, not optimal, but it's
still quite good (thanks to your observations in comp.std.c++ in
August). Of course, I am as always willing to be corrected and
convinced.

>so any conclusions drawn from it are (still) bogus.

A request: I have taken the time to write and illuminate testable code
to help educate people about an important issue. If you disagree so
strongly, I think it's only fair that you likewise support your
comments with compilable, measurable code. Your most convincing
rebuttal would be an implementation of this GotW's String that is both
thread-safe and penalty-free in my test harness.

I'm only asking for one page of code that would benefit all readers.
My code is openly available on our website for review and correction
by anyone, and you just have to provide implementations for
CriticalSection, Mutex, and IntAtomicXxx to run the same tests I did.
The links are on www.peerdirect.com/resources/gotw045.html.

>In fact a correctly-coded thread-safe
>COW implementation is not appreciably slower than a non-thread-
>safe one, given atomic reference-count operations: load, increment,
>decrement-and-test.

Great... please show a "correctly-coded" answer to GotW #45, question
1, part a), since you seem to feel that mine wasn't. :)

>Where did Herb go wrong? Let's explore:
>

>> if( data_->refs > 1 && data_->refs != Unshareable ) {

[...]

>HERE is the problem.
>
>On several postings to this newsgroup it was pointed out

[actually, it was comp.std.c++]

>that the "Unshareable" condition can be encoded into the
>"refs" member as a special value. What Herb neglected to
>take into account was that this can be discerned by a single
>(atomic) test. Hence, no locking is required to make this
>test thread-safe, given an atomic integer "load" operation.

It's always good to read what you're criticizing. Before you say what
I neglected, you may want to recheck the "atomic int operations"
portion of my solution posting:

int refs = IntAtomicGet( data_->refs );
if( refs > 1 && refs != Unshareable ) {

Here's where you went wrong:

>the overhead is negated, because it's more than made up for by reduced copying.

[...]

>However, a COW implementation does many fewer deep-copies,

[...]

>A fair comparison would, like
>any real program, do more copies than modifications.

This is simply wrong, as are most generalizations, although I have no
doubt that you have worked with code where it was true.

I think I see your problem: In the "string/cow/algorithm" thread on
comp.std.c++ you called passing strings by const& "unnatural" -- yet
this is exactly what I do, and what all programmers _should_ do (Matt
showed why quite eloquently in his reply). If you are indeed used to
passing strings around by value -- in effect, _relying_ on reference
counting -- then of course it's no wonder that reference-counted
strings perform much better for your special case.

What you must understand is that the same is not true for everyone. In
fact, it's not even a good idea. It is simply wrong to say things like
'any real program does more copies than modifications.'

See also Rob Murray's interesting empirical statistics in C++
Strategies and Tactics (AW, 1993, pp 70-72). Rob and I wrote actual
code to measure our results.

>... and thus his empirical results are flawed:

Not so. I ask again: Show me code, and I will believe.

Herb

P.S.: Here are a few small things that I've moved out of the main
text:

>Given atomic operations, locks are in fact _not_ required in
>a thread-safe COW string class implementation.

Of course mutex locks are not required. Atomic integer operations are
implicitly 'locking' operations since they do in fact both deliver
serialization and incur overhead (just like mutexes, only less-general
serialization and less overhead).

What I demonstrated is that COW indeed forces a significant
performance penalty on even the simplest of operations: an actually
nonmutating operation on an unshared string, such as a call to
operator[] whose returned reference is not used to modify the string.
Clearly there is no penalty on can't-possibly-mutate (or, const)
operations, but there is a penalty on every single possibly-mutating
operation whether it actually mutates or not, and whether the string
is actually shared or not.

Atomic integer operations just make the penalty smaller; they do not
eliminate it, because they are slower than "normal" integer
operations, whether you want to call them locking operations or not.

>Even with the unnecessary locks, Herb's COW-string would have come
>out miles ahead.

Please specifically point out the unnecessary locks. I do not doubt
that they may be there, but I prefer to be educated with concrete
examples.

---
Herb Sutter (mailto:he...@cntc.com)

Current Network Technologies Corp 2695 North Sheridan Way, Suite 150
www.cntc.com www.peerdirect.com Mississauga Ontario Canada L5K 2N6

Paul Grealish

unread,

Oct 27, 1998, 3:00:00 AM10/27/98

to

Herb Sutter wrote:
>
> GotW #45: Reference Counting - Part III
>

> (snip)

>
> I won't go into a long discussion about what's involved
> with writing thread-safe code in detail; see a good
> book on threads for more details.

Any recommendations? I've been planning
to purchase a book on the subject?

Bill Wade

unread,

Oct 27, 1998, 3:00:00 AM10/27/98

to

Herb Sutter wrote in message <3633cca5...@nr1.toronto.istar.net>...

> Unfortunately, even with atomic integer operations,
> a thread-safe COW implementation is still
> substantially slower than a non-COW implementation,
> and COW indeed remains in general a "false
> optimization" in code that might be used in a
> multithreaded environment. The worst part is that
> the performance hit applies even if the code is
> actually used in only a single-threaded program.

It is certainly true that thread-safe COW implementations take a
performance
hit compared to single-threaded COW (as shown by Herb's numbers).
However
"thread-safe COW ... slower than non-COW" remains strongly dependent on
usage patterns. In a parsing environment where the most common
operations
are copy and access to const objects (which don't need to worry about
thread
issues) the issue seems less clear.

Access to const objects will give essentially the same performance for
both
COW and non-COW (there is a slim possibility that COW would gain a
benefit
from a smaller memory footprint). For instance if string always
maintains a
trailing '\0', string::c_str() does not require locking. Neither does
the
const version of [].

Most such strings will require just two locks during their lifetime.
One
when the string is created and another when it is destroyed.

A string used by a lexer may require an additional lock, in order to be
marked unshareable during the period it is being grown. In this case
there
should be a convenient mechanism to toggle the "shareable" bit.

So strings in COW implementations in this environment will typically
require
two or three locks during their lifetime. If used in an environment
where
copy by value is common (say STL) COW may save nearly a complete
new/copy/delete per string (and new/delete typically each require some
kind
of lock) at the cost of some bookkeeping overhead.

Obviously COW is a loser for manipulating small, non-const strings in a
multithreaded environment. For larger, const strings, measure.

Nathan Myers

unread,

Oct 27, 1998, 3:00:00 AM10/27/98

to

Bill Wade<bill...@stoner.com> wrote:
>
>Obviously COW is a loser for manipulating small, non-const strings in a
>multithreaded environment. For larger, const strings, measure.

Even this statement is too strong. COW may lose if you make very few
copies, and if your manipulations use non-const operator[] and begin()
rather than replace() or append().

Incidentally, there are many handy techniques for toggling the
"unshareable" bit. Anything that invalidates iterators suffices,
such as assignment: "s = s;" should clear it, or appending "".

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Herb Sutter

unread,

Oct 27, 1998, 3:00:00 AM10/27/98

to

On 27 Oct 1998 09:50:41 -0500, "Bill Wade" <bill...@stoner.com>
wrote:

>Herb Sutter wrote in message <3633cca5...@nr1.toronto.istar.net>...
>

>> Unfortunately, even with atomic integer operations,
>> a thread-safe COW implementation is still
>> substantially slower than a non-COW implementation,

Here I should perhaps add "for all possibly-mutating operations,"
since const operations are unaffected.

>> and COW indeed remains in general a "false
>> optimization" in code that might be used in a
>> multithreaded environment. The worst part is that
>> the performance hit applies even if the code is
>> actually used in only a single-threaded program.
>

>It is certainly true that thread-safe COW implementations take a performance
>hit compared to single-threaded COW (as shown by Herb's numbers). However
>"thread-safe COW ... slower than non-COW" remains strongly dependent on
>usage patterns.

Definitely. Whether COW is a win always depends on usage, and my point
was that requiring thread safety just pushes the "win boundary" closer
to non-COW... that is, a thread-safe COW has to save more copies to
justify itself than does a thread-unsafe COW.

That's what I meant by "in general" above... my experience has been
that "in general" most programs don't make enough read-only copies of
strings to justify COW if thread safety is also required. This is
because, as I showed, every possibly-mutating operation on a COW
string must do serialization of some sort, even if it's just an
IntAtomicCompare, and that adds up quickly in most real-world
programs.

>Obviously COW is a loser for manipulating small, non-const strings in a
>multithreaded environment. For larger, const strings, measure.

Herb

---
Herb Sutter (mailto:he...@cntc.com)

Current Network Technologies Corp 2695 North Sheridan Way, Suite 150
www.cntc.com www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Nathan Myers

unread,

Oct 28, 1998, 3:00:00 AM10/28/98

to

Herb Sutter<he...@cntc.com> wrote:
> n...@nospam.cantrip.org (Nathan Myers) wrote:
>> Your thread-safe string implementation is still far from optimal,

>> so any conclusions drawn from it are (still) bogus.
>
> The implementation was intended to be tutorial, not optimal, but it's
> still quite good (thanks to your observations in comp.std.c++ in
> August). Of course, I am as always willing to be corrected and
> convinced.

Is controversy resulting from bald assertions ultimately retracted
more "tutorial" than boringly correct postings? Probably.

Anyway, a poor implementation intended to be "tutorial, not optimal"
is a bad basis for a sweeping condemnation of a very effective and
mature implementation technique.

> A request: I have taken the time to write and illuminate testable
> code to help educate people about an important issue. If you disagree
> so strongly, I think it's only fair that you likewise support your
> comments with compilable, measurable code. Your most convincing
> rebuttal would be an implementation of this GotW's String that is
> both thread-safe and penalty-free in my test harness.

I'm quite busy this week preparing to release a snapshot of
work-in-progress on a C++ standard library. It includes a
basic_string, implemented as I described. When the snapshot
is out it will be available to try, although of course it is
not mature.

I would also expect to see measurements of operations other then op[]
before any very interesting point could be considered demonstrated.
I see no point in submitting code to a test which is insanely biased
in favor of one approach.

> Before you say what I neglected, you may want to recheck the
> "atomic int operations" portion of my solution posting:
>
> int refs = IntAtomicGet( data_->refs );
> if( refs > 1 && refs != Unshareable ) {

OK, no locks. However, this still adds _bags_ of unnecessary
overhead, because it generates a(nother!) function call on every
op[] call. The test should be inlined, and the encoding for
member refs should allow both conditions to be identified on
a single comparison.

Some operations always set "unshareable", and some always clear it;
it's known at compile time which, so passing it as a runtime argument
to the deep-copy operation just adds overhead.

Anyway, operator[] should itself be inlined. Otherwise
function-call overhead to operator[] swamps what we're trying
to measure.

E.g. something more like the following would be better.

inline char&

String::operator[]( size_t n ) {

if (IntAtomicGet(data_->refs) != Unshareable)
MakeUnshareable(); // might require a deep-copy.
return *(data_->buf+n);
}

Here, a function call happens only the first time op[] is called on
a string instance; after that it's just a test, branch, and fetch.
(Of course this is slower than the simple "return *(data_->buf_+n);"
possible with a non-COW string, but overhead in other operations
must compared too.)

Furthermore, if the string is implemented like this:

String
data_ -> StringBuf
buf -> char[...]
refs
used
etc.

a deep-copy involves two allocations, which is a horrendous overhead.
Of course a non-COW string can put the StringBuf members right in the
String object:

String
buf -> char[...]
refs
used
etc.

I don't know if your benchmark does that, but if not it should, to
make the comparison more fair to non-COW implementations. However,
if it does, then the COW string should also use a single allocated
block, like so:

etc.
used
String refs
data_ -> char [ ... ]

Using a single allocation for each alternative amplifies the relative
effect of any non-allocation overhead. Comparisons when the number
of allocated blocks used in a string differs are simply meaniningless.

Finally, it makes a big difference in real programs if your
"deep copy" can allocate the right size for the intended operation.
Doing a copy followed by another reallocation to grow the string is
bad for performance, which would be evident in a test that (e.g.)
appends strings. Of course this effect would be similar for either
implementation, if both used only the same number of secondary
memory blocks.

>> the overhead is negated, because it's more than made up for by
>> reduced copying. ... However, a COW implementation does many

>> fewer deep-copies ... A fair comparison would, like any real

>> program, do more copies than modifications.
>
> This is simply wrong, as are most generalizations, although I have no
> doubt that you have worked with code where it was true.

When I worked at Rogue Wave Software, we were told in no uncertain
terms when we harmed users' performance. The overwhelming influence
on performance, swamping anything else by a huge margin, was the
number of allocations performed. COW does fewer.

> I think I see your problem: In the "string/cow/algorithm" thread on
> comp.std.c++ you called passing strings by const& "unnatural"

That was not me. Of course I pass strings by reference whenever
possible. Frequently it *is* possible, but that is not a substitute
for assignment, nor for copy-on-return. The overwhelming majority
of applications of strings never apply operator[] at all; they use
read, copy/assign, append, write, and sometimes substring extraction,
to none of which COW adds any overhead at all. However, a non-COW
implementation adds overwhelming overhead to copy/assign operations.

Herb! Do you really maintain that real programs don't (or shouldn't)
assign strings? This would be a _remarkable_ statement, either way.

Any comparison of techniques which avoids measuring copy operations
stacks the deck to an almost-unscrupulous degree.

> Atomic integer operations are
> implicitly 'locking' operations since they do in fact both deliver
> serialization and incur overhead (just like mutexes, only less-general
> serialization and less overhead).

The "overhead" for the atomic operations is counted as the time for
other CPUs to retrieve the count again from main memory, next time
they need it, because their cached copy has been marked as invalid.
This is in a different league from a function call, never mind a
system call or a memory allocation.

> What I demonstrated is that COW indeed forces a significant
> performance penalty on even the simplest of operations: an actually
> nonmutating operation on an unshared string, such as a call to
> operator[] whose returned reference is not used to modify the string.

Herb did not demonstrate even that, for two reasons. First, he
measured a seriously suboptimal implementation. Second, he chose
the "simplest of operations" to favor one technique. Real string
types offer many "simple" operations to measure, and a comparison
that ignores all the ones that don't favor your own bias is neither
fair nor meaningful.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Paul D. DeRocco

unread,

Oct 28, 1998, 3:00:00 AM10/28/98

to

Herb Sutter wrote:
>
> The implementation was intended to be tutorial, not optimal, but it's
> still quite good (thanks to your observations in comp.std.c++ in
> August). Of course, I am as always willing to be corrected and
> convinced.

One small optimization: use the value zero to indicate the unshared
condition, rather than the maximum integer value. This puts the unshared
value in the same direction as the minimum reference count of one, so
that both can be checked for in a single test, in the second statement
of AboutToModify.

Another small optimization: drop the second argument to AboutToModify,
and let the caller simply set refs afterward to either 0 (unshared) or
1.

By the way, on a 486+ CPU, an atomic decrement and fetch can be done by
loading -1 into a register, and using LOCK XADD to add it to memory and
retrieve the previous value. And on most machines, IntAtomicGet boils
down to nothing, and a LOCK INC will do for IntAtomicIncrement. Thus, in
the real world, making the ref counting operations atomic often imposes
no cost at all.

--

Ciao,
Paul

Andrei Alexandrescu

unread,

Oct 28, 1998, 3:00:00 AM10/28/98

to

Herb Sutter wrote in message <3633cca5...@nr1.toronto.istar.net>...

[snip]

Guys,

I've read an article some time ago that even though integer operations
may
be atomic on a single processor machine, they AREN'T on a multiprocessor
machine sharing the same memory bus (even though the same processor type
is
used in both cases). This would blow everything up.
The race condition was something like (P1 and P2 are processors willing
to
do a RMW increment operation):

P1 locks the bus
P1 loads the integer in a register
P1 unlocks the bus
.... thread switching
P2 locks the bus
P2 loads the integer in a register
P2 unlocks the bus
P2 increments the register
P2 locks the bus
P2 writes the integer back
P2 unlocks the bus
... thread switching. During this time maybe P1 already incremented the
register so it waits for the bus
P1 locks the bus
P1 writes the integer back
P1 unlocks the bus

Andrei

Bill Wade

unread,

Oct 29, 1998, 3:00:00 AM10/29/98

to

Paul D. DeRocco wrote in message <3636A04B...@ix.netcom.com>...

>One small optimization: use the value zero to indicate the unshared
>condition, rather than the maximum integer value. This puts the unshared
>value in the same direction as the minimum reference count of one, so
>that both can be checked for in a single test, in the second statement
>of AboutToModify.

If refs is unsigned, use 1 for Unshareable and 2 for One. This allows the
destructor to be written as

~String(){ if(IntAtomicDecrement(data_->refs) < One) delete data_; }

Of course if refs is signed you can use this destructor with Paul's
suggestion.

Note that this optimization applies to both the single-threaded and
multi-threaded COW implementations. It allows all of the refs test to be
performed with a single comparison. The only place multi-threaded COW
requires an extra test compared to single-threaded COW is in AboutToModify.
When it decides to perform a deep copy it has to check (again) to see if it
is the only holder of the old buffer. In this case the time for the extra
test is swamped by the time for the deep copy.

As Nathan and Paul have pointed out, on many machines the Atomic operations
cost about the same as access to a volatile int. Depending on architecture,
this will typically cost between 1 and 10 times as much as access to a
non-volatile int. I can imagine architectures where the penalty is much
higher. AtomicSomething() needs to be inlined to get this performance.

Nathan points out that COW sometimes saves an allocation and that when he
was at RW he saw performance dominated by the cost of allocations. Until
recently a malloc/free pair would typically cost hundreds of instructions
and sometimes thousands. With the widespread use of efficient pool
allocators this is reduced to a few dozen (say 40) instructions for the
typical case. This includes the overhead for the non-inline function calls.
In addition a deep copy needs to copy the string itself which is likely to
require something between 4 and 1000000 instructions.

I believe someone said that COW is a big win for the common, empty string
case (which can use a single program-wide buffer), however the same
optimization can be used for non-COW implementations.

Herb implied that when using COW, most strings eventually get a deep copy
anyway, because if no copy was going to occur the string should have been
passed by reference. Nathan pointed out return-by-value and implied other
circumstances where ownership issues rule out simple references.

Herb points out that AboutToModify has a cost even when a string is not
shared. I'll point out that if you really care about performance
AboutToModify will be rare. For append operations where I don't know the
final size, I'll use a back_inserter, which knows it has an unshared buffer.
For in-place operations I'll use a non-const iterator derived from begin(),
which also knows it has an unshared buffer. I'll also point out that I
don't dare use both at once, which is a convenience/safety issue relative to
[] and push_back.

Herb rightly points out that if you don't have fast atomic operations it
becomes much more difficult for COW to win.

So who is right? I think they both are. In other words it depends on the
relative cost of atomic operations and deep copies (which in turn depends on
the string length), and the behavior of the program.

Yesterday I needed a queue of variable length strings (one to a few hundred
characters) in a piece of code that needed a performance boost. For this
application dumb pointers would give the best performance, but I'm spoiled.
In particular I was using a queue with value semantics. A string with
auto_ptr copy semantics would have worked well, but I didn't have one. I
suppose I could have used a queue which used swap() semantics instead of
value semantics, but I didn't have one of those either. What I did have
were an existing template queue class with value semantics and a string with
COW. They certainly weren't optimal, but they did work very well. Did I
need the extra performance of a COW string? In this case probably not
(using the queue changed an O(N*N) operation to O(N)), but I have no doubt
that this particular section of code is faster than it would have been with
unshared strings, even in many (most?) multi-threaded environments.

In this regard my experience is closer to Nathan's. Strings get copied much
more often then they get "AboutToModified" and in many cases COW prevents a
deep copy from ever occurring.

I appreciate Herb's insight, and truly value his contributions to the C++
community. It was good of him to provide code for others to see (and
perhaps attack). I don't believe his benchmark reflects the way I typically
use strings, and as a result some of his conclusions don't apply (at least
right now) to my environment.

Peace brothers.

Herb Sutter

unread,

Oct 29, 1998, 3:00:00 AM10/29/98

to

On 28 Oct 1998 05:13:53 -0500, "Paul D. DeRocco"

<pder...@ix.netcom.com> wrote:
>By the way, on a 486+ CPU, an atomic decrement and fetch can be done by
>loading -1 into a register, and using LOCK XADD to add it to memory and
>retrieve the previous value. And on most machines, IntAtomicGet boils
>down to nothing, and a LOCK INC will do for IntAtomicIncrement. Thus, in
>the real world, making the ref counting operations atomic often imposes
>no cost at all.

Unfortunately, this is insufficient. Manipulating the CPU directly
will not always work, and you should always use the OS's atomic
integer functions. My test results were based on using Win32's
InterlockedDecrement and related functions, and as you saw there is
indeed measurable overhead (although much less than for a mutex or a
critical section). One reason there must be overhead is that there may
be multiple CPUs.

If you wrote it as you suggest above, using native machine
instructions instead of OS calls, it would not work at all on some
machines. (Note that I haven't talked at all about portability,
because that's not the main issue here, but inline asm is even less
portable than native OS calls.)

Herb

Nathan Myers

unread,

Oct 29, 1998, 3:00:00 AM10/29/98

to

Andrei Alexandrescu<alexan...@micromodeling.com> wrote:
>I've read an article some time ago that even though integer operations
>may be atomic on a single processor machine, they AREN'T on a
>multiprocessor machine sharing the same memory bus (even though the
>same processor type is used in both cases).

Of course. Every modern CPU has a an atomic "decrement-and-test"
or "test-and-swap" instruction that can be used safely, and that
is what we have been talking about. Herb mistakenly referred to
this as something offered by "some operating systems" rather than
"some CPUs".

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Nathan Myers

unread,

Oct 30, 1998, 3:00:00 AM10/30/98

to

Herb Sutter<he...@cntc.com> wrote:
>... My test results were based on using Win32's

>InterlockedDecrement and related functions, and as you saw there is
>indeed measurable overhead (although much less than for a mutex or a
>critical section).

As noted previously, the overhead you measured was from a wasted
function call or two per op[] operation, not from the atomic
operations.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Herb Sutter

unread,

Nov 3, 1998, 3:00:00 AM11/3/98

to

Here's an interesting nterim thought, as I prepare to release the
results of the latest (greatly enhanced) cut of my COW performance
test suite:

One disadvantage of COW is that it's more difficult than you think to
get it right. As Rob Murray puts it: "A use-counted class is more
complicated than a non-use-counted equivalent, and all of this horsing
around with use counts takes a significant amount of processing time.
... Code that manages use counts is also a good place for a careful
code inspection. Bugs involving messed up use counts can be hard to
find." [C++ Strategies and Tactic, pp. 71-2]

GotW solutions are not exempt. As it turns out, there was a bad bug
in my originally-posted code that nobody has yet reported (and that
miraculously never failed through a week of single-threaded testing,
only because the related underlying functions were quite forgiving).
The COW_Critsec and COW_Mutex AboutToModify (now called EnsureUnique,
with a complementary EnsureUnshareable) functions looked like this:

inline void String::EnsureUnique( size_t n ) {
bool bUnsharedIt = false;
data_->m.Lock(); //---------------------------
if( data_->refs > 1 ) {
bUnsharedIt = true;

StringBuf* newdata = new StringBuf( *data_, n );
--data_->refs; // now all the real work is
data_ = newdata; // done, so take ownership
}

data_->m.Unlock(); //-------------------------

if( !bUnsharedIt ) {
data_->Reserve( n );
data_->refs = 1; // shareable again
}
}

Stop here for a moment. Do you see the serious problem?

OK, here it is:

If we do the deep copy, data_ is reset, and when we perform the Unlock
we are unlocking the wrong object (oops). Nobody pointed this out,
including several people who gave detailed comments about the code,
even though my originally posted solution specifically talked about
the length of time this lock was being kept, to cover both
refs-related operations.

Instead, the code should read:

inline void String::EnsureUnique( size_t n ) {
bool bUnsharedIt = false;
data_->m.Lock(); //---------------------------
if( data_->refs > 1 ) {
bUnsharedIt = true;

StringBuf* newdata = new StringBuf( *data_, n );
--data_->refs; // now all the real work is

data_->m.Unlock(); //-----------------------

data_ = newdata; // done, so take ownership
}
else {

data_->m.Unlock(); //-----------------------
}

if( !bUnsharedIt ) {
data_->Reserve( n );
data_->refs = 1; // shareable again
}
}

COW can save time in some situations, but it does add some complexity
-- both performance and maintenance -- of its own. In fact, one
prominent library writer with several dozen years of experience told
me that, in his experience, reference-counting code has proved to be a
consistent source of bugs. Of course COW can work correctly when it's
coded correctly, but I do want to point out this often-overlooked cost
of adding complexity in the name of efficiency... especially since I
just got burned with it myself (and I'm glad I'm the one who found it
instead of having to have it pointed out to me later).

Herb

[Aside: I am not ducking my performance critique and moving off into
maintenance and other costs. Over the past few days I've considerably
expanded and updated the test harness, and Nathan and I are discussing
the new code and numbers now. My current interim results still show
COW_AtomicInt losing even when 33% of copies are never modified and
the rest are mutated only once, but I'm trying a few last
optimizations to see if I can swing it in COW's favour even more. Stay
tuned, and fasten your seatbelts for the avalanche of
continually-refined numbers.]

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150

www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Nathan Myers

unread,

Nov 3, 1998, 3:00:00 AM11/3/98

to

Herb Sutter<he...@cntc.com> wrote:
>... Over the past few days I've considerably

>expanded and updated the test harness, and Nathan and I are discussing
>the new code and numbers now. My current interim results still show
>COW_AtomicInt losing even when 33% of copies are never modified and

>the rest are mutated only once ...

Let me get another few licks in before the onslaught of numbers:

I maintain that "33% of copies are never modified" is a gross
_under_estimate. Even in programs that use op[] heavily on strings
(though most that do shouldn't) you generally find that happening
only in a few places; elsewhere the results are simply copied, or
more-meaningful operations are used, such as appending. (I.e. does
the result of op[] really mean much of anything, for most strings?)

Most programs that use strings never call op[] at all; they
read strings in, store them, compare them, hash them, once in a
while append them, and write them back out. Think of all the
database queries you've ever seen.

However, all this remains academic until we measure real, useful
programs that use string objects. There are more than enough
to establish a firm foundation of statistics.

Herb's numbers will establish, though, that certain ways of
implementing COW are indefensible by any measure.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Bill Wade

unread,

Nov 4, 1998, 3:00:00 AM11/4/98

to

Herb Sutter wrote in message <363e0a39...@nr1.toronto.istar.net>...

>One disadvantage of COW is that it's more difficult than you think to
>get it right.

Certainly true. I ran some performance tests to get some numbers for this
discussion. The numbers showed that reference counts were incremented (or
initialized to 1) more often than they were decremented. There is a global
StringBuffer used for the empty string. One of the obscure factories would
reference that StringBuffer and then decide it needed to make a brand new
StringBuffer. It forgot to dereference the global buffer when this
occurred. Obviously given enough time (this is in a program which runs for
months) it would be possible for the reference count in the global buffer to
overflow.

In this particular test reference counts were accessed about 22M times.
About 3M deep copies were avoided entirely (this doesn't count deep copies
of empty strings, which are cheap even for non-COW implementations).

For this test COW is cheaper, but not by much. The COW code is obviously
much more complex.

Jerry Leichter

unread,

Nov 4, 1998, 3:00:00 AM11/4/98

to

| ... Every modern CPU has a an atomic "decrement-and-test"

| or "test-and-swap" instruction that can be used safely, and that

| is what we have been talking about....

Really? You haven't looked at very many architectures.

HP PA RISC 2.0 has support only for a pair of semaphore operations which
atomically load and clear a word or doubleword.

SPARC V9 has Compare and Swap - but the number of V9 machines out there
is still limited. Before V9, all you had was LDSTUB, which atomically
loads a byte and sets it to all 1's.

Alpha has Load Locked/Store Conditional, which can be *used* to
implement all kinds of powerful primitives - but if you want to use
asm() instructions, you may have to be able to generate a loop. MIPS
and PowerPC are basically the same.

The Intel X86 series is very richly endowed with interlocked
instructions; no problems there. No hints I've seen yet about what IA64
will have.

In general, the more aggressive designs (everything except x86) allow
for some degree of weak memory consistency. This requires that you get
the code sequences for interlocked code *exactly* right, with the
appropriate sync's and memory barriers and such in all the right places.
This is extremely non-portable, highly sensitive code for which no even
vaguely portable API's exist. Worse, some architectures allow for
machines with different memory consistency models. If you don't know
what machine you will run on, you have to assume the weakest - even if
no current hardware actually implements it, as is the case with SPARC,
for example. This can give you code that's quite a bit more expensive
than you expect.

So: In *principle*, you can use cheap hardware primitives to implement
reference counts. In *prudent engineering practice*, it's very hard to
justify the effort to develop and maintain such code on even a single
platform, not to mention across multiple platforms.

It would really be nice if *some* standard defined an API for cheap
atomic operations. Things like atomic add, atomic compare and swap,
even perhaps atomic add to and remove from list ... The burden for
implementing these in the most efficient way possible should fall on
those who actually work with the particular hardware - compiler and OS
library developers.

Unfortunately, no such API exists. (Maybe there's something like it in
the POSIX Real-Time extensions; I haven't looked, and I have yet to use
a system that implemented them.) Bits and pieces of such an API appear
here and there, with all kinds of variations. For example, NT gives you
an atomic add - but it returns only the *sign* of the value before - or
is it after? - the operation.

For the effort required to do an efficient, reliable, multi-platform
reference-counted string implementation, I'd rather have a garbage
collector, which does away with reference counts entirely (among many
other advantages).
-- Jerry

Andrei Alexandrescu

unread,

Nov 4, 1998, 3:00:00 AM11/4/98

to

Herb Sutter wrote in message <363e0a39...@nr1.toronto.istar.net>...

>GotW solutions are not exempt. As it turns out, there was a bad bug
>in my originally-posted code that nobody has yet reported (and that
>miraculously never failed through a week of single-threaded testing,
>only because the related underlying functions were quite forgiving).

I haven't followed this discussion in great detail, but about this
particluar bug, it seems like one should have applied the adage: "If you
have to call functions in exact pairs, make an object that does it for you
in the constructor and in the destructor". Then you'll have language rules
enforcing corectness instead of your own discipline.

Maybe in this case the author thought things are small enough to be easily
manageable, so he didn't make this effort. As always, it turns out that
things are never "small enough".

I love this game.

Andrei

sa...@bear.com

unread,

Nov 4, 1998, 3:00:00 AM11/4/98

to

In article <71n0b7$kqa$1...@shell7.ba.best.com>,

n...@nospam.cantrip.org (Nathan Myers) wrote:
> Herb Sutter<he...@cntc.com> wrote:
> >... Over the past few days I've considerably
> >expanded and updated the test harness, and Nathan and I are
discussing
> >the new code and numbers now. My current interim results still show
> >COW_AtomicInt losing even when 33% of copies are never modified and
> >the rest are mutated only once ...
>
> Let me get another few licks in before the onslaught of numbers:
>
> I maintain that "33% of copies are never modified" is a gross
> _under_estimate. Even in programs that use op[] heavily on strings
> (though most that do shouldn't) you generally find that happening
> only in a few places; elsewhere the results are simply copied, or
> more-meaningful operations are used, such as appending. (I.e. does
> the result of op[] really mean much of anything, for most strings?)
>
> Most programs that use strings never call op[] at all; they
> read strings in, store them, compare them, hash them, once in a
> while append them, and write them back out.

I agree that not all programs use op[] heavily. But it is also my
experience that if one passes a string like a vector and uses swap
properly, one can avoid a lot of unnecessary copies. In my earlier
postings I have shown how many times a ref-count increment happens
just to be decremented immediately, where a swap would be more
general (will work for COW or non-COW string, vector or any STL
container)
and equally efficient(or more because no increment and decrement op).

I have used and implemented both COW and non COW versions of string
(USL string, RWString, home-grown COW and non-COW, SGI STL string, DEC
C++
string) in both business applications (telecom and finance) and system
applications (such as preprocessor, make, other code generator tools) in
single and multi-threaded code for a
couple of years now. I prefer non-COW now (See Jerry Leichter's post
for some more reasons).

> Think of all the
> database queries you've ever seen.
>

What about database query? In a database query program, one would
use a fixed length char array and do the bind. After the fetch one
would convert the char array to a string attribute of an object. Whether
COW or non-COW, a copy will take place from the char array to the
string.

example:

struct DbEmployee {
int age;
char name[30]; // DbEmployee struct can be automatically generated
// with the correct size (for static queries, which
// is most prevalent in applications).
Syb::Statement::Length name_len;

void bind(Syb::Statement* s) {
s->bind(&age);
s->bind(name, sizeof(name), &name_len);
}
};

void get_employees(Syb::Statement* s, std::list<Employee>* emp_list) {
DbEmployee db_emp;
s->execute("select age, name from Employee");
db_emp.bind(s);
while (s->fetch()) {
emp_list->push_back(new Employee(db_emp));
// copy from char array to string will take place here (COW or
non-COW).
}
}

Before somebody tells me that you can bind to a string (like RogueWave
DBTools.h++), let me tell that we have been through that and
DBTools.h++
is one of the slowest implementations around (compared to our
home-grown implementations).

> However, all this remains academic until we measure real, useful
> programs that use string objects. There are more than enough
> to establish a firm foundation of statistics.
>
> Herb's numbers will establish, though, that certain ways of
> implementing COW are indefensible by any measure.
>

Cheers,
Saroj Mahapatra

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

Nathan Myers

unread,

Nov 5, 1998, 3:00:00 AM11/5/98

to

Bill Wade<bill...@stoner.com> wrote:
>I ran some performance tests to get some numbers for this

>discussion. ...

Oh good.

>In this particular test reference counts were accessed about 22M times.
>About 3M deep copies were avoided entirely (this doesn't count deep copies
>of empty strings, which are cheap even for non-COW implementations).

It would be more interesting to learn how many times the reference
counts were changed, rather than just referenced. It is the
increment-and-decrement cycle that costs. (You don't say how
many more increments there were than decrements, but anyway did
the program terminate without destroying all its strings? That
would have been an optimization, not a bug.)

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

A. G. McDowell

unread,

Nov 5, 1998, 3:00:00 AM11/5/98

to

With regard to the various comments about cheap hardware synchronisation
primitives: are there are any hidden costs? I mean, it's all very well
to time a compare+swap instruction and see it's fast, but is it possible
for such things to invalidate cache and slow the next 1000 instructions
down to a crawl? I believe the Java Virtual machine is allowed a lot
more caching and memory optimisations between synchronisation-related
memory accesses that it is across them, and I wonder if this reflects
experience in the hardware world. I could believe in a machine that only
guaranteed that writes does by cpu A would show up to cpu B if at least
one of them had made some sort of attempt at synchronisation in the mean
time.
--
A. G. McDowell

Jerry Leichter

unread,

Nov 5, 1998, 3:00:00 AM11/5/98

to

| With regard to the various comments about cheap hardware
| synchronisation primitives: are there are any hidden costs? I mean,
| it's all very well to time a compare+swap instruction and see it's
| fast, but is it possible for such things to invalidate cache and slow

| the next 1000 instruction down to a crawl?

There are *always* costs; as Dave Butenhoff has said about the pthreads
interface: Perhaps pthread_mutex_lock() should have been called
pthread_bottleneck() to make it clear what effect it *could* have on
your program. Interlocked instructions, one way or the other, have to
synchronize at least across processors, and often with memory. This is
always more expensive than operations within a single CPU.

That said: If you want to write multi-threaded code, synchronization is
a necessary evil. You want to minimize it; and you want to avoid it
entirely when the costs exceed the benefits; but sometimes you just have
to pay the piper.

The hardware primitives on decently-designed modern machines are pretty
efficient. They are highly unlikely to invalidate an entire cache.
Yes, they might slow down a couple of subsequent instructions - but it's
more likely they'll slow down 10 than 100, and no reasonable machine
will slow down 1000! Getting the most out of the hardware, however,
requires that you use the more sophisticated primitives (like compare
and swap) when they are available, rather than building everything out
of simple locks.

The generally-available software primitives can be *much* slower than
the underlying hardware. Further, the API's out there generally drop
you back to the least-common-denominator of mutex and condition
variables. It's impossible to fully utilize modern hardware using these
primitives. If you really need high-performance concurrent code, even
in 1998, you're talking machine-specific code.

| I believe the Java Virtual
| machine is allowed a lot more caching and memory optimisations between
| synchronisation-related memory accesses that it is across them, and I
| wonder if this reflects experience in the hardware world. I could
| believe in a machine that only guaranteed that writes does by cpu A
| would show up to cpu B if at least one of them had made some sort of
| attempt at synchronisation in the mean time.

Yes, the JVM is attempting to allow for so-called weak memory consis-
tency models. That should come as no surprise, since Sun's SPARC allows
for them! However, some of the resulting constraints can be a bit
wierd!
-- Jerry

Herb Sutter

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

Here's an update on my ongoing "COW performance
measurements" project. The test harness is getting
more sophisticated, but please feel free to post or
email any further performance optimizations or
test-case refinements as you spot them.

Major differences from last time:

- implemented all performance enhancements
suggested earlier in this thread, which improved
all cases but did not materially affect the
relative results

- implemented two other enhancements that made a more
substantial difference: a) replaced copy() with
memcpy(); and b) implemented a fast allocator test
case

- included non-COW benchmarks (last time I just
compared different flavours of COW)

- added copying and mutating-and-lengthening test
cases (last time I just measured operator[])

Here is one example of a suggested performance
enhancement that actually made no difference: The
AboutToModify function in the original version took a
bool flagging whether to mark the now-unique string as
unshareable. The objection was that this runtime
decision ought to be made at compile time, since we
always call AboutToModify with either "true" or "false"
hardcoded. The current code version splits
AboutToModify into two functions -- EnsureUnique and
EnsureUnshareable -- and it actually made no
performance difference, because in the original version
(plus inlining) the compiler could see enough to do the
optimization automatically.

MORAL: Don't guess. Measure.

Never worry about micro-optimization until a
profiler or other measurement tells you to. In this
case, we have ended up with slightly more complex
hand-optimized code that has no advantage over the
original simpler compiler-optimizable code.

See the source code header for more comments.

See the accompanying followup for the results summary
and detailed numbers.

Herb

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150
www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Herb Sutter

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

Here's an update on my ongoing "COW performance

measurements" project. The moral of the story is
one that GotW readers will be familiar with by now:
When it comes to performance, never guess. Measure.

RESULTS SUMMARY
---------------

Thread-safe COW is a pessimization in many real
programs.

I focused on comparing Plain with COW_AtomicInt,
which was generally the most efficient thread-safe
COW implementation. The results were as follows:

1. For all mutating and possibly-mutating operations,
COW_AtomicInt was always worse than Plain. This is
natural and expected.

2. COW should shine when there are many unmodified
copies, but for an average string length of 50:

a) When 33% of all copies were never modified,
and the rest were modified only once each,
COW_AtomicInt was still slower than Plain.

b) When 50% of all copies were never modified,
and the rest were modified only thrice each,
COW_AtomicInt was still slower than Plain.

This result may be more surprising to many --
particularly that COW_AtomicInt is slower in cases
where there are more copy operations than mutating
operations in the entire system!

Note that, in both cases, traditional thread-unsafe
COW did perform better than Plain. This shows that
indeed COW can be an optimization for purely single-
threaded environments, but it is less often
appropriate for thread-safe code.

3. It is a myth that COW's principal advantage lies in
avoiding memory allocations. Especially for longer
strings, COW's principal advantage is that it avoids
copying the characters in the string.

4. Optimized allocation, not COW, was a consistent true
speed optimization in all cases (but note that it
does trade off space). Here is perhaps the most
important conclusion from the Detailed Measurements
section:

"* Most of COW's primary advantage for small strings could be
gained without COW by using a more efficient allocator.
(Of course, you could also do both -- use COW and an
efficient allocator.)"

In this test environment I tested six main
flavours of string implementations:

Name Description
--------------- ---------------------------------------------------
Plain Non-use-counted string; all others modeled on this
COW_Unsafe Plain + COW, no thread safety
COW_AtomicInt Plain + COW + thread-safe (Win32 atomic int ops)
COW_AtomicInt2 COW_AtomicInt + StringBuf in same buffer as data
COW_CritSec Plain + COW + thread-safe (Win32 critical sections)
COW_Mutex Plain + COW + thread-safe (Win32 mutexes)

Q: Why measure something as inefficient as
COW_CritSec? A: Because at least one popular
commercial basic_string implementation used this method
as recently as a year ago (and perhaps still does, I
haven't seen their code lately), despite the fact that
COW_CritSec is nearly always a pessimization. Be sure
to check whether your library vendor is doing this,
because if the library is built for possible
multithreaded use then you will bear the performance
cost all the time -- even if your program is
single-threaded.

COW_AtomicInt2 is the same as COW_AtomicInt except that,
instead of allocating a StringBuf and then separately
allocating the data buffer, the StringBuf and data are
in the same allocated block of memory. Note that all
other COW_* variants use a fast allocator for the
StringBuf object (so that there's no unfair "double
allocation" cost), and the purpose of COW_AtomicInt2 is
mainly to demonstrate that I have actually addressed
that concern... COW_AtomicInt2 is actually slightly
slower than COW_AtomicInt for most operations (because
of the extra logic).

I also threw in a seventh flavour to measure the result
of optimizing memory allocation instead of optimizing
copying:

Plain_FastAlloc Plain + an optimized memory allocator

NOTE: Yes, Plain_FastAlloc is a fair comparison case.
My test implementation of a FastArena allocator
includes thread-safety overhead via IntAtomic-
Increment/Decrement functions, and these are
included in the timings.

I also tested the relative performance of various
integer operations (incrementing int, incrementing
volatile int, and incrementing int using the Win32
atomic integer operations), to ensure that
COW_AtomicInt results were not unfairly skewed by poor
implementations or function call overhead.

APPROACH
--------

To assess COW, I performed measurements of three kinds
of functions:

- copying (where COW shines, its raison d'etre)

- mutating operations that could trigger reallocation
(represented by Append, which gradually lengthens;
this is to make sure any conclusions drawn can
take into account periodic reallocation overhead
due to normal string use)

- possibly-mutating operations that do not change
length enough to trigger reallocation, or that do
not actually mutate the string at all (represented
by operator[])

It turns out that the last two both incur a constant
(and similar, within ~20%) cost per operation, and can
be roughly considered together. Assuming for
simplicity that mutating-and-extending operations like
Append (235ms overhead) and possibly-mutating
operations like operator[] (280ms overhead) will be
about equally frequent, the COW_AtomicInt overhead for
mutating and possibly-mutating operations is about
260ms per 1,000,000 operations in this implementation.

Finally, for each of 2(a) and 2(b), I first used the
"Raw Measurements" section below to hand-calculate a
rough prediction of expected relative performance, then
ran the test to check actual performance.

SUMMARY FOR CASE 2(a):

PREDICTION

COW_AtomicInt Cost Plain Cost
------------------------- ----------------------
1M shallow copies 1M deep copies
and dtors 400 and dtors 1600
667K mutations ??? ???
667K deep copies 1060
extra overhead on
667K deep copies ???
extra overhead on
667K mutations 175
----- -----
1635+ 1600+

TEST
(program that makes copies in a tight loop, and
modifies 33% of them with a single Append and
another 33% of them with a single op[])

Running 1000000 iterations with strings of length 50:
Plain_FastAlloc 642ms copies: 1000000 allocs: 1000007
Plain 1726ms copies: 1000000 allocs: 1000007
COW_Unsafe 1629ms copies: 1000000 allocs: 666682
COW_AtomicInt 1949ms copies: 1000000 allocs: 666682
COW_AtomicInt2 1925ms copies: 1000000 allocs: 666683
COW_CritSec 10660ms copies: 1000000 allocs: 666682
COW_Mutex 33298ms copies: 1000000 allocs: 666682

SUMMARY FOR CASE 2(b):

PREDICTION

COW_AtomicInt Cost Plain Cost
------------------------- ----------------------
1M shallow copies 1M deep copies
and dtors 400 and dtors 1600
1.5M mutations ??? ???
500K deep copies 800
extra overhead on
500K deep copies ???
extra overhead on
1.5M mutations 390
----- -----
1590+ 1600+

TEST
(program that makes copies in a tight loop, and
modifies 25% of them with three Appends and
another 25% of them with three operator[]s)

Running 1000000 iterations with strings of length 50:
Plain_FastAlloc 683ms copies: 1000000 allocs: 1000007
Plain 1735ms copies: 1000000 allocs: 1000007
COW_Unsafe 1407ms copies: 1000000 allocs: 500007
COW_AtomicInt 1838ms copies: 1000000 allocs: 500007
COW_AtomicInt2 1700ms copies: 1000000 allocs: 500008
COW_CritSec 8507ms copies: 1000000 allocs: 500007
COW_Mutex 31911ms copies: 1000000 allocs: 500007

RAW MEASUREMENTS
----------------

TESTING CONST COPYING + DESTRUCTION: The target case of COW

Notes:
- COW_AtomicInt always took over twice as long to create and
destroy a const copy as did plain thread-unsafe COW.
- For every copy of a string that was later modified,
COW_AtomicInt added constant unrecoverable overhead
(400ms per 1,000,000) not counting the overhead on other
operations.
* Most of COW's primary advantage for small strings could be
gained without COW by using a more efficient allocator.
(Of course, you could also do both -- use COW and an
efficient allocator.)
* COW's primary advantage for large strings lay, not in
avoiding the allocations, but in avoiding the char copying.

Running 1000000 iterations with strings of length 10:
Plain_FastAlloc 495ms copies: 1000000 allocs: 1000003
Plain 1368ms copies: 1000000 allocs: 1000003
COW_Unsafe 160ms copies: 1000000 allocs: 3
COW_AtomicInt 393ms copies: 1000000 allocs: 3
COW_AtomicInt2 433ms copies: 1000000 allocs: 4
COW_CritSec 428ms copies: 1000000 allocs: 3
COW_Mutex 14369ms copies: 1000000 allocs: 3

Running 1000000 iterations with strings of length 50:
Plain_FastAlloc 558ms copies: 1000000 allocs: 1000007
Plain 1598ms copies: 1000000 allocs: 1000007
COW_Unsafe 165ms copies: 1000000 allocs: 7
COW_AtomicInt 394ms copies: 1000000 allocs: 7
COW_AtomicInt2 412ms copies: 1000000 allocs: 8
COW_CritSec 433ms copies: 1000000 allocs: 7
COW_Mutex 14130ms copies: 1000000 allocs: 7

Running 1000000 iterations with strings of length 100:
Plain_FastAlloc 708ms copies: 1000000 allocs: 1000008
Plain 1884ms copies: 1000000 allocs: 1000008
COW_Unsafe 171ms copies: 1000000 allocs: 8
COW_AtomicInt 391ms copies: 1000000 allocs: 8
COW_AtomicInt2 412ms copies: 1000000 allocs: 9
COW_CritSec 439ms copies: 1000000 allocs: 8
COW_Mutex 14129ms copies: 1000000 allocs: 8

Running 1000000 iterations with strings of length 250:
Plain_FastAlloc 1164ms copies: 1000000 allocs: 1000011
Plain 5721ms copies: 1000000 allocs: 1000011 [*]
COW_Unsafe 176ms copies: 1000000 allocs: 11
COW_AtomicInt 393ms copies: 1000000 allocs: 11
COW_AtomicInt2 419ms copies: 1000000 allocs: 12
COW_CritSec 443ms copies: 1000000 allocs: 11
COW_Mutex 14118ms copies: 1000000 allocs: 11

Running 1000000 iterations with strings of length 1000:
Plain_FastAlloc 2865ms copies: 1000000 allocs: 1000014
Plain 4945ms copies: 1000000 allocs: 1000014
COW_Unsafe 173ms copies: 1000000 allocs: 14
COW_AtomicInt 390ms copies: 1000000 allocs: 14
COW_AtomicInt2 415ms copies: 1000000 allocs: 15
COW_CritSec 439ms copies: 1000000 allocs: 14
COW_Mutex 14059ms copies: 1000000 allocs: 14

Running 1000000 iterations with strings of length 2500:
Plain_FastAlloc 6244ms copies: 1000000 allocs: 1000016
Plain 8343ms copies: 1000000 allocs: 1000016
COW_Unsafe 174ms copies: 1000000 allocs: 16
COW_AtomicInt 397ms copies: 1000000 allocs: 16
COW_AtomicInt2 413ms copies: 1000000 allocs: 17
COW_CritSec 446ms copies: 1000000 allocs: 16
COW_Mutex 14070ms copies: 1000000 allocs: 16

TESTING APPEND: An always-mutating periodically-reallocating operation

Notes:
- Plain always outperformed COW.
- The overhead of COW_AtomicInt compared to Plain did not
vary greatly with string lengths: It was fairly constant
at about 235ms per 1,000,000 operations.
- The overhead of COW_AtomicInt compared to COW_Unsafe did not
vary greatly with string lengths: It was fairly constant
at about 110ms per 1,000,000 operations.
* The overall ever-better performance for longer strings was
due to the allocation strategy (see GotW #43), not COW vs.
Plain issues.

Running 1000000 iterations with strings of length 10:
Plain_FastAlloc 302ms copies: 0 allocs: 272730
Plain 565ms copies: 0 allocs: 272730
COW_Unsafe 683ms copies: 0 allocs: 272730
COW_AtomicInt 804ms copies: 0 allocs: 272730
COW_AtomicInt2 844ms copies: 0 allocs: 363640
COW_CritSec 825ms copies: 0 allocs: 272730
COW_Mutex 8419ms copies: 0 allocs: 272730

Running 1000000 iterations with strings of length 50:
Plain_FastAlloc 218ms copies: 0 allocs: 137262
Plain 354ms copies: 0 allocs: 137262
COW_Unsafe 474ms copies: 0 allocs: 137262
COW_AtomicInt 588ms copies: 0 allocs: 137262
COW_AtomicInt2 536ms copies: 0 allocs: 156871
COW_CritSec 607ms copies: 0 allocs: 137262
COW_Mutex 7614ms copies: 0 allocs: 137262

Running 1000000 iterations with strings of length 100:
Plain_FastAlloc 182ms copies: 0 allocs: 79216
Plain 257ms copies: 0 allocs: 79216
COW_Unsafe 382ms copies: 0 allocs: 79216
COW_AtomicInt 492ms copies: 0 allocs: 79216
COW_AtomicInt2 420ms copies: 0 allocs: 89118
COW_CritSec 535ms copies: 0 allocs: 79216
COW_Mutex 7810ms copies: 0 allocs: 79216

Running 1000000 iterations with strings of length 250:
Plain_FastAlloc 152ms copies: 0 allocs: 43839
Plain 210ms copies: 0 allocs: 43839
COW_Unsafe 331ms copies: 0 allocs: 43839
COW_AtomicInt 438ms copies: 0 allocs: 43839
COW_AtomicInt2 366ms copies: 0 allocs: 47825
COW_CritSec 485ms copies: 0 allocs: 43839
COW_Mutex 7358ms copies: 0 allocs: 43839

Running 1000000 iterations with strings of length 1000:
Plain_FastAlloc 123ms copies: 0 allocs: 14000
Plain 149ms copies: 0 allocs: 14000
COW_Unsafe 275ms copies: 0 allocs: 14000
COW_AtomicInt 384ms copies: 0 allocs: 14000
COW_AtomicInt2 299ms copies: 0 allocs: 15000
COW_CritSec 421ms copies: 0 allocs: 14000
COW_Mutex 7265ms copies: 0 allocs: 14000

Running 1000000 iterations with strings of length 2500:
Plain_FastAlloc 122ms copies: 0 allocs: 6416
Plain 148ms copies: 0 allocs: 6416
COW_Unsafe 279ms copies: 0 allocs: 6416
COW_AtomicInt 380ms copies: 0 allocs: 6416
COW_AtomicInt2 304ms copies: 0 allocs: 6817
COW_CritSec 405ms copies: 0 allocs: 6416
COW_Mutex 7281ms copies: 0 allocs: 6416

TESTING OPERATOR[]: A possibly-mutating operation, never does mutate

Notes:
- Plain always vastly outperformed COW.
- Results were independent of string lengths.
- The overhead of COW_AtomicInt compared to Plain was
constant at about 280ms per 1,000,000 operations.
- COW_AtomicInt2 fared better in this test case, but
COW_AtomicInt did better overall and so I am focusing
on comparing that with Plain.

[10x iterations] Running 10000000 iterations with strings of length
10:
Plain_FastAlloc 3ms copies: 0 allocs: 3 [*]
Plain 2ms copies: 0 allocs: 3 [*]
COW_Unsafe 1698ms copies: 0 allocs: 3
COW_AtomicInt 2833ms copies: 0 allocs: 3
COW_AtomicInt2 2112ms copies: 0 allocs: 4
COW_CritSec 3587ms copies: 0 allocs: 3
COW_Mutex 71787ms copies: 0 allocs: 3

[*] within measurement error margin, both varied from 0ms to 9ms

TESTING VARIOUS INTEGER INCREMENT/DECREMENT OPERATIONS

Tets Summary:
- "plain" performs the operations on normal
nonvolatile ints
- "volatile" is the only case to use volatile ints
- "atomic" uses the Win32 InterlockedXxx operations
- "atomic_ass" uses inline x86 assembler locked
integer operations

Notes:
- ++atomic took only three times as long as either
++volatile and unoptimized ++plain
- ++atomic does not incur function call overhead

[100x iterations] Running 100000000 iterations for integer operations:

++plain 2404ms, counter=100000000
--plain 2399ms, counter=0

++volatile 2400ms, counter=100000000
--volatile 2405ms, counter=0

++atomic 7480ms, counter=100000000
--atomic 9340ms, counter=0

++atomic_ass 8881ms, counter=100000000
--atomic_ass 10964ms, counter=0

Here are a few extra notes on the relative timings of
various flavours of x86 assembler implementations of
IntAtomicIncrement (these timings were taken under the
same conditions as above and can be compared directly):

Instructions Timing
--------------------------- --------
__asm mov eax, 1
__asm lock xadd i, eax
__asm mov result, eax ~11000ms

__asm mov eax, 1
__asm lock xadd i, eax ~10400ms

__asm lock inc i ~8900ms
(this is the one actually used above)

Note that the non-atomic versions are much better, and
map directly onto the "plain" timings:

__asm inc i ~2400ms

Conclusion: So there is indeed overhead introduced by
the x86 LOCK instruction, even on a single-CPU machine.
This is natural and to be expected, but I point it out
because some people said there should be no difference.
Moral: Never guess. Measure.

I am very impressed that Win32's InterlockedIncrement
is actually faster at 765ms than my hand-coded
assembler at 900ms, even though my hand-coded version
actually does less work (only a single instruction!)
because it doesn't return the new value. Of course,
I'm no x86 assembler expert; the explanation is
certainly that the OS's wrapper is using a different
opcode than my hand-coded version.

Finally, of course, note that the Win32 atomic int
functions clearly are not incurring function-call
overhead. Moral: Never guess. Measure.

Herb

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150
www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Herb Sutter

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

On 30 Oct 1998 22:38:17 -0500, n...@nospam.cantrip.org (Nathan Myers)
wrote:

>Herb Sutter<he...@cntc.com> wrote:
>>... My test results were based on using Win32's
>>InterlockedDecrement and related functions, and as you saw there is
>>indeed measurable overhead (although much less than for a mutex or a
>>critical section).
>
>As noted

c/noted/assumed/

>previously, the overhead you measured was from a wasted
>function call or two per op[] operation, not from the atomic
>operations.

Never assume. Measure.

See the latest results posting. There is no function-call overhead.
The Win32 operations are as efficient as hand-coded inline assembler.

The code is available for anyone to rip apart at their leisure.

Herb

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150

Nathan Myers

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

Jerry Leichter <leic...@smarts.com> wrote:
>| ... Every modern CPU has a an atomic "decrement-and-test"
>| or "test-and-swap" instruction that can be used safely, and that
>| is what we have been talking about....

Summarizing:
>HP PA RISC 2.0 [is crippled.]

>SPARC V9 has Compare and Swap

>SPARC Vn, n<9 is semi-crippled
>Alpha, MIPS, PowerPC have Load Locked/Store Conditional
>Intel X86 series [has] interlocked instructions

>
>So: In *principle*, you can use cheap hardware primitives to implement
>reference counts. In *prudent engineering practice*, it's very hard to
>justify the effort to develop and maintain such code on even a single
>platform, not to mention across multiple platforms.

It is hard to justify it for one application, easy to justify in a
widely-used library. The GNU Glibc is a reasonable place to put such
operations, and in fact it provides some for HPPA, Alpha, MIPS, PPC,
SPARC, and x86.

>It would really be nice if *some* standard defined an API for cheap
>atomic operations. Things like atomic add, atomic compare and swap,
>even perhaps atomic add to and remove from list ...
>The burden for
>implementing these in the most efficient way possible should fall on
>those who actually work with the particular hardware - compiler and OS
>library developers.

Or on those most interested in portability.

>For the effort required to do an efficient, reliable, multi-platform
>reference-counted string implementation, I'd rather have a garbage
>collector, which does away with reference counts entirely (among many
>other advantages).

Garbage collection can be handy, but this reminds me of the wishing
game: why wish for a ham sandwich when you can wish for the moon?
We have the beginnings of a portable (and widely ported) library
as Jerry describes. To be precise, it provides spinlock for many
architectures, and atomic add and compare-and-swap for a few.
Presumably you would use these things in your garbage collector
anyway.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Bill Wade

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

Nathan Myers wrote in message <71qi1f$9mk$1...@shell7.ba.best.com>...

>It would be more interesting to learn how many times the reference
>counts were changed, rather than just referenced. It is the
>increment-and-decrement cycle that costs.

Almost all the references were increments or decrements.

Some reference count counts:

Decremented: 10.5M
Incremented: 10.0M
Explict reference of "" 6.0M (this is included in the 10.0M above)
StringBuffer Construction 2.2M (sets reference count to 1)
Read, but don't change 1.7M
AboutToModify 0.8M

AboutToModify does two reads (this was debug code and the second read was an
assert(refs==1) before returning) and possibly a decrement and buffer
construction pair.

The explicit reference of "" is used by various constructors when they
determine they're making an empty string.

Copy assignment occured 3.9M times. This is a deferred deep copy.
Copy construction was 14.1k times, the other case of a deferred deep copy.

Some of these assignments would have involved assignment to a StringBuffer
which was already big enough, and would not need a new/delete even for
non-COW. They would need a memcpy in this case. I don't know how many fit
in this category. I'd need better instrumentation, and I don't have time to
do that right now.

Some of these copies would have been hit with AboutToModify before the
source was destroyed, resulting in a deep copy and no COW savings. I expect
that situation to be rare, probably less than one in four of the
AboutToModify calls.

String constructors were called as follows:
String::String() 4.3M
String::String(const char*) .4M
Copy constructor .014M
Other 1.8M

"Other" is used to convert strings from some legacy (Fortran) formats. Some
of these referenced the empty string before determining that the result
would be non-empty.

>(You don't say how
>many more increments there were than decrements, but anyway did
>the program terminate without destroying all its strings? That
>would have been an optimization, not a bug.)

String constructors exactly matched String destructors at 6.6M each.
StringBuffer constructors exactly matched StringBuffer destructors at 2.2M
each.

Herb Sutter

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

An enhanced version of my test harness, incorporating the suggestions
to date, has now been published here. Let me add a few notes by way of
reply to earlier posts... note that these are intended primarily to
defend my testing approach (I'm refraining from repeating comments
about COW vs. non-COW in this article; see the "results" article for
that).

On 28 Oct 1998 01:11:51 -0500, n...@nospam.cantrip.org (Nathan Myers)
wrote:

>Anyway, a poor implementation intended to be "tutorial, not optimal"
>is a bad basis for a sweeping condemnation of a very effective and
>mature implementation technique.

...for traditional single-threaded code, and which adds overhead
(which must be considered) when thread safety is important.

Of course, but it would be wrong to assume that my code was poor.
Several of the objections about it were invalid, as could have been
seen by simply reading the code.

After implementing all of your and others' suggested changes, overall
performance was improved for all cases without affecting relative
results. Note that two optimizations that actually did make a
significant difference were ones nobody suggested: a) replacing copy()
with memcpy(); and b) adding a test case that optimizes the allocator
rather than copying. Together, these indicated that COW's copying
advantage primarily comes from avoiding the cost of copying the chars,
not avoiding an allocation.

Good point about having test cases for copying and appending, not just
op[]. They are now there.

>>>What Herb neglected to take into account
[...]

>> Before you say what I neglected, you may want to recheck

[...]
>OK, no locks.

I think "OK, sorry, I should have read your code before telling you
what it didn't do" would be closer. :)

>However, this still adds _bags_ of unnecessary
>overhead, because it generates a(nother!) function call on every
>op[] call. The test should be inlined, and the encoding for
>member refs should allow both conditions to be identified on
>a single comparison.

Too many assumptions.

Done, but the difference wasn't substantial. There are no function
calls anywhere in sight. Note that my implementations of IntAtomicXxx
are inlined (with the same performance when #define'd) to call the
Win32 InterlockedXxx's directly. And the Win32 operations are
efficient... InterlockedIncrement takes only three times as long as a
plain preincrement of an int, for example, and is as fast as inline
assembler "LOCK INC".

>Some operations always set "unshareable", and some always clear it;
>it's known at compile time which, so passing it as a runtime argument
>to the deep-copy operation just adds overhead.

Done, but this didn't affect performance at all, after inlining.
Presumably this was because any optimizer would optimize "if(false)"
automatically (I admit that I didn't read the assembly to check that
it was doing this, though).

>Anyway, operator[] should itself be inlined.

Done. All functions are now inline. This improved all versions about
the same across the board.

>Furthermore, if the string is implemented like this:
>
> String
> data_ -> StringBuf
> buf -> char[...]
> refs
> used
> etc.
>
>a deep-copy involves two allocations, which is a horrendous overhead.

<dig to point out mutually contradictory claims>
It shouldn't, if you're right about the amount of the time that COW
avoids deep copies. :-)
</dig>

>the COW string should also use a single allocated block, like so:
>
> etc.
> used
> String refs
> data_ -> char [ ... ]

For all but one of the COW versions, I still use a distinct StringBuf,
but with a fast (fixed-size) allocator. This eliminates most of the
double-allocation effect you're worried about.

I have added a COW_AtomicInt2 that performs the above optimization,
and it is marginally slower in some cases (because of some extra
fiddling to make this work) and somewhat faster for operator[]. Mostly
a wash, but I included it throughout the results.

>Finally, it makes a big difference in real programs if your
>"deep copy" can allocate the right size for the intended operation.
>Doing a copy followed by another reallocation to grow the string is
>bad for performance

This is another "Yes, that would be dumb, and my code doesn't do it."
Please see AboutToModify, in cooperation with the StringBuf copy ctor.

>When I worked at [...], we were told in no uncertain

>terms when we harmed users' performance.

(See that vendor's thread-safe implementation of basic_string, which
had strikingly poor performance because it locked a critical section
on every access to the reference count -- at least, it did so until
recently, but maybe it's been changed since I last saw it.)

>The overwhelming influence
>on performance, swamping anything else by a huge margin, was the
>number of allocations performed. COW does fewer.

Not quite. At least in my test harness, the overwhelming influence on
performance was the number of char copies (not allocations) performed.
COW does fewer.

>a non-COW
>implementation adds overwhelming overhead to copy/assign operations.

See my latest test results. For strings of length 50, "non-COW copy +
dtor" took about four times as long as "COW_AtomicInt shallow copy +
dtor" -- substantial, but not an order of magnitude or anything as you
seem to imply. For that string length, simply using a fast allocator
nearly equalled COW's performance (without adding COW's complexity).
And in my tests, even when 50% of the copies were never changed (and
the others changed only thrice each), COW_AtomicInt came out behind.

>Herb! Do you really maintain that real programs don't (or shouldn't)
>assign strings? This would be a _remarkable_ statement, either way.

That would be remarkable, and it's not what I said. I was merely
questioning the earlier statement that:

>A fair comparison would, like any real program, do more copies
>than modifications.

In a program like that, COW can win (and lose; see the results). It
does not seem reasonable to me, however, to say that "any real
program" does so (unless there are numbers -- could you cite a
study?). When I have time, I want to take some of my commercial
programs and instrument basic_string to record counts of all
operations -- the results should make a very interesting database, and
who knows, they might prove me wrong. I would like to encourage all
readers who are interested to take such measurements of their programs
and publish the results: How many copies are performed? how many
mutating operations? how many copies are never modified (and have
their originals unmodified too, of course; i.e., how many don't ever
need a deep copy)?

>Any comparison of techniques which avoids measuring copy operations
>stacks the deck

I agree, and you'll like my revised test harness better.

>Herb [...] measured a seriously suboptimal implementation.

See above. All suggested changes were made, and some made an
across-the-board difference while others made no difference at all.
(Moral: Don't guess. Measure.)

The two changes that made a lot of difference were not suggested by
anyone. This shows the difficulty involved in guessing about
performance bottlenecks absent real measurements, even in a domain
where one has great experience (and you definitely have great
experience with commercial string library implementations).

>Second, he chose
>the "simplest of operations" to favor one technique. Real string
>types offer many "simple" operations to measure, and a comparison
>that ignores all the ones that don't favor your own bias is neither
>fair nor meaningful.

A valid point; done. Incidentally, I have no bias other than
performance... this whole thing grew out of my annoyance over two
years ago with one particular vendor's poor implementation (using
critical sections every time the reference count was accessed -- and
wouldn't listen to my problem report).

Herb

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150
www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Matt Austern

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

he...@cntc.com (Herb Sutter) writes:

> A valid point; done. Incidentally, I have no bias other than
> performance... this whole thing grew out of my annoyance over two
> years ago with one particular vendor's poor implementation (using
> critical sections every time the reference count was accessed -- and
> wouldn't listen to my problem report).

If it was a Win95 program, there might have been reasons for using
critical sections instead of Interlocked[Inc|Dec]rement. I was
surprised to find, looking at the Win32 documentation, that
InterlockedDecrement is not an atomic version of --i. On Win95
and NT 3.5, it's more like an atomic version of
long tmp = --i;
if (tmp < 0)
return some_negative_number;
else if (tmp > 0)
return some_positive_number;
else
return 0;

Admittedly, this is usually good enough for reference counting. (All
you usually care about is whether it goes to zero.) I've come across
other cases, though, where you really need to know the exact answer.
In those cases, if you want to write code that's portable across all
Win32 systems, there's no sensible choice but to use critical sections.

Herb Sutter

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

On 4 Nov 1998 05:12:35 -0500, "Bill Wade" <bill...@stoner.com>
wrote:

>Herb Sutter wrote in message <363e0a39...@nr1.toronto.istar.net>...

>>One disadvantage of COW is that it's more difficult than you think to
>>get it right.
>

>Certainly true. I ran some performance tests to get some numbers for this
>discussion.

Thanks for the followup, but thanks even more for taking the time to
actually write code and take measurements.

>In this particular test reference counts were accessed about 22M times.
>About 3M deep copies were avoided entirely (this doesn't count deep copies
>of empty strings, which are cheap even for non-COW implementations).
>

>For this test COW is cheaper, but not by much. The COW code is obviously
>much more complex.

What was the test? (Described in simple terms, that is... e.g., one of
mine was "33% of copies never changed, the rest changed only once.")

Herb

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150
www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Herb Sutter

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

On 26 Oct 1998 18:58:16 -0500, n...@nospam.cantrip.org (Nathan Myers)
wrote:
>[I *really* wish Herb had checked
>with me before posting.] In fact a correctly-coded thread-safe
>COW implementation is not appreciably slower than a non-thread-
>safe one, given atomic reference-count operations: load, increment,
>decrement-and-test.

To sum up: In fact, a correctly-coded thread-safe COW implementation
is both measurably and unavoidably slower than a non-thread-safe one,
as I have demonstrated since.

Herb Sutter

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

On 4 Nov 1998 05:18:39 -0500, "Andrei Alexandrescu"

<alexan...@micromodeling.com> wrote:
>Maybe in this case the author thought things are small enough to be easily
>manageable, so he didn't make this effort. As always, it turns out that
>things are never "small enough".

Yup. That particular function was also not exception-safe (for
releasing the lock) if the new throws, which would also be addressed
by having a lock-owning object. Thanks also to Lisa Lippincott who
pointed this out privately; I encouraged her to post those same
comments publicly too.

I originally didn't write it that way because I was testing
performance, but it was trivial to do and I should have done it -- I
normally try to present "good stylistic example" code, not just a hack
that sorta works.

Nathan Myers

unread,

Nov 6, 1998, 3:00:00 AM11/6/98

to

A. G. McDowell<newsRe...@mcdowella.demon.co.uk> wrote:
>With regard to the various comments about cheap hardware synchronisation
>primitives: are there are any hidden costs?

The general answer is: yes, there are often hidden costs. How large
are these costs? It varies. On machines intended for SMP operation
it is minimized, but still may be of the scale of bus operations rather
than a single instructions. (On a single-CPU system, there's no need
for "interlock" as such; you just need an uninterruptible instruction,
which runs as fast as others.)

With a 333MHz CPU on a 66MHz bus, a bus operation is five times
as long as a (minimal) instruction cycle. Does this mean an
interlocked decrement should take the time of 2*5 ordinary
instructions? Maybe. Anyway you shouldn't be surprised if
it does. Maybe it is fast, but it invalidates a "line" in
the cache of each other CPU, possibly slowing them down if
they happen to be looking at memory nearby.

Thus, the question of whether a "plain copy" or "COW" string
implementation is better really depends on details of operation
mix, inherent target-hardware overhead, and degree of actual
contention among CPUs for the particular problem being solved.

You can't make any general statement without stating your assumptions
about these variables, and any such set of assumptions will be wrong
for the majority of applications. The GNU Standard C++ Library will
come with both implementations so that if you find the default to be
a bottleneck you can use the other.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Bill Wade

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

Herb Sutter wrote in message <3646878a...@nr1.toronto.istar.net>...

>What was the test? (Described in simple terms, that is... e.g., one of
>mine was "33% of copies never changed, the rest changed only once.")

The test was a function-count profile of the startup portion of an
engineering simulator (transient hydraulic pipeline). This is code which
was converted from Ratfor (a Fortran dialect) to C++ by a largely automated
program several years ago ('94). Much of the code remains non OO, but it is
getting better.

The Ratfor code had three string representations which were widely used (and
a couple of others that were occasionally used). A String class was
introduced which could be constructed from any of the Ratfor representations
or from c-strings. Conversion from String to one of the other
representations always goes through c_str(). The main considerations for
String were that it be easy to use, correct (no memory leaks) and at least
as fast as the Ratfor versions (easy to beat). The program still spends
more time working with other representations than it does with the new
String, so the class has not been heavily optimized. It uses
single-threaded COW with the array of characters pointed at by StringBuffer
(rather than using struct-hack).

I expect (I don't have measurements for this program) that about half of the
non-empty strings were ten characters or less. Most of the rest would be
less than fifty characters, and at most a few were over a couple of hundred
characters.

In this implementation StringBuffer has private data (and String is not a
friend) so it is easy to count how many times the reference count is read,
incremented or decremented. There is a shared global StringBuffer for the
empty string. The default constructor, among others, references that
StringBuffer rather than create a new one.

Here are the most pertinent numbers (function names changed to match String
or std::string conventions).

Count Function

34630609 c_str()
19441381 size()
15854955 operator[]() const
10502960 --refcount
9957813 ++refcount
6561602 ~String() (this matches sum of constructor calls)
6015124 ReferenceEmptyString() (calls ++refcount)
4348823 String::String(void)
3928590 copy assignment
3745965 Allocate a new or larger character array.
2199857 ~StringBuffer() (== sum of all StringBuffer constructors)
1696545 GetRefCount()
1255004 String::String(proprietary)
841895 AboutToModify()
426842 String::String(const char*)
420883 operator[]() non-const
408307 MakeUpper(void)
408233 String::String(proprietary)
108601 String::String(proprietary)
14099 copy constructor
12755 operator=(char const *)
12685 operator+=(const String &)
12685 String operator+(String, const String &) // uses +=

All other functions were called less than 10k times.

Note that almost all of the AboutToModify() calls come from op[], MakeUpper
and operator+=.

In this (debug) version, AboutToModify() calls GetRefCount() twice. The
second call, just before returning, is an assert that refcount==1.

I don't know how many of the calls to c_str() could have used
std::string::data() if it were available. The large number of c_str() calls
might suggest that Herb's Plain::String should use a shared representation
for empty strings (speeds up c_str() at the cost of making most mutators
(other than op[]) slightly more expensive).

Christian Vogler

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

[To mod: This is slightly off-topic, but I think that these are
important considerations for getting unbiased measurements. ]

Herb Sutter (he...@cntc.com) wrote:

: Conclusion: So there is indeed overhead introduced by

: the x86 LOCK instruction, even on a single-CPU machine.
: This is natural and to be expected, but I point it out
: because some people said there should be no difference.
: Moral: Never guess. Measure.

This is actually not surprising, because at least the Pentium
invalidates the L1 cache line that contains the modified memory
location after it has executed a locked instruction. Don't know about
the Pentium II yet.

: I am very impressed that Win32's InterlockedIncrement

: is actually faster at 765ms than my hand-coded
: assembler at 900ms, even though my hand-coded version
: actually does less work (only a single instruction!)
: because it doesn't return the new value. Of course,
: I'm no x86 assembler expert; the explanation is
: certainly that the OS's wrapper is using a different
: opcode than my hand-coded version.

This difference could actually be due to the invalidation of the L1
cache line as stated above. If you would like to make absolutely sure
that the results are comparable, you have to make sure that the lower
4 bits of the reference count's memory address are the same in both test
cases.

To maximize performance of the COW implementation, I suggest making
sure that nothing else is in the same 16 byte block as the reference
count. This will make sure that no additional cache misses occur as a
result of manipulating the reference count. I would be interested to
see how much of a difference this makes in the performance of the
COW implementation.

I once tested Win32 critical sections on a Pentium 166, and the
difference between a proper and an inproper memory alignment was a
factor of 2 for the EnterCriticalSection/LeaveCriticalSection
pairs. Cache misses seem to be that expensive.

Regards,
Chrisian

apje...@my-dejanews.com

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

In article <36427f9...@nr1.toronto.istar.net>,

Then all you need to do is find out what's in InterlockedIncrement and
move it inline, and it'll be even faster. Anyone with a debugger can
take a look at how InterlockedIncrement is implemented. Here it is,
from VC5 on an x86 under NT:

mov ecx,dword ptr [esp+4]
mov eax,1
lock xadd dword ptr [ecx],eax
inc eax
ret 4

Note that it uses the xadd method, which is claimed to be slower than
the inc method. And the statement

j = InterlockedIncrement(&i);

is compiled to

lea eax,[i]
push eax
call dword ptr [__imp__InterlockedIncrement@4(0x004050e0)]
mov dword ptr [j],eax

Note that 4 instructions are involved in calling the function, plus 5
instructions inside the function itself. If I move the exact code
that's in InterlockedIncrement inline, I can shorten it to

__asm {
mov eax,1
lea ecx,i
lock xadd [ecx],eax
mov j,eax
}

How could this possibly be _slower_? It's using the same code as
InterlockedIncrement; the only difference is it's inline, so I've
gotten rid of the code to push i's address onto the stack, then take
it back off the stack, plus two jumps have been gotten rid of. There
has to be something else affecting your results. I'd might believe it
if you just said that it didn't make enough difference to matter.

--
Adam P. Jenkins
ajenkins at javanet dot com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

apje...@my-dejanews.com

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

In article <fxtww58...@isolde.engr.sgi.com>,

Matt Austern <aus...@sgi.com> wrote:
> If it was a Win95 program, there might have been reasons for using
> critical sections instead of Interlocked[Inc|Dec]rement. I was
> surprised to find, looking at the Win32 documentation, that
> InterlockedDecrement is not an atomic version of --i

.. snip ..

> In those cases, if you want to write code that's portable across all
> Win32 systems, there's no sensible choice but to use critical sections.

Just use InterlockedIncrement with a negative number to get an atomic --i if
that's the case. Anyway, InterlockedDecrement does the right thing on NT4.

--
Adam P. Jenkins
ajenkins at javanet dot com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Christian Vogler

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

Christian Vogler (cvo...@gradine.cis.upenn.edu) wrote:

: This difference could actually be due to the invalidation of the L1

: cache line as stated above. If you would like to make absolutely sure
: that the results are comparable, you have to make sure that the lower
: 4 bits of the reference count's memory address are the same in both test
: cases.

Oops. I was a little too quick here. A L1 cache line is 32 bytes
wide. That should have read "the lower 5 bits."

: To maximize performance of the COW implementation, I suggest making

: sure that nothing else is in the same 16 byte block as the reference
: count.

Likewise, this should have read "in the same 32 byte block."

Regards,
- Christian

Nathan Myers

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

Herb Sutter<he...@cntc.com> wrote:

> n...@nospam.cantrip.org (Nathan Myers) wrote:
>>In fact a correctly-coded thread-safe
>>COW implementation is not appreciably slower than a non-thread-
>>safe one, given atomic reference-count operations: load, increment,
>>decrement-and-test.
>
>To sum up: In fact, a correctly-coded thread-safe COW implementation
>is both measurably and unavoidably slower than a non-thread-safe one,
>as I have demonstrated since.

No, Herb has demonstrated that (as we already knew) certain operations
on COW strings are measurably slower. His results do not imply that
using COW strings in multi-threaded programs makes your program slower,
or that using non-COW strings would make it faster, or vice versa.

What his results do imply is that a COW string implemented using
mutex-guarded reference counts really stinks. Any other differences
are too small to generalize about.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Bill Wade

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

Matt Austern wrote in message ...

>I was
>surprised to find, looking at the Win32 documentation, that
>InterlockedDecrement is not an atomic version of --i.

Which documentation?

The help for VC5 says "The return value is the resulting decremented
value."

Roman Lechtchinsky

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

This thread has really got me wondering about the typical usage of
strings. My experience has been that in most programs string performance
doesn't really matter. I realize very well that there are applications
in which strings play a significant role. However, IMO in such
applications you wouldn't want to use basic_string ( or any general
purpose string class ) anyway, at least not always. If there are enough
strings which are never supposed to change a simple immutable string
class could probably improve performance much more than an elaborated
basic_string implementation. If strings are frequently concatenated a
string class with O(1) concatenation might be appropriate. IMO such
alternatives should be considered if we are doing tests on real world
programs.
Another point which I haven't seen mentioned so far is memory usage. I
don't think a general purpose string class should be optimized for
performance only. I won't use a basic_string implementation that wastes
a lot of memory in most of my programs even if its performance is
optimal simply because string performance isn't important for me but the
wasted memory (e.g. because of unnecessary deep copies) sometimes is.

Herb Sutter wrote:
> On 28 Oct 1998 01:11:51 -0500, n...@nospam.cantrip.org (Nathan Myers)
> wrote:
> >A fair comparison would, like any real program, do more copies
> >than modifications.

IMO a fair comparison would also take into account that the measured
parameters are only significant for a special class of programs, i.e.
those for which string performance really matters. Such programs would
probably use specialized classes for immutable strings etc. anyway. This
might change the usage patterns dramatically.

> In a program like that, COW can win (and lose; see the results). It
> does not seem reasonable to me, however, to say that "any real
> program" does so (unless there are numbers -- could you cite a
> study?). When I have time, I want to take some of my commercial
> programs and instrument basic_string to record counts of all
> operations -- the results should make a very interesting database, and
> who knows, they might prove me wrong. I would like to encourage all
> readers who are interested to take such measurements of their programs
> and publish the results: How many copies are performed? how many
> mutating operations? how many copies are never modified (and have
> their originals unmodified too, of course; i.e., how many don't ever
> need a deep copy)?

I'd add the following questions ( even though I don't know how to
measure some of these parameters ): how important is string performance
vs. memory usage? How much memory is wasted by the string
implementation? How many strings could be replaced by a specialized
string class (e.g. immutable strings)?

> A valid point; done. Incidentally, I have no bias other than
> performance...

I do :) Really, if we are talking about general purpose classes then
other parameters should be considered, too. If not, then maybe the test
patterns should be reconsidered.

Bye

Roman

Nathan Myers

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

Herb Sutter<he...@cntc.com> wrote:
>Here's an update on my ongoing "COW performance

>measurements" project. ...

>
>Here is one example of a suggested performance
>enhancement that actually made no difference: The
>AboutToModify function in the original version took a
>bool flagging whether to mark the now-unique string as
>unshareable. The objection was that this runtime
>decision ought to be made at compile time, since we
>always call AboutToModify with either "true" or "false"
>hardcoded. The current code version splits
>AboutToModify into two functions -- EnsureUnique and
>EnsureUnshareable -- and it actually made no
>performance difference, because in the original version
>(plus inlining) the compiler could see enough to do the
>optimization automatically.
>
>MORAL: Don't guess. Measure.

In the code I saw, AboutToModify wasn't inline. Considering
the amount of code in it, it shouldn't be, unless it were
split into two functions, a small inline part and a large
non-inline part. If the small inline part took an argument,
then of course the compiler could optimize it out.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Nathan Myers

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

Herb Sutter<he...@cntc.com> wrote:
>
>Here's an update on my ongoing "COW performance
>measurements" project. The moral of the story is
>one that GotW readers will be familiar with by now:
>When it comes to performance, never guess. Measure.

Herb has clearly done a great deal of work, and is to
be commended for it. I have two comments:

1. Herb has not identified the limits of validity of his results.
Herb tested only on an unidentified branch of the Win32 tree,
and only on an unidentified branch of the x86 tree. There is
little reason to believe that his results apply to other branches,
never mind entirely-other environments. Did he test on Win95,
Win95-SP1, -SP2, Win98, NT4, NT4-SP1, -SP2, -SP3, -SP4, NT5beta?
On 486, Pentium, PPro, PII, K6, Cyrix586, 686?

There is every reason to believe that instruction timings vary
among the CPUs listed, and that allocator performance varies
among the OS versions listed. Outside the Wintel world, there
is even wider variation.

2. "Never guess, measure" is easy to say, but impractical in
normal cases. For all Herb's work, his results don't really
apply beyond his own machine, although most real programs he
writes probably must run on others. Furthermore, measuring
the wrong thing may lead you as far astray as not measuring at
all. The solution is not to give up measuring, but to measure
when you find a problem.

In Herb's results, only two differences really stand out:
First, protecting the reference count with a shared mutex is
an absolute lose, and second, his results comparing just op[]
for COW and plain strings are frankly suspect:

>TESTING OPERATOR[]: A possibly-mutating operation, never does mutate
>
> Notes:
> - Plain always vastly outperformed COW.
> - Results were independent of string lengths.
> - The overhead of COW_AtomicInt compared to Plain was
> constant at about 280ms per 1,000,000 operations.
> - COW_AtomicInt2 fared better in this test case, but
> COW_AtomicInt did better overall and so I am focusing
> on comparing that with Plain.
>
>[10x iterations] Running 10000000 iterations with strings of length
>10:
> Plain_FastAlloc 3ms copies: 0 allocs: 3 [*]
> Plain 2ms copies: 0 allocs: 3 [*]
> COW_Unsafe 1698ms copies: 0 allocs: 3
> COW_AtomicInt 2833ms copies: 0 allocs: 3
> COW_AtomicInt2 2112ms copies: 0 allocs: 4
> COW_CritSec 3587ms copies: 0 allocs: 3
> COW_Mutex 71787ms copies: 0 allocs: 3
>
> [*] within measurement error margin, both varied from 0ms to 9ms

What are these allocations needed for doing an op[]? Why does
it take a thousand times as long to do an inline op[] with one
extra test-and-branch? This is very suspicious indeed, and
smacks of a testing artifact.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Herb Sutter

unread,

Nov 8, 1998, 3:00:00 AM11/8/98

to

On 7 Nov 1998 17:24:50 -0500, n...@nospam.cantrip.org (Nathan Myers)
wrote:

>Herb Sutter<he...@cntc.com> wrote:
>> n...@nospam.cantrip.org (Nathan Myers) wrote:
>>>In fact a correctly-coded thread-safe
>>>COW implementation is not appreciably slower than a non-thread-
>>>safe one, given atomic reference-count operations: load, increment,
>>>decrement-and-test.
>>
>>To sum up: In fact, a correctly-coded thread-safe COW implementation
>>is both measurably and unavoidably slower than a non-thread-safe one,
>>as I have demonstrated since.
>
>No, Herb has demonstrated that (as we already knew) certain operations
>on COW strings are measurably slower.

(Actually, some people were claiming that nothing would be measurably
slower -- not you, I think, but others.) OK, I'm willing to rephrase
slightly:

A thread-safe COW implementation is both measurably and unavoidably
slower than a non-thread-safe one, except only for const operations
and non-copy ctors. So thread-safe COW will be slower than
non-thread-safe COW in all programs, except programs that only ever
directly construct read-only objects and never copy or assign them
(pretty rare, shall we agree?).

>His results do not imply that
>using COW strings in multi-threaded programs makes your program slower,
>or that using non-COW strings would make it faster, or vice versa.

Granted. It depends on usage. As Rob Murray put it in C++ Strategies
and Tactics: "Always do some performance measurements when making this
kind of change [speaking specifically of COW strings] to convince
yourself that this optimization is not really a pessimization!" But
neither do my results support the broad assertion with which I was
contradicted earlier, namely that 'COW is faster than non-COW in any
real program.'

>What his results do imply is that a COW string implemented using
>mutex-guarded reference counts really stinks.

Granted. Note, however, that this _would_ be required if the
implementation supported sharing strings between _processes_ (at least
under Win32), because Win32 critical sections do not guarantee
serialization between processes, only betwee threads.

This makes me think of another interesting potential drawback to COW:
I know of no commercial COW implementation that supports sharing
strings between processes, but non-COW strings handle that fine by
their nature. Hmm. Interesting.

That ability may not be needed often (in fact, I'd expect it to be
exceedingly rare), but it is indicative of the general cost of COW,
where by "cost" I mean not just performance but also maintenance,
usability, etc. of any technique that attempts complex optimizations
under the covers. This is not to say that to do so is wrong -- it
clearly can be right, depending entirely on usage -- but only to point
out that it's not without tradeoffs, that's all.

Herb

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150
www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Herb Sutter

unread,

Nov 8, 1998, 3:00:00 AM11/8/98

to

On 6 Nov 1998 22:50:31 -0500, n...@nospam.cantrip.org (Nathan Myers)
wrote:

>A. G. McDowell<newsRe...@mcdowella.demon.co.uk> wrote:
>>With regard to the various comments about cheap hardware synchronisation
>>primitives: are there are any hidden costs?

[...]

>(On a single-CPU system, there's no need
>for "interlock" as such; you just need an uninterruptible instruction,
>which runs as fast as others.)

This is true. Of course, if you're writing a commercial library and/or
application, you can't necessarily know the target environment at
build time... it might be SMP, so you really ought to use the
interlocked instructions if you don't want support calls when users'
code starts to break mysteriously when they run on SMP machines.

Lisa Lippincott

unread,

Nov 8, 1998, 3:00:00 AM11/8/98

to

Andrei Alexandrescu wrote:
> Maybe in this case the author thought things are small enough to
> be easily manageable, so he didn't make this effort. As always,
> it turns out that things are never "small enough".

Herb Sutter replied:

> Yup. That particular function was also not exception-safe (for
> releasing the lock) if the new throws, which would also be addressed
> by having a lock-owning object. Thanks also to Lisa Lippincott who
> pointed this out privately; I encouraged her to post those same
> comments publicly too.

Well, Usenet waits for no one. But I'll reinforce Herb's point on
correctness: the non-COW string manages only one resource: memory.
Herb's multithreaded implementation of a COW string manages three
resources: memory, counted references, and locks. The former code
is certain to be more maintainable.

More stylistically pleasant versions of the string objects would
separate all three resource management issues from each other and
from the strings themselves.

But such a division would impede optimization by both humans and
compilers, and in my estimation tend to bias the tests in favor of
the simpler non-COW code. For example, Nathan Myers suggests that
treating locking and reference counting as separate operations
is a weakness of Herb's COW code.

--Lisa Lippincott

Herb Sutter

unread,

Nov 8, 1998, 3:00:00 AM11/8/98

to

[BTW, several people have asked whether my test harness source code is
available. Answer: Yes, via the GotW #45 page at
www.peerdirect.com/resources, and please post the results if you try
it on your system.]

On 7 Nov 1998 21:22:50 -0500, n...@nospam.cantrip.org (Nathan Myers)
wrote:

>Herb has clearly done a great deal of work, and is to
>be commended for it. I have two comments:

Thanks. I appreciate this. I can think of further improvements to my
test harness, but I'll have to put it aside for at least a few months
until I have more time.

I want to make sure everyone understands why I went to the trouble
(and many late nights) of updating my test harness with no less than
seven different comparative implementations of String, and several
more test cases right down to some inline assembler. I did it for two
reasons:

a) Nathan in particular correctly pointed out some serious
omissions in my initial test harness.

b) I wanted to do justice to the topic.

I'm glad my initial results were further supported, but the motive was
to gather useful measurements, not to find support for or against a
particular position.

I definitely won't do this quantity of freebie work "to prove
<someone> wrong." The only major disagreement I have with anything
that was said was with the assertion that COW is always better in 'any
real program,' and aside from that I found the many comments to be
generally both insightful and enlighteningly useful, especially
Nathan's and Bill's and Jerry's.

The exact results will clearly vary from platform to platform. That's
why I made the test harness code available and requested that anyone
who was interested download it, play with it, and let me know the
results on their platforms/environments. Interestingly, however, the
relative results could be expected to change even more when I went
from my original implementation(s) to my more highly-optimized
implementations -- and the performance results did improve markedly
across all test cases, but the relative results wer unchanged.

>1. Herb has not identified the limits of validity of his results.

[which CPU, which OS]

>2. "Never guess, measure" is easy to say, but impractical in
> normal cases. For all Herb's work, his results don't really
> apply beyond his own machine

Both true as charged. That's why I've done the work of providing the
complete source code for seven sample (but decent) implementations,
and why I hope that people will run the harness on their
machines/compilers and let us all know the results. In particular,
Nathan, I'm hoping that you have a spare hour or two to run it under
GCC/Linux and report the results (if you don't have the time, that's
cool, I know you're up to your eyeballs in the GNU lib project).

To all: If you do run the harness on your platform and post results,
please also post your implementations of the platform-specific
functions in the file "test.h" if you change them, for comparative
purposes.

>>[10x iterations] Running 10000000 iterations with strings of length
>>10:
>> Plain_FastAlloc 3ms copies: 0 allocs: 3 [*]
>> Plain 2ms copies: 0 allocs: 3 [*]
>> COW_Unsafe 1698ms copies: 0 allocs: 3
>> COW_AtomicInt 2833ms copies: 0 allocs: 3
>> COW_AtomicInt2 2112ms copies: 0 allocs: 4
>> COW_CritSec 3587ms copies: 0 allocs: 3
>> COW_Mutex 71787ms copies: 0 allocs: 3
>>
>> [*] within measurement error margin, both varied from 0ms to 9ms
>

>What are these allocations needed for doing an op[]? Why does
>it take a thousand times as long to do an inline op[] with one
>extra test-and-branch? This is very suspicious indeed, and
>smacks of a testing artifact.

Again, the code is available. For your first question, inspection will
show that the #copies and #allocs include some test setup, and
actually happen outside the timed test loop. For the second, it's
presumably because of the additional refcount comparison (actually,
that and a test inside Reserve that might be avoidable with a little
extra work). Note, however, that for this particular test I had to
crank up the iterators by a factor of 10 (compared to the other tests)
because the differences were smaller.

I could give a longer answer, but I'll withhold it for now... the code
is available for inspection, and I don't want to bias that inspection.

Matt Austern

unread,

Nov 9, 1998, 3:00:00 AM11/9/98

to

apje...@my-dejanews.com writes:

> In article <fxtww58...@isolde.engr.sgi.com>,
> Matt Austern <aus...@sgi.com> wrote:
> > If it was a Win95 program, there might have been reasons for using

> > critical sections instead of Interlocked[Inc|Dec]rement. I was

> > surprised to find, looking at the Win32 documentation, that

> > InterlockedDecrement is not an atomic version of --i
> .. snip ..
> > In those cases, if you want to write code that's portable across all
> > Win32 systems, there's no sensible choice but to use critical
sections.
>
> Just use InterlockedIncrement with a negative number to get an atomic
--i if
> that's the case. Anyway, InterlockedDecrement does the right thing on
NT4.

The same caveat applies to InterlockedIncrement on Windows 95: it is
not simply an atomic version of ++i, and you should not rely on
anything more than the sign. If you don't intend to target Windows
95, of course, then perhaps you don't need to care about any of this.

This does have lessons that apply beyond a single operating system,
though.

(1) When it comes to thread safety, you shouldn't make any
assumptions beyond what your vendor's documentation says. Most
of the time, presumably, InterlockedIncrement does behave as ++i
and InterlockedDecrement does behave as --i. If you just test
it and see if it works, you'll probably see that it does---you
might never find the rare cases where Microsoft's caveat
applies. You'll wind up with a program that occasionally fails
in mysterious ways and that's very hard to debug.

(2) Writing portable multithreaded code is hard. We've seen that
there are serious issues even if all you're doing is trying to
target the two most recent releases of Microsoft's most popular
operating system. Throw a few more operating systems into the
mix, and you've got a fair amount of careful reading to do.

Matt Austern

unread,

Nov 9, 1998, 3:00:00 AM11/9/98

to

"Bill Wade" <bill...@stoner.com> writes:

> Matt Austern wrote in message ...

> >I was
> >surprised to find, looking at the Win32 documentation, that
> >InterlockedDecrement is not an atomic version of --i.
>

> Which documentation?

The version of MSDN that came with VC++6. (July, 1998)

This reads to me like the sort of thing that would appear in the BUGS
section of a traditional Unix man page: my guess is that
Interlocked[Inc|Dec]rement was intended to be an atomic version of ++i
and --i, that it almost always does behave that way, and that caveat
was added to the documentation when Microsoft discovered some rare
cases when it does something else.

Bill Wade

unread,

Nov 9, 1998, 3:00:00 AM11/9/98

to

Herb Sutter wrote in message <36450d79...@nr1.toronto.istar.net>...
>Note, however, that [mutex guarded reference counts] _would_ be

>required if the
>implementation supported sharing strings between _processes_ (at least
>under Win32), because Win32 critical sections do not guarantee
>serialization between processes, only betwee threads.

Just in case the stuff I put in "[]" is an improper expansion of "this",
I'll apologize in advance.

From Win32 InterlockedIncrement documentation:

"The threads of different processes can use this mechanism if the
variable
is in shared memory."

So it appears that the atomic integer operations can be used in shared
memory.

Christopher Eltschka

unread,

Nov 10, 1998, 3:00:00 AM11/10/98

to

Jerry Leichter <leic...@smarts.com> writes:

[...]

>For the effort required to do an efficient, reliable, multi-platform
>reference-counted string implementation, I'd rather have a garbage
>collector, which does away with reference counts entirely (among many
>other advantages).

If you use COW: How would you determine if a copy is needed, if you
don't have a reference count?

And if you always copy: What would GC give you in this respect?
(the only difference I can see is that you don't need a delete
in the destructor - but that's not the problem anyway)

Scott R. Turner

unread,

Nov 10, 1998, 3:00:00 AM11/10/98

to

I wanted to de-lurk momentarily to express my thanks to Herb and
all the other participants for the Guru discussions. I consider
myself a fairly expert programmer, but I'm constantly learning new
things from these discussions. The back-and-forth and in-depth
analysis of the thread-safe COW string discussion has been particularly
interesting.

At times, posting to Usenet is like throwing a stone into the Grand
Canyon. You're never sure what impact you have. I just wanted to let
you know on behalf of myself (and doubtless countless other lurkers)
that your efforts are not unappreciated.

{Thanks. It's also nice to have top-notch people like Nathan
and Matt contribute... makes it better for everyone. -hps}

-- Scott T.

apje...@my-dejanews.com

unread,

Nov 11, 1998, 3:00:00 AM11/11/98

to

In article <fxtpvaw...@isolde.engr.sgi.com>,

Matt Austern <aus...@sgi.com> wrote:
> apje...@my-dejanews.com writes:
>
> > In article <fxtww58...@isolde.engr.sgi.com>,
> > Matt Austern <aus...@sgi.com> wrote:
> > > If it was a Win95 program, there might have been reasons for using

> > > critical sections instead of Interlocked[Inc|Dec]rement. I was

> > > surprised to find, looking at the Win32 documentation, that

> > > InterlockedDecrement is not an atomic version of --i
> > .. snip ..
> > > In those cases, if you want to write code that's portable across all
> > > Win32 systems, there's no sensible choice but to use critical
> sections.
> >
> > Just use InterlockedIncrement with a negative number to get an atomic
> --i if
> > that's the case. Anyway, InterlockedDecrement does the right thing on
> NT4.
>
> The same caveat applies to InterlockedIncrement on Windows 95: it is
> not simply an atomic version of ++i, and you should not rely on
> anything more than the sign. If you don't intend to target Windows
> 95, of course, then perhaps you don't need to care about any of this.
>

Where are you getting this information? The online documentation that
comes with VC5 documents InterlockedIncrement as being precisely an
atomic ++i. Here it is, cut-and-pasted. And InterlockedDecrement is
documented as being an atomic --i. I can only assume that you either
misread the documentation or are looking at out-of-date documentation
for Win32.

<QUOTE>
The InterlockedIncrement function both increments (increases by one)
the value of the specified 32-bit variable and checks the resulting
value. The function prevents more than one thread from using the same
variable simultaneously.

LONG InterlockedIncrement(
LPLONG lpAddend; // address of the variable to increment
);

Parameters

lpAddend

Points to the 32-bit variable to increment.

Return Values

The return value is the resulting incremented value.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Remarks

The functions InterlockedIncrement, InterlockedCompareExchange,
InterlockedDecrement, InterlockedExchange, and InterlockedExchangeAdd
provide a simple mechanism for synchronizing access to a variable that
is shared by multiple threads. The threads of different processes can

use this mechanism if the variable is in shared memory.

The variable pointed to by the lpAddend parameter must be aligned on a
32-bit boundary; otherwise, this function will fail on multiprocessor
x86 systems.
</QUOTE>

Also note that it says the Interlocked* functions can be used
across process boundaries for variables in shared memory.

--
Adam P. Jenkins
ajenkins at javanet dot com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Andrei Alexandrescu

unread,

Nov 11, 1998, 3:00:00 AM11/11/98

to

Hi guys,

This ongoing debate is fascinating. Listen, I have an idea for us all.

I think we all agree that thread-safe COW strings are less efficient that
non-thread-safe ones. Then let's put *two* of them in place. A plain string
and a multithreaded one. The decision will be taken by the programmer and
specified as a third template argument of type bool passed to basic_string.
For instance:

template <class E, class T = char_traits<E>, class A = allocator<T>, bool
Mutithreaded = false>
class basic_string
{
// ...
};

Looks like a lot of code bloating, ain't it? Not quite. We can derive the MT
string from the plain string by using Partial Template Specialization (PTS)
magic:

template <class E, class T, class A>
class basic_string<E, T, A, true> : public basic_string<E, T, A, false>
{
// ...
};

This way we'll have to override only the methods we need to get the
multithreaded behavior.

Think of the advantages:
- You have two string available, and you can use the best for the job
- The string will still be standard, since the standard doesn't prohibit
adding template parameters as long as you provide defaults for them.
- Only the programmer who wants to have MT strings will have to type more,
which is good since MT programmers usually know what they're doing.

Egcs supports PTS so it can be used as a test platform.

What do you think? Any takers?

Andrei

Jerry Leichter

unread,

Nov 11, 1998, 3:00:00 AM11/11/98

to

| >For the effort required to do an efficient, reliable, multi-platform
| >reference-counted string implementation, I'd rather have a garbage
| >collector, which does away with reference counts entirely (among many
| >other advantages).
|
| If you use COW: How would you determine if a copy is needed, if you
| don't have a reference count?
|
| And if you always copy: What would GC give you in this respect?
| (the only difference I can see is that you don't need a delete
| in the destructor - but that's not the problem anyway)

If you want to alter an existing string in place, then you have to copy
it. On the other hand, if you want to *extend* an existing string, you
can use a string pointer object that includes the length. You can then
safely extend an existing string, regardless of how many references
there are to it (assuming you've reserved enough extra space - but
that's true regardless of reference counts). You can even go further
and use an offset to the beginning of a substring in a shared string.
Then you can extract substrings without copying, too.

You lose when you end up copying a string in order to change something
within it, when if you'd known there was only a single reference, you
could have changed it in place. If you this is a common operation in
your program, the obvious thing is to define a "string_buffer" object,
which is optimized for this purpose. It can freely interconvert back
and forth with string. You can have it retain the "atomic" semantics of
strings by specifying that any attempt to create a second reference
causes a copy; or you can give it array semantics, so that all
references point to a single object which can be changed through any of
them.

COW attempts to automate the choice between string and string_buffer. I
can think of almost no programs I've ever written or worked on in which
the task of programming would in any way have been helped by the
automation of this choice. In practice, any given program tends to
treat strings in one of two styles: As atomic objects, operated on as
units with append and substring; or as arrays of characters, subject to
internal modification. It's pretty obvious when you write the code
which is the preferred mode. Switching between modes is pretty cheap,
and is rarely done that often anyway.

BTW, if you're really want to optimize, it helps to consider what you
really want to do with your strings. I see many cases where a final
string is built out multiple concatenated pieces. Unless you're willing
to reserve - and potentially waste - extra space, this usually ends up
doing a *lot* of copying, COW or no. This is easily avoided, *without*
COW: Create a new datatype, concatenated_string, which is an ordered
list of string's (and, preferably, char *'s - since there are no
compile-time pre-built constant string objects). The only operations
allowed on concatenated_string are concatenation - which adds to the
list - and conversion to string (which walks the list to determine the
total space needed, allocates it *once*, and copies all the pieces into
the allocated space *once*.) I contend that the use of such a class
would, on average, save significantly more time and space across typical
programs than even the most efficient COW implementation. (Yes, they
can be used together - but which would you do *first* if my contention
is correct?)
-- Jerry

Herb Sutter

unread,

Nov 11, 1998, 3:00:00 AM11/11/98

to

On 9 Nov 1998 21:17:31 -0500, "Bill Wade" <bill...@stoner.com>
wrote:

>Herb Sutter wrote in message <36450d79...@nr1.toronto.istar.net>...

>>[mutex guarded reference counts] _would_ be required if the
>>implementation supported sharing strings between _processes_ (at
>>least under Win32), because Win32 critical sections do not
>>guarantee serialization between processes, only betwee threads.

>From Win32 InterlockedIncrement documentation:

>"The threads of different processes can use this mechanism if the
>variable is in shared memory."

Good point, I jumped the gun while musing aloud. COW can be made both
multithread-safe and multiprocess-safe by serializing a single int
only (since the only data accessed/changed by multiple threads is the
refcount), so my comment doesn't apply.

What I said would be true of other shared-representation optimizations
where the data that needs to be serialized together is more than a
single int (or group of individually serializable ints).

Thanks,

Herb

---
Herb Sutter (mailto:hsu...@peerdirect.com)

PeerDirect Inc. 2695 North Sheridan Way, Suite 150
www.peerdirect.com Mississauga Ontario Canada L5K 2N6

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Jay Krell

unread,

Nov 11, 1998, 3:00:00 AM11/11/98

to

(Posted and email cc'ed)
(Mod: somewhat [ENVIRONMENT SPECIFIC])

>bill...@stoner.com

>
>Matt Austern wrote in message ...

>>I was
>>surprised to find, looking at the Win32 documentation, that

>>InterlockedDecrement is not an atomic version of --i.
>
>Which documentation?
>

>The help for VC5 says "The return value is the resulting decremented
>value."
>

>leic...@smarts.com
>For example, NT gives you
>an atomic add - but it returns only the *sign* of the value before - or
>is it after? - the operation.

October '98 MSDN:

>InterlockedIncrement
Return Values
Windows 98, Windows NT 4.0 and later: The return value is the
resulting
incremented value.

Windows 95, Windows NT 3.51 and earlier: If the result of the
operation is
zero, the return value is zero. If the result of the operation is less
than
zero, the return value is negative, but it is not necessarily equal to
the
result. If the result of the operation is greater than zero

>InterlockedAdd
Return Values
The return value is the initial value of the variable pointed to by
the
Addend parameter.

QuickInfo
Windows NT: Requires version 4.0 or later.
Windows: Requires Windows 98 or later.

Me: Windows 95 and NT 3.x run on '386, Windows 98 and NT4+ require
'486. I
think that's why there are these differences. Anyone know the relevant
diff
between '386 and '486? Anything even better in Pentium, etc.? I guess,
if
you want to work on Win95 but require a '486, you can implement these
yourself.

- Jay

Roman Lechtchinsky

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

Andrei Alexandrescu wrote:
> I think we all agree that thread-safe COW strings are less efficient that
> non-thread-safe ones. Then let's put *two* of them in place. A plain
string
> and a multithreaded one. The decision will be taken by the programmer and
> specified as a third template argument of type bool passed to
basic_string.

Actually, I expect a good implementation to provide two different
versions of the entire standard library for single-threaded and
multithreaded programs. They would be selected by a compiler option.
Incidentally, this would probably also allow the compiler to generate
more efficient code for single-threaded programs in some cases. Of
course, this means quite a lot of additional work for the implementers (
and thus additional bugs ), which in turn means that most of the time I
would use the multithreaded version so that I have only one set of
compiler/library bugs to deal with.

Bye

Roman

sa...@bear.com

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

In article <364877...@smarts.com>,
Jerry Leichter <leic...@smarts.com> wrote:

> I see many cases where a final
> string is built out multiple concatenated pieces. Unless you're willing
> to reserve - and potentially waste - extra space, this usually ends up
> doing a *lot* of copying, COW or no. This is easily avoided, *without*
> COW: Create a new datatype, concatenated_string, which is an ordered
> list of string's (and, preferably, char *'s - since there are no
> compile-time pre-built constant string objects). The only operations
> allowed on concatenated_string are concatenation - which adds to the
> list - and conversion to string (which walks the list to determine the
> total space needed, allocates it *once*, and copies all the pieces into
> the allocated space *once*.) I contend that the use of such a class
> would, on average, save significantly more time and space across typical
> programs than even the most efficient COW implementation. (Yes, they
> can be used together - but which would you do *first* if my contention
> is correct?)

Use SGI STL rope. You can use rope and string together to best suit your
needs.

-- Saroj

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Matt Austern

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

apje...@my-dejanews.com writes:

> Where are you getting this information? The online documentation that
> comes with VC5 documents InterlockedIncrement as being precisely an
> atomic ++i.

I'm getting it from the online documentation that comes with VC6.
Either the behavior changed between VC5 and VC6 or (more likely)
Microsoft changed the documentation because they discovered some rare
cases where it does not work they way they thought it did when they
wrote the VC5 documentation.

The VC6 documentation is quite explicit.

Doug Harrison

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

Matt Austern wrote:

>This reads to me like the sort of thing that would appear in the BUGS
>section of a traditional Unix man page: my guess is that
>Interlocked[Inc|Dec]rement was intended to be an atomic version of ++i
>and --i, that it almost always does behave that way, and that caveat
>was added to the documentation when Microsoft discovered some rare
>cases when it does something else.

I've been told that restriction is due to Win95's and NT3.51 and earlier's
support for the 386, which lacks an instruction that would quickly
accomplish the atomic ++i or --i, but which does have an instruction that
performs the atomic increment but returns only the sign. NT4 and Win98 do
lift this restriction, but each requires at least a 486.

--
Doug Harrison
dHar...@worldnet.att.net

Nathan Myers

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

Jerry Leichter <leic...@smarts.com> wrote:
> You can then
>safely extend an existing string, regardless of how many references
>there are to it (assuming you've reserved enough extra space - but
>that's true regardless of reference counts). You can even go further
>and use an offset to the beginning of a substring in a shared string.
>Then you can extract substrings without copying, too.

This was true in drafts of the standard up until very late.

Then someone who thus far remains nameless slipped through a proposal
changing operator[] to require (for no good reason that I can discern)
that s[s.size()] must return 0.

Previously it was only necessary to tack on a 0 when s.c_str()
was called; now it must be done after most modifications.

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

George V. Reilly

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

Jay Krell <jay....@cornell.edu> wrote:
) >bill...@stoner.com
) >Matt Austern wrote in message ...
) >>I was
) >>surprised to find, looking at the Win32 documentation, that
) >>InterlockedDecrement is not an atomic version of --i.
) >
) >Which documentation?
) >
) >The help for VC5 says "The return value is the resulting decremented
) >value."
) >
)
) >leic...@smarts.com
) >For example, NT gives you
) >an atomic add - but it returns only the *sign* of the value before - or
) >is it after? - the operation.
)
) October '98 MSDN:
)
) >InterlockedIncrement
) Return Values
) Windows 98, Windows NT 4.0 and later: The return value is the
) resulting
) incremented value.
)
) Windows 95, Windows NT 3.51 and earlier: If the result of the
) operation is
) zero, the return value is zero. If the result of the operation is less
) than
) zero, the return value is negative, but it is not necessarily equal to
) the
) result. If the result of the operation is greater than zero
)
) >InterlockedAdd
) Return Values
) The return value is the initial value of the variable pointed to by
) the
) Addend parameter.
)
) QuickInfo
) Windows NT: Requires version 4.0 or later.
) Windows: Requires Windows 98 or later.
)
) Me: Windows 95 and NT 3.x run on '386, Windows 98 and NT4+ require
) '486. I
) think that's why there are these differences. Anyone know the relevant
) diff
) between '386 and '486? Anything even better in Pentium, etc.? I guess,
) if
) you want to work on Win95 but require a '486, you can implement these
) yourself.

The newer implementation of InterlockedIncrement looks something
like this:
mov ecx, lpAddend
mov eax, 1
lock xadd [ecx],eax ; atomic{temp = *ecx; *ecx += eax; eax = temp}
inc eax ; update, since it holds the old value of *ecx
ret

However, the xadd instruction was not introduced until the 486.
The older implementation of InterlockedIncrement is implemented
in terms of "lock inc" and there's an obvious race condition
since there's no atomic instruction to read the value into a
register while simultaneously updating it. You're guaranteed
that the sign of the the result is correct (<0, ==0, >0), but
that's it.

If you have code like this:
Foo::Foo() : m_cRefs(0) {}

Foo::AddRef()
{
LONG l = InterlockedIncrement(&m_cRefs);
if (l == 1) // first increment, so do some initialization...
}
and you want it to work correctly on all Win32 platforms, you
need to recalibrate it so that m_cRefs starts out at -1:
Foo::Foo() : m_cRefs(-1) {}

Foo::AddRef()
{
LONG l = InterlockedIncrement(&m_cRefs);
if (l == 0) // first increment, so do some initialization...
}

Foo::Release()
{
LONG l = InterlockedDecrement(&m_cRefs);
if (l < 0) // last decrement, so release all resources...
}
assuming Releases correctly balance AddRefs, of course.

Windows 98 supports the new style InterlockedIncrement, even
though it runs on 386s, as well the other new Interlocked
APIs, such as InterlockedCompareExchange, which also
make use of instructions not available on 386s (cmpxchg).
Windows 98 uses the new instructions if it's running on a
sufficiently modern processor. On a 386, it has to fall back on
some horrors to ensure that the Interlocked APIs work correctly
with the advertised semantics, such as disabling interrupts to
ensure atomicity.
--
/George V. Reilly mailto:g...@halcyon.com
ATL, Vim (Vi IMproved), Active Server Pages: http://www.halcyon.com/gvr/
Co-author, "Beginning ATL COM Programming", Wrox Press, 1998

Roman Lechtchinsky

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

Jerry Leichter wrote:
> BTW, if you're really want to optimize, it helps to consider what you
> really want to do with your strings.

Indeed.

> doing a *lot* of copying, COW or no. This is easily avoided, *without*
> COW: Create a new datatype, concatenated_string, which is an ordered
> list of string's (and, preferably, char *'s - since there are no
> compile-time pre-built constant string objects).

This doesn't allow to concatenate two concatenated strings easily,
though ( of course, this could be added ). What I have found to work
quite well are binary trees, where each string ( let's call them
bstrings to avoid confusion ) is either a pair (bstring,bstring) or a
internal string representation like basic_string or preferably char*.
Concatenation of two bstrings is obviously O(1). Besides, you get
"unconcatenation" for free which can be useful sometimes. Other
interesting properties include a relatively fast operator[] ( the const
one - though now that I come to think of it, the mutating version could
be made quite fast, too, since you don't need to copy the entire tree )
and in some cases a rather efficient equality test. And you could even
reuse an old Lisp garbage collector for it.
Has anyone done any research on nontrivial string representations? I'd
be really interested in it.

Bye

Roman

Paul D. DeRocco

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

Jay Krell wrote:
>
> Me: Windows 95 and NT 3.x run on '386, Windows 98 and NT4+ require

> '486. I

> think that's why there are these differences. Anyone know the relevant

> diff

> between '386 and '486? Anything even better in Pentium, etc.? I guess,

> if

> you want to work on Win95 but require a '486, you can implement these

> yourself.

Yes. The 386 didn't have the XADD instruction.

--

Ciao,
Paul

Christopher Eltschka

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

Jerry Leichter <leic...@smarts.com> writes:

>| >For the effort required to do an efficient, reliable, multi-platform
>| >reference-counted string implementation, I'd rather have a garbage
>| >collector, which does away with reference counts entirely (among many
>| >other advantages).
>|
>| If you use COW: How would you determine if a copy is needed, if you
>| don't have a reference count?
>|
>| And if you always copy: What would GC give you in this respect?
>| (the only difference I can see is that you don't need a delete
>| in the destructor - but that's not the problem anyway)

>If you want to alter an existing string in place, then you have to copy
>it. On the other hand, if you want to *extend* an existing string, you

>can use a string pointer object that includes the length. You can then

>safely extend an existing string, regardless of how many references
>there are to it (assuming you've reserved enough extra space - but
>that's true regardless of reference counts). You can even go further
>and use an offset to the beginning of a substring in a shared string.
>Then you can extract substrings without copying, too.

Then look at the following code:

string a("Hello");
string b(a);
// -- (1) --
a+=" world!";
// -- (2) --
b+=", too.";
// -- (3) --

Your supposed optimisation would look like

At (1):

a: (5, ->X), b: (5, ->X) X: "Hello___________"

("_" denotes pre-allocated space)

Therefore a and b are "Hello". Ok.

At (2):

a: (12, ->X), b: (5, ->X) X: "Hello world!____"

Now a is "Hello world!", b is "Hello". Still Ok.

At (3):

a: (12, ->X), b: (11, ->X) X: "Hello, too.!____"

Now a is "Hello, too.!", which is bad (b is "Hello, too": correct).

The reason is that b can't know that there's another (longer)
reference to the string data, since there's no ref count.

[...]

Jerry Leichter

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

| >You can then safely extend an existing string, regardless of how many
| >references there are to it (assuming you've reserved enough extra
| >space - but that's true regardless of reference counts). You can
| >even go further and use an offset to the beginning of a substring in
| >a shared string. Then you can extract substrings without copying,
| >too.
|

| This was true in drafts of the standard up until very late.
|
| Then someone who thus far remains nameless slipped through a proposal
| changing operator[] to require (for no good reason that I can discern)
| that s[s.size()] must return 0.

Oh, yuck.

(Actually, just to make the interface even less comprehensible, this is
only true for the const version - at least in CD2. For the non-const
version, such a reference is undefined - with good reason, as otherwise
you'd have the impossible problem of deciding what happens if someone
assigns something non-0 (well, other than traits::eos()) to
s[s.size()].)

| Previously it was only necessary to tack on a 0 when s.c_str()
| was called; now it must be done after most modifications.

This is, I'm afraid, another example of an attempt to bundle in to one
and the same object support for several different styles of coding which
are almost never used simultaneously.

Personally, I long ago decided that in almost all cases, bounds checking
is a worthwhile overhead to pay. So I'd probably implement operator[]
with bounds checking, and handle s[s.size()] as a special case. (We're
talking about the advantages of a garbage-collected system here, and the
hazards from random memory scribling in such a system are two painful to
contemplate.) The gains from being able to share representations would
outweigh the costs of bounds checking in all programs I would want to
write - and, after all, if I really need to do that kind of thing,
there's always s.c_str().
-- Jerry

apje...@my-dejanews.com

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

In article <fxt7lx2...@isolde.engr.sgi.com>,

Matt Austern <aus...@sgi.com> wrote:
> apje...@my-dejanews.com writes:
>
> > Where are you getting this information? The online documentation that
> > comes with VC5 documents InterlockedIncrement as being precisely an
> > atomic ++i.
>
> I'm getting it from the online documentation that comes with VC6.
> Either the behavior changed between VC5 and VC6 or (more likely)
> Microsoft changed the documentation because they discovered some rare
> cases where it does not work they way they thought it did when they
> wrote the VC5 documentation.
>
> The VC6 documentation is quite explicit.

Thanks for letting me know, but that s*cks! I have a lot of code
which uses Interlocked{In,De}crement, and depends on the
behaviour as documented in VC5. It runs on NT4. Does anyone know on
which systems InterlockedIncrement doesn't behave like ++i? Maybe on
the Unix port of MFC or something?

If they really discovered a problem, it seems very strange (though
MS-like) to just change the documentation instead of fixing the code.
I can't imagine it's difficult to fix on any system. The worst case
would be that on some certain system, Interlocked{In,De}crement
couldn't be implemented as efficiently as on most systems, but I would
think that by most people's standards, "correct" is better than "fast"
if you have to choose between the two.

--
Adam P. Jenkins
ajenkins at javanet dot com

-----------== Posted via Deja News, The Discussion Network ==----------

http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Jerry Leichter

unread,

Nov 12, 1998, 3:00:00 AM11/12/98

to

| > BTW, if you're really want to optimize, it helps to consider what
| >you really want to do with your strings.
|
| Indeed.
|
| > doing a *lot* of copying, COW or no. This is easily avoided,
| > *without* COW: Create a new datatype, concatenated_string, which is
| > an ordered list of string's (and, preferably, char *'s - since there

| > are not compile-time pre-built constant string objects).

|
| This doesn't allow to concatenate two concatenated strings easily,

| though ( of course, this could be added ). ...

You're forgetting what you "Indeed"'d to above. I'm suggesting
concatenated_string as an implementation class. It could actually be
private to the string implementation - users would not be able to get
their hands on one.

If you write:

s = s1 + s2 + s3 + s4;

the + concatenation operator associates to the left, so you never get
two concatenated_strings to concatenate. You'd have to go out of your
way to write:

s = (s1 + s2) + (s3 + s4);

Then the unparenthesized "+" would nominally be trying to concatenate
two concatenated_strings. If you really want to provide a definition
for CS::operator+(CS) (CS==concatenated_string; my fingers are getting
tired), you can; but why bother? How often would you expect this to
come up? Given the assumed conversion from CS to string, the outer "+"
would instead end up begin CS::operator+(string) - a bit less efficient,
but it's unlikely to matter in any real program.

| Other interesting properties [of a tree-based implementation] include
| a relatively fast operator[] ....

Again, if CS is an implementation-only object, the only way this could
come up is with (s1 + s2)[n]. Unlikely. Without a CS::operator[],
this expression would use the implicit conversion and
string::operator[].

| Has anyone done any research on nontrivial string representations? I'd
| be really interested in it.

I suggested CS as an implementation technique for optimizing a conven-
tional string implementation. Other string representations are
possible. I suggest you look at SGI's rope implementation
(http://www.sgi.com/Technology/STL/Rope.html). Ropes are intended to
allow for the efficient support of strings of any size - it's reasonable
for an editor to store the entire document being edited in a rope. On
the other hand, some operations on ropes can be pretty expensive, so you
probably wouldn't want to use rope as a replacement for string.

-- Jerry

Matt Austern

unread,

Nov 13, 1998, 3:00:00 AM11/13/98

to

Roman Lechtchinsky <wol...@cs.tu-berlin.de> writes:

> This doesn't allow to concatenate two concatenated strings easily,

> though ( of course, this could be added ). What I have found to work
> quite well are binary trees, where each string ( let's call them
> bstrings to avoid confusion ) is either a pair (bstring,bstring) or a
> internal string representation like basic_string or preferably char*.
> Concatenation of two bstrings is obviously O(1). Besides, you get
> "unconcatenation" for free which can be useful sometimes. Other
> interesting properties include a relatively fast operator[] ( the const
> one - though now that I come to think of it, the mutating version could
> be made quite fast, too, since you don't need to copy the entire tree )
> and in some cases a rather efficient equality test. And you could even
> reuse an old Lisp garbage collector for it.

> Has anyone done any research on nontrivial string representations? I'd
> be really interested in it.

That's more or less what SGI calls ropes. See
http://www.sgi.com/Technology/STL/Rope.html for details, and see
H.-J. Boehm, R. Atkinson, and M. Plass, "Ropes: An Alternative to
Strings", Software Practice and Experience 25(12):1315, 1995, for
a discussion of the data structures and algorithms.

Robert O'Dowd

unread,

Nov 13, 1998, 3:00:00 AM11/13/98

to

Roman Lechtchinsky wrote:
>
> Andrei Alexandrescu wrote:
> > I think we all agree that thread-safe COW strings are less efficient that
> > non-thread-safe ones. Then let's put *two* of them in place. A plain
> string
> > and a multithreaded one. The decision will be taken by the programmer and
> > specified as a third template argument of type bool passed to
> basic_string.
>
> Actually, I expect a good implementation to provide two different
> versions of the entire standard library for single-threaded and
> multithreaded programs. They would be selected by a compiler option.
> Incidentally, this would probably also allow the compiler to generate
> more efficient code for single-threaded programs in some cases. Of
> course, this means quite a lot of additional work for the implementers (
> and thus additional bugs ), which in turn means that most of the time I
> would use the multithreaded version so that I have only one set of
> compiler/library bugs to deal with.
>

I'm inclined this way myself, unless the language standard starts
including
specific support for multi-threading. At the moment, multithreading
in any form is beyond the scope of the standard. To do it properly,
multithreading concepts would need to be tied down explicitly in
the language and in the library. I'm hoping that the discussion in
this news thread (which I've been reading with great interest)
will act as a starting point for getting multithreading support
in a future standard. If that happens, I will agree with Andrei.

-<Automagically included trailer>
Robert O'Dowd Ph +61 (8) 9553 3618
DSTO Bldg A51 Fax +61 (8) 9553 3577
HMAS Stirling Email:
robert...@dsto.defence.gov.au
Rockingham, Western Australia, 6958

Disclaimer: Opinions above are mine and may be worth what you paid for
them

Oleg Zabluda

unread,

Nov 13, 1998, 3:00:00 AM11/13/98

to

Nathan Myers <n...@nospam.cantrip.org> wrote:
: Jerry Leichter <leic...@smarts.com> wrote:
: > You can then

: >safely extend an existing string, regardless of how many references
: >there are to it (assuming you've reserved enough extra space - but
: >that's true regardless of reference counts). You can even go further
: >and use an offset to the beginning of a substring in a shared string.
: >Then you can extract substrings without copying, too.

: This was true in drafts of the standard up until very late.

: Then someone who thus far remains nameless slipped through a proposal
: changing operator[] to require (for no good reason that I can discern)
: that s[s.size()] must return 0.

Plauger?

: Previously it was only necessary to tack on a 0 when s.c_str()

: was called; now it must be done after most modifications.

It's a mess. It would be better to eliminate operator[]
altogether, or at the very least make it return something
other then char&.

P.S. I'll send you my iterators Real Soon Now. It proved to be
a much deeper change that I expected, and I have very little
time right now. Currently the library compiles, but
the testsuite doesn't link. Should take another couple of
hours.

Oleg.
--
Life is a sexually transmitted, 100% lethal disease.

Michael Rubenstein

unread,

Nov 13, 1998, 3:00:00 AM11/13/98

to

According to the documentation, the return value may not be the
incremented value under Windows 95 or NT 3.51 or earlier. It will
have the correct sign, including being zero if the result is zero, but
the value may not be correct if the new value is nonzero. Under
Windows 98 or NT 4.0 and later, the returned value is the incremented
value.
--
Michael M Rubenstein

Jerry Leichter

unread,

Nov 13, 1998, 3:00:00 AM11/13/98

to

| >If you want to alter an existing string in place, then you have to
| >copy it. On the other hand, if you want to *extend* an existing
| >string, you can use a string pointer object that includes the length.
|

| [problem when b's addition to "Hello" overwrites a's addition.]

Oh, come now; this is silly - "Hey, I can propose an implementation that
doesn't work". The amount of space actually allocated to the string is
going to be stored *somewhere*, no? You wouldn't want to duplicate it
in every pointer to the string data - you'd store it with the string's
header. Similarly, you'd want to store a pointer to the first free byte
within that space. The modification to a would succeed in place - the
first byte after a's value is still free. The modification to b would
find that the first free byte after b's value was *not* free, and so
would copy b before appending to it. (Actually, a clever implementation
can get by with a single "bytes remaining unused" field.)

A multithread-safe implemen- tation would have to lock the header before
checking and manipulating the "first free byte" pointer, but that's no
surprise. (If you have access to hardware that can do, say, fetch-and-
add or compare-and-swap, you don't even need the lock - you can test and
update the free space pointer atomically in one instruction.) At least
it only has to do atomic operations when *modifying* the string, not on
every reference!
-- Jerry

Roman Lechtchinsky

unread,

Nov 13, 1998, 3:00:00 AM11/13/98

to

Jerry Leichter wrote:
> | > doing a *lot* of copying, COW or no. This is easily avoided,
> | > *without* COW: Create a new datatype, concatenated_string, which is
> | > an ordered list of string's (and, preferably, char *'s - since there
> | > are not compile-time pre-built constant string objects).
> |

> | This doesn't allow to concatenate two concatenated strings easily,

> | though ( of course, this could be added ). ...
>
> You're forgetting what you "Indeed"'d to above. I'm suggesting
> concatenated_string as an implementation class. It could actually be
> private to the string implementation - users would not be able to get
> their hands on one.

I didn't realize that. Obviously, if you want to use it in a
basic_string implementation you have to make sure it doesn't show up in
the interface ( i.e. operator+(string,string) has to return a string and
not a CS ) which means that a string can now be one of two things:
either a pointer to a conventional string representation or a pointer to
a concatenated string. This will probably slow down every single string
operation except concatenation. Besides, you still have to define when
exactly a CS representation is converted to the default one. And if you
are suggesting it as a not-quite-conforming-but-close-enough drop-in
replacement for basic_string then I can think of better alternatives (
and I doubt if it really is "close enough" ).
Actually, I'm not sure if such optimizations to basic_string are useful
in general. For a program to be truly portable, it must not only compile
and run across a set of platforms, it must also meet its specifications
on every single member of that set. If these specifications somehow
imply very fast string concatenation basic_string can't be used anyway
because the standard makes no guarantees about its performance. If they
don't then you don't really need this optimization, either.

> If you write:
>
> s = s1 + s2 + s3 + s4;
>
> the + concatenation operator associates to the left, so you never get
> two concatenated_strings to concatenate. You'd have to go out of your
> way to write:
>
> s = (s1 + s2) + (s3 + s4);

This is just too subtle for my taste. I don't want my program to break
just because someone added a couple of parentheses to a supposedly
associative operation. BTW, this is not as unlikely as it seems. If
s1+s2 is known to be rather long and s4 is short then one might try to
help the compiler by making it evaluate s3+s4 first and append the
result to s1+s2 which amounts to about (s1+s2).length()-s4.length()
characters less to copy. Not that I would really recommend it.

> | Has anyone done any research on nontrivial string representations? I'd
> | be really interested in it.
>

> (http://www.sgi.com/Technology/STL/Rope.html). Ropes are intended to
> allow for the efficient support of strings of any size - it's reasonable
> for an editor to store the entire document being edited in a rope.

That's what bothers me. I'm not sure if it is really fast for rather
short strings ( after a quick look at the implementation I think it
isn't ). Does anyone has some data on it? Thanks for the link.

Bye

Roman

Alex Martelli

unread,

Nov 13, 1998, 3:00:00 AM11/13/98

to

[Warning--topic drift. --mod]

apje...@my-dejanews.com wrote in message <71vvj6$u0f$1...@nnrp1.dejanews.com>...
[snip]

>> InterlockedDecrement is not an atomic version of --i
>.. snip ..
>> In those cases, if you want to write code that's portable across all
>> Win32 systems, there's no sensible choice but to use critical sections.
>
>Just use InterlockedIncrement with a negative number to get an atomic --i if
>that's the case. Anyway, InterlockedDecrement does the right thing on NT4.

InterlockedIncrement is specified *exactly the same way* as
InterlockedDecrement -- on Win95 and NT3.51, both return
a result that is only guaranteed to be correct as to sign; on
Win98 and NT4, the correct result is obtained. There is no
advantage in using one versus the other function.

Alex

Nathan Myers

unread,

Nov 14, 1998, 3:00:00 AM11/14/98

to

Oleg Zabluda <zab...@math.psu.edu> wrote:
>Nathan Myers <n...@nospam.cantrip.org> wrote:
>: Jerry Leichter <leic...@smarts.com> wrote:
>: > You can then
>: >safely extend an existing string, regardless of how many references
>: >there are to it (assuming you've reserved enough extra space - but
>: >that's true regardless of reference counts).
>

>: This was true in drafts of the standard up until very late.
>: Then someone who thus far remains nameless slipped through a proposal
>: changing operator[] to require (for no good reason that I can discern)
>: that s[s.size()] must return 0.
>
>Plauger?

Probably not. Bill is responsible for very few defects in the
standard. The ones I know about are operator>> failing to report
overflow (scheduled to be fixed) and the char_traits<> members
being obliged to save one of their arguments to return, uselessly
(probably won't be fixed).

My own defects are too numerous to list here. :-)

--
Nathan Myers
n...@nospam.cantrip.org http://www.cantrip.org/

Christopher Eltschka

unread,

Nov 16, 1998, 3:00:00 AM11/16/98

to

Jerry Leichter <leic...@smarts.com> writes:

>| >If you want to alter an existing string in place, then you have to
>| >copy it. On the other hand, if you want to *extend* an existing
>| >string, you can use a string pointer object that includes the length.
>|
>| Then look at the following code:
>|
>| string a("Hello");
>| string b(a);
>| // -- (1) --
>| a+=" world!";
>| // -- (2) --
>| b+=", too.";
>| // -- (3) --
>|
>| [problem when b's addition to "Hello" overwrites a's addition.]

>Oh, come now; this is silly - "Hey, I can propose an implementation that
>doesn't work". The amount of space actually allocated to the string is
>going to be stored *somewhere*, no? You wouldn't want to duplicate it
>in every pointer to the string data - you'd store it with the string's
>header.

Then I'll quote what you've written yourself:

---- 8< ---- 8< ---- 8< ---- 8< ---- 8< ----

If you want to alter an existing string in place, then you have to copy
it. On the other hand, if you want to *extend* an existing string, you

can use a string pointer object that includes the length. You can then
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

safely extend an existing string, regardless of how many references
there are to it (assuming you've reserved enough extra space - but

that's true regardless of reference counts). You can even go further
and use an offset to the beginning of a substring in a shared string.
Then you can extract substrings without copying, too.

---- 8< ---- 8< ---- 8< ---- 8< ---- 8< ----

(Marking ^^^^^ by me).

So, please be careful with words like "silly"...

Also note that if you want to reference *substrings*, the idea
of storing the length in the beginning would not work, since you
don't point to the beginning of the buffer (unless, of course,
you add an additional pointer to the string object ["string
pointer"] representation.

> Similarly, you'd want to store a pointer to the first free byte
>within that space.

Which space - the "string pointer" space (i.e. the "string object")
or the "string data" space?

> The modification to a would succeed in place - the
>first byte after a's value is still free. The modification to b would
>find that the first free byte after b's value was *not* free, and so
>would copy b before appending to it.

Ok, obviously you want to store the "first-free" (=one-past-end)
in the "string pointer". So, instead of my pair
(pointer-to-start, length) you have a pair
(pointer-to-start, pointer-to-one-past-end). But both are
completely equivalent (Ok, one may allow more efficient code,
and depending on the implementation, the size/alignment may differ).

However, I now see how your implementation would work:

You store the length in *two* places: The "real" length
together with the string data, and the "local" length together
with the pointer. This way, you can avoid copying if you
see that you end at the "real" end of data.

Did I now get that right? (Maybe it would have been more helpful
if you'd pointed that very detail out)

Now, I understand this is an advantage (some copying avoided);
however, this can easily be combined with reference counting
(the string data can have a ref count *as well* as a length
(or end) indicator. If combined with reference counting, you
get the best of both worlds: Modification needs only copy
if you have more than one reference, and extending needs
only copy if your end has additional data *and in addition*
you are not the only one referencing the data (because if you
are, the extra data is not used any more, and you can safely
overwrite it).

So, I've learned about another optimisation for strings, but
the original question remains unanswered: What do you gain
with GC without RC versus RC without GC?
As I've shown above, even in the case of appending, GC can
avoid some copies you'd have with GC.

> (Actually, a clever implementation
>can get by with a single "bytes remaining unused" field.)

>A multithread-safe implemen- tation would have to lock the header before
>checking and manipulating the "first free byte" pointer, but that's no
>surprise. (If you have access to hardware that can do, say, fetch-and-
>add or compare-and-swap, you don't even need the lock - you can test and
>update the free space pointer atomically in one instruction.) At least
>it only has to do atomic operations when *modifying* the string, not on
>every reference!

If we are talking about implementations of the standard string
class, it still will have to do atomic operations with taking
(non-const) references - a non-RC-implementation will have to
make a deep copy in *every* case there, which requires locking.

If we are talking about implementation of a specialized "appender"
string where modification of single characters is not allowed,
you'll not get references to the string data at all (it would
not make sense to do so). And therefore the question of atomic
operations on references doesn't even make sense.

The standard demands a full string class, and a library that
wants to give a maximum performance string class will have
to use RC anyway - so it's no substantial overhead to use that
for memory management as well.

The specialized append_string can profit with RC as well,
in certain cases. The question remains if those cases occur
often enough to justify the overhead of RC - in that specialized
case i admit GC may win.

However, I agree with the standard that the default should be a
general-purpose string class - if this doesn't fit your
performance needs, you still can write specialized classes for
your needs (as you can do with containers, algorithms, streambufs,
etc.).
As an example, look at real numbers. For multiplication, division
and power, the most efficient representation would be just the
logarithm; this would allow those operations to be implemented
by addition, subtraction and multiplication. Therefore you might
argue that you shouldn't use a general float type, but
specialized types which support multiplication etc., and another
set optimized for addition/subtraction. Given that many
applications of real numbers use lots of multiplications/powers
first, and after that add up the results, this might increase
the performance of numerical calculations. But I never heared
such a proposal - instead, we have the float types, which are
reasonably optimized for both types of operation.

Jerry Leichter

unread,

Nov 17, 1998, 3:00:00 AM11/17/98

to

| > I'm suggesting concatenated_string as an implementation class. It
| > could actually be private to the string implementation - users would
| > not be able to get their hands on one.
|
| I didn't realize that. Obviously, if you want to use it in a
| basic_string implementation you have to make sure it doesn't show up
| in the interface ( i.e. operator+(string,string) has to return a
| string and not a CS )

Hmm. This brings up an interesting and subtle point: I was suggesting
that, indeed, operator+(string,string) return a CS, not a string.
Suppose we provide:

CS operator+(string,string)
CS operator+(CS,string)
(Could be member functions, of course; and there
are probably others to handle char*'s directly
and so on.)
operator string(const CS& cs)
Converts a CS to the equivalent string
string* CS::operator&()
Converts this to the equivalent string and
returns its address

We make the constructor for CS private, and make the appropriate
functions in string friends.

Now, CS appears in the interface, but it's impossible for the user to
create a CS - because the constructor is private - and he can't even get
the address of a CS created by the interface - because of the operator&.

As written, such a string class would not be consistent with the library
interface. However, it's very close to following the "as if" rule used
in the rest of the standard - in almost all code, it's impossible to
tell that operator+ is really returning a CS which "decays" to a string
when you try to use it. I can think of two ways the difference can
become visible:

1. If the result of operator+ is used as a template argument.
For most purposes, the worst that will happen is that
you'll get duplicated code; but you *can* write a
template that checks whether two types are the same.
2. If you define a conversion operator from string to some
other class (or do the equivalent with a non-explicit
constructor), then the "no two user-defined conversions"
rule can come into play.
3. Really, there's a third: sizeof(CS) and sizeof(string)
may differ. You can, of course, design your classes
carefully to avoid this.

There are other little oddities, like throwing a CS.

C++ *almost* lets you hide the CS helper class. It's really unfortunate
that you can't hide it completely, as this would allow for the applica-
tion of an "as if" rule to libraries in a way you can't really do now.
A big part of it could be done by having a way to directly prevent
anyone (but friends, say) from declaring either a pointer or reference
to a class. (This generalizes the notion of having only private
constructors.)

A question for language lawyers: If the standard says that something
returns an A, when - if ever - can it actually return a publically
derived subclass? (In modeling terms, if B is publically derived from
A, we say B isa A. However, it *is* possible to write code - like the
examples above - which accept an A but not a B.)

I never suggested that the above wouldn't work - just that it might not
work exactly as a naive reading would predict. If there were an
operator+(CS,string), the result of (s3 + s4) would be converted to a
string, and then concatenated on. If you think you're going to see many
expressions like this, by all means defined an operator+(CS,CS) which
handles it more optimally.
-- Jerry

t...@wagner.princeton.edu

unread,

Nov 18, 1998, 3:00:00 AM11/18/98

to

In article <3651B4...@smarts.com>, Jerry Leichter

<leic...@smarts.com> writes:
>
> As written, such a string class would not be consistent with the
library
> interface. However, it's very close to following the "as if" rule
used
> in the rest of the standard - in almost all code, it's impossible to
> tell that operator+ is really returning a CS which "decays" to a
string
> when you try to use it. I can think of two ways the difference can
> become visible:
>

> [...]

It seems to me that you can run into problems with much simpler code:

string (*f)(string, string);

f = operator+;

yes, I believe this is legal! :-)

------------------------------------------------------------------------
---
Tim Hollebeek | "Everything above is a true
email: t...@wfn-shop.princeton.edu | statement, for sufficiently
URL: http://wfn-shop.princeton.edu/~tim | false values of true."

Jaewoo Kim

unread,

Dec 1, 1998, 3:00:00 AM12/1/98

to

>It's a mess. It would be better to eliminate operator[]
>altogether, or at the very least make it return something
>other then char&.

I think that's the point, so operator[] shouldn't have been for std::string.
Because string is not a string buffer - mutable charater aggregate, that's
a definite defect of std::string.

-- Kizoo