Const reference worse than a copy (passing by value)?

Allan Odgaard

unread,

Jan 16, 2004, 8:17:29 AM1/16/04

to

I noticed that in my implementation (gcc 3.3/Mac OS X) std::copy is
implemented as a chain of functions, each use traits and argument
overloading to specialize the implementation, similar to the following
(simplified to illustrate point -- does not compile!):

_OutIter copy (_Iter f, _Iter l, _OutIter out)
{
return help_copy(f, l, out, is_simple_iter<_Iter>::value);
}

_OutIter help_copy (_Iter f, _Iter l, _OutIter out, true_type)
{
return pod_copy(f, l, out, is_simple_iter<_OutIter>::value);
}

_OutIter help_copy (_Iter f, _Iter l, _OutIter out, false_type)
{
return class_copy(f, l, out, is_simple_iter<_OutIter>::value);
}

In the real implementation there are five functions in this chain --
what puzzled me was, that the iterator arguments are declared to be
by-value.

This means that the iterators will be copied five times (and I
checked, they are, even with full optimizations).

I only know of two reasons for why not to prefer const reference, one
being the penalty related to accessing the value (and not being able
to cache it w/o full alias analysis), and the other being that there
is no need for alias analysis with by-value, which may allow for more
aggressive optimization.

But I do not see this as relevant in the intermediate helper
functions, which just call another function, each passing its own
arguments verbatim to the next.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Dhruv

unread,

Jan 16, 2004, 10:22:39 PM1/16/04

to

On Fri, 16 Jan 2004 08:17:29 -0500, Allan Odgaard wrote:

[...]

> In the real implementation there are five functions in this chain --
> what puzzled me was, that the iterator arguments are declared to be
> by-value.
>
> This means that the iterators will be copied five times (and I
> checked, they are, even with full optimizations).
>
> I only know of two reasons for why not to prefer const reference, one
> being the penalty related to accessing the value (and not being able
> to cache it w/o full alias analysis), and the other being that there
> is no need for alias analysis with by-value, which may allow for more
> aggressive optimization.

Basically, in loopy functions such as copy, find, for_each, the value of
an iterator is queried and the iterator is incremented/decremented very
frequently. So, if you pass the iterator by reference, the cache locality
may get disturbed badly. Also, const reference can not be used because
then, the iterators' dereferenced value may not be passed to functors that
take non-const references. (Usually, the operator*() for the iterators is
not a const member) Also, pass by reference may not be used because
then, the code such as this will become invalid:

find (v.begin(), v.end(), Some_Value);

Also, generally iterators for vector/list are small, and fit into a
hardware register, so with optimizations, and if the compiler is ready to
change the function calling convention partially, or for inlines, it may
optimize away the passing altogether. However, for map/multimap/deque the
iterators are quite large. Do sizeof on the iterator to find out the size
of the iterators on your implementation. After all, iterators are
implementation defined entities.

> But I do not see this as relevant in the intermediate helper
> functions, which just call another function, each passing its own
> arguments verbatim to the next.

I haven't quite followed you here. Please could you elaborate.

Regards,
-Dhruv.

Geno Rice

unread,

Jan 17, 2004, 6:02:25 AM1/17/04

to

Perhaps they don't bother to pass references because passing
an iterator, which, after all, is just a pointer, is the same (at
the assembly code level) as passing a reference, which itself is just a pointer.

What surprises me is that the implementation uses 5 nested function
calls. Doesn't that screw up code cache locality, branch prediction, etc?
Are you sure these don't get turned into inlines, and then get
optimized away?

Geno Rice

Ivan Vecerina

unread,

Jan 17, 2004, 6:08:45 AM1/17/04

to

"Allan Odgaard" <Du...@DIKU.DK> wrote in message
news:689e217.04011...@posting.google.com...

> I noticed that in my implementation (gcc 3.3/Mac OS X) std::copy is
> implemented as a chain of functions, each use traits and argument
> overloading to specialize the implementation, similar to the following
> (simplified to illustrate point -- does not compile!):

...

> This means that the iterators will be copied five times (and I
> checked, they are, even with full optimizations).

I would hope that, once the helper-functions are inlined (as they should),
the intermediate iterator copies are optimized out...

> I only know of two reasons for why not to prefer const reference, one
> being the penalty related to accessing the value (and not being able
> to cache it w/o full alias analysis), and the other being that there
> is no need for alias analysis with by-value, which may allow for more
> aggressive optimization.

I see other reasons:
- readability (author didn't bother, wanted to rely on the compiler)
- the standard does define std::copy and other algorithms as taking
their iterator arguments by value. Using a const-reference instead
could create portability issues (e.g. if using function pointers).
And of course, by-const-ref is pointless if the function wants to
modify its (local) arguments.

> But I do not see this as relevant in the intermediate helper
> functions, which just call another function, each passing its own
> arguments verbatim to the next.

I would guess that intermediate functions that are not visible to
the user could use const-references. But this might be unnecessarily
complicated.

I would rather expect that a decent compiler would optimize out, as
allowed by the standard, the unnecessary parameter copies in all
inlined functions.

Regards,
Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- e-mail contact form

Allan Odgaard

unread,

Jan 17, 2004, 6:09:07 AM1/17/04

to

"Dhruv" <dhru...@gmx.net> wrote in message news:<pan.2004.01.16....@gmx.net>...

> Basically, in loopy functions such as copy, find, for_each, the value of
> an iterator is queried and the iterator is incremented/decremented very
> frequently. So, if you pass the iterator by reference, the cache locality
> may get disturbed badly.

I doubt locality of reference plays any role here -- the iterator is
most likely to be found on the calling functions stack-frame. I do not
see a reason why moving it 32 bytes or so should make a difference in
performance (with an associative cache).

However, this is not related to my letter, as I was not talking about
mutating values passed by non-const reference.

> Also, const reference can not be used because
> then, the iterators' dereferenced value may not be passed to functors that
> take non-const references. (Usually, the operator*() for the iterators is
> not a const member)

Yes, assuming the calling function wants to mutate the value (w/o
making a copy), then it cannot take it by const-reference, but I do
not think this was implied in my letter -- also, the end iterator is
rarely used for anything other than as an argument to
iterator::operator!=, which anyway take it as const reference, so why
not always pass this as const reference to the "algorithm"?

> Also, pass by reference may not be used because
> then, the code such as this will become invalid:
> find (v.begin(), v.end(), Some_Value);

That would only be invalid if you assume that find wants to call
non-const member functions on the iterators. It may make a copy of the
first iterator or take that by-value and still take the end iterator
by const reference.

> Also, generally iterators for vector/list are small, and fit into a
> hardware register, so with optimizations, and if the compiler is ready to
> change the function calling convention partially, or for inlines, it may
> optimize away the passing altogether.

But why should we rely on all these factors (which often do not hold),
when we can explicitly avoid the extra copy by using const reference?
i.e. I was interested in the downside of const reference, and I listed
two myself, which I did not see apply to the case I was speaking about
(std::copy), and would be interested if there are other hidden "costs"
of const reference.

btw: iterators are often not just a memory copy, it will need to
invoke the copy-constructor (and destructor when done).

> > But I do not see this as relevant in the intermediate helper
> > functions, which just call another function, each passing its own
> > arguments verbatim to the next.
> I haven't quite followed you here. Please could you elaborate.

I am afraid this applies to my entire letter ;)

As explained in the beginning of my previous letter, std::copy (for
gcc) is in essence implemented like this:

OutIter copy (Iter f, Iter l, OutIter o)
{ return copy_1(f, l, o, xyz); }

OutIter copy_1 (Iter f, Iter l, OutIter o, magic)
{ return copy_2(f, l, o, xyz); }

OutIter copy_2 (Iter f, Iter l, OutIter o, magic)
{ return copy_3(f, l, o, xyz); }

OutIter copy_3 (Iter f, Iter l, OutIter o, magic)
{ return copy_4(f, l, o, xyz); }

OutIter copy_4 (Iter f, Iter l, OutIter o, magic)
{ /* do actual copy */ }

Here 'xyz' and 'magic' are standins for some traits stuff which
resolve various properties of the iterators passed in, to specialize
the copy (and thus make it more effective).

So calling std::copy will go through a chain of five functions, before
it arrives at the proper one, and for each step in this chain, it will
invoke the copy constructor for the iterators, since these are passed
by-value.

As I see it, there is no reason to not make the arguments a const
reference in all but the last function, and in the last function the
end iterator could still be by const reference, sinec it is probably
only used as rhs argument to operator!=.

John Potter

unread,

Jan 17, 2004, 8:24:39 PM1/17/04

to

On 16 Jan 2004 08:17:29 -0500, Du...@DIKU.DK (Allan Odgaard) wrote:

> I noticed that in my implementation (gcc 3.3/Mac OS X) std::copy is
> implemented as a chain of functions, each use traits and argument
> overloading to specialize the implementation,

> In the real implementation there are five functions in this chain --

> what puzzled me was, that the iterator arguments are declared to be
> by-value.

> This means that the iterators will be copied five times (and I
> checked, they are, even with full optimizations).

I don't know what you are using. I tried -S and got five calls, but
-O -S inlined the whole mess and called memmove directly.

Maybe you should post your code.

John

Allan Odgaard

unread,

Jan 18, 2004, 6:20:01 AM1/18/04

to

Geno Rice <ge...@monmouth.com> wrote in message news:<dj0Ob.762$4P6...@nwrdny01.gnilink.net>...

> > This means that the iterators will be copied five times (and I
> > checked, they are, even with full optimizations).

> Perhaps they don't bother to pass references because passing
> an iterator, which, after all, is just a pointer, is the same (at
> the assembly code level) as passing a reference, which itself is just a pointer.

For std::vector and std::string, the normal iterator *could* be a
pointer, but it generally isn't.

> What surprises me is that the implementation uses 5 nested function
> calls. Doesn't that screw up code cache locality, branch prediction, etc?
> Are you sure these don't get turned into inlines, and then get
> optimized away?

They most likely will be inline, but the iterators are still being
copied. As stated, I did test this!

Allan Odgaard

unread,

Jan 18, 2004, 6:25:37 AM1/18/04

to

John Potter <jpo...@falcon.lhup.edu> wrote in message news:<s8ei009hbteoom9f9...@4ax.com>...

> > This means that the iterators will be copied five times (and I
> > checked, they are, even with full optimizations).
> I don't know what you are using. I tried -S and got five calls, but
> -O -S inlined the whole mess and called memmove directly.

Probably the optimizer is better at removing redundant temporaries
when they are POD. I was using non-POD iterators.

> Maybe you should post your code.

The code is dependent on other things, but here's a minimal slip of
code to illustrate it:

struct count_iterator : public std::iterator<std::input_iterator_tag,
int, size_t, int*, int&>
{
typedef count_iterator self;
static int copy_count;
int value;
count_iterator (int i) : value(i)
{ }
count_iterator (self const& rhs) : value(rhs.value)
{ ++copy_count; }
self& operator= (self const& rhs)
{ ++copy_count; value = rhs.value; return *this; }
int& operator* ()
{ return value; }
self& operator++ ()
{ ++value; return *this; }
bool operator== (self const& rhs) const
{ return value == rhs.value; }
bool operator!= (self const& rhs) const
{ return value != rhs.value; }
};

Doing:
std::vector<int> v;
std::copy<count_iterator>(5, 15, back_inserter(v));
printf("%d\n", count_iterator::copy_count);

Will output 8, i.e. 4 copies pr. iterator. This is *with*
optimizations.
--
Private Mails To: Allan At Top-House Dot DK
http://www.top-house.dk/~aae0030/

John Potter

unread,

Jan 18, 2004, 7:28:49 PM1/18/04

to

On 18 Jan 2004 06:25:37 -0500, Du...@DIKU.DK (Allan Odgaard) wrote:

> John Potter <jpo...@falcon.lhup.edu> wrote in message news:<s8ei009hbteoom9f9...@4ax.com>...

> > > This means that the iterators will be copied five times (and I
> > > checked, they are, even with full optimizations).
> > I don't know what you are using. I tried -S and got five calls, but
> > -O -S inlined the whole mess and called memmove directly.

> Probably the optimizer is better at removing redundant temporaries
> when they are POD. I was using non-POD iterators.

> > Maybe you should post your code.

> The code is dependent on other things, but here's a minimal slip of
> code to illustrate it:

Yes. It illustrates that you need to trust your compiler.

> std::vector<int> v;
> std::copy<count_iterator>(5, 15, back_inserter(v));
> printf("%d\n", count_iterator::copy_count);

> Will output 8, i.e. 4 copies pr. iterator. This is *with*
> optimizations.

Yep. Now look at the assembler code. That 8 is observable
behavior. It optimized out all of the calls to the copy
ctor and just added 8 to the static int. Beautiful use of
the as-if rule. Because each function call introduces a
new variable, there are no temporaries to remove using the
special rules in 12.8/15. However, it is allowed to not
make the calls and copies if it can fool you into thinking
that it did.

A good magician whose slight of code is faster than printf.

Remove the instrumentation and save the compiler the time
needed to fool you.

John

Allan Odgaard

unread,

Jan 19, 2004, 12:26:18 PM1/19/04

to

John Potter <jpo...@falcon.lhup.edu> wrote in message news:<62al00dcskv8vqjqb...@4ax.com>...

> > The code is dependent on other things, but here's a minimal slip of

> > code to illustrate it [...]

> Yes. It illustrates that you need to trust your compiler.

I would think that it illustrates that the compiler will not optimize
away the copying of iterators. It will inline the copy-constructor and
then fold the four repeated increments into a single add, and thus my
example may have been bad at illustrating that there is a price
connected with copying the iterators, but it is done, nonetheless,
hence my original question about why not to prefer const reference
over passing by value (even though the compiler will often be able
optimize the code so that the copy is virtually free -- but there are
exceptions, which there AFAIK is not with const reference).

Tom

unread,

Jan 20, 2004, 6:09:26 AM1/20/04

to

Du...@DIKU.DK (Allan Odgaard) wrote:

> Geno Rice <ge...@monmouth.com> wrote:

> > > This means that the iterators will be copied five times (and I
> > > checked, they are, even with full optimizations).
> > Perhaps they don't bother to pass references because passing
> > an iterator, which, after all, is just a pointer, is the same (at
> > the assembly code level) as passing a reference, which itself is just a
> > pointer.
>
> For std::vector and std::string, the normal iterator *could* be a
> pointer, but it generally isn't.

Although an iterator for std::vector and std::string may or may not be
an ordinary pointer, for a non-debug implementation, size_of such
iterators should be the same as an ordinary pointer, which I think is
what Geno Rice meant. In other words, copying is no more expensive
than passing by reference for such iterators.

Best regards,

Tom

ka...@gabi-soft.fr

unread,

Jan 21, 2004, 10:48:01 AM1/21/04

to

Thomas...@yahoo.com (Tom) wrote in message
news:<7b68d58f.04011...@posting.google.com>...
> Du...@DIKU.DK (Allan Odgaard) wrote:

> > Geno Rice <ge...@monmouth.com> wrote:
> > > > This means that the iterators will be copied five times (and
> > > > I checked, they are, even with full optimizations).

> > > Perhaps they don't bother to pass references because passing an
> > > iterator, which, after all, is just a pointer, is the same (at
> > > the assembly code level) as passing a reference, which itself is
> > > just a pointer.

> > For std::vector and std::string, the normal iterator *could* be a
> > pointer, but it generally isn't.

> Although an iterator for std::vector and std::string may or may not be
> an ordinary pointer, for a non-debug implementation, size_of such
> iterators should be the same as an ordinary pointer, which I think is
> what Geno Rice meant. In other words, copying is no more expensive
> than passing by reference for such iterators.

This depends very much on the compiler. With the compilers I use, a
pointer or an int is passed in a register, whereas, regardless of the
size, a class is passed as a temporary in memory.

More important than performance, however, is perhaps the semantics. In
many cases, the algorthme modifies one or more of the iterators passed
in the algorithm. And the semantics of all of the algorithms is that
the iterator in the calling function is NOT modified. So there must be
a copy somewhere.

Also, one of the key elements of the two iterator idiom is that you do
copy iterators. You search a first element, save the iterator, then
search a second, to define a sub-interval, for example. The entire STL
is based on the premise that iterators, items in collections and
functional objects are copiable at a reasonable price. (What reasonable
means, of course, depends on the application.)

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

Tom

unread,

Jan 22, 2004, 4:00:38 AM1/22/04

to

ka...@gabi-soft.fr wrote:

> > Although an iterator for std::vector and std::string may or may not be
> > an ordinary pointer, for a non-debug implementation, size_of such
> > iterators should be the same as an ordinary pointer, which I think is
> > what Geno Rice meant. In other words, copying is no more expensive
> > than passing by reference for such iterators.
>
> This depends very much on the compiler. With the compilers I use, a
> pointer or an int is passed in a register, whereas, regardless of the
> size, a class is passed as a temporary in memory.

You are certainly correct that I should have acknowledged that this is
an implementation detail. Still, I'm curious. Which compilers do you
use? If I understand you correctly, on your compilers, passing struct
{ int* p; } by value is more expensive than passing a simple int* by
value. That suggests that the Stepanov abstraction penalty must be
significant, so that for time-critical operations, there is a
disincentive to using the STL. If you have a few spare moments, I
would be interested in the results of running Stepanov's benchmark
tests on your compiler. (If you don't have a copy of the Stepanov
tests, you can find a version on Scott Ladd's website at
http://www.coyotegulch.com/reviews/intel_comp/icc_gcc_benchmarks.tar.gz
)
For comparison purposes, gcc 3.3.1's Stepanov abstraction penalty is
less than 10% with full optimizations, and you probably know of other
compilers that do even better.

> More important than performance, however, is perhaps the semantics. In
> many cases, the algorthme modifies one or more of the iterators passed
> in the algorithm. And the semantics of all of the algorithms is that
> the iterator in the calling function is NOT modified. So there must be
> a copy somewhere.

Agreed.

> The entire STL
> is based on the premise that iterators, items in collections and
> functional objects are copiable at a reasonable price. (What reasonable
> means, of course, depends on the application.)

Agreed.

Best regards,

Tom

Samuel Krempp

unread,

Jan 22, 2004, 4:03:20 AM1/22/04

to

le Monday 19 January 2004 18:26, Du...@DIKU.DK écrivit :

> John Potter <jpo...@falcon.lhup.edu> wrote in message
> news:<62al00dcskv8vqjqb...@4ax.com>...
>> > The code is dependent on other things, but here's a minimal slip of
>> > code to illustrate it [...]
>> Yes. It illustrates that you need to trust your compiler.
>
> I would think that it illustrates that the compiler will not optimize
> away the copying of iterators. It will inline the copy-constructor and
> then fold the four repeated increments into a single add, and thus my

I suggest using iterators with adequately big vector members, and using only
time of execution as hint of whether the compiler was able to avoid the
copying overhead.

If it does not, it demonstrates the point.

(and if it does avoid the overhead, I'm impressed :) )

--
Samuel.Krempp
cout << "@" << "crans." << (is_spam ? "trucs.en.trop." : "" )
<< "ens-cachan.fr" << endl;

ka...@gabi-soft.fr

unread,

Jan 23, 2004, 4:01:30 PM1/23/04

to

Thomas...@yahoo.com (Tom) wrote in message

news:<7b68d58f.04012...@posting.google.com>...
> ka...@gabi-soft.fr wrote:

> > > Although an iterator for std::vector and std::string may or may
> > > not be an ordinary pointer, for a non-debug implementation,
> > > size_of such iterators should be the same as an ordinary pointer,
> > > which I think is what Geno Rice meant. In other words, copying
> > > is no more expensive than passing by reference for such
> > > iterators.

> > This depends very much on the compiler. With the compilers I use,
> > a pointer or an int is passed in a register, whereas, regardless of
> > the size, a class is passed as a temporary in memory.

> You are certainly correct that I should have acknowledged that this is
> an implementation detail. Still, I'm curious. Which compilers do you
> use? If I understand you correctly, on your compilers, passing struct
> { int* p; } by value is more expensive than passing a simple int* by
> value.

I tried the following code:

class S
{
int* p ;
public:
S() : p( 0 ) {}
S( int* pp ) : p( pp ) {}
int& operator*() const { return *p; }
S& operator++() { ++ p ; return *this ; }
} ;

typedef int* P ;

S
func_sss( S s )
{
++ s ;
return s ;
}

P
func_ppp( P p )
{
++ p ;
return p ;
}

I tried it with Sun CC (version 5.1, -O4), g++ (version 3.3.1, -O3,
under Solaris Sparc) and VC++ 6.0 (-Ox, under Windows NT). In all three
cases, the class is copied into a temporary in memory, and returned via
another temporary in memory. As a simple idea, the generated code with
g++ was:

func_sss:
ld [%o0], %g1
ld [%sp+64], %o0
add %g1, 4, %g1
jmp %o7+12
st %g1, [%o0]

func_ppp:
retl
add %o0, 4, %o0

Sun CC was very similar. The difference is less striking on the Intel
platform, because even pure pointers are passed on the stack, but still:

func_sss:
mov eax, dword ptr 12[esp-4]
lea ecx, dword ptr [eax+4]
mov eax, dword ptr 8[esp-4]
mov dword ptr [eax], ecx
ret 0

func_ppp:
mov eax, dword ptr 8[esp-4]
add eax, 4
ret 0

Differences at the call site were similar -- Sun CC and g++ never even
allocated memory for P, whereas they did for S.

> That suggests that the Stepanov abstraction penalty must be
> significant, so that for time-critical operations, there is a
> disincentive to using the STL. If you have a few spare moments, I
> would be interested in the results of running Stepanov's benchmark
> tests on your compiler. (If you don't have a copy of the Stepanov
> tests, you can find a version on Scott Ladd's website at
> http://www.coyotegulch.com/reviews/intel_comp/icc_gcc_benchmarks.tar.gz
> ) For comparison purposes, gcc 3.3.1's Stepanov abstraction penalty is
> less than 10% with full optimizations, and you probably know of other
> compilers that do even better.

I gave it a try. I'm not sure what the output means. It displayed an
abstraction penalty of 3.1 with Sun CC, and 1.23 with g++. Is this
percent, or does it mean that the abstract version takes 3.1/1.23 times
more time?

At any rate, I'm not sure that the benchmark signifies much. There is
only one module -- when I compiled my test with g++ or Sun CC, I had to
put the calls in a separate module for either of the compilers to ever
call them. Both Unix compilers inline functions defined in the same
module, whether they are declared inline or not, and once inlined, both
of the functions are optimized to a simple incrementation (which is then
further optimized according to the surrounding context).

You might want to try the following benchmark:

s.hh:
-----
class S
{
int* p ;
public:
S() : p( 0 ) {}
S( int* pp ) : p( pp ) {}
int& operator*() const { return *p; }
S& operator++() { ++ p ; return *this ; }
} ;

typedef int* P ;

extern S incr( S ) ;
extern P incr( P ) ;

x.cc:
-----
#include "s.hh"

S
incr( S s )
{
++ s ;
return s ;
}

P
incr( P p )
{
++ p ;
return p ;
}

bench.cc:
---------
#include "gb/CommandLine.hh"
#include "gb/NumericOption.hh"
#include "gb/ProgramStatus.hh"
#include "gb/BenchHarness.hh"
#include <ostream>
#include <iostream>

#include "s.hh"

template< typename Ptr >
class Test : public GB_BenchHarness
{
public:
virtual void operator()() ;
private:
int myData[ 10 ] ;
} ;

template< typename Ptr >
void
Test< Ptr >::operator()()
{
Ptr p( myData ) ;
*incr( p ) = 1 ;
}

int
main( int argc, char** argv )
{
GB_BoundNumericOption
count( 'c', 1000000 ) ;
GB_CommandLine::instance().parse( argc, argv ) ;
GB_BenchHarness::setLoopCount( count ) ;
Test< P >().run(
std::cout,
GB_BenchHarness::iterationTime,
"Pointer" ) ;
Test< S >().run(
std::cout,
GB_BenchHarness::iterationTime,
"Class" ) ;
return GB_ProgramStatus::instance().returnCode() ;
}
-------

(The necessary includes and libraries are available at my site,
www.gabi-soft.fr. I think they are pretty portable, but I've only
actually built them for Sun CC and g++, under Solaris, and g++ under
Linux.)

I find a difference of a factor more than 10 with Sun CC, and around 7
with g++.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Thomas Tutone

unread,

Jan 25, 2004, 5:41:22 AM1/25/04

to

<ka...@gabi-soft.fr> wrote:

> > That suggests that the Stepanov abstraction penalty must be
> > significant, so that for time-critical operations, there is a
> > disincentive to using the STL. If you have a few spare moments, I
> > would be interested in the results of running Stepanov's benchmark
> > tests on your compiler. (If you don't have a copy of the Stepanov
> > tests, you can find a version on Scott Ladd's website at
> > http://www.coyotegulch.com/reviews/intel_comp/icc_gcc_benchmarks.tar.gz
> > ) For comparison purposes, gcc 3.3.1's Stepanov abstraction penalty is
> > less than 10% with full optimizations, and you probably know of other
> > compilers that do even better.
>
> I gave it a try. I'm not sure what the output means. It displayed an
> abstraction penalty of 3.1 with Sun CC, and 1.23 with g++. Is this
> percent, or does it mean that the abstract version takes 3.1/1.23 times
> more time?

The latter. Alex Stepanov wrote the benchmark as a rough measure of how
well a compiler optimizes away various iterator abstractions, such as making
naked pointers part of a simple class, the Standard Libary's mechanism for
creating reverse_iterators, etc. You might have to increase the number of
iterations to get accurate results, because when the test was written,
computers were slower... But on you platform, gcc has about a 23% Stepanov
abstraction penalty, and Sun cc has about a 231% Stepanov abstraction
penalty, if your results are typical.

In sec, 10.3 of TC++PL, Stroustrup talks about the importance of "efficient
user-defined types." A 231% penalty for using user-defined types is too
much, IMHO.

> At any rate, I'm not sure that the benchmark signifies much. There is
> only one module -- when I compiled my test with g++ or Sun CC, I had to
> put the calls in a separate module for either of the compilers to ever
> call them. Both Unix compilers inline functions defined in the same
> module, whether they are declared inline or not, and once inlined, both
> of the functions are optimized to a simple incrementation (which is then
> further optimized according to the surrounding context).

I'm not sure I follow you. Did you not compile the Stepanov benchmark as a
single module? It is intended to be so compiled. If not, your results are
not representative.

> You might want to try the following benchmark:

{Quite unnecessary quoting of two screens of source code snipped.
-mod/fwg}

I will try it. Thanks.

Best regards,

Tom

Gabriel Dos Reis

unread,

Jan 25, 2004, 5:21:24 PM1/25/04

to

"Thomas Tutone" <Thomas8675...@yahoo.com> writes:

| In sec, 10.3 of TC++PL, Stroustrup talks about the importance of "efficient
| user-defined types." A 231% penalty for using user-defined types is too
| much, IMHO.

The fact that you get something less than 230% with other compilers
is an indication that "penalty" is more or compiler issues than with
the language. (BTW, "abstraction penalty" is a very polite phrase).

I think, many compilers out there have a very curious and prehistoric
view of "classes" or "structures". I've seen somehow antiquitated
ABIs where passing a value of struct type as an argument in a function
call, systematically makes it being pushed in memory, no matter how
small it is. Clearly, such an implementation strategy does ignore the
purpose of types in programming.

--
Gabriel Dos Reis
g...@integrable-solutions.net

ka...@gabi-soft.fr

unread,

Jan 26, 2004, 5:43:44 AM1/26/04

to

Gabriel Dos Reis <g...@integrable-solutions.net> wrote in message news:<m3llnw5...@uniton.integrable-solutions.net>...

> "Thomas Tutone" <Thomas8675...@yahoo.com> writes:

> | In sec, 10.3 of TC++PL, Stroustrup talks about the importance of "efficient
> | user-defined types." A 231% penalty for using user-defined types is too
> | much, IMHO.

> The fact that you get something less than 230% with other compilers
> is an indication that "penalty" is more or compiler issues than with
> the language.

Totally agreed. But the 230% is nothing compared to the 1500% that I
see when I use separate modules. All that I'm measuring is the
difference in cost between passing a pointer, and passing a (trivial,
completely inlined) iterator.

> (BTW, "abstraction penalty" is a very polite phrase).

You're being punished for doing it right:-).

You do realize, of course, that if people start using abstraction
systematically, we'll get to a point where even average programmers will
be able to understand and maintain complex systems, and the we gurus
will no longer be able to justify our outrageously high rates. So it
isn't all bad:-). (Seriously, of course, there are enough really
difficult things that need doing that I see no need in making simple
things unnecessarily complicated.)

> I think, many compilers out there have a very curious and prehistoric
> view of "classes" or "structures". I've seen somehow antiquitated
> ABIs where passing a value of struct type as an argument in a function
> call, systematically makes it being pushed in memory, no matter how
> small it is. Clearly, such an implementation strategy does ignore the
> purpose of types in programming.

I suspect that the problem comes from the fact that at some early point
in the compilation process, the compiler took the address of the object,
in order to get a this pointer. And that this "taking of the address"
forces the object into memory, regardless of what happens later.

But I'm just guessing.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Jan 26, 2004, 5:44:24 AM1/26/04

to

"Thomas Tutone" <Thomas8675...@yahoo.com> wrote in message
news:<toAQb.9563$6O4.2...@bgtnsc04-news.ops.worldnet.att.net>...

> <ka...@gabi-soft.fr> wrote:

Quite. In fact, it was the fact that it was so high that made me wonder
whether it was a percent, or a multiple.

Given the little time necessary to execute the benchmarks, I suspect
that I sould increase the loop count.

No, just checked. Same results, more or less.

> > At any rate, I'm not sure that the benchmark signifies much. There is
> > only one module -- when I compiled my test with g++ or Sun CC, I had to
> > put the calls in a separate module for either of the compilers to ever
> > call them. Both Unix compilers inline functions defined in the same
> > module, whether they are declared inline or not, and once inlined, both
> > of the functions are optimized to a simple incrementation (which is then
> > further optimized according to the surrounding context).

> I'm not sure I follow you. Did you not compile the Stepanov benchmark as a
> single module? It is intended to be so compiled. If not, your results are
> not representative.

I compiled it as a single module. I was just interpreting the results
as a percent, and not a multiple, and found them surprisingly low:-).
Thus, I thought I'd try using separate modules. In fact, it looks like
the problem is worse than I'd thought. Which is somewhat strange; in
earlier tests, the fact that I'd found no significant difference between
++i and i++, even when the iterators were class types, led me to believe
that everything was getting inlined correctly, as long as we stayed
within a single module (and thus, that the abstraction penalty in such
cases was low).

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Gabriel Dos Reis

unread,

Jan 26, 2004, 9:37:35 AM1/26/04

to

ka...@gabi-soft.fr writes:

| > I think, many compilers out there have a very curious and prehistoric
| > view of "classes" or "structures". I've seen somehow antiquitated
| > ABIs where passing a value of struct type as an argument in a function
| > call, systematically makes it being pushed in memory, no matter how
| > small it is. Clearly, such an implementation strategy does ignore the
| > purpose of types in programming.
|
| I suspect that the problem comes from the fact that at some early point
| in the compilation process, the compiler took the address of the object,
| in order to get a this pointer. And that this "taking of the address"
| forces the object into memory, regardless of what happens later.

That is also true. But in the case of ABI I'm thinking of, a
structure is systematically passed in memory -- whether you take its
address or not. The SPARC psABI is such an example. Of course, there
are exceptions. The PowerPC psABI tries to take the structure's size
into consideration; and that is better :-). The next step would be for
compilers to recognize that the C++ calling convention needs not be
equated with C calling convention. Those considerations are crucial
for effective/efficient support of things like iterators or smart
pointers. Types are highly valuable compile-time comments compilers
read; they should not be thrown away that way.

--
Gabriel Dos Reis
g...@integrable-solutions.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Keith H Duggar

unread,

Jan 26, 2004, 3:22:54 PM1/26/04

to

> What surprises me is that the implementation uses 5 nested function
> calls. Doesn't that screw up code cache locality, branch prediction, etc?
> Are you sure these don't get turned into inlines, and then get
> optimized away?

Would you be surprised to learn that a great deal of the gcc
implementation of the STL uses multiple nested function calls often
through one or more pointers? I was. For the most part, it is a large
nested wrapper of the standard C library. This makes historical sense
considering it's evolution and the evolution of the C++ language and
standard; but, it doesn't make for the best implementation.

For example, try tracking down the implementation of stream insertion
operator for a simple integer (grep comes in handy here). After some
work, you will find that after going through a few function calls and
a pointer it finally calls sprintf. In other words, just to insert an
integer we need to pay for nested functions, dereferencing, and then
to top it the work horse of sprintf. All of this could have been
handled by a simple itoa equivalent and character buffer write.

I am very thankful for GCC and to the all the people that helped
implement it. I will be even more thankful when a group takes the time
to clean out the GCC attic and create an sharp, clean, and more
efficient version of G++ or some other open source C++ ONLY compiler.

P.S. We rely too much on compilers to do their optimization for us.
Compilers are good but they aren't as good as many seem to think. They
are constrained by correctness and conformance and simply can not do
what we sometimes dream they can do.

Bjarne Stroustrup

unread,

Jan 26, 2004, 3:24:27 PM1/26/04

to

ka...@gabi-soft.fr :

> > But on you platform, gcc has about a 23% Stepanov
> > abstraction penalty, and Sun cc has about a 231% Stepanov abstraction
> > penalty, if your results are typical.
>
> > In sec, 10.3 of TC++PL, Stroustrup talks about the importance of "efficient
> > user-defined types." A 231% penalty for using user-defined types is too
> > much, IMHO.

This is *far* too much. I designed C++ with the "zero overhead
principle" as the ideal ("not one extra cycle and not one extra
byte"). I considered a 3% overhead compared to hand-written C
unacceptable. I suspect the 231% figure to be the result of a
disregard for C++ optimization needs and a regression compared to
earlier results. I have not checked recently, but 231% is consistent
with results for compilers that does hardly any inlining. To have a
low "abstraction penalty" (say 3% as is achieved by some current
compilers), you need a good inliner. Simply obeying the user's intent
as expressed in the code (using "inline" and in-class member
definitions) will do that provided the compiler knows how to put small
structs (e.g. one pointer member) into registers.

- Bjarne Stroustrup; http://www.research.att.com/~bs

Tom

unread,

Jan 26, 2004, 3:28:24 PM1/26/04

to

Gabriel Dos Reis wrote:

> | In sec, 10.3 of TC++PL, Stroustrup talks about the importance of "efficient
> | user-defined types." A 231% penalty for using user-defined types is too
> | much, IMHO.
>
> The fact that you get something less than 230% with other compilers
> is an indication that "penalty" is more or compiler issues than with
> the language.

I couldn't agree more, and I guess that was really my point. This

subthread started after Mr. Kanze said:

> > > > With the compilers I use, a
> > > > pointer or an int is passed in a register,
> > > > whereas, regardless of the size, a class
> > > > is passed as a temporary in memory.

Which surpised me and led me to assume that the abstraction penalty on
his platform must be quite large - and it sure is.

Best regards,

Tom

ka...@gabi-soft.fr

unread,

Jan 28, 2004, 4:57:30 AM1/28/04

to

b...@research.att.com (Bjarne Stroustrup) wrote in message
news:<188b3370.0401...@posting.google.com>...
> ka...@gabi-soft.fr :

> > > But on you platform, gcc has about a 23% Stepanov
> > > abstraction penalty, and Sun cc has about a 231% Stepanov abstraction
> > > penalty, if your results are typical.

> > > In sec, 10.3 of TC++PL, Stroustrup talks about the importance of
> > > "efficient user-defined types." A 231% penalty for using
> > > user-defined types is too much, IMHO.

> This is *far* too much. I designed C++ with the "zero overhead
> principle" as the ideal ("not one extra cycle and not one extra
> byte"). I considered a 3% overhead compared to hand-written C
> unacceptable. I suspect the 231% figure to be the result of a
> disregard for C++ optimization needs and a regression compared to
> earlier results. I have not checked recently, but 231% is consistent
> with results for compilers that does hardly any inlining. To have a
> low "abstraction penalty" (say 3% as is achieved by some current
> compilers), you need a good inliner. Simply obeying the user's intent
> as expressed in the code (using "inline" and in-class member
> definitions) will do that provided the compiler knows how to put small
> structs (e.g. one pointer member) into registers.

Well, I think for once all of us participating in the discussion are in
violent agreement:-). Although I think we are talking about two
different abstraction penalties -- and I find that I'm usually paying
both. I recognize that if I don't inline the actual functions (say
operator++), some form of abstraction penalty is inevitable for this
function. On the other hand, I don't expect to pay more to simply copy
the object around (supposing copy constructor and destructor are
trivial).

What we are seeing is 231% abstraction penalty for the inlined
functions, in a single module (where the zero overhead principle should
hold), and 1500% (!!) penalty for just copying. G++ does better, but
it is still 23% and around 800% (and it starts from a higher base --
g++ runs slower when the built in types are used). While I'm not
usually one to get upset about performance (most of my applications are
IO bound), performance loss at this level is really showing a complete
disregard for the client.

I think it worthwhile to point out, too, that while I've not analysed
Stepanov's benchmark, my own was about as trivial as you can get. I
can understand a compiler having some problems with inlining
complicated functions, but a simple ++ or an immediate value return?
In my benchmark, in fact, a simple look at the generated code showed
that *all* of the penalty was due to pass by value -- of a type with
trivial (compiler generated) copy constructor, assignment operator and
destructor.

With regards to passing the parameter in a register, and Gabriels
response to my posting: I recognize that there are a number of issues
involved. A priori, an object of class type needs a this pointer, a
this pointer means taking the address, and taking the address requires
the object to be in memory. And when specifying an API for separatly
compiled modules, you cannot suppose any real work on the part of the
optimizer -- you have to specify exactly when an object is passed in a
register, and when not. Still, I think that as a minimum: if an object
has a trivial copy constructor, a trivial destructor and it fits in a
register, that's where it should be passed. Since the compiler handles
the trivial copy constructor immediately, there should be no problem
(and of course, a trivial destructor does nothing). Personally, I'd
like even more, but none of the compilers I have access to (Sun CC, g++
and VC++) support even this.

I can't remember for sure, but I'm almost sure that it was something you
wrote many years ago (maybe 15) which suggested already at the time that
small classes could be passed and returned in registers. So it's not
like we are suggesting something that no one had ever thought of before.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Balog Pal

unread,

Jan 28, 2004, 3:31:03 PM1/28/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04012...@posting.google.com...

> I can't remember for sure, but I'm almost sure that it was something you
> wrote many years ago (maybe 15) which suggested already at the time that
> small classes could be passed and returned in registers. So it's not
> like we are suggesting something that no one had ever thought of before.

Still, there can be az 'object compatibility' problem. If you write your
V1.0 compiler it; easy to start with 'I pass small objects in registers'
convention. But it you can't just change to this in V4.0. At least if you
want to keep link-compatibility with the previous version.
Many compilers allow tweaking calling conventions using some defaults from
command line switch, and keywords like __cdecl __stdcall __thiscall on
function level. And that is reflected in the mangled function name too.

A new passing convention should likely introduce a new such keyword, say
__thscall meant this pointer is passed in R66, then __newthiscall will mean
the same for objects bigger than a register, while small objects will
directly reside in R66, and shall not be preserved at exit. The new default
will be __newthiscall.
If the programmer want to ling with old stuff he will face link errors, than
can either use the general cmdline to fall back using __thiscall or
qualifies the individual functions with locked object files. (Or recomliles
everything, eliminating the problem.)

Doesn't sound like a killer task, but still enough fuss to inhibit the
change.

Paul

Bjarne Stroustrup

unread,

Jan 28, 2004, 3:33:33 PM1/28/04

to

ka...@gabi-soft.fr wrote in message

>
> I can't remember for sure, but I'm almost sure that it was something you
> wrote many years ago (maybe 15) which suggested already at the time that
> small classes could be passed and returned in registers. So it's not
> like we are suggesting something that no one had ever thought of before.

Correct. Sometime in the very early days, 1983 or so. I convinced
Steve Johnson to modify PCC to put small structures into registers. It
made quite a difference.

It seems that C-oriented ABI makers didn't follow up because their C
benchmark didn't show a need.

Note that putting small class objects into registers can be
non-trivial (dependent on the machine architecture) because data in
registers is typically addressed quite different from data in memory;
i.e. "the this pointer" for a member function must somehow become a
register address iff an object is loaded into a register.

-- Bjarne Stroustrup; http://www.research.att.com/~bs

ka...@gabi-soft.fr

unread,

Jan 29, 2004, 1:53:25 PM1/29/04

to

b...@research.att.com (Bjarne Stroustrup) wrote in message

news:<188b3370.04012...@posting.google.com>...
> ka...@gabi-soft.fr wrote in message

> > I can't remember for sure, but I'm almost sure that it was something
> > you wrote many years ago (maybe 15) which suggested already at the
> > time that small classes could be passed and returned in registers.
> > So it's not like we are suggesting something that no one had ever
> > thought of before.

> Correct. Sometime in the very early days, 1983 or so. I convinced
> Steve Johnson to modify PCC to put small structures into registers. It
> made quite a difference.

I can believe it.

> It seems that C-oriented ABI makers didn't follow up because their C
> benchmark didn't show a need.

The usual problem. Benchmarks sell compilers, so compiler implementors
optimize for the specific benchmarks. And most benchmarks don't test
abstractions.

> Note that putting small class objects into registers can be
> non-trivial (dependent on the machine architecture) because data in
> registers is typically addressed quite different from data in memory;
> i.e. "the this pointer" for a member function must somehow become a
> register address iff an object is loaded into a register.

That's one of the reasons why my minimal suggestion only involved
classes with trivial constructors.

I suspect that as a minimum, inline constructors will be necessary,
since the compiler can't put anything in a register if the address leaks
(e.g. if the constructor "registers" the address somewhere). Again, my
minimal suggestion involved classes with trivial constuctors because the
compiler "knows" that a trivial constructor doesn't leak the address. I
think that the key for this to work is that a memcpy'd image of the
object can be used in its place, with no difference in observable
behavior.

The way I envisaged the optimization was that the API specified that if
a class had a trivial copy constructor (and maybe destructor, I'm not
sure), and its size was less than N bytes, it would be passed in a
register (or several registers). Typically, to do this, the compiler
would start by generating the parameter in memory, in the usual fashion,
then copying the bytes to registers; at the other end, the compiler
would copy the registers to memory, and use that memory as the class.
After than, classical optimization techniques would suppress the memory
image at the call site (almost certainly) and in the called function
(depending on what the function did with the class -- if the compiler
couldn't determine that the address didn't leak, it would have to keep
the copy in memory).

I'm pretty sure that such an implementation would be conforming.
Provided, of course, that the double word by word copy (into and out of
the register) was acceptable to the class, and the copy constructor
didn't leak the address. Both are guaranteed in the case of a trivial
copy constructor.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Jan 29, 2004, 1:53:49 PM1/29/04

to

"Balog Pal" <pa...@lib.hu> wrote in message
news:<4017...@andromeda.datanet.hu>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04012...@posting.google.com...

> > I can't remember for sure, but I'm almost sure that it was something
> > you wrote many years ago (maybe 15) which suggested already at the
> > time that small classes could be passed and returned in registers.
> > So it's not like we are suggesting something that no one had ever
> > thought of before.

> Still, there can be az 'object compatibility' problem. If you write
> your V1.0 compiler it; easy to start with 'I pass small objects in
> registers' convention. But it you can't just change to this in
> V4.0. At least if you want to keep link-compatibility with the
> previous version. Many compilers allow tweaking calling conventions
> using some defaults from command line switch, and keywords like
> __cdecl __stdcall __thiscall on function level. And that is reflected
> in the mangled function name too.

True, if you change the API, you break object compatibility. Which you
shouldn't do on a whim. But most compilers had to break object
compatibility to conform to the C++ standard anyway -- you can't link
g++ 2.95.2 with g++ 3.0, nor Sun CC 4.2 with Sun CC 5.0. And as I said,
both the technique and the necessity have been know for at least 15
years. Sun has broken object compatibility twice in that time, and g++
at least twice. (The idea of using special compiler specific keywords
is interesting. On one hand, I'm suspicious of such non-portable
constructs. On the other... I certainly don't like the fact that we
have to continue using Sun CC in the 4.2 compatibility mode because we
have some third party libraries that haven't been upgraded.)

Part of the problem may be C compatibility -- but I always thought that
that's what we have "extern "C"" for.

> A new passing convention should likely introduce a new such keyword,
> say __thscall meant this pointer is passed in R66, then __newthiscall
> will mean the same for objects bigger than a register, while small
> objects will directly reside in R66, and shall not be preserved at
> exit. The new default will be __newthiscall.
> If the programmer want to ling with old stuff he will face link
> errors, than can either use the general cmdline to fall back using
> __thiscall or qualifies the individual functions with locked object
> files. (Or recomliles everything, eliminating the problem.)

> Doesn't sound like a killer task, but still enough fuss to inhibit the
> change.

I agree that we certainly don't want compiler implementors to break
object compatibility with each new release. On the other hand, it is a
hurdle that must be faced every once in a while. A hurdle that does
require compiler switches, etc., in order to be palatible for the
compiler users. But there have been enough occasions.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Gabriel Dos Reis

unread,

Jan 30, 2004, 4:33:35 AM1/30/04

to

ka...@gabi-soft.fr writes:

| Part of the problem may be C compatibility -- but I always thought that
| that's what we have "extern "C"" for.

Exactly! That is why C+ defines at least two calling conventions: One
for "C", one for "C++ native". I would expect that compilers are
going to pay more and more attention to use of types in software
building. They usually contain invaluable information.

--
Gabriel Dos Reis
g...@integrable-solutions.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Gabriel Dos Reis

unread,

Jan 30, 2004, 4:34:21 AM1/30/04

to

ka...@gabi-soft.fr writes:

| I suspect that as a minimum, inline constructors will be necessary,
| since the compiler can't put anything in a register if the address leaks
| (e.g. if the constructor "registers" the address somewhere). Again, my

With the old ARM rules (and under mild assumption, i.e. a copy
constructor and the destructor are used by function that takes a
parameter of class-type), that "inline" requirement is sufficient to
guide the compiler to set the appropriate calling convention. But
that may sound too complicated (and anyway, the standard weakened the
requirements on inline functions); so trivial copy-constructors and
trivial destructor are probably good enough for the most common cases.
It just implies that people should not be writing "trivial" copy
constructors or destructors if they want to take advantages of passing
small objects in register.

[...]

| The way I envisaged the optimization was that the API specified that if
| a class had a trivial copy constructor (and maybe destructor, I'm not
| sure),

I guess you want to know what is going on in the destructor too
because, in general, the destructor could be fiddling with "this".

| and its size was less than N bytes, it would be passed in a
| register (or several registers). Typically, to do this, the compiler
| would start by generating the parameter in memory, in the usual fashion,
| then copying the bytes to registers; at the other end, the compiler
| would copy the registers to memory, and use that memory as the class.
| After than, classical optimization techniques would suppress the memory
| image at the call site (almost certainly) and in the called function
| (depending on what the function did with the class -- if the compiler
| couldn't determine that the address didn't leak, it would have to keep
| the copy in memory).

You don't want to change the ABI depending on the level of
optimisation, therefore you don't want to depend on what "classical
optimization techniques" would do. You want to set the ABI depending
only on the type information. Here, the type information tell you
that the copy constructor and the destructor are trivial, therefore
you know there is no fiddling with 'this'.

| I'm pretty sure that such an implementation would be conforming.

If you depend on the cleverness of the optimization algorithms, then
you may mess with the calling sequences. Typically, this is a
decision you have to make *regardless* of the optimization level --
because of the genuine separate compilation model.

--
Gabriel Dos Reis
g...@integrable-solutions.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Francis Glassborow

unread,

Jan 30, 2004, 1:03:16 PM1/30/04

to

In message <m3r7xi1...@uniton.integrable-solutions.net>, Gabriel Dos
Reis <g...@integrable-solutions.net> writes

>It just implies that people should not be writing "trivial" copy
>constructors or destructors if they want to take advantages of passing
>small objects in register.

Thanks for mentioning that. It would seem to strengthen the case for my
proposal to provide for explicit declaration for compiler generation of
the big four.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit

Howard Hinnant

unread,

Jan 30, 2004, 8:54:05 PM1/30/04

to

In article <SYf1wpAA...@robinton.demon.co.uk>,
Francis Glassborow <fra...@robinton.demon.co.uk> wrote:

> Thanks for mentioning that. It would seem to strengthen the case for my
> proposal to provide for explicit declaration for compiler generation of
> the big four.

I think you have a strong case for:

http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1445.htm

and would like to see this proposal (or something like it) accepted. In
addition to being able to explicitly ask for compiler generated special
members, I would also like to see the ability to explicitly disable
special members. This could be so much more expressive than declaring
them private and not defining them:

explicit struct not_copyable
{
// no special members implicitly generated here
// you can manually declare/define them
// or you can explicitly request them to be compiler generated
// (as in n1445.htm)
};

as opposed to today's:

struct not_copyable
{
private:
not_copyable(const not_copyable&);
not_copyable& operator=(const not_copyable&);
};

The use of the "explicit" syntax is just an off the cuff suggested
syntax. I don't really care what the syntax is. I just want to be able
to say: this special member does not exist (not even privately).

When combined with a compiler-supported is_copyable<T> traits class,
this could become extremely useful. The proposed version would fit well
into today's language because is_copyable would be computed purely based
on lookup rules. Today's typical not_copyable hack is not ameanable to
an is_copyable<T> trait because it is based on name access instead of
name lookup. And it is also concievable that a class really is
copyable, but only for itself and friends. Getting an is_copyable<T>
traits to work correctly for such a class is at best context sensitive
and at worst impossible.

I strongly suspect a sound is_copyable<T> traits would aid in several
varied applications from move semantics to advanced iterator concepts.

-Howard

Glen Low

unread,

Feb 2, 2004, 3:41:55 AM2/2/04

to

> I would think that it illustrates that the compiler will not optimize
> away the copying of iterators. It will inline the copy-constructor and
> then fold the four repeated increments into a single add, and thus my
> example may have been bad at illustrating that there is a price
> connected with copying the iterators, but it is done, nonetheless,
> hence my original question about why not to prefer const reference
> over passing by value (even though the compiler will often be able
> optimize the code so that the copy is virtually free -- but there are
> exceptions, which there AFAIK is not with const reference).

I have the same compiler and OS as you, and in general I do not see
the multiple copies persisting into the object code. Having hand-tuned
macstl precisely to avoid the load/store penalty within [SIMD-ized]
STL code, I'm not sure what the problem is.

If at all possible, use the compiler generated copy constructor. Some
compilers do not properly elide copy construction otherwise.

In theory, if the compiler inlines the chain of 5 function calls, it
should elide the implicit copy constructors, and then if the object is
small enough and there are enough registers, it should enregister it.
You should be left with a single memory load for each (outermost)
parameter and a single memory store for the (outermost) function
result.

In lieu of actual timing of the loops, a good profiler, such as the
Shark profiler that comes with Mac OS X, will point out if the code is
spending excessive time doing copying. Also you may want to turn up
the inline limit on your compiler e.g. -finline-limit=10000 on gcc, if
the compiler refused to inline that part of your code.

Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com

ka...@gabi-soft.fr

unread,

Feb 2, 2004, 10:52:59 AM2/2/04

to

Gabriel Dos Reis <g...@integrable-solutions.net> wrote in message

news:<m3r7xi1...@uniton.integrable-solutions.net>...
> ka...@gabi-soft.fr writes:

> | I suspect that as a minimum, inline constructors will be necessary,
> | since the compiler can't put anything in a register if the address
> | leaks (e.g. if the constructor "registers" the address somewhere).
> | Again, my

> With the old ARM rules (and under mild assumption, i.e. a copy
> constructor and the destructor are used by function that takes a
> parameter of class-type), that "inline" requirement is sufficient to
> guide the compiler to set the appropriate calling convention. But that
> may sound too complicated (and anyway, the standard weakened the
> requirements on inline functions); so trivial copy-constructors and
> trivial destructor are probably good enough for the most common cases.
> It just implies that people should not be writing "trivial" copy
> constructors or destructors if they want to take advantages of passing
> small objects in register.

I don't think the changes in standard, compared to the ARM, have much
effect here. The problem is that the same calling convention must be
adopted at both the call site and in the callee. And there is no
requirement (either in the ARM or in the standard) that the callee know
that the copy constructor is inlined. (It's also more difficult to
specify, since even an inline constructor may leak the this pointer.)

If the optimizer can see the call site, the callee, and the contents of
the constructors (inline or not), it can do whatever it wants, as long
as the observable behavior is respected. But if the call site and the
callee are in different translation units, that's a pretty good
optimizer, and I don't expect to see it universally available in the
near future.

> [...]

> | The way I envisaged the optimization was that the API specified that
> | if a class had a trivial copy constructor (and maybe destructor, I'm
> | not sure),

> I guess you want to know what is going on in the destructor too
> because, in general, the destructor could be fiddling with "this".

Probably. I haven't analysed everything in detail.

> | and its size was less than N bytes, it would be passed in a register
> | (or several registers). Typically, to do this, the compiler would
> | start by generating the parameter in memory, in the usual fashion,
> | then copying the bytes to registers; at the other end, the compiler
> | would copy the registers to memory, and use that memory as the
> | class. After than, classical optimization techniques would suppress
> | the memory image at the call site (almost certainly) and in the
> | called function (depending on what the function did with the class
> | -- if the compiler couldn't determine that the address didn't leak,
> | it would have to keep the copy in memory).

> You don't want to change the ABI depending on the level of
> optimisation, therefore you don't want to depend on what "classical
> optimization techniques" would do.

Totally agreed. The ABI defines whether the object is passed in
register or not. The "classical optimization techniques" are used to
suppress any in memory copies that may have preceded or followed the in
register copy.

Note that if the called function calls any non-inlined functions, it
will probably have to establish an in memory copy in the function, in
order to pass a this pointer. My intuitive feeling is that the way to
handle this is to systematically generate the in memory copy in the
first pass, and count on optimization to suppress it.

> You want to set the ABI depending only on the type information.

For a sufficiently broad definition of "type information". I wouldn't
normally consider whether the copy constructor was user defined or
generated by the compiler part of the type.

> Here, the type information tell you that the copy constructor and the
> destructor are trivial, therefore you know there is no fiddling with
> 'this'.

> | I'm pretty sure that such an implementation would be conforming.

> If you depend on the cleverness of the optimization algorithms, then
> you may mess with the calling sequences. Typically, this is a decision
> you have to make *regardless* of the optimization level -- because of
> the genuine separate compilation model.

I don't think I was very clear concerning my suggestion. It is clear to
me that the API must define pass be register in some very concrete
cases, independantly of how the code would later be generated. On the
other hand, the simplest way of adopting an existing compiler to the new
API is to simply generate the object in memory, as it currently does,
then, instead of putting the address of the object in a register, to put
the byts of the object itself in the register, using the moral
equivalent of memcpy. And to do the same thing in the called function:
use the moral equivalent of memcpy to obtain an in memory copy, and then
continue as usual.

At this point, we have a pessimization, rather than an optimization.
BUT... classical optimization techniques will almost certainly eliminate
the in memory copy at the call site (supposing a trivial constructor, of
course). And if none of the uses in the called function require a this
pointer outside the translation unit, classical optimization techniques
should be able to eliminate this in memory copy as well.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Gabriel Dos Reis

unread,

Feb 2, 2004, 2:46:33 PM2/2/04

to

ka...@gabi-soft.fr writes:

| Gabriel Dos Reis <g...@integrable-solutions.net> wrote in message
| news:<m3r7xi1...@uniton.integrable-solutions.net>...
| > ka...@gabi-soft.fr writes:
|
| > | I suspect that as a minimum, inline constructors will be necessary,
| > | since the compiler can't put anything in a register if the address
| > | leaks (e.g. if the constructor "registers" the address somewhere).
| > | Again, my
|
| > With the old ARM rules (and under mild assumption, i.e. a copy
| > constructor and the destructor are used by function that takes a
| > parameter of class-type), that "inline" requirement is sufficient to
| > guide the compiler to set the appropriate calling convention. But that
| > may sound too complicated (and anyway, the standard weakened the
| > requirements on inline functions); so trivial copy-constructors and
| > trivial destructor are probably good enough for the most common cases.
| > It just implies that people should not be writing "trivial" copy
| > constructors or destructors if they want to take advantages of passing
| > small objects in register.
|
| I don't think the changes in standard, compared to the ARM, have much
| effect here.

See below for an explanation.

The ARM rules say that it is an error to use an inline function
before its definition. The direct implication is that when one uses
an inline function, the body of that function ought to be known to the
compiler. In the case of an inline copy-constructor or destructor,
the ARM rules would let the compiler (with no excessive effort) figure
out whether those are playing with "this" or not.

| The problem is that the same calling convention must be
| adopted at both the call site and in the callee.

That is what I have been saying, in particular that is why you don't
want to depend on optimization level to decide whether you're going
to pass in register or not.

| And there is no requirement (either in the ARM or in the standard)
| that the callee know that the copy constructor is inlined. (It's
| also more difficult to specify, since even an inline constructor may
| leak the this pointer.)

Let me re-quote the assumptions:

With the old ARM rules (and under mild assumption, i.e. a copy
constructor and the destructor are used by function that takes a
parameter of class-type)

(1) ARM rules: if you use an inline function, then its definition
must have been seen.

(2) mild assumption: In a functionn definition that declares a
parameter of class type, both the copy-constructor and the
destructor are used.

It follows that both the callee and the caller knows exactly whether
"this" is "leaked" or not.

The standard has removed the ARM rules (1).

| If the optimizer can see the call site, the callee, and the contents of
| the constructors (inline or not), it can do whatever it wants, as long

Under the ARM rules, the compiler knows at the call site what the body
of the inline function looks like.

| as the observable behavior is respected. But if the call site and the
| callee are in different translation units, that's a pretty good
| optimizer, and I don't expect to see it universally available in the
| near future.

The mild assumption (2) -- that the copy-constructor and the
destructor are used in the definition of a function taking a value
of class-type -- (when "applied" with the ARM rules) ensures that the
compiler etablishes the same calling convention as at the call sites
as in the callee *without requiring optimizers to look at every call site*
in every translation unit.

[...]

| Note that if the called function calls any non-inlined functions, it
| will probably have to establish an in memory copy in the function, in
| order to pass a this pointer.

No, that does not follow. The only thing that matters is how
copy-constructor and destruction are performed.

| > You want to set the ABI depending only on the type information.
|
| For a sufficiently broad definition of "type information". I wouldn't
| normally consider whether the copy constructor was user defined or
| generated by the compiler part of the type.

That would probably fit the traditional C-view of a type.

If, however, you consider that a type is any collection of constraints
and operations that applies to set of values, then it is clear that a
a copy constructor is part of the type.

That a user-declared copy constructor does not necessarily have
the same semantics as a compiler-synthetized copy constructor is
trivial.

Therefore, you don't get the same type information from a
user-declared copy constructor as from a compiler-synthtized copy
constructor. QED.

--
Gabriel Dos Reis
g...@integrable-solutions.net

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Tony

unread,

Feb 3, 2004, 6:10:57 AM2/3/04

to

Just reading the subject line, I was scared.
I am using const reference a lot to specify an objct doesn't change in
the called function. If const reference is worse than passing by value,
I will get into real trouble.
So I would like to know whether the subject is ONLY relavent to STL copy
or STL copy similar senoriors ??
Thanks
Tony

Francis Glassborow

unread,

Feb 3, 2004, 9:38:10 AM2/3/04

to

In message <9396fe12.04020...@posting.google.com>, Tony
<tongr...@netzero.com> writes

>Just reading the subject line, I was scared.
>I am using const reference a lot to specify an objct doesn't change in
>the called function. If const reference is worse than passing by value,
>I will get into real trouble.
>So I would like to know whether the subject is ONLY relavent to STL copy
>or STL copy similar senoriors ??

[I think you meant scenarios]

Does your code run fast enough for the user? If so stop worrying.

Are you using const references for fundamental types (built ins,
pointers etc.)? If so consider not doing so as most of those would be
passed in registers and will not benefit from using a const reference.
Do you use lots of very small objects? Then if performance is an issue
check your compiler options as well as benchmarking the alternatives.

Otherwise you have more important things to use your time on.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit

For project ideas and contributions: http://www.spellen.org/youcandoit/projects

Andrea Griffini

unread,

Feb 3, 2004, 1:46:34 PM2/3/04

to

On 3 Feb 2004 06:10:57 -0500, tongr...@netzero.com (Tony) wrote:

>So I would like to know whether the subject is ONLY relavent to STL copy
>or STL copy similar senoriors ??

In the past I made a few experiments examining the performance of

inline P2d operator+(const P2d& a,const P2d& b)
{
return P2d(a.x+b.x,a.y+b.y);
}

against

inline P2d operator+(P2d a,const P2d& b)
{
return a+=b;
}

and the second version was getting better (both smaller and
faster) assembly code. I did the test some time ago, may be
now the situation is different; but I wouldn't bet on it ...

Good software was killed by good hardware: IMO less and less
people in the world are investigating on those low-level
performance issues... for example in my test I found that
C++ was still not reaching the performance of the explicit
C approach in this context. I don't think this is going to
change in the future (change in better, I mean).

If you have a critical point (e.g. code size or speed) then
do not trust anyone. Do a test and compare results.
If your software doesn't work or doesn't sell I don't think
that adding "Ok, it sucks. But it's just a QOI issue." on
the box will help.

IMO from an high-level logical point of view for example
operator+ should accept two values; it has nothing to do with
the two representants of those values (with their address,
for example). Passing a "const &" is an ugly implementation
detail that just adds noise to the interface. But like it
or not this is the C++ language.

HTH
Andrea

ka...@gabi-soft.fr

unread,

Feb 3, 2004, 1:57:45 PM2/3/04

to

Gabriel Dos Reis <g...@integrable-solutions.net> wrote in message

news:<m3vfmpj...@uniton.integrable-solutions.net>...
> ka...@gabi-soft.fr writes:

I wasn't aware that there was a difference here. The standard still
says (§7.1.2/4): "An inline function shall be defined in every
translation unit in which it is used [...] If a function with external
linkage is declared inlne in one translation unit, it shall be declared
inline in all translation units in which it appears; [...]" I'll admit
that I would have expected (and preferred) "[...] shall be defined before its
first use in every translation unit in which it is used[ ...]".

> | The problem is that the same calling convention must be adopted at
> | both the call site and in the callee.

> That is what I have been saying, in particular that is why you don't
> want to depend on optimization level to decide whether you're going to
> pass in register or not.

I think we agree here.

> | And there is no requirement (either in the ARM or in the standard)
> | that the callee know that the copy constructor is inlined. (It's
> | also more difficult to specify, since even an inline constructor may
> | leak the this pointer.)

> Let me re-quote the assumptions:

> With the old ARM rules (and under mild assumption, i.e. a copy
> constructor and the destructor are used by function that takes a
> parameter of class-type)

Where does the mild assumption come from?

> (1) ARM rules: if you use an inline function, then its definition
> must have been seen.

> (2) mild assumption: In a functionn definition that declares a
> parameter of class type, both the copy-constructor and the
> destructor are used.

This is where I don't follow you. Where does this "mild assumption"
come from? The standard certainly doesn't require it, and while the way
I organize my code, it would always be true, I can conceive of other
organizations. I might not like them, but they are certainly legal, and
a compiler must support them.

> It follows that both the callee and the caller knows exactly whether
> "this" is "leaked" or not.

Independantly of my problem with your "mild assumption": if we are
defining an API, I'd hate to have to specify what you can or cannot do
in an inlined constructor for the compiler to pass by register.
Intuitively, I can imagine what an optimizer might do which could allow
passing in a register, but I think we both agree that we don't want the
API to depend on optimization. You'd have to specify very precisely
what operations in the inlne constructor do not inhibit pass by
register.

> The standard has removed the ARM rules (1).

> | If the optimizer can see the call site, the callee, and the contents
> | of the constructors (inline or not), it can do whatever it wants, as
> | long

> Under the ARM rules, the compiler knows at the call site what the body
> of the inline function looks like.

And now you only have to know it within the translation unit. I suspect
that the standard has made life harder for implementors, with no real
gain for users. But I still don't think it really affects us here; I
don't think you can assume that the inline definition is present in the
translation unit which contains the called function.

> | as the observable behavior is respected. But if the call site and
> | the callee are in different translation units, that's a pretty good
> | optimizer, and I don't expect to see it universally available in the
> | near future.

> The mild assumption (2) -- that the copy-constructor and the
> destructor are used in the definition of a function taking a value of
> class-type -- (when "applied" with the ARM rules) ensures that the
> compiler etablishes the same calling convention as at the call sites
> as in the callee *without requiring optimizers to look at every call
> site* in every translation unit.

The "mild assuption" doesn't hold. As far as I can remember, it didn't
hold under the ARM either.

> [...]

> | Note that if the called function calls any non-inlined functions, it
> | will probably have to establish an in memory copy in the function,
> | in order to pass a this pointer.

> No, that does not follow. The only thing that matters is how
> copy-constructor and destruction are performed.

For the specification of the API. Suppose I have a class like:

struct C
{
int i ;
void f() ;
} ;

and I use it:

void
g( C c )
{
c.f() ;
}

We are agreed (I think) that this class should be passed in a register.
Within g, however, the compiler will have to call C::f on an instance.
To do so, it will have to pass the address of c as a parameter
(supposing C::f is not inline, and the compiler isn't doing inter-module
optimization). To get its address, it needs an object in memory (on
most hardware).

The obvious solution here is to "memcpy" the class from the registers
into memory. If a later version of C provides an inline version of f,
and that version doesn't leak the this pointer, I think that any good
optimizer should be able to suppress the inline copy.

And I would stress the fact that at this point, I am talking about
implementation techniques. I do NOT think that whether C::f() is inline
should affect the API. The API says that C is passed in a register,
period.

> | > You want to set the ABI depending only on the type information.

> | For a sufficiently broad definition of "type information". I
> | wouldn't normally consider whether the copy constructor was user
> | defined or generated by the compiler part of the type.

> That would probably fit the traditional C-view of a type.

> If, however, you consider that a type is any collection of constraints
> and operations that applies to set of values, then it is clear that a
> a copy constructor is part of the type.

Agreed. The word "type" has a more general meaning. I was limiting it
to its meaning within the (current) C++ standard, which is also fairly
close to the C view. (C++ uses name equivalence, so two classes named X
are the same type: class X.)

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Dylan Nicholson

unread,

Feb 4, 2004, 10:38:31 PM2/4/04

to

Gabriel Dos Reis <g...@integrable-solutions.net> wrote in message
news:<m3r7xi1...@uniton.integrable-solutions.net>...
>

> You don't want to change the ABI depending on the level of
> optimisation, therefore you don't want to depend on what "classical
> optimization techniques" would do.

Ideally no, but even Microsoft, who are quite happy to forego
standards compliance for the sake of backwards compatibility, don't
seem to have a problem with having incompatible ABI's between release
and debug versions (admittedly most of the problems occur because of
#ifdef _DEBUG's, not optimization level, but I know of more than
example where just changing the optimization level meant that an
application didn't link).

As far as controlling the calling convention via keywords, I would
have thought this could just be done with extern "C" or, perhaps
better, extern "C++" *
Any function declared inside the scope of an extern "C" or "C++" must
conform to the platform's standard C calling convention, anything else
can optimized according to whatever method the compliler sees best -
and could potentially change across optimization levels (for a start,
if you don't allow this, it makes life for debuggers that allow
displaying and modifying variable contents rather messy).

Dylan

* extern "C" of course has the problem that you can't overload any
functions declared at translation-unit level (outside of any
namespace). Once inside a namespace, it would seem reasonable to
allow overloading, but the few compilers I tested don't support it.

From comeau:

extern "C" namespace bar
{
void foo();
void foo(int);
}

"ComeauTest.c", line 4: error: more than one instance of overloaded
function
"bar::foo" has "C" linkage
void foo(int);

Ben Hutchings

unread,

Feb 6, 2004, 3:45:59 PM2/6/04

to

Dylan Nicholson wrote:
<snip>

> * extern "C" of course has the problem that you can't overload any
> functions declared at translation-unit level (outside of any
> namespace). Once inside a namespace, it would seem reasonable to
> allow overloading, but the few compilers I tested don't support it.

You can overload the name of a function that is extern "C", but the
other functions sharing the name must be extern "C++". extern "C"
names are independent of namespace, according to standard section
7.5 paragraph 6:

"At most one function with a particular name can have C language
linkage. Two declarations for a function with C language linkage
with the same function name (ignoring the namespace names that
qualify it) that appear in different namespace scopes refer to
the same function. Two declarations for an object with C language
linkage with the same name (ignoring the namespace names that
qualify it) that appear in different namespace scopes refer to
the same object."

This allows the <cxxxx> headers to declare C library functions in
the std namespace even though their definitions are written in C.