I would like to verify if the following [undoubtedly common] trick for
deliberately circumventing compiler copy/assignment assistance is
legal. I am particularly concerned about using sizeof() inside a class
whose declaration is not yet complete.
I have a class 'Frame' that is about 9,000 bytes big. It contains
several member classes:
struct Frame
{
class Yada1 {...no pointers, no references, recursively} yada1;
class Yada2 {...no pointers, no references, recursively} yada2;
class Packet {...no pointers, no references, recursively} packet;
class Yada3 {...no pointers, no references, recursively} yada3;
} ;
The Yada's and Packet are "pointer-less" and "reference-less". For
each of them, the entirety of their run-time state moves with them as
they move throughout the system.
I want to preempt the compiler's assistance with assignment and copy
construction as I do:
Frame f1, f2;
f1 = f2;
-or-
Frame f3(f1);
So I use a dummy struct containing a character array whose size is the
same size as my Frame. Then when I do assignment, f1 = f2, I take the
address of f1 and of f2, cast each addresses to pointers to my struct-
array thingy, do an assignment of the respected dereferences of these
new pointers, and hopefully enjoy the best that the compiler has to
offer in terms of performing the copy/construction on a particular
CPU.
Before I show the code, one question:
If a POD has no members requiring constructors, will it also be devoid
of a default constructor?
Code:
#include <iostream>
struct Foo
{
char a, b, c, d;
Foo () : a('#'), b('#'), c('#'), d('#') {}
Foo & operator = (const Foo &that)
{
if (this == &that)
return *this;
struct Thunk
{
char dummy_buffer[sizeof(Foo)];
} ;
cout << "\nInside assignment operator ...\n" << endl;
*reinterpret_cast<Thunk *>(const_cast<Foo *>(this)) =
*reinterpret_cast<const Thunk *>(&that);
return *this;
}
} ;
int main ()
{
using namespace std;
cout << "sizeof(Foo) == " << sizeof(Foo) << endl << endl;
Foo f1, f2;
cout << "Before change:" << endl;
cout << "f1: " << f1.a << f1.b << f1.c << f1.d << endl;
cout << "f2: " << f2.a << f2.b << f2.c << f2.d << endl;
f1.a = f1.b = f1.c = f1.d = '@';
f2 = f1;
cout << "After change:" << endl;
cout << "f1: " << f1.a << f1.b << f1.c << f1.d << endl;
cout << "f2: " << f2.a << f2.b << f2.c << f2.d << endl;
return 0;
}
TIA,
-Le Chaud Lapin-
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
> Hi All,
>
> I would like to verify if the following [undoubtedly common] trick for
> deliberately circumventing compiler copy/assignment assistance is
> legal.
What makes you believe this is common? Sounds more like a bad premature
optimisation to me.
> I am particularly concerned about using sizeof() inside a class
> whose declaration is not yet complete.
It is in the body of a function, so it should be okay.
> I have a class 'Frame' that is about 9,000 bytes big. It contains
> several member classes:
>
> struct Frame
> {
> class Yada1 {...no pointers, no references, recursively} yada1;
> class Yada2 {...no pointers, no references, recursively} yada2;
> class Packet {...no pointers, no references, recursively} packet;
> class Yada3 {...no pointers, no references, recursively} yada3;
> } ;
>
> The Yada's and Packet are "pointer-less" and "reference-less".
That doesn't really matter. User declared copy constructors are really
the heart of the issue.
> I want to preempt the compiler's assistance with assignment and copy
> construction as I do:
Why? What compiler (with optimisations turned on, of course) is not
outputting good code in this case?
> If a POD has no members requiring constructors, will it also be devoid
> of a default constructor?
Aggregates (section 8.5.1 of the standard) by definition:
1. Have no user declared constructors
2. No private or protected non-static data members
3. No base classes
4. No virtual functions
A POD-struct is an aggregate with additional constraints (section 9):
1. No non-static data members or arrays of type non-POD-struct
2. No non-static data members or arrays of type non-POD-union
3. No user defined copy assignment operator
4. No user defined destructor
> struct Foo
> {
> char a, b, c, d;
> Foo () : a('#'), b('#'), c('#'), d('#') {}
> Foo & operator = (const Foo &that)
Foo is not a POD, so there are no guarantees that your hack will work.
If we define a compiler-copyable-type as:
1. No user-defined copy constructor or copy assignment operator
2. All non-static data members are compiler-copyable-types
I can't imagine any modern compiler where your hack generates any better
code than the compiler does in this case. What compiler (and
optimisation level) is generating suboptimal code for you in this case?
And if you have a user defined copy constructor or copy assignment
operator, either:
A. It is mimicking the compiler generated one, in which case you don't
need it (private/protected notwithstanding)
B. Your hack has broken semantics.
--
Nevin ":-)" Liber <mailto:ne...@eviloverlord.com> 773 961-1620
I am near the end of my project, so one might say that the moment of
evil has arrived. ;)
> I can't imagine any modern compiler where your hack generates any better
> code than the compiler does in this case. What compiler (and
> optimisation level) is generating suboptimal code for you in this case?
I'm using the Microsoft's Visual Studio 2008 stock C++ compiler in
Release mode with all "regular" optimizations turned on, but with
debug information included. "Suboptimal" in this case is function
calls to memcpy, even though truly inline code would be shorter and
faster than the prolog/epilogue code of the memcpy function.
> And if you have a user defined copy constructor or copy assignment
> operator, either:
> A. It is mimicking the compiler generated one, in which case you don't
> need it (private/protected notwithstanding)
> B. Your hack has broken semantics.
Well, it looks like it is about to become even more broken. I am ready
to use _asm on x86 at least. The compiler requires platform-specific
__forceinline to force inlining of constructors whether they were
declared-inline-explicitly+defined-in-class-declaration or not. I
realize that inline is only a hint to the compiler. I was hoping that
Microsoft's compiler would forgive the 10 or so instructions for a
memory move and not go to memcpy, which is bigger and slower.
For example, in the following code, class Frame has no .cpp file,
only .hpp, with copy constructor/assignment using the trick that I
wrote about in my OP. As can be seen, without programmer help at the
command line, the compiler ignores implicit inlining of the
constructor, not taking into consideration how trivial construction
would be if truly inlinde, which would be a simple move of 9KB of
information:
Frame frame1, frame2, frame3;
004B175E lea ecx,[ebp-62E4h]
004B1764 call Frame::Frame (4A6E37h)
004B1769 lea ecx,[ebp-85A4h]
004B176F call Frame::Frame (4A6E37h)
004B1774 lea ecx,[ebp-0A864h]
004B177A call Frame::Frame (4A6E37h)
Now for assigment, which is equally trivial using the technique that I
mentioned in my OP, the compiler still makes a call to memcpy:
frame1 = frame2;
004B177F push 22B8h
004B1784 lea eax,[ebp-85A4h]
004B178A push eax
004B178B lea ecx,[ebp-62E4h]
004B1791 push ecx
004B1792 call @ILT+4040(_memcpy) (4A5FCDh)
004B1797 add esp,0Ch
And for copy construction, again it makes a call to memcpy:
Frame frame4(frame3);
004B179A push 22B8h
004B179F lea eax,[ebp-0A864h]
004B17A5 push eax
004B17A6 lea ecx,[ebp-0CB24h]
004B17AC push ecx
004B17AD call @ILT+4040(_memcpy) (4A5FCDh)
004B17B2 add esp,0Ch
return 0;
So in all these cases, there are pushes, invocations of memcpy [which
is suprisingly large, btw], and stack cleanup.
This is a bit to much for this particular area of my project, so I
plan to use __asm, __declspec(naked) and __forceinline on all three
functions:
1. constructor
2. copy constructor
3. assignment operator
...to get the performance that I need.
Certain x86 memory movement instructions are much faster than calls to
memcpy, which simply employs those same instructions internally along
with unnecessary overhead.
-Le Chaud Lapin-
--
The sizeof is safe, as by the time operator= is ready to compile the
layout is known. (Like any other member function, operator= is even
allowed to reference members declared textually later in the class
than they are.)
> I have a class 'Frame' that is about 9,000 bytes big. It contains
> several member classes:
>
> ....
>
> The Yada's and Packet are "pointer-less" and "reference-less". For
> each of them, the entirety of their run-time state moves with them as
> they move throughout the system.
Ok.
> I want to preempt the compiler's assistance with assignment and copy
> construction as I do:
>
> Frame f1, f2;
> f1 = f2;
>
> -or-
>
> Frame f3(f1);
So we're targeting a compiler with an ineffective optimizer, or else
want -O0 to be performant. Fine.
> ....
> Before I show the code, one question:
>
> If a POD has no members requiring constructors, will it also be devoid
> of a default constructor?
It will be devoid of a non-trivial default constructor, which is close
enough.
> Code:
>
> #include <iostream>
> struct Foo
> {
> char a, b, c, d;
> Foo () : a('#'), b('#'), c('#'), d('#') {}
It would be better to use a C++0X compiler. Then this would be a
standard-layout class, and this approach guaranteed to work. (That
said, the point of this particular proposed change in the standards
seems to be "most compilers already work this way".)
> Foo & operator = (const Foo &that)
> {
> if (this == &that)
> return *this;
>
> struct Thunk
> {
> char dummy_buffer[sizeof(Foo)];
> } ;
>
> cout << "\nInside assignment operator ...\n" << endl;
> *reinterpret_cast<Thunk *>(const_cast<Foo *>(this)) =
> *reinterpret_cast<const Thunk *>(&that);
I think (on compilers where this works) you can get away with
*reinterpret_cast<Thunk * const>(this) =
*reinterpret_cast<const Thunk *>(&that);
> return *this;
> }
>
> } ;
I can see how performance testing might indicate that self-assignment
is a big deal with this class with the default operator=, even if the
default is just an (optimized) memset. Unless there was a requirement
to avoid #include <string.h>, I'd prefer to avoid the reinterpret_cast
with
Foo & operator = (const Foo &that)
{
if (this == &that)
return *this;
cout << "\nInside assignment operator ...\n" << endl;
memset(this,&that,sizeof(Foo));
return *this;
> Foo & operator = (const Foo &that)
> {
> if (this == &that)
> return *this;
>
> cout << "\nInside assignment operator ...\n" << endl;
> memset(this,&that,sizeof(Foo));
> return *this;
>
> }
s/memset/memcpy/ , of course.
However (considering what else has been mentioned) up-thread, I'm
unsurprised you're having to work around non-working optimization with
assembly language.
> So in all these cases, there are pushes, invocations of memcpy [which
> is suprisingly large, btw], and stack cleanup.
>
> This is a bit to much for this particular area of my project, so I
> plan to use __asm, __declspec(naked) and __forceinline on all three
> functions:
>
> 1. constructor
> 2. copy constructor
> 3. assignment operator
>
> ...to get the performance that I need.
>
> Certain x86 memory movement instructions are much faster than calls to
> memcpy, which simply employs those same instructions internally along
> with unnecessary overhead.
(I find it hard to believe that your concern is code size, it's about
speed, right? If so...)
Are you sure about that overhead? I just made a smallest possible
memmove function I could think of (cld, init esi/edi/ecx, rep movsd).
I compared speed of that (inlined and noninlined), with stock
operator=.
Speed-wise, I see only statistically irrelevant differences, inlined
version being (in some, not all, runs) less than 5% faster than the
other two. IOW, I think you have fallen in the trap of optimization
without measurement and this whole discussion is HORRIBLY irrelevant.
Goran.
I was quite suprised that Microsoft's compiler would not optimize a
simple copy. If one has:
struct Foo
{
char buffer[8192];
} f1, f2;
...and does:
f1 = f2;
...the compiler, at least on x86, will ~not~ use the very old, built-
in x86 memory-copying instructions, some of which can operate at
essentially full bus bandwidth, and are far faster and shorter than
any call to memcpy(). Additionally, memcpy() must do a bit of
housework before doing the actual copy, like determining alignment,
whether source overlaps target, relative order of source and target,
stack manipulation, etc.
But when compiler sees...
f1 = f2;
...it knows all of the answers to these questions immediately.
There must be some underlying explantion why such good compiler will
not simply pick such low-hanging fruit.
On a related note...if Bjarne, while designing C++, had made possible:
char a1[64];
char a2[64];
a1 = a2; // copy of all 64 bytes using fastest CPU instructions
available, not memcpy.
...I would have lost no sleep over any resulting incompatibilities
with C.
IMHO, such a feature, with high utility/low breach-of-regularity,
would have been a great candidate for absorption.
-Le Chaud Lapin-
M$ went through a period where they were so crazed about correctness
that performance almost didn't matter. I don't remember the details
offhand, but I believe they finally decided that it was too
complicated to generate custom inline code for all possible cases and
that the correct general function was too big to inline.
Anyway, when I do an optimized compile of
struct Foo
{
char buffer[8192];
} f1, f2;
int main( void )
{
strcpy( f1.buffer, "Hi!" );
f2 = f1;
cout << f2.buffer << endl;
return 0;
}
with VC++03 (with /G7), I get:
:
00011 b9 00 08 00 00 mov ecx, 2048 ; 00000800H
00016 be 00 00 00 00 mov esi, OFFSET FLAT:?f1@@3UFoo@@A ; f1
0001b bf 00 00 00 00 mov edi, OFFSET FLAT:?f2@@3UFoo@@A ; f2
:
00025 f3 a5 rep movsd
:
which is recognizable as an inline double word string copy.
However, with VC++08, I get:
:
00002 68 00 20 00 00 push 8192 ; 00002000H
00007 68 00 00 00 00 push OFFSET ?f1@@3UFoo@@A
0000c 68 00 00 00 00 push OFFSET ?f2@@3UFoo@@A ; f2
00011 c7 05 00 00 00
00 48 69 21 00 mov DWORD PTR ?f1@@3UFoo@@A, 2189640 ;
00216948H
0001b e8 00 00 00 00 call _memcpy
:
which is a call to a function (though I'm not sure whether the MOV at
0x11 is actually part of it).
IMO, always calling the function is a cop out ... generating correct
inline code really isn't that hard and you can always limit the
inlined cases to those that are aligned and simple to count and fall
back on the general function for everything else. This is one case
where I think M$ really screwed up.
George
I was going on the assumption that code A that is 15 times larger than
code B is generally slower than code B.
> Are you sure about that overhead? I just made a smallest possible
> memmove function I could think of (cld, init esi/edi/ecx, rep movsd).
> I compared speed of that (inlined and noninlined), with stock
> operator=.
> Speed-wise, I see only statistically irrelevant differences, inlined
> version being (in some, not all, runs) less than 5% faster than the
> other two. IOW, I think you have fallen in the trap of optimization
> without measurement and this whole discussion is HORRIBLY irrelevant.
Having written quite a bit of x86 assembly Ye Olden Days, I find it
hard to believe that the difference is "statiscally irrelevant"
between a movs and the 200+ instructions in full version of memcpy, at
least 95 of which gets executed for a stock operator =. That's
excluding stack manipulaation and function calls.
-Le Chaud Lapin-
Back when I actually tested this (Win16, ~1996): it's DLL-imported
memmove that acts like it's using those instructions (and even then
the DLL import overhead was barely measurable). The DLL imported
version of memcpy caused a 1.5x slowdown relative to memmove.
So I changed all of my code targeting Windows to always use memmove,
rather than mess with assembly programming. It makes trying to port
to *NIX much easier not having to think about assembly language.
> (I find it hard to believe that your concern is code size, it's about
> speed, right? If so...)
>
> Are you sure about that overhead? I just made a smallest possible
> memmove function I could think of (cld, init esi/edi/ecx, rep movsd).
Reads like memcpy to me, as memmove has to be safe when the source and
destination memory blocks overlap while memcpy doesn't. When I
checked, the speed difference between assembly-handwritten memcpy and
assembly-handwritten memmove for Intel wasn't measurable.
That is essentially my sentiment.
What suprises me is that there are so often the compiler designer is
unable to employ an optimization because there exist the potential of
a rare-but-treacherous code sequence that would negate its validity.
In this case, not only is the optimization always employable, but the
_memcpy alternative actually results in less contextual information
than existed before the call to it.
But I really like M$ compilers, so I have to tell myself that they
have a good reason for doing this other than laziness. :D
-Le Chaud Lapin-
Did you try to code it faster? It's been a couple of days since this
started ;-).
It's clearly the question of data size. If it's big enough, memcpy or
asm won't matter, because time needed for rep movsd (which is what
memcpy of MS CRT uses) will swamp all else.
But, now that you voiced your disbelief of my utterly opaque, yet
highly scientific test ;-), I thought, perhaps my struct size was too
big (~22K). I tried with smaller, ~12K. Nope, still the same. (Stock
PC, 32-bit code on 64-bit Windows). ~8k, same. (Again, inline or not,
doesn't matter).
And finally, I saw a strange thing when I approach 4k: suddenly, stock
operator= (which __is__ memcpy) and manual memcpy become faster than
mine asm! That's something I didn't expect. Must be related to some
hardware effect memcpy knows about and I don't. (Either that, or
there's a flaw in my test.
But as now I passed through all the moves, I am convinced - you should
leave your optimization idea aside. It's __false__. Try applying first
rule of optimization:
1. make code faster by changing the design to eliminate hotspots
(precluded by: find hotspots)
Goran.
> > Are you sure about that overhead? I just made a smallest possible
> > memmove function I could think of (cld, init esi/edi/ecx, rep movsd).
> > I compared speed of that (inlined and noninlined), with stock
> > operator=.
> > Speed-wise, I see only statistically irrelevant differences, inlined
> > version being (in some, not all, runs) less than 5% faster than the
> > other two. IOW, I think you have fallen in the trap of optimization
> > without measurement and this whole discussion is HORRIBLY irrelevant.
>
> Having written quite a bit of x86 assembly Ye Olden Days, I find it
> hard to believe that the difference is "statiscally irrelevant"
> between a movs and the 200+ instructions in full version of memcpy, at
> least 95 of which gets executed for a stock operator =. That's
> excluding stack manipulaation and function calls.
Goran openly stated he *wasn't* testing the stock memcpy for whatever
unnamed compiler was being used.
The test that actually matters, is whether stock MSVC memcpy is
competent. (But you already ran that test....)
> It's clearly the question of data size. If it's big enough, memcpy or
> asm won't matter, because time needed for rep movsd (which is what
> memcpy of MS CRT uses) will swamp all else.
Agreed. (Aside: apologies for taking your initial reports too
literally; you make it clear you're doing a three-way test here.)
> But, now that you voiced your disbelief of my utterly opaque, yet
> highly scientific test ;-), I thought, perhaps my struct size was too
> big (~22K). I tried with smaller, ~12K. Nope, still the same. (Stock
> PC, 32-bit code on 64-bit Windows). ~8k, same. (Again, inline or not,
> doesn't matter).
> And finally, I saw a strange thing when I approach 4k: suddenly, stock
> operator= (which __is__ memcpy) and manual memcpy become faster than
> mine asm! That's something I didn't expect. Must be related to some
> hardware effect memcpy knows about and I don't. (Either that, or
> there's a flaw in my test.
I suspect cache locality. The explicit memcpy call explicitly creates
a hotspot which would tend to be in cache rather than out in general
RAM. It looks like (on the target hardware) 4K should be near the
transition point where the cache locality for memcpy overcompensates
for the function call overhead. [Le Chaud Lapin is operating near
~8K.]
Probably Microsoft's development team sees this too, and decided to
gamble on the normal use case being fairly small.
Speculating:
* if the memory buffers being memcpy'd are all over the place, this
would be harder to keep in cache than a few memory buffers being
repeatedly memcpy'ed.
* it may be harder to keep memcpy's body in cache in a real program,
than a test driver.
It really matters what the "whole application" size and timing results
are, and what the priorities are.