I apologize in advance if this is not the appropriate place to post
this message -- I had posted this on a GCC group and a suggestion was
made to post it here as well.
I am experiencing an unexplainable issue with my C++ application and
after weeks of trying to figure it out, I thought it was time to
enlist some other brains to think about it.
In a nutshell, I have a very simple multi-threaded application that I
run on 4 CPU (dual core) AMD Opteron (64-bit) Linux server (both
RedHat 4 and SUSE 9). Given what I know about these processors and
cores, I would expect the performance of this application to increase
as I increase the number of simultaneous executing threads from 1 to 8
(the thinking is one thread per core). I do experience this behavior
when I run a "debug" compile (i.e. without any optimization flags).
However, when I introduce optimization flags (e.g. -O1, -O2, or -O3),
the performance is faster overall, but begins to deteriorate as the
number of threads running approaches 8.
Here is some additional pertinent information: I am using GCC 3.3.6.
I notice that something about the optimization flags is causing the
system to work more. Here is the general breakdown of user CPU usage
versus system when running mpstat when running with 8 threads in the
optimized code and the debug code respectively:
30/70 (per CPU)
60/40 (per CPU)
As you can see, the CPU usage by the system is significantly more with
the optimized code and I can't figure out why.
Here are some things I have already done: I have looked into I/O
bottlenecks. I don't think I/O is an issue because the only file
reading/writing that occurs is during initialization and shutdown of
the application by the main thread (non multi-threaded code). The
amount of time spent in I/O remains constant whether 4 threads are run
or 8.
I have also tried to isolate which optimization flags might be causing
this behavior by slowly introducing the flags mapped to O1, O2, and O3
(instead of adding the whole collection of flags at once by using one
of the Ox flags) according to the GCC documentation. This has been
fruitless however, because for some reason manually adding this flags
has had no affect on performance. That is, when I add -O1 to my
compiler flags, the overall performance speeds up but the performance
deteriorates as we approach 8 threads. However, if I add the
following flags to my compiler options instead of O1 (because it
technically should be the same thing), nothing changes (it is the same
as if I had never added O1 or any other flags):
-fdefer-pop -fmerge-constants -fthread-jumps -floop-optimize -
fcrossjumping -fif-conversion -fif-conversion2 -fdelayed-branch -
fguess-branch-probability -fcprop-registers
Does anybody know why this might be?
Now, somebody has already suggested that this issue with the
optimization flags may very well be a red herring, and it's just that
the more optimized code is going to involve more clashes between
threads. However, I have absolutely no locking going on -- the only
shared memory between the threads is read-only memory. Other than
this, the threads should be independent and relatively CPU-bound, so
that's what makes this more confusing.
If anybody has any insight into this, I would REALLY appreciate any
help you could give me. Thank you so much in advance for your time...
The optimized code probably uses fewer user-mode CPU
cycles to do its work, so the "increase" may just be a fixed
amount of system-mode CPU effort divided by a smaller grand
total. For example, suppose the unoptimized code uses 210
and 140 milliquivers of CPU time in user and system modes for
a total of 350 milliquivers; that'd produce the 60/40 split
you see. If the optimized code uses 60 user-mode milliquivers
while the system-mode effort is unchanged, you'd have a total
of 200 milliquivers and a 30/70 split.
In other words, the user/system ratio *may* be pretty
much meaningless. Take a look at the total time spent in
each mode rather than at the ratio; there's probably no
point in fretting if the total system-mode time is pretty
much the same.
I don't know what's going on with your scalability
problem, but it seems likely that user/system is a red
herring.
> Here is some additional pertinent information: I am using GCC 3.3.6.
> I notice that something about the optimization flags is causing the
> system to work more. Here is the general breakdown of user CPU usage
> versus system when running mpstat when running with 8 threads in the
> optimized code and the debug code respectively:
>
> 30/70 (per CPU)
> 60/40 (per CPU)
>
> Does anybody know why this might be?
This can be caused by false cache-line sharing (cache-line ping-pong).
Under optimization objects can be packed more tightly. So some objects
which was situated on different cache-lines w/o optimization, not
situated on single cache-line.
Cache-line ping-pong is extremely expensive and can cause performance
degradation 100x.
Try padding your main objects:
struct X
{
// as before
char pad [64];
};
Dmitriy V'jukov
[..snip..]
mpstat/iostat and basically everything that uses /proc/stat or
/proc/uptime to get the cpu accounting information is affected
by the way the information is gathered inside the kernel.
See http://www.kernel.org/doc/Documentation/cpu-load.txt for
details.
If you have relatively recent vanilla kernel (2.6.21 is not ok,
2.6.23 is) then the information from /proc/<PID>/stat can be used
with much higher degree of confidence. One can just run the code
in question and monitor the application with `top -p $(pidof app)`
Using this approach has it's own problems (no per CPU breakdown
for instance).
>
> Does anybody know why this might be?
>
> Now, somebody has already suggested that this issue with the
> optimization flags may very well be a red herring, and it's just that
> the more optimized code is going to involve more clashes between
> threads. However, I have absolutely no locking going on -- the only
> shared memory between the threads is read-only memory. Other than
> this, the threads should be independent and relatively CPU-bound, so
> that's what makes this more confusing.
>
> If anybody has any insight into this, I would REALLY appreciate any
> help you could give me. Thank you so much in advance for your time...
I'd suggest to first verify the numbers with accurate load monitor or
profiler and reading the aforementioned text.
--
mailto:av1...@comtv.ru
[...]
In addition to padding your objects, you really should make sure to align
them on l2 cache-line boundaries. You can't be 100% sure that a padded
structure is not being falsely shared with another if they are not
properly aligned. Imagine if the following represents three cache lines:
[XXXXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXX]
This example shows a cacheline size of 20 X's.
Imagine padded objects A & B residing within that memory space:
[XXXXAAAAAAAAAAAAAAAA-AAAABBBBBBBBBBBBBBBB-BBBBXXXXXXXXXXXXXXXX]
There is false sharing between A & B despite the fact that they are
padded. However, once you align them on cache-line boundaries, the layout
look's like:
[AAAAAAAAAAAAAAAAAAAA-BBBBBBBBBBBBBBBBBBBB-XXXXXXXXXXXXXXXXXXXX]
The false sharing is totally eliminated...
I'm not saying that isn't possible but...
I've never seen a compiler (gcc included) change the layout of structures
as a result of adding -On type options. Maybe I've just never noticed it--
but I'd appreciate seeing an example where structure sizes or offsets
change due to optimization options.
IMO there may well be false sharing occurring in the OP's code, but I doubt
it's being *introduced* by the optimization.
GH
Can you cite an example where this might happen? I can't see anywhere
in the C standard where such an optimisation is forbidden, but on a
hosted environment it would be worse than useless (consider structures
declared in system headers).
Such an optimisation would only be used (in free-standing
implementations) where the compiler supported aggressive space saving
optimisations. On some architectures it would be impossible, on most it
would make a significant negative impact on performance due to
misaligned accesses requiring multiple reads.
--
Ian Collins.
When one allocate objects from heap (free store) in debug mode run-
time usually reserve some additional space before and after the
object. So in following example:
X* x1 = new X;
X* x2 = new X;
in debug mode x1 and x2 can be situated in difference cache-lines, and
in release mode - in one cache-line.
Also there can be some tricks like:
class X
{
//...
#ifdef _DEBUG
DWORD owner_thread_id_;
uint64_t unique_id_;
#endif
//...
};
Dmitriy V'jukov
> > This can be caused by false cache-line sharing (cache-line ping-pong).
> > Under optimization objects can be packed more tightly. So some objects
> > which was situated on different cache-lines w/o optimization, not
> > situated on single cache-line.
>
> Can you cite an example where this might happen? I can't see anywhere
> in the C standard where such an optimisation is forbidden, but on a
> hosted environment it would be worse than useless (consider structures
> declared in system headers).
We are talking about C++. In general case C++ doesn't give any
guarantees about object layout, member order and even continuity of
object.
C++ compiler can reorder members, add additional members for debug
purposes, make optimizations like EBO (empty base class).
> Such an optimisation would only be used (in free-standing
> implementations) where the compiler supported aggressive space saving
> optimisations. On some architectures it would be impossible, on most it
> would make a significant negative impact on performance due to
> misaligned accesses requiring multiple reads.
I am talking not about alignment. C++ totally respects alignment.
Dmitriy V'jukov
> Can you cite an example where this might happen?
Consider a 'malloc' implementation that keeps debug information inside
the block in a 'debug' build but not in a release build. So consider:
int *foo1=malloc(sizeof(int));
int *foo2=malloc(sizeof(int));
These may share a cache line in the release build but not in the debug
build.
> I can't see anywhere
> in the C standard where such an optimisation is forbidden, but on a
> hosted environment it would be worse than useless (consider structures
> declared in system headers).
The padding can just as easily be between structures as within them.
> Such an optimisation would only be used (in free-standing
> implementations) where the compiler supported aggressive space saving
> optimisations. On some architectures it would be impossible, on most it
> would make a significant negative impact on performance due to
> misaligned accesses requiring multiple reads.
The platform might use completely different 'malloc' implementations
in debug builds compare to release builds. This can cause all kinds of
performance differences.
DS
> C++ compiler can reorder members, add additional members for debug
> purposes, make optimizations like EBO (empty base class).
>
Assuming either a POD struct or a class where all member variables have
the same access specifier (very common), no it can't.
--
Ian Collins.
> >>> This can be caused by false cache-line sharing (cache-line ping-pong).
> >>> Under optimization objects can be packed more tightly. So some objects
> >>> which was situated on different cache-lines w/o optimization, not
> >>> situated on single cache-line.
> >> Can you cite an example where this might happen? I can't see anywhere
> >> in the C standard where such an optimisation is forbidden, but on a
> >> hosted environment it would be worse than useless (consider structures
> >> declared in system headers).
>
> > We are talking about C++. In general case C++ doesn't give any
> > guarantees about object layout, member order and even continuity of
> > object.
>
> Member order in C++ is fixed (withing an access specifier).
Unless I miss something, this means that member order in C++ is *not*
fixed in *general* case. Right?
> > C++ compiler can reorder members, add additional members for debug
> > purposes, make optimizations like EBO (empty base class).
>
> Assuming either a POD struct or a class where all member variables have
> the same access specifier (very common), no it can't.
Indeed. In this particular case. But not in general case.
If one has base class or constructor/destructor, it's already not POD.
Dmitriy V'jukov
>>> C++ compiler can reorder members, add additional members for debug
>>> purposes, make optimizations like EBO (empty base class).
>> Assuming either a POD struct or a class where all member variables have
>> the same access specifier (very common), no it can't.
>
> Indeed. In this particular case. But not in general case.
The above are very common cases. Any gains from swapping the order of
access specifier delimited blocks would be outweighed by rendering code
compiled with different optimisation levels incompatible.
> If one has base class or constructor/destructor, it's already not POD.
>
That makes no difference to member ordering
--
Ian Collins.
> > Indeed. In this particular case. But not in general case.
>
> The above are very common cases.
Are you trying to say that I am wrong, because 99.9999% of classes in
most C++ programs are POD types? And author of topic just can't
confront with *not* POD types in C++ program?
> Any gains from swapping the order of
> access specifier delimited blocks would be outweighed by rendering code
> compiled with different optimisation levels incompatible.
Release and debug C++ binaries are incompatible anyway.
> > If one has base class or constructor/destructor, it's already not POD.
>
> That makes no difference to member ordering
This makes difference because members of base and descendant class can
be reordered.
Anyway there are other reasons:
http://groups.google.ru/group/comp.programming.threads/msg/764284be7778aef5
+ stack layout can be different because of function inlining and
omission of stack frames. So 2 objects on stack can become situated in
1 cache line.
Dmitriy V'jukov
>
>>> If one has base class or constructor/destructor, it's already not POD.
>> That makes no difference to member ordering
>
>
> This makes difference because members of base and descendant class can
> be reordered.
>
Only in access specifier delimited blocks.
>
> Anyway there are other reasons:
> http://groups.google.ru/group/comp.programming.threads/msg/764284be7778aef5
>
> + stack layout can be different because of function inlining and
> omission of stack frames. So 2 objects on stack can become situated in
> 1 cache line.
>
That has nothing to do with class member ordering.
--
Ian Collins.
> >>> Indeed. In this particular case. But not in general case.
> >> The above are very common cases.
>
> > Are you trying to say that I am wrong, because 99.9999% of classes in
> > most C++ programs are POD types? And author of topic just can't
> > confront with *not* POD types in C++ program?
>
> Don't be daft.
Fact that POD is common case doesn't make sense here, because NOT POD
is common case too.
> >> Any gains from swapping the order of
> >> access specifier delimited blocks would be outweighed by rendering code
> >> compiled with different optimisation levels incompatible.
>
> > Release and debug C++ binaries are incompatible anyway.
>
> Not on any platform I know.
Hmmm... MSVC++ with dynamic runtime.
> >>> If one has base class or constructor/destructor, it's already not POD.
> >> That makes no difference to member ordering
>
> > This makes difference because members of base and descendant class can
> > be reordered.
>
> Only in access specifier delimited blocks.
>
> > Anyway there are other reasons:
> >http://groups.google.ru/group/comp.programming.threads/msg/764284be77...
>
> > + stack layout can be different because of function inlining and
> > omission of stack frames. So 2 objects on stack can become situated in
> > 1 cache line.
>
> That has nothing to do with class member ordering.
Well, actually here:
http://groups.google.ru/group/comp.programming.threads/msg/f6b12e9a1108f805
I meant mainly 'external packing', not 'internal packing'.
The most probable in real-life case of different 'internal packing' I
can think of is:
>>> Release and debug C++ binaries are incompatible anyway.
>> Not on any platform I know.
>
>
> Hmmm... MSVC++ with dynamic runtime.
>
I didn't realise that ran on Linux....
>> That has nothing to do with class member ordering.
>
> Well, actually here:
> http://groups.google.ru/group/comp.programming.threads/msg/f6b12e9a1108f805
> I meant mainly 'external packing', not 'internal packing'.
> The most probable in real-life case of different 'internal packing' I
> can think of is:
>
> class X
> {
> //...
> #ifdef _DEBUG
> DWORD owner_thread_id_;
> uint64_t unique_id_;
> #endif
> //...
> };
>
Well doh! Here you've explicitly changed the class based on the
preprocessor token _DEBUG
--
Ian Collins.
> On 18 มะา, 21:05, Gil Hamilton <gil_hamil...@hotmail.com> wrote:
>>
>> [Apparent performance degradation with optimization]
>>
>> > This can be caused by false cache-line sharing (cache-line
>> > ping-pong). Under optimization objects can be packed more tightly.
>> > Try padding your main objects:
>>
>> I'm not saying that isn't possible but...
>> I've never seen a compiler (gcc included) change the layout of
>> structures as a result of adding -On type options. Maybe I've just
>> never noticed it-
> -
>> but I'd appreciate seeing an example where structure sizes or offsets
>> change due to optimization options.
> When one allocate objects from heap (free store) in debug mode run-
> time usually reserve some additional space before and after the
> object. So in following example:
> X* x1 = new X;
> X* x2 = new X;
> in debug mode x1 and x2 can be situated in difference cache-lines, and
> in release mode - in one cache-line.
>
> Also there can be some tricks like:
>
> class X
> {
> //...
> #ifdef _DEBUG
> DWORD owner_thread_id_;
> uint64_t unique_id_;
> #endif
> //...
> };
But this has nothing to do with adding -O3 to the gcc options.
Obviously the OP could well have a completely different set of code for
his "debug mode" build, in which case--without seeing the code--none of
us can really guess what might be different between the two sets.
However, the OP specifically asked about "when I introduce optimization
flags (e.g. -O1, -O2, or -O3)" and your answer was misleading at best
given this as background.
GH
> But this has nothing to do with adding -O3 to the gcc options.
The OP was not sure his optimization flags were the issue, at least,
not directly.
> Obviously the OP could well have a completely different set of code for
> his "debug mode" build, in which case--without seeing the code--none of
> us can really guess what might be different between the two sets.
> However, the OP specifically asked about "when I introduce optimization
> flags (e.g. -O1, -O2, or -O3)" and your answer was misleading at best
> given this as background.
The OP couldn't narrow it down to particular flags or even consistent
combinations of flags. Given that, a general discussions of things
that can result in performance differences across different builds of
the same code base seems appropriate to me.
DS
That's fine. However, Dmitriy's post was clearly implying that the
optimization options given to the compiler might affect the compiler's
layout of the data structures when he said:
>>> Under optimization objects can be packed more tightly.
And that simply isn't true. (Or at least--I will again qualify--I've
never heard of a C or C++ compiler that changes structure layouts at
different levels of optimization and I don't believe it's true of gcc in
particular.)
GH
> > The OP couldn't narrow it down to particular flags or even consistent
> > combinations of flags. Given that, a general discussions of things
> > that can result in performance differences across different builds of
> > the same code base seems appropriate to me.
>
> That's fine. However, Dmitriy's post was clearly implying that the
> optimization options given to the compiler might affect the compiler's
> layout of the data structures when he said:
>
> >>> Under optimization objects can be packed more tightly.
>
> And that simply isn't true. (Or at least--I will again qualify--I've
> never heard of a C or C++ compiler that changes structure layouts at
> different levels of optimization and I don't believe it's true of gcc in
> particular.)
I mean 'external object packing', not 'internal object packing'.
Objects allocated on heap (free store) as well as allocated on stack
can be situated closer to each other in release build:
int main()
{
thread_param_t tp1 = ...;
start_thread(&thread_func1, &tp1);
f();
}
void f()
{
thread_param_t tp2 = ...;
start_thread(&thread_func2, &tp2);
}
If f() will be inlined, then tp1 and tp2 will be closer to each other.
Sorry for any confusion, I'm not a native English speaker.
Dmitriy V'jukov
I made a small compute bound test program and multi threaded it, I got
good scaling (I measure elapsed time on an otherwise empty computer, I
can't figure out how to get any other decent measure of CPU time and
thread efficiency). anyway, 4 threads ran about 4x on this small
program (4 cpu machine of course).
Now, I take my large application that is thousands of lines of C, and
threaded it so that there was no locking going on. I did not come
close to attaining scaled performance. 2cpus instead of running in
1/2 the time, ran in perhaps 70% of the time. More cpus even worse
efficiency. I was thinking, "Wow, I've got this one nailed, the
threads are totally independent (that took some work)". Oh well...
Speculation: Maybe it is the shared memory architecture. Caches and
all that should help, but I can't think of anything else. My
appication is grabbing variouis variables from a bunch of shared data
structures.
Dick