Why Cliff Click used -O2 instead of -O3 when comparing C++ with Java for speed?

1,102 views
Skip to first unread message

rick.ow...@gmail.com

unread,
Jul 5, 2015, 2:14:34 AM7/5/15
to mechanica...@googlegroups.com


So I am writing Java and C++ equivalent programs to compare both languages for speed.

In his excellent article, Cliff Click did the same, but he used -O2 to compile instead of -O3.

On my various benchmarks, Java loses for C++ code compiled with -O3 but wins for C++ code compiled with -O2.

Which one should I use to reach a reasonable conclusion? It looks like Cliff Click chose -O2. Does anyone know why?


Vitaly Davidovich

unread,
Jul 5, 2015, 12:41:13 PM7/5/15
to mechanica...@googlegroups.com

Can't speak for Cliff Click, but he mentioned in one of his presentations that the JIT (hotspot at least) is equivalent to GCC -O2 in terms of optimizations; perhaps that's the motivation there.

I've also encountered C++ devs who were under the impression that O3 is for experimental/unstable optimizations, and wouldn't enable it.  That's not the case though so IMHO O3 is the right comparison for peak performance (e.g O3 is where a lot of the aggressive vectorization takes place).

sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Todd Lipcon

unread,
Jul 5, 2015, 1:11:54 PM7/5/15
to mechanica...@googlegroups.com

In my experience, for a lot of applications, -O2 beats out O3 significantly in terms of code size. In the non-hot parts of the code, this is more important than fast straight line performance, since it avoids instruction cache misses, etc.

At least in the project I'm currently working on, building all my third party dependencies with O2 resulted in a couple percent speedup in the resulting binary. Selectively enabling O3 on the hot code paths on the other hand does make sense to get vectorization, etc.

Todd

Vitaly Davidovich

unread,
Jul 5, 2015, 1:36:01 PM7/5/15
to mechanica...@googlegroups.com

It's true that O3 will almost always increase code size primarily due to more aggressive unrolling and other loop transforms (and subsequent vectorization).  For microbenchmarks this is almost always a win since the code is hot by definition and there's not a lot of it; in a "real" app it could cause issues, very true.  My default is O3 and selectively decreasing when warranted.

The ideal way to avoid instruction bloat due to expansion of non-hot paths is to either use PGO or at least manually mark unlikely paths (when known a priori).  The benchmarks benefiting from vectorization will be interesting to re-check with java 9 given a few Intel superword enhancements.

sent from my phone

Gil Tene

unread,
Jul 5, 2015, 2:15:11 PM7/5/15
to mechanica...@googlegroups.com
I think it's simply a matter of "that's how most people compile the C/C++ apps and libs". That was certainly true 6+ years ago, and seems to still be mostly true now. 

However, your best bet is probably to test with both -O2 and -O3 if you want to cover your bases. Just remember to ask the "am I willing to turn this optimization on by default for everything I compile?" question when you use -O3. If you are looking at the "I'll use -O3 for some code and not for others" option, you should consider the same hand-select-optimizations-per-method approach in HotSpot...

-- Gil.

Rajiv Kurian

unread,
Jul 5, 2015, 2:39:56 PM7/5/15
to mechanica...@googlegroups.com
Ditto here. -O3 is almost always a better outcome for micro benchmarks since the code size, number of branches etc doesn't saturate the limits of the machine one is running on. But many "real world" applications that I have worked on run better with a mix of -O3 and -Os/-O2.


On Sunday, July 5, 2015 at 10:36:01 AM UTC-7, Vitaly Davidovich wrote:

It's true that O3 will almost always increase code size primarily due to more aggressive unrolling and other loop transforms (and subsequent vectorization).  For microbenchmarks this is almost always a win since the code is hot by definition and there's not a lot of it; in a "real" app it could cause issues, very true.  My default is O3 and selectively decreasing when warranted.

The ideal way to avoid instruction bloat due to expansion of non-hot paths is to either use PGO or at least manually mark unlikely paths (when known a priori).  The benchmarks benefiting from vectorization will be interesting to re-check with java 9 given a few Intel superword enhancements.

sent from my phone

On Jul 5, 2015 1:11 PM, "Todd Lipcon" <to...@lipcon.org> wrote:

In my experience, for a lot of applications, -O2 beats out O3 significantly in terms of code size. In the non-hot parts of the code, this is more important than fast straight line performance, since it avoids instruction cache misses, etc.

At least in the project I'm currently working on, building all my third party dependencies with O2 resulted in a couple percent speedup in the resulting binary. Selectively enabling O3 on the hot code paths on the other hand does make sense to get vectorization, etc.

Todd

On Jul 5, 2015 9:41 AM, "Vitaly Davidovich" <vit...@gmail.com> wrote:

Can't speak for Cliff Click, but he mentioned in one of his presentations that the JIT (hotspot at least) is equivalent to GCC -O2 in terms of optimizations; perhaps that's the motivation there.

I've also encountered C++ devs who were under the impression that O3 is for experimental/unstable optimizations, and wouldn't enable it.  That's not the case though so IMHO O3 is the right comparison for peak performance (e.g O3 is where a lot of the aggressive vectorization takes place).

sent from my phone

On Jul 5, 2015 2:14 AM, <rick.ow...@gmail.com> wrote:


So I am writing Java and C++ equivalent programs to compare both languages for speed.

In his excellent article, Cliff Click did the same, but he used -O2 to compile instead of -O3.

On my various benchmarks, Java loses for C++ code compiled with -O3 but wins for C++ code compiled with -O2.

Which one should I use to reach a reasonable conclusion? It looks like Cliff Click chose -O2. Does anyone know why?


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Jul 5, 2015, 2:40:30 PM7/5/15
to mechanica...@googlegroups.com

Hand selecting optimizations per method (I'm assuming you're referring to XX:CompileCommand) is very cumbersome in hotspot and I've yet to see this widely used; I've seen it used only to turn off compilation of a method entirely when a JIT bug is suspected.

sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Jul 5, 2015, 2:58:47 PM7/5/15
to mechanica...@googlegroups.com

Yeah, this also makes intuitive sense; most large workloads are dominated by branchy logic and/or cache misses.  There may not even be much code to vectorize over.  This is of course not true of certain types of libs, but generally holds IME.  Inevitably, java vs c++ microbenchmarks tend to contain lots of loops and use arrays of primitives - lots of opportunity for vectorizing.  IMHO, the difference maker (wrt java vs c++) for typical real world apps is better locality and lower abstraction costs in c++.

sent from my phone

On Jul 5, 2015 2:39 PM, "Rajiv Kurian" <geet...@gmail.com> wrote:
Ditto here. -O3 is almost always a better outcome for micro benchmarks since the code size, number of branches etc doesn't saturate the limits of the machine one is running on. But many "real world" applications that I have worked on run better with a mix of -O3 and -Os/-O2.

On Sunday, July 5, 2015 at 10:36:01 AM UTC-7, Vitaly Davidovich wrote:

It's true that O3 will almost always increase code size primarily due to more aggressive unrolling and other loop transforms (and subsequent vectorization).  For microbenchmarks this is almost always a win since the code is hot by definition and there's not a lot of it; in a "real" app it could cause issues, very true.  My default is O3 and selectively decreasing when warranted.

The ideal way to avoid instruction bloat due to expansion of non-hot paths is to either use PGO or at least manually mark unlikely paths (when known a priori).  The benchmarks benefiting from vectorization will be interesting to re-check with java 9 given a few Intel superword enhancements.

sent from my phone

On Jul 5, 2015 1:11 PM, "Todd Lipcon" <to...@lipcon.org> wrote:

In my experience, for a lot of applications, -O2 beats out O3 significantly in terms of code size. In the non-hot parts of the code, this is more important than fast straight line performance, since it avoids instruction cache misses, etc.

At least in the project I'm currently working on, building all my third party dependencies with O2 resulted in a couple percent speedup in the resulting binary. Selectively enabling O3 on the hot code paths on the other hand does make sense to get vectorization, etc.

Todd

On Jul 5, 2015 9:41 AM, "Vitaly Davidovich" <vit...@gmail.com> wrote:

Can't speak for Cliff Click, but he mentioned in one of his presentations that the JIT (hotspot at least) is equivalent to GCC -O2 in terms of optimizations; perhaps that's the motivation there.

I've also encountered C++ devs who were under the impression that O3 is for experimental/unstable optimizations, and wouldn't enable it.  That's not the case though so IMHO O3 is the right comparison for peak performance (e.g O3 is where a lot of the aggressive vectorization takes place).

sent from my phone

On Jul 5, 2015 2:14 AM, <rick.ow...@gmail.com> wrote:


So I am writing Java and C++ equivalent programs to compare both languages for speed.

In his excellent article, Cliff Click did the same, but he used -O2 to compile instead of -O3.

On my various benchmarks, Java loses for C++ code compiled with -O3 but wins for C++ code compiled with -O2.

Which one should I use to reach a reasonable conclusion? It looks like Cliff Click chose -O2. Does anyone know why?


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ricardo.dan...@gmail.com

unread,
Jul 5, 2015, 3:38:25 PM7/5/15
to mechanica...@googlegroups.com

-O3 will certainly beat -O2 in microbenchmarks but when you benchmark a more realistic application (such as a FIX engine) you will see that -O2 beats -O3 in terms of performance.

As far as I know, -O3 does a very good job compiling small and mathematical pieces of code, but for more realistic and larger applications it can actually be slower than -O2. By trying to aggressively optimize everything (i.e. inlining, vectorization, etc.), the compiler will produce huge binaries leading to cpu cache misses (i.e. especially instruction cache misses). That's one of the reasons the Hotspot JIT chooses not to optimize big methods and/or non-hot methods.

Well, if you use -O3 for microbenchmarks you will get amazing fast results that will be unrealistic for larger and more complex applications. That's why I think the judges use -O2 instead of -O3. For example, our garbage-free Java FIX engine (http://www.coralblocks.com/index.php/category/coralfix) is faster than C++ FIX engines and I have no idea if they are compiling with -O2, -O3 or a mix of both through executable linking.

In theory it is possible for a person to selective compartmentalize an entire C++ application in executable pieces and then choose which ones are going to be compiled with -O2 and which ones are going to be compiled with -O3. Then link everything in an ideal binary executable. But in reality, how feasible is that?

The approach the Hotspot chooses is much simpler. It says:

Listen, I am going to consider each method as an independent unit of execution instead of any block of code anywhere. If that method is hot enough (i.e. called often) and is not too big I will try to aggressively optimize it.

That of course has the drawback of requiring code warmup but it is much simpler and produces the best results most of the time for realistic/large/complex applications.

And last but not least, you should probably consider this question if you want to compile your entire application with -O3: http://stackoverflow.com/questions/14850593/when-can-i-confidently-compile-program-with-o3

Remi Forax

unread,
Jul 5, 2015, 4:15:44 PM7/5/15
to mechanica...@googlegroups.com

Vitaly Davidovich

unread,
Jul 5, 2015, 4:28:34 PM7/5/15
to mechanica...@googlegroups.com

Yeah, that's fairly new and maybe works for Oracle SQE :).  JEP 165 may make this a bit more accessible for end users, but I don't think that's its stated goal.

sent from my phone

Vitaly Davidovich

unread,
Jul 5, 2015, 4:45:37 PM7/5/15
to mechanica...@googlegroups.com

You can see what GCC selects for O3: https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/opts.c?view=markup#l522.  Most of it is for vectorization (loops) and shouldn't generally have negative effect on other code shapes (may not improve it either, of course).  Hotspot is actually pretty aggressive in inlining frequent code (bytecode size limit and number of nodes are large) and more often than not, inlining helps if it's for hot code path.  The big advantage hotspot has over typical static compilation is profile info (the way it's collected can create perf problems sometimes), but if you at least highlight to static compilers which paths are uncommon (a good chunk of these are known a priori by developer), it won't bloat code unnecessarily.  Best case is you use PGO and have a representative and consistent profile, but that's problematic in big apps.

As for FIX handling, how do you know your java impl is faster due to better icache utilization? I suspect it's really other reasons.

sent from my phone

Vitaly Davidovich

unread,
Jul 5, 2015, 4:45:39 PM7/5/15
to mechanica...@googlegroups.com

You can see what GCC selects for O3: https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/opts.c?view=markup#l522.  Most of it is for vectorization (loops) and shouldn't generally have negative effect on other code shapes (may not improve it either, of course).  Hotspot is actually pretty aggressive in inlining frequent code (bytecode size limit and number of nodes are large) and more often than not, inlining helps if it's for hot code path.  The big advantage hotspot has over typical static compilation is profile info (the way it's collected can create perf problems sometimes), but if you at least highlight to static compilers which paths are uncommon (a good chunk of these are known a priori by developer), it won't bloat code unnecessarily.  Best case is you use PGO and have a representative and consistent profile, but that's problematic in big apps.

As for FIX handling, how do you know your java impl is faster due to better icache utilization? I suspect it's really other reasons.

sent from my phone

On Jul 5, 2015 3:38 PM, <ricardo.dan...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages