What kind of optimizations are possible in a JIT that are difficult to do in an PGO AOT binary?

320 views

Skip to first unread message

Rajiv Kurian

unread,

Apr 2, 2014, 4:13:33 AM4/2/14

to mechanica...@googlegroups.com

I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.

Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)? Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

Thanks,
Rajiv

Peter Lawrey

unread,

Apr 2, 2014, 4:32:24 AM4/2/14

to mechanica...@googlegroups.com

> In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks

The advantage of C and C++ is you have many ways of doing the same thing and if you really know what you are doing you can heavily optimise the code. When you have a simple set of requirements which are well understood and don't change, C/C++ is very fast compared with Java. Benchmarks tend to be a simple problem where someone with experience can spend an unrealistic amount of time optimising the code to death.

However, in the real world, you often (but not always) have changing requirement, problems which are not as well understood as you would like, teams with mixed technical ability (it is useful to have team members who have strong non technical skills), and access to open source third party libraries. As yet, there isn't a good benchmark where you have code written by someone with 1 year of experience, who is more creative than technical, in a limited time frame to multiple platforms, using mostly third party libraries.

Note: the same arguments can be applied to not using Java but instead using a language like Scala, Groovy or JRuby. For my projects, Java is the right balance.

> Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

I have libraries written in Java which use explicit memory management for very large data sets i.e. GB to TB. Just because Java has memory management doesn't mean you have to use it for everything.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rajiv Kurian

unread,

Apr 2, 2014, 4:40:46 AM4/2/14

to mechanica...@googlegroups.com

On Wednesday, April 2, 2014 1:32:24 AM UTC-7, Peter Lawrey wrote:

>  In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks

The advantage of C and C++ is you have many ways of doing the same thing and if you really know what you are doing you can heavily optimise the code. When you have a simple set of requirements which are well understood and don't change, C/C++ is very fast compared with Java.   Benchmarks tend to be a simple problem where someone with experience can spend an unrealistic amount of time optimising the code to death.

However, in the real world, you often (but not always) have changing requirement, problems which are not as well understood as you would like, teams with mixed technical ability (it is useful to have team members who have strong non technical skills), and access to open source third party libraries. As yet, there isn't a good benchmark where you have code written by someone with 1 year of experience, who is more creative than technical, in a limited time frame to multiple platforms, using mostly third party libraries.

Note: the same arguments can be applied to not using Java but instead using a language like Scala, Groovy or JRuby. For my projects, Java is the right balance.

>  Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

I have libraries written in Java which use explicit memory management for very large data sets i.e. GB to TB. Just because Java has memory management doesn't mean you have to use it for everything.

Good point but did you really write this code in Java instead of C because of the potential JIT benefits? What has Java gotten you there w.r.t C/C++ besides the joys of writing small parts of a modern C/C++ compiler ;)

On 2 April 2014 03:13, Rajiv Kurian <geet...@gmail.com> wrote:

I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.

Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)? Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

Thanks,
Rajiv

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Barker

unread,

Apr 2, 2014, 4:43:57 AM4/2/14

to mechanica...@googlegroups.com

I would expect in theory any specific optimisation would be available to PGO+AOT and JIT compilation. I think the differences are more workflow/productivity related. Here are some random thoughts:

- The build, deploy to performance environment, test, profile, build (rinse/repeat) flow could be quite lengthy compared to just drop it is and run...

- Obviously JIT pays some initial warm-up cost.

- How correct is the load scenario being applied in the performance test, if varies from production (even a small amount) you could end up with some nasty performance issues.

- Some smart JITs have the ability to re-optimise if the load profile changes while the system is running - not sure how widely that is implemented.

I could see it working well for apps like games where you could have a tester run through a number of levels and get quite a good spread of coverage across the code. However, I'd be very reluctant to apply something like that to something like LMAX as I would worry about over-biasing it toward our performance environment.

Mike.

Gil Tene

unread,

Apr 2, 2014, 9:05:50 AM4/2/14

to mechanica...@googlegroups.com

Instead of focusing on memory management, I focus on code management when it comes to these kinds of optimization questions. I like MSFT's CLR classification, which separates "managed code" from "managed data". Java and C# (and Scala and Clojure and F# and Groovy, etc.) are all "managed code, managed data" languages. But there are pure "managed code" environments that do not include managed data. They just aren't very common (e.g. Java + Unsafe would qualify for this, probably).

It's the managed code quality that allows for many optimizations. The fact that C allows for " x = (this_struct *) (*f)((that_struct *) y); " makes certain optimizations that would rely on knowledge of a limited type and target function spaces "hard". And while full program static analysis by a PGO can often approximate "managed code", many real world behaviors make "real world full static program static analysis" impractical for things that are not purely embedded.

On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:

I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

"should be possible" is the key phrase here.

GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.

For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).

Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.

On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.

The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.

Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)?

Putting aside the futility of using benchmarks, there are some strong performance-effecting things around memory layout and memory access patterns, for which current Java JDKs are seemingly still inherently inferior in speed compared to C variants. I say "seemingly" because I believe it has more to do with library semantics that it does with language semantics, and I think it will be changing very soon.

Specifically, C variants allow for the following three commonly used memory layout schemes that the Java ecosystem does not currently support with speed:

- An array of structs.

- A struct with a struct inside.

- A struct with a variable size array at the end.

Note that I'm not specifically talking about structs. You can replace "struct" with "object" above and keep my same meaning. It's just that the common language environments that use the term "object" to describe their structs typically don't support the above memory layout relationships.

For each of these cases, equivalent Java implementation logic using today's JDKs will require extra data indirection (de-referencing) steps to access the same information. Even more importantly, in the first case and in combinations of the first or third case with the second, C variants will achieve linear streaming through memory where Java will end up with potentially random access, which has a huge impact of how hardware prefetched work and how well memory is streamed into a cache for similar program semantics.

But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.

Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

My opinion: It's because (per examples above), it's been hard to make them produce (in the current "real world") the same sort of optimizations that managed code and managed data runtime seem to be able to produce in their current real world.

Thanks,
Rajiv

Peter Lawrey

unread,

Apr 2, 2014, 11:15:03 AM4/2/14

to mechanica...@googlegroups.com

The main advantage of writing small pieces of low level Java is that it integrates with natural Java very well. It is still cross platform and works with all the normal Java tools. If 90% of your project is going to be in Java, IMHO, it makes sense to use low level Java instead of C for the low level coding.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rajiv Kurian

unread,

Apr 2, 2014, 2:18:31 PM4/2/14

to mechanica...@googlegroups.com

Thank you for that very detailed answer. I have a few follow up questions.

On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:

Instead of focusing on memory management, I focus on code management when it comes to these kinds of optimization questions. I like MSFT's CLR classification, which separates "managed code" from "managed data". Java and C# (and Scala and Clojure and F# and Groovy, etc.) are all "managed code, managed data" languages. But there are pure "managed code" environments that do not include managed data. They just aren't very common (e.g. Java + Unsafe would qualify for this, probably).

Thank you that's a much clearer classification.

It's the managed code quality that allows for many optimizations. The fact that C allows for " x = (this_struct *) (*f)((that_struct *) y); " makes certain optimizations that would rely on knowledge of a limited type and target function spaces "hard". And while full program static analysis by a PGO can often approximate "managed code", many real world behaviors make "real world full static program static analysis" impractical for things that are not purely embedded.

On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

"should be possible" is the key phrase here.

GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.

But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?

For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).

Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem. It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness. This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).

Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.

Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.

On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.

What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price. Also the JIT logic occupies icache. I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.

The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.

Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)?

Putting aside the futility of using benchmarks, there are some strong performance-effecting things around memory layout and memory access patterns, for which current Java JDKs are seemingly still inherently inferior in speed compared to C variants. I say "seemingly" because I believe it has more to do with library semantics that it does with language semantics, and I think it will be changing very soon.

Specifically, C variants allow for the following three commonly used memory layout schemes that the Java ecosystem does not currently support with speed:

- An array of structs.

- A struct with a struct inside.

- A struct with a variable size array at the end.

Note that I'm not specifically talking about structs. You can replace "struct" with "object" above and keep my same meaning. It's just that the common language environments that use the term "object" to describe their structs typically don't support the above memory layout relationships.

For each of these cases, equivalent Java implementation logic using today's JDKs will require extra data indirection (de-referencing) steps to access the same information. Even more importantly, in the first case and in combinations of the first or third case with the second, C variants will achieve linear streaming through memory where Java will end up with potentially random access, which has a huge impact of how hardware prefetched work and how well memory is streamed into a cache for similar program semantics.

But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.

C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?

Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

My opinion: It's because (per examples above), it's been hard to make them produce (in the current "real world") the same sort of optimizations that managed code and managed data runtime seem to be able to produce in their current real world.

D and Rust are some modern languages that shed some of the legacy problems of C/C++ but they still produce AOT compiled binaries.

Thanks,
Rajiv

Rajiv Kurian

unread,

Apr 2, 2014, 2:56:23 PM4/2/14

to mechanica...@googlegroups.com

On Wednesday, April 2, 2014 8:15:03 AM UTC-7, Peter Lawrey wrote:

The main advantage of writing small pieces of low level Java is that it integrates with natural Java very well. It is still cross platform and works with all the normal Java tools. If 90% of your project is going to be in Java, IMHO, it makes sense to use low level Java instead of C for the low level coding.

Certainly. My point was that low level Java is not written to gain the JIT benefits missing in C/C++ but for other very valid reasons.

On 2 April 2014 03:40, Rajiv Kurian <geet...@gmail.com> wrote:

On Wednesday, April 2, 2014 1:32:24 AM UTC-7, Peter Lawrey wrote:

>  In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks

The advantage of C and C++ is you have many ways of doing the same thing and if you really know what you are doing you can heavily optimise the code. When you have a simple set of requirements which are well understood and don't change, C/C++ is very fast compared with Java.   Benchmarks tend to be a simple problem where someone with experience can spend an unrealistic amount of time optimising the code to death.

However, in the real world, you often (but not always) have changing requirement, problems which are not as well understood as you would like, teams with mixed technical ability (it is useful to have team members who have strong non technical skills), and access to open source third party libraries. As yet, there isn't a good benchmark where you have code written by someone with 1 year of experience, who is more creative than technical, in a limited time frame to multiple platforms, using mostly third party libraries.

Note: the same arguments can be applied to not using Java but instead using a language like Scala, Groovy or JRuby. For my projects, Java is the right balance.

>  Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

I have libraries written in Java which use explicit memory management for very large data sets i.e. GB to TB. Just because Java has memory management doesn't mean you have to use it for everything.

Good point but did you really write this code in Java instead of C because of the potential JIT benefits? What has Java gotten you there w.r.t C/C++ besides the joys of writing small parts of a modern C/C++ compiler ;)

On 2 April 2014 03:13, Rajiv Kurian <geet...@gmail.com> wrote:

I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.

Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)? Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?

Thanks,
Rajiv

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Peter Lawrey

unread,

Apr 2, 2014, 3:59:47 PM4/2/14

to mechanica...@googlegroups.com

My responses are inline.

On 2 April 2014 13:18, Rajiv Kurian <geet...@gmail.com> wrote:

Thank you for that very detailed answer. I have a few follow up questions.

GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.

But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?

Say you have a dzone possible implementations, these implementations are not available at compile time and possibly not available when the program starts. The JVM can inline those methods when they are used and uninline them if the class is unloaded and inline a new version when it get updated. For C++ to inline virtual methods it would have to inline all possible methods and this must limit further optimisation. In Java if you inline a dozen methods is can be substantially slower than inlining one or two.

For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).

Doesn't the JVM itself use dynamic libraries for some of the same functionality?

Some, but in general, ad little as possible.

And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.

This does happen, there have been a lot of update of Java 6 and 7 since it was first released (not the update numbers jump values for no apparent reason, 7 update 25 was followed by update 40 and then update 45)

It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time

I also cannot know which methods will be called at runtime. The compiler would also need to know what configuration you are using as well. e.g. you might load values from a database which determine which implementation is called. i.e. there is some things you cannot know until run time and can change over the life of one run of a program.

(i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.

Simple example, You have a line which takes an option from a user, that option determines which sub-class is used repeatedly in later code. You compiler cannot know which option you are going to choose.

Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.

Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.

I would say that Java has limited options for memory management in code. e.g. no structs. C# does have these and a managed heap so it is possible.

On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.

What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price. Also the JIT logic occupies icache. I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.

The GCC has to optimise all your code whereas the JIT can optimise just the code you actually use a lot.

Safe points bring many issues, but in general it's not something 95+% of Java developers are even aware of.

The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.

Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)?

Putting aside the futility of using benchmarks, there are some strong performance-effecting things around memory layout and memory access patterns, for which current Java JDKs are seemingly still inherently inferior in speed compared to C variants. I say "seemingly" because I believe it has more to do with library semantics that it does with language semantics, and I think it will be changing very soon.

I can say from real experience having two trading systems in the same organisation that the C++ systems was significantly slower than a Java system of similar complexity. The C++ also had many more developers. This could be an exception rather than the rule, but using C++ is no guarantee of having a faster system in real applications.

C is inherently harder to code for than Java, and this often means you can spend more time optimising Java e.g. replacing O(n) with O(Log N) algos, than you do C (unless you have a really strong team of C developers)

Specifically, C variants allow for the following three commonly used memory layout schemes that the Java ecosystem does not currently support with speed:

- An array of structs.

These are not that common IMHO, however when you need them, you really need them.

- A struct with a struct inside.

This is a common use case, but it rarely has much impact in Java programs.

- A struct with a variable size array at the end.

This is a very rare use case IMHO.

C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?

In general C# is slower than Java for other reasons, the language is ahead in features, but the VM is behind. e.g. you can't tune the GC to the same degree.

On Wednesday, April 2, 2014 8:15:03 AM UTC-7, Peter Lawrey wrote:

The main advantage of writing small pieces of low level Java is that it integrates with natural Java very well. It is still cross platform and works with all the normal Java tools. If 90% of your project is going to be in Java, IMHO, it makes sense to use low level Java instead of C for the low level coding.

> Certainly. My point was that low level Java is not written to gain the JIT benefits missing in C/C++ but for other very valid reasons.

I would say that cross platform is a benefit of the JIT. Someone can use the library without compiling it or knowing if it was compiled on Windows, Linux or Mac, 32-bit or 64-bit.

Gil Tene

unread,

Apr 2, 2014, 4:42:51 PM4/2/14

to mechanica...@googlegroups.com

More discussion inline.

On Wednesday, April 2, 2014 11:18:31 AM UTC-7, Rajiv Kurian wrote:

Thank you for that very detailed answer. I have a few follow up questions.

...

On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:
On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

"should be possible" is the key phrase here.

GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.
But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?

Yes. but "ONLY" if they are in other libraries is the same as saying "only" for code you didn't write yourself. Like collections for example. Or Boost. That's 95%+ of the code you use and run through in most non-trivial applications. A better way to look at it is to say that the can only inline virtual function for within your own code...

For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).
Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.

All the JDK libraries, and the the JVM's intrinsic (i.e. that other 95%+ of the code) is just as inline-able for a JIT as the code you write yourself. Java applications spend a minute % in non-generated code.

It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.

There is a very real-world reason for this, and it's not script awfulness. it's because operationally people expect binaries to run. Not to have to be recompiled ahead of every run on every system after every deployment or every patch. When you do that, you're pretty much in JIT-land already (with a lot more pain).

Try and talk your sysadmins into recompiling all the code on each system every time they apply a RHEL patch, and see what they say to that. See what your QA folks think about it, too.. Alternatively, try to explain to them why you insist on carrying the bug in version X of system library Y forward with your statically compiled and linked code...

This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).

Agreed. that's what I see as the other side of "things that are not purely embedded." In embedded systems, you can afford to do this. Reality outside of embedded systems is simply that people don't do this. Why? I think it's because of the nightmares that follow. But why doesn't really matter...

Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.
Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.

To start with, you need a way to describe the notion that a variable that is set once at runtime is never changed after being set that one time, and the expression of this concept is missing in many non-runtime languages. But even in cases where you can express this notion, the managed data part is what gives the runtime confidence that the code discipline around data access is actually going to adhered to the concept you expressed. In C/C++, you'll be easily able to get a pointer to such "supposedly runtime constant" field, and mess with it, and that's something you just can't do in a managed data environment.

Another way to demonstrate this need is this: in JITed managed code environments, the JIT will commonly find itself in a situation where it needs to compile code that refers to state that is not yet resolvable at compile time (e.g. a static final field of a class that has not yet been initialized). This can happen for example in this code:

class Foo {
  static final long startTime = System.nanoTime();


  void doFoo() {
    for (int i = 0; i < 1000000; i++) {
      if (i > 500000) {
        System.out.println("Foo start time " + Foo.startTime);
      } else {
        System.out.println("Doof start time " + Doof.startTime);
      }

    }
  }
}


class Doof {
  static final long startTime = System.nanoTime();
}

Where the code will get hot and compiled before Doof is ever initialized. So at the time of 1st compilation, the Doof.startTime data is not yet initialized, and cannot be propagated as a constant. It could be coded as a runtime field access instead (with the proper "make sure it's initialized first" guards), but most JITs basically deal with this by simply leaving a "trap" in the unresolved part. When the cost first gets to the latter half of the if statement, the compiled code will deoptimize, force the initialization of the Doof class (if it had not already been initialized), and reoptimize with the now-initialized value, at which point the optimization can be applied. There is a bunch of "managed data" needed to make these transitions safe and to guard against access to uninitialized data (forcing initialization ahead of any such access).

On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.

Most JITs only interment during warmup, with the fully optimized code either not instrumenting at all, or instrumenting very lightly (e.g. via sampling). So the runtime cost for the optimized code is effectively zero.

Also the JIT logic occupies icache.

Once the optimized code is hot, there is zero cache footprint. There is nothing left in the code other than the optimized result.

I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.

Safepointing is an interesting issue.

Saefpoint polling itself is extremely cheap. E.g. HotSpot uses a blind load from a constant address, and relies on page protection to trigger an exception if the safepoint needs to be taken. And safepoint are generally only polled for at loop back edges and method entry or exit.

It's the fact that the machine state can be modified at the safepoint that reduces the ability to apply some optimizations across loop back edge safepoints. Loop unrolling help alleviate that, but safepoints are certainly optimization-opportunity-reducers to some degree.

There is a whole bunch of cool art around figuring out how to rely on the limited nature of the state change in a safepoint to allows safepoint-crossing optimizations. E.g. while safepoints may modify the bitwise representation of reference, they do not modify their logical meaning: A null cannot become a non-null at a safepoint, and vice-versa. And for most practical GC algorithms two references that are equal to each other will remain equal to each other after a safepoint, and the same for two references that are not equal to each other.

...

But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.

C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?

Yes. And the same can be said for stack variables in C#. (StructuredArray is a way of avoiding the need for structs, and staying with regular objects, but that's beside the point).

In theory practice should meet theory, and C# should be able to beat C++. I'm sure there are examples where it does, and examples where it doesn't...

Rajiv Kurian

unread,

Apr 2, 2014, 8:51:51 PM4/2/14

to mechanica...@googlegroups.com

Thank you for the very detailed follow up Gil. I've got a few more questions to ask :)

On Wednesday, April 2, 2014 1:42:51 PM UTC-7, Gil Tene wrote:

More discussion inline.

On Wednesday, April 2, 2014 11:18:31 AM UTC-7, Rajiv Kurian wrote:
Thank you for that very detailed answer. I have a few follow up questions.
...
On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:
On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

"should be possible" is the key phrase here.

GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.
But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?

Yes. but "ONLY" if they are in other libraries is the same as saying "only" for code you didn't write yourself. Like collections for example. Or Boost. That's 95%+ of the code you use and run through in most non-trivial applications. A better way to look at it is to say that the can only inline virtual function for within your own code...

For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).
Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.

All the JDK libraries, and the the JVM's intrinsic (i.e. that other 95%+ of the code) is just as inline-able for a JIT as the code you write yourself. Java applications spend a minute % in non-generated code.

If the JVM is dependent on other native libraries written by others, then how is this possible? For example if a JVM uses libgmp for it's arbitrary precision arithmetic, then it needs to be linked, just like in a C++ app, and it couldn't be inlined with user code. So from what I understand you are just saying that there are almost no dynamically linked libraries used in the JVM (and the JDK is pure Java). If so then this benefit is conferred to any C/C++ program that doesn't dynamically link to any library. Of course I understand your point that practically it's tougher to do so in C/C++ because you don't write your own malloc using brk/mmap.

It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.

There is a very real-world reason for this, and it's not script awfulness. it's because operationally people expect binaries to run. Not to have to be recompiled ahead of every run on every system after every deployment or every patch. When you do that, you're pretty much in JIT-land already (with a lot more pain).

Try and talk your sysadmins into recompiling all the code on each system every time they apply a RHEL patch, and see what they say to that. See what your QA folks think about it, too.. Alternatively, try to explain to them why you insist on carrying the bug in version X of system library Y forward with your statically compiled and linked code...

Hmm. I can see that for client code, but for server code that one is running on their own environment, recompilation isn't that big of an issue. A lot of libraries already are header only C++ template code, so a re-compilation is already required for the tiniest of changes. In my experience the biggest reason people do not want to recompile on slight changes in a server program, is the long long long times it takes to compile a decently sized C++ project. So as a developer I'd rather link to libs that don't change that much than to recompile every time, because I cannot stand the 1 hour long compile times. This is mainly down to header legacy and C++ syntax complexity though, and more modern languages (Rust crates etc) with module support don't have to deal with this problem. Even C++ is supposed to get module support which will make this a non-issue. A lot of open source software written in C/C++ (Nginx, ImageMagick etc) already recommend a compilation on the destination instead of using conservatively optimized universal binaries and since it's not updated very regularly a proper modular compilation (one missing from C++ atm) wouldn't be the worst thing. If whole program compilations didn't take the absurd amounts of time it takes now, it would be a very valid choice for developers looking to trade compilation time for better code.

Also I don't see the point in your comparison to statically compiled and linked code. Java must suffer from the same problem. If your Java code uses a collections library that has a bug, you need to ship a new version of your code, using the updated collections library for every program that uses it, instead of shipping a single one that is dynamically linked. The JVM cannot just escape the cost of complete compilation for these whole program optimizations we are talking about. It's just that it does it in an amortized way, but pays in delayed startup time and complicated warm up procedures.

This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).

Agreed. that's what I see as the other side of "things that are not purely embedded." In embedded systems, you can afford to do this. Reality outside of embedded systems is simply that people don't do this. Why? I think it's because of the nightmares that follow. But why doesn't really matter...

Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.
Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.

To start with, you need a way to describe the notion that a variable that is set once at runtime is never changed after being set that one time, and the expression of this concept is missing in many non-runtime languages. But even in cases where you can express this notion, the managed data part is what gives the runtime confidence that the code discipline around data access is actually going to adhered to the concept you expressed. In C/C++, you'll be easily able to get a pointer to such "supposedly runtime constant" field, and mess with it, and that's something you just can't do in a managed data environment.

A very real issue with C++, but not with a language like Rust which has support for immutable constructs and statically ensures the discipline you are talking about

Another way to demonstrate this need is this: in JITed managed code environments, the JIT will commonly find itself in a situation where it needs to compile code that refers to state that is not yet resolvable at compile time (e.g. a static final field of a class that has not yet been initialized). This can happen for example in this code:

class Foo { static final long startTime = System.nanoTime(); void doFoo() { for (int i = 0; i < 1000000; i++) { if (i > 500000) { System.out.println("Foo start time " + Foo.startTime); } else { System.out.println("Doof start time " + Doof.startTime); } } } } class Doof { static final long startTime = System.nanoTime(); }

Where the code will get hot and compiled before Doof is ever initialized. So at the time of 1st compilation, the Doof.startTime data is not yet initialized, and cannot be propagated as a constant. It could be coded as a runtime field access instead (with the proper "make sure it's initialized first" guards), but most JITs basically deal with this by simply leaving a "trap" in the unresolved part. When the cost first gets to the latter half of the if statement, the compiled code will deoptimize, force the initialization of the Doof class (if it had not already been initialized), and reoptimize with the now-initialized value, at which point the optimization can be applied. There is a bunch of "managed data" needed to make these transitions safe and to guard against access to uninitialized data (forcing initialization ahead of any such access).

I see, so you are calling stuff like "has X been initialized?" managed data since the runtime needs to keep track of it? If I am correct C++11 already has the static keyword which is similar to the Java static keyword and will do threadsafe initialization etc. I thought you meant automatic memory management.

On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.

Most JITs only interment during warmup, with the fully optimized code either not instrumenting at all, or instrumenting very lightly (e.g. via sampling). So the runtime cost for the optimized code is effectively zero.

Also the JIT logic occupies icache.

Once the optimized code is hot, there is zero cache footprint. There is nothing left in the code other than the optimized result.

Where is the code to analyze samples, run heuristics, de-optimize and re-optimize based then? A heavy duty compiler is running with your application after all and it will occupy icache right? If it's running on the plenty of "free" cores that we have right now then this is non-negligible price to pay depending on your environment and application, just like GC threads running in the background doesn't have a zero cost.

I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.

Safepointing is an interesting issue.

Saefpoint polling itself is extremely cheap. E.g. HotSpot uses a blind load from a constant address, and relies on page protection to trigger an exception if the safepoint needs to be taken. And safepoint are generally only polled for at loop back edges and method entry or exit.

It's the fact that the machine state can be modified at the safepoint that reduces the ability to apply some optimizations across loop back edge safepoints. Loop unrolling help alleviate that, but safepoints are certainly optimization-opportunity-reducers to some degree.

There is a whole bunch of cool art around figuring out how to rely on the limited nature of the state change in a safepoint to allows safepoint-crossing optimizations. E.g. while safepoints may modify the bitwise representation of reference, they do not modify their logical meaning: A null cannot become a non-null at a safepoint, and vice-versa. And for most practical GC algorithms two references that are equal to each other will remain equal to each other after a safepoint, and the same for two references that are not equal to each other.

Thank you for the insight into safepoints. Besides the cross-safepoint optimization opportunities that are lost, isn't there is unpredictable stalling when trying to bring different threads to a safe-point?

...
But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.
C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?

Yes. And the same can be said for stack variables in C#. (StructuredArray is a way of avoiding the need for structs, and staying with regular objects, but that's beside the point).

In theory practice should meet theory, and C# should be able to beat C++. I'm sure there are examples where it does, and examples where it doesn't...

:) This begs a question and I promise that it is out of curiosity and not to be a dick. Why do you not use your own bootstrapped JVM + JDK to write future versions of your JVM (I assume you use C/C++)? Is it because of the lack of the struct features you talked about (All available through unsafe hacking, but looks uglier than std C or C++ IMHO)? Is it because of the lack of availability of many instructions (SIMD, PAUSE) required for high performance applications? What would need to change with Java or other managed languages for you to use them for Zing or a project of comparable needs?

Sanjoy Das

unread,

Apr 3, 2014, 3:02:50 AM4/3/14

to mechanica...@googlegroups.com

(Hope I'm not racing anyone to a response here.)

> I see, so you are calling stuff like "has X been initialized?" managed data
> since the runtime needs to keep track of it? If I am correct C++11 already
> has the static keyword which is similar to the Java static keyword and will
> do threadsafe initialization etc. I thought you meant automatic memory
> management.

Another example illustrating the same concept is this:

class Foo {
static final long startTime = System.nanoTime();

void doFoo() {
for (int i = 0; i < 1000000; i++) {

long v = 0;
if (i < 500000) {
v = Foo.startTime;
} else {
v = Doof.startTime;
}
System.out.println(v * 32);

}
}
}

class Doof {
static final long startTime = System.nanoTime();
}

The first time doFoo is optimized, a JIT will compile it to

void doFoo() {
for (int i = 0; i < 1000000; i++) {

long v = 0;
if (i < 500000) {
// only assignment to v
v = Foo.startTime; // constant, say 10
} else {
trapHenceUnreachable();
}
System.out.println(v * 32); // folds to 320
}
}

and thus will be able to constant fold the v * 32 (since v is final).

Once i >= 500000, the function deoptimizes and the second time it
JIT compiles, we get

void doFoo() {
for (int i = 0; i < 1000000; i++) {

long v = 0;
if (i < 500000) {
trapHenceUnreachable(); // also statically unreachable in this case
} else {
// only assignment to v
v = Doof.startTime; // constant, say 20
}
System.out.println(v * 32); // folds to 640
}
}

and we get the constant folding for (v * 32) again. This is something
a static compiler cannot do in general since it cannot know the values
of Doof.startTime and Foo.startTime (static or not) at compile time.
Even the JIT doesn't know the value of Doof.startTime when compiling
doFoo the first time, which is what I assume Gil meant
by "in JITed managed code environments, the JIT will commonly find

itself in a situation where it needs to compile code that refers to

state that is not yet resolvable at compile time".

To actually be able to do take the trap and do the deoptimization, you
are required to maintain some metadata relating the actual machine state at that
trap site to what the abstract machine state (the java expression
stack, locals etc.) would have been. That is the managed data you need.

-- Sanjoy

Martin Grajcar

unread,

Apr 3, 2014, 3:32:50 AM4/3/14

to mechanica...@googlegroups.com

I took this example to the extreme and unfortunately, the JVM is not as smart as it could be. This loop

for (int i=0; i<LIMIT; ++i) {

if (i < LIMIT/2) {

result += A.x;

} else {

result += B.x;

}

could be optimized to something like

result += LIMIT/2 * A.x + LIMIT/2 * B.x;

but it isn't. Maybe because of it making too little sense?

My benchmark and results:

https://dl.dropboxusercontent.com/u/4971686/published/maaartin/misc/StaticFinalBenchmark.java

https://microbenchmarks.appspot.com/runs/75eccb0d-c722-4ec1-9094-f54d5548535c

Gil Tene

unread,

Apr 3, 2014, 3:52:04 AM4/3/14

to mechanica...@googlegroups.com

On Wednesday, April 2, 2014 5:51:51 PM UTC-7, Rajiv Kurian wrote:

Thank you for the very detailed follow up Gil. I've got a few more questions to ask :)

More inline. It's fun to have this kind of back and forth with non-trolls...

On Wednesday, April 2, 2014 1:42:51 PM UTC-7, Gil Tene wrote:
More discussion inline.

On Wednesday, April 2, 2014 11:18:31 AM UTC-7, Rajiv Kurian wrote:
Thank you for that very detailed answer. I have a few follow up questions.
...
On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:
On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

"should be possible" is the key phrase here.

GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.
But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?

Yes. but "ONLY" if they are in other libraries is the same as saying "only" for code you didn't write yourself. Like collections for example. Or Boost. That's 95%+ of the code you use and run through in most non-trivial applications. A better way to look at it is to say that the can only inline virtual function for within your own code...

For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).
Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.

All the JDK libraries, and the the JVM's intrinsic (i.e. that other 95%+ of the code) is just as inline-able for a JIT as the code you write yourself. Java applications spend a minute % in non-generated code.
If the JVM is dependent on other native libraries written by others, then how is this possible? For example if a JVM uses libgmp for it's arbitrary precision arithmetic, then it needs to be linked, just like in a C++ app, and it couldn't be inlined with user code. So from what I understand you are just saying that there are almost no dynamically linked libraries used in the JVM (and the JDK is pure Java).

To be specific, I'm saying that in Java applications running on JDKs and the myriad of Java libraries and frameworks out there the vast majority of CPU time is spent in generated code that gets to leverage JIT optimizations. Only tiny percentage is spent in native library code, and only out of necessity. The JDK would never use a native dynamic binary for something it wants performance optimizations applied to. You'll see native code used for bulk operations, where it is locally and statically optimized, but opaque to optimizations coming in from the calling code. It will also use native libraries for interfacing to system functions (like i/o). But for anything you want speed for, especially when each individual operation is fairly light, you'll use byte codes and intrinsics, because those can have powerful optimization applied.

If so then this benefit is conferred to any C/C++ program that doesn't dynamically link to any library. Of course I understand your point that practically it's tougher to do so in C/C++ because you don't write your own malloc using brk/mmap.

Right, but the difference goes far beyond the basics (like malloc) and extends to data structures, collections, parsers, transformations, and pretty much any piece of code you don't write on your own. And Java (as an ecosystem) has tons of code that you can use instead of writing it on your own. In Java, all those system and third party libraries get to be fully optimized along with your code, where with native libraries you don't get those optimizations unless you statically link and use PGO (which, for whatever reason, people don't seem to be doing).

It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.

There is a very real-world reason for this, and it's not script awfulness. it's because operationally people expect binaries to run. Not to have to be recompiled ahead of every run on every system after every deployment or every patch. When you do that, you're pretty much in JIT-land already (with a lot more pain).

Try and talk your sysadmins into recompiling all the code on each system every time they apply a RHEL patch, and see what they say to that. See what your QA folks think about it, too.. Alternatively, try to explain to them why you insist on carrying the bug in version X of system library Y forward with your statically compiled and linked code...
Hmm. I can see that for client code, but for server code that one is running on their own environment, recompilation isn't that big of an issue. A lot of libraries already are header only C++ template code, so a re-compilation is already required for the tiniest of changes. In my experience the biggest reason people do not want to recompile on slight changes in a server program, is the long long long times it takes to compile a decently sized C++ project. So as a developer I'd rather link to libs that don't change that much than to recompile every time, because I cannot stand the 1 hour long compile times. This is mainly down to header legacy and C++ syntax complexity though, and more modern languages (Rust crates etc) with module support don't have to deal with this problem. Even C++ is supposed to get module support which will make this a non-issue. A lot of open source software written in C/C++ (Nginx, ImageMagick etc) already recommend a compilation on the destination instead of using conservatively optimized universal binaries and since it's not updated very regularly a proper modular compilation (one missing from C++ atm) wouldn't be the worst thing. If whole program compilations didn't take the absurd amounts of time it takes now, it would be a very valid choice for developers looking to trade compilation time for better code.

We have this saying where I grew up that roughly translates to: "Yeh, and if my grandmother had wheels, she'd be a bicycle..."

Build-time (as in ahead of QA time and distribution time) PGO is a real thing. Deployment-time PGO is not. Unless you mean JIT in runtimes (which is basically deployment time PGO per run).

Also I don't see the point in your comparison to statically compiled and linked code. Java must suffer from the same problem. If your Java code uses a collections library that has a bug, you need to ship a new version of your code, using the updated collections library for every program that uses it, instead of shipping a single one that is dynamically linked.

If your java collections have a bug, you change the JDK on the system to the next version of the JDK that has a fix for the collections package, and the bug goes away. Same as it would if you updated the dynamically linked native collections library you may have installed.

The difference is that with Java (and other managed runtimes), the new, fixed library gets just as intimately optimized with your code as the old one would have (think inlining and constant propagation across the library call line, for example), where a replaceable native dynamic library cannot be optimized along with your statically compiled code that calls it.

The JVM cannot just escape the cost of complete compilation for these whole program optimizations we are talking about. It's just that it does it in an amortized way, but pays in delayed startup time and complicated warm up procedures.

I'm not arguing that the cost of optimization is reduced. I'm arguing that the resulting optimization is better. The cost of JIT optimization is obviously bigger (you pay it again for every run instead of once per produced binary), but at least for server side code that runs for more than a few minutes, that tradeoff is very much worth it.

This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).

Agreed. that's what I see as the other side of "things that are not purely embedded." In embedded systems, you can afford to do this. Reality outside of embedded systems is simply that people don't do this. Why? I think it's because of the nightmares that follow. But why doesn't really matter...

Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.
Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.

To start with, you need a way to describe the notion that a variable that is set once at runtime is never changed after being set that one time, and the expression of this concept is missing in many non-runtime languages. But even in cases where you can express this notion, the managed data part is what gives the runtime confidence that the code discipline around data access is actually going to adhered to the concept you expressed. In C/C++, you'll be easily able to get a pointer to such "supposedly runtime constant" field, and mess with it, and that's something you just can't do in a managed data environment.
A very real issue with C++, but not with a language like Rust which has support for immutable constructs and statically ensures the discipline you are talking about

Another way to demonstrate this need is this: in JITed managed code environments, the JIT will commonly find itself in a situation where it needs to compile code that refers to state that is not yet resolvable at compile time (e.g. a static final field of a class that has not yet been initialized). This can happen for example in this code:

class Foo { static final long startTime = System.nanoTime(); void doFoo() { for (int i = 0; i < 1000000; i++) { if (i > 500000) { System.out.println("Foo start time " + Foo.startTime); } else { System.out.println("Doof start time " + Doof.startTime); } } } } class Doof { static final long startTime = System.nanoTime(); }

Where the code will get hot and compiled before Doof is ever initialized. So at the time of 1st compilation, the Doof.startTime data is not yet initialized, and cannot be propagated as a constant. It could be coded as a runtime field access instead (with the proper "make sure it's initialized first" guards), but most JITs basically deal with this by simply leaving a "trap" in the unresolved part. When the cost first gets to the latter half of the if statement, the compiled code will deoptimize, force the initialization of the Doof class (if it had not already been initialized), and reoptimize with the now-initialized value, at which point the optimization can be applied. There is a bunch of "managed data" needed to make these transitions safe and to guard against access to uninitialized data (forcing initialization ahead of any such access).

I see, so you are calling stuff like "has X been initialized?" managed data since the runtime needs to keep track of it? If I am correct C++11 already has the static keyword which is similar to the Java static keyword and will do threadsafe initialization etc. I thought you meant automatic memory management.

Yes. Managed data is not the same as automatic memory management (AMM). It's hard(er?) to do AMM without the "managed data" quality, but there are other things that managed data allows you to do, beyond AMM. You loose the managed data quality when your code can do arbitrary pointer manipulation and unchecked/unanalyzable memory access through arbitrarily manipulated pointers. You lose the managed code quality when your code can create arbitrary execution paths that you can't analyze (like being able to call a function through an arbitrarily computed pointer). The above example relies on being able to prove that certain accesses cannot happen to a certain memory location, which [to me] is a managed data (not managed code) property.

On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.

Most JITs only interment during warmup, with the fully optimized code either not instrumenting at all, or instrumenting very lightly (e.g. via sampling). So the runtime cost for the optimized code is effectively zero.

Also the JIT logic occupies icache.

Once the optimized code is hot, there is zero cache footprint. There is nothing left in the code other than the optimized result.

Where is the code to analyze samples, run heuristics, de-optimize and re-optimize based then?

It's sitting there, cold, in memory. Not in your icache.

A heavy duty compiler is running with your application after all and it will occupy icache right? If it's running on the plenty of "free" cores that we have right now then this is non-negligible price to pay depending on your environment and application, just like GC threads running in the background doesn't have a zero cost.

Normally (e.g. on HotSpot) once you've stabilized you hot code, no more compilations happen, and there is no background work begin done. So nothing in your icache.

If your JIT scheme is one that would keep doing some sample based instrumentation (which e.g. hotspot doesn't keep doing), that will eat up some tiny fraction of CPU, and yes, it will show up in your cache, but much less soften and with a much smaller footprint and disruption than the much more frequent interrupts you get from e.g. your OS's low level timer interrupts, which are considered nothing more than background noise.

I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.

Safepointing is an interesting issue.

Saefpoint polling itself is extremely cheap. E.g. HotSpot uses a blind load from a constant address, and relies on page protection to trigger an exception if the safepoint needs to be taken. And safepoint are generally only polled for at loop back edges and method entry or exit.

It's the fact that the machine state can be modified at the safepoint that reduces the ability to apply some optimizations across loop back edge safepoints. Loop unrolling help alleviate that, but safepoints are certainly optimization-opportunity-reducers to some degree.

There is a whole bunch of cool art around figuring out how to rely on the limited nature of the state change in a safepoint to allows safepoint-crossing optimizations. E.g. while safepoints may modify the bitwise representation of reference, they do not modify their logical meaning: A null cannot become a non-null at a safepoint, and vice-versa. And for most practical GC algorithms two references that are equal to each other will remain equal to each other after a safepoint, and the same for two references that are not equal to each other.
Thank you for the insight into safepoints. Besides the cross-safepoint optimization opportunities that are lost, isn't there is unpredictable stalling when trying to bring different threads to a safe-point?

Well, that assumes you have long time-to-safepoint paths in your code ;-). While many runtimes still have those, some (those that have already taken care of the much bigger GC issues, hint hint) have mostly gotten rid of them, making TTSP a much smaller nuisance. E.g. when *we* see high ttsps, they are usually dominated by system level issues like swapping, paging, or CPU scheduling delays, as actual executing code paths have dense enough safepoint polling opportunities to reach a safepoint within microseconds of being asked to. When they exist and are not tuned away, those same system delays would have also hit an equivalent functionality non-managed-runtime system...

...
But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.
C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?

Yes. And the same can be said for stack variables in C#. (StructuredArray is a way of avoiding the need for structs, and staying with regular objects, but that's beside the point).

In theory practice should meet theory, and C# should be able to beat C++. I'm sure there are examples where it does, and examples where it doesn't...
:) This begs a question and I promise that it is out of curiosity and not to be a dick. Why do you not use your own bootstrapped JVM + JDK to write future versions of your JVM (I assume you use C/C++)?

History, momentum, and investment. No other reason. If I were to build a new JVM today, from scratch, I would probably make it meta-circular, code the vast majority of it in Java, and only have about 2-3K lines of code of native stuff that allows the remaining Java code to do anything it needs to.

In practice the only things I wouldn't do in Java are the ones I just don't want to write again from scratch, i.e. ones where relatively good, stable, existing, working code exists in C or C++, and where I have no (good enough) reason to replace it. That pretty much describes the entire HotSpot JVM code base.

Is it because of the lack of the struct features you talked about (All available through unsafe hacking, but looks uglier than std C or C++ IMHO)? Is it because of the lack of availability of many instructions (SIMD, PAUSE) required for high performance applications?

No. Those 2-3K lines would provide all I need, and the rest would be pure Java. See the Jikes RVM for an example of the concept in practice. And the code (outside of the 2-3K lines) is not ugly at all.

What would need to change with Java or other managed languages for you to use them for Zing or a project of comparable needs?

Nothing really. In fact, there has been a long standing trend of converting JVM C++ code to Java to gain speed in the JDK. Anything that is actually "hot" in the JVM tends to go that way over time. The JIT compilers and and the GC code are the main major CPU consumer holdouts so far, and those too might shift over to being coded in Java over time (e.g. look at Graal).

Rajiv Kurian

unread,

Apr 3, 2014, 3:53:01 AM4/3/14

to mechanica...@googlegroups.com

Thank you, that makes it much clearer for me. This is no different than the case where a run-time constant flag would lead to optimizations - in general a discovery of any run time determined constant. I already conceded that AOT compilers wouldn't be able to do handle this. The remark was more about the requirement of managed data for JIT compilers for native language (in addition to managed code). Of course to hotswap code in general, one would need to keep track of the stack etc. My understanding of what managed data mean't was just wrong - I equated it with garbage collection.

Gil Tene

unread,

Apr 3, 2014, 4:04:21 AM4/3/14

to mechanica...@googlegroups.com

:-) If you want the JVM to get smart enough to do that, all you have to do is convince someone to stick this loop in SPECjbb2018 or some such. That would virtually guarantee that a JIT optimization will then pop up to fold the below the way you suggest, even if it's just for the specific pattern you get put in the test...

Biased locking is a great example of such an optimization. Especially the way it delays biasing for a magical period of 4 seconds (see BiasedLockingStartupDelay).

Rajiv Kurian

unread,

Apr 3, 2014, 12:28:41 PM4/3/14

to mechanica...@googlegroups.com

Thank you Gil. More comments inline.

On Thursday, April 3, 2014 12:52:04 AM UTC-7, Gil Tene wrote:

On Wednesday, April 2, 2014 5:51:51 PM UTC-7, Rajiv Kurian wrote:
Thank you for the very detailed follow up Gil. I've got a few more questions to ask :)

More inline. It's fun to have this kind of back and forth with non-trolls...

On Wednesday, April 2, 2014 1:42:51 PM UTC-7, Gil Tene wrote:
More discussion inline.

On Wednesday, April 2, 2014 11:18:31 AM UTC-7, Rajiv Kurian wrote:
Thank you for that very detailed answer. I have a few follow up questions.
...
On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:
On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.

"should be possible" is the key phrase here.

GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.
But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?

Yes. but "ONLY" if they are in other libraries is the same as saying "only" for code you didn't write yourself. Like collections for example. Or Boost. That's 95%+ of the code you use and run through in most non-trivial applications. A better way to look at it is to say that the can only inline virtual function for within your own code...

For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).
Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.

All the JDK libraries, and the the JVM's intrinsic (i.e. that other 95%+ of the code) is just as inline-able for a JIT as the code you write yourself. Java applications spend a minute % in non-generated code.
If the JVM is dependent on other native libraries written by others, then how is this possible? For example if a JVM uses libgmp for it's arbitrary precision arithmetic, then it needs to be linked, just like in a C++ app, and it couldn't be inlined with user code. So from what I understand you are just saying that there are almost no dynamically linked libraries used in the JVM (and the JDK is pure Java).

To be specific, I'm saying that in Java applications running on JDKs and the myriad of Java libraries and frameworks out there the vast majority of CPU time is spent in generated code that gets to leverage JIT optimizations. Only tiny percentage is spent in native library code, and only out of necessity. The JDK would never use a native dynamic binary for something it wants performance optimizations applied to. You'll see native code used for bulk operations, where it is locally and statically optimized, but opaque to optimizations coming in from the calling code. It will also use native libraries for interfacing to system functions (like i/o). But for anything you want speed for, especially when each individual operation is fairly light, you'll use byte codes and intrinsics, because those can have powerful optimization applied.

Yes but there is a fine line here. Not every piece of JDK code will be optimal, nor will the opportunities of whole program optimization be great enough to trump the dedicated efforts put in by the developers of a sole piece. LibGMP (vs BigDecimal?) was just one example. I am sure there are many more like libJPEGturbo etc where the JDK solutions just don't match up, and neither will some one's hand-coded solutions, even if they were written in pure Java. Again my point being that whole program optimizations are not as much because of JIT as they are because of having the entire source available. I'll definitely concede that in C++/C this is impractical, mainly because of slow header induced compilation times, but not a fate every AOT language is consigned to. More modern AOT languages (Rust etc) have a good across module optimization story.

If so then this benefit is conferred to any C/C++ program that doesn't dynamically link to any library. Of course I understand your point that practically it's tougher to do so in C/C++ because you don't write your own malloc using brk/mmap.

Right, but the difference goes far beyond the basics (like malloc) and extends to data structures, collections, parsers, transformations, and pretty much any piece of code you don't write on your own. And Java (as an ecosystem) has tons of code that you can use instead of writing it on your own. In Java, all those system and third party libraries get to be fully optimized along with your code, where with native libraries you don't get those optimizations unless you statically link and use PGO (which, for whatever reason, people don't seem to be doing).

It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.

There is a very real-world reason for this, and it's not script awfulness. it's because operationally people expect binaries to run. Not to have to be recompiled ahead of every run on every system after every deployment or every patch. When you do that, you're pretty much in JIT-land already (with a lot more pain).

Try and talk your sysadmins into recompiling all the code on each system every time they apply a RHEL patch, and see what they say to that. See what your QA folks think about it, too.. Alternatively, try to explain to them why you insist on carrying the bug in version X of system library Y forward with your statically compiled and linked code...
Hmm. I can see that for client code, but for server code that one is running on their own environment, recompilation isn't that big of an issue. A lot of libraries already are header only C++ template code, so a re-compilation is already required for the tiniest of changes. In my experience the biggest reason people do not want to recompile on slight changes in a server program, is the long long long times it takes to compile a decently sized C++ project. So as a developer I'd rather link to libs that don't change that much than to recompile every time, because I cannot stand the 1 hour long compile times. This is mainly down to header legacy and C++ syntax complexity though, and more modern languages (Rust crates etc) with module support don't have to deal with this problem. Even C++ is supposed to get module support which will make this a non-issue. A lot of open source software written in C/C++ (Nginx, ImageMagick etc) already recommend a compilation on the destination instead of using conservatively optimized universal binaries and since it's not updated very regularly a proper modular compilation (one missing from C++ atm) wouldn't be the worst thing. If whole program compilations didn't take the absurd amounts of time it takes now, it would be a very valid choice for developers looking to trade compilation time for better code.

We have this saying where I grew up that roughly translates to: "Yeh, and if my grandmother had wheels, she'd be a bicycle..."

Build-time (as in ahead of QA time and distribution time) PGO is a real thing. Deployment-time PGO is not. Unless you mean JIT in runtimes (which is basically deployment time PGO per run).

Sure it's a problem in C++, but not of the AOT/PGO design.

Also I don't see the point in your comparison to statically compiled and linked code. Java must suffer from the same problem. If your Java code uses a collections library that has a bug, you need to ship a new version of your code, using the updated collections library for every program that uses it, instead of shipping a single one that is dynamically linked.

If your java collections have a bug, you change the JDK on the system to the next version of the JDK that has a fix for the collections package, and the bug goes away. Same as it would if you updated the dynamically linked native collections library you may have installed.

The difference is that with Java (and other managed runtimes), the new, fixed library gets just as intimately optimized with your code as the old one would have (think inlining and constant propagation across the library call line, for example), where a replaceable native dynamic library cannot be optimized along with your statically compiled code that calls it.

Aah, because changing your JDK is such a simple decision. People seem to be stuck on older versions forever. Also not every library resides in the JDK. Java devs, especially the ones on this list, use their high performance alternatives all the time. When a new version of the Disruptor shows up, every Java program that uses it must supply a new version of the program with the new Disruptor sources. You do gain the benefit of whole program optimization, but pay the same costs that dog the static linking methods for native code. You can't just ask users to get a new Disruptor Jar and have all programs working against the shinier, bug-free, faster version. Ensuring that I sound like a broken record - if you have all your sources and ship with them, you reap the benefits and suffer the problems, whether it's AOT or JIT. Java of course makes it a lot easier for your to supply all your dependency sources, than C++ but this is not something future AOT+PGO languages must suffer from.

The JVM cannot just escape the cost of complete compilation for these whole program optimizations we are talking about. It's just that it does it in an amortized way, but pays in delayed startup time and complicated warm up procedures.

I'm not arguing that the cost of optimization is reduced. I'm arguing that the resulting optimization is better. The cost of JIT optimization is obviously bigger (you pay it again for every run instead of once per produced binary), but at least for server side code that runs for more than a few minutes, that tradeoff is very much worth it.

This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).

Agreed. that's what I see as the other side of "things that are not purely embedded." In embedded systems, you can afford to do this. Reality outside of embedded systems is simply that people don't do this. Why? I think it's because of the nightmares that follow. But why doesn't really matter...

Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.
Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.

To start with, you need a way to describe the notion that a variable that is set once at runtime is never changed after being set that one time, and the expression of this concept is missing in many non-runtime languages. But even in cases where you can express this notion, the managed data part is what gives the runtime confidence that the code discipline around data access is actually going to adhered to the concept you expressed. In C/C++, you'll be easily able to get a pointer to such "supposedly runtime constant" field, and mess with it, and that's something you just can't do in a managed data environment.
A very real issue with C++, but not with a language like Rust which has support for immutable constructs and statically ensures the discipline you are talking about

Another way to demonstrate this need is this: in JITed managed code environments, the JIT will commonly find itself in a situation where it needs to compile code that refers to state that is not yet resolvable at compile time (e.g. a static final field of a class that has not yet been initialized). This can happen for example in this code:

class Foo { static final long startTime = System.nanoTime(); void doFoo() { for (int i = 0; i < 1000000; i++) { if (i > 500000) { System.out.println("Foo start time " + Foo.startTime); } else { System.out.println("Doof start time " + Doof.startTime); } } } } class Doof { static final long startTime = System.nanoTime(); }

Where the code will get hot and compiled before Doof is ever initialized. So at the time of 1st compilation, the Doof.startTime data is not yet initialized, and cannot be propagated as a constant. It could be coded as a runtime field access instead (with the proper "make sure it's initialized first" guards), but most JITs basically deal with this by simply leaving a "trap" in the unresolved part. When the cost first gets to the latter half of the if statement, the compiled code will deoptimize, force the initialization of the Doof class (if it had not already been initialized), and reoptimize with the now-initialized value, at which point the optimization can be applied. There is a bunch of "managed data" needed to make these transitions safe and to guard against access to uninitialized data (forcing initialization ahead of any such access).

I see, so you are calling stuff like "has X been initialized?" managed data since the runtime needs to keep track of it? If I am correct C++11 already has the static keyword which is similar to the Java static keyword and will do threadsafe initialization etc. I thought you meant automatic memory management.

Yes. Managed data is not the same as automatic memory management (AMM). It's hard(er?) to do AMM without the "managed data" quality, but there are other things that managed data allows you to do, beyond AMM. You loose the managed data quality when your code can do arbitrary pointer manipulation and unchecked/unanalyzable memory access through arbitrarily manipulated pointers. You lose the managed code quality when your code can create arbitrary execution paths that you can't analyze (like being able to call a function through an arbitrarily computed pointer). The above example relies on being able to prove that certain accesses cannot happen to a certain memory location, which [to me] is a managed data (not managed code) property.

Got it.

On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.

Most JITs only interment during warmup, with the fully optimized code either not instrumenting at all, or instrumenting very lightly (e.g. via sampling). So the runtime cost for the optimized code is effectively zero.

Also the JIT logic occupies icache.

Once the optimized code is hot, there is zero cache footprint. There is nothing left in the code other than the optimized result.
Where is the code to analyze samples, run heuristics, de-optimize and re-optimize based then?

It's sitting there, cold, in memory. Not in your icache.

A heavy duty compiler is running with your application after all and it will occupy icache right? If it's running on the plenty of "free" cores that we have right now then this is non-negligible price to pay depending on your environment and application, just like GC threads running in the background doesn't have a zero cost.

Normally (e.g. on HotSpot) once you've stabilized you hot code, no more compilations happen, and there is no background work begin done. So nothing in your icache.

So safepoints (called on method entry/exit mostly?) are just used to evaluate de-optimize/re-optimize methods instead of a background process doing the same? You'll still need to load this heavy duty compiler code, interleaving it with your own application code either in the background or otherwise, which if nothing else does cause unpredictability in your own code.

If your JIT scheme is one that would keep doing some sample based instrumentation (which e.g. hotspot doesn't keep doing), that will eat up some tiny fraction of CPU, and yes, it will show up in your cache, but much less soften and with a much smaller footprint and disruption than the much more frequent interrupts you get from e.g. your OS's low level timer interrupts, which are considered nothing more than background noise.

Point taken.

I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.

Safepointing is an interesting issue.

Saefpoint polling itself is extremely cheap. E.g. HotSpot uses a blind load from a constant address, and relies on page protection to trigger an exception if the safepoint needs to be taken. And safepoint are generally only polled for at loop back edges and method entry or exit.

It's the fact that the machine state can be modified at the safepoint that reduces the ability to apply some optimizations across loop back edge safepoints. Loop unrolling help alleviate that, but safepoints are certainly optimization-opportunity-reducers to some degree.

There is a whole bunch of cool art around figuring out how to rely on the limited nature of the state change in a safepoint to allows safepoint-crossing optimizations. E.g. while safepoints may modify the bitwise representation of reference, they do not modify their logical meaning: A null cannot become a non-null at a safepoint, and vice-versa. And for most practical GC algorithms two references that are equal to each other will remain equal to each other after a safepoint, and the same for two references that are not equal to each other.
Thank you for the insight into safepoints. Besides the cross-safepoint optimization opportunities that are lost, isn't there is unpredictable stalling when trying to bring different threads to a safe-point?

Well, that assumes you have long time-to-safepoint paths in your code ;-). While many runtimes still have those, some (those that have already taken care of the much bigger GC issues, hint hint) have mostly gotten rid of them, making TTSP a much smaller nuisance. E.g. when *we* see high ttsps, they are usually dominated by system level issues like swapping, paging, or CPU scheduling delays, as actual executing code paths have dense enough safepoint polling opportunities to reach a safepoint within microseconds of being asked to. When they exist and are not tuned away, those same system delays would have also hit an equivalent functionality non-managed-runtime system...

Sure this does cost in terms of predictability, but I see your point regarding OS issues causing more jitter.

Thank you Gil for providing so many insights into the possibilities of JIT and the internals of JIT compilers. Some takeaways for me:

i) My original question was asked with the purpose of figuring out what exactly does JIT buy on top of a theoretically modern AOT+PGO language. I've gotten lots of illustrative examples here.

ii) I asked it with the intention of also figuring out what a modern AOT language like Rust would have to lose/gain in order to get JIT benefits.So far your answers tell me that a sound type system like the one Rust has (unlike C++'s) will allow a JIT mode without having to resort to automatic memory management. Of course it is a lot of work to be able to build something like the Hotspot JIT compiler. Before this conversation, JIT sounded to me like more of a benefit for dynamic languages, where the JIT mechanism would unveil types leading to good optimization opportunities. I've seen good examples of how it could help strongly typed languages here.

ii) The two big places where Java has more theoretical benefits seems to me like whole program optimizations and runtime constant discovery. The first one stems from having all your sources at the same time. Rust thankfully already has a similar story here and hopefully this will be an option once (will it ever?) C++ gets modules. GCC and co can already do all of this when you have your sources, but C/C++ depends on dynamic linking because of a multitude of reasons like you pointed out. The second (runtime optzns) cannot be achieved reliably with the AOT+PGO combination, especially in programs where runtime constants do exist but vary from run to run. The LLVM JIT also seems really like an AOT compiler with some extra linker magic - it won't be able to perform these dynamic optimizations all. With LLVM-JIT not an option, there doesn't seem to exist any toolchain to help with such optimizations for native languages atm. Of course I still don't know (and am not convinced) whether the trade-offs in predictability and CPU power to run compiler logic is worth it. Depends on each program I guess. For short predictable programs, it definitely doesn't seem worth it. For long running programs it very well could.

...
But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.
C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?

Yes. And the same can be said for stack variables in C#. (StructuredArray is a way of avoiding the need for structs, and staying with regular objects, but that's beside the point).

In theory practice should meet theory, and C# should be able to beat C++. I'm sure there are examples where it does, and examples where it doesn't...
:) This begs a question and I promise that it is out of curiosity and not to be a dick. Why do you not use your own bootstrapped JVM + JDK to write future versions of your JVM (I assume you use C/C++)?

History, momentum, and investment. No other reason. If I were to build a new JVM today, from scratch, I would probably make it meta-circular, code the vast majority of it in Java, and only have about 2-3K lines of code of native stuff that allows the remaining Java code to do anything it needs to.

In practice the only things I wouldn't do in Java are the ones I just don't want to write again from scratch, i.e. ones where relatively good, stable, existing, working code exists in C or C++, and where I have no (good enough) reason to replace it. That pretty much describes the entire HotSpot JVM code base.

Is it because of the lack of the struct features you talked about (All available through unsafe hacking, but looks uglier than std C or C++ IMHO)? Is it because of the lack of availability of many instructions (SIMD, PAUSE) required for high performance applications?

No. Those 2-3K lines would provide all I need, and the rest would be pure Java. See the Jikes RVM for an example of the concept in practice. And the code (outside of the 2-3K lines) is not ugly at all.

What would need to change with Java or other managed languages for you to use them for Zing or a project of comparable needs?

Nothing really. In fact, there has been a long standing trend of converting JVM C++ code to Java to gain speed in the JDK. Anything that is actually "hot" in the JVM tends to go that way over time. The JIT compilers and and the GC code are the main major CPU consumer holdouts so far, and those too might shift over to being coded in Java over time (e.g. look at Graal).

But is the transition to Java to gain JIT related speed benefits, or more as a way to avoid FFI costs from crossing Java/C++ boundaries?

Rüdiger Möller

unread,

Apr 3, 2014, 4:01:54 PM4/3/14

Just my 2 cents besides a lot of very real practical reasons brought up above,

Dynamic Optimization makes use of the difference inbetween what a program "could do" and what it "actually does". Naturally a JIT has better information about this gap, so it will have better optimization opportunities. I'd expect C++ JITs to emerge in the not-so-far future.

Kirk Pepperdine

unread,

Apr 3, 2014, 4:23:35 AM4/3/14

to mechanica...@googlegroups.com

bigger problem, change int to long as most of the optimizations that are there go away.

Regards,

Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,

Apr 4, 2014, 4:45:32 AM4/4/14

to mechanica...@googlegroups.com

I use to use long variables in my loops for longer runs but kept getting ranted at by Cliff Click who pointed out that so many optimisations did not apply then :-)

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages