--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
> In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks
The advantage of C and C++ is you have many ways of doing the same thing and if you really know what you are doing you can heavily optimise the code. When you have a simple set of requirements which are well understood and don't change, C/C++ is very fast compared with Java. Benchmarks tend to be a simple problem where someone with experience can spend an unrealistic amount of time optimising the code to death.However, in the real world, you often (but not always) have changing requirement, problems which are not as well understood as you would like, teams with mixed technical ability (it is useful to have team members who have strong non technical skills), and access to open source third party libraries. As yet, there isn't a good benchmark where you have code written by someone with 1 year of experience, who is more creative than technical, in a limited time frame to multiple platforms, using mostly third party libraries.Note: the same arguments can be applied to not using Java but instead using a language like Scala, Groovy or JRuby. For my projects, Java is the right balance.> Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?I have libraries written in Java which use explicit memory management for very large data sets i.e. GB to TB. Just because Java has memory management doesn't mean you have to use it for everything.
On 2 April 2014 03:13, Rajiv Kurian <geet...@gmail.com> wrote:
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)? Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?Thanks,
Rajiv
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.
The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)?
Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?
Thanks,
Rajiv
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Instead of focusing on memory management, I focus on code management when it comes to these kinds of optimization questions. I like MSFT's CLR classification, which separates "managed code" from "managed data". Java and C# (and Scala and Clojure and F# and Groovy, etc.) are all "managed code, managed data" languages. But there are pure "managed code" environments that do not include managed data. They just aren't very common (e.g. Java + Unsafe would qualify for this, probably).
It's the managed code quality that allows for many optimizations. The fact that C allows for " x = (this_struct *) (*f)((that_struct *) y); " makes certain optimizations that would rely on knowledge of a limited type and target function spaces "hard". And while full program static analysis by a PGO can often approximate "managed code", many real world behaviors make "real world full static program static analysis" impractical for things that are not purely embedded.On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization."should be possible" is the key phrase here.GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.
For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).
Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.
On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.
The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)?Putting aside the futility of using benchmarks, there are some strong performance-effecting things around memory layout and memory access patterns, for which current Java JDKs are seemingly still inherently inferior in speed compared to C variants. I say "seemingly" because I believe it has more to do with library semantics that it does with language semantics, and I think it will be changing very soon.Specifically, C variants allow for the following three commonly used memory layout schemes that the Java ecosystem does not currently support with speed:- An array of structs.- A struct with a struct inside.- A struct with a variable size array at the end.Note that I'm not specifically talking about structs. You can replace "struct" with "object" above and keep my same meaning. It's just that the common language environments that use the term "object" to describe their structs typically don't support the above memory layout relationships.For each of these cases, equivalent Java implementation logic using today's JDKs will require extra data indirection (de-referencing) steps to access the same information. Even more importantly, in the first case and in combinations of the first or third case with the second, C variants will achieve linear streaming through memory where Java will end up with potentially random access, which has a huge impact of how hardware prefetched work and how well memory is streamed into a cache for similar program semantics.But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.
Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?My opinion: It's because (per examples above), it's been hard to make them produce (in the current "real world") the same sort of optimizations that managed code and managed data runtime seem to be able to produce in their current real world.
Thanks,
Rajiv
The main advantage of writing small pieces of low level Java is that it integrates with natural Java very well. It is still cross platform and works with all the normal Java tools. If 90% of your project is going to be in Java, IMHO, it makes sense to use low level Java instead of C for the low level coding.
On 2 April 2014 03:40, Rajiv Kurian <geet...@gmail.com> wrote:
On Wednesday, April 2, 2014 1:32:24 AM UTC-7, Peter Lawrey wrote:> In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks
The advantage of C and C++ is you have many ways of doing the same thing and if you really know what you are doing you can heavily optimise the code. When you have a simple set of requirements which are well understood and don't change, C/C++ is very fast compared with Java. Benchmarks tend to be a simple problem where someone with experience can spend an unrealistic amount of time optimising the code to death.However, in the real world, you often (but not always) have changing requirement, problems which are not as well understood as you would like, teams with mixed technical ability (it is useful to have team members who have strong non technical skills), and access to open source third party libraries. As yet, there isn't a good benchmark where you have code written by someone with 1 year of experience, who is more creative than technical, in a limited time frame to multiple platforms, using mostly third party libraries.Note: the same arguments can be applied to not using Java but instead using a language like Scala, Groovy or JRuby. For my projects, Java is the right balance.> Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?I have libraries written in Java which use explicit memory management for very large data sets i.e. GB to TB. Just because Java has memory management doesn't mean you have to use it for everything.Good point but did you really write this code in Java instead of C because of the potential JIT benefits? What has Java gotten you there w.r.t C/C++ besides the joys of writing small parts of a modern C/C++ compiler ;)
On 2 April 2014 03:13, Rajiv Kurian <geet...@gmail.com> wrote:
I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization.The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)? Any opinions on why we don't see JIT enabled run times for native languages with explicit memory management?Thanks,
Rajiv
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thank you for that very detailed answer. I have a few follow up questions.
GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.
But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?
For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).Doesn't the JVM itself use dynamic libraries for some of the same functionality?
And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.
It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time
(i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.
Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.
On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price. Also the JIT logic occupies icache. I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.
The one thing I can think of is it is difficult to test an instrumented binary with "real data", especially for a long running program like a server. For example an app with hundreds of command line flags. Also if there is no typical profile i.e. it varies by usage, then one might do more harm than good with PGO. One could in theory trust the JIT to workout the best optimizations given this kind of variance.Finally a sweeping, religious question full of assumptions ;) - In spite of these advantages why does Java or C# not beat C/C++ in all but a few benchmarks (almost always virtual functions with single implementation benchmarks)?Putting aside the futility of using benchmarks, there are some strong performance-effecting things around memory layout and memory access patterns, for which current Java JDKs are seemingly still inherently inferior in speed compared to C variants. I say "seemingly" because I believe it has more to do with library semantics that it does with language semantics, and I think it will be changing very soon.
Specifically, C variants allow for the following three commonly used memory layout schemes that the Java ecosystem does not currently support with speed:- An array of structs.
- A struct with a struct inside.
- A struct with a variable size array at the end.
C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?
The main advantage of writing small pieces of low level Java is that it integrates with natural Java very well. It is still cross platform and works with all the normal Java tools. If 90% of your project is going to be in Java, IMHO, it makes sense to use low level Java instead of C for the low level coding.
Thank you for that very detailed answer. I have a few follow up questions.
...
On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization."should be possible" is the key phrase here.GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?
For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.
It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.
This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).
Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.
class Foo {
static final long startTime = System.nanoTime();
void doFoo() {
for (int i = 0; i < 1000000; i++) {
if (i > 500000) {
System.out.println("Foo start time " + Foo.startTime);
} else {
System.out.println("Doof start time " + Doof.startTime);
}
}
}
}
class Doof {
static final long startTime = System.nanoTime();
}
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.
On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.
Also the JIT logic occupies icache.
I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.
...
But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.
C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?
More discussion inline.
On Wednesday, April 2, 2014 11:18:31 AM UTC-7, Rajiv Kurian wrote:Thank you for that very detailed answer. I have a few follow up questions.
...
On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization."should be possible" is the key phrase here.GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?Yes. but "ONLY" if they are in other libraries is the same as saying "only" for code you didn't write yourself. Like collections for example. Or Boost. That's 95%+ of the code you use and run through in most non-trivial applications. A better way to look at it is to say that the can only inline virtual function for within your own code...For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.All the JDK libraries, and the the JVM's intrinsic (i.e. that other 95%+ of the code) is just as inline-able for a JIT as the code you write yourself. Java applications spend a minute % in non-generated code.
It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.There is a very real-world reason for this, and it's not script awfulness. it's because operationally people expect binaries to run. Not to have to be recompiled ahead of every run on every system after every deployment or every patch. When you do that, you're pretty much in JIT-land already (with a lot more pain).Try and talk your sysadmins into recompiling all the code on each system every time they apply a RHEL patch, and see what they say to that. See what your QA folks think about it, too.. Alternatively, try to explain to them why you insist on carrying the bug in version X of system library Y forward with your statically compiled and linked code...
This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).Agreed. that's what I see as the other side of "things that are not purely embedded." In embedded systems, you can afford to do this. Reality outside of embedded systems is simply that people don't do this. Why? I think it's because of the nightmares that follow. But why doesn't really matter...
Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.To start with, you need a way to describe the notion that a variable that is set once at runtime is never changed after being set that one time, and the expression of this concept is missing in many non-runtime languages. But even in cases where you can express this notion, the managed data part is what gives the runtime confidence that the code discipline around data access is actually going to adhered to the concept you expressed. In C/C++, you'll be easily able to get a pointer to such "supposedly runtime constant" field, and mess with it, and that's something you just can't do in a managed data environment.
Another way to demonstrate this need is this: in JITed managed code environments, the JIT will commonly find itself in a situation where it needs to compile code that refers to state that is not yet resolvable at compile time (e.g. a static final field of a class that has not yet been initialized). This can happen for example in this code:class Foo {
static final long startTime = System.nanoTime();
void doFoo() {
for (int i = 0; i < 1000000; i++) {
if (i > 500000) {
System.out.println("Foo start time " + Foo.startTime);
} else {
System.out.println("Doof start time " + Doof.startTime);
}
}
}
}
class Doof {
static final long startTime = System.nanoTime();
}
Where the code will get hot and compiled before Doof is ever initialized. So at the time of 1st compilation, the Doof.startTime data is not yet initialized, and cannot be propagated as a constant. It could be coded as a runtime field access instead (with the proper "make sure it's initialized first" guards), but most JITs basically deal with this by simply leaving a "trap" in the unresolved part. When the cost first gets to the latter half of the if statement, the compiled code will deoptimize, force the initialization of the Doof class (if it had not already been initialized), and reoptimize with the now-initialized value, at which point the optimization can be applied. There is a bunch of "managed data" needed to make these transitions safe and to guard against access to uninitialized data (forcing initialization ahead of any such access).
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.
On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.Most JITs only interment during warmup, with the fully optimized code either not instrumenting at all, or instrumenting very lightly (e.g. via sampling). So the runtime cost for the optimized code is effectively zero.Also the JIT logic occupies icache.Once the optimized code is hot, there is zero cache footprint. There is nothing left in the code other than the optimized result.
I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.Safepointing is an interesting issue.Saefpoint polling itself is extremely cheap. E.g. HotSpot uses a blind load from a constant address, and relies on page protection to trigger an exception if the safepoint needs to be taken. And safepoint are generally only polled for at loop back edges and method entry or exit.It's the fact that the machine state can be modified at the safepoint that reduces the ability to apply some optimizations across loop back edge safepoints. Loop unrolling help alleviate that, but safepoints are certainly optimization-opportunity-reducers to some degree.There is a whole bunch of cool art around figuring out how to rely on the limited nature of the state change in a safepoint to allows safepoint-crossing optimizations. E.g. while safepoints may modify the bitwise representation of reference, they do not modify their logical meaning: A null cannot become a non-null at a safepoint, and vice-versa. And for most practical GC algorithms two references that are equal to each other will remain equal to each other after a safepoint, and the same for two references that are not equal to each other.
...
But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?Yes. And the same can be said for stack variables in C#. (StructuredArray is a way of avoiding the need for structs, and staying with regular objects, but that's beside the point).In theory practice should meet theory, and C# should be able to beat C++. I'm sure there are examples where it does, and examples where it doesn't...
Thank you for the very detailed follow up Gil. I've got a few more questions to ask :)
On Wednesday, April 2, 2014 1:42:51 PM UTC-7, Gil Tene wrote:More discussion inline.
On Wednesday, April 2, 2014 11:18:31 AM UTC-7, Rajiv Kurian wrote:Thank you for that very detailed answer. I have a few follow up questions.
...
On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization."should be possible" is the key phrase here.GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?Yes. but "ONLY" if they are in other libraries is the same as saying "only" for code you didn't write yourself. Like collections for example. Or Boost. That's 95%+ of the code you use and run through in most non-trivial applications. A better way to look at it is to say that the can only inline virtual function for within your own code...For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.All the JDK libraries, and the the JVM's intrinsic (i.e. that other 95%+ of the code) is just as inline-able for a JIT as the code you write yourself. Java applications spend a minute % in non-generated code.If the JVM is dependent on other native libraries written by others, then how is this possible? For example if a JVM uses libgmp for it's arbitrary precision arithmetic, then it needs to be linked, just like in a C++ app, and it couldn't be inlined with user code. So from what I understand you are just saying that there are almost no dynamically linked libraries used in the JVM (and the JDK is pure Java).
If so then this benefit is conferred to any C/C++ program that doesn't dynamically link to any library. Of course I understand your point that practically it's tougher to do so in C/C++ because you don't write your own malloc using brk/mmap.
It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.There is a very real-world reason for this, and it's not script awfulness. it's because operationally people expect binaries to run. Not to have to be recompiled ahead of every run on every system after every deployment or every patch. When you do that, you're pretty much in JIT-land already (with a lot more pain).Try and talk your sysadmins into recompiling all the code on each system every time they apply a RHEL patch, and see what they say to that. See what your QA folks think about it, too.. Alternatively, try to explain to them why you insist on carrying the bug in version X of system library Y forward with your statically compiled and linked code...Hmm. I can see that for client code, but for server code that one is running on their own environment, recompilation isn't that big of an issue. A lot of libraries already are header only C++ template code, so a re-compilation is already required for the tiniest of changes. In my experience the biggest reason people do not want to recompile on slight changes in a server program, is the long long long times it takes to compile a decently sized C++ project. So as a developer I'd rather link to libs that don't change that much than to recompile every time, because I cannot stand the 1 hour long compile times. This is mainly down to header legacy and C++ syntax complexity though, and more modern languages (Rust crates etc) with module support don't have to deal with this problem. Even C++ is supposed to get module support which will make this a non-issue. A lot of open source software written in C/C++ (Nginx, ImageMagick etc) already recommend a compilation on the destination instead of using conservatively optimized universal binaries and since it's not updated very regularly a proper modular compilation (one missing from C++ atm) wouldn't be the worst thing. If whole program compilations didn't take the absurd amounts of time it takes now, it would be a very valid choice for developers looking to trade compilation time for better code.
Also I don't see the point in your comparison to statically compiled and linked code. Java must suffer from the same problem. If your Java code uses a collections library that has a bug, you need to ship a new version of your code, using the updated collections library for every program that uses it, instead of shipping a single one that is dynamically linked.
The JVM cannot just escape the cost of complete compilation for these whole program optimizations we are talking about. It's just that it does it in an amortized way, but pays in delayed startup time and complicated warm up procedures.
This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).Agreed. that's what I see as the other side of "things that are not purely embedded." In embedded systems, you can afford to do this. Reality outside of embedded systems is simply that people don't do this. Why? I think it's because of the nightmares that follow. But why doesn't really matter...Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.To start with, you need a way to describe the notion that a variable that is set once at runtime is never changed after being set that one time, and the expression of this concept is missing in many non-runtime languages. But even in cases where you can express this notion, the managed data part is what gives the runtime confidence that the code discipline around data access is actually going to adhered to the concept you expressed. In C/C++, you'll be easily able to get a pointer to such "supposedly runtime constant" field, and mess with it, and that's something you just can't do in a managed data environment.A very real issue with C++, but not with a language like Rust which has support for immutable constructs and statically ensures the discipline you are talking aboutAnother way to demonstrate this need is this: in JITed managed code environments, the JIT will commonly find itself in a situation where it needs to compile code that refers to state that is not yet resolvable at compile time (e.g. a static final field of a class that has not yet been initialized). This can happen for example in this code:class Foo {
static final long startTime = System.nanoTime();
void doFoo() {
for (int i = 0; i < 1000000; i++) {
if (i > 500000) {
System.out.println("Foo start time " + Foo.startTime);
} else {
System.out.println("Doof start time " + Doof.startTime);
}
}
}
}
class Doof {
static final long startTime = System.nanoTime();
}
Where the code will get hot and compiled before Doof is ever initialized. So at the time of 1st compilation, the Doof.startTime data is not yet initialized, and cannot be propagated as a constant. It could be coded as a runtime field access instead (with the proper "make sure it's initialized first" guards), but most JITs basically deal with this by simply leaving a "trap" in the unresolved part. When the cost first gets to the latter half of the if statement, the compiled code will deoptimize, force the initialization of the Doof class (if it had not already been initialized), and reoptimize with the now-initialized value, at which point the optimization can be applied. There is a bunch of "managed data" needed to make these transitions safe and to guard against access to uninitialized data (forcing initialization ahead of any such access).
I see, so you are calling stuff like "has X been initialized?" managed data since the runtime needs to keep track of it? If I am correct C++11 already has the static keyword which is similar to the Java static keyword and will do threadsafe initialization etc. I thought you meant automatic memory management.
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.
On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.Most JITs only interment during warmup, with the fully optimized code either not instrumenting at all, or instrumenting very lightly (e.g. via sampling). So the runtime cost for the optimized code is effectively zero.Also the JIT logic occupies icache.Once the optimized code is hot, there is zero cache footprint. There is nothing left in the code other than the optimized result.
Where is the code to analyze samples, run heuristics, de-optimize and re-optimize based then?
A heavy duty compiler is running with your application after all and it will occupy icache right? If it's running on the plenty of "free" cores that we have right now then this is non-negligible price to pay depending on your environment and application, just like GC threads running in the background doesn't have a zero cost.
I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.Safepointing is an interesting issue.Saefpoint polling itself is extremely cheap. E.g. HotSpot uses a blind load from a constant address, and relies on page protection to trigger an exception if the safepoint needs to be taken. And safepoint are generally only polled for at loop back edges and method entry or exit.It's the fact that the machine state can be modified at the safepoint that reduces the ability to apply some optimizations across loop back edge safepoints. Loop unrolling help alleviate that, but safepoints are certainly optimization-opportunity-reducers to some degree.There is a whole bunch of cool art around figuring out how to rely on the limited nature of the state change in a safepoint to allows safepoint-crossing optimizations. E.g. while safepoints may modify the bitwise representation of reference, they do not modify their logical meaning: A null cannot become a non-null at a safepoint, and vice-versa. And for most practical GC algorithms two references that are equal to each other will remain equal to each other after a safepoint, and the same for two references that are not equal to each other.Thank you for the insight into safepoints. Besides the cross-safepoint optimization opportunities that are lost, isn't there is unpredictable stalling when trying to bring different threads to a safe-point?
...
But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?Yes. And the same can be said for stack variables in C#. (StructuredArray is a way of avoiding the need for structs, and staying with regular objects, but that's beside the point).In theory practice should meet theory, and C# should be able to beat C++. I'm sure there are examples where it does, and examples where it doesn't...:) This begs a question and I promise that it is out of curiosity and not to be a dick. Why do you not use your own bootstrapped JVM + JDK to write future versions of your JVM (I assume you use C/C++)?
Is it because of the lack of the struct features you talked about (All available through unsafe hacking, but looks uglier than std C or C++ IMHO)? Is it because of the lack of availability of many instructions (SIMD, PAUSE) required for high performance applications?
What would need to change with Java or other managed languages for you to use them for Zing or a project of comparable needs?
On Wednesday, April 2, 2014 5:51:51 PM UTC-7, Rajiv Kurian wrote:Thank you for the very detailed follow up Gil. I've got a few more questions to ask :)More inline. It's fun to have this kind of back and forth with non-trolls...On Wednesday, April 2, 2014 1:42:51 PM UTC-7, Gil Tene wrote:More discussion inline.
On Wednesday, April 2, 2014 11:18:31 AM UTC-7, Rajiv Kurian wrote:Thank you for that very detailed answer. I have a few follow up questions.
...
On Wednesday, April 2, 2014 6:05:50 AM UTC-7, Gil Tene wrote:On Wednesday, April 2, 2014 1:13:33 AM UTC-7, Rajiv Kurian wrote:I am trying to understand what kind of optimizations a JITed runtime like the JVM could provide over a C/C++ app compiled using say GCC's PGO. With PGO inlining of virtual functions (the poster boy of JIT superiority claimers) should be possible. With PGO + march=native, machine specific optimizations should be possible too. Link time optimization (lto flag) should also in theory allow for holistic optimization."should be possible" is the key phrase here.GCC, LLVM, Intel and Microsoft compilers for C/C++ have been around for a while now, and keep evolving in optimization, but we still do not see PGO inlining of virtual functions in library code. Whether it's inline caching (the changing of a virtual dispatch to a static one guarded by a type check), or guarded inlining (the inlining of a virtual method body guarded by a type check), or unguarded, class-hierarchy-analysis (CHA) based versions of both, we just don't see much of this in the C/C++ world. The same can be said for constant propagation based on runtime environment discovered constants. This has multiple real-world reasons.But PGO on G++ and MSVC do employ inline caching for virtual functions of your own. Am I correct about your point being that they cannot inline virtual functions ONLY if they are in other libraries?Yes. but "ONLY" if they are in other libraries is the same as saying "only" for code you didn't write yourself. Like collections for example. Or Boost. That's 95%+ of the code you use and run through in most non-trivial applications. A better way to look at it is to say that the can only inline virtual function for within your own code...For example, in practice, C/C++ code predominantly uses dynamic libraries as a form of a "portability runtime", since few people write their own malloc or bsd socket code, get time of day, or find it practical to promise to only ever use the current binary version of libc. And many high performance machine features (e.g. alternative low latency NICs and network stacks, mallocs that improve over time) and ongoing bug fixes and security improvements (e.g. "oops, should probably fix that stack overflow vulnerability in libXYZ") use this fact to allow deployment-time injection of newly implemented functionality. This makes it hard, in practice, for a PGO to inline library code (or gain from other other program+library combo optimization), whether it's libc, or lib socket, or my-cool-math library. Choosing to use purely static binaries can improve on this to some degree, but it is not commonly practiced (and for good reason).Doesn't the JVM itself use dynamic libraries for some of the same functionality? And if it doesn't then it suffers from the same problems - new JVM required to address some minor problem.All the JDK libraries, and the the JVM's intrinsic (i.e. that other 95%+ of the code) is just as inline-able for a JIT as the code you write yourself. Java applications spend a minute % in non-generated code.If the JVM is dependent on other native libraries written by others, then how is this possible? For example if a JVM uses libgmp for it's arbitrary precision arithmetic, then it needs to be linked, just like in a C++ app, and it couldn't be inlined with user code. So from what I understand you are just saying that there are almost no dynamically linked libraries used in the JVM (and the JDK is pure Java).To be specific, I'm saying that in Java applications running on JDKs and the myriad of Java libraries and frameworks out there the vast majority of CPU time is spent in generated code that gets to leverage JIT optimizations. Only tiny percentage is spent in native library code, and only out of necessity. The JDK would never use a native dynamic binary for something it wants performance optimizations applied to. You'll see native code used for bulk operations, where it is locally and statically optimized, but opaque to optimizations coming in from the calling code. It will also use native libraries for interfacing to system functions (like i/o). But for anything you want speed for, especially when each individual operation is fairly light, you'll use byte codes and intrinsics, because those can have powerful optimization applied.
If so then this benefit is conferred to any C/C++ program that doesn't dynamically link to any library. Of course I understand your point that practically it's tougher to do so in C/C++ because you don't write your own malloc using brk/mmap.Right, but the difference goes far beyond the basics (like malloc) and extends to data structures, collections, parsers, transformations, and pretty much any piece of code you don't write on your own. And Java (as an ecosystem) has tons of code that you can use instead of writing it on your own. In Java, all those system and third party libraries get to be fully optimized along with your code, where with native libraries you don't get those optimizations unless you statically link and use PGO (which, for whatever reason, people don't seem to be doing).It seems like the bulk of optimizations not available to C/C++ programs are the ones due to not having all the source code at the same time (i.e. dynamically linking in a bunch of .so files). I could just get the sources for my libraries and use whole program optimization to have the same kinds of optimizations, instead of linking in a dynamic library. I realize this isn't done a lot, but I don't see any reason besides closed source dependencies and general make/cmake script awfulness.There is a very real-world reason for this, and it's not script awfulness. it's because operationally people expect binaries to run. Not to have to be recompiled ahead of every run on every system after every deployment or every patch. When you do that, you're pretty much in JIT-land already (with a lot more pain).Try and talk your sysadmins into recompiling all the code on each system every time they apply a RHEL patch, and see what they say to that. See what your QA folks think about it, too.. Alternatively, try to explain to them why you insist on carrying the bug in version X of system library Y forward with your statically compiled and linked code...Hmm. I can see that for client code, but for server code that one is running on their own environment, recompilation isn't that big of an issue. A lot of libraries already are header only C++ template code, so a re-compilation is already required for the tiniest of changes. In my experience the biggest reason people do not want to recompile on slight changes in a server program, is the long long long times it takes to compile a decently sized C++ project. So as a developer I'd rather link to libs that don't change that much than to recompile every time, because I cannot stand the 1 hour long compile times. This is mainly down to header legacy and C++ syntax complexity though, and more modern languages (Rust crates etc) with module support don't have to deal with this problem. Even C++ is supposed to get module support which will make this a non-issue. A lot of open source software written in C/C++ (Nginx, ImageMagick etc) already recommend a compilation on the destination instead of using conservatively optimized universal binaries and since it's not updated very regularly a proper modular compilation (one missing from C++ atm) wouldn't be the worst thing. If whole program compilations didn't take the absurd amounts of time it takes now, it would be a very valid choice for developers looking to trade compilation time for better code.We have this saying where I grew up that roughly translates to: "Yeh, and if my grandmother had wheels, she'd be a bicycle..."Build-time (as in ahead of QA time and distribution time) PGO is a real thing. Deployment-time PGO is not. Unless you mean JIT in runtimes (which is basically deployment time PGO per run).
Also I don't see the point in your comparison to statically compiled and linked code. Java must suffer from the same problem. If your Java code uses a collections library that has a bug, you need to ship a new version of your code, using the updated collections library for every program that uses it, instead of shipping a single one that is dynamically linked.If your java collections have a bug, you change the JDK on the system to the next version of the JDK that has a fix for the collections package, and the bug goes away. Same as it would if you updated the dynamically linked native collections library you may have installed.The difference is that with Java (and other managed runtimes), the new, fixed library gets just as intimately optimized with your code as the old one would have (think inlining and constant propagation across the library call line, for example), where a replaceable native dynamic library cannot be optimized along with your statically compiled code that calls it.
The JVM cannot just escape the cost of complete compilation for these whole program optimizations we are talking about. It's just that it does it in an amortized way, but pays in delayed startup time and complicated warm up procedures.I'm not arguing that the cost of optimization is reduced. I'm arguing that the resulting optimization is better. The cost of JIT optimization is obviously bigger (you pay it again for every run instead of once per produced binary), but at least for server side code that runs for more than a few minutes, that tradeoff is very much worth it.This will (in practice, not just theory) lead to inlining of virtual functions, constant folding, optimized code layout, dead block elimination etc (using the proper cross module optimization incantations - -flto, -fripa, --Wl, section-order-file etc).Agreed. that's what I see as the other side of "things that are not purely embedded." In embedded systems, you can afford to do this. Reality outside of embedded systems is simply that people don't do this. Why? I think it's because of the nightmares that follow. But why doesn't really matter...Similarly, managed data allows us to do do things like constant propagation of static final values set at initialization time. Whether it's code that determines (at runtime) whether to use a specific machine functionality, adapt to specific launch flags or early-program-discovered parameters, or esoteric things like code that optimizes for the current month (and restarts daily), managed runtimes are full of examples of static finals giving you speed that can't be easily found with PGO. Again, with exhustive full-program-static analysis it should be possible to prove the "constantness" of a memory location for a specific complete program, and if we combined that with Just Ahead Of Time last minute compilation+linking some cool stuff could be done (ala LLVM) you could maybe get a similar benefit, but we just don't see this done in the real world. Yet.Right, optimizations stemming from runtime discovered constants do seem like a great missed opportunity for even PGO aided AOT binaries. Why is managed data required for this to work though? A managed code environment alone should be able to do the same optimizations (from runtime discovered constants - eliminating dead branches, inlining the ones that can be taken etc)? Hot-swapping code seems to be the main capability required and not updating pointers to data. I've only seen a few projects use LLVM's JIT facilities and am yet to find successful ones.To start with, you need a way to describe the notion that a variable that is set once at runtime is never changed after being set that one time, and the expression of this concept is missing in many non-runtime languages. But even in cases where you can express this notion, the managed data part is what gives the runtime confidence that the code discipline around data access is actually going to adhered to the concept you expressed. In C/C++, you'll be easily able to get a pointer to such "supposedly runtime constant" field, and mess with it, and that's something you just can't do in a managed data environment.A very real issue with C++, but not with a language like Rust which has support for immutable constructs and statically ensures the discipline you are talking aboutAnother way to demonstrate this need is this: in JITed managed code environments, the JIT will commonly find itself in a situation where it needs to compile code that refers to state that is not yet resolvable at compile time (e.g. a static final field of a class that has not yet been initialized). This can happen for example in this code:class Foo {
static final long startTime = System.nanoTime();
void doFoo() {
for (int i = 0; i < 1000000; i++) {
if (i > 500000) {
System.out.println("Foo start time " + Foo.startTime);
} else {
System.out.println("Doof start time " + Doof.startTime);
}
}
}
}
class Doof {
static final long startTime = System.nanoTime();
}
Where the code will get hot and compiled before Doof is ever initialized. So at the time of 1st compilation, the Doof.startTime data is not yet initialized, and cannot be propagated as a constant. It could be coded as a runtime field access instead (with the proper "make sure it's initialized first" guards), but most JITs basically deal with this by simply leaving a "trap" in the unresolved part. When the cost first gets to the latter half of the if statement, the compiled code will deoptimize, force the initialization of the Doof class (if it had not already been initialized), and reoptimize with the now-initialized value, at which point the optimization can be applied. There is a bunch of "managed data" needed to make these transitions safe and to guard against access to uninitialized data (forcing initialization ahead of any such access).I see, so you are calling stuff like "has X been initialized?" managed data since the runtime needs to keep track of it? If I am correct C++11 already has the static keyword which is similar to the Java static keyword and will do threadsafe initialization etc. I thought you meant automatic memory management.Yes. Managed data is not the same as automatic memory management (AMM). It's hard(er?) to do AMM without the "managed data" quality, but there are other things that managed data allows you to do, beyond AMM. You loose the managed data quality when your code can do arbitrary pointer manipulation and unchecked/unanalyzable memory access through arbitrarily manipulated pointers. You lose the managed code quality when your code can create arbitrary execution paths that you can't analyze (like being able to call a function through an arbitrarily computed pointer). The above example relies on being able to prove that certain accesses cannot happen to a certain memory location, which [to me] is a managed data (not managed code) property.
What about the cost of running the JIT logic itself? I am purely speculating, but instrumented binaries in GCC are much slower (though sample-based profiling works fine) and I am guessing that JITs need to pay a similar price.
On the other hand, if we compare what should be possible, then JITs still have a lot of untapped "possible". E.g. on the managed code (or managed code + managed data) side it should be possible for escape analysis to make Java just as good as C (or better even) at making temporary variables that can be localized to the stack stay there. This should allow (in theory) for seemly heap-allocated objects to be just as fast as stack based variables are. With JITs it should be possible to "auto-discover" situations where C is currently currently forced to use off-stack allocation but Java could (speculatively?) keep the same temporary state on the stack for better speed. We just haven't evolved JITs to that level yet, and it's hard to tell if we will in the next few years.Most JITs only interment during warmup, with the fully optimized code either not instrumenting at all, or instrumenting very lightly (e.g. via sampling). So the runtime cost for the optimized code is effectively zero.Also the JIT logic occupies icache.Once the optimized code is hot, there is zero cache footprint. There is nothing left in the code other than the optimized result.Where is the code to analyze samples, run heuristics, de-optimize and re-optimize based then?It's sitting there, cold, in memory. Not in your icache.A heavy duty compiler is running with your application after all and it will occupy icache right? If it's running on the plenty of "free" cores that we have right now then this is non-negligible price to pay depending on your environment and application, just like GC threads running in the background doesn't have a zero cost.Normally (e.g. on HotSpot) once you've stabilized you hot code, no more compilations happen, and there is no background work begin done. So nothing in your icache.
If your JIT scheme is one that would keep doing some sample based instrumentation (which e.g. hotspot doesn't keep doing), that will eat up some tiny fraction of CPU, and yes, it will show up in your cache, but much less soften and with a much smaller footprint and disruption than the much more frequent interrupts you get from e.g. your OS's low level timer interrupts, which are considered nothing more than background noise.
I don't know how expensive hot-swapping code is, but my feeble understanding is it requires safe points which bring their own issues. These factors are more pronounced in less powerful devices like smart phones etc.Safepointing is an interesting issue.Saefpoint polling itself is extremely cheap. E.g. HotSpot uses a blind load from a constant address, and relies on page protection to trigger an exception if the safepoint needs to be taken. And safepoint are generally only polled for at loop back edges and method entry or exit.It's the fact that the machine state can be modified at the safepoint that reduces the ability to apply some optimizations across loop back edge safepoints. Loop unrolling help alleviate that, but safepoints are certainly optimization-opportunity-reducers to some degree.There is a whole bunch of cool art around figuring out how to rely on the limited nature of the state change in a safepoint to allows safepoint-crossing optimizations. E.g. while safepoints may modify the bitwise representation of reference, they do not modify their logical meaning: A null cannot become a non-null at a safepoint, and vice-versa. And for most practical GC algorithms two references that are equal to each other will remain equal to each other after a safepoint, and the same for two references that are not equal to each other.Thank you for the insight into safepoints. Besides the cross-safepoint optimization opportunities that are lost, isn't there is unpredictable stalling when trying to bring different threads to a safe-point?Well, that assumes you have long time-to-safepoint paths in your code ;-). While many runtimes still have those, some (those that have already taken care of the much bigger GC issues, hint hint) have mostly gotten rid of them, making TTSP a much smaller nuisance. E.g. when *we* see high ttsps, they are usually dominated by system level issues like swapping, paging, or CPU scheduling delays, as actual executing code paths have dense enough safepoint polling opportunities to reach a safepoint within microseconds of being asked to. When they exist and are not tuned away, those same system delays would have also hit an equivalent functionality non-managed-runtime system...
...
But as I note, this is not inherent to a language, to managed code, or to managed data, and I believe it can be powerfully resolved by a combination of some well defined library constructs and JVM-internal optimizations. ObjectLayout is and example of what I mean by this, with StructuredArray specifically addressing the first of the three cases above while fitting well into Java's object model and "playing nice" with all the existing code that knows how to deal with objects.C# seems to be way ahead of Java here and already has struct support already. In theory with JIT + controlled memory layout, should it be able to beat C++?Yes. And the same can be said for stack variables in C#. (StructuredArray is a way of avoiding the need for structs, and staying with regular objects, but that's beside the point).In theory practice should meet theory, and C# should be able to beat C++. I'm sure there are examples where it does, and examples where it doesn't...:) This begs a question and I promise that it is out of curiosity and not to be a dick. Why do you not use your own bootstrapped JVM + JDK to write future versions of your JVM (I assume you use C/C++)?History, momentum, and investment. No other reason. If I were to build a new JVM today, from scratch, I would probably make it meta-circular, code the vast majority of it in Java, and only have about 2-3K lines of code of native stuff that allows the remaining Java code to do anything it needs to.In practice the only things I wouldn't do in Java are the ones I just don't want to write again from scratch, i.e. ones where relatively good, stable, existing, working code exists in C or C++, and where I have no (good enough) reason to replace it. That pretty much describes the entire HotSpot JVM code base.Is it because of the lack of the struct features you talked about (All available through unsafe hacking, but looks uglier than std C or C++ IMHO)? Is it because of the lack of availability of many instructions (SIMD, PAUSE) required for high performance applications?No. Those 2-3K lines would provide all I need, and the rest would be pure Java. See the Jikes RVM for an example of the concept in practice. And the code (outside of the 2-3K lines) is not ugly at all.What would need to change with Java or other managed languages for you to use them for Zing or a project of comparable needs?Nothing really. In fact, there has been a long standing trend of converting JVM C++ code to Java to gain speed in the JDK. Anything that is actually "hot" in the JVM tends to go that way over time. The JIT compilers and and the GC code are the main major CPU consumer holdouts so far, and those too might shift over to being coded in Java over time (e.g. look at Graal).
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.