Why (most) high level languages are slow

786 views
Skip to first unread message

Vitaly Davidovich

unread,
Apr 13, 2015, 10:28:07 AM4/13/15
to mechanica...@googlegroups.com
Came across this today: http://sebastiansylvan.com/2015/04/13/why-most-high-level-languages-are-slow/

Nothing that hasn't been stated before, but what's entertaining is actually the ensuing reddit discussion: http://www.reddit.com/r/programming/comments/32f4as/why_most_high_level_languages_are_slow/.  I must say it's somewhat annoying to hear people "talk up" the JVM as if it's some magic pixie dust, and never mentioning practical/existing limitations.


Martin Thompson

unread,
Apr 13, 2015, 12:25:55 PM4/13/15
to mechanica...@googlegroups.com
It is an interesting read. I find it annoying when people think the JVM can do magic and optimise their code that is full of stupid idioms. Most people do not get the difference between instruction stream optimisation and data structure optimisation. Compilers can do wonderful things with the former and not so much with the later.

I think so much can be done for object graphs if we can better express relationships and types. For example, when modeling a relationship, the relationship can be an association, aggregation, or composition. The semantic meaning is very useful in a domain. It can also be very useful to the runtime. What Gil and I have been discussing for ObjectLayout is fundamentally about how to express aggregation in a model because the language has no support for this. I actually don't like the term ObjectLayout because it does not specify the layout of objects, it specifies the layout of a graph of objects representing some significant aspects of aggregate associations.

It is also nice that we are discussing the difference between entities and value objects. Value types as proposed for future Java versions are a really nice implementation of value objects from a modelling perspective. That is objects that do not have identity beyond their field values and thus can be copied or replaced allowing for nice stack based semantics. So much of what goes on the heap can be addressed with these.

I think high level languages do not need to be slow. I don't think well written code in Java or C# is what you would call slow at all. I see so much C and C++ that is very slow because of bad design and poor data structures. With good GC, object graph layout semantics, and value objects then Java could be really fast and usable. The big issue I see way more often is library and API design. We need to be thinking data friendly as first class in design. 

Martin...


On 13 April 2015 at 15:28, Vitaly Davidovich <vit...@gmail.com> wrote:
Came across this today: http://sebastiansylvan.com/2015/04/13/why-most-high-level-languages-are-slow/

Nothing that hasn't been stated before, but what's entertaining is actually the ensuing reddit discussion: http://www.reddit.com/r/programming/comments/32f4as/why_most_high_level_languages_are_slow/.  I must say it's somewhat annoying to hear people "talk up" the JVM as if it's some magic pixie dust, and never mentioning practical/existing limitations.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Apr 13, 2015, 12:58:26 PM4/13/15
to mechanica...@googlegroups.com

I think high level languages do not need to be slow. I don't think well written code in Java or C# is what you would call slow at all. I see so much C and C++ that is very slow because of bad design and poor data structures. With good GC, object graph layout semantics, and value objects then Java could be really fast and usable. The big issue I see way more often is library and API design. We need to be thinking data friendly as first class in design. 

Add "proper generics support" to that list please :).  As of right now, generics in java are borderline useless in java in terms of performance.  Hopefully valhalla fixes that.  There are also other micro level issues currently with Hotspot compiler that impact performance, which would be nice to address (a few examples off hand: profile pollution due to profile being collected in callee, switch statements not using profile in code generation, escape analysis being brittle (does not take control flow into account -- graal apparently fixes that), very poor SIMD utilization although I saw a few recent jdk 9 commits by Intel around this area).

Unfortunately, some language semantics will always impose a tax.  Speculative inlining works great with single receiver loaded in the VM (zero cost, best case), but starts to degrade when goes bimorphic and full out virtual call when megamorphic.  The thing that's somewhat frustrating when people compare this to, say, C++ is they say "well, C++ would be a virtual call right from the start" (which isn't always true with modern compilers that support devirtualization), but they fail to realize that performance sensitive C++ code would avoid virtual dispatch and use other language tools to accomplish same thing (e.g. template/parametric polymorphism) as much as possible.

Rust is a language that I'm quite excited about as it has features typically found in high-level languages, yet allows for efficient abstractions.

Vitaly Davidovich

unread,
Apr 13, 2015, 1:22:34 PM4/13/15
to mechanica...@googlegroups.com

Oh, the other current limitation that irks me is that JIT doesn't trust final fields and so doesn't treat them as constants (prime example where this stinks is Enum.ordinal usage).  I know the rationale behind it, but could've been done differently.

sent from my phone

Jan Kotek

unread,
Apr 13, 2015, 2:48:47 PM4/13/15
to mechanica...@googlegroups.com
After reading a lot from Peter L., I think about Java as low level language. GC is optional ;-)

Jan

Vitaly Davidovich

unread,
Apr 13, 2015, 3:00:34 PM4/13/15
to mechanica...@googlegroups.com

Even if you manage to avoid GC you're still paying some price to support it (e.g. write barrier/card marking, some bits in the mark word of an object header).

sent from my phone

Kevin Burton

unread,
Apr 13, 2015, 3:16:26 PM4/13/15
to mechanica...@googlegroups.com
Ha. I'm going to post a comment on reddit and tell them we're making fun of then ;)

Richard Warburton

unread,
Apr 13, 2015, 3:47:55 PM4/13/15
to mechanica...@googlegroups.com
Hi,

Oh, the other current limitation that irks me is that JIT doesn't trust final fields and so doesn't treat them as constants (prime example where this stinks is Enum.ordinal usage).  I know the rationale behind it, but could've been done differently.


In what sense do you mean "could've been done differently"? In one sense you're making the other tradeoff - ie just break frameworks that don't respect the language semantics but that has obvious downsides. Or are you proposing an alternative implementation approach that doesn't have this downside?

regards,

  Richard Warburton

Marshall Pierce

unread,
Apr 13, 2015, 3:53:54 PM4/13/15
to mechanica...@googlegroups.com

-Marshall

Sent from my iPhone
--

Vitaly Davidovich

unread,
Apr 13, 2015, 3:56:51 PM4/13/15
to mechanica...@googlegroups.com
The one I had in mind is simply invert the default: by default, compiler trusts final fields and treats them as constants.  Arguably, most java apps will likely not be reflectively changing final fields.  For those that want it, add a JVM option to either (a) disable trust of final fields globally or (b) allow specifying classes where this optimization should be disabled.  Well-behaved java code should not be penalized for frameworks that heavily use reflection.

At any rate, there's some work that oracle is doing to add support for trusting final fields.  We'll see when/if this lands and what exactly it'll look like (last I saw, it was a flag to enable trusting final fields globally).

--

Vitaly Davidovich

unread,
Apr 13, 2015, 4:04:01 PM4/13/15
to mechanica...@googlegroups.com
By the way, I suppose the more complex way to actually solve this is by tracking field access while profiling, and detecting reflective accesses to them.  If no reflective access has been seen when JIT is compiling a field access, then treat it as constant.  The trick would be to piggyback on deopt/dependency tracking and modify reflection to effectively act as an uncommon trap when it writes to a final field: if an app all of a sudden writes to a final field via reflection, the app traps and the dependent compiled methods are thrown away.  If you see too many traps like this, then ditch that optimization altogether (to not severely penalize apps that do use this capability).  The downside, of course, is having yet even more dependency tracking in the deopt machinery.

Vitaly Davidovich

unread,
Apr 13, 2015, 4:20:01 PM4/13/15
to mechanica...@googlegroups.com
I recall seeing that set of posts by Cliff Click (by the way, he's not longer with Azul so I don't know whether that actually went anywhere -- Gil may know).  Although his basic premise of "who cares about the 2nd load as it'll hit L1" is sound in theory, it's not quite as simple as that in practice (as demonstrated by his two examples from Doug Lea and Charles Nutter).

Here's a quick example.  Suppose you have an enum E with 10 members: ONE, TWO, THREE ... TEN.  There's a big difference in generated asm between these two constructions:

private static final int MASK = (1 << E.ONE.ordinal()) | (1 << E.TWO.ordinal()) | ...; // say we use 3-4 of them

static boolean isMatch(E arg) {
 return (arg.ordinal() & MASK) != 0;
}

VS

static boolean isMatch(E arg) {
   return (arg.ordinal() & ((1 << E.ONE.ordinal()) | (1 << E.TWO.ordinal()) | ...)) != 0;
}

In the first case, MASK is a (JIT) compile time constant and emitted as an immediate in the assembly.  In the 2nd case (which is arguably what one wants to write since there's otherwise no need for the static field), you get a bunch of code to load the enum members' ordinal field, then the resulting shift and OR instructions.

I'm sure everyone's intuitive understanding would be that enum's ordinal() is just like a static constant (semantically speaking), but you're going to get gross code generated for this.  To boot, if this code is hot but runs in the context of a lot more code, we end up with quite a few "unnecessary" memory accesses to grab the ordinal of the enum members, which may miss.

Martin Thompson

unread,
Apr 13, 2015, 4:22:23 PM4/13/15
to mechanica...@googlegroups.com
+1 on the the card marking. I've worked with low/no GC apps and got bitten by card marking causing unpredictable behaviour and false sharing on the card table. Another great reason not to have too much indirection even in low/no GC apps. For high thread count apps I often see benefits from turning on conditional card marking. -XX:+UseCondCardMark

Rajiv Kurian

unread,
Apr 14, 2015, 7:01:04 PM4/14/15
to mechanica...@googlegroups.com
Yeah the whole "A sufficiently advanced VM could do this" is such a cop out. You cannot even keep a few gigs of objects cached in memory without massive pauses . Most high performance Java code I see rarely allocates objects. Otherwise they resort to flyweights over ByteBuffers or Unsafe (not without its own performance caveats).  Even Gil who keeps talking about how cheap allocations are, has zero allocations in HdrHistogram. I totally understand the good enough performance with short lead times promise of Java, but the heroic efforts made by people to make it do ungodly things just seems like a wrong choice of technology.

I think a lot of people have been burnt by the poor debugging and compiler infrastructure for C/C++. I am also really excited about Rust because it has started from scratch. It is not handicapped by the header file expansion issues of C, nor the 100 pages template errors from C++. Given the compiler prevents aliasing it can actually generate better code than C++ compilers that always assume aliasing.


On Monday, April 13, 2015 at 7:28:07 AM UTC-7, Vitaly Davidovich wrote:

Scott Carey

unread,
Apr 14, 2015, 7:57:14 PM4/14/15
to mechanica...@googlegroups.com


On Tuesday, April 14, 2015 at 4:01:04 PM UTC-7, Rajiv Kurian wrote:
Yeah the whole "A sufficiently advanced VM could do this" is such a cop out. You cannot even keep a few gigs of objects cached in memory without massive pauses . Most high performance Java code I see rarely allocates objects. Otherwise they resort to flyweights over ByteBuffers or Unsafe (not without its own performance caveats).  Even Gil who keeps talking about how cheap allocations are, has zero allocations in HdrHistogram. I totally understand the good enough performance with short lead times promise of Java, but the heroic efforts made by people to make it do ungodly things just seems like a wrong choice of technology.

I think a lot of people have been burnt by the poor debugging and compiler infrastructure for C/C++. I am also really excited about Rust because it has started from scratch. It is not handicapped by the header file expansion issues of C, nor the 100 pages template errors from C++. Given the compiler prevents aliasing it can actually generate better code than C++ compilers that always assume aliasing.

What I'm curious about Rust is how it will perform memory allocation-wise.  I once was part of a project that ported ~300,000 lines of C++ code to java 1.3 (nearly class for class, method per method, some automated).   The result was 30% faster! Why?  because a crap-load of fine-grained multi-threaded object allocation and deletion with a malloc-style allocator is dog slow.  GC, even with hotspot in the 1.3 days is a lot faster.  Allocations are amazingly fast, and GC's amortized cost can be surprisingly low.  We went from >30% of time spent in malloc/free or smart pointer management, to about 2 percent of time spent in GC, with the cost being pause times.

Of course, you can use custom allocators, memory arenas, or other tricks with C / C++ with various tradeoffs, but shoving that into a large legacy / messy code-base is no picnic.  There is a reason why the inner core of Oracle's db engine has multiple of its own custom allocation schemes and only a handful of developers are allowed to get near it.

I have not gone much past the overview and tutorial on Rust, and it looks excellent after a Saturday afternoon poking around with it, but if it compiles everything down into fine-grained malloc/free and smart pointers, then many classes of software will be spending most of their time searching for holes in the heap to allocate memory and causing lots of random memory writes and cache line pings.

I guess from my perspective, yes GC can go wrong, but malloc/free, new/delete, smart pointers, can be abused and just as bad.  Rust might prevent a whole host of errors from ever happening (relative to C/C++), but does it come at the cost of being a slave to the compiler's allocation strategy?  I have found a few discussions like this one related to the topic: https://github.com/rust-lang/rfcs/issues/538 so there is some hope.

Rajiv Kurian

unread,
Apr 14, 2015, 9:15:31 PM4/14/15
to mechanica...@googlegroups.com
My experience with even good allocators like Jemalloc has been that if your code base does Java-esque  allocations and they some how don't match up the arena sizes in jemalloc - things will be slow and like you said probably slower than Java. Jemalloc has been maturing a lot though so it handles moderate malloc/free pretty well. They are also experimenting with asynchronous collection (GC any one) to improve efficiency and cross thread malloc/free use case.

As for Rust given its strong aliasing rules - using malloc as a way of figuring out concurrency is just not needed. You can allocate on the stack with a lot of confidence and pass pointers to stack allocated objects around knowing that the compiler will catch lifetime issues (unlike C++). Even John Carmack finds stack allocation lifetime issues in his code. You can even pass a pointer to a stack allocated struct from thread A to thread B and it won't let you exit thread A unless you join Thread B first.  But I still don't think you can get away without arena allocators, preallocations, stack allocators, eating some internal fragmentation and all other C++ tricks to prevent allocations. My point was that all high performance Java projects go off heap and limit allocation (on or off heap) so there are better tools than using Unsafe which according to me is even lower level than C++. Rust allows for all that but with better compile time guarantees and quite possibly better performance thanks to strict aliasing rules and all of this without going to C++ and its complicated compiler/linker toolchain or whatever other reason people rightly hate C++ for. It is definitely not going to magically transform code that does a lot of mallocs and frees. As for custom allocators ala C++, personally I don't like C++ custom allocators so I am fine preallocating and managing by struct arrays/buffers like one would in C. In my experience using classes that make allocation/deallocation strategies at the lowest level i.e. all C++ std collections and all libraries that use the std collections end up being too complicated to use even with placement new/custom allocators (more here - https://deplinenoise.wordpress.com/2012/10/20/toollibrary-memory-management-youre-doing-it-wrong/). 

Kirk Pepperdine

unread,
Apr 15, 2015, 4:25:39 AM4/15/15
to mechanica...@googlegroups.com

On Apr 15, 2015, at 1:01 AM, Rajiv Kurian <geet...@gmail.com> wrote:

> Yeah the whole "A sufficiently advanced VM could do this" is such a cop out. You cannot even keep a few gigs of objects cached in memory without massive pauses . Most high performance Java code I see rarely allocates objects. Otherwise they resort to flyweights over ByteBuffers or Unsafe (not without its own performance caveats). Even Gil who keeps talking about how cheap allocations are, has zero allocations in HdrHistogram.

This is because quite frankly, Gil is right about many many things.. many things but in this case he’s wrong. They are not expensive but they are also not cheap. If you take in the life-cycle costs then that can get downright expensive.


> I totally understand the good enough performance with short lead times promise of Java, but the heroic efforts made by people to make it do ungodly things just seems like a wrong choice of technology.

Agreed… Though memory management in C/C++ is more expensive than it is in Java.. it’s just a hidden cost, or a cost that is harder to measure. However a lot of other stuff can be done in a less expensive way so on balance….

Regards,
Kirk
signature.asc

Martin Thompson

unread,
Apr 15, 2015, 4:39:01 AM4/15/15
to mechanica...@googlegroups.com
On 15 April 2015 at 09:25, Kirk Pepperdine <ki...@kodewerk.com> wrote:

On Apr 15, 2015, at 1:01 AM, Rajiv Kurian <geet...@gmail.com> wrote:

> Yeah the whole "A sufficiently advanced VM could do this" is such a cop out. You cannot even keep a few gigs of objects cached in memory without massive pauses . Most high performance Java code I see rarely allocates objects. Otherwise they resort to flyweights over ByteBuffers or Unsafe (not without its own performance caveats).  Even Gil who keeps talking about how cheap allocations are, has zero allocations in HdrHistogram.

This is because quite frankly, Gil is right about many many things.. many things but in this case he’s wrong. They are not expensive but they are also not cheap. If you take in the life-cycle costs then that can get downright expensive.

It is my "fault" that HdrHistogram does not allocate. Gil's original version did. I pointed out to Gil that most people who use it will be running Hotspot and not Zing and we do not want something allocating in there that will cause pauses. HdrHistogram needs to be used in measurements and it should not impact the measurements itself, which it would if it did allocate.

Vitaly Davidovich

unread,
Apr 15, 2015, 9:00:44 AM4/15/15
to mechanica...@googlegroups.com

+1

There's probably no debate that GC allocation throughput (especially bump the pointer in a thread local buffer) will beat native allocators.  But, the people that make this statement as an argument for GC fail to realize that even semi performance conscious native code doesn't allocate at the same rate as typical java code.  The real issue is that java is a very heap heavy language - there's very little useful work that can be done without causing allocations (assuming one doesn't jump out of their way and contort the code).  The one glaring difference is native code uses the stack quite a bit for temp allocations.

The other thing is that typical C++ code isn't written in the same manner as java.  In particular, templates allow for compile time polymorphism whereas java relies on the JIT and profile/classload info to devirtualize, which is much more brittle and subject to subtle performance regressing changes (e.g. going from one subclass of an abstract class loaded to two).

The "frustrating" thing is that java IS pretty fast and it does have some advantages over AOT native compilers (even ones being compiled with PGO), and yet it's not quite where it could be.  For example, having a tiered compilation system is great in that it doesn't put unnecessary pressure on the highest tier compiler to finish compilation quickly.  But then there are implementation artifacts that reduce some of that potential.  One example in mind is the whole "profile pollution" problem.

The other "dropped the ball" aspect is generics - the incessant type checks in the generated code is just silly.  I'll leave it at that as I'm sure we're all well aware of this issue, and hopefully Valhalla fixes it (and value types).

As for Rust, I also quite like that performance is a paramount requirement for its engineers/community.  They also have a performance sensitive complex "dogfooding" project in Servo, which should help to keep the performance train on track.

sent from my phone

Martin Thompson

unread,
Apr 15, 2015, 9:20:57 AM4/15/15
to mechanica...@googlegroups.com
On 15 April 2015 at 14:00, Vitaly Davidovich <vit...@gmail.com> wrote:

+1

There's probably no debate that GC allocation throughput (especially bump the pointer in a thread local buffer) will beat native allocators.  But, the people that make this statement as an argument for GC fail to realize that even semi performance conscious native code doesn't allocate at the same rate as typical java code.  The real issue is that java is a very heap heavy language - there's very little useful work that can be done without causing allocations (assuming one doesn't jump out of their way and contort the code).  The one glaring difference is native code uses the stack quite a bit for temp allocations.

If found that I often end up allocating or contorting my own code in Java due to the lack of value types so that I can return a multi value structure on the stack. We can hopefully get this addressed for a big bump for a future release. I've spent some time over the last few years tracking when I came perf issues due to restrictions in Java vs C. Value types is right up there next to  the ability to co-locate a small object in a larger one as an aggregate and having arrays of objects that are not via reference array. Most other issues I tracked in real world apps are simple API issues like with ByteBuffer and String missing key methods.

The other thing is that typical C++ code isn't written in the same manner as java.  In particular, templates allow for compile time polymorphism whereas java relies on the JIT and profile/classload info to devirtualize, which is much more brittle and subject to subtle performance regressing changes (e.g. going from one subclass of an abstract class loaded to two).

The "frustrating" thing is that java IS pretty fast and it does have some advantages over AOT native compilers (even ones being compiled with PGO), and yet it's not quite where it could be.  For example, having a tiered compilation system is great in that it doesn't put unnecessary pressure on the highest tier compiler to finish compilation quickly.  But then there are implementation artifacts that reduce some of that potential.  One example in mind is the whole "profile pollution" problem.

The other "dropped the ball" aspect is generics - the incessant type checks in the generated code is just silly.  I'll leave it at that as I'm sure we're all well aware of this issue, and hopefully Valhalla fixes it (and value types).

The type checks are an interesting problem. They are the wrong way round in some ways. For example, we don't check when putting something into a HashMap due to erasure but we do when getting it out via a check cast.  Maps are often read much more than written so the cost feels wrongly apportioned.

These branches don't typically miss but they do have an impact. In tight loops we are restricted to 28 uops being fed from the LBB cache. Within these 28 uops we can only have 8 branches plus no calls or returns. With all the check casts we can end up with more branches than is obvious and blow this 8 branch limit. No only does this slow down code it greatly increases the power requirements on the front side of processor having to decode instructions that cannot be refeed from the uop cache.

Vitaly Davidovich

unread,
Apr 15, 2015, 9:45:32 AM4/15/15
to mechanica...@googlegroups.com

Modern intel chips will fuse cmp+jmp into 1 uop, if I recall correctly.  But yes, irrespective of fusion, it's a waste in the instruction stream and uses up a BTB entry.  This is particularly annoying when the actual code using the object is cheaper than the typecheck.

The other issue is that if you end up reading data from the loaded object that's a cacheline away from the header, the type check will either cause an unnecessary cache miss or keep a cacheline resident that may not otherwise be needed.

sent from my phone

--

Michael Barker

unread,
Apr 15, 2015, 4:54:32 PM4/15/15
to mechanica...@googlegroups.com
Right now I wish those were my issues, right now I'm still being kicked by the perfect storm of safe points and major page faults.

Remi Forax

unread,
Apr 15, 2015, 5:35:02 PM4/15/15
to mechanica...@googlegroups.com

On 04/15/2015 03:20 PM, Martin Thompson wrote:
On 15 April 2015 at 14:00, Vitaly Davidovich <vit...@gmail.com> wrote:

+1

There's probably no debate that GC allocation throughput (especially bump the pointer in a thread local buffer) will beat native allocators.  But, the people that make this statement as an argument for GC fail to realize that even semi performance conscious native code doesn't allocate at the same rate as typical java code.  The real issue is that java is a very heap heavy language - there's very little useful work that can be done without causing allocations (assuming one doesn't jump out of their way and contort the code).  The one glaring difference is native code uses the stack quite a bit for temp allocations.

If found that I often end up allocating or contorting my own code in Java due to the lack of value types so that I can return a multi value structure on the stack. We can hopefully get this addressed for a big bump for a future release. I've spent some time over the last few years tracking when I came perf issues due to restrictions in Java vs C. Value types is right up there next to  the ability to co-locate a small object in a larger one as an aggregate and having arrays of objects that are not via reference array. Most other issues I tracked in real world apps are simple API issues like with ByteBuffer and String missing key methods.

The other thing is that typical C++ code isn't written in the same manner as java.  In particular, templates allow for compile time polymorphism whereas java relies on the JIT and profile/classload info to devirtualize, which is much more brittle and subject to subtle performance regressing changes (e.g. going from one subclass of an abstract class loaded to two).

The "frustrating" thing is that java IS pretty fast and it does have some advantages over AOT native compilers (even ones being compiled with PGO), and yet it's not quite where it could be.  For example, having a tiered compilation system is great in that it doesn't put unnecessary pressure on the highest tier compiler to finish compilation quickly.  But then there are implementation artifacts that reduce some of that potential.  One example in mind is the whole "profile pollution" problem.

The other "dropped the ball" aspect is generics - the incessant type checks in the generated code is just silly.  I'll leave it at that as I'm sure we're all well aware of this issue, and hopefully Valhalla fixes it (and value types).

The type checks are an interesting problem. They are the wrong way round in some ways. For example, we don't check when putting something into a HashMap due to erasure but we do when getting it out via a check cast.  Maps are often read much more than written so the cost feels wrongly apportioned.

[...]

Even without generics, there is a little trivia around java.util.HashMap, the boot process of the OpenJDK VM loads at least one LinkedHashMap (that inherits from HashMap) which defeat CHA so each time you do a get/put on a HashMap, the VM as first to check at runtime if the HashMap is a HashMap or a LinkedHashMap.

cheers,
Rémi

Vitaly Davidovich

unread,
Apr 15, 2015, 6:25:58 PM4/15/15
to mechanica...@googlegroups.com
But, if you use a static final HashMap reference, the type checks are gone :).  Some of this comes back to JIT not trusting final instance fields.  In a lot of cases I've seen, instance constructor will instantiate HashMap and store it in a final field.  If JIT were to trust that information, it would not need to generate any type checks when that field is used inside the class (and, it could also then propagate that information to anything inlined from this class as well).  As of right now, you get sillyness like this:

final class Foo {
   private final HashMap<...> _m = new HashMap<>();

   public Object get() {
      return _m.get(null); // type check here
   }
}

Switch HashMap for LinkedHashMap, and type check is gone.

Vitaly Davidovich

unread,
Apr 15, 2015, 6:41:04 PM4/15/15
to mechanica...@googlegroups.com
And, along the same lines:

final class Foo {
    private final Object[] _arr = new Object[6];
    private static final Object[] _sarr = new Object[6];

    public Object get() {
       return _arr[1]; // range check
       return _sarr[1]; // no range check, direct load of 2nd element
       
       if (_arr.length > 5) return null;
       return _arr[1];  // length check for 5

       if (_sarr.length > 5) return null;
       return _arr[1]; // code eliminated, returns null outright;
   }
}

Martin Thompson

unread,
Apr 16, 2015, 2:01:54 AM4/16/15
to mechanica...@googlegroups.com
On 15 April 2015 at 22:34, Remi Forax <fo...@univ-mlv.fr> wrote:

Even without generics, there is a little trivia around java.util.HashMap, the boot process of the OpenJDK VM loads at least one LinkedHashMap (that inherits from HashMap) which defeat CHA so each time you do a get/put on a HashMap, the VM as first to check at runtime if the HashMap is a HashMap or a LinkedHashMap. 

That is at least relatively simple to fix by breaking the inheritance. So often DRY can cause too many design issues and restrictions when it comes to inheritance. I'd have no inheritance between the two and thus keep the implementations decoupled, in addition to the the benefits to CHA.

Vitaly Davidovich

unread,
Apr 16, 2015, 7:21:10 AM4/16/15
to mechanica...@googlegroups.com

Oracle will never make such a change as it breaks backcompat.  And really, they'd have to make HashMap final or else someone else (JDK or another lib) can load their own subclass of HashMap and that will deopt your uses and recompile with the type check.

I suspect though that if final fields were trusted, enough occurrences of this would be squashed to make the remainder irrelevant.

sent from my phone

--

Aistis Raulinaitis

unread,
Apr 16, 2015, 1:30:07 PM4/16/15
to mechanica...@googlegroups.com
This to me seems more like "Why C# is slow" instead of the actual title. MLton has had a full program optimizing compiler for years now: http://mlton.org/WholeProgramOptimization

Vitaly Davidovich

unread,
Apr 16, 2015, 3:37:31 PM4/16/15
to mechanica...@googlegroups.com
I'm not familiar with MLton, but the blog mentions this upfront:

The reason most high level languages are slow is usually because of two reasons:
  1. They don’t play well with the cache.
  2. They have to do expensive garbage collections. 
But really, both of these boil down to a single reason: the language heavily encourages too many allocations. 


C# has a decent JIT compiler (Hotspot is even better for java), so I don't exactly see what WPO has to do with this.  Can you expand on how MLton addresses those concerns?


Aistis Raulinaitis

unread,
Apr 16, 2015, 4:18:25 PM4/16/15
to mechanica...@googlegroups.com
Well you don't need a JIT, there is higher order control flow analysis that is done at compile time. On top of that WPO will unroll loops, monomorphize functions, inline dependencies among many other things, all very good for cache coherence. Unless you are doing hard-realtime, GC isn't your biggest trouble. Actually it's quite easy to write programs that operate in constant space, so I don't see where the problem is.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/n3sApDaBuyY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 16, 2015, 4:42:42 PM4/16/15
to mechanica...@googlegroups.com

On top of that WPO will unroll loops, monomorphize functions, inline dependencies among many other things, all very good for cache coherence

C# (CLR) does all of this as well, but I think there may be misunderstanding.  The blog is talking about "data" not playing well with the cache (nothing to do with coherence though), not code -- what you're mentioning in the above sentence is mostly about saving CPU cycles that are wasted due to things *other* than waiting on memory loads (or stores, for that matter).

The gripes in the blog basically come down to (a) poor/low use/support of stack allocations and prevalent use of references, (b) libraries written with heap allocations as if they're zero cost, and as a result, (c) GC causing latency spikes (this guy appears to be a game dev, so they're particularly sensitive there).  I did not see any concrete mention of compiler code quality there (apart from a few sentences talking about "smart compilers" in passing).

 Actually it's quite easy to write programs that operate in constant space, so I don't see where the problem is.

Trivial programs, perhaps.

pron

unread,
Apr 16, 2015, 5:38:22 PM4/16/15
to mechanica...@googlegroups.com
If I may chime in, I'd like to say why -- in spite of some shortcomings -- I see the JVM as the best bet for the future.

Some people here are fans of the single-threaded, as-fast-as-possible, as-streaming-as-possible approach, but when I think of performance I think of other kinds of systems. While current software trends (in some sub-industries) favor a distributed, micro-service architecture, it seems to be going against hardware trends. It's hard to make exact comparisons, but it seems to me (and Martin, correct me if I'm wrong) that data-center IO throughput is growing much more slowly than (multi-core) processing throughput, and the micro-service architecture requires -- in addition to data marshalling -- a fan-in/fan-out (or multiplexing/demultiplexing) of concurrency at each end. I'm thinking of interactive (as opposed to batch) systems where all processing as well as data-management are in-process (though modular), where each node has its own slice of a distributed data store contained in an in-memory transactional DB (or whatever kind of atomicity/isolation the application requires), and the inter-node IO is reduced as much as possible -- kind of like the tuple-space of old, only better. Each node has hundreds of cores and a few TB of RAM. Scaling such an architecture requires nonblocking data structures (or at least partially blocking), and that means a good GC. Combine that with next-gen JITs (that also exploit GPUs), and I think that the JVM is in the best position to take advantage of that kind of hardware. It is not necessarily geared towards the most latency-sensitive environments, but those where low(ish) latency is required while accessing huge amounts of in-memory data under high-concurrency workloads. I can't see such requirements addressed with any arena/thread-local allocation approach, and shared-nothing/actors can only address the logic parts of the application.

That's why I'm patient with respect to value-types, tail calls, continuations and the like, as adding all of those to the JVM is still much easier than getting any platform/language out there to where the JVM is today with respect to addressing those near-future systems.

Richard Warburton

unread,
Apr 16, 2015, 5:49:10 PM4/16/15
to mechanica...@googlegroups.com
Hi,

Unless you are doing hard-realtime, GC isn't your biggest trouble. Actually it's quite easy to write programs that operate in constant space, so I don't see where the problem is.

I don't find sweeping generalisations such as this particularly constructive to the discussion. The bounds you can put on your memory consumption are often very related to the kind of problem that you're trying to solve. Some systems have very clear cut bounds that can be put on their memory consumption (eg: simple data processing pipelines built on top of fixed size queues), in other cases its much harder to reason about. The amount of work being done can differ per task.

Not only that but sometimes you get algorithms which can be implemented in constant memory space, but only if you're applying a cunning trick. Depth first search is an example that comes to mind. Oh, its trivially easy to implement, once you know the trick, but I doubt many developers would figure it out on their own.

Vitaly Davidovich

unread,
Apr 16, 2015, 6:17:45 PM4/16/15
to mechanica...@googlegroups.com
Modern cores are amazingly quick singlethreaded if they're utilized well (that's the trick!).  If we're talking about machines with hundreds of cores and TB of RAM, you probably want close to shared-nothing design; the simple reason is that even with non-blocking datastructures, sharing memory amongst that many threads is going to create a ton of coherence traffic.  In these systems, you'll want to minimize cross-core/thread chatter, which means either sharing nothing or doing "bulk" messaging between them.  Also, let's not forget that one could leverage a highly parallel machine by utilizing separate processes (with its own additional isolation benefits) and doing IPC; the IPC protocol, if extremely chatty, is going to run slower than sharing memory, but either way, you wouldn't want to do that.  Nginx and postgres are two examples off the top of my head that use multiple procs rather than threads.  Singlethreaded code is also much easier to reason about, develop, debug, etc, so even if you have GC to facilitate non-blocking datastructures, it's still not easy.  Also, with 100+ core machines, it's not even clear whether current JVMs can scale to that without substantial additional engineering effort, nevermind the underlying OS kernel.  My point here is that it all sounds great in theory, but reality is rarely that simple/clear.

As for various native allocation strategies, again, it's been done already many times (e.g. nginx, postgres, redis, memcached, etc as few examples of high perf servers).  As for GC, it's nice to punt all the hard work on it.  But, if you don't already, look at hotspot-gc-use mailing list, search on Google, etc and you'll see people spending a ton of time trying to tune the darn thing -- it's a black art in and of itself.  If the language simply wasn't so heap allocation heavy, a lot of those issues would simply go away on their own.  These java vs c++/rust/c/etc arguments that always talk about GC throughput are missing the point a bit -- systems written in those languages (I'm including Rust even though it's not mainstream yet as it's got the same philosophy) simply don't allocate nearly at the same rate as java (or C#, although it has better facilities here).  C++ and Rust allow for zero-cost abstractions, whereas in Java it's much harder to achieve and lots of people resort to "ugly"/unidiomatic code and/or design to avoid those pitfalls.

Having said all that, I also really like the JVM (it's a serious piece of engineering) -- its heroics are the only reason why java code is somewhat competitive with c/c++.  But in some ways, it *has* to be heroic given language semantics: you need a fast GC if everything is heap allocated, you need a great JIT if everything is virtual by default, etc.  If one could marry C# (well, its VM type system) with the JIT compiler/interpreter combo from Hotspot, it would be a serious beast.


pron

unread,
Apr 16, 2015, 6:52:39 PM4/16/15
to mechanica...@googlegroups.com
Well, shared data-structures don't necessarily create a lot of coherence traffic, as they can exploit hierarchies that perfectly cancel out cache/memory/NUMA/distributed hierarchies; I did the math. It gets better if the DB runs in-process and actually assists in scheduling the logic (move code to data etc). 

And sure, people struggle with GC tuning because they've given up on struggling with C++ (I'm talking about the industry at large; not specific niches). The goal (again, for a general-purpose platform) is not to squeeze every ounce of performance, but to reach the sweet spot between performance and developer effort/cognition/psychology, and I think that's precisely what the JVM does best. It is much better for the platform itself to be a heroic feat than each program written on top of it. The way people struggle to get even simple Rust programs to compile mean that those abstractions are far from zero-cost. The trick is to find those abstraction whose performance cost can be "scaled away" as much as possible. Mental effort does not scale at all. RAM overhead scales beautifully. Latency does not scale away, but if you're fine with a few milliseconds worth, then it's not going to get worse. Shared-nothing scales perfectly but is bought for a mental cost that doesn't. Shared data (with required atomicity/transaction -- and I'm not talking about general-purpose STM, which certainly doesn't scale) might or might not scale well -- we don't really know yet...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 16, 2015, 7:15:43 PM4/16/15
to mechanica...@googlegroups.com

I agree with a lot of what you say, but they're somewhat general and subjective statements such as "find balance between performance and maintenance".  Certainly if 100ms+ GC pauses don't breach SLAs or lose you money (likewise, if 100+ns cache miss penalties don't add up to a number that hurts), the conversation shifts purely to "what language/platform with good enough baseline perf suits my non-perf requirements"; for some, it's java/Scala/clojure/another JVM language, others it's Go, someone else it's Haskell/Ocaml, etc.

The original blog though is talking about a performance pain in the domain where the above things hurt; you'll find similar thinking in any domain where request/frame response time budgets are in the millis/micros.  Simply different things.

As for Rust, I have high hopes for it.  It seems to have a well thought out design, performance is a paramount requirement (not a nice to have), and is being shaped by a perf sensitive project (servo).  There's going to be a learning curve for people as it's got some novel techniques (borrow checker, lifetimes, etc) and depending on which background people are coming from, ramp up time will vary.  There's nothing wrong with that.  The zero cost I was referring to was runtime cost, not anything else.

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 16, 2015, 7:40:42 PM4/16/15
to mechanica...@googlegroups.com

There's also the common saying/belief that simple languages move complexity into the application code.  Java being simple has its upsides, but it's definitely true that this simplicity makes some code/design more complex and bug prone (e.g. overuse of primitives to avoid heap objects, flyweight patterns over plain bytes, GC tuning wizardry - people make a business out of consulting in this space!, custom collections implemented million times, off heap storage, various hacks for erasure, Unsafe usage, cacheline padding via inheritance hacks, custom layout proposals such as ObjectLayout, nevermind the complexity of the JVM/JIT).

sent from my phone

Gary Mulder

unread,
Apr 17, 2015, 7:51:29 AM4/17/15
to mechanica...@googlegroups.com
On 16 April 2015 at 23:52, pron <ron.pr...@gmail.com> wrote:
Well, shared data-structures don't necessarily create a lot of coherence traffic, as they can exploit hierarchies that perfectly cancel out cache/memory/NUMA/distributed hierarchies; I did the math. It gets better if the DB runs in-process and actually assists in scheduling the logic (move code to data etc). 

It does seem however that both sides of the Von Neumann bottleneck are increasing geometrically in performance while main memory speeds are still only increasing linearly:


On the CPU side we have of course multi-core and larger L3 caches. On the persistent storage side we have flash, and flash is moving from the PCIe to the memory bus:


Regards,
Gary 

Daniel Lemire

unread,
Apr 17, 2015, 11:15:25 AM4/17/15
to mechanica...@googlegroups.com
On the CPU side we have of course multi-core and larger L3 caches.

Does anyone has a nice presentation on how the size of L3 caches have evolved and where they are headed? My impression is that we are nowhere near having an exponential growth, especially on a per core basis.

Martin Thompson

unread,
Apr 17, 2015, 12:56:32 PM4/17/15
to mechanica...@googlegroups.com
The typical size of the L3 cache is one slice of 1-3MB per core on x86. This is staying fairly constant. We are likely to get approximately ~128MB of eDRAM per socket as an L4 in the not so distant future.

The interesting thing is that things are going NUMA even on socket with switches between rings that group the cores.

On 17 April 2015 at 16:15, Daniel Lemire <lem...@gmail.com> wrote:
On the CPU side we have of course multi-core and larger L3 caches.

Does anyone has a nice presentation on how the size of L3 caches have evolved and where they are headed? My impression is that we are nowhere near having an exponential growth, especially on a per core basis.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Jimmy Jia

unread,
Apr 17, 2015, 1:12:19 PM4/17/15
to mechanica...@googlegroups.com
Do you expect the on-package memory on the new Knight's Landing chips to be more like higher-bandwidth main memory than like Crystalwell-style L4 cache? If it's at all like an L4 cache, then that's at least one solution that gives you substantially more than 128 MB.

Martin Thompson

unread,
Apr 17, 2015, 1:34:32 PM4/17/15
to mechanica...@googlegroups.com
The memory that comes with Knights Landing is higher bandwidth, and much more of it. Though this is not a mainstream CPU.

Jimmy Jia

unread,
Apr 17, 2015, 3:32:31 PM4/17/15
to mechanica...@googlegroups.com
Yeah - it might be quite interesting to see what the latency characteristics look like, though. It's definitely not a mainstream CPU, but if that 16 GB is accessible at latencies more similar to Crystalwell than like main memory, it could be extremely interesting.
Reply all
Reply to author
Forward
0 new messages