Cliff Click's talk at JVMLS made it obvious that even for typical Java
or C applications, the limitations of memory bandwidth and cache
locality represent the lion's share of performance loss in modern
applications. JVMs and compilers can do various tricks to improve
this, but at the end of the day you're limited by how much you can fit
into cache and how much back-and-forth you need across threads and
between the CPU and main memory. Given that knowledge...
* Dynamic languages on the JVM have to use boxed numerics most of the
time, which means that we're creating a lot of numeric objects. Some
of these may be nearly free, immediately-collectable. Some may be
eliminated by escape analysis in future versions of the JVM (e.g.
current JDK 7, which has EA on by default). But even with the best
tricks and best GC, the use of objects for numerics is still going to
be slower (on average) than primitives. How to cope with this?
* JVM languages that use closures are forced to heap-allocate
structures in which to hold closed-over values. That means every
instantiation of those closures allocates purely-transient objects,
populates them with data, and passes them down-stack for other code
bodies to use. In this case, the objects are likely to be
longer-lived, though still potentially "youngest" generation. More
troublesome, however, is that no current JVMs can inline closures
through a megamorphic intermediate call, so we lose almost all
inlining-based optimization opportunities like escape analysis or
throw/catch reduction to a jump.
* Languages like JRuby, which have to maintain "out of band" call
stack data (basically a synthetic frame stack for cross-call data)
have to either keep a large pre-allocated frame stack in memory or
allocate frames for each call. Both have obvious GC/allocation/cache
effects.
It seems like the current state-of-the-art GC for languages with these
issues would be something G1ish, where both "newborn" and "youthful"
objects can be collected en masse. As far as allocation goes, EA is
the only solution that seems likely to help, but it depends on being
able to inline...something that's hard to do for many calls in dynamic
languages but also for non-closure-converted calls in any of these
languages.
Another wrinkle is the use of immutable structures, as in Clojure. I'm
curious whether such systems generate more garbage than those that
permit the use of in-place-mutable structures (seems to me that they
would) and how that plays into memory bandwidth, allocation and GC
rates, and whether the bulk of the extra garbage is young or gets
tenured in typical usage.
We have been trying to explore the next steps for JRuby performance,
and we have begun to suspect that we're hitting memory/alloc/GC
bottlenecks for many (most?) cases. This is obviously harder to
investigate than straight-up execution performance, since even looking
at the Hotspot assembly output doesn't always make it clear what
memory effects a piece of code will have.
So what have you all been seeing, and what tools are you using to
investigate the memory/alloc/GC impact of your languages?
- Charlie