GC, allocation, cache effects, immutables, numerics!

106 views
Skip to first unread message

Charles Oliver Nutter

unread,
Nov 16, 2009, 5:27:36 PM11/16/09
to JVM Languages
I wanted to start a discussion about how various aspects of these new
JVM languages are affected by (or how they affect) GC and allocation
performance, and what the state-of-the-art answers are for making
these languages run well.

Cliff Click's talk at JVMLS made it obvious that even for typical Java
or C applications, the limitations of memory bandwidth and cache
locality represent the lion's share of performance loss in modern
applications. JVMs and compilers can do various tricks to improve
this, but at the end of the day you're limited by how much you can fit
into cache and how much back-and-forth you need across threads and
between the CPU and main memory. Given that knowledge...

* Dynamic languages on the JVM have to use boxed numerics most of the
time, which means that we're creating a lot of numeric objects. Some
of these may be nearly free, immediately-collectable. Some may be
eliminated by escape analysis in future versions of the JVM (e.g.
current JDK 7, which has EA on by default). But even with the best
tricks and best GC, the use of objects for numerics is still going to
be slower (on average) than primitives. How to cope with this?
* JVM languages that use closures are forced to heap-allocate
structures in which to hold closed-over values. That means every
instantiation of those closures allocates purely-transient objects,
populates them with data, and passes them down-stack for other code
bodies to use. In this case, the objects are likely to be
longer-lived, though still potentially "youngest" generation. More
troublesome, however, is that no current JVMs can inline closures
through a megamorphic intermediate call, so we lose almost all
inlining-based optimization opportunities like escape analysis or
throw/catch reduction to a jump.
* Languages like JRuby, which have to maintain "out of band" call
stack data (basically a synthetic frame stack for cross-call data)
have to either keep a large pre-allocated frame stack in memory or
allocate frames for each call. Both have obvious GC/allocation/cache
effects.

It seems like the current state-of-the-art GC for languages with these
issues would be something G1ish, where both "newborn" and "youthful"
objects can be collected en masse. As far as allocation goes, EA is
the only solution that seems likely to help, but it depends on being
able to inline...something that's hard to do for many calls in dynamic
languages but also for non-closure-converted calls in any of these
languages.

Another wrinkle is the use of immutable structures, as in Clojure. I'm
curious whether such systems generate more garbage than those that
permit the use of in-place-mutable structures (seems to me that they
would) and how that plays into memory bandwidth, allocation and GC
rates, and whether the bulk of the extra garbage is young or gets
tenured in typical usage.

We have been trying to explore the next steps for JRuby performance,
and we have begun to suspect that we're hitting memory/alloc/GC
bottlenecks for many (most?) cases. This is obviously harder to
investigate than straight-up execution performance, since even looking
at the Hotspot assembly output doesn't always make it clear what
memory effects a piece of code will have.

So what have you all been seeing, and what tools are you using to
investigate the memory/alloc/GC impact of your languages?

- Charlie

John Cowan

unread,
Nov 16, 2009, 7:45:16 PM11/16/09
to jvm-la...@googlegroups.com
On Mon, Nov 16, 2009 at 5:27 PM, Charles Oliver Nutter
<hea...@headius.com> wrote:

> * Dynamic languages on the JVM have to use boxed numerics most of the
> time, which means that we're creating a lot of numeric objects. Some
> of these may be nearly free, immediately-collectable. Some may be
> eliminated by escape analysis in future versions of the JVM (e.g.
> current JDK 7, which has EA on by default). But even with the best
> tricks and best GC, the use of objects for numerics is still going to
> be slower (on average) than primitives. How to cope with this?

In the end the answer will be fixnums. As things stand, Integers are
so slow that BigIntegers aren't much slower: consequently, I decided,
since I needed unlimited size integers, to use BigIntegers
exclusively. This eliminated a lot of annoying conversions, which now
happen only when talking to Java methods.

> * JVM languages that use closures are forced to heap-allocate
> structures in which to hold closed-over values. That means every
> instantiation of those closures allocates purely-transient objects,
> populates them with data, and passes them down-stack for other code
> bodies to use.

The key in this case is, I think, to create flat closures that contain
only the necessary variables, after boxing all assigned-to variables
so that you can copy the box pointers.

> Another wrinkle is the use of immutable structures, as in Clojure. I'm
> curious whether such systems generate more garbage than those that
> permit the use of in-place-mutable structures (seems to me that they
> would) and how that plays into memory bandwidth, allocation and GC
> rates, and whether the bulk of the extra garbage is young or gets
> tenured in typical usage.

My guess would be that most people only care about the most recent
state, which means that older states become young garbage.

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

Jon Harrop

unread,
Nov 16, 2009, 8:33:52 PM11/16/09
to jvm-la...@googlegroups.com
On Monday 16 November 2009 22:27:36 Charles Oliver Nutter wrote:
> So what have you all been seeing, and what tools are you using to
> investigate the memory/alloc/GC impact of your languages?

FWIW, I'm using value types to reduce stress on the GC and inlining of
higher-order functions and their function arguments (to remove closures) in
F#.

I see allocation as contention because the heap is a global shared resource.
If you want to recover the performance of the previous generation of
standalone languages with GCs optimized for rapid recycling on a single
thread (e.g. OCaml) then you need to reimplement a mini-heap local to your
thread and that is left up to the user on VMs like the JVM and CLR.

JVM-based languages could perform the same inlining and use monomorphization
to avoid type erasure but the lack of value types is an insurmountable on the
JVM.

--
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

Charles Oliver Nutter

unread,
Nov 16, 2009, 11:09:51 PM11/16/09
to jvm-la...@googlegroups.com
On Mon, Nov 16, 2009 at 6:45 PM, John Cowan <johnw...@gmail.com> wrote:
> In the end the answer will be fixnums.  As things stand, Integers are
> so slow that BigIntegers aren't much slower: consequently, I decided,
> since I needed unlimited size integers, to use BigIntegers
> exclusively.  This eliminated a lot of annoying conversions, which now
> happen only when talking to Java methods.

Fixnums are certainly the final answer, or even better, support for
general-purpose value types on the JVM. But in the interim, we need
other solutions. And of course, we need someone willing to work on
Fixnums. Maybe we need to make more noise about it? :)

I'm very intrigued by your idea to just use BigInteger everywhere. In
JRuby, we currently have Ruby Fixnum implemented using a Java long,
and then all math operations check for overflow to potentially return
a Bignum (wrapping a Java BigInteger). RubyFixnum itself is a Java
object with one reference field (Ruby metaclass), one int field (Ruby
object flags), and one long field (the Fixnum value). For every math
operation, we're allocating one of these (though like JDK and Clojure
we have a cache, in our case from -128..127) plus checking overflow. I
make all of this logic inline, but have not seen EA eliminate any of
it so far.

When we turn off some compatibility-mode features of JRuby (like the
artificial frames we maintain) we perform basically as well as Clojure
for math and dynamic calls...which I think is impressive since our
call pipeline is substantially more complicated. But pushing
performance any further requires something more.

Have you compared Integer/Long allocation plus overflow checks to just
using BigInteger? One of the irreducible pieces of this logic (which I
can see make it into the final assembly, even when inlining all logic)
is the overflow check...probably 15 x86 instructions for every math
operation, not counting typechecks and object allocations. I'd love to
find a way to eliminate that.

>> * JVM languages that use closures are forced to heap-allocate
>> structures in which to hold closed-over values. That means every
>> instantiation of those closures allocates purely-transient objects,
>> populates them with data, and passes them down-stack for other code
>> bodies to use.
>
> The key in this case is, I think, to create flat closures that contain
> only the necessary variables, after boxing all assigned-to variables
> so that you can copy the box pointers.

I know this is what Groovy does. In JRuby, we support this mode, but
there are some features of Ruby that complicate heap-allocating only
the local variables we can statically see are being accessed.
Ultimately though I'm not sure the size of the heap structure
allocated makes a great deal of difference...it's the allocation and
GC of that structure that are costly, regardless of the size. Plus the
difference between putting all potentially closed-over variables on
the heap versus a couple of them is only a few bytes of contiguous
memory. Peanuts.

>> Another wrinkle is the use of immutable structures, as in Clojure. I'm
>> curious whether such systems generate more garbage than those that
>> permit the use of in-place-mutable structures (seems to me that they
>> would) and how that plays into memory bandwidth, allocation and GC
>> rates, and whether the bulk of the extra garbage is young or gets
>> tenured in typical usage.
>
> My guess would be that most people only care about the most recent
> state, which means that older states become young garbage.

That's certainly true, but depending on when the older states become
garbage you could have a much higher turnover rate than for algorithms
that use a mutable structure in-place rather than spinning new
immutable structure for every mutation. At any rate, the problems
faced by immutable data structures on current JVMs are very similar to
those faced by JRuby, so I'm interested in finding a solution for both
of us.

- Charlie

Charles Oliver Nutter

unread,
Nov 16, 2009, 11:14:10 PM11/16/09
to jvm-la...@googlegroups.com
On Mon, Nov 16, 2009 at 7:33 PM, Jon Harrop <j...@ffconsultancy.com> wrote:
> I see allocation as contention because the heap is a global shared resource.
> If you want to recover the performance of the previous generation of
> standalone languages with GCs optimized for rapid recycling on a single
> thread (e.g. OCaml) then you need to reimplement a mini-heap local to your
> thread and that is left up to the user on VMs like the JVM and CLR.

On the JVM, at least on Hotspot, threads get their own thread-local
allocation buffers (TLABs), which reduces most allocations to
pointer-bumping. This reduces the contention for the heap, but it does
nothing to reduce the cost of rapidly burning through pointer
addresses, which leads to cache page misses more quickly than for
lower allocation rates. I suppose value types help remedy this problem
by localizing certain object allocations on the stack, so there's only
a single pointer bump entering a call and a single pointer bump
exiting it. Value types or stack-allocated data structures would
certainly be welcome on the JVM. I believe at the moment the only
answer we have been given is that escape analysis "should do this for
us", but so far I've seen only mixed results.

> JVM-based languages could perform the same inlining and use monomorphization
> to avoid type erasure but the lack of value types is an insurmountable on the
> JVM.

I'm not sure it's totally insurmountable, but the current answer (EA)
does not seem to be helping enough (not to mention being *heavily*
dependent on the JVM inlining the code you want to EA-optimize...a
special challenge for any languages with complicated or deep call
logic).

- Charlie

Attila Szegedi

unread,
Nov 17, 2009, 5:46:44 AM11/17/09
to jvm-la...@googlegroups.com
On 2009.11.16., at 23:27, Charles Oliver Nutter wrote:

> Another wrinkle is the use of immutable structures, as in Clojure. I'm
> curious whether such systems generate more garbage than those that
> permit the use of in-place-mutable structures (seems to me that they
> would)

Certainly - computation of any new value requires creation of at least one object.

> and how that plays into memory bandwidth, allocation and GC
> rates, and whether the bulk of the extra garbage is young or gets
> tenured in typical usage.

That'd be an interesting thing to measure, and I'd have to measure it as I can't guess at the right answer by just thinking about it :-)

On a side note, I had an interesting discussion with Cam Purdy and Rich Hickey back in Aarhus this year about VM design. One thing that came up is that it'd be really beneficial if GC could know that an object is immutable, as it could optimize certain operational aspects for immutable objects. Namely, if you move an immutable object as part of a compacting phase, you don't have to lock it down to prevent concurrent mutation, since there is no mutation at all. Unfortunately, this doesn't go as far as giving the GC a leisure to slowly update the references to the moved objects, because that would break pointer-based identity comparison. Nevertheless, you could have a separate compact phase of a compacting GC that'd compact immutable objects first and not worry about locking anything while it does it, then only do reference updates atomically under some lock once it's done compacting. Of course, you could also have a VM that doesn't use pointers for identity comparison...

This immutability information is currently not available (or insufficient) in JVM - final fields are a compiler construct not enforced on the bytecode level, and furthermore there are no immutable arrays.

> We have been trying to explore the next steps for JRuby performance,
> and we have begun to suspect that we're hitting memory/alloc/GC
> bottlenecks for many (most?) cases. This is obviously harder to
> investigate than straight-up execution performance, since even looking
> at the Hotspot assembly output doesn't always make it clear what
> memory effects a piece of code will have.

I know someone who's writing a tool that might help you with that. I'll check up on him whether/when he can (or wants to) come public with it.

Attila.

Rich Hickey

unread,
Nov 17, 2009, 7:48:40 AM11/17/09
to JVM Languages


On Nov 16, 11:09 pm, Charles Oliver Nutter <head...@headius.com>
wrote:
> On Mon, Nov 16, 2009 at 6:45 PM, John Cowan <johnwco...@gmail.com> wrote:
> > In the end the answer will be fixnums. As things stand, Integers are
> > so slow that BigIntegers aren't much slower: consequently, I decided,
> > since I needed unlimited size integers, to use BigIntegers
> > exclusively. This eliminated a lot of annoying conversions, which now
> > happen only when talking to Java methods.
>
> Fixnums are certainly the final answer, or even better, support for
> general-purpose value types on the JVM. But in the interim, we need
> other solutions. And of course, we need someone willing to work on
> Fixnums. Maybe we need to make more noise about it? :)
>

I think it is essential to separate the *two* problems with boxed
numbers. One problem is arithmetic with boxed numbers. EA, inlining,
lever compilers etc can certainly do something about this. Frederik
Öhrström gave an impressive demonstration at the JVM language summit
showing how even large structures used as numbers could provide
sufficient information to the compiler to enable their complete
removal in arithmetic.

The second, significant, and currently unavoidable problem with
objects as numbers is that they do, in fact, escape. All the time. In
order to be stored in our data structures. Obtaining their value
requires a pointer deref, and when incorporated into the long-term
information maintained by an application, they move out of the young
gen. They might increase the number of active older objects by an
order of magnitude or more vs a comparable Java program using
primitives as fields and in collections. It is this latter problem
that is bigger, and that fixnums address in a way I think nothing else
can. Until references can be made to hold both an object reference on
the heap and a number not on the heap, we are doomed to have to choose
one. And choosing objects yields a terrible memory profile, in pointer
nav, cache misses and GC.

> Have you compared Integer/Long allocation plus overflow checks to just
> using BigInteger? One of the irreducible pieces of this logic (which I
> can see make it into the final assembly, even when inlining all logic)
> is the overflow check...probably 15 x86 instructions for every math
> operation, not counting typechecks and object allocations. I'd love to
> find a way to eliminate that.
>

I think you'll find the actual time cost of overflow checking is
dwarfed by other things. CPU parallelism seems to do a decent job with
overflow checks.

> >> Another wrinkle is the use of immutable structures, as in Clojure. I'm
> >> curious whether such systems generate more garbage than those that
> >> permit the use of in-place-mutable structures (seems to me that they
> >> would) and how that plays into memory bandwidth, allocation and GC
> >> rates, and whether the bulk of the extra garbage is young or gets
> >> tenured in typical usage.
>
> > My guess would be that most people only care about the most recent
> > state, which means that older states become young garbage.
>
> That's certainly true, but depending on when the older states become
> garbage you could have a much higher turnover rate than for algorithms
> that use a mutable structure in-place rather than spinning new
> immutable structure for every mutation. At any rate, the problems
> faced by immutable data structures on current JVMs are very similar to
> those faced by JRuby, so I'm interested in finding a solution for both
> of us.
>

Immutable data structures have the very nice GC property that old
things are never pointing to younger things. In addition, never being
updated means they need not be volatile or have their access wrapped
in locks. Yes, there is ephemeral garbage, and making that cheaper is
always good. As Attila mentions in another message, being able to
communicate about immutability to the runtime, and have it leverage
that information, would be a huge win.

Rich

Charles Oliver Nutter

unread,
Nov 17, 2009, 11:17:28 AM11/17/09
to jvm-la...@googlegroups.com
On Tue, Nov 17, 2009 at 4:46 AM, Attila Szegedi <szeg...@gmail.com> wrote:
> On a side note, I had an interesting discussion with Cam Purdy and Rich Hickey back in Aarhus this year about VM design. One thing that came up is that it'd be really beneficial if GC could know that an object is immutable, as it could optimize certain operational aspects for immutable objects. Namely, if you move an immutable object as part of a compacting phase, you don't have to lock it down to prevent concurrent mutation, since there is no mutation at all. Unfortunately, this doesn't go as far as giving the GC a leisure to slowly update the references to the moved objects, because that would break pointer-based identity comparison. Nevertheless, you could have a separate compact phase of a compacting GC that'd compact immutable objects first and not worry about locking anything while it does it, then only do reference updates atomically under some lock once it's done compacting. Of course, you could also have a VM that doesn't use pointers for identity comparison...
>
> This immutability information is currently not available (or insufficient) in JVM - final fields are a compiler construct not enforced on the bytecode level, and furthermore there are no immutable arrays.

Yes, I've been communicating with folks on the Hotspot team about this
exact issue. We have a number of places in JRuby where we access final
fields repeatedly. Without Hotspot treating those fields as "really
final", those accesses are irreducible, so we do them over and over
and over again. Any special treatment for immutability would certainly
need to also treat finals as final, at the very least for purposes of
optimizing several inlined bodies that access the same final fields.

Our canonical example is the preamble to most code bodies that
retrieves a final org.jruby.Ruby instance off a thread-local object,
to avoid accessing it multiple times for other purposes. Even when I
can get multiple dyncalls to inline into the same compilation unit,
those accesses remain. Then we have finals for method caches, serial
numbers, and so on...all of which get accessed and re-accessed
repeatedly.

> I know someone who's writing a tool that might help you with that. I'll check up on him whether/when he can (or wants to) come public with it.

That would be spectacular.

- Charlie

Charles Oliver Nutter

unread,
Nov 17, 2009, 11:30:28 AM11/17/09
to jvm-la...@googlegroups.com
On Tue, Nov 17, 2009 at 6:48 AM, Rich Hickey <richh...@gmail.com> wrote:
> I think it is essential to separate the *two* problems with boxed
> numbers. One problem is arithmetic with boxed numbers. EA, inlining,
> lever compilers etc can certainly do something about this. Frederik
> Öhrström gave an impressive demonstration at the JVM language summit
> showing how even large structures used as numbers could provide
> sufficient information to the compiler to enable their complete
> removal in arithmetic.

And I have seen, for very limited cases, that some boxed numerics are
eliminated in the final assembly. But not enough of them to make a
difference when *all* numerics are boxed values, as in JRuby or
Clojure :(

Because of this I've resigned myself that without JVM help we're not
going to be able to compete with other Ruby implementations on raw
numeric performance, since all the C/C++ based Ruby impls use real
fixnums. And that makes me sad.

FWIW, we can come within 2x the performance of the fastest native
impl's numeric perf (on simple things like fib) so the JVM is
certainly doing a cracker-jack job making boxed numerics as fast as
possible. It's just not fast enough.

> The second, significant, and currently unavoidable problem with
> objects as numbers is that they do, in fact, escape. All the time. In
> order to be stored in our data structures. Obtaining their value
> requires a pointer deref, and when incorporated into the long-term
> information maintained by an application, they move out of the young
> gen. They might increase the number of active older objects by an
> order of magnitude or more vs a comparable Java program using
> primitives as fields and in collections. It is this latter problem
> that is bigger, and that fixnums address in a way I think nothing else
> can. Until references can be made to hold both an object reference on
> the heap and a number not on the heap, we are doomed to have to choose
> one. And choosing objects yields a terrible memory profile, in pointer
> nav, cache misses and GC.

This is one I have worried about. With all those boxed numbers
floating around, we eventually end up tenuring *numbers*, for pete's
sake, consuming older generations' space and increasing GC costs for
those generations. As in Clojure, we have a cache for a small range of
fixnums, but it's not good enough for anything but the simplest
iterations and algorithms. The value of caching larger fixnums rapidly
diminishes, and also can affect startup time since we need to
pre-populate that cache on startup. I agree there's no good answer
here, right now.

Maybe we should start a pool to hire someone to implement fixnums.

> I think you'll find the actual time cost of overflow checking is
> dwarfed by other things. CPU parallelism seems to do a decent job with
> overflow checks.

Assuming the overflow checks that eventually make it to asm are being
done in the most efficient way possible. I'll try to dig up the actual
x86 instructions later and we can discuss whether it's as quick as it
ought to be.

> Immutable data structures have the very nice GC property that old
> things are never pointing to younger things. In addition, never being
> updated means they need not be volatile or have their access wrapped
> in locks. Yes, there is ephemeral garbage, and making that cheaper is
> always good. As Attila mentions in another message, being able to
> communicate about immutability to the runtime, and have it leverage
> that information, would be a huge win.

I think it would be helpful even for mutable systems, since even then
we have a lot of immutable state and no way to tell the JVM about it.

It seems a lot of this comes down to not having (or not knowing) the
right tools to investigate hidden GC/memory/alloc/cache effects
resulting from the way our languages are put together and the way the
JVM compiles them. My tools of choice have been PrintGCDetails for
rough information, PrintOptoAssembly to see what asm is actually
coming out, and now starting to get into the ideal graph data and
visualizer to understand the optimization decisions Hotspot is making.
But none of those go very far toward investigating the impact of
irreducible allocations of heap scopes and numerics, but the
less-than-100% CPU utilization for most numeric algorithms in JRuby
tells me there's a lot of room to improve.

- Charlie

logi...@gmail.com

unread,
Nov 17, 2009, 2:37:31 PM11/17/09
to jvm-la...@googlegroups.com
One old fixnum solution proposed by Per Bothner many years ago
(he is on this mailing list so he might know the PDF he wrote I am referring to?)
Whenever I read something,my mind often over paraphrases things
(even makes up it's own version!) but I walked away with this

The primitive types that the JVM feels the need to box. Int/float/byte etc

All could be put into a "TAGGED" structure. This means

(jobject && BYTE_TAG) >> TAG_LSR_BYTE == jbyte value
(jobject && INT_TAG) >> TAG_LSR_INT == jint value

That basically many jobjects would actually point to places in memory that are never actually valid.
One bit is used to tell us that the type is TAGGED_AT_ALL specially that is a valid boxing of a primitive.
Some more bits are used to decide how the what type of primitive is boxed.
The remainder of the bits is the value. some types are totally subsumed Short/Byte.. some wont fit totally like
Long/Float/Long/Double/Int etc.. but very common values for them will still fit.
Some like Byte and Short don't need to use their full value range and use their impossible values to help host other primitives
types that would have fallen out of the possible value range.
Still others like Long that fall outside of available data bit range are stuck using their old legacy impl (the current impl!)

This makes boxing and unboxing most primitives "most often" a very tiny operation.
The boxed versions still pretend to be jobjects giving compatibility to most part of the code.. however every part that is used to
consuming a jobject will take on slight overhead to see if its a boxed value first.. but I think this would be worth it.

Is this correct interpretation?



Charles Oliver Nutter

unread,
Nov 17, 2009, 11:55:40 PM11/17/09
to jvm-la...@googlegroups.com
On Tue, Nov 17, 2009 at 11:37 AM, <logi...@gmail.com> wrote:
> One old fixnum solution proposed by Per Bothner many years ago
> (he is on this mailing list so he might know the PDF he wrote I am referring to?)
> Whenever I read something,my mind often over paraphrases things
> (even makes up it's own version!) but I walked away with this
>
> The primitive types that the JVM feels the need to box.  Int/float/byte etc
>
> All could be put into a "TAGGED" structure.   This means
>
> (jobject &&   BYTE_TAG) >> TAG_LSR_BYTE == jbyte value
> (jobject &&   INT_TAG) >> TAG_LSR_INT == jint value

Yes, this sounds like a typical "tagged pointer" way of representing
numerics as pseudo-references. John Rose also posted about Fixnums in
the JVM here:

http://blogs.sun.com/jrose/entry/fixnums_in_the_vm

As I understand it from talking to John and others, it would not be
particularly "difficult" to get fixnums into the JVM, but there's
nobody working on it at present. We have folks doing dynamic
invocation, interface injection, tail calls, continuations/coroutines
and more...but no fixnums yet.

- Charlie

Andrew Oliver

unread,
Nov 17, 2009, 11:58:55 PM11/17/09
to jvm-la...@googlegroups.com
This is something that might be a good candidate for openjdk and/or
icedtea. The only problem is the build seems to be regularly broken
in the repository. It would be really interesting to have a few
research forks with various improvements to primitives.
> --
>
> You received this message because you are subscribed to the Google Groups "JVM Languages" group.
> To post to this group, send email to jvm-la...@googlegroups.com.
> To unsubscribe from this group, send email to jvm-language...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jvm-languages?hl=.
>
>
>

logi...@gmail.com

unread,
Nov 18, 2009, 8:41:58 AM11/18/09
to jvm-la...@googlegroups.com
Yes http://blogs.sun.com/jrose/entry/fixnums_in_the_vm
describes it exactly! I had thought it was Per Botner..
I must have read both Per's KAWA fixnums and that blog on the same day.
Although if my memory serves (which it hasn't),
there was someone starting to do this on the JikesRVM.

Charles Oliver Nutter

unread,
Nov 18, 2009, 11:39:31 AM11/18/09
to daizi sheng, JVM Languages
I'm replying to you and JVM-L, so let me know if you have further
trouble posting. Responses inline below!

On Tue, Nov 17, 2009 at 11:39 PM, daizi sheng <daizi...@gmail.com> wrote:
> I replied this post on JVML group, but it did not appeared there.
> So I directly mailed you for the following questions.
>
> Have you ever tried to use group of stack values instead of
> box primitive values and put them on heap? In CRuby and
> other c/c++ implemented script engines, tagged pointer (or
> union structure) are used. In JVM, this is not allowed, but we
> can still use group of JVM stack values to represent a script value.
>
> For example, to represent a local variable, the following group
> of value can be used, instead of a single reference to heap boxed
> value.
>
> byte the_current_type;
> int the_current_int_value;
> double the_current_double_value;
> Object other_case;
>
> We have tried this idea in a JVM based JavaScript engine.
> Both performance improvement and less memory consumption
> can be seed.

Our current plan for reducing fixnum overhead is to build a better
compiler. We've been working with Subbu Sastry on a new compiler for
JRuby that can locally reduce fixnums and fixnum operations to
primitive math, as well as inlining various pieces of logic *before*
we emit JVM bytecode. There are many details to work out (maintaining
a useful stack trace, for example) but the approach has promise. Some
of the most problematic performance areas for us are heavy local math,
math across only a couple Ruby methods, or numeric loops. In all these
cases, a "better compiler" should be able to reduce that logic to
primitive math rather than constructing fixnum objects for every math
operation.

It may very well be that using a "stack" of values like you suggest
would help too, but it obviously complicates compilation a bit. And if
we hoped to value-double along the call path, so we could pass a long
and a RubyFixnum reference and possibly leave the RubyFixnum reference
null, that means a lot of global changes.

One thing we do right now is to have separate paths for literal
integers on the LHS or RHS of a call. So for example, if 'a' is a
Fixnum, 'a + 1' never stands up the 1 as a RubyFixnum object; it just
passes it straight through to a RubyFixnum "op_plus" method that
accepts a long. This reduces the number of Fixnum objects we have to
have in flight at any given time, but it does not do anything to avoid
the return value being a regular boxed fixnum.

- Charlie

Daniel Hicks

unread,
Nov 19, 2009, 9:30:13 PM11/19/09
to JVM Languages


On Nov 16, 7:33 pm, Jon Harrop <j...@ffconsultancy.com> wrote:

> I see allocation as contention because the heap is a global shared resource.
> If you want to recover the performance of the previous generation of
> standalone languages with GCs optimized for rapid recycling on a single
> thread (e.g. OCaml) then you need to reimplement a mini-heap local to your
> thread and that is left up to the user on VMs like the JVM and CLR.

I don't know of any JVM that doesn't implement a thread-local heap, at
least in the "server" implementations.

Jon Harrop

unread,
Nov 19, 2009, 11:03:22 PM11/19/09
to jvm-la...@googlegroups.com
You mean a thread-local first generation, not a thread local heap.

Daniel Hicks

unread,
Nov 20, 2009, 7:23:56 PM11/20/09
to JVM Languages


On Nov 19, 10:03 pm, Jon Harrop <j...@ffconsultancy.com> wrote:
> On Friday 20 November 2009 02:30:13 Daniel Hicks wrote:
>
> > On Nov 16, 7:33 pm, Jon Harrop <j...@ffconsultancy.com> wrote:
> > > I see allocation as contention because the heap is a global shared
> > > resource. If you want to recover the performance of the previous
> > > generation of standalone languages with GCs optimized for rapid recycling
> > > on a single thread (e.g. OCaml) then you need to reimplement a mini-heap
> > > local to your thread and that is left up to the user on VMs like the JVM
> > > and CLR.
>
> > I don't know of any JVM that doesn't implement a thread-local heap, at
> > least in the "server" implementations.
>
> You mean a thread-local first generation, not a thread local heap.

The two most critical factors are having a lock-free allocation
protocol, and maintaining cache locality for relatively "new"
objects. Sun's algorithms do both of these fairly well, and there are
other algorithms in use that probably do them better. True local
heaps with read barriers to "promote" objects to global when globally
referenced have been implemented experimentally, but I don't think
there are any practical implementations.

Jon Harrop

unread,
Nov 20, 2009, 11:54:51 PM11/20/09
to jvm-la...@googlegroups.com
On Tuesday 17 November 2009 04:14:10 Charles Oliver Nutter wrote:
> On Mon, Nov 16, 2009 at 7:33 PM, Jon Harrop <j...@ffconsultancy.com> wrote:
> > I see allocation as contention because the heap is a global shared
> > resource. If you want to recover the performance of the previous
> > generation of standalone languages with GCs optimized for rapid recycling
> > on a single thread (e.g. OCaml) then you need to reimplement a mini-heap
> > local to your thread and that is left up to the user on VMs like the JVM
> > and CLR.
>
> On the JVM, at least on Hotspot, threads get their own thread-local
> allocation buffers (TLABs), which reduces most allocations to
> pointer-bumping.

Only for the mutator. For the GC, the object is then reallocated on the shared
heap and copied. That is expensive and it doesn't scale because the shared
heap is a global resource.

I was referring to a thread local *heap* rather than a generation. For
example, linked lists held as an array of cons cells where tails are indexes
into the array. This offers substantial performance improvements on the CLR
because it doesn't introduce any contention for the shared heap at all
(beyond the single allocation of the whole array) but it requires the array
elements (head+tail pairs) to be value types.

> This reduces the contention for the heap, but it does
> nothing to reduce the cost of rapidly burning through pointer
> addresses, which leads to cache page misses more quickly than for
> lower allocation rates. I suppose value types help remedy this problem
> by localizing certain object allocations on the stack, so there's only
> a single pointer bump entering a call and a single pointer bump
> exiting it.

Exactly. The benefits can be really huge in the context of complex numbers, 2D
vectors, vertex data for transfer to the hardware and unboxed array.

Perhaps the most compelling example of value types is hash tables: inserting
10M float->float bindings into a hash table is 32x faster in F# than in Java.
I suspect the reason is primarily that the JVM is boxing those floats whereas
the CLR is not.

> Value types or stack-allocated data structures would
> certainly be welcome on the JVM. I believe at the moment the only
> answer we have been given is that escape analysis "should do this for
> us", but so far I've seen only mixed results.

Does relying upon escape analysis make it impossible to pass unboxed arrays
through the FFI?

> > JVM-based languages could perform the same inlining and use
> > monomorphization to avoid type erasure but the lack of value types is an
> > insurmountable on the JVM.
>
> I'm not sure it's totally insurmountable, but the current answer (EA)
> does not seem to be helping enough (not to mention being *heavily*
> dependent on the JVM inlining the code you want to EA-optimize...a
> special challenge for any languages with complicated or deep call
> logic).

Yes.

Daniel Hicks

unread,
Nov 21, 2009, 8:08:25 AM11/21/09
to JVM Languages


On Nov 20, 10:54 pm, Jon Harrop <j...@ffconsultancy.com> wrote:
> On Tuesday 17 November 2009 04:14:10 Charles Oliver Nutter wrote:

> > On the JVM, at least on Hotspot, threads get their own thread-local
> > allocation buffers (TLABs), which reduces most allocations to
> > pointer-bumping.
>
> Only for the mutator. For the GC, the object is then reallocated on the shared
> heap and copied. That is expensive and it doesn't scale because the shared
> heap is a global resource.

I haven't studied Sun's algorithms in great detail, but my
understanding was that (depending on the specific algorithm chosen)
objects remain in the local heap until they are promoted, which might
not be for 2-3 GCs.

The "server" mode on IBM iSeries "classic" Java maintains objects in
"chunks" which are initially thread-allocated and only made global
when they "fill up" with long-lived objects. (This is a non-copying
algorithm.)

Charles Oliver Nutter

unread,
Nov 30, 2009, 1:48:36 PM11/30/09
to jvm-languages
On Fri, Nov 20, 2009 at 10:54 PM, Jon Harrop <j...@ffconsultancy.com> wrote:
> I was referring to a thread local *heap* rather than a generation. For
> example, linked lists held as an array of cons cells where tails are indexes
> into the array. This offers substantial performance improvements on the CLR
> because it doesn't introduce any contention for the shared heap at all
> (beyond the single allocation of the whole array) but it requires the array
> elements (head+tail pairs) to be value types.

I'm not sure I understand how CLR's support for stack-based value
types constitutes a thread-local heap. My understanding is that CLR
does not have any support for thread-local heaps, though the
stack-based stuff helps reduce GC pressure.

> Perhaps the most compelling example of value types is hash tables: inserting
> 10M float->float bindings into a hash table is 32x faster in F# than in Java.
> I suspect the reason is primarily that the JVM is boxing those floats whereas
> the CLR is not.

This also has nothing to do with thread-local heaps, and is a result
of CLR having fully-reified generics that can support primitive
values. A hand-written float->float map in Java should perform as well
or better than the generic version in CLR.

Of course not having to write the primitive version by hand would be
nice. I've written several myself over the years.

> Does relying upon escape analysis make it impossible to pass unboxed arrays
> through the FFI?

If you're referring to the FFI for calling into Java from Ruby, then
we're already pretty well stuck since there's a lot of megamorphic
code in the reflection call chain that would never inline enough for
EA to eliminate those arrays. This is a cost paid for all Ruby to Java
calls in JRuby and paid for even more calls in Groovy since most calls
end up being reflective. Method handles would make it possible for EA
to work across Java FFI, since the actual call target could inline all
the way back to the dynamic call site.

- Charlie

Robert Fischer

unread,
Nov 30, 2009, 2:04:54 PM11/30/09
to jvm-la...@googlegroups.com
Charles Oliver Nutter wrote:
> This also has nothing to do with thread-local heaps, and is a result
> of CLR having fully-reified generics that can support primitive
> values. A hand-written float->float map in Java should perform as well
> or better than the generic version in CLR.
>
> Of course not having to write the primitive version by hand would be
> nice. I've written several myself over the years.
>
>
This is what Trove is there for. It's not an actively maintained
project, but that's because it's pretty "done".

~~ Robert.

ijuma

unread,
Nov 30, 2009, 2:39:33 PM11/30/09
to JVM Languages
On Nov 30, 7:04 pm, Robert Fischer <robert.fisc...@smokejumperit.com>
wrote:
> This is what Trove is there for.  It's not an actively maintained
> project, but that's because it's pretty "done".

It is actively maintained and version 3.0 is in alpha now. The API is
much closer to the Java collections now. If you look at Trove, it uses
two primitive arrays since Java doesn't have support for struct-like
things. That means loss of data locality, which can be a big deal
these days. This could be fixed for int -> int maps, but it's more
complicated for other cases (e.g. int -> long, etc.).

Best,
Ismael

Robert Fischer

unread,
Nov 30, 2009, 3:05:03 PM11/30/09
to jvm-la...@googlegroups.com
Really? Awesome!

~~ Robert.
> --
>
> You received this message because you are subscribed to the Google Groups "JVM Languages" group.
> To post to this group, send email to jvm-la...@googlegroups.com.
> To unsubscribe from this group, send email to jvm-language...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jvm-languages?hl=en.
>
>
>

ijuma

unread,
Nov 30, 2009, 3:34:45 PM11/30/09
to JVM Languages
On Nov 30, 8:05 pm, Robert Fischer <robert.fisc...@smokejumperit.com>
wrote:
> Really?  Awesome!

Yeap, 3.0a2 is here:

https://sourceforge.net/projects/trove4j/files/

Ismael

Jon Harrop

unread,
Nov 30, 2009, 5:49:01 PM11/30/09
to jvm-la...@googlegroups.com
I haven't benchmarked a hand-rolled JVM-based float->float hash table
against .NET's default one but the JVM is incapable of expressing value types
including entries in a hash table so you're probably still taking an
unnecessary performance hit.

Jon Harrop

unread,
Nov 30, 2009, 5:51:50 PM11/30/09
to jvm-la...@googlegroups.com
On Monday 30 November 2009 18:48:36 Charles Oliver Nutter wrote:
> On Fri, Nov 20, 2009 at 10:54 PM, Jon Harrop <j...@ffconsultancy.com> wrote:
> > I was referring to a thread local *heap* rather than a generation. For
> > example, linked lists held as an array of cons cells where tails are
> > indexes into the array. This offers substantial performance improvements
> > on the CLR because it doesn't introduce any contention for the shared
> > heap at all (beyond the single allocation of the whole array) but it
> > requires the array elements (head+tail pairs) to be value types.
>
> I'm not sure I understand how CLR's support for stack-based value
> types constitutes a thread-local heap. My understanding is that CLR
> does not have any support for thread-local heaps, though the
> stack-based stuff helps reduce GC pressure.

The CLR just facilitates this, it doesn't support it. You're essentially
writing your own pool allocator by hand in a .NET language.

> > Perhaps the most compelling example of value types is hash tables:
> > inserting 10M float->float bindings into a hash table is 32x faster in F#
> > than in Java. I suspect the reason is primarily that the JVM is boxing
> > those floats whereas the CLR is not.
>
> This also has nothing to do with thread-local heaps, and is a result
> of CLR having fully-reified generics that can support primitive
> values. A hand-written float->float map in Java should perform as well
> or better than the generic version in CLR.

Yes, exactly. I was talking generally about the advantages of value types
there and not specifically about thread-local heaps. Hash table performance
is probably the most compelling example for general interest in case you ever
need one.

> Of course not having to write the primitive version by hand would be
> nice. I've written several myself over the years.

Next best thing is to write code to write the primitive versions by hand, i.e.
Greenspun type-specializing generics. Clojure's macros might be good for
this. I'd hope Scala tries its hardest to do this itself but I haven't
actually looked.

> > Does relying upon escape analysis make it impossible to pass unboxed
> > arrays through the FFI?
>
> If you're referring to the FFI for calling into Java from Ruby, then
> we're already pretty well stuck since there's a lot of megamorphic
> code in the reflection call chain that would never inline enough for
> EA to eliminate those arrays. This is a cost paid for all Ruby to Java
> calls in JRuby and paid for even more calls in Groovy since most calls
> end up being reflective. Method handles would make it possible for EA
> to work across Java FFI, since the actual call target could inline all
> the way back to the dynamic call site.

Yes. Except for graphics, the performance of the FFI is growing less important
as high-level languages overtake low-level ones in terms of speed due to
parallelism.

Marcelo Fukushima

unread,
Nov 30, 2009, 4:55:43 PM11/30/09
to jvm-la...@googlegroups.com
trove uses 2 arrays (of float in this case) one holding the keys and
another holding the values, so it probably performs very well

i remember seeing somewhere that scala has a @specialized annotation
that makes type parameters acts as templates. I suppose they box /
unbox primitives in method signatures that uses the type parameters
though
> --
>
> You received this message because you are subscribed to the Google Groups "JVM Languages" group.
> To post to this group, send email to jvm-la...@googlegroups.com.
> To unsubscribe from this group, send email to jvm-language...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jvm-languages?hl=en.
>
>
>



--
http://mapsdev.blogspot.com/
Marcelo Takeshi Fukushima

Jon Harrop

unread,
Nov 30, 2009, 7:48:27 PM11/30/09
to jvm-la...@googlegroups.com
On Monday 30 November 2009 21:55:43 Marcelo Fukushima wrote:
> trove uses 2 arrays (of float in this case) one holding the keys and
> another holding the values, so it probably performs very well

That's still 2x as many cache misses.

> i remember seeing somewhere that scala has a @specialized annotation
> that makes type parameters acts as templates. I suppose they box /
> unbox primitives in method signatures that uses the type parameters
> though

I see.

Marcelo Fukushima

unread,
Nov 30, 2009, 10:36:34 PM11/30/09
to jvm-la...@googlegroups.com
On Mon, Nov 30, 2009 at 10:48 PM, Jon Harrop <j...@ffconsultancy.com> wrote:
> On Monday 30 November 2009 21:55:43 Marcelo Fukushima wrote:
>> trove uses 2 arrays (of float in this case) one holding the keys and
>> another holding the values, so it probably performs very well
>
> That's still 2x as many cache misses.

pretty much, altough as Ismael mentioned before, you could implement
it better for ints and floats (and the shorter primitives)

>
>> i remember seeing somewhere that scala has a @specialized annotation
>> that makes type parameters acts as templates. I suppose they box /
>> unbox primitives in method signatures that uses the type parameters
>> though
>
> I see.

while i havent read it entirely yet, here's the first pointer i found
on this feature

http://lamp.epfl.ch/~dragos/files/scala-spec.pdf

and it'll be in the 2.8 release

>
> --
> Dr Jon Harrop, Flying Frog Consultancy Ltd.
> http://www.ffconsultancy.com/?e
>

ijuma

unread,
Dec 1, 2009, 2:57:21 AM12/1/09
to JVM Languages
On Dec 1, 3:36 am, Marcelo Fukushima <takesh...@gmail.com> wrote:
> while i havent read it entirely yet, here's the first pointer i found
> on this feature
>
> http://lamp.epfl.ch/~dragos/files/scala-spec.pdf

There is also:

http://www.scala-lang.org/sid/9

Best,
Ismael

Christian Vest Hansen

unread,
Dec 1, 2009, 4:37:39 AM12/1/09
to jvm-la...@googlegroups.com
On Tue, Dec 1, 2009 at 4:36 AM, Marcelo Fukushima <take...@gmail.com> wrote:
> On Mon, Nov 30, 2009 at 10:48 PM, Jon Harrop <j...@ffconsultancy.com> wrote:
>> On Monday 30 November 2009 21:55:43 Marcelo Fukushima wrote:
>>> trove uses 2 arrays (of float in this case) one holding the keys and
>>> another holding the values, so it probably performs very well
>>
>> That's still 2x as many cache misses.
>
> pretty much, altough as Ismael mentioned before, you could implement
> it better for ints and floats (and the shorter primitives)

I don't think the bit-width of the types have much to do with it,
although it certainly do have an effect on how many elements fit in
the caches.

As I understand things, a greater advantage comes from the fact that
same-typed keys and values can be kept in a single array. Searching a
single block of memory instead of two distinct blocks has potential
benefits from better pre-fetching and cache-locality. The hope is to
reduce two reads from main memory, to one read from memory and one
from cache.
Venlig hilsen / Kind regards,
Christian Vest Hansen.

ijuma

unread,
Dec 1, 2009, 5:02:35 AM12/1/09
to JVM Languages
On Dec 1, 9:37 am, Christian Vest Hansen <karmazi...@gmail.com> wrote:
> I don't think the bit-width of the types have much to do with it,
> although it certainly do have an effect on how many elements fit in
> the caches.
>
> As I understand things, a greater advantage comes from the fact that
> same-typed keys and values can be kept in a single array. Searching a
> single block of memory instead of two distinct blocks has potential
> benefits from better pre-fetching and cache-locality. The hope is to
> reduce two reads from main memory, to one read from memory and one
> from cache.

Right, that was my point indeed.

Best,
Ismael
Reply all
Reply to author
Forward
0 new messages