We had a great meeting, excellent speaker, and many first-time
attendees. There were many good questions asked too because the audience
included folks working on VMs or JIT compilers for Perl, Ruby, Smalltalk
and so on. While the presentation was not strictly on FP, it touched
upon many challenges shared by FP language implementors.
I've asked Chromatic (lead developer of Parrot) to take another set of
notes, so please defer to those in case of divergence because his
knowledge of this field is superior to mine. My notes are not direct
quotes, but are interpretations. Feel free to post corrections or additions.
Jim Blandy of Mozilla presented "Trace-Based Just-In-Time Compilation
For Highly-Dynamic Languages". The techniques presented have made major
interpreter. These techniques can be applied to many other languages.
Some techniques rely on specialized hardware and exploit its extra
memory, cores, caches and processor cycles.
Slides, bibliography, project and additional links
- Jim's homepage: http://www.red-bean.com/jimb/
- Presentation slides: http://www.red-bean.com/jimb/pdxfunc-tjit/
- Compiling notes:
- Related article: http://andreasgal.com/2008/08/22/tracing-the-web/
Jim has spent his whole career contributing to the Free Software
movement, including GNU Emacs, Guile, GDB, EGLIBC, Mozilla, Subversion,
Techniques discussed are relatively new and their possibilities
aren't thoroughly explored. Thus some may have edge cases that are
counterproductive and these need to be guarded against.
- "For programming languages, good performance effectively improves
- Algorithms are vital. In the past, we often heard people say, "My
Perl program is faster than the C program it replaced!". This shouldn't
have been possible because Perl is written in C and thus should be
slower. However, because of Perl's built-in optimized features and
ability to easily express better algorithms, programmers were able to
easily write faster code.
- Expressiveness is vital. Programs like `grep` make it easy to
perform useful operations in seemingly inefficient ways. Thus optimizing
the way real people use programming languages is necessary.
- Humans seem to counteract each step forward in technology by
adding slow, new features, gimmicks, and sloppiness. Thus net gains in
performance are both necessary yet often short-lived.
- Lessons have been learned from trying to write compilers for Scheme.
- Flow analysis is challenging to do with dynamic code.
- High-level type inference is needed to optimize down to underlying
- Correctness may lead to inefficient implementation (e.g., using
`bignums` in loop which never needs more than an `int`).
- JWZ declared "WE MUST SHIP NAVIGATOR 2.0 IN TWO WEEKS!" and
Brendan Eich threw something together based on "Self" to provide scripting.
between Mozilla's SpiderMonkey, Apple's SquirrelFish Extreme, and
- Performance is significant to users, with Apple emphasizing
"blindingly fast" as marketing message.
easily into other programs.
through each line.
- There are basic optimizations done to the bytecode, such as
sequencing instructions in such a way as to minimize the need for jumps.
- "Trace" + "SpiderMonkey" = "TraceMonkey", the tracing JIT stuff.
- Uses Adobe's "Nanojit", from their open sourced "Tamarin" (libre)
- The internals of SpiderMonkey are highly-optimized and rather
opaque. This isn't a criticism, but rather an inevitability.
Unfortunately, this also makes it hard to add optimizations into the
SpiderMonkey code directly.
- Rather than inspecting source code, the trace-based JIT compiler
observes the behavior of a program as it is executing to identify
- Not everything is JIT'ed. This minimizes the number of potential
optimizations that must be checked, focuses on the most noticeable
optimizations, and avoids tricky counterproductive traps where
"optimized" code runs more slowly than the original.
TraceMonkey generated code
- The tracer records progress through code as it's executed...
- ...and when it sees the same code about to be run the second time,
often due to a jump, it classifies the enclosed area as a loop...
- ...and compiles the bytecode collected by the tracer for this loop...
- ...and executes it the third and subsequent times.
- Larger loops are more efficient for JIT compiling, and these
larger sections of code are collected thanks to the tracer watching the
generated bytecode to locate the loops, rather than naively stopping at
- Code is generated for the types that are seen, making it
inexpensive to generate code for both an `int` and `bignum`, and bail
out from `int` to a `bignum` mid-execution if there's an overflow, thus
using the more-efficient-yet-possibly-incorrect type first, and then
falling back to the less-efficient-but-always-correct type when needed.
- Loop unrolling becomes trivial with trace-based JIT compilation
because you simply keep tracing until you see a jump back to code you've
already seen, and then JIT compile it. True loop unrolling seems
unnecessary on modern hardware.
- Jim showed the assembly code produced from the bytecode from the
Static single assignment (SSA) form
- SSA form describes a program where every variable is assigned to
- Useful because it's possible to create a functionally-equivalent
intermediate representation of code using SSA form where a larger
section of code is identical and can be JIT compiled.
- E.g., assume that if a variable can be assigned by both the `if`
and `then` sides of a conditional, then we could create an intermediate
representation that features two versions of this variable that can be
compiled together more efficiently.
- Implementors of Java Mobile Edition used this technique to create
a fast, efficient JIT that could operate in a fraction of memory
available on a cellphone by reducing the amount of code that needed to
be compiled and stored separately.
- Generating SSA form code takes computation, but thanks to
ingenious algorithms and some extra processor cycles, it's possible to
create significantly faster code as a result.
- Further details on SSA:
Tracing function calls
In-lining functions becomes easy by tracing across boundaries and
thus identifying common blocks.
Tracing virtual function calls
prototype-based delegation. Each object is really just an associative
array (AKA dict, hash). Each method call is really just looking up an
element in the associative array and calling a function assigned to that
named element. If a method isn't found in the object, it asks the
object's prototype for the item.
- As objects are dynamically altered, their "shape" changes in terms
of their internal representation.
- The shape versions are stored in a tree to provide a way to
efficiently add properties to pre-existing shapes.
- Checking the object's shape is vital to identifying the appropriate
- Keep multiple trace paths in a common tree with branches for each
- ...and JIT compile and store optimized code in the branches...
- ...and run native code as loops are identified.
- Folding common branch tails within the tree makes it possible to
graft them back together to reduce total native code generated, vital in
situations where there can be many tails.
- Nested trees provide a way to split out code into separate trees
that can then be joined as needed to efficiently bounce execution
between a series of compiled sections.
- Fancy constant propagation.
- If an object can be detected as immutable, it'd be possible to
eliminate many checks that'd be wasted checking its type.
- E.g., if you know that a variable will remain an `int` for the
duration of a loop, you don't need to have code called within the loop
checking if it's still an `int`.
Some seemingly obvious optimizations turn out to be
counterproductive. Therefore, it's useful to build rules that recognize
problematic cases using blacklists to prevent any attempt to optimize them.
GC throws away all JIT compiled code because the addresses and such
change. Thus knowing how the GC works and working with it is vital.
JIT in Haskell
- Value type reduction, replaces types with optimized versions
because it knows variables can't change types???
- Haskell typically replaces entire functions.
- New hot-spot optimizer can deal with smaller sections.
JIT and multiple cores
- New hardware will have more cores and using them efficiently is a
- Linear piece of code in SSA form could easily be parallelized
efficiently if the sections are large enough.
- Demo of Squeak application on a Tilera TILE64 board with 64
processors was able to run quickly by populating core caches and
bouncing execution between them rapidly.
- Major challenges include limited bus bandwidth between cores,
dissimilar performance from various caches, varying performance between
how far hardware/software can look ahead and optimize.
- Future changes include GPGPU (General-purpose computing on
graphics processing units), where hardware like nVidia's CUDA & Tesla
(marketing slogan: "960 cores for under $9,999"), AMD's FireStream, and
Intel's Larrabee provide APIs for accessing fast, multi-core,
multi-pipeline, multi-threaded video cards that can provide superior
computing performance by offloading work from the machine's main CPU cores
- Dave Ungar presenting Squeak running on the TILE64:
- Tilera TILE64: http://www.tilera.com/products/TILE64.php
- Discussion on TILE64:
- Larrabee: http://en.wikipedia.org/wiki/Larrabee_(GPU)
JIT in Java
- Has proven value of JIT to many by making Java way faster than it
- ARM's Jazelle processors provide hardware-optimized execution of
some, but not all, Java bytecodes. The system can bounce a program
between an incomplete hardware implementation of the JVM to provide fast
operations, and fallback to a complete software implementation as
needed. The software gives an address to the memory address with the
bytecode to the hardware and lets it take over execution until it bails
out back to the software JVM. This caused a spontaneous outburst from an
audience member, "That's horrific on so many levels!!"
this optimization more valuable, and more easily accessible.
So many optimizations are being added to the code that it's tricky
to adapt TraceMonkey to these changes.
"Stalin" was an aggressive whole-program optimizing Scheme
implementation. By analyzing the entire program during compilation, as
opposed to just bits of it during runtime, it could produce
highly-optimized executables that were often significantly faster than
those produced from hand-written C code. Unfortunately, it could take
days to create these executables.
We retired to Produce Row, discussion topics included:
- Discussion of "State of Portland Tech":
- Discussion on OurPDX "Historical Perspective" post:
- Scientific management of Type 1 Diabetes.
- Tests for determining right/left brain preference, e.g.,:
- Psycho-olfactory quantum perfume analysis, e.g.,:
- What we'd see when traveling faster than speed of light.
- Sci-fi book club discussion on stories by Issac Asimov, Larry
- Much more.
Thanks --- I had a great time!
The notes look great in general, but there was one part where I didn't
get across what I'd intended:
> Static single assignment (SSA) form
> - SSA form describes a program where every variable is assigned to
> only once.
> - Useful because it's possible to create a functionally-equivalent
> intermediate representation of code using SSA form where a larger
> section of code is identical and can be JIT compiled.
> - E.g., assume that if a variable can be assigned by both the `if`
> and `then` sides of a conditional, then we could create an intermediate
> representation that features two versions of this variable that can be
> compiled together more efficiently.
> - Implementors of Java Mobile Edition used this technique to create
> a fast, efficient JIT that could operate in a fraction of memory
> available on a cellphone by reducing the amount of code that needed to
> be compiled and stored separately.
> - Generating SSA form code takes computation, but thanks to
> ingenious algorithms and some extra processor cycles, it's possible to
> create significantly faster code as a result.
> - Further details on SSA:
Here's what I was trying to say:
Once a program has been put in Static Single Assignment form, many
optimizations become much easier. There are many papers describing
optimizations that begin, "First, put the code in SSA form. Then..."
It's been known for quite a while that, if you could get a program
into SSA form, then you'd be in a good position, but SSA still wasn't
widely used even in highly optimizing, traditional ahead-of-time
compilers because there was no good algorithm for getting code into
SSA in the first place. In 1991, Ron Cytron et al found a good
and SSA has since become widely used. However, even Cytron's
algorithm is a bit heavy for embedded devices, and I don't believe
it's widely used there.
The reason I brought up SSA at the talk is that, by their nature,
tracing JITs only compile either single loop traces, or trace trees
--- not arbitrary control flow graphs. This restriction happens to
mean that it is always trivial to put code traces in SSA --- the cases
that make Cytron's algorithm necessary simply don't arise. Thus, we
can apply SSA-based optimizations without incurring the cost of
converting to SSA. If, in the future, our trace collection techniques
became more sophisticated and we began to collect traces that were
more general in shape (not just trees whose leaves jump back to the
root), then it might become expensive to put the tree in SSA, and we'd
be in the same position as everyone else. We're just protected at the
moment by our naivete.
So: SSA is a technique that (since 1991, anyway) is very practical for
use in ahead-of-time compilers with lots of room (memory, time). It
is not, in its general form, practical on mobile devices, as far as I
know. However, because we compile traces, not source programs, it
happens to be trivial for us to put code in SSA form, so we can take
advantage of its properties.
Thanks for the clarification!