Learning about VMs: syllabus

David Griswold

unread,

Oct 14, 2006, 9:24:43 AM10/14/06

to strongtal...@googlegroups.com

Hi all,

So, for those of you who want to work on the Strongtalk VM, you will need to
have a solid background in modern object-oriented virtual machine design,
and the reasons why it is done that way. I am assuming you have some kind
of system programming experience, knowledge of C and C++, and have at least
taken a course or read a book on basic compiler design. At the highest
level, here is a checklist of some of the basic principles you will need to
have a good understanding of:

- how method dispatch has traditionally been done in Smalltalk
(inline-caching) and why it works (most sends have low dynamic polymorphism)

- tagging in unified data-model languages (allowing unboxed small integers)

- closures: what they are and the three basic types of blocks in a
full-closure Smalltalk: open (pure functions with no closure), copying
(constant closure can be constructed independent of context), and full
(closure is mutable so context must be reified).

- how modern garbage-collectors work (specifically, generation scavenging
and write-barriers, card marking vs. remembered sets, and compacting
mark-sweep).

- the disappointing history of efforts to make Smalltalk faster
(Deutsch-Schiffman dynamic translation, SOAR), and why just compiling
Smalltalk in a traditional way without inlining produces very little speedup
(high call-frequency defeats optimization without inlining, polymorphism
defeats simple inlining schemes).

- some basic facts about modern CPU architecture: why self-modifying code is
very slow, why computed and indirect branches are slow.

- how type-feedback and PICs (polymorphic inline-caches) work

Fortunately, there is an online syllabus for an excellent course on OO
virtual-machine design that was given at Aarhus university by Lar Bak, who
was the lead engineer on the Strongtalk VM.

My suggestion is that you work through the curriculum for the course,
focusing on the non-Java related stuff (Java has quite a few differences
because of the manifest typesystem and non-OO basic types). That will get
you a long way towards understanding the basic principles.

The course syllabus is at: http://www.daimi.au.dk/~ups/OOVM/. There are
links to most of the important papers you will need to read there.

Cheers,
Dave

Thiago Silva

unread,

Oct 14, 2006, 10:03:15 AM10/14/06

to strongtal...@googlegroups.com

Hi Dave,

Thanks for the checklist. I'm going to study ASAP :-)

Thiago Silva

tim Rowledge

unread,

Oct 14, 2006, 2:16:24 PM10/14/06

to strongtal...@googlegroups.com

Thanks for that pointer Dave; if people want a lighter start I would
offer my own chapter in the squeak 'nublue' book which is available
online as http://www.rowledge.org/tim/squeak/OE-Tour.html

Since it is a simple, short, description of the squeak vm workings it
might get readers off to a gentle start and prepare them for the rest.

tim
--
tim Rowledge; t...@rowledge.org; http://www.rowledge.org/tim
Oxymorons: Legally drunk

br...@kampjes.demon.co.uk

unread,

Oct 14, 2006, 6:14:22 PM10/14/06

to strongtal...@googlegroups.com

David Griswold writes:
> - some basic facts about modern CPU architecture: why self-modifying code is
> very slow, why computed and indirect branches are slow.

A good book on modern CPU architecture is "Computer Architecture: A
Quantitative Approach" by John L. Hennessy and David A. Patterson
published by Morgan Kaufmann. ISBN: 1558807242. It's about 900 pages
but the chapters can be read individually. It's a tour through current
hardware design from CPU pipelines to building clusters. For the
fundamentals today it's probably the best thing available.

For details on the instruction sets the guides produced by AMD and
Intel are fine both are freely available on the web. They include
everything that you'll need to know about the x86 instruction set.
Either companies instruction set guides are good, I think Intel's are
better. AMD's optimization guide is better, it has a lot more detail
on what's going on. Intel's optimization guide has too much space
dedicated to top tips and too little to how things work.

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_7044,00.html

http://www.intel.com/design/pentium4/manuals/index_new.htm

Branch predictor information is harder to find. There's definitely
information available but I don't have links handy. "Computer
Architecture" covers the techniques.

Bryce

David Griswold

unread,

Oct 15, 2006, 4:11:31 AM10/15/06

to strongtal...@googlegroups.com

Thanks Bryce, those look like great references.

As for the details on branch target prediction, the great thing about
inline-caches
and uncommon traps is that they depend much less on branch target prediction
than do
C++ vtable calls. Inline caches effectively act like an infinite branch
target
prediction table.

This is such an important effect that I recently noticed that Microsoft has
apparently
actually gotten rid of vtable dispatch in C++ on .NET and is now using
something much more
like what Strongtalk uses (but without PICs). Apparently, monomorphic calls
are done
with an inline-cache, and polymorphic calls are patched to a hashed lookup,
which is
almost exactly what we do minus PICs and inlining:
http://blogs.msdn.com/vancem/archive/2006/03/13/550529.aspx

I wonder where they got that idea from!?
Cheers,
-Dave

eliot....@gmail.com

unread,

Oct 16, 2006, 3:31:00 PM10/16/06

to Strongtalk-general

Hi David,

let me chip in on closures...

David Griswold wrote:
> Hi all,
>
>
> So, for those of you who want to work on the Strongtalk VM, you will need to
> have a solid background in modern object-oriented virtual machine design,
> and the reasons why it is done that way. I am assuming you have some kind
> of system programming experience, knowledge of C and C++, and have at least
> taken a course or read a book on basic compiler design. At the highest
> level, here is a checklist of some of the basic principles you will need to
> have a good understanding of:
>
> - how method dispatch has traditionally been done in Smalltalk
> (inline-caching) and why it works (most sends have low dynamic polymorphism)
>
> - tagging in unified data-model languages (allowing unboxed small integers)
>
> - closures: what they are and the three basic types of blocks in a
> full-closure Smalltalk: open (pure functions with no closure), copying
> (constant closure can be constructed independent of context), and full
> (closure is mutable so context must be reified).

If one uses a typical Lisp implemetation approach one can avoid fully
reifying the enclosing context; this is the main point of
"http://wiki.cs.uiuc.edu/VisualWorks/DOWNLOAD/papers/oopsla99-contexts.pdf"
which sped up VW significantly (exception handling, non-local return,
block intensive code speeds up by around a factor of two).

The idea is simple; mutable closed-over variables are stored in a
separate temp vector on the heap. Hence the mutable state is accessed
by copying the temp vector. The key advantage here is that one no
longer has to synchronise the stack frame holding the mutable state
with the heap (i.e. on return from the stack frame when a closure
outlives its defining context).

An example can clarify. Here's the canonical Collection>>inject:into:
:
inject: thisValue into: binaryBlock
| nextValue |
nextValue := thisValue.
self do: [:each | nextValue := binaryBlock value: nextValue value:
each].
^nextValue

sicne nextvalue is mutated inside the do: [:each| block the traditional
implementation makes that block a "full" block. But if one rewrites to
oput nextValue in an explicitly alloated heap array that's no longer
true:
inject: thisValue into: binaryBlock
| tempVector | "nextValue is tempVector at: 1"
tempVector := Array new: 1.
tempVector at: 1 put: thisValue.
self do: [:each |
tempVector at: 1 put: (binaryBlock value: (tempVector at: 1) value:
each)].
^tempVector at: 1

So in VW we have a bytecode to allocate a tempVector of a given size
and a pair of bytecodes to fetch/store into a given tempVector slot,
i.e. these bytecodes take a temp index and an index into the temp at
that index. There is at most one tempVector per method/block scope.
More info in the paper.

Now the enclosing context only gets reified for non-local return, i.e.
we only reify its identity, not its stack contents, meaning we don't
ever have to copy the onstack contents into the heap.

HTH

eliot....@gmail.com

unread,

Oct 16, 2006, 3:35:04 PM10/16/06

to Strongtalk-general

David Griswold wrote:
> Thanks Bryce, those look like great references.
>
> As for the details on branch target prediction, the great thing about
> inline-caches
> and uncommon traps is that they depend much less on branch target prediction
> than do
> C++ vtable calls. Inline caches effectively act like an infinite branch
> target
> prediction table.
>
> This is such an important effect that I recently noticed that Microsoft has
> apparently
> actually gotten rid of vtable dispatch in C++ on .NET and is now using
> something much more
> like what Strongtalk uses (but without PICs). Apparently, monomorphic calls
> are done
> with an inline-cache, and polymorphic calls are patched to a hashed lookup,
> which is
> almost exactly what we do minus PICs and inlining:
> http://blogs.msdn.com/vancem/archive/2006/03/13/550529.aspx
>
> I wonder where they got that idea from!?

:)

I suspect that it was George Bosworth who introduc ed the idea to them.
When I interviewed there in late 2002 it came up as one of the
questions. George had gone to MS a year or two before and told me he'd
done a lot of performance measurements on vtables and PICs on arriving.
And yes, George was/is more than aware of the Strongtalk work and the
work at UCSB.

:)

Reply all

Reply to author

Forward