So, for those of you who want to work on the Strongtalk VM, you will need to have a solid background in modern object-oriented virtual machine design, and the reasons why it is done that way. I am assuming you have some kind of system programming experience, knowledge of C and C++, and have at least taken a course or read a book on basic compiler design. At the highest level, here is a checklist of some of the basic principles you will need to have a good understanding of:
- how method dispatch has traditionally been done in Smalltalk (inline-caching) and why it works (most sends have low dynamic polymorphism)
- tagging in unified data-model languages (allowing unboxed small integers)
- closures: what they are and the three basic types of blocks in a full-closure Smalltalk: open (pure functions with no closure), copying (constant closure can be constructed independent of context), and full (closure is mutable so context must be reified).
- how modern garbage-collectors work (specifically, generation scavenging and write-barriers, card marking vs. remembered sets, and compacting mark-sweep).
- the disappointing history of efforts to make Smalltalk faster (Deutsch-Schiffman dynamic translation, SOAR), and why just compiling Smalltalk in a traditional way without inlining produces very little speedup (high call-frequency defeats optimization without inlining, polymorphism defeats simple inlining schemes).
- some basic facts about modern CPU architecture: why self-modifying code is very slow, why computed and indirect branches are slow.
- how type-feedback and PICs (polymorphic inline-caches) work
Fortunately, there is an online syllabus for an excellent course on OO virtual-machine design that was given at Aarhus university by Lar Bak, who was the lead engineer on the Strongtalk VM.
My suggestion is that you work through the curriculum for the course, focusing on the non-Java related stuff (Java has quite a few differences because of the manifest typesystem and non-OO basic types). That will get you a long way towards understanding the basic principles.
The course syllabus is at: http://www.daimi.au.dk/~ups/OOVM/. There are links to most of the important papers you will need to read there.
Thanks for that pointer Dave; if people want a lighter start I would offer my own chapter in the squeak 'nublue' book which is available online as http://www.rowledge.org/tim/squeak/OE-Tour.html
Since it is a simple, short, description of the squeak vm workings it might get readers off to a gentle start and prepare them for the rest.
> - some basic facts about modern CPU architecture: why self-modifying code is > very slow, why computed and indirect branches are slow.
A good book on modern CPU architecture is "Computer Architecture: A Quantitative Approach" by John L. Hennessy and David A. Patterson published by Morgan Kaufmann. ISBN: 1558807242. It's about 900 pages but the chapters can be read individually. It's a tour through current hardware design from CPU pipelines to building clusters. For the fundamentals today it's probably the best thing available.
For details on the instruction sets the guides produced by AMD and Intel are fine both are freely available on the web. They include everything that you'll need to know about the x86 instruction set. Either companies instruction set guides are good, I think Intel's are better. AMD's optimization guide is better, it has a lot more detail on what's going on. Intel's optimization guide has too much space dedicated to top tips and too little to how things work.
Branch predictor information is harder to find. There's definitely information available but I don't have links handy. "Computer Architecture" covers the techniques.
As for the details on branch target prediction, the great thing about inline-caches and uncommon traps is that they depend much less on branch target prediction than do C++ vtable calls. Inline caches effectively act like an infinite branch target prediction table.
This is such an important effect that I recently noticed that Microsoft has apparently actually gotten rid of vtable dispatch in C++ on .NET and is now using something much more like what Strongtalk uses (but without PICs). Apparently, monomorphic calls are done with an inline-cache, and polymorphic calls are patched to a hashed lookup, which is almost exactly what we do minus PICs and inlining: http://blogs.msdn.com/vancem/archive/2006/03/13/550529.aspx
I wonder where they got that idea from!? Cheers, -Dave
> -----Original Message----- > From: strongtalk-general@googlegroups.com > [mailto:strongtalk-general@googlegroups.com]On Behalf Of > br...@kampjes.demon.co.uk > Sent: Sunday, October 15, 2006 12:14 AM > To: strongtalk-general@googlegroups.com > Subject: Learning about VMs: syllabus
> David Griswold writes: > > - some basic facts about modern CPU architecture: why > self-modifying code is > > very slow, why computed and indirect branches are slow.
> A good book on modern CPU architecture is "Computer Architecture: A > Quantitative Approach" by John L. Hennessy and David A. Patterson > published by Morgan Kaufmann. ISBN: 1558807242. It's about 900 pages > but the chapters can be read individually. It's a tour through current > hardware design from CPU pipelines to building clusters. For the > fundamentals today it's probably the best thing available.
> For details on the instruction sets the guides produced by AMD and > Intel are fine both are freely available on the web. They include > everything that you'll need to know about the x86 instruction set. > Either companies instruction set guides are good, I think Intel's are > better. AMD's optimization guide is better, it has a lot more detail > on what's going on. Intel's optimization guide has too much space > dedicated to top tips and too little to how things work.
> Branch predictor information is harder to find. There's definitely > information available but I don't have links handy. "Computer > Architecture" covers the techniques.
> So, for those of you who want to work on the Strongtalk VM, you will need to > have a solid background in modern object-oriented virtual machine design, > and the reasons why it is done that way. I am assuming you have some kind > of system programming experience, knowledge of C and C++, and have at least > taken a course or read a book on basic compiler design. At the highest > level, here is a checklist of some of the basic principles you will need to > have a good understanding of:
> - how method dispatch has traditionally been done in Smalltalk > (inline-caching) and why it works (most sends have low dynamic polymorphism)
> - tagging in unified data-model languages (allowing unboxed small integers)
> - closures: what they are and the three basic types of blocks in a > full-closure Smalltalk: open (pure functions with no closure), copying > (constant closure can be constructed independent of context), and full > (closure is mutable so context must be reified).
If one uses a typical Lisp implemetation approach one can avoid fully reifying the enclosing context; this is the main point of "http://wiki.cs.uiuc.edu/VisualWorks/DOWNLOAD/papers/oopsla99-contexts..." which sped up VW significantly (exception handling, non-local return, block intensive code speeds up by around a factor of two).
The idea is simple; mutable closed-over variables are stored in a separate temp vector on the heap. Hence the mutable state is accessed by copying the temp vector. The key advantage here is that one no longer has to synchronise the stack frame holding the mutable state with the heap (i.e. on return from the stack frame when a closure outlives its defining context).
An example can clarify. Here's the canonical Collection>>inject:into: : inject: thisValue into: binaryBlock | nextValue | nextValue := thisValue. self do: [:each | nextValue := binaryBlock value: nextValue value: each]. ^nextValue
sicne nextvalue is mutated inside the do: [:each| block the traditional implementation makes that block a "full" block. But if one rewrites to oput nextValue in an explicitly alloated heap array that's no longer true: inject: thisValue into: binaryBlock | tempVector | "nextValue is tempVector at: 1" tempVector := Array new: 1. tempVector at: 1 put: thisValue. self do: [:each | tempVector at: 1 put: (binaryBlock value: (tempVector at: 1) value: each)]. ^tempVector at: 1
So in VW we have a bytecode to allocate a tempVector of a given size and a pair of bytecodes to fetch/store into a given tempVector slot, i.e. these bytecodes take a temp index and an index into the temp at that index. There is at most one tempVector per method/block scope. More info in the paper.
Now the enclosing context only gets reified for non-local return, i.e. we only reify its identity, not its stack contents, meaning we don't ever have to copy the onstack contents into the heap.
David Griswold wrote: > Thanks Bryce, those look like great references.
> As for the details on branch target prediction, the great thing about > inline-caches > and uncommon traps is that they depend much less on branch target prediction > than do > C++ vtable calls. Inline caches effectively act like an infinite branch > target > prediction table.
> This is such an important effect that I recently noticed that Microsoft has > apparently > actually gotten rid of vtable dispatch in C++ on .NET and is now using > something much more > like what Strongtalk uses (but without PICs). Apparently, monomorphic calls > are done > with an inline-cache, and polymorphic calls are patched to a hashed lookup, > which is > almost exactly what we do minus PICs and inlining: > http://blogs.msdn.com/vancem/archive/2006/03/13/550529.aspx
> I wonder where they got that idea from!?
:)
I suspect that it was George Bosworth who introduc ed the idea to them. When I interviewed there in late 2002 it came up as one of the questions. George had gone to MS a year or two before and told me he'd done a lot of performance measurements on vtables and PICs on arriving. And yes, George was/is more than aware of the Strongtalk work and the work at UCSB.
> > -----Original Message----- > > From: strongtalk-general@googlegroups.com > > [mailto:strongtalk-general@googlegroups.com]On Behalf Of > > br...@kampjes.demon.co.uk > > Sent: Sunday, October 15, 2006 12:14 AM > > To: strongtalk-general@googlegroups.com > > Subject: Learning about VMs: syllabus
> > David Griswold writes: > > > - some basic facts about modern CPU architecture: why > > self-modifying code is > > > very slow, why computed and indirect branches are slow.
> > A good book on modern CPU architecture is "Computer Architecture: A > > Quantitative Approach" by John L. Hennessy and David A. Patterson > > published by Morgan Kaufmann. ISBN: 1558807242. It's about 900 pages > > but the chapters can be read individually. It's a tour through current > > hardware design from CPU pipelines to building clusters. For the > > fundamentals today it's probably the best thing available.
> > For details on the instruction sets the guides produced by AMD and > > Intel are fine both are freely available on the web. They include > > everything that you'll need to know about the x86 instruction set. > > Either companies instruction set guides are good, I think Intel's are > > better. AMD's optimization guide is better, it has a lot more detail > > on what's going on. Intel's optimization guide has too much space > > dedicated to top tips and too little to how things work.
> > Branch predictor information is harder to find. There's definitely > > information available but I don't have links handy. "Computer > > Architecture" covers the techniques.