v8 first impressions

118 views
Skip to first unread message

Eliot Miranda

unread,
Sep 3, 2008, 12:57:34 PM9/3/08
to strongtal...@googlegroups.com
Hi David,

    I spent some time reading the code yesterday.  Here's what I could glean


Thread model: the standard, multi-threaded with only one active thread at a time, i.e. the standard combination of a green thread VM architecture with native threads, as seen in Strongtalk.  Different threads must use a lock on the entire VM to use it.  There is a preemptive scheduler to share the VM between threads waiting on the lock.


Parses direct to an AST.


Generates native code directly from AST.


No bytecode intermediate form.


In-lines simple arithmetic ops (see codegen-ia32.cc/codegen-arm.cc) when one side or the other is literal.


Has monomorphic & megamorphic inline caches, but no polymorphic inline caches (see e.g. MONOMORPHIC in globals.h).


I see no evidence of adaptive optimization/speculative inlining, performance counters et al.  Seems to generate stubs to defer code generation.  So perhaps some deferred optimization is done at send/link time, but its not obvious to me.


There are counters, but I think this is for understanding the dynamic behaviour, not yet for optimization.  


Colin Putney

unread,
Sep 3, 2008, 6:57:12 AM9/3/08
to strongtal...@googlegroups.com

On 3-Sep-08, at 9:57 AM, Eliot Miranda wrote:

> Thread model: the standard, multi-threaded with only one active
> thread at a time, i.e. the standard combination of a green thread VM
> architecture with native threads, as seen in Strongtalk. Different
> threads must use a lock on the entire VM to use it. There is a
> preemptive scheduler to share the VM between threads waiting on the
> lock.

One thing to note here is that the semantics of Javascript don't
include concurrency at all. Javascript code only runs in response to
sequential events generated from outside the Javascript context, so it
makes perfect sense for V8 to be implemented this way.

Colin

talksmall

unread,
Sep 3, 2008, 2:43:12 PM9/3/08
to Strongtalk-general
On Sep 3, 11:57 am, Colin Putney <cput...@wiresong.ca> wrote:

> One thing to note here is that the semantics of Javascript don't  
> include concurrency at all. Javascript code only runs in response to  
> sequential events generated from outside the Javascript context, so it  
> makes perfect sense for V8 to be implemented this way.

True ... except that you can spawn off timeouts and intervals that act
in some ways similarly to threads, except (at least in current
implementations) with no parallel execution - the VM schedules them
sequentially - and with no suspension or synchronization mechanism, so
one task (event handling function) needs to complete before another
can be scheduled.

It is not hard to see how this could be expanded into a thread, though
obviously that would be outside of the scope of the current ECMAScript
language definition.

Steve

Colin Putney

unread,
Sep 4, 2008, 1:53:32 AM9/4/08
to strongtal...@googlegroups.com

On 3-Sep-08, at 11:43 AM, talksmall wrote:

>
> On Sep 3, 11:57 am, Colin Putney <cput...@wiresong.ca> wrote:
>
>> One thing to note here is that the semantics of Javascript don't
>> include concurrency at all. Javascript code only runs in response to
>> sequential events generated from outside the Javascript context, so
>> it
>> makes perfect sense for V8 to be implemented this way.
>
> True ... except that you can spawn off timeouts and intervals that act
> in some ways similarly to threads, except (at least in current
> implementations) with no parallel execution - the VM schedules them
> sequentially - and with no suspension or synchronization mechanism, so
> one task (event handling function) needs to complete before another
> can be scheduled.

True, you can schedule events from Javascript. But I think the
differences you point out are more important than the similarity -
which is only that you can schedule code to be executed without having
to do an immediate synchronous function call.

Not to belabour it too much, but I actually think this nitpicky
distinction is quite important. All the action in computer science
these days in figuring out how to deal with multiprocessing. Shared-
state concurrency, as traditionally used by OO languages like
Smalltalk, Java, Objective-C, C++ or Ruby can be problematic, so I'm
interested in alternatives.

What's cool about V8 is that it's designed for multi-theading on the
outside, and single threading on the inside. That is, the VM supports
being used in a multithreaded application, but it doesn't attempt to
execute Javascript concurrently. At the same time, it supports
multiple Javascript execution contexts within the same VM. Maybe that
could be used to implement the same sort of concurrency model that's
used in E and Croquet.

Colin

David Griswold

unread,
Sep 5, 2008, 3:09:41 PM9/5/08
to strongtal...@googlegroups.com
Hi Eliot,  sorry for the delay in answering, I am traveling overseas.

On Wed, Sep 3, 2008 at 9:57 AM, Eliot Miranda <eliot....@gmail.com> wrote:
Hi David,

    I spent some time reading the code yesterday.  Here's what I could glean


Thread model: the standard, multi-threaded with only one active thread at a time, i.e. the standard combination of a green thread VM architecture with native threads, as seen in Strongtalk.  Different threads must use a lock on the entire VM to use it.  There is a preemptive scheduler to share the VM between threads waiting on the lock.

That's interesting; green threads I expected but the preemptive scheduler for sharing the VM sounds cool.  But I hope that doesn't mean they can't actually schedule different VM instances concurrently.
 


Parses direct to an AST.


Generates native code directly from AST.


No bytecode intermediate form.


In-lines simple arithmetic ops (see codegen-ia32.cc/codegen-arm.cc) when one side or the other is literal.


Has monomorphic & megamorphic inline caches, but no polymorphic inline caches (see e.g. MONOMORPHIC in globals.h).


I'm not that bothered by the lack of PICs; for inlining most of the benefit comes from just detecting the monomorphic case, but it sounds like they aren't doing that anyway, at least yet.  Having just monomorphic and megamorphic is interesting, perhaps they don't have inline caches at all (other than the degenerate monomorphic case).   The monomorphic form would still give the speed advantages of inline-caching for the common case, and if it ever changes they could lock and update to the megamorphic form, which would be reasonable because it only ever happens once per send site and would avoid the need to do any locking for the small polymorphic case if the VM is shared.  PICs don't buy much performance anyway; I've always thought they were a bit overrated.  In fact I have often thought that Strongtalk might be improved by eliminating PICs and doing basically what they doing.  It would be slightly slower, but would eliminate a lot of complexity.


I see no evidence of adaptive optimization/speculative inlining, performance counters et al.  Seems to generate stubs to defer code generation.  So perhaps some deferred optimization is done at send/link time, but its not obvious to me.


There are counters, but I think this is for understanding the dynamic behaviour, not yet for optimization.  


Lack of a bytecode intermediate form is disappointing.  The big question is whether they can still do any kind of mixed-mode execution without it.  Perhaps they interpret the AST directly and use the counters you saw to do 'hotspot' compilation.  If not, that sounds like it would be a problem for big code bodies, since if they have to compile any code that is executed, they would have the same kind of code blowup that was a big problem in Self, although that was exacerbated by the verbose code generated by the non-optimizing compiler.  Even if space isn't a problem, compiling everything that runs would take time, although perhaps they think that is hidden by the page load pause.
 
Well, it sounds like they aren't doing anything radically advanced or magical.  Kind of mind-boggling that Strongtalk is more advanced than anything else after nearly 15 years.  I keep expecting people to catch up, but I guess the complexity is the obstacle.

-Dave

talksmall

unread,
Sep 7, 2008, 8:47:33 PM9/7/08
to Strongtalk-general
I ran a comparison of the Richards benchmark for which both Javascript
and Smalltalk versions are available.

The version released with V8 isn't exactly comparable to the standard
version since it uses a count of 1000 rather than 10000. With that
tweaked to match the standard count I ran a comparison of V8 against a
selection of Smalltalk VMs - VisualWorks NC, Squeak, GNU Smalltalk and
Strongtalk.

I customised the Smalltalk tests to follow the same methodology as the
V8 benchmarks - iterating the test for a second and then dividing by
the number of iterations.

The results are interesting

Strongtalk: 7.335ms
VW: 13.013ms
V8: 13.648ms
GNUSmalltalk: 91.090ms
Squeak: 102.600ms

All of the tests were run on a Vista 64 Quad core machine at 2.66 GHz,
though in each case I was running a 32 bit VM (since not all the VMs
have 64 bit builds).

Regards, Steve

David Griswold

unread,
Sep 8, 2008, 4:13:04 AM9/8/08
to strongtal...@googlegroups.com
Hi Steve,

That's interesting.  On Strongtalk did you run the test multiple times with the top level loop in a method (not a doit) to avoid the on-stack replacement problem?   If not, Strongtalk might be even faster than that, since in general it is more than twice as fast as VisualWorks (or at least used to be).  But at least it confirms my intuition that V8 would be on the same order of performance as VW.
-Dave

talksmall

unread,
Sep 8, 2008, 5:03:58 AM9/8/08
to Strongtalk-general
Hi Dave,
Yes, I did. I emulated the "iterate for a second" loop from the V8
test and put it into a method. I then ran the test several times, with
only a marginal improvement (< 1ms). I took the same approach for all
of the Smalltalk tests, though it shouldn't make any difference to the
others since they don't implement adaptive optimization.

Regards, Steve

On Sep 8, 9:13 am, "David Griswold" <david.griswold....@gmail.com>
wrote:
> Hi Steve,
> That's interesting.  On Strongtalk did you run the test multiple times with
> the top level loop in a method (not a doit) to avoid the on-stack
> replacement problem?   If not, Strongtalk might be even faster than that,
> since in general it is more than twice as fast as VisualWorks (or at least
> used to be).  But at least it confirms my intuition that V8 would be on the
> same order of performance as VW.
> -Dave
>

prunedtree

unread,
Sep 9, 2008, 4:12:29 AM9/9/08
to Strongtalk-general
Hello

I gave a quick look at the V8 source code (I was curious about some
architectural choices) and it is a lot less interesting than what I've
read on the web made me expect.

V8 is clearly a VM made to be embedded and run short Javascript code
that interacts with a large amount of built-in objects and functions
(like DOM). Compilation straight to native code does make perfect
sense for this specific use, as well as many other implementation
choices in V8 (I suppose the complexity of the write barrier also
comes from the desire to simplify embedding, compared to simpler
cardmarking). It's fine software and deserves credit for it's
performance as a JS engine for web browsers.

However, if your goal is to run a complete application/system on a VM
(not embedding) or if your language is not based on Javascript, then
I'm afraid using this VM doesn't make much sense. V8 and Strongtalk
(among others) have been made with totally different goals in mind,
and it's hard for me to see a meaningful way to compare them in
practice.

So it doesn't really surprise me to see Strongtalk crush V8 when it
does what it has been made for. In fact, there's probably more room
for Strongtalk to improve than for V8 here: Strongtalk has a cheaper
write barrier, less indirection, and more potential for optimisation
due to type-feedback (escape analysis...).

Btw, a simple optimisation to Strongtalk's GC (using a sparse and
precise representation for the remembered set during scavenges) would
remove any advantage that V8's more expensive write barrier might give
to it's scavenging speed.

Dave: regarding PICs, I'm pretty sure it's critical that you have a
monomorphic send attempt followed by a megamorphic send for the case
where a call site is nearly-monomorphic. In essence, this is a
degenerate PIC of length 1, and I guess it seems natural to allow
bigger PICs. I think this mainly made sense in the Self system where
the effiency of ifTrue:/ifFalse:/etc depended highly on PICs of arity
2. This is irrevelant in Strongtalk (or in any system that implements
booleans without polymorphism). I suspect larger PICs are useless
because of recompilation (it makes code more monomorphic - PICs for
non-inlining JITs (VW or V8) are another story). It would actually be
interressing to benchmark strongtalk with various settings (no PICs,
PICs limited to various sizes). If PICs of length 1 are found to be
enough, then the PICs code could indeed be much more simple.

David Griswold

unread,
Sep 11, 2008, 12:48:08 PM9/11/08
to strongtal...@googlegroups.com
Hi Marc,

On Tue, Sep 9, 2008 at 1:12 AM, prunedtree <prune...@gmail.com> wrote:

[...]

Dave: regarding PICs, I'm pretty sure it's critical that you have a
monomorphic send attempt followed by a megamorphic send for the case
where a call site is nearly-monomorphic.

I'm not sure how that would work.  What you suggest sounds pretty much like a standard inline-cache.  With type-feedback the form of the send needs to record the encountered polymorphism of the send; the kind of send you suggest can't be distinguished from a truly megamorphic send.  Such a send would be slower for megamorphic sends, since the cache will usually miss, so it is wasted time that is eliminated in Strongtalk, as well as eliminating the updating of the cache.  The question is, when the cache misses, what do you do?  If you don't convert such sends into the megamorphic form (with no inline-cache), how do you detect megamorphic sends?

I am sure you are right that slightly polymorphic sends would be slower without PICs or an inline cache, however I have my doubts how important they are statistically.  As you pointed out, many of them become monomorphic after inlining and/or customization; my intuition is that the resulting distribution is highly bi-modal and dominated by monomorphic and megamorphic, with not much in-between.  As you also pointed out, since boolean control structures are hardcoded, that eliminates the biggest source of slightly polymorphic sends.

In essence, this is a
degenerate PIC of length 1, and I guess it seems natural to allow
bigger PICs.

As I said above, I don't think it really is like a PIC, since a PIC upgrades itself to the next higher arity send when a cache miss occurs, and doesn't just do a megamorphic send and update the cache.  Allowing bigger PICs of variable size I think is a big mistake we made in Strongtalk, since suddenly you get a lot of extra complexity for very little payoff.  Variable size PICs cause fragmentation in the PIC area, requiring compaction (which isn't done in the current system but would eventually be necessary).

I think eliminating variable size PICs would be a great improvement.  But there are two different ways that could be done, depending on how important sends of arity 2 are, which as I said is not clear. 

If they are not that important, which I suspect is true, then separately allocated PICs could be eliminated entirely, and we could use a degenerate inline 1 element PIC for the monomorphic case, which patches to a cache-less megamorphic send on cache miss.  Nothing else would be needed.

If it turns out that sends of arity 2 are too important not to optimize, then 2 element allocated PICs could be retained.  That is still much simpler than now, since all PICs would have the same size, and no fragmentation would occur, and thus no compaction would be needed.
 
[...] It would actually be

interressing to benchmark strongtalk with various settings (no PICs,
PICs limited to various sizes). If PICs of length 1 are found to be
enough, then the PICs code could indeed be much more simple.

In fact, that is the first experiment I tried when Strongtalk went open source, but it didn't appear to work the easy way, and I didn't follow up on it.  I tried just lowering the constant that determines the max PIC size to 1 and 2, but I got no change at all in the benchmarks I tried, and the inlined code structure didn't appear to change.  I suspect that the constant is probably also hardcoded somewhere, perhaps in the assembly code, so just changing the constant wasn't enough :-(.  But I agree that actually getting the experiment to work shouldn't be very hard and would be very, very interesting.

-Dave

John Cowan

unread,
Sep 11, 2008, 1:42:19 PM9/11/08
to strongtal...@googlegroups.com
On Thu, Sep 11, 2008 at 12:48 PM, David Griswold wrote:

> [...] suggest can't be distinguished from a truly megamorphic send. Such a send
> would be slower for megamorphic sends [...]

By the way, I don't know who coined the word "megamorphic", but he was
clearly a barbarian (originally, someone who doesn't speak Greek).
"Monomorphic" and "polymorphic" are fine, but "megamorphic" would mean
having a big form, as in "You've grown rather megamorphic since you
left school."

Can we please switch to "perissomorphic"? It's better Greek and onlly
a little bit longer. We already have the rare words "perissology"
(too many words) and "perissosyllabic" (having too many syllables), as
well as the more common word "perissodactyl" (ungulates that have an
odd number of toes, such as horses, from a second sense of Greek
_perissos_, 'odd (in mathematics')'.

"Megamorphic": kill it before it spreads!

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

talksmall

unread,
Sep 11, 2008, 2:04:10 PM9/11/08
to Strongtalk-general
On Sep 11, 6:42 pm, "John Cowan" <johnwco...@gmail.com> wrote:
> On Thu, Sep 11, 2008 at 12:48 PM, David Griswold wrote:
> > [...] suggest can't be distinguished from a truly megamorphic send.  Such a send
> > would be slower for megamorphic sends [...]
>
> By the way, I don't know who coined the word "megamorphic", but he was
> clearly a barbarian (originally, someone who doesn't speak Greek).
> "Monomorphic" and "polymorphic" are fine, but "megamorphic" would mean
> having a big form, as in "You've grown rather megamorphic since you
> left school."
>
> Can we please switch to "perissomorphic"?  It's better Greek and onlly
> a little bit longer.  We already have the rare words "perissology"
> (too many words) and "perissosyllabic" (having too many syllables), as
> well as the more common word "perissodactyl" (ungulates that have an
> odd number of toes, such as horses, from a second sense of Greek
> _perissos_, 'odd (in mathematics')'.
>
> "Megamorphic": kill it before it spreads!

Too late. It seems to be in fairly widespread usage already. While I
bow to your superior classical Greek, as I have none, I think
megamorphic will be hard to change. Inaccurate though it may be, it
seems to me more expressive than your possibly more correct
alternative. Also, the negative connotation implied by "perissology"
and "perissosyllabic" - ie. too many words/syllables - doesn't really
apply to the notion of what we currently refer to as megamorphic
sends, except in the narrow sense of "too many possible targets of
this send to fit in this PIC".

talksmall

unread,
Sep 22, 2008, 5:15:15 PM9/22/08
to Strongtalk-general
Just a slight update to my earlier post about the relative performance
of various VMs.

With the new back-end in place the performance of the Strongtalk VM
improves substantially. I didn't include it in my previous test,
because I honestly didn't think it was stable enough to complete a
run. Still, I gave it a try and it surprised me.

A 100 run average time with the new back-end produces an average time
of 4.36ms. That's about 3 times as fast as the VW and V8 figures.

Running with the new back-end is pretty unstable (I had to try a
couple of times to get a complete run and paste the results in a
spreadsheet without the VM crashing with an access violation), but it
does offer the prospect of substantial performance improvements.

Regards, Steve

David Griswold

unread,
Sep 22, 2008, 5:30:31 PM9/22/08
to strongtal...@googlegroups.com
Yes, it's really a shame that original development on the system stopped just as the new back-end was starting to run programs about the size of Richards.  It can do quite a bit more optimization than the current Strongtalk back-end, although it still isn't a real optimizing compiler by any means.  But there is a lot of low-hanging fruit that it can take advantage of.

Unfortunately, as I recall it is still far from finished, and it would be a serious project for a real compiler guy to get it done.  But kudos for getting any numbers out of it at all!  Maybe at some point someone will do something with it.
-Dave
Reply all
Reply to author
Forward
0 new messages