Cajita optimization strategy proposal

ihab...@gmail.com

unread,

Jul 15, 2008, 10:49:21 PM7/15/08

to Google Caja Discuss

Hi folks,

The following is a writeup of a Cajita optimization strategy that Mike Stay and I worked out. This sets out to solve the main performance issue with Cajita -- the fact that creating (the moral equivalent of) an object with N methods causes O(N) object allocations since each method, in the absence of 'this', is a closure.

http://google-caja.googlecode.com/svn/trunk/doc/html/cajitaOptimization/index.html

Comments welcome. Regards,

Ihab

--
Ihab A.B. Awad, Palo Alto, CA

ihab...@gmail.com

unread,

Jul 28, 2008, 6:07:43 PM7/28/08

to Google Caja Discuss

Hi folks,

Mike Stay and I have a new copy of the Cajita optimization proposal here:

http://google-caja.googlecode.com/svn/trunk/doc/html/cajitaOptimization/index.html

Please let us know how this looks. Thanks,

Felix

unread,

Jul 28, 2008, 7:02:18 PM7/28/08

to google-ca...@googlegroups.com

is there a reason to use numeric slots instead of symbolic slots?
rather than
s[0][0][1] // a (grandparent scope)
how about
s.up__.up__.a

mixed calling convention is a little more complex, but debuggability is
improved a lot. not sure about impact on performance.

ihab...@gmail.com

unread,

Jul 28, 2008, 7:04:38 PM7/28/08

to google-ca...@googlegroups.com

On Mon, Jul 28, 2008 at 4:02 PM, Felix <fel...@gmail.com> wrote:

how about
s.up__.up__.a
mixed calling convention is a little more complex, but debuggability is
improved a lot. not sure about impact on performance.

Yes, nifty idea, but I ran an informal micro-benchmark a while ago that convinced me that (at least on FF3, which is where I ran it), numeric accesses in an array are way fast. I could dig that out and try some more things but, fwiw, this is the reason: use numbers all the way for speed.

Ihab

ihab...@gmail.com

unread,

Jul 28, 2008, 7:34:22 PM7/28/08

to google-ca...@googlegroups.com

Hey again,

On Mon, Jul 28, 2008 at 4:02 PM, Felix <fel...@gmail.com> wrote:

is there a reason to use numeric slots instead of symbolic slots?

Ok, so the plot thickens.

Creating a new slot in an array is way faster (382 μs) than doing the same in an object literal (1061 μs).

Creating a new 3-element array is also noticeably faster (1572 μs) than creating a 3-member object literal (2136 μs).

Reading a grandparent scope using the numeric scheme is only marginally better than the symbolic scheme (489 μs versus 531 μs).

Writing a grandparent scope using the numeric scheme is also only marginally better than the symbolic scheme (443 μs versus 491 μs).

== So ==

Access to a function parameter is a use occurrence of an existing slot. But pushing a local variable is a creation of a new slot. And we create a new instance of a stack frame structure each time we call a function.

My inclination is to go with numeric indices for the time being barring further evidence since, among other things, they give us automatic alpha renaming of variables, which is something we wanted to do anyway to get around the hoaky scoping weirdnesses of the JS implementations.

scope-timing.html

Felix

unread,

Jul 28, 2008, 7:58:42 PM7/28/08

to google-ca...@googlegroups.com

in nightly webkit (the new squirrelfish bytecode engine),
reading foo[3] is about 2x faster than reading foo.bar.

ok, for me, speed beats debuggability, especially since cajita programs
are runnable without translation.

maybe add a few sentences about this to the proposal?

ihab...@gmail.com

unread,

Jul 28, 2008, 8:26:04 PM7/28/08

to google-ca...@googlegroups.com

On Mon, Jul 28, 2008 at 4:58 PM, Felix <fel...@gmail.com> wrote:

maybe add a few sentences about this to the proposal?

Will do, and thanks for the feedback and for running the test! Cheers, -- Ihab

Felix

unread,

Jul 28, 2008, 8:43:47 PM7/28/08

to google-ca...@googlegroups.com

Felix wrote:
> ok, for me, speed beats debuggability, especially since cajita
> programs are runnable without translation.

no wait, I take that back. the typical use of cajita will be within a
complex container, and it might be hard to run the cajita code usefully
outside the container. debuggability beats performance. hm.

I guess it isn't too hard to keep the names of things as comments near
the translated code, that'll help some. but it will still be hard to
read object tree browserss, which can be solved by emitting debug
symbols and teaching firebug about it.

I'm just wary of that complexity. symbolic slots is potentially more
usable more quickly.

ihab...@gmail.com

unread,

Jul 28, 2008, 8:46:17 PM7/28/08

to google-ca...@googlegroups.com

On Mon, Jul 28, 2008 at 5:43 PM, Felix <fel...@gmail.com> wrote:

no wait, I take that back.

Just when you had me convinced. :)

I'm just wary of that complexity. symbolic slots is potentially more
usable more quickly.

Yeah, but we've been talking about doing an alpha renaming step anyway to avoid scoping weirdness. I guess that could be done by just, say, prefixing user variable names rather than rewriting them entirely. So I don't know ... hm. :/

Ihab

ihab...@gmail.com

unread,

Jul 28, 2008, 8:58:14 PM7/28/08

to google-ca...@googlegroups.com

Ok, here's what I think I think --

On Mon, Jul 28, 2008 at 5:43 PM, Felix <fel...@gmail.com> wrote:

debuggability beats performance.

It's pretty important for there to be some targeting of Caj[a,ita] whereby performance is the maximum possible. JS interpreters are not the fastest things around.

Paranoids like you and us are willing to reduce performance for security, on the assumption that the worst performance of all is when you have turned off your computer and are calling your credit card company to cancel all the fraudulent charges. But our publicity is best to the extent that we keep this cost as low as possible. 2X is not insignificant.

We can always run un-optimized for debugging. In this case, we would have to maintain the Cajita array-based calling convention at the boundary (since containers would be assuming that) but add a little bit of code to un-optimize the internals of the functions. (This calling convention business is the main pain in the neck about this whole scheme ... I'll keep thinking about this more.)

Ihab

Felix

unread,

Jul 28, 2008, 10:00:19 PM7/28/08

to google-ca...@googlegroups.com

ok, how about this argument.

in IE7, the speed of numeric vs symbolic slots is indistinguishable.
in IE8b1, numeric wins slightly but it's still mostly indistinguishable.
in opera 9.51, symbolic slots are faster.
in nightly firefox and nightly webkit, numeric slots are faster.

IE is the most important browser, and it evolves slowly.

in the other browsers, performance is a moving target and is changing
rapidly. optimizing for their current performance characteristics is a
mistake. the other browsers are likely improve their runtime to match
opera, since making these basic operations faster benefits everyone, not
just cajita.

therefore, performance of these micro-operations is a red herring.
symbolic slots beat numeric slots, due to ease of implementation and
ease of debugging.

does that seem sensible?

here's the performance numbers I got:

ie7 (in a vm):

readGrandparentNumerically - 1250 μs
writeGrandparentNumerically - 1095 μs
readGrandparentSymbolically - 1170 μs
writeGrandparentSymbolically - 1095 μs
writeNewArraySlot - 1640 μs
writeNewObjectSlot - 1875 μs
new3ElementArray - 18905 μs
new3MemberObject - 19220 μs

ie8 (in a vm):

readGrandparentNumerically - 1095 μs
writeGrandparentNumerically - 1015 μs
readGrandparentSymbolically - 1175 μs
writeGrandparentSymbolically - 1015 μs
writeNewArraySlot - 1325 μs
writeNewObjectSlot - 1800 μs
new3ElementArray - 17110 μs
new3MemberObject - 17655 μs

opera 9.51:

eadGrandparentNumerically - 227.5 μs
writeGrandparentNumerically - 302.5 μs
readGrandparentSymbolically - 275 μs
writeGrandparentSymbolically - 205 μs
writeNewArraySlot - 315 μs
writeNewObjectSlot - 295 μs
new3ElementArray - 1362.5 μs
new3MemberObject - 1095 μs

nightly firefox:

readGrandparentNumerically - 647.5 μs
writeGrandparentNumerically - 577.5 μs
readGrandparentSymbolically - 705 μs
writeGrandparentSymbolically - 620 μs
writeNewArraySlot - 675 μs
writeNewObjectSlot - 1392.5 μs
new3ElementArray - 1762.5 μs
new3MemberObject - 3912.5 μs

nightly webkit:

readGrandparentNumerically - 135 μs
writeGrandparentNumerically - 110 μs
readGrandparentSymbolically - 162.5 μs
writeGrandparentSymbolically - 157.5 μs
writeNewArraySlot - 77.5 μs
writeNewObjectSlot - 1125 μs
new3ElementArray - 1317.5 μs
new3MemberObject - 1562.5 μs

my earlier benchmarking gets different numbers than yours, because yours
looks like
for (k = 0; k < n; ++k) { t = a[0][0][1]; }
and mine looks like:
for (k = 0; k < n; ++k) { a[0]; a[1]; a[3]; a[4]; }

Felix

unread,

Jul 28, 2008, 10:53:09 PM7/28/08

to google-ca...@googlegroups.com

one more set of timings. ie6, in a vm:

readGrandparentNumerically - 1095 μs
writeGrandparentNumerically - 935 μs
readGrandparentSymbolically - 1095 μs
writeGrandparentSymbolically - 1015 μs
writeNewArraySlot - 1875 μs
writeNewObjectSlot - 1875 μs
new3ElementArray - 16485 μs
new3MemberObject - 16640 μs

David-Sarah Hopwood

unread,

Jul 29, 2008, 7:06:53 AM7/29/08

to google-ca...@googlegroups.com

ihab...@gmail.com wrote:
> Hi folks,
>
> Mike Stay and I have a new copy of the Cajita optimization proposal here:
>
> http://google-caja.googlecode.com/svn/trunk/doc/html/cajitaOptimization/index.html
>
> Please let us know how this looks.

I'm confused as to why you are expecting this to reduce overall allocation.
Doesn't it end up allocating a bunch of array objects that would not
otherwise have been needed? Note that these can't be deallocated immediately
when a stack frame exits, since they might be captured in closures created
by the function. I.e. you're essentially using a spaghetti heap. That can
be a perfectly reasonable approach *if* the whole language implementation
is designed around it, but existing JS implementations are not. You're also
replacing local parameter accesses with array accesses, which should be
less efficient.

--
David-Sarah Hopwood

ihab...@gmail.com

unread,

Jul 29, 2008, 11:57:43 AM7/29/08

to google-ca...@googlegroups.com

Hi David-Sarah,

Thanks for reading the doc and responding.

On Tue, Jul 29, 2008 at 4:06 AM, David-Sarah Hopwood <david....@industrial-designers.co.uk> wrote:

I'm confused as to why you are expecting this to reduce overall allocation.
Doesn't it end up allocating a bunch of array objects that would not
otherwise have been needed?

Yes, but they don't necessarily hang around any longer than a machine stack frame would, *unless* they are captured by closures.

However, we are making a deal with the devil in some sense. We reduce the number of per-object allocations for objects that have a large number of closures. In return, we incur one heap allocation every time we call a function. Is this the tradeoff you are calling into question?

Note that these can't be deallocated immediately
when a stack frame exits, since they might be captured in closures created
by the function. I.e. you're essentially using a spaghetti heap.

I assume "spaghetti heap" == allocating stack frames on the heap, aka a stackless interpreter? Or am I missing something?

In any case, yes, closures will pin the stack frames, which is precisely what we want. Is your point that the whole stack frame will be pinned, rather than simply a structure containing only those slots which the closure refers to?

That can be a perfectly reasonable approach *if* the whole language implementation
is designed around it, but existing JS implementations are not.

Yes, I think we are gambling that we can implement 10% of a VM inside the other 90% and get better performance. This is subject to benchmarking.

You're also replacing local parameter accesses with array accesses, which should be
less efficient.

Well so I think we are trading time for space and hoping that we don't make things too bad.

Ihab

Felix

unread,

Jul 29, 2008, 12:25:16 PM7/29/08

to google-ca...@googlegroups.com

ihab...@gmail.com wrote:
> Yes, I think we are gambling that we can implement 10% of a VM inside
> the other 90% and get better performance. This is subject to benchmarking.

I think you need to concentrate on IE. for the other browsers, it seems
more useful to put the effort into improving the underlying js engine,
especially if cajita.eval() becomes a native function.

ihab...@gmail.com

unread,

Jul 29, 2008, 12:36:49 PM7/29/08

to google-ca...@googlegroups.com

Ok, imho that is very wise advice. :)

So maybe the next step is to benchmark a piece of code against a manual transformation of that code to see what we gain or lose. It should be really easy to do.

Here's the whole reason for performing this exercise right now:

We would really like to narrow down our supported languages to (Valija == new Caja) and Cajita, and implement all of new Caja outside our TCB by translation to Cajita. This would allow us to remove a lot of the Caja complexity from our rewrite rules, which would simplify our work quite a bit.

On the other hand, folks have expressed trepidation about the Cajita pattern due to the extra allocations. For example, the Lively Kernel people stated that their Point class contains simple state but has dozens of methods. If it were implemented as a Cajita object, they would be making huge numbers of allocations per object for a fine-grained thing they use everywhere.

So what we are doing is really a feasibility study to figure out what the most reasonable course of action seems to be. Comments on the specific optimization strategy we wrote up are greatly appreciated, but so are more broad ideas regarding the design constraints we're trying to weasel our way around.

Ihab

David-Sarah Hopwood

unread,

Jul 29, 2008, 4:44:30 PM7/29/08

to google-ca...@googlegroups.com

ihab...@gmail.com wrote:
> Hi David-Sarah,
>
> Thanks for reading the doc and responding.
>
> On Tue, Jul 29, 2008 at 4:06 AM, David-Sarah Hopwood <
> david....@industrial-designers.co.uk> wrote:
>
>> I'm confused as to why you are expecting this to reduce overall allocation.
>> Doesn't it end up allocating a bunch of array objects that would not
>> otherwise have been needed?
>
> Yes, but they don't necessarily hang around any longer than a machine stack
> frame would, *unless* they are captured by closures.

Yes they do. Existing JS interpreters will not be able to see that the
use of these array objects follows stack discipline. They will just leave
it to the GC, and so the GC will run more often.

(It's not a difficult optimization -- just straightforward escape analysis
-- but JS interpreters are *painfully* unoptimized.)

> However, we are making a deal with the devil in some sense. We reduce the
> number of per-object allocations for objects that have a large number of
> closures. In return, we incur one heap allocation every time we call a
> function. Is this the tradeoff you are calling into question?
>
>> Note that these can't be deallocated immediately
>> when a stack frame exits, since they might be captured in closures created
>> by the function. I.e. you're essentially using a spaghetti heap.
>
> I assume "spaghetti heap" == allocating stack frames on the heap, aka a
> stackless interpreter?

Yes. Actually I meant to say "spaghetti stack":
<http://en.wikipedia.org/wiki/Spaghetti_stack>.

> Or am I missing something?
>
> In any case, yes, closures will pin the stack frames, which is precisely
> what we want. Is your point that the whole stack frame will be pinned,
> rather than simply a structure containing only those slots which the closure
> refers to?

That's another problem, yes: you will get potentially huge space leaks.

>> That can be a perfectly reasonable approach *if* the whole language
>> implementation is designed around it,

... for example, the escape analysis for stack frames mentioned above is
essential ...

>> but existing JS implementations are not.
>
> Yes, I think we are gambling that we can implement 10% of a VM inside the
> other 90% and get better performance. This is subject to benchmarking.
>
>> You're also replacing local parameter accesses with array accesses, which
>> should be less efficient.
>
> Well so I think we are trading time for space and hoping that we don't make
> things too bad.

I'm still extremely skeptical; I suspect that you will lose in both time and
space.

--
David-Sarah Hopwood

Reply all

Reply to author

Forward