Towards Deeper Integration

David LaPalomento

unread,

Oct 11, 2010, 10:47:42 PM10/11/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

One thing that strikes me as very unfortunate is the fairly small
intersection across the code generating portions of llvm-js-backend
and emscripten. While it's true that some early design decisions make
fully merging the projects unrealistic, I think there are steps we
could take to unify a greater portion of our respective codebases. So
to that end, I'm thinking really hard about moving the assorted
strings that produce the generated javascript for ljb into a more
formal template system.

If the template system was built on a data format readable by
javascript and C++ such as JSON, we'd have the foundation for a
sharing an important part of both of the projects. Additionally, a
template-based system should give us greater output flexibility and
enhance separation of concerns in our code outputting routines. The
potential disadvantages seem to mainly be additional complexity and
coordination between the two code bases. Any thoughts?

David

Alon Zakai

unread,

Oct 12, 2010, 2:07:32 PM10/12/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

We could use the same template system, but sharing the
actual code in the templates would be the problem. Getting
that to work wouldn't be possible unless our underlying
'runtime' stuff were identical, that is, unless we emulate
loops, pointers, structs, the stack, the heap, etc. etc. in
the same way. And that's really most of the work here - in
other words, if we were to make the effort to use the exact
same runtime, that wouldn't take much less work than
just merging our two projects (in fact merging them would
take significantly less effort, I suspect).

Not that I'm saying merging the two projects is a bad
idea - I don't have an opinion on that yet, because I
don't know enough about llvm-js-backend. That's one
reason I'm very curious to see examples of the generated
code it produces on things like the benchmarks in the
emscripten test cases, etc., so we can do some comparisons.
Towards that goal, here is some (somewhat outdated, but
still relevant) generated code from emscripten:

http://www.syntensity.com/static/raytrace.js

This is the original C++ code:

http://code.google.com/p/emscripten/source/browse/demos/raytrace.cpp

Can you perhaps run llvm-js-backend on that, and show
the generated code? The comparison would help understand
how feasible it would be to share the template code.

- azakai

David LaPalomento

unread,

Oct 16, 2010, 10:23:02 PM10/16/10

to emscripte...@googlegroups.com, llvm-js...@googlegroups.com

Attached is an llvm-js-backend compiled version of raytrace.cpp. It's
not runnable-- there's at least a couple language features this has
revealed I need to implement. It should still give you an idea of the
sort of code ljb is generating today. I did some comparison myself and
have a couple questions.

emscripten is reconstructing the original C control flow from the IR's
basic blocks. ljb didn't go this route because it didn't seem possible
in the general case without using some sneaky dynamic control flow
mechanisms like throwing exceptions. Of course, the while-switch
construction ljb generates isn't particularly pretty either, but I
thought the straightforward translation was worth obfuscating the
high-level features. Have you hit any issues so far with this method?

Probably the biggest difference I noted was in memory management. In
ljb, pointers to values are created by calling _p(<value>) -- this
wraps up the value in an object that allows you to index across the
value or overwrite it. It's not going to properly emulate any code
that assumes some knowledge of the underlying memory space as all
pointers end up being "islands". ljb also isn't doing anything special
to emulate the stack, we're just piggy-backing on the javascript
stack. That makes intrinsics like llvm.stacksave and llvm.stackrestore
basically impossible to implement. I'd love to hear an overview of how
emscripten is emulating memory management in general.

Also, what's emscripten's overall strategy mapping C/LLVM code into
canvas calls? I'm really interested in pursuing something like this
next with ljb so it would be great if we ended up taking a similar
path.

David

raytrace.js

Alon Zakai

unread,

Oct 17, 2010, 9:02:19 PM10/17/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

----- Original Message -----
> From: "David LaPalomento" <dlapal...@gmail.com>
> To: emscripte...@googlegroups.com
> Cc: llvm-js...@googlegroups.com
> Sent: Saturday, October 16, 2010 7:23:02 PM
> Subject: Re: Towards Deeper Integration
> Attached is an llvm-js-backend compiled version of raytrace.cpp. It's
> not runnable-- there's at least a couple language features this has
> revealed I need to implement. It should still give you an idea of the
> sort of code ljb is generating today.

Thanks, very interesting!

>
> emscripten is reconstructing the original C control flow from the IR's
> basic blocks.

Emscripten first generates a representation similar to yours,
basically a 'soup' of code fragments with control flow done by a
switch in a loop. In unoptimizing mode, that is left that way,
and control flow looks similar to ljb in that respect. Only in
optimizing mode does Emscripten in addition work to recreate
native control flow structures, from the original representation.

I added examples of both optimized and unoptimized code to
this page:

http://code.google.com/p/emscripten/wiki/GeneratedCodeComparison

> ljb didn't go this route because it didn't seem possible
> in the general case without using some sneaky dynamic control flow
> mechanisms like throwing exceptions.

No need for exceptions necessarily - Emscripten uses just
ifs, loops and labelled breaks. So far this works for all
the code I've tested on. In theory, if it can't handle
a code fragment, it will fallback to implementing that
particular fragment using the switch-in-a-loop approach.

> Of course, the while-switch
> construction ljb generates isn't particularly pretty either, but I
> thought the straightforward translation was worth obfuscating the
> high-level features. Have you hit any issues so far with this method?

The only issue is speed. The switch-in-a-loop approach is
much, much slower, both because it adds a lot of overhead
in itself, and also because it makes it much harder to
apply other optimizations (if 2 code blocks use the same
variables, and you see them one after another, you can
optimize that. But if they are just random blocks in a
switch, you can't).

Emscripten's optimized code is 10-20X faster than its
unoptimized code. A large part of that is due to
using native loop structures.

>
> Probably the biggest difference I noted was in memory management. In
> ljb, pointers to values are created by calling _p(<value>) -- this
> wraps up the value in an object that allows you to index across the
> value or overwrite it.

Emscripten started out that way, and I may implement
an option to use that again sometime, since it's kind
of neat. But, calling a function/creating an object/
defining a closure/etc., for every pointer creation
or copy, is very slow. You may also end up with
a lot of GC overhead.

> ljb also isn't doing anything special
> to emulate the stack, we're just piggy-backing on the javascript
> stack. That makes intrinsics like llvm.stacksave and llvm.stackrestore
> basically impossible to implement. I'd love to hear an overview of how
> emscripten is emulating memory management in general.

Basically, by implementing something as close as possible
to what C/C++ compilers generate. Pointers are just
integers, indexes into memory. So creating them and copying
them is as fast as it can be.

Some more details are (or will be) here:

http://code.google.com/p/emscripten/wiki/RuntimeArchitecture

>
> Also, what's emscripten's overall strategy mapping C/LLVM code into
> canvas calls? I'm really interested in pursuing something like this
> next with ljb so it would be great if we ended up taking a similar
> path.

I'm working on an implementation of SDL, that does canvas
calls. Still figuring out the details, but you can see
it work in the raytracing web demo,

http://www.syntensity.com/static/raytrace.html

For the SDL implementation itself, see

http://code.google.com/p/emscripten/source/browse/src/library_sdl.js

--

Ok, after going over ljb's output, it looks like
the main differences between the two projects, in
terms of generated code, are:

* Memory implementation. To be able to run arbitrary
code, I'd expect you'd need to implement things in a
much more C/C++-like way, both in terms of memory
model as well as how to write into that memory
(the layout of structures, etc.). There are various
gotchas with things like nested structures,
polymorphism/vtables, pointer/integer
conversions, ensuring malloc/memcpy/etc. work
properly, and so forth.

* Speed. To get good performance, you'll likely
need to do the things Emscripten does - generate
native control flow structures, use native JS
variables as much as possible, emulate a stack
efficiently, use typed arrays at runtime if the
JS engine supports it, implement pointers as
integers, etc.

I'm not sure what your goals with ljb are. If you
care about only a subset of C/C++, and don't care
too much about performance, then you might already
be close to that goal. (My goals are different -
one specific thing I want to achieve is to port
3D game engines to the web.)

Or, if you do care about running almost all
C/C++ code, and about good performance, then I
suspect you'll need to implement a lot of stuff
Emscripten already does, and basically our two
projects are needless duplication of effort. You
are of course welcome to use Emscripten's code,
and we can collaborate on specific features,
but I am starting to think that we should
consider merging our two projects. What are your
thoughts on the matter?

- azakai

David LaPalomento

unread,

Oct 21, 2010, 12:48:19 AM10/21/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

Lots of good stuff here so I'll jump right in:

Using while-switch as a fallback and regenerating native control flow
when possible is a *great* idea. I'm curious how you've measured the
overhead of the while-switch form. I did some micro-benchmarking
myself and I don't recall the overhead being as high as you have
found. That's a side-point, though: native control is surely more
amenable to browser optimizations than the state machine
representation.

>
>>
>> Probably the biggest difference I noted was in memory management. In
>> ljb, pointers to values are created by calling _p(<value>) -- this
>> wraps up the value in an object that allows you to index across the
>> value or overwrite it.
>
> Emscripten started out that way, and I may implement
> an option to use that again sometime, since it's kind
> of neat. But, calling a function/creating an object/
> defining a closure/etc., for every pointer creation
> or copy, is very slow. You may also end up with
> a lot of GC overhead.

I can buy that. I debated about how to approach pointers for a long
time and I'm still not happy with this solution. One thing that I'd
really like to investigate is situations where pointers can safely be
transformed to js references. I'd imagine even indices into a
simulated heap a la emscripten would be more expensive than plain
variables. And because LLVM forces all globals to be pointers, I
imagine there are a number of cases where the pointer dereferences are
an unnecessary artifact of the bitcode and not the program semantics.

>
>> ljb also isn't doing anything special
>> to emulate the stack, we're just piggy-backing on the javascript
>> stack. That makes intrinsics like llvm.stacksave and llvm.stackrestore
>> basically impossible to implement. I'd love to hear an overview of how
>> emscripten is emulating memory management in general.
>
> Basically, by implementing something as close as possible
> to what C/C++ compilers generate. Pointers are just
> integers, indexes into memory. So creating them and copying
> them is as fast as it can be.
>
> Some more details are (or will be) here:
>
> http://code.google.com/p/emscripten/wiki/RuntimeArchitecture
>

I think you've convinced me that explicitly representing the heap is
less overhead than pointer objects. I don't quite follow your point
about polymorphism and vtables, though. Aren't those language features
eliminated by the conversion to LLVM? What is the advantage of
reconstructing them?

>>
>> Also, what's emscripten's overall strategy mapping C/LLVM code into
>> canvas calls? I'm really interested in pursuing something like this
>> next with ljb so it would be great if we ended up taking a similar
>> path.
>
> I'm working on an implementation of SDL, that does canvas
> calls. Still figuring out the details, but you can see
> it work in the raytracing web demo,
>
> http://www.syntensity.com/static/raytrace.html
>
> For the SDL implementation itself, see
>
> http://code.google.com/p/emscripten/source/browse/src/library_sdl.js
>

That's really cool. Have you considered a WebGL implementation?
Clearly, it's not as standards-compliant of a solution but it would
have some pretty big performance advantages on platforms that support
it.

> --
>
> Ok, after going over ljb's output, it looks like
> the main differences between the two projects, in
> terms of generated code, are:
>
> * Memory implementation. To be able to run arbitrary
> code, I'd expect you'd need to implement things in a
> much more C/C++-like way, both in terms of memory
> model as well as how to write into that memory
> (the layout of structures, etc.). There are various
> gotchas with things like nested structures,
> polymorphism/vtables, pointer/integer
> conversions, ensuring malloc/memcpy/etc. work
> properly, and so forth.
>
> * Speed. To get good performance, you'll likely
> need to do the things Emscripten does - generate
> native control flow structures, use native JS
> variables as much as possible, emulate a stack
> efficiently, use typed arrays at runtime if the
> JS engine supports it, implement pointers as
> integers, etc.

Speed is definitely one of my primary goals and everything you mention
here seems valid to me.

>
> I'm not sure what your goals with ljb are. If you
> care about only a subset of C/C++, and don't care
> too much about performance, then you might already
> be close to that goal. (My goals are different -
> one specific thing I want to achieve is to port
> 3D game engines to the web.)
>
> Or, if you do care about running almost all
> C/C++ code, and about good performance, then I
> suspect you'll need to implement a lot of stuff
> Emscripten already does, and basically our two
> projects are needless duplication of effort. You
> are of course welcome to use Emscripten's code,
> and we can collaborate on specific features,
> but I am starting to think that we should
> consider merging our two projects. What are your
> thoughts on the matter?

I'm shooting for full C/C++ support. However, I was envisioning ljb
being used by new code more often than taking existing code and
compiling to the web. So my short term goals were more around exposing
javascript's "I/O" so that interesting new coding could begin without
having to correctly emulate the full C runtime necessary to get
existing code operating correctly. Also, I'm really excited by the
possibility of compling other LLVM languages into js so I had been
concentrating more on breadth of bitcode support and saving the
optimization for later.

I think we'd be doing our shared objective a disservice if we didn't
seriously consider merging the two projects. Today, it seems to me you
have a significant lead in implementation. I do still feel like there
are some pretty big advantages to implementing this functionality as a
llvm backend, however, so I'm torn about how to approach this. Did you
have any ideas in mind?

David

>
> - azakai
>

Alon Zakai

unread,

Oct 22, 2010, 6:13:47 PM10/22/10

to emscripte...@googlegroups.com, llvm-js...@googlegroups.com

> I'm curious how you've measured the
> overhead of the while-switch form. I did some micro-benchmarking
> myself and I don't recall the overhead being as high as you have
> found.

Hmm, running the Fannkuch benchmark now, with and without
generating native loops, without is 50% slower.

> That's a side-point, though: native control is surely more
> amenable to browser optimizations than the state machine
> representation.

Yeah, exactly. For example, tracing will not be
done at all for switch/loop code, but it can
speed up native loops very significantly.

(Probably the Fannkuch inner loops aren't being
traced for some reason.)

>
> >
> >>
> >> Probably the biggest difference I noted was in memory management.
> >> In
> >> ljb, pointers to values are created by calling _p(<value>) -- this
> >> wraps up the value in an object that allows you to index across the
> >> value or overwrite it.
> >
> > Emscripten started out that way, and I may implement
> > an option to use that again sometime, since it's kind
> > of neat. But, calling a function/creating an object/
> > defining a closure/etc., for every pointer creation
> > or copy, is very slow. You may also end up with
> > a lot of GC overhead.
>
>
> I can buy that. I debated about how to approach pointers for a long
> time and I'm still not happy with this solution. One thing that I'd
> really like to investigate is situations where pointers can safely be
> transformed to js references. I'd imagine even indices into a
> simulated heap a la emscripten would be more expensive than plain
> variables.

Pointers to non-structure variables can be 'nativized'
into JS variables. Emscripten does that for all stack
variables (except ones whose address is taken). Such
variables are much faster than heap accesses, for sure.

Structures are harder to deal with. For one thing, it's
much harder to implement properly, as pointers to
fields in the structure can be taken. So you need
closures or objects to represent such pointers. Also,
using native structures means you will be using GC.
But, this is definitely something worth investigating
later on.

> >
> >> ljb also isn't doing anything special
> >> to emulate the stack, we're just piggy-backing on the javascript
> >> stack. That makes intrinsics like llvm.stacksave and
> >> llvm.stackrestore
> >> basically impossible to implement. I'd love to hear an overview of
> >> how
> >> emscripten is emulating memory management in general.
> >
> > Basically, by implementing something as close as possible
> > to what C/C++ compilers generate. Pointers are just
> > integers, indexes into memory. So creating them and copying
> > them is as fast as it can be.
> >
> > Some more details are (or will be) here:
> >
> > http://code.google.com/p/emscripten/wiki/RuntimeArchitecture
> >
>
> I think you've convinced me that explicitly representing the heap is
> less overhead than pointer objects. I don't quite follow your point
> about polymorphism and vtables, though. Aren't those language features
> eliminated by the conversion to LLVM? What is the advantage of
> reconstructing them?

Sorry, I guess I wasn't clear. All I meant was that
the pointer and memory implementation needs to support
the code that LLVM generates for polymorphism and
vtables. Also stuff like pointer-to-int conversions
and back. Had to write a lot of unit tests for those
things.

>
> >>
> >> Also, what's emscripten's overall strategy mapping C/LLVM code into
> >> canvas calls? I'm really interested in pursuing something like this
> >> next with ljb so it would be great if we ended up taking a similar
> >> path.
> >
> > I'm working on an implementation of SDL, that does canvas
> > calls. Still figuring out the details, but you can see
> > it work in the raytracing web demo,
> >
> > http://www.syntensity.com/static/raytrace.html
> >
> > For the SDL implementation itself, see
> >
> > http://code.google.com/p/emscripten/source/browse/src/library_sdl.js
> >
>
> That's really cool. Have you considered a WebGL implementation?
> Clearly, it's not as standards-compliant of a solution but it would
> have some pretty big performance advantages on platforms that support
> it.
>

Yes, definitely something should be done for
WebGL. Not sure what though. I was hoping it's
close enough to OpenGL ES, that it could be
done fairly straightforwardly. But, not sure.

>
> I'm shooting for full C/C++ support. However, I was envisioning ljb
> being used by new code more often than taking existing code and
> compiling to the web. So my short term goals were more around exposing
> javascript's "I/O" so that interesting new coding could begin without
> having to correctly emulate the full C runtime necessary to get
> existing code operating correctly. Also, I'm really excited by the
> possibility of compling other LLVM languages into js so I had been
> concentrating more on breadth of bitcode support and saving the
> optimization for later.

Interesting. I have been focusing on existing
C++ code for now, mainly in order to do demos ;)

But actually most of the C runtime is now done
anyhow. Would definitely be neat to do stuff like
other LLVM languages.

>
> I think we'd be doing our shared objective a disservice if we didn't
> seriously consider merging the two projects. Today, it seems to me you
> have a significant lead in implementation. I do still feel like there
> are some pretty big advantages to implementing this functionality as a
> llvm backend, however, so I'm torn about how to approach this. Did you
> have any ideas in mind?
>

Well, my first question is, what do you see
as the benefits of implementing as an LLVM
backend? Let's start by discussing that.

Note that if we think about it, and decide that
being an LLVM backend is worthwhile, we can still
base off of the Emscripten code. Emscripten has
3 parts, the parser, the analyzer, and the
JS generator. The parser could be replaced
with a C++ LLVM backend, that calls the other
two parts (through an embedded JS engine).
We might be able to get the best of both
worlds that way.

As for why writing the other 2 parts in JS
makes sense, I chose that path for several
reasons:

1. Compiling into the same language as the
compiler is very useful. Emscripten can
generate objects, like for example
index offsets for a particular structure,
and use that during compilation. If it is
also needed at runtime, JSON.stringify()
and it's there. Can also test a lot of
optimizations during compile time, using
eval (does eval-ing something get us a
constant value? then replace it with
that value, etc.). Likewise, a lot of
runtime functions are used both during
compile time, and runtime - the exact
same code is run in both cases. It's just
pasted into the generated output.

In the farther future, to really get good
performance and all the language features,
we may need to rewrite code during runtime.
For example, to emulate threads, we can
do something like cooperative multitasking
using generators (credit for that idea goes
to bcrowder). But we'd likely want to change
how those work at runtime - inserting
yields and removing them as necessary.
Having the compiler in the same language
as the generated code will allow using
compiler functionality at runtime, making
this much easier.

2. Lots of experimental stuff here, like how
to reconstruct native loops etc. So a lot
of testing of various ideas is to be expected
(and needed). Dynamic languages, IMHO,
are faster at that. Also, code size:
converting the existing Emscripten code
into C++ would end up in a much larger
and more complex codebase (I used a lot
of JS language features to make it as
compact as it is). A lot of Emscripten's
quick progress is due to JavaScript being
used.

- azakai

David LaPalomento

unread,

Oct 23, 2010, 2:34:35 AM10/23/10

to emscripte...@googlegroups.com, llvm-js...@googlegroups.com

On Fri, Oct 22, 2010 at 6:13 PM, Alon Zakai <aza...@mozilla.com> wrote:
>
> Interesting. I have been focusing on existing
> C++ code for now, mainly in order to do demos ;)
>
> But actually most of the C runtime is now done
> anyhow. Would definitely be neat to do stuff like
> other LLVM languages.

The demos are pretty fantastic, btw :)

This is a *very* cool idea. Have you considered basing it on web
workers, though? It would only work for FF3.5+ and Safari/Chrome but
that would be a concession I could live with. Not to get off on too
much of a tangent, but did you have any thoughts about what browsers
you intended to support?

On the larger point about the advantage of writing your compiler in
your target language, I think you're right on the money. The ability
to share generated and compiler code is a bonus I hadn't even thought
of.

>
> 2. Lots of experimental stuff here, like how
> to reconstruct native loops etc. So a lot
> of testing of various ideas is to be expected
> (and needed). Dynamic languages, IMHO,
> are faster at that. Also, code size:
> converting the existing Emscripten code
> into C++ would end up in a much larger
> and more complex codebase (I used a lot
> of JS language features to make it as
> compact as it is). A lot of Emscripten's
> quick progress is due to JavaScript being
> used.

Given the option, I'd choose javascript over C++ for almost any
project :) I started ljb with the notion of making it easy-to-consume
from within LLVM and the idea of embedding a js runtime didn't occur
to me. So I definitely concede there's huge advantages to writing the
compiler in javascript but I'd say there's a complementary set of
advantages to tight integration with the LLVM toolchains:

* LLVM provides detailed semantic analysis of the parsed bitcode.
Compile time constants and literals are labelled. All references to
variables are available and traversable. On top of the basic stuff,
LLVM has a number of analysis passes that can be run before the
backend is invoked for more in-depth information. I think this sort of
functionality will enable some really interesting optimizations later
on and recreating it all would be a lot of effort.
* Being built into LLVM opens the project up to a wider audience and
allows us to simplify the experience for the end-user. For instance,
there's a large number of code transformations already available that
may make sense to apply before we even touch anything. Writing the
code generator as a backend allows us to register those
transformations as a prerequisite or at one of the predefined
optimization levels so we can relieve users of the burden of
determining and selecting the appropriate options from the front-end.
That's ignoring the most obvious advantage: "-march=js" is likely to
be much easier for most people than downloading a separate project and
learning how to run it. Of course, this whole point is moot if there
isn't enough interest in the LLVM community to integrate the backend
one day.

At first blush, I'm really excited by the-best-of-both-worlds
approach. Let me know what you think.

David

>
> - azakai
>

Alon Zakai

unread,

Oct 24, 2010, 8:34:02 PM10/24/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

> > In the farther future, to really get good
> > performance and all the language features,
> > we may need to rewrite code during runtime.
> > For example, to emulate threads, we can
> > do something like cooperative multitasking
> > using generators (credit for that idea goes
> > to bcrowder). But we'd likely want to change
> > how those work at runtime - inserting
> > yields and removing them as necessary.
> > Having the compiler in the same language
> > as the generated code will allow using
> > compiler functionality at runtime, making
> > this much easier.
>
> This is a *very* cool idea. Have you considered basing it on web
> workers, though?

I believe the current spec for web workers doesn't allow
shared state. So that's a problem for a simple approach,
but I was thinking about providing an API for non-shared
state stuff. (Something like pthreads, but explicit about
what memory region the thread can access. Maybe one of
the existing multiprocessing C/C++ APIs?)

> It would only work for FF3.5+ and Safari/Chrome but
> that would be a concession I could live with. Not to get off on too
> much of a tangent, but did you have any thoughts about what browsers
> you intended to support?

I believe the code should be standard JavaScript, so
it can run perfectly on all standards-compliant
web browsers. With that said, supporting additional
features as an option is fine. Emscripten has an
option to utilize typed arrays, for example, which
few browsers support now, but it makes the ones that
do much faster.

>
> * LLVM provides detailed semantic analysis of the parsed bitcode.
> Compile time constants and literals are labelled. All references to
> variables are available and traversable. On top of the basic stuff,
> LLVM has a number of analysis passes that can be run before the
> backend is invoked for more in-depth information. I think this sort of
> functionality will enable some really interesting optimizations later
> on and recreating it all would be a lot of effort.

I remember looking over the passes and being
disappointed, maybe because I was hoping to
find something like "reconstruct high-level loop
structures" ;) But, I think you're right, there
are surely things that can help here (some of the LLVM
passes in development looked interesting, so even if
not today, this stuff will be useful later).

> * Being built into LLVM opens the project up to a wider audience and
> allows us to simplify the experience for the end-user. For instance,
> there's a large number of code transformations already available that
> may make sense to apply before we even touch anything. Writing the
> code generator as a backend allows us to register those
> transformations as a prerequisite or at one of the predefined
> optimization levels so we can relieve users of the burden of
> determining and selecting the appropriate options from the front-end.
> That's ignoring the most obvious advantage: "-march=js" is likely to
> be much easier for most people than downloading a separate project and
> learning how to run it.

I fully agree that if in the long run the backend
gets upstream, then that is by far the optimal
situation.

(Side note, another benefit of not being an LLVM backend
is that it allows people that write things that
generate LLVM bytecode *without* LLVM itself, to use
Emscripten to generate JS directly. In other words, just
the LLVM format would be used, but none of its code.
That would allow lightweight compilation into JS, without
installing LLVM at all. I'm imagining something like
a Ruby-to-LLVM convertor - maybe written in Ruby -
combined with Emscripten.)

Bottom line, I agree with the reasons you mentioned,
and after thinking about it, it seems there are good
arguments for both approaches. Good news is I think
we can really get the best of both worlds, how about
this direction:

* We write an LLVM backend, that generates JSON in a
format Emscripten's 2nd&3rd passes can use. That
backend simply invokes Emscripten and passes it
that data (no need to even integrate a JS engine,
can just run a commandline JS engine in a child
process).

* We'll keep (at least for now?), Emscripten's 1st
pass, which will generate the same format of JSON.
It's just 676 lines of code anyhow.

So, we will have two interchangeable frontends to
Emscripten's core (the 2nd&3rd passes - the analyzer
and the code generator). The LLVM backend one will
be able to provide additional data, from LLVM passes,
that the Emscripten optimizer might use. That would
be optional, just as there is an option now to either
generate optimized code or not optimize at all.

What do you think?

- azakai

David LaPalomento

unread,

Oct 27, 2010, 10:40:30 PM10/27/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

On Sun, Oct 24, 2010 at 8:34 PM, Alon Zakai <aza...@mozilla.com> wrote:

> * We write an LLVM backend, that generates JSON in a
> format Emscripten's 2nd&3rd passes can use. That
> backend simply invokes Emscripten and passes it
> that data (no need to even integrate a JS engine,
> can just run a commandline JS engine in a child
> process).
>
> * We'll keep (at least for now?), Emscripten's 1st
> pass, which will generate the same format of JSON.
> It's just 676 lines of code anyhow.
>
> So, we will have two interchangeable frontends to
> Emscripten's core (the 2nd&3rd passes - the analyzer
> and the code generator). The LLVM backend one will
> be able to provide additional data, from LLVM passes,
> that the Emscripten optimizer might use. That would
> be optional, just as there is an option now to either
> generate optimized code or not optimize at all.
>
> What do you think?

This sounds like a good strategy. So I'm happy to take on the task of
getting LLVM to start emitting JSON in an Emscripten-friendly format
and we can sync back up once that seems to be working correctly and
figure out where to go next. Would that work for you?

David
>
> - azakai
>

Alon Zakai

unread,

Oct 29, 2010, 4:45:57 PM10/29/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

> >
> > So, we will have two interchangeable frontends to
> > Emscripten's core (the 2nd&3rd passes - the analyzer
> > and the code generator). The LLVM backend one will
> > be able to provide additional data, from LLVM passes,
> > that the Emscripten optimizer might use. That would
> > be optional, just as there is an option now to either
> > generate optimized code or not optimize at all.
> >
> > What do you think?
>
> This sounds like a good strategy. So I'm happy to take on the task of
> getting LLVM to start emitting JSON in an Emscripten-friendly format
> and we can sync back up once that seems to be working correctly and
> figure out where to go next. Would that work for you?
>

Sounds good! We should talk about how that JSON
data would look though. A starting point is the
current JSON Emscripten's first pass generates,
but this was never meant to be a public
interface, so it isn't very cleanly designed. I'll
try to document it on the wiki soon. In general
though, it's a fairly straightforward reflection
of the LLVM bytecode.

- azakai

Alon Zakai

unread,

Nov 3, 2010, 11:37:34 PM11/3/10

to emscripte...@googlegroups.com, llvm-js...@googlegroups.com

> A starting point is the
> current JSON Emscripten's first pass generates,
> but this was never meant to be a public
> interface, so it isn't very cleanly designed. I'll
> try to document it on the wiki soon.

Added some details to the wiki:

http://code.google.com/p/emscripten/wiki/InternalDataFormat

- azakai

David LaPalomento

unread,

Nov 4, 2010, 9:31:28 AM11/4/10

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

That's great, thanks. I haven't had a chance to dive into it too deeply yet but this weekend is looking good. We should have something concrete to go over shortly thereafter.

David

Jonathan Toland

unread,

Oct 1, 2011, 8:34:18 PM10/1/11

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

I'm late to this bandwagon. I've been researching compiling and tooling a more expressive inferred or optionally typed language for the Flash platform. LLVM would be an ideal IL but not at the expense of language or tooling e.g. interoperability and debugging. Our problem domain is similar regardless.

> > For example, to emulate threads, we can
> > do something like cooperative multitasking
> > using generators (credit for that idea goes
> > to bcrowder).

Citation please? In your paper you implied additional threading concepts? Scott Peterson of Adobe simulated threads for Alchemy (see Flash C Compiler) by emulating a SPARC architecture (its code generation looks like unoptimised Emscripten) and managing his own call stacks (but might preclude Relooper).

> This is a *very* cool idea. Have you considered basing it on web
> workers, though?
I believe the current spec for web workers doesn't allow
shared state.

Have you looked at shared workers?

Alon Zakai

unread,

Oct 2, 2011, 4:53:25 AM10/2/11

to llvm-js...@googlegroups.com, emscripte...@googlegroups.com

----- Original Message -----
> From: "Jonathan Toland" <tol...@dnalot.com>
> To: llvm-js...@googlegroups.com
> Cc: emscripte...@googlegroups.com

> Sent: Sunday, October 2, 2011 2:34:18 AM
> Subject: Re: Towards Deeper Integration

> I'm late to this bandwagon. I've been researching compiling and
> tooling a more expressive inferred or optionally typed language for
> the Flash platform. LLVM would be an ideal IL but not at the expense
> of language or tooling e.g. interoperability and debugging. Our
> problem domain is similar regardless.
>
>
>
> > > For example, to emulate threads, we can
> > > do something like cooperative multitasking
> > > using generators (credit for that idea goes
> > > to bcrowder).
>
> Citation please? In your paper you implied additional threading
> concepts? Scott Peterson of Adobe simulated threads for Alchemy (see

> Flash C Compiler ) by emulating a SPARC architecture (its code

> generation looks like unoptimised Emscripten) and managing his own
> call stacks (but might preclude Relooper).
>

I don't have anything to cite, this is just an
idea. But basically you can use relooping and
so forth, but make each function a generator,
so that it can yield and be resumed later. Then
you can maintain the list of called functions
on a stack and do cooperative multithreading that
way.

This approach would at least let your functions
have fast inner loops (since you can use the
relooper). But maintaining a stack of called
functions would add some overhead. So hard
to say what performance would be like. (Maybe
with a lot of inlining it would be ok?)

>
>
> > This is a *very* cool idea. Have you considered basing it on web
> > workers, though?
>
> I believe the current spec for web workers doesn't allow
> shared state.
> Have you looked at shared workers ?

They aren't implemented yet in any browsers I don't
think, but yes, they might help. I suspect though
that they are not meant for very fast performance. No
current JS engine has good support for letting multiple
threads access the same objects AFAIK, so I assume the
shared state will be sychronized with messages or
something similar. This would probably be much too
slow for a shared memory space which is updated on
each write to memory.

- azakai

Reply all

Reply to author

Forward