LLVM -> emscripten internal format update

17 views

Skip to first unread message

David LaPalomento

unread,

Nov 21, 2010, 2:59:40 PM11/21/10

to emscripte...@googlegroups.com, llvm-js...@googlegroups.com

I finally got around to making some progress in migrating the LLVM JS
backend to produce emscripten's internal data format instead of raw
javascript. Here's a sample of what I'm producing given the example
bitcode (http://code.google.com/p/emscripten/wiki/InternalDataFormat):

[{ "ident": "_OC_str", "intertype": "globalVariable", "lineNum": 0,
"type": "", "value": { "text": ""hello, world!\n"" }},
{ "ident": "main", "intertype": "function", "lineNum": 0, "params":
[], "returnType": "" },
{ "ident": "v0", "intertype": "alloca", "lineNum": 0, "operands": [{
"value": "1", "type": "" }],
{ "ident": "v1", "intertype": "store", "lineNum": 0, "operands": [{
"value": "0", "type": "" }, { "value": "v0", "type": "" }],
{ "ident": "v2", "intertype": "call", "lineNum": 0, "operands": [{
"value": "(_a(_OC_str(0)))", "type": "" }, { "value": "printf",
"type": "" }],
{ "intertype": "return", "lineNum": 0, "type": """operands": [{
"value": "0", "type": "" }]},
{ "intertype": "functionEnd", "lineNum": 0 }]

You'll notice there are a couple things missing (e.g. types, lineNums)
and some structural differences ("operands" vs. emscripten's richer
instruction operand serialization) but I thought now was a good time
to pause and ask questions and get some feedback. First off, I'm
confused by the ordering of the example output on the wiki. It looks
like lines 9 and 11 get shuffled around in the output order and I
can't think of a good reason why that would be. If it's an unnecessary
artifact of the parsing process I can certainly ignore it. Also, I
notice the internal format often stores the textual representation of
the bitcode alongside the other attributes of nodes in the AST. Is
this actually used later on in the compilation process and if so, how?
It's possible to reconstruct some facsimile of the original bitcode
input but it's something I'd prefer not to have to code up.

I also made a pretty interesting discovery while doing this: there
*is* some form of higher-level loop info in LLVMs AST. I haven't dug
into it yet but I'm excited that we may have some help re-constructing
control flow. Finally, I noted on the wiki that the internal data
format wasn't yet set in stone. I'd like to propose some changes to
bring it closer into line with the class hierarchy LLVM uses
internally to represent the IR. Yes, a big part of this is making my
job easier :) But I think it may also bear fruit in the long run if we
decide to look at annotating the internal format with metadata from
LLVM's analysis passes. The current code is up at my github repo
(https://github.com/dmlap/llvm-js-backend) for anyone who'd like to
check it out.

David

Alon Zakai

unread,

Nov 23, 2010, 12:19:24 AM11/23/10

to emscripte...@googlegroups.com, llvm-js...@googlegroups.com

The order of lines in the JSON doesn't matter, it can be
ignored. What matters is the lineNum, that's used to order
them later.

> Also, I
> notice the internal format often stores the textual representation of
> the bitcode alongside the other attributes of nodes in the AST. Is
> this actually used later on in the compilation process and if so, how?

The main idea behind that is being able to provide useful
debugging info, so if there is a problem with the
generated code, you can more easily figure it out.

> It's possible to reconstruct some facsimile of the original bitcode
> input but it's something I'd prefer not to have to code up.

If this is hard to do, we can live without it for
the time being, in my opinion.

>
> I also made a pretty interesting discovery while doing this: there
> *is* some form of higher-level loop info in LLVMs AST.

Really? Heh. Wish I knew that before I wrote all
the Relooper code - twice ;)

> I haven't dug
> into it yet but I'm excited that we may have some help re-constructing
> control flow. Finally, I noted on the wiki that the internal data
> format wasn't yet set in stone. I'd like to propose some changes to
> bring it closer into line with the class hierarchy LLVM uses
> internally to represent the IR. Yes, a big part of this is making my
> job easier :) But I think it may also bear fruit in the long run if we
> decide to look at annotating the internal format with metadata from
> LLVM's analysis passes.

Yeah, that makes sense to me too. The current
format was just created out of convenience, we
should definitely improve it as necessary.