Compiling from source code vs. byte code

Scott Blum

no leída,

15 jun 2007, 12:02:0415/6/07

a Google Web Toolkit Contributors

I think we should move this conversation to GWT-Contrib. My folllowup to follow.

Forwarded Conversation

From: Toby Reyelts <to...@google.com>

To: Joel Webber < j...@google.com>

Date: Fri, Jun 15, 2007 at 11:00 AM

Generic typing information is present in Java 1.5 class files, and I'd like to see even a single optimization that you can perform from a source parse that you can't from a bytecode parse.

In general, I have trouble seeing advantages to parsing Java source code over Java bytecode, whereas I can name several disadvantages off the top of my head.

--------
From: Joel Webber <j...@google.com>

To: Toby Reyelts <to...@google.com >

Date: Fri, Jun 15, 2007 at 11:06 AM

Perhaps I know not what I'm talking about. I thought all the generic type information was elided in the bytecode, but I didn't think about the fact that it could be available in the headers or something like that.

As far as source vs. bytecode is concerned, I've always been of the impression that you could get most everything from parsing bytecode. At times, though, I've seen decompilers generate lots of weird labels and jumps, turning for loops into while loops, etc. I don't know if this is actually a theoretical issue or a practical one. Miguel has a lot of experience trying to do this on .NET, so he probably knows more about it than I do.

joel.

--------
From: John Tamplin <j...@google.com>

To: Toby Reyelts <to...@google.com>

Date: Fri, Jun 15, 2007 at 11:09 AM

On 6/15/07, Toby Reyelts <to...@google.com > wrote:

I don't see how either of these statements are true. Generic typing information is present in Java 1.5 class files, and I'd like to see even a single optimization that you can perform from a source parse that you can't from a bytecode parse.

As I understand it, the generated bytecode loses generic type information via type erasure (yes it is still present in method signatures etc).

In general, I have trouble seeing advantages to parsing Java source code over Java bytecode, whereas I can name several disadvantages off the top of my head.

You are operating at a lower level, dealing with smaller operations. Given that JS is roughly the same level of abstraction as the Java source, to get efficient JS generation you would have to essentially piece together bytecode ops (which theoretically could be intermixed from different statements complicating this, although they don't seem to be in practice) to get back to the corresponding Java source. Essentially, we would be adding the complexity and fragility of a bytecode decompiler to the JS compile process.

--------
From: Toby Reyelts <to...@google.com>

To: John Tamplin <j...@google.com>

Date: Fri, Jun 15, 2007 at 11:44 AM

On 6/15/07, John Tamplin < j...@google.com> wrote:

On 6/15/07, Toby Reyelts <to...@google.com> wrote:

I don't see how either of these statements are true. Generic typing information is present in Java 1.5 class files, and I'd like to see even a single optimization that you can perform from a source parse that you can't from a bytecode parse.

As I understand it, the generated bytecode loses generic type information via type erasure (yes it is still present in method signatures etc).

If you have a generic type declaration in your source code, it's available in the class file. This holds for classes, methods, fields, parameters, and locals.

In general, I have trouble seeing advantages to parsing Java source code over Java bytecode, whereas I can name several disadvantages off the top of my head.

You are operating at a lower level, dealing with smaller operations. Given that JS is roughly the same level of abstraction as the Java source, to get efficient JS generation you would have to essentially piece together bytecode ops (which theoretically could be intermixed from different statements complicating this, although they don't seem to be in practice) to get back to the corresponding Java source.

Java bytecode is nearly the same level of abstraction as source code, just generally more explicit. For example, instead of having to do complicated name-resolution lookup involving scopes and import statements to determine which type an identifier represents, java bytecode contains a fully-qualified reference to the type. This is simpler - not more complex. Rather than hand-waving in the abstract, why not give some concrete examples of specific optimizations that can only occur from a source code parse?

Essentially, we would be adding the complexity and fragility of a bytecode decompiler to the JS compile process.

Java bytecode has been more stable than the Java language. I think that speaks for itself. It also sounds like you believe that the goal of parsing bytecode would be to generate the same parse tree we'd have gotten had we parsed the original source, which is inaccurate.

My primary interest in responding to this thread was to point out the misinformation, rather than getting into a full-blown debate about the two approaches (although I'm up for that at any point). I'd just like us to have our facts straight, particularly on issues that we could likely end up commenting about in public.

Scott Blum

no leída,

15 jun 2007, 12:05:3815/6/07

a Google Web Toolkit Contributors

On 6/15/07, Toby Reyelts <to...@google.com> wrote:

If you have a generic type declaration in your source code, it's available in the class file. This holds for classes, methods, fields, parameters, and locals.

Good to know.

Java bytecode is nearly the same level of abstraction as source code, just generally more explicit. For example, instead of having to do complicated name-resolution lookup involving scopes and import statements to determine which type an identifier represents, java bytecode contains a fully-qualified reference to the type. This is simpler - not more complex.

It's not really our problem at the moment, as JDT does all this for us. It also does additional things for us, like ensure the code is error free, that all necessary classes are available, that the set of classes are internally consistent (for example, the user didn't use JRE APIs that aren't in our JRE). We'd need alternative solutions for these.

But I can see in principal that how compiling from bytecode might be faster.

Rather than hand-waving in the abstract, why not give some concrete examples of specific optimizations that can only occur from a source code parse?

I don't think we could answer this question without really diving into it.

Java bytecode has been more stable than the Java language. I think that speaks for itself. It also sounds like you believe that the goal of parsing bytecode would be to generate the same parse tree we'd have gotten had we parsed the original source, which is inaccurate.

I'm not sure I follow. Are you saying that the parse tree we get from source is inaccurate (I'm not sure how it could be) or that there's no need to get the same AST from byte code that we could have gotten from source?

My primary interest in responding to this thread was to point out the misinformation, rather than getting into a full-blown debate about the two approaches (although I'm up for that at any point). I'd just like us to have our facts straight, particularly on issues that we could likely end up commenting about in public.

There are some nontrivial issues to be solved if we were to go that route. For example, we currently support annotations in javadoc, which doesn't get preserved into the class file. While this will matter far less when 1.5 support is there, we would have to consider the backwards compatibility case. Native method bodies are another thing to consider. If we require a "special" kind of compilation from source to generate class files with enough information, then we've sort of just pushed the problem around.

Scott

John Tamplin

no leída,

15 jun 2007, 12:19:2315/6/07

a Google-Web-Tool...@googlegroups.com

From: Toby Reyelts < to...@google.com>

If you have a generic type declaration in your source code, it's available in the class file. This holds for classes, methods, fields, parameters, and locals.

As I understand it, and I haven't actually looked at generated bytecode for 1.5 constructs, the actual bytecode for the method will not include generic information. Instead, the compiler simply inserts the appropriate cast instructions. I certainly agree that the information is present in the type signatures of methods/etc, and that you can figure out the original code from the combination of the generated bytecode and that metainformation, but it is harder than already having it when you parse the source.

Java bytecode is nearly the same level of abstraction as source code, just generally more explicit. For example, instead of having to do complicated name-resolution lookup involving scopes and import statements to determine which type an identifier represents, java bytecode contains a fully-qualified reference to the type. This is simpler - not more complex. Rather than hand-waving in the abstract, why not give some concrete examples of specific optimizations that can only occur from a source code parse?

I didn't say it wasn't possible to do it, I said it was more work to have to essentially do what a Java decompiler would do first. If the bytecode is at the same level as the source, then why is the decomplier more complicated and less accurate than a disassembler? A disassembler will produce the exact sequence of instructions that were fed into the assembler, yet a Java decompiler will generally not produce identical source.

Also, as Scott points out, JSNI methods are a big issue since the compiler will simply eliminate them.

--
John A. Tamplin
Software Engineer, Google

Ray Cromwell

no leída,

15 jun 2007, 12:39:4015/6/07

a Google-Web-Tool...@googlegroups.com

I see several issues with compiling from bytecode:

1) IIRC correctly, generic type information is only available for certain things (methods declaration, not fields, not return types,
not method locals)

2) type erasure generates spurious bridge methods that are not needed or different than GWT

3) inner class/local class/anonymous inner class transformation is not neccessarily how GWT wants to do it

4) JSNI information won't be preserved (now what, force people to move all JSNI code to external .js files or XML?) This is the deal
breaker.

5) autoboxing!

I also think it's not going to speed anything up. Right now, GWT parses source, and runs visitors over the AST to perform its optimizations. Now if you feed it bytecode, it will essentially need to rebuild an intermediate representation in memory amenable to optimization passes, which essentially means parsing bytecode and constructing trees and graphs from it. Not only that, but all the optimizations will have to be rewritten to deal with a new, lower-level IR format.

I'd rather have the GWT work on increasing GWT's capabilities, fixing bugs, and improving performance in other areas (like JS code size/performance) rather than reworking the guts of the compiler on the premise that somehow running JavaC/Jikes + GWT in succession will result in superior performance.

Has it even been determined that GWT spends most of its time parsing?

-Ray

On 6/15/07, John Tamplin < j...@google.com> wrote:

Sandy McArthur

no leída,

15 jun 2007, 12:39:5215/6/07

a Google-Web-Tool...@googlegroups.com

On 6/15/07, Scott Blum <sco...@google.com> wrote:

I think we should move this conversation to GWT-Contrib. My folllowup to follow.

Forwarded Conversation

From: John Tamplin <j...@google.com>
To: Toby Reyelts <to...@google.com>
Date: Fri, Jun 15, 2007 at 11:09 AM

On 6/15/07, Toby Reyelts <to...@google.com > wrote:

In general, I have trouble seeing advantages to parsing Java source code over Java bytecode, whereas I can name several disadvantages off the top of my head.

You are operating at a lower level, dealing with smaller operations. Given that JS is roughly the same level of abstraction as the Java source, to get efficient JS generation you would have to essentially piece together bytecode ops (which theoretically could be intermixed from different statements complicating this, although they don't seem to be in practice) to get back to the corresponding Java source. Essentially, we would be adding the complexity and fragility of a bytecode decompiler to the JS compile process.

Also, javac introduces synthetic methods to get around some access related differences between java bytecode and java source. For example an inner class accessing a private member of the containing class will usually create a synthetic access method with a more accessible scope.

JavaScript doesn't have the enforced scoping rules the JVM has but unless you introduced decompiler inference logic I think you're more likely to produce less optimal JavaScript because of the "quirks" javac introduced.

--
Sandy McArthur

"He who dares not offend cannot be honest."
- Thomas Paine

Toby Reyelts

no leída,

15 jun 2007, 13:33:2615/6/07

a Google-Web-Tool...@googlegroups.com

On 6/15/07, Ray Cromwell <cromw...@gmail.com> wrote:

I see several issues with compiling from bytecode:

1) IIRC correctly, generic type information is only available for certain things (methods declaration, not fields, not return types,
not method locals)

No, generic type information is available on all of those entities. (It might help to review sections 4.4.4 and 4.8.13 of the JVMS).

2) type erasure generates spurious bridge methods that are not needed or different than GWT

Bridge methods are marked as synthetic, as are other compiler-generated entities.

3) inner class/local class/anonymous inner class transformation is not neccessarily how GWT wants to do it

How is that relevant? We don't do a 1 to 1 transformation of the source, why should we be doing a 1 to 1 transformation of the bytecode?

4) JSNI information won't be preserved (now what, force people to move all JSNI code to external .js files or XML?) This is the deal
breaker.

No, we can continue to do a source-code parse for JSNI. JSNI would be stored in class files as an additional attribute.

5) autoboxing!

I'm sorry, what?

I also think it's not going to speed anything up. Right now, GWT parses source, and runs visitors over the AST to perform its optimizations. Now if you feed it bytecode, it will essentially need to rebuild an intermediate representation in memory amenable to optimization passes, which essentially means parsing bytecode and constructing trees and graphs from it. Not only that, but all the optimizations will have to be rewritten to deal with a new, lower-level IR format.

I think you're really missing context here, which I whole-heartedly blame Scott for. As such, there's a false dichotomy being drawn here. Arguments for bytecode parsing have to deal with other issues, such as speed of hosted mode, compatibility with JVMTI, being able to deliver your libraries to clients as binaries, etc...

Has it even been determined that GWT spends most of its time parsing?

Yes, in fact, I've performed several profiling runs against both the GWT compiler and hosted mode with important results that I'd like to use towards speeding up both hosted mode and unit tests. I encourage anybody who's interested to also post their own results. Please be sure to include your JVM version + flags, operating system, hardware configuration, profiler, etc...

Toby Reyelts

no leída,

15 jun 2007, 13:36:1315/6/07

a Google-Web-Tool...@googlegroups.com

On 6/15/07, Sandy McArthur <sand...@gmail.com> wrote:

On 6/15/07, Scott Blum <sco...@google.com> wrote:

I think we should move this conversation to GWT-Contrib. My folllowup to follow.

Forwarded Conversation
From: John Tamplin <j...@google.com>
To: Toby Reyelts <to...@google.com>
Date: Fri, Jun 15, 2007 at 11:09 AM

On 6/15/07, Toby Reyelts < to...@google.com > wrote:

In general, I have trouble seeing advantages to parsing Java source code over Java bytecode, whereas I can name several disadvantages off the top of my head.

You are operating at a lower level, dealing with smaller operations. Given that JS is roughly the same level of abstraction as the Java source, to get efficient JS generation you would have to essentially piece together bytecode ops (which theoretically could be intermixed from different statements complicating this, although they don't seem to be in practice) to get back to the corresponding Java source. Essentially, we would be adding the complexity and fragility of a bytecode decompiler to the JS compile process.

Also, javac introduces synthetic methods to get around some access related differences between java bytecode and java source. For example an inner class accessing a private member of the containing class will usually create a synthetic access method with a more accessible scope.

Synthetic members are marked as such.

JavaScript doesn't have the enforced scoping rules the JVM has but unless you introduced decompiler inference logic I think you're more likely to produce less optimal JavaScript because of the "quirks" javac introduced.

Sounds like hand-waving to me. What do we need to "infer"? Please be concrete.

Toby Reyelts

no leída,

15 jun 2007, 14:38:1315/6/07

a Google-Web-Tool...@googlegroups.com

On 6/15/07, Scott Blum <sco...@google.com> wrote:

On 6/15/07, Toby Reyelts <to...@google.com> wrote:

If you have a generic type declaration in your source code, it's available in the class file. This holds for classes, methods, fields, parameters, and locals.

Good to know.

Java bytecode is nearly the same level of abstraction as source code, just generally more explicit. For example, instead of having to do complicated name-resolution lookup involving scopes and import statements to determine which type an identifier represents, java bytecode contains a fully-qualified reference to the type. This is simpler - not more complex.

It's not really our problem at the moment, as JDT does all this for us. It also does additional things for us, like ensure the code is error free, that all necessary classes are available, that the set of classes are internally consistent (for example, the user didn't use JRE APIs that aren't in our JRE). We'd need alternative solutions for these.

I don't really think we need "alternative solutions" for most of these, because the user's own compiler has done this for us before the class file reaches us.

Rather than hand-waving in the abstract, why not give some concrete examples of specific optimizations that can only occur from a source code parse?

I don't think we could answer this question without really diving into it.

I think your suggestion to compare the GWT compiler output from 2 different sources (1 original, 1 decompiled) is a good idea. It's not formal, but it's a nice starter step.

Java bytecode has been more stable than the Java language. I think that speaks for itself. It also sounds like you believe that the goal of parsing bytecode would be to generate the same parse tree we'd have gotten had we parsed the original source, which is inaccurate.

I'm not sure I follow. Are you saying that the parse tree we get from source is inaccurate (I'm not sure how it could be) or that there's no need to get the same AST from byte code that we could have gotten from source?

The latter. I'm just saying that syntactical differences shouldn't affect the compiler output, otherwise our compiler is broken to begin with.

If we require a "special" kind of compilation from source to generate class files with enough information, then we've sort of just pushed the problem around.

I don't think so, because I think "the problem" is a bit ill-defined based on the way this discussion got started.There are several inter-related things that all need to be addressed, and it is my feeling that class file parsing is the appropriate solution for several of those problems - not that it necessarily makes sense for it to supercede source parsing.

Ray Cromwell

no leída,

15 jun 2007, 14:49:4815/6/07

a Google-Web-Tool...@googlegroups.com

On 6/15/07, Toby Reyelts <to...@google.com> wrote:

No, generic type information is available on all of those entities. (It might help to review sections 4.4.4 and 4.8.13 of the JVMS).

I agree that the class file format allows for it, the question is, does javac actually do anything with these optional attributes?

I just compiled the following to test before posting my original message.

import java.util.*;

public class t {
public static void main(String arg[]) {
ArrayList<Integer> al=new ArrayList<Integer>();
System.out.println(al);
}
}

An examination of the compiled bytecode shows that the type java.lang.Integer appears nowhere.

3) inner class/local class/anonymous inner class transformation is not neccessarily how GWT wants to do it

How is that relevant? We don't do a 1 to 1 transformation of the source, why should we be doing a 1 to 1 transformation of the bytecode?

Perhaps because working with the source AST is easier than working with the byte code in this regard. With the source, you don't need to detect, inference, and remove the hidden constructor and accessor fields. Sure, the actual source relationships are preserved
in the byte code (plus some transformations and synthetic stuff), but I think the issue is, how to retrofit this into the existing compiler
infrastructure.

4) JSNI information won't be preserved (now what, force people to move all JSNI code to external .js files or XML?) This is the deal
breaker.

No, we can continue to do a source-code parse for JSNI. JSNI would be stored in class files as an additional attribute.

Are you asking that GWT or some other modified compiler produce bytecode as output (instead of Javascript) to store this attribute?

Perhaps I'm misunderstanding. I thought you wanted GWT to compile bytecode to JS (that had already been compiled using
a traditional java compiler). But the assumption that the bytecode contains JSNI strings seems to imply that some other compiler non-javac compiler must first transform source to bytecode as a front-end to GWT?

Has it even been determined that GWT spends most of its time parsing?

Yes, in fact, I've performed several profiling runs against both the GWT compiler and hosted mode with important results that I'd like to use towards speeding up both hosted mode and unit tests. I encourage anybody who's interested to also post their own results. Please be sure to include your JVM version + flags, operating system, hardware configuration, profiler, etc...

The question is, how would this speed up GWT, as it is written now? It seems it would practically require a rewrite of the compiler, as all of the compiler guts are predicated on JDT ASTs. This hypothetical new compiler would have to deal with essentially, three-address IR style, instead of a tree, and all of the optimizations would either a) go back to old school Dragon Book structures or b) you'd have to insert a bytecode -> AST parser into the loop. And at that stage, is generating an AST from bytecode faster than generating an AST from a grammar?

If we're going to use GWT to generate special .class files that have JSNI methods encoded, why don't we just dispense with the class file format and just dump the GWT AST into a special .gwt file. Then, if someone hands you a JAR file with one of these .gwt files, GWT can startup faster by immediately loading the AST into memory without parsing.

This sounds alot easier to implement in the existing compiler for fast-start than java+jsni -> bytecode | bytecode parse->AST | GWT

I'm just not sure what problem is being solved. If java source is first compiled with javac, you lose JSNI methods and must parse the source anyway. If it's compiled with a special GWT pass which inserts JSNI attributes into the .class file, it seems like you've just added work. And if you don't want to rewrite the guts of the GWT visitors which perform optimizations and code generation, than you need
a new internal compiler IR based on bytecode, and rewrite practically everything, or, you need to insert a bytecode->AST parser, which I think would require research to determine if it is a big win or not.

What are the benefits? Maybe faster hosted mode startup, and maybe being able to distribute 'binary' GWT modules commercially, but then again, if you don't want source code theft, you're going to run an obfuscator on the bytecode to prevent decompilation which will probably inhibit the compiler.

There would be a certain elegance to being able to deal with bytecode, since all other Java tools deal with it, and currently, GWT seems like the 'odd man out' by requiring the source and other metadata. But unless someone is willing to put in the work doing this, I'd vote for Scott and others to continue work adding more optimizations (side effect analysis, SSA, CSE/Copy Prop) and 1.5 features.

This seems like GWT 2.0 and not GWT 1.5. Perhaps as a research project, someone can engineer a bytecode parser which reconstructs a JDT AST so that the existing compiler can deal with it. Then you can benchmark performance to see how much skipping the source parsing step helps.

-Ray

Toby Reyelts

no leída,

15 jun 2007, 15:14:3415/6/07

a Google-Web-Tool...@googlegroups.com

On 6/15/07, Ray Cromwell <cromw...@gmail.com> wrote:

On 6/15/07, Toby Reyelts < to...@google.com> wrote:

No, generic type information is available on all of those entities. (It might help to review sections 4.4.4 and 4.8.13 of the JVMS).

I agree that the class file format allows for it, the question is, does javac actually do anything with these optional attributes?

I just compiled the following to test before posting my original message.

import java.util.*;

public class t {
public static void main(String arg[]) {
ArrayList<Integer> al=new ArrayList<Integer>();
System.out.println(al);
}
}

An examination of the compiled bytecode shows that the type java.lang.Integer appears nowhere.

The information for locals is provided as additional debug information, Ray. Unlike all of the other type signature information, you won't get this unless you turn it on in your compiler.

I'm going to pretty much sidestep the rest of your post by saying that I have some very specific use cases in mind (mostly revolving around speeding up hosted mode), and I had even begun a design document that addressed the specific use cases and problems - for example, problems with multiple re-parses, caching (including staleness), startup speed, JVMTI compatibility (ever wonder why you can't hotswap correctly?), unconditional generator invocations, reproduced generator state, embedded http server startup, etc...

It turns out bytecode parsing probably makes sense as part of a solution for several of those issues and (potentially) binary distribution. This discussion doesn't seem to be very effective, because it's discussing parsing in a vacuum. I think we'd be better off waiting for a design proposal, and debating the drawbacks and merits of very specific features of that proposal (with perhaps a prototype implementation with measurements) than the idleness present here.

Alex Tkachman

no leída,

15 jun 2007, 15:31:3215/6/07

a Google-Web-Tool...@googlegroups.com

I was thinking about some intermediate binary format for parse tree.
We can distribute standard libraries in this format and win on
compilation and recomlilation time. It is kind class-file but with all
information we need and in terms, which are better for javascript
generation. From my point of view bytecode is a bit low-level
abstraction for javascript generation..

On 6/15/07, Scott Blum <sco...@google.com> wrote:

Miguel Méndez

no leída,

18 jun 2007, 13:01:1918/6/07

a Google-Web-Tool...@googlegroups.com

Sorry to jump into this thread so late. FWIW....

I have done bytecode-based compilation before. Specifically, taking a set of compiled bytecodes and cross-compiling them to target another framework and virtual machine.

The compilation of source code to bytecode is not one-to-one. By that I mean that certain source level constructs are expanded into sets of instructions or additional synthetic constructs and some constructs are removed entirely. Most of the time, this is not a big deal. It does become a big deal when the use case forces you to try to get back to the source level concept that it came from.

A simple example that comes to mind is the generation of useful error messages. There were instances where a source level construct would cause a synthetic construct to be produced which used unsupported features. The synthetic construct may or may not have file and line number information depending on its nature even when the bytecode was compiled with debugging turned on. In some cases, the best that you could do was to generate an error message associated with the starting file and line number for the class. The user is then left wondering which part of his or her class caused the problem.

Net, net: it can be done, you just need to consider the trade offs.

Toby Reyelts

no leída,

18 jun 2007, 13:08:2018/6/07

a Google-Web-Tool...@googlegroups.com

That's a great point Miguel.

Bruce Johnson

no leída,

18 jun 2007, 18:05:1318/6/07

a Google-Web-Tool...@googlegroups.com

I'll add 2 cents.

We shouldn't make any architectural inferences, conclusions, or future decisions based on the speed of the compiler today. The speed of compiling Java into JavaScript was simply not a consideration in its design. The current compiler does a ton of unnecessary work. For example, we currently parse and compile multiple times during a single hosted mode refresh: once to populate the TypeOracle, then again to generate bytecode in the CompilingClassLoader (and maybe even more...I forget). All that you can conclude from the compiler today is that the compiler is implemented suboptimally, not that compiling from source is intrinsically too slow.

As Joel would say, we have cart/horse inversion here. We're jumped into talking about implementation details when in fact we should be discussing (1) the actual functionality we want to deliver, (2) any additional non-functional requirements (which could indirectly lead to a discussion of how they would be affected by the source vs. bytecode decision), and (3) how these differ from the behavior we already have today, if at all.

This thread can't converge meaningfully until we have defined the goals more clearly. And the way to do that is to (1) decide what the feature is that we're trying to achieve (e.g. reducing compile times? speeding up hosted mode? initial startup? refreshes?), (2) decide which release it is going into, and (3) begin with at least a strawman design doc based on the requirements. Looks like we've done it exactly backwards so far :-)

-- Bruce

Responder a todos

Responder al autor

Reenviar