Rewriting (refactoring) of compiler

6 views
Skip to first unread message

VladD2

unread,
Feb 17, 2009, 11:37:46 AM2/17/09
to nemerle-dev
Below are some of my thoughts regarding the way the process of
rewriting the compiler should proceed.

Please, do not judge too harshly. This is only a first draft. I could
have missed something or gone over the top somewhere.

All comments are welcome.

Except, avoid proposing new features that will affect the language
design. All that will come at stage three or four.

At stage one (performed with the current version of Nemerle):

1. Make the code of the compiler multithreaded, reenterable, clear,
and easy to modify.

* Remove mutable static variables.
* Make Token, PExpr, TExpr immutable.
* Reduce mutable fields to a minimum.
* Remove internal and public fields (replace them with properties and
functions).
* Remove all properties with side effects.
* Let only the methods required by macros remain public.
* Move each type into a separate file (with the exception of delegates
and type aliases).
* Clean up type and member names.
* Avoid controlling logic flow with exceptions.

2. Remove hacks.

3. Transform descriptions of functional types and tuples.

* Tuples should support zero and more arguments (including ONE).
* MType.Fun should always take a tuple:

variant MType
{
...
| Fun(from : MType.Tuple, to : MType.Tuple)
...
}

* Conversion of tuples into parameter lists should only be performed
when a tuple is passed to a function explicitly.
* In expressions we should treat tuple with one parameter as brackets,
but in PExpr it should be expressed as a tuple (for macros purpose).

5. Separate custom attributes and modifiers into two separate
entities. Remove the duplication in names. Modifiers, such as public
and static, should be called modifiers, attributes should be called
custom attributes. Simplify syntax of quasi-quotation for modifiers
and custom attributes.

6. Remove macro limitations:

• Allow commas and semicolons everywhere.
• Allow introduction of new syntax everywhere (at file level, in
namespaces, in types).
• Support macros in pattern matching patterns.
• Make macro syntax similar to EBNF.
• Allow using macros as arguments to other macros. This way the macro
"class" can be described as:

macro CustomAtribute ...
macro Modifiers ...
submacro Member = Method | Field | Property | Class | Struct |
Variant;

macro Class(customAtribute : CustomAtribute, modifiers : Modifiers,
name, body : list[Member])
syntax (option(customAtribute), option(modifiers), "class", "{", option
(body), "}")
{
...
}

7. Take into account aspects of integration with the IDE.
* Exact location specification.
* Ability to partially update project information (partial tree
restructuring, partial token recalculation, partial parsing of
individual methods).
* First compile class hierarchy and only after that compile all
methods bodies.
* No fail (asserts, exceptions or ICE) on incorrect synax (only error
reports and PExpr.Error() or TExpr.Error() returning).
* No "FatalError". Compiler should continue at all costs.
* No two step typing. All errors should be reported synchronously.
* Error messages should be attached to methods (AST or TypeBuilder).

8. Make sure that functionality does not have to be duplicated in code
for delayed typing. I have lately ran into the fact that the majority
of errors take place at delayed typing. Such errors are very difficult
to find and even comprehend. Delayed typing protocol must be clearly
thought through and implemented in such a way that the programmer does
not need to duplicate code. We should get same effect in all cases (in
direct and delayed typing). For this we should carefully unify types
for delayed typing objects.

9. The compiler should be divided into separate components that can be
changed independently. Connections between the components must be made
exclusively through interfaces. No peeking inside. An approximate list
of components (assemblies):

* Interface assembly. Contains the public interface of the other
components. This assembly (component) should change as little as
possible. It will also contain the data structures that will be passed
between compiler components. It will be the one that will be
referenced by macro libraries and other components. This assembly will
be contains class fabrics for other components (which will be
dynamically load other assemblies).
* Lexer
* Preparser
* Parser
* Typer
* Solver
* Macro Expander
* Optimizer
* Code generator

Every component must be implemented as a separate DLL. The exact
composition is up for debate. In principle, it could be possible to
combine the Lexer, Preparser, and Parser assemblies. But combine them
later, when they stop changing frequently.

This will:

1. Speed up compilation of different parts of the projects, since only
the modified component needs to be rebuilt. Now compiling of Nemerle
solution toke about 2 minutes on Core 2 Duo 2.5 Ghz.
2. Make the project easier to understand.
3. Reduce code coupling. Components enable easy definition and
tracking of dependencies.
4. Allow developers to evolve separate parts of the project in
parallel.
5. Simplify retargeting of separate modules. For instance, one could
create different code generators and optimizers.

At stage two:

1. Bootstrap the compiler.
2. Rewrite the compiler to use the new (more powerful) macro system.
At this point, much of the parser will be rewritten declaratively with
macros.
3. Add new features to the compiler.

It may also be possible to make modifications to the macro system in
the old compiler, so that the new one can be written at a higher level
from the start.

Ķaмĭļ ๏ Şκaļşκĭ

unread,
Feb 20, 2009, 1:45:56 PM2/20/09
to nemer...@googlegroups.com
2009/2/17 VladD2 <v...@rsdn.ru>:
>
> Below are some of my thoughts regarding the way the process of
> rewriting the compiler should proceed.
>
> Please, do not judge too harshly. This is only a first draft. I could
> have missed something or gone over the top somewhere.

Well, I can see your plan is quite wide, but it would be really nice
to have all of these enchancements.

>
> All comments are welcome.

One obvious comment is that many problems / hacks you are describing
were originally done to optimize performance. I can easily see that
eliminating static state and mutable fields will make compiler much
slower. We already done one such refactoring when you initally asked
about it - we didn't have huge performance hit then, but most of the
code that was changed there was not in critical path. Similarly
separation of delayed typing from immediate typing and error reporting
as separate pass - this is a way to make compilation of most common
cases (code not requiring hard type inference and without errors)
fast.

It will also require touching lots of code (e.g. MType to operate
needs Solver object, so either you pass it in constructor or pass it
in every method you call, etc. Some of it could potentially be
simplified by macros, e.g. one which change every call of given method
to another method with additional argument passed from current
context: mtype.Require(ty) would be rewritten by macro to
mtype.Require(solver, ty))

But of course many problems are just because compiler was written
without any consideration for OO programming style and best practices,
so with some effort I guess it can be made better. Your plan is very
good to start it.
If you do this you might as well eliminte Preparser I guess. It was
created mainly to group code parts into {} () [] chunks, which could
be then used in macros. If you don't constrain yourself to those
groups and just pass something like TokenStream to macros, then I
think you don't need preparse stage.
I think this is what you are planning, since with macros able to kick
in at lexer stage you will be able to move most of the parsing into
macros.

> • Support macros in pattern matching patterns.

Oh, this is kind of strange feature, but why not... Any example
reasoning why this could be useful?

> • Make macro syntax similar to EBNF.

This is going to be a hard at implementation stage (you will need to
come up with parser for general grammars and additionally make it very
dynamic to modifications - e.g. every time you import some namespace
you need to rebuild the parser). I'm not sure how you would like to
make it work with macros processing arbitrary token stream, e.g. EBNF
rules specified by user will definitely contain things like:

E1 => A Arbitrary1 C
E2 => A Arbitrary2 D

and even if normally you could decide on E1 vs E2 rule looking ahead
at C vs D, now you won't be able, since you need to choose between
Arbitrary1 and Arbitraty2 - i.e. you can't look ahead of any arbitrary
(macro eaten) stream because you cannot know where does it end.
Adding full EBNF support is just a huge task (people writing stuff
like ANTLR eaten years developing it)

> • Allow using macros as arguments to other macros. This way the macro
> "class" can be described as:
>
> macro CustomAtribute ...
> macro Modifiers ...
> submacro Member = Method | Field | Property | Class | Struct |
> Variant;
>
> macro Class(customAtribute : CustomAtribute, modifiers : Modifiers,
> name, body : list[Member])
> syntax (option(customAtribute), option(modifiers), "class", "{", option
> (body), "}")
> {
> ...
> }

You will probably want to get rid of notion of "keyword", e.g. "class"
would no longer be a special keyword recognized always, but you would
at each moment of parsing have a map of currently expected tokens,
which might terminate current parsing rule.
--
Kamil Skalski
http://kamil-skalski.pl

VladD2

unread,
Feb 21, 2009, 12:53:47 PM2/21/09
to nemer...@googlegroups.com
2009/2/20 Ķaмĭļ ๏ Şκaļşκĭ <kamil....@gmail.com>

> One obvious comment is that many problems / hacks you are describing
> were originally done to optimize performance. I can easily see that
> eliminating static state and mutable fields will make compiler much
> slower. We already done one such refactoring when you initally asked
> about it - we didn't have huge performance hit then, but most of the
> code that was changed there was not in critical path.

I understand your fears.
But, in my opinion, you try find problems of the compiler performance
not in that place.

The basic problems of compiler performance are caused by following factors:
1. Use of not suitable algorithms:
1.1. Simple defects - use of loops for search of elements instead of
table lockup (Use of improper data structures (for example, lists
instead of hash-tables.)
1.2. Use universal (but is terrible hard) algorithms. For example,
the overload resolution algorithm rather is not effective in case of
extension methods overload resolution (and it can be simple improved).
1.3. Multiple repeat of typing process in case of speculative
typing. Today the compiler cannot use the typed expressions received
at speculative typing process.
2. Functional оверхэдом (closure and virtual calls have are much more
cost than cycles).
3. Generation of not optimum code. For example, not use switch.
An available compiler code is very difficult for analyzing and to
optimization in consequence of high coupling and foggy.
After the refactoring (or in process) it will make much easier.


> Similarly
> separation of delayed typing from immediate typing and error reporting
> as separate pass - this is a way to make compilation of most common
> cases (code not requiring hard type inference and without errors)
> fast.

It is necessary to introduce concept of «speculative compilation». To
hide errors is necessary only at speculative compilation. Presence of
«a silent mod» only complicates a compiler code. This technics is very
difficult to understand for other programmers. I have spent a lot of
time when understanding this silent mode. And I have found many bugs
in its implementation.

At least, we should to reconsider it.
Two passes is an obvious harm. The two-pass scheme is badly compatible
with macro.
Macros should work always in a normal mode. And macros should not take
in account problems of a error messages handling (including those
which appear at speculative typing).
I think, rewriting the compiler from scratch and caring of
productivity we shall make much faster the compiler which will be
clear from hacks.


> It will also require touching lots of code (e.g. MType to operate
> needs Solver object, so either you pass it in constructor or pass it
> in every method you call, etc.

Good example! TyVar (MType derived from it) already has the reference
on Manager (ManagerClass). The TyVar get a reference on Solver through
Manager. It is a big design bug! It is necessary to store the direct
reference on Solver in TyVar. It only will improve productivity.
Will be better to think how to optimize types graph and Solver.
By the way, it necessary to add the Location into TyVar. Today we
simply lose the location information for types annotations.

> Some of it could potentially be
> simplified by macros, e.g. one which change every call of given method
> to another method with additional argument passed from current
> context: mtype.Require(ty) would be rewritten by macro to
> mtype.Require(solver, ty))

In it simply there is no necessity. In this cases it is better to not
hide a reality from the programmer.

> If you do this you might as well eliminte Preparser I guess. It was
> created mainly to group code parts into {} () [] chunks, which could
> be then used in macros. If you don't constrain yourself to those
> groups and just pass something like TokenStream to macros, then I
> think you don't need preparse stage.
> I think this is what you are planning, since with macros able to kick
> in at lexer stage you will be able to move most of the parsing into
> macros.

I think, we can consider the PreParse as Lexer-macros which give out
the processed stream of tokens. If we wish to support the Python like
syntax the given feature is equally necessary to us.

DSL not compatible with Nemerle syntax can be presented in the form of
Lexer-macros.

> > • Support macros in pattern matching patterns.
> Oh, this is kind of strange feature, but why not... Any example
> reasoning why this could be useful?

Pattern matching on XML. For example:
| <a><b>$x</a></b> => doSomething(x)
or
| /a/b/$x => doSomething(x)

> • Make macro syntax similar to EBNF.
> This is going to be a hard at implementation stage (you will need to
> come up with parser for general grammars and additionally make it very
> dynamic to modifications - e.g. every time you import some namespace
> you need to rebuild the parser). I'm not sure how you would like to
> make it work with macros processing arbitrary token stream, e.g. EBNF
> rules specified by user will definitely contain things like:
>
> E1 => A Arbitrary1 C
> E2 => A Arbitrary2 D
>
> and even if normally you could decide on E1 vs E2 rule looking ahead
> at C vs D, now you won't be able, since you need to choose between
> Arbitrary1 and Arbitraty2 - i.e. you can't look ahead of any arbitrary
> (macro eaten) stream because you cannot know where does it end.
> Adding full EBNF support is just a huge task (people writing stuff
> like ANTLR eaten years developing it)

I have some ideas about this. For dynamic creation the parser (parser
which marge a some number of gramars at runtime) it is possible to use
algorithm known as "Earley parser":

http://en.wikipedia.org/wiki/Earley_parser

But I agree that it is serious change which it is not necessary to
make at the given stage. First it is necessary to make the refactoring
of an available compiler.

> You will probably want to get rid of notion of "keyword", e.g. "class"
> would no longer be a special keyword recognized always, but you would
> at each moment of parsing have a map of currently expected tokens,
> which might terminate current parsing rule.

Yes. But we can convert a standard keywords in a macros and move it
into global namespace.
In this case we could switch off standard keyword by preprocessor
directive or somehow. It allows creating high-grade DSL on the basis
of Nemerle.

Probably this improvement too can be postponed till the best times.
It is now important to make a code of the compiler:
1) clear (more simple and rectilinear);
2) loose (low) coupled;
3) fast (to not lose in performance).

Reply all
Reply to author
Forward
0 new messages