ASTs and Type Check Analysis

CranCran77

unread,

Aug 20, 2008, 3:47:55 PM8/20/08

to

I am presently involved in a compiler project in C++ that is mirroring
the BASIC grammar extensively.

The early stages of this project is more of a proof of concept more
than diving too deeply into many extremities that will certainly prove
very challenging as this project moves forward. So far I've
developed a Lexer object that returns tokens to my Parser class. The
parser class currently reads these tokens and creates a very
simplistic AST tree.

For this proof of concept, I am using a very basic source file that
contains the following:

PRINT 2+3*6

As I indicated above, we're using BASIC like grammar. The above
example would simply print to the screen the value of "20" and end if
it were to be compiled and executed.

We've taken the Parser to the point where it is creating a very
simplistic AST tree which would look something much like the following
for our example:

CASTFunctionCall
+- CASTSimpleName
+- Name: PRINT
+- CASTExprList
+- List Size: 1
+- CASTBinaryExpr
+- Operation: '+'
+- CASTIntegerLiteral
+- Number: 2
+- CASTBinaryExpr
+- Operation: *
+- CASTIntegerLiteral
+- Number: 3
+- CASTIntegerLiteral
+- Number: 6

Naturally once the AST tree is created above, I need to use visitor
patterns to do a number of passes. Since my example is extremely
simple, I can see three passes required:

- Pass #1 : Type Analysis
- Pass #2 : Constant Folding
- Pass #3 : Code Generation

Am I off base here ?

Where I am currently getting caught is in the first two phases. What
is the ideal way to layer type information into the AST tree, followed
by constant folding, followed by any other checks prior to compile
time?

Keep in mind of course the next step will be to include variable
declarations and how to assign values to those variables and even
working with very simple function/procedure calls (both code specific
and external library calls to things like DLL functions).

Chris

Hans-Peter Diettrich

unread,

Aug 23, 2008, 7:11:27 PM8/23/08

to

CranCran77 schrieb:

> simple, I can see three passes required:
>
> - Pass #1 : Type Analysis
> - Pass #2 : Constant Folding
> - Pass #3 : Code Generation
>
> Am I off base here ?
>
> Where I am currently getting caught is in the first two phases. What
> is the ideal way to layer type information into the AST tree, followed
> by constant folding, followed by any other checks prior to compile
> time?

The efforts may depend on the actual brand of BASIC you're targeting.
When a variable can hold any type at any time, depending on the last
assignment, or depending on the last declaration passed in control
flow, a type analysis IMO will be quite fruitless.

Otherwise, when the type of a variable is known at compile time, you
have the usual choice of up-/downcasting to the next applicable type in
expression evaluation. That's easy when numerical and textual (string)
data cannot be mixed. Otherwise you'll have to specify what the result
of e.g. 1+"1" shall be - it might be "11", 2, or some even more weird
result. That's one of the common pitfalls also in Java :-(

Even Microsoft couldn't find *valid* optimizations for Visual Basic,
that would result in significantly faster execution of compiled vs.
interpreted code.

DoDi

Arargh...@arargh.com

unread,

Aug 24, 2008, 3:47:01 PM8/24/08

to

On Sun, 24 Aug 2008 01:11:27 +0200, Hans-Peter Diettrich
<DrDiet...@aol.com> wrote:

<snip>

>Even Microsoft couldn't find *valid* optimizations for Visual Basic,
>that would result in significantly faster execution of compiled vs.
>interpreted code.

I am not sure that it is possible for any implementation of Basic,
considering the large number of run-time calls.

Most of the optimizations in BCET are because of the lousy code
generator. It was easier to do it that way. :-)
--
ArarghMail808 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

CranCran77

unread,

Aug 25, 2008, 11:04:44 AM8/25/08

to

On Aug 23, 6:11 pm, Hans-Peter Diettrich <DrDiettri...@aol.com> wrote:

In the simple example I provided, I know my end result (without
optimizations) would result in the following:

PUSHI 2
PUSHI 3
PUSHI 6
IMUL
IADD
PRINTI

Had the code been optimized by the compiler at compile time, the
resulting code would have been:

PUSHI 20
PRINTI

But I'm not at the "optimization" stage just yet; that is still to
come.

The goal presently is to understand how I should handle type
assignments to nodes within the AST tree.

As you see in my example, I had several nodes that were
CASTIntegerLiteral, but I question if that is the right approach or
not. In my opinion, the parser which is creating the AST tree should
not be responsible for determining the number type of Integer, Real,
or Long. A simple CASTNumberLiteral node seems more appropriate to me
and allow the type analysis check phase when traversing the tree
determine whether its Integer, Real, or Long right?

If my thought process is correct, then when my type check visitor
traverses the nodes of the tree, certain AST nodes need to be
decorated with type information, such as Integer, Real, or Long. When
examining expressions such as binary ones, both sides to the operation
need to be examined to make sure they both conform to the same type,
and if not handle that appropriately.

For example, "PRINT 2.5*3" would throw an exception with type mismatch
because 3 would be evaluated to integer where-as 2.5 would be viewed
as a real. A different method could be that it would turn the
statement into "PRINT 2.5*3.0" so that both sides are viewed as real
values for the computation to happen correctly.

In this case during the binary expression type check, the right side
of the operation would become the child of another another AST node
that handles converting the result into a given type by an internal
function call. So the tree would resemble the following:

CASTFunctionCall
+- CASTSimpleName
+- Name: PRINT
+- CASTExprList
+- List Size: 1
+- CASTBinaryExpr

+- Operation: '*'
+- CASTNumberInteger
+- Number: 2.5
+- CASTFunctionCall
+- CASTSimpleName
+- Name: INT2REAL

+- CASTExprList
+- List Size: 1

+- CASTNumberLiteral
+- Number: 3

I suppose what I see is that I could use a class object called
CASTType with derived classes for Integer, Long, Real, and String.
Each derived class would contain information about the size of the
type for storage purposes and/or reference pointer for strings in the
constant pool.

So each CASTNumberLiteral and CASTBinaryExpr objects would be assigned
a CASTType object during the type check phase. Each CASTFunctionCall
I also believe would get assigned a type as well in keeping with the
assembly example above.

Am I on the right track?

Bartc

unread,

Aug 25, 2008, 6:19:32 PM8/25/08

to

"CranCran77" <cran...@gmail.com> wrote in message

> On Aug 23, 6:11 pm, Hans-Peter Diettrich <DrDiettri...@aol.com> wrote:
>
> In the simple example I provided, I know my end result (without
> optimizations) would result in the following:
>
> PUSHI 2
> PUSHI 3
> PUSHI 6
> IMUL
> IADD
> PRINTI
>
> Had the code been optimized by the compiler at compile time, the
> resulting code would have been:
>
> PUSHI 20
> PRINTI
>
> But I'm not at the "optimization" stage just yet; that is still to
> come.

This 'optimisation' is actually fairly easy and is worth doing
anyway. You need to do your type analysis first, then call something
like Evaluate() on each node, which for a binary op:

Evaluate(left operand)
Evaluate(right operand)
If left and right operands are now immediate values, do the calculation and
convert this node to a terminal.

> The goal presently is to understand how I should handle type
> assignments to nodes within the AST tree.
>
> As you see in my example, I had several nodes that were
> CASTIntegerLiteral, but I question if that is the right approach or
> not. In my opinion, the parser which is creating the AST tree should
> not be responsible for determining the number type of Integer, Real,
> or Long. A simple CASTNumberLiteral node seems more appropriate to me
> and allow the type analysis check phase when traversing the tree
> determine whether its Integer, Real, or Long right?

Depends on your implementation language, which may store all these
differently (but all my parsers differentiate between integer and real
literals). How would you know otherwise whether the value is integer or
real?

> If my thought process is correct, then when my type check visitor
> traverses the nodes of the tree, certain AST nodes need to be
> decorated with type information, such as Integer, Real, or Long.

Is this Basic language dynamically typed with regard to numeric values?
Because if you're using tagged variables that makes things awkward, unless
you forget about integer and real and just have a Numeric type.

> When
> examining expressions such as binary ones, both sides to the operation
> need to be examined to make sure they both conform to the same type,
> and if not handle that appropriately.

This can get tricky for complex expressions. For binary ops, traverse each
operand subtree, leaving the required type open. Then choose the dominant
type of the two, and insert a cast if needed to one operand (or generate a
type error if required). This would also have been done at lower levels. But
it means that in:

3.5*4 + 5*6

which ends up as real, the 5*6 term uses integer arithmetic, in this case
not a problem, but you may prefer that all subops are treated as real.
Perhaps then re-traverse the right-hand subtree and require the type to be
real.

>
> For example, "PRINT 2.5*3" would throw an exception with type mismatch

I don't think your users would be happy with that.

I put your PRINT 2.5*3 into one of my (rather ancient) compilers, and it
produced this equivalent to ASTs:

After parsing:
0005: Sysfn <> #: 1 [This is PRINT]
0005: Opc <> Mul
0005: Value <Real*8> [Val:Real*8] 2.500000
0005: Value <Int*1> [Val:Int*4] 3

After type analysis:
0005: Sysfn <> #: 1
0005: Opc <Real*8> Mul
0005: Value <Real*8> [Val:Real*8] 2.500000
0005: Value <Real*8> [Val:Real*8] 3.000000

After evaluating constants:
0005: Sysfn <> #: 1
0005: Value <Real*8> [Val:Real*8] 7.500000

No casts are used here, but using instead PRINT X*I (X is real, I is
integer) it gives:

After parsing:
0007: Sysfn <> #: 1
0007: Opc <> Mul
0007: Value <Real*8> [Val:Var](x 3C520AH)
0007: Value <Int*4> [Val:Var](i 3C5256H)

After type analysis:
0007: Sysfn <> #: 1
0007: Opc <Real*8> Mul
0007: Mem <Real*8> [Val:Var](x 3C520AH)
0007: Softconv <Real*8> [ie. a Cast]
0007: Mem <Int*4> [Val:Var](i 3C5256H)

--
Bartc

Hans-Peter Diettrich

unread,

Aug 25, 2008, 6:24:12 PM8/25/08

to

CranCran77 schrieb:

> The goal presently is to understand how I should handle type
> assignments to nodes within the AST tree.
>
> As you see in my example, I had several nodes that were
> CASTIntegerLiteral, but I question if that is the right approach or
> not. In my opinion, the parser which is creating the AST tree should
> not be responsible for determining the number type of Integer, Real,
> or Long. A simple CASTNumberLiteral node seems more appropriate to me
> and allow the type analysis check phase when traversing the tree
> determine whether its Integer, Real, or Long right?

I agree in so far, as the compiler can know better about the most
appropriate representation of a numerical constant, in preparation of
the compiled code for expressions. Even with known types (of variables),
type conversions may be required in the evaluation of any expression.
Then the "cost" of the conversion of an literal is zero, in contrast to
the conversion of some predefined value. Likewise the cost of operations
with all constant arguments is zero, because it can be done at compile
time. Here a reordering of a whole (sub-)expression may result in
smaller and faster code. Also have a look at Common Subexpression
Elimination (CSE).

But I'd not simply neglect the exact type of a literal, because
sometimes a real constant of e.g. 2.0 can be a hint to the compiler, to
use real arithmetic, where integer arithmetic could result in an
unwanted overflow in some intermediate result. The same for explicit
type casts, i.e. Long(2) should really result in Long arithmetic, even
if Byte arithmetic would fit together with the other operands.

> If my thought process is correct, then when my type check visitor
> traverses the nodes of the tree, certain AST nodes need to be
> decorated with type information, such as Integer, Real, or Long. When
> examining expressions such as binary ones, both sides to the operation
> need to be examined to make sure they both conform to the same type,
> and if not handle that appropriately.

Remember that every operator node will have an assigned result type, as
determined from the involved operands, and that required dynamic type
conversions may have to be inserted into the AST. I had to implement
similar tree transformations in my C decompiler, in order to throw out
type consersions and temporary variables, before emitting the
"optimized" source code.

> For example, "PRINT 2.5*3" would throw an exception with type mismatch
> because 3 would be evaluated to integer where-as 2.5 would be viewed
> as a real. A different method could be that it would turn the
> statement into "PRINT 2.5*3.0" so that both sides are viewed as real
> values for the computation to happen correctly.

I'd not insist in too many type casts in source code, when the intention
is clear. Only when e.g. the result of a division can differ, between an
integer and a real DIV, or when an overflow shall be prevented by
extending an value to a certain byte count, the according type casts
should be required and respected in the source code.

> I suppose what I see is that I could use a class object called
> CASTType with derived classes for Integer, Long, Real, and String.
> Each derived class would contain information about the size of the
> type for storage purposes and/or reference pointer for strings in the
> constant pool.

I'd use as few node classes as possible, with properties indicating e.g.
the kind of an unary or binary operator node, and the result type. How
else would you transform an node "operator+" into "IntADD" or "RealADD"
later? IMO all nodes with computational values should have a common base
class, with possible subclasses for literals, variables, operators and
functions.

DoDi