RFC: language beginnings

0 views

Skip to first unread message

tho...@gmail.com

unread,

Nov 11, 2009, 1:23:26 AM11/11/09

to pp-d...@googlegroups.com

Description:
This includes most of a grammar for a PP domain-specific language.
Syntactically and behaviorally it is very much like C, with reduced
types
(int, string, bool, no structs, no enums). Most of the PP-specific
actions
are simple function calls.

Major departures from C are:
- dynamic typing, with optional runtime checking
- function literals and anonymous functions
- list/tuple literals
- symbols are private unless declard public
- no headers

I have converted all of the existing device code to the new syntax, with
a few
bits of hand-waving to work out still. The actual interpreter for the
language is yet to be written. :)

I'm sending this change as a chance to get some feedback.

Tim

Please review this at http://codereview.appspot.com/154047

Affected files:
A language/Makefile
language/Makefile
A language/grammar.l
language/grammar.l
A language/grammar.y
language/grammar.y
A language/identifier.h
language/identifier.h
A language/language.cpp
language/language.cpp
A language/language.h
language/language.h
A language/lexer_test.cpp
language/lexer_test.cpp
A language/main_lex.cpp
language/main_lex.cpp
A language/main_parse.cpp
language/main_parse.cpp
A language/pipe_file.h
language/pipe_file.h
A language/pp-files/amd_k8.pp
language/pp-files/amd_k8.pp
language/pp-files/amd_k8.pp
A language/pp-files/cpu.pp
language/pp-files/cpu.pp
language/pp-files/cpu.pp
A language/pp-files/cpuid.pp
language/pp-files/cpuid.pp
language/pp-files/cpuid.pp
A language/pp-files/msr.pp
language/pp-files/msr.pp
language/pp-files/msr.pp
A language/pp-files/pci.pp
language/pp-files/pci.pp
language/pp-files/pci.pp
A language/pp-files/pp.pp
language/pp-files/pp.pp
language/pp-files/pp.pp
A language/string_file.h
language/string_file.h
A language/variable.h
language/variable.h
A language/variable_test.cpp
language/variable_test.cpp

mjte...@gmail.com

unread,

Nov 11, 2009, 10:59:59 PM11/11/09

to tho...@gmail.com, pp-d...@googlegroups.com, tho...@gmail.com

I just glanced at the lex/yacc stuff. I'll read through more later.

http://codereview.appspot.com/154047/diff/1/10
File language/grammar.l (right):

http://codereview.appspot.com/154047/diff/1/10#newcode2
language/grammar.l:2: DEC [0-9]
[:digit:] ?

http://codereview.appspot.com/154047/diff/1/10#newcode5
language/grammar.l:5: SPACE [ \t\n\v\f\r]
Why can't you use [:space:] ?

http://codereview.appspot.com/154047/diff/1/10#newcode69
language/grammar.l:69: /* Support C-style comments, even nested. */
Argh. Don't do this. :)

Why deviate from the C-style rule for comments? It's what people are
used to.

http://codereview.appspot.com/154047/diff/1/10#newcode86
language/grammar.l:86: 0[bB]{DEC}+ { dump(yyscanner); return
int_literal(yyscanner); }
What is this token? This is weird to me.
I don't think you want to match 0b34.

http://codereview.appspot.com/154047/diff/1/19
File language/grammar.y (right):

http://codereview.appspot.com/154047/diff/1/19#newcode88
language/grammar.y:88: | string_literal { fprintf(stderr, "%d
primary_expression <- string_literal\n", lex_lineno()); }
Agh; why is the casing different for different literals :(

It's annoying to have rules "string_literal" and "STRING_LITERAL".

http://codereview.appspot.com/154047/diff/1/19#newcode112
language/grammar.y:112: // TODO: '123()' parses to a
function_call_expression. Make sure to validate
Do you really want this to allow <primary_expression>()?
I even find "{return 0;}()" questionable above; I was expecting this to
be restricted to identifiers for a symbol table lookup.

Note: You also allow 123()().

http://codereview.appspot.com/154047/diff/1/19#newcode122
language/grammar.y:122: // We allow dangling commas for convenience.
Blech. Okay, I guess.

http://codereview.appspot.com/154047/diff/1/19#newcode129
language/grammar.y:129: | IDENTIFIER ':' assignment_expression {
fprintf(stderr, "%d argument <- IDENTIFIER ':' assignment_expression\n",
lex_lineno()); }
Why do you want to allow this?
(b:a=5) ? Clearly I'm missing something here.

http://codereview.appspot.com/154047/diff/1/19#newcode161
language/grammar.y:161: : '+' { fprintf(stderr, "%d unary_operator <-
'+'\n", lex_lineno()); }
Why allow + as a unary operator?

http://codereview.appspot.com/154047/diff/1/19#newcode173
language/grammar.y:173: : cast_expression { fprintf(stderr, "%d
multiplicative_expression <- cast_expression\n", lex_lineno()); }
Presumably this is for order of expressions? It seems like there must be
a better way.

http://codereview.appspot.com/154047

Tim Hockin

unread,

Nov 12, 2009, 12:09:56 AM11/12/09

to tho...@gmail.com, pp-d...@googlegroups.com, mjte...@gmail.com

Wow! Thanks!

On Wed, Nov 11, 2009 at 7:59 PM, <mjte...@gmail.com> wrote:
> I just glanced at the lex/yacc stuff. I'll read through more later.
>
>
> http://codereview.appspot.com/154047/diff/1/10
> File language/grammar.l (right):
>
> http://codereview.appspot.com/154047/diff/1/10#newcode2
> language/grammar.l:2: DEC [0-9]
> [:digit:] ?
>
> http://codereview.appspot.com/154047/diff/1/10#newcode5
> language/grammar.l:5: SPACE [ \t\n\v\f\r]
> Why can't you use [:space:] ?

I didn't realize lex had built in character classes

> http://codereview.appspot.com/154047/diff/1/10#newcode69
> language/grammar.l:69: /* Support C-style comments, even nested. */
> Argh. Don't do this. :)
>
> Why deviate from the C-style rule for comments? It's what people are
> used to.

This was a very late addition. I originally had only // line comments.

> http://codereview.appspot.com/154047/diff/1/10#newcode86
> language/grammar.l:86: 0[bB]{DEC}+ { dump(yyscanner); return
> int_literal(yyscanner); }
> What is this token? This is weird to me.
> I don't think you want to match 0b34.

I want to match it and then issue a useful error from int_literal().
Defining it more tightly results in "0b34" parsing as INT_LITERAL "0"
+ IDENTIFIER "b34". I tested C (using octal "08") and it behaves
similarly.

> http://codereview.appspot.com/154047/diff/1/19
> File language/grammar.y (right):
>
> http://codereview.appspot.com/154047/diff/1/19#newcode88
> language/grammar.y:88: | string_literal { fprintf(stderr, "%d
> primary_expression <- string_literal\n", lex_lineno()); }
> Agh; why is the casing different for different literals :(
>
> It's annoying to have rules "string_literal" and "STRING_LITERAL".

string_literal is a non-terminal rule that enables quoted string
joining, like C. If you prefer, I could define int_literal :
INT_LITERAL for the other literals. Would that be easier to read?
You never want to use STRING_LITERAL, just string_literal.

> http://codereview.appspot.com/154047/diff/1/19#newcode112
> language/grammar.y:112: // TODO: '123()' parses to a
> function_call_expression. Make sure to validate
> Do you really want this to allow <primary_expression>()?

This is how precedence is define here, by chaining different classes
of expressions. Because my function decl syntax is simpler than C, I
got rid of the complicated declarator, abstract_declarator, etc rules.
This is a case where the grammar is flexible enough to accept
constructs that are semantically invalid. The TODO is just a reminder
to me to validate that the called symbol actually evaluates to a
function, which "123" obviously does not.

> I even find "{return 0;}()" questionable above; I was expecting this to
> be restricted to identifiers for a symbol table lookup.

You were expecting what "this" to be restricted to identifiers?
function literals? I just can't parse your sentence :)

> Note: You also allow 123()().
>
> http://codereview.appspot.com/154047/diff/1/19#newcode122
> language/grammar.y:122: // We allow dangling commas for convenience.
> Blech. Okay, I guess.

C does, python does. For list literals it's convenient.

> http://codereview.appspot.com/154047/diff/1/19#newcode129
> language/grammar.y:129: | IDENTIFIER ':' assignment_expression {
> fprintf(stderr, "%d argument <- IDENTIFIER ':' assignment_expression\n",
> lex_lineno()); }
> Why do you want to allow this?
> (b:a=5) ? Clearly I'm missing something here.

It's another case of precedence rules. I stole the expression
evaluation stack up from a C99 grammar. It is valid to call a
function as "foo(a=5);" The result of an assignment expression is the
value assigned, so changing it would be a departure from C.

> http://codereview.appspot.com/154047/diff/1/19#newcode161
> language/grammar.y:161: : '+' { fprintf(stderr, "%d unary_operator <-
> '+'\n", lex_lineno()); }
> Why allow + as a unary operator?

Because C does.

> http://codereview.appspot.com/154047/diff/1/19#newcode173
> language/grammar.y:173: : cast_expression { fprintf(stderr, "%d
> multiplicative_expression <- cast_expression\n", lex_lineno()); }
> Presumably this is for order of expressions? It seems like there must be
> a better way.

I could make one great big expression rule and then define precedence,
but I actually find this cleaner.