parsing a block of code

439 views
Skip to first unread message

Jelle Feringa

unread,
Nov 14, 2013, 6:22:22 AM11/14/13
to ply-...@googlegroups.com

Hi,

I'm writing a parser for the RAPID robot language.
I'd like to know what is a good way to approach the following
parsing problem. Something specific for RAPID is that
for loops, conditional, function, module definitions
all use a familar MODULE < module block > ENDMODULE
or PROC < procedure block > ENDPROC structure.

My question is what is the right way to go about this?
Here we have an example of a procedure defined in RAPID.

Intuitively I would write a regex that matches the
name of the procedure, its argument and the procedure block.

Another way would be to drop into a state when such a
START / END block is found, but that feels unnecessarily
complex.


PROC top_front( string strNoStepIn )
! procedure block
MoveL ...;
ENDPROC

Since this pattern is so present in the language, I'd like to get it
right and in a p(l)ythonic manner. Thing is that I'm too new to the
parsing to really see that.

Thanks,

-jelle



A.T.Hofkamp

unread,
Nov 14, 2013, 6:55:52 AM11/14/13
to ply-...@googlegroups.com
On 11/14/2013 12:22 PM, Jelle Feringa wrote:
> My question is what is the right way to go about this?
> Here we have an example of a procedure defined in RAPID.

The example seems to be missing, but in general, you don't start with the parser, you start with the
scanner, identifying the individual words that you should recognize.

> PROC top_front( string strNoStepIn )
> ! procedure block
> MoveL ...;
> ENDPROC

becomes a sequence of tokens (1 per line), empty lines and // text is added to clarify what you
read. (token names are written all uppercase)

PROC
IDENTIFIER(top_front)
PARENTHESIS_OPEN
STRING // if "string" is not a built-in, it would become an IDENTIFIER
IDENTIFIER(strNoStepIn)
PARENTHESIS_CLOSE

// Assuming ! means 'comment', skipped it.

MoveL

// skipped some

SEMICOLON
ENDPROC

You break down your input text in these small elementary words with the scanner. I didn't do it, but
it's often useful to add a suffix or prefix to keywords (I use ...KW, eg PROCKW), and other tokens
(I use ...TK), it makes the parser rules below more readable, and avoids name conflicts between
different tokens that are closely related, like the keyword string denoting a type and a literal
string like "abcd".



The parser takes this stream of tokens, and reconstructs the parts you want to keep together, with
grammar rules, like

Procedure : PROC IDENTIFIER PARENTHESIS_OPEN FormalParameters PARENTHESIS_CLOSE Statments ENDPROC ;
Procedure : PROC IDENTIFIER PARENTHESIS_OPEN PARENTHESIS_CLOSE Statments ENDPROC ;

A "Procedure" is thing that starts with the keyword PROC and ends with the keyword ENDPROC. There
are 2 variants, one with and one without FormalParameters.

FormalParameters : FormalParameter
| FormalParameters COMMA FormalParameter
;

FortmalParameter : Type IDENTIFIER ;

Type : STRING
| ...
;

FormalParameters is one or more FormalParameter, separated by COMMA. The latter is a sequence of
Type and IDENTIFIER.

> Intuitively I would write a regex that matches the
> name of the procedure, its argument and the procedure block.

In general, regex is not powerful enough to handle programming languages. Consider the case

string x = "endproc";

in the middle of a proc. Good luck detecting the right 'endproc' word. Similar cases exist when a
user comments away a part of a proc.

You may get it working for a set of cases, but all cases that are valid for the RAPID compiler is
impossible, probably.

> Since this pattern is so present in the language, I'd like to get it
> right and in a p(l)ythonic manner. Thing is that I'm too new to the
> parsing to really see that.

The pattern is not really special, { .. } or BEGIN .. END are mostly the same thing, although they
group different things.

Good luck with your parsing adventure,
Albert

Jelle Feringa

unread,
Nov 18, 2013, 10:57:03 AM11/18/13
to ply-...@googlegroups.com
Dear Albert,

Thanks so much for your constructive comments.
I first completed the tokenization of the RAPID grammar, and
when print the tok.type, tok.value of parts of RAPID code, it
becomes really obvious how to parse the code, since at some
the parsing code is hinted by the "type" attribute.

So parsing remains a challenging field, but I managed to
move a lot further, also thanks to your comments!

So thanks again,

-jelle



Reply all
Reply to author
Forward
0 new messages