How to handle 'free form' text alongside grammar defined statements

155 views
Skip to first unread message

john...@jptechnical.co.uk

unread,
Jun 16, 2013, 2:45:10 PM6/16/13
to ply-...@googlegroups.com
I've just completed a PLY project to do some preprocessing on a 'C' source file for a PIC microcontroller (specifically to handle my version of the #rom directive converting it to the compilers version). This is working just as I want but I have only defined the grammar for the specific #rom statements I wish to process.

It would be convenient, 
in the same file, to have various other 'C' statements (free from text), #define or typedefs for example. Can someone suggest a way to do this without having to put the complete or subsections of the 'C' grammar into my preprocessor. That is to say I want the preprocessor to copy all non #rom statements (ie lines of other 'C' code) 'as is' to the output file. It is only the #rom statements I need to process on the way to the output file.

Needless to say the 'free form text' and the #rom statements overlap in terms of some tokens. The free from text may include 'C' expressions as may a #rom statement, also literals such as { } ( ) @ etc.

Any ideas would be appreciated.

Regards,

John

Bruce Frederiksen

unread,
Jun 16, 2013, 3:09:09 PM6/16/13
to ply-...@googlegroups.com
One approach would be to have two different scanner states.  To keep this simple two conditions would have to be met:
  1. A preprocessing statement can be recognized by it's first token.  I.e., the first token in any preprocessing statement doesn't appear in any C code.
  2. The end of each preprocessing statement can be determined without having to look ahead at the next token (which would be the first C token to copy after the preprocessing statement).
If these conditions can be met for your preprocessing statements, then the two scanner states would be:
  • A "copy" state that only recognizes and returns the first tokens for each of the preprocessing statements and copies everything else.
  • A "full scan" state that recognizes and returns all tokens.
The parser could then switch the scanner back and forth between these two states.  It would be written to recognize a series of preprocessing statements and nothing else.  The scanner would start in the "copy" state, and all of the C code would not be seen by the parser.  When the scanner sees one of the first tokens, it returns it to the parser.  Upon seeing the first token for any preprocessing statement, the parser immediately switches the scanner to the "full scan" state and parses the rest of that preprocessing statement.  When it encounters the end of the statement, it switches the scanner back to the "copy" state.

Hope this helps.

BTW, are you making this code available?  I've been thinking that it would nice to have a general macro processing preprocessor that would be easier to use than m4.  Wondering if what you're doing might be a start for such a thing?

-Bruce


--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+u...@googlegroups.com.
To post to this group, send email to ply-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/d8801853-3b90-41b2-a929-6ea2f97ece19%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

john...@jptechnical.co.uk

unread,
Jun 16, 2013, 3:25:33 PM6/16/13
to ply-...@googlegroups.com
Thanks for ideas so far. The grammer I've defined to date is

#rom ID, int8 { int8, int8 … }

#rom ID, int16 { int16, int16 … }

#rom ID, "Quoted text string"

#rom ID, """

Block text a

la Python

"""

#rom (locateLo|locateHi)@addr


IDs are general identifers so each rom block can be referred to by the rest of the 'C' code. int8, int16, locateLo and locateHi are reserved words. int8, int16 and addr are general 'C' expressions.

I'll certainly make the source available but it's rather specific compared to something like M4. What's the best way to do that? it's ~640 lines at the moment.


Bruce Frederiksen

unread,
Jun 16, 2013, 4:21:34 PM6/16/13
to ply-...@googlegroups.com
It looks like all of the statements satisfy the first requirement (starts with a non-C token), since #rom is not a normal C identifier.

All but the last satisfy the second requirement (can recognize the end of the statement without looking ahead).  As an example for the last statement, after the parser sees:

#rom locateLo@A + B

it would need to peek at the next token to see if it is one that would continue the expression (such as +, -, *, etc), or something else such as the 'void' token for a following function definition:

#rom locateLo@A + B + C

vs.

#rom locateLo@A + B

void foo(int x) {
    ...
}

In the second case, the 'void' token would be scanned in the "full scan" state, so would not be copied to the output.  Having seen 'void', the parser would return the scanner to the "copy" state, where it would pick up with "foo"... and start copying from there, missing 'void'.

If it is possible to change the syntax, something like:

#rom locateLo(addr)

or

#rom locateLo@addr;

would correct this.  Then the ) or ; would signal the end of the statement with no need to look any further.

As to generalizing this, I would think that most of the scanner tokens would be the same for most languages (various literals, identifiers, return any single special character as a token, though comments are a bit different).  Thus from the scanner perspective it should be relatively easy to use this as a preprocessor for python, or ruby, or C, or fortran, or many other languages.

You want C expressions in your preprocessor statements.  Full C expressions include multi-special-character tokens such as ++ or +=.  I don't know what these would mean in your context, and whether you would want to include them since they have side effects; but if so, you'd need special scanner rules for them; with a single rule to return any single special character as a token.

Then the parser rules would be specific to your preprocessing directives.  If somebody wants to use this for another purpose, they would need to define their own preprocessor statement syntax.

If you are familiar with a source code control system, such as mercurial, you can put your code on code.google.com for free.  There are several other free open source code hosting sites if you don't like google.  No rush on any of this, if you just want to play with the idea for a bit to see if you can get it working.

If you aren't familiar with a common source code control system, and don't mind sending a copy to me, and don't mind if I put it up on google code, I would appreciate it!  Again, no rush...

Thanks!

-Bruce


--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+u...@googlegroups.com.
To post to this group, send email to ply-...@googlegroups.com.

A.T.Hofkamp

unread,
Jun 17, 2013, 2:59:02 AM6/17/13
to ply-...@googlegroups.com
On 06/16/2013 09:25 PM, john...@jptechnical.co.uk wrote:
> Thanks for ideas so far. The grammer I've defined to date is
>
> *#rom* /ID/, *int8* { /int8/, /int8/ � }
>
> *#rom* /ID/, *int16* { /int16/, /int16/ � }
>
> *#rom* /ID/, "/Quoted text string/"
>
> *#rom* /ID/, """
>
> /Block text a/
>
> /la Python/
>
> """
>
> *#rom* (*locateLo*|*locateHi*)@/addr/

It looks like a 2 stage problem to me, where PLY could play a role.

First perform a line-based search to seperate the #rom sections from the other sections.
This can be fairly high level, you are not interested in the precise content at this stage, you are
just looking for the end of a #rom section.

Ideally you can do that with some simple string operations and/or re matching. If you have nested
delimiters (eg "{ { } }" ), you need to do some counting,
More complicated options are inspecting a token stream (possibly with counting), or even a parse,
where most of the input would map to a "unimportant text" token. If you want to preserve
white-space, also generate whitespace tokens.

At the end, you should get a list of #rom and #non-rom sections.


Now that you know what text is a #rom section, you can address parsing of separate #rom sections. I
suspect you'll need C expression parsing, to avoid counting "{ 1, f(2,3,4) }" as 4 vaules (which
happens if you do line.split(',') )


Albert

john...@jptechnical.co.uk

unread,
Jun 19, 2013, 8:23:07 AM6/19/13
to ply-...@googlegroups.com
Thanks for all the replies and ideas, plenty to think about.

I've just found and read the section 4.18 'Conditional lexing and start conditions' in 'ply.html'. This is about conditional lexing and states in the lexer so probably the way to go. My initial thought is to have one state that matches a 'token' up to a #rom token (easy enough with a Python re), pass all this other 'C' code to the parser as the single token and switch lexer state to a #rom tokenising state. At the end of the #rom statement a parser rule action will have to switch the lexer state back. This approach has the advantage that all output goes through the parser and all interaction with the lexer is at the published interface level.

Definitely make the '#rom locateHi|Lo @ addr' statement explicitly terminated. ';' is 'C' like and seems good to me. Hopefully this will stop the parser having to read a lookahead token as dangyogi pointed out.

Will do these mods but not sure when.

I don't know if it would be feasible to turn it into a general purpose pre-processor as an m4 alternative but it could certainly be a customisable frame work.

Thanks again everyone.
John

john...@jptechnical.co.uk

unread,
Jul 8, 2013, 11:24:55 AM7/8/13
to ply-...@googlegroups.com
Made all the mods to the preprocessor so it handles free form text between #rom statements. Went to put the code in code.google as dangyogi suggested. In creating the project it asks which license to use. Suggestions as to the best one to use appreciated as I new to open source publishing.
Many thanks,
John

Bruce Frederiksen

unread,
Jul 8, 2013, 2:37:34 PM7/8/13
to ply-...@googlegroups.com
I'm certainly not an expert on the various licenses.  My limited understanding is that if you don't care what anybody does with it, including selling it as a commercial product, then BSD or MIT are two that are short and sweat (I generally use BSD, but forget why).

If you want to limit its use only within other open source projects, then GNU GPL or GNU Lesser GPL.  These both require anybody redistributing the code to keep it under the original license.  The Lesser license is for libraries to allow use of the libraries in other programs without restrictions while retaining the limitations for the source code.  I think that the GNU licenses have some gotchas that aren't well understood when a user wants to use your GPL project along with some other non-GPL project, so might scare some people off.  I don't believe that GPL prevents somebody from selling your project commercially, so long as they offer the sources to anybody who asks for them for free under the same license.  Not sure how anybody could make money this way, except maybe by selling support?

The apache license is another popular one.  Not sure what it's limitations are.

Sorry I couldn't be more helpful.  :-/

-Bruce


--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+u...@googlegroups.com.
To post to this group, send email to ply-...@googlegroups.com.

john...@jptechnical.co.uk

unread,
Jul 9, 2013, 7:50:29 PM7/9/13
to ply-...@googlegroups.com
The preprocessor is good enough to use now and I've annotated it with plenty of comments. Source is available at http://code.google.com/p/ccs-preprocessor/. Usage and source file format comments are extracted and put in the wiki.

I used an extra, exclusive, state, 'ccode', in the lexer to handle free form text between #rom statements. To switch between lexer states 'ccode' and 'INITIAL' (which handles the #rom statements) empty productions are used immediately before the terminating token of each state. The action of these productions is to switch lexer states. That way the lexer state is changed immediately the terminating token of a state has been taken by the parser.

Interestingly the code to do the command line decoding and #rom statement processing (the RomHandler class) are relatively large compared to the lexer and parser code. Considering what the parser is doing that seems to me a credit to lex/yacc, PLY and of course Python.

I plan to convert the module into a more Pythonic OO framework suitable for building such preprocessors more quickly than starting from scratch. But that will have to be a couple of months away.

Comments invited,
John


Reply all
Reply to author
Forward
0 new messages