PyPgen

Franck Pommereau

unread,

Jan 8, 2010, 2:58:32 AM1/8/10

to myt...@googlegroups.com

Hi all,

I recently discovered PyPgen, which is exactly the tool I've been
looking for for years! I've extracted the relevant parts from basil and
bundled everything into a single source file with several changes.

New features :
- possibility to add user-defined tokens (simple ones specified as
strings, not regexps)
- possibility to generate Python modules with no dependencies to any
non-standard library
- probably simpler API

At code level:
- a token class replaces the 3-tuples (type, name, lineno) and keeps
all the information from the original token
- removed many intermediary classes and functions (less code) that
were not needed anymore
- replaced SyntaxError with a specially defined error
- use Python's warnings
- some more documentation

The resulting module is attached. I'll try to backport my changes to
basil source and break them down into several patches that I'll send
here. Notice that is likely that many changes will impact parts of basil
that I didn't have to look at, I hope that the patches be usable anyway.
By the way: is there any standard test that I can run in order to check
for non-regression when I change basil's source? Thank!

Cheers,
Franck

pgen.py

Jon Riehl

unread,

Jan 8, 2010, 5:48:12 PM1/8/10

to myt...@googlegroups.com

Hi Franck,

I'm glad you found this useful. I'm sorry I don't publicize this
more, since I've never cut a formal release of Basil. More comments
inline.

On Fri, Jan 8, 2010 at 1:58 AM, Franck Pommereau
<franck.p...@gmail.com> wrote:
> New features :
> - possibility to add user-defined tokens (simple ones specified as
> strings, not regexps)

I forget how I do this for Mython, but this is certainly useful.
Mython adds a "quote" keyword, for instance.

> - possibility to generate Python modules with no dependencies to any
> non-standard library

Yeah. Basil is oriented more as a framework, and pays for it in
complexity. I've always talked about making framework to script
extraction tools, but never looked too far into it.

> - probably simpler API

One note about the API: I was aiming to be compliant to a non-existent
CPython pgen API (PEP-269). Ideally I wanted my version of PyPgen to
generate parsers that could use the C parser.

> At code level:
> - a token class replaces the 3-tuples (type, name, lineno) and keeps
> all the information from the original token

I'd prefer to make this a switch. My preferences for tuples over
classes comes from the olden days when object construction time could
kill parse times (by an order of magnitude). Maybe a timing test
might be in order?

> - removed many intermediary classes and functions (less code) that
> were not needed anymore
> - replaced SyntaxError with a specially defined error
> - use Python's warnings

At present, I don't have any issues with these decisions. I have a
personal preference for Python's SyntaxError, but I can't come up with
a good technical reason.

> - some more documentation

On trivial inspection, your documentation seems to be quite an improvement.

> The resulting module is attached. I'll try to backport my changes to
> basil source and break them down into several patches that I'll send
> here. Notice that is likely that many changes will impact parts of basil
> that I didn't have to look at, I hope that the patches be usable anyway.
> By the way: is there any standard test that I can run in order to check
> for non-regression when I change basil's source? Thank!

I'd be agreeable to a proposal that makes PyPgen both a standalone
script and a replacement for basil.parsing.pgen (or similar). Looking
at PEP-8 and the standard library design directions, it seems that
starting out using the (NASA-originated) one file per class rule of
thumb just isn't Pythonic.

I'm working on building up a testing framework, since I'm going to be
swapping out Mython's parser soon, and want to verify that I'm not
breaking anything. Have you tested your code with various versions of
Python? I ask because it looks like it might work in both 2 and 3,
but I'm not sure.

Thanks for your interest and work!
-Jon

Franck Pommereau

unread,

Jan 9, 2010, 5:07:23 AM1/9/10

to myt...@googlegroups.com

>> - possibility to add user-defined tokens (simple ones specified as
>> strings, not regexps)
>
> I forget how I do this for Mython, but this is certainly useful.
> Mython adds a "quote" keyword, for instance.

'quote' is a NAME token and is recognized by the standard Python
lexer. For my usage, I needed to be able to add tokens like '!', '?'
and so on. That's done by interception each ERRORTOKEN and looking if
it is one of the user-defined tokens.

For instance, I've extended pgen grammar with the following rule:

'$' NAME STRING NEWLINE

In order to allow for specifying new tokens in the pgen source.

> One note about the API: I was aiming to be compliant to a non-existent
> CPython pgen API (PEP-269). Ideally I wanted my version of PyPgen to
> generate parsers that could use the C parser.

I have not such requirement. In fact, probably I'd like to have
something completely standalone that I could maintain. Indeed, my
library has been designed using the compiler modules that has been
deprecated with Python 3. This makes a lot of effort to port my
library to Python 3... This bad experience leads me to a solution that
depends as few as possible on Python's low level libraries.

By the way, I noticed recently that starting from Python 2.6, there is
a package lib2to3.pgen2 that is a Python pgen implementation that
generates Python code. However, I had problems to make it work and it
seems to be more complex that PyPgen.

> I'd prefer to make this a switch. My preferences for tuples over
> classes comes from the olden days when object construction time could
> kill parse times (by an order of magnitude). Maybe a timing test
> might be in order?

This is also linked to my usage: in order to quickly retrieve the
source code for a part of a ST, I need not to loose information. I'll
look a the execution times. If timing is really an issue, I could
probably quickly code a native object that does the same (using
Pyrex/Cython).

Notice also that my Token class is based on str and coded so that
there is also the possibility to use 3-tuples (int, Token, int) as an
immediate replacement for (int, str, int).

> At present, I don't have any issues with these decisions. I have a
> personal preference for Python's SyntaxError, but I can't come up with
> a good technical reason.

My experience in maintaining a (small) compiler is that it's better
not to rely on standard Python exceptions in order to signal errors in
the compiled source. Otherwise, it becomes hard to distinguish whether
the errors come from the compiled source or from your compiler itself.

> Have you tested your code with various versions of
> Python? I ask because it looks like it might work in both 2 and 3,
> but I'm not sure.

Not yet, but for sure it doesn't run with Python 3 since I've included
print statements.
What versions of Python 2 would you like to support?

Cheers,
Franck

Jon Riehl

unread,

Jan 13, 2010, 5:49:04 PM1/13/10

to myt...@googlegroups.com

Hi Franck,

Sorry for the delay. You sent out a lot of stuff over the weekend,
and I wanted to process it all at once. Anyway...

On Sat, Jan 9, 2010 at 4:07 AM, Franck Pommereau
<franck.p...@gmail.com> wrote:
>>> - possibility to add user-defined tokens (simple ones specified as
>>> strings, not regexps)
>>
>> I forget how I do this for Mython, but this is certainly useful.
>> Mython adds a "quote" keyword, for instance.
>
> 'quote' is a NAME token and is recognized by the standard Python
> lexer. For my usage, I needed to be able to add tokens like '!', '?'
> and so on. That's done by interception each ERRORTOKEN and looking if
> it is one of the user-defined tokens.
>
> For instance, I've extended pgen grammar with the following rule:
>
> '$' NAME STRING NEWLINE
>
> In order to allow for specifying new tokens in the pgen source.

This is certainly handy. Mython 3000 should have a compile-time
expression that uses '!'. I'll be interested in seeing what your
patches do.

>> One note about the API: I was aiming to be compliant to a non-existent
>> CPython pgen API (PEP-269). Ideally I wanted my version of PyPgen to
>> generate parsers that could use the C parser.
>
> I have not such requirement. In fact, probably I'd like to have
> something completely standalone that I could maintain. Indeed, my
> library has been designed using the compiler modules that has been
> deprecated with Python 3. This makes a lot of effort to port my
> library to Python 3... This bad experience leads me to a solution that
> depends as few as possible on Python's low level libraries.

Got it. I don't know what the logistics of keeping everything in sync
looks like, but I certainly understand the allure of having a required
component under your control. At one point I wrote a Python compiler
(about the same time as the compiler module was originated by Bill
Tutt and Greg Stein), and I wouldn't be scared to roll another one.
This may be a requirement, since there is a bug in the compiler module
that is messing with Mython development. What are you doing with it,
if I may ask?

> By the way, I noticed recently that starting from Python 2.6, there is
> a package lib2to3.pgen2 that is a Python pgen implementation that
> generates Python code. However, I had problems to make it work and it
> seems to be more complex that PyPgen.

Guido mentioned writing one, but I don't know if that's the same one.

>> I'd prefer to make this a switch. My preferences for tuples over
>> classes comes from the olden days when object construction time could
>> kill parse times (by an order of magnitude). Maybe a timing test
>> might be in order?
>
> This is also linked to my usage: in order to quickly retrieve the
> source code for a part of a ST, I need not to loose information. I'll
> look a the execution times. If timing is really an issue, I could
> probably quickly code a native object that does the same (using
> Pyrex/Cython).

I'll try to come up with something more concrete that might suit both
of us. I'm hesitant to encourage a fork in the code, but you've
already done the work. *smirk*

> Notice also that my Token class is based on str and coded so that
> there is also the possibility to use 3-tuples (int, Token, int) as an
> immediate replacement for (int, str, int).

This is a good point. When I get to the patches, I'm going to see how
this would impact the unit tests that I decide are required. I need
to get weened off of the tokenize module anyway.

>> At present, I don't have any issues with these decisions. I have a
>> personal preference for Python's SyntaxError, but I can't come up with
>> a good technical reason.
>
> My experience in maintaining a (small) compiler is that it's better
> not to rely on standard Python exceptions in order to signal errors in
> the compiled source. Otherwise, it becomes hard to distinguish whether
> the errors come from the compiled source or from your compiler itself.

That's a real issue in the Mython language (where errors can be in
Python, the Mython implementation, or in a user's embedded language).
I'm going to have to think about this more, but it might be better to
move to something like your policy.

>> Have you tested your code with various versions of
>> Python? I ask because it looks like it might work in both 2 and 3,
>> but I'm not sure.
>
> Not yet, but for sure it doesn't run with Python 3 since I've included
> print statements.
> What versions of Python 2 would you like to support?

I'm still using 2.5 for the majority of my development work, but I'm
trying to catch up to 2.6, and apparently 2.7 is in alpha already. If
I come up with something that is 3.x compatible, you'll know from this
list.

Thanks,
-Jon

Franck Pommereau

unread,

Jan 14, 2010, 2:51:05 AM1/14/10

to myt...@googlegroups.com

Hi Jon,

> This may be a requirement, since there is a bug in the compiler module
> that is messing with Mython development. What are you doing with it,
> if I may ask?

I'm developing a Petri net library to implement the models I'm working
on for research (http://lacl.univ-paris12.fr/pommereau/soft/snakes/).
These Petri nets are annotated with Python code and I need to be able to
do basic queries and transformations, like listing the variables
involved in an expression, or renaming some of them, etc. This basically
can be made using Python's builtin libraries. But my library also has a
compiler to transform a kind of process algebra syntax into Petri nets,
with Python code embedded in the syntax. So that's why I need to have a
parser that I can extend.

I used to use a PLY based-parser, based on Andrew Dalke's Python4Ply
(http://www.dalkescientific.com/Python/python4ply.html). But this is
actually hard to maintain because PLY does not support repetitions and
options in the grammar definition.

So, after discovering your PyPgen, I decided to see what I can do with
it. And I'm now quite satisfied since I could get a pgen compiler to
produce parsers, I've also written an asdl compiler to produce AST
classes, and I'm currently fixing an ST to AST translator that is
compatible with Python's ast module.

With three these tools, it should be easy to reimplement my compiler and
it should be much easier to keep up to date with Python's grammar
evolutions.

> I'll try to come up with something more concrete that might suit both
> of us. I'm hesitant to encourage a fork in the code, but you've
> already done the work. *smirk*

I just submit patches, you decide what you want to integrate. :-)

> I'm still using 2.5 for the majority of my development work, but I'm
> trying to catch up to 2.6, and apparently 2.7 is in alpha already. If
> I come up with something that is 3.x compatible, you'll know from this
> list.