PyPgen: using or not Token objects

Franck

unread,

Jan 11, 2010, 8:52:50 AM1/11/10

to mython

Hi all,

I've measured the performance penalty of introducing Token objects
instead of using 3-tuples: tokenizing is 1.52 times slower when Token
objects are used. A first attempt to recode class Token using Cython/
Pyrex results in a 1.11 slow down instead of 1.52. (To the price of
simple portability.)

However, I profiled a version with pure Python Token objects and it
appears that building new Token instances consumes only 5% of the
total parsing time. So, a 1.52 penalty on this aspect is probably
unnoticeable...

What is your feeling about that?
Should I submit a patch to use class Token?
If yes, would you prefer it as a Python class or as a new built-in?

(I prefer to know if it is worth introducing Token before to do it,
because this involves modifying many lines...)

Franck

Jon Riehl

unread,

Jan 13, 2010, 6:03:16 PM1/13/10

to myt...@googlegroups.com

Hi Franck,

Thanks for doing the leg work on testing the difference between
lexical token representations!

On Mon, Jan 11, 2010 at 7:52 AM, Franck <franck.p...@gmail.com> wrote:
>
> I've measured the performance penalty of introducing Token objects
> instead of using 3-tuples: tokenizing is 1.52 times slower when Token
> objects are used. A first attempt to recode class Token using Cython/
> Pyrex results in a 1.11 slow down instead of 1.52. (To the price of
> simple portability.)
>
> However, I profiled a version with pure Python Token objects and it
> appears that building new Token instances consumes only 5% of the
> total parsing time. So, a 1.52 penalty on this aspect is probably
> unnoticeable...
>
> What is your feeling about that?

I agree that the penalty isn't so bad. I'm afraid that I went to an
all-tuple implementation without paying too much attention to its
overall benefit. At least we seem to be in agreement that the
concrete parse tree shouldn't use classes.

> Should I submit a patch to use class Token?
> If yes, would you prefer it as a Python class or as a new built-in?

Let me digest the LineList and Tokenizer stuff, and I'll get back to
you. If you are agreeable, I might do the work myself, porting from
your pgen.py. This work shouldn't be too out of my way since it would
be part of my proposal that the Basil PyPgen application have a switch
for selecting the scanner used by the generated source.

Thanks,
-Jon

Franck Pommereau

unread,

Jan 14, 2010, 3:05:25 AM1/14/10

to myt...@googlegroups.com

> Let me digest the LineList and Tokenizer stuff, and I'll get back to
> you. If you are agreeable, I might do the work myself, porting from
> your pgen.py. This work shouldn't be too out of my way since it would
> be part of my proposal that the Basil PyPgen application have a switch
> for selecting the scanner used by the generated source.

As you like. I don't think there is much left to backport. It's mainly
the introduction of the Token class. (I assume that you will not remove
the intermediary classes/functions that I've dropped, probably you have
them for a good reason of your own.) As a first step, I had just changed
the tokenizer so it produces (int, Token, int) tuples instead of (int,
str, int). This allows not to impact the parser while achieving the
transfer of full lexical information to the parse tree.

Moreover, I recently introduced a new class Location to aggregate
lines/columns numbers from a parse sub-tree and I now use Location
instances instead of None in parse trees internal nodes. So, at any
stage of a parse tree traversal, one can provide location information
about the parsed text, which is useful for error messages. :-)

Cheers,
Franck

class Location (str) :
"""A position in a parsed file

Used to aggregate the positions of all the tokens is a parse
(sub)tree. The following attributes are available:

- self.srow: start row
- self.scol: start column
- self.erow: end row
- self.ecol: end column
- self.filename: input file from which the token comes
"""
def __new__ (cls, first, last) :
"""Create a new instance

Expected arguments:
- first: the first Token or Location instance in the region
- last: the last one
"""
self = str.__new__(cls, "%s[%s:%s-%s:%s]"
% (first.filename, first.srow, first.scol,
last.erow, last.ecol))
self.srow, self.scol = first.srow, first.scol
self.erow, self.ecol = last.erow, last.ecol
self.filename = first.filename
return self

def _fix_locations (st) :
"""Replaces None in non-terminal nodes by a Location instance.

Expected argument:
- st: a syntax tree as returned by the parser
"""
(kind, token, lineno), children = st
children = [_fix_locations(c) for c in children]
if kind >= token.NT_OFFSET :
token = Location(children[0][0][1], children[-1][0][1])
return ((kind, token, lineno), children)

Reply all

Reply to author

Forward