I've measured the performance penalty of introducing Token objects
instead of using 3-tuples: tokenizing is 1.52 times slower when Token
objects are used. A first attempt to recode class Token using Cython/
Pyrex results in a 1.11 slow down instead of 1.52. (To the price of
However, I profiled a version with pure Python Token objects and it
appears that building new Token instances consumes only 5% of the
total parsing time. So, a 1.52 penalty on this aspect is probably
What is your feeling about that?
Should I submit a patch to use class Token?
If yes, would you prefer it as a Python class or as a new built-in?
(I prefer to know if it is worth introducing Token before to do it,
because this involves modifying many lines...)
Thanks for doing the leg work on testing the difference between
lexical token representations!
On Mon, Jan 11, 2010 at 7:52 AM, Franck <franck.p...@gmail.com> wrote:
> I've measured the performance penalty of introducing Token objects
> instead of using 3-tuples: tokenizing is 1.52 times slower when Token
> objects are used. A first attempt to recode class Token using Cython/
> Pyrex results in a 1.11 slow down instead of 1.52. (To the price of
> simple portability.)
> However, I profiled a version with pure Python Token objects and it
> appears that building new Token instances consumes only 5% of the
> total parsing time. So, a 1.52 penalty on this aspect is probably
> What is your feeling about that?
I agree that the penalty isn't so bad. I'm afraid that I went to an
all-tuple implementation without paying too much attention to its
overall benefit. At least we seem to be in agreement that the
concrete parse tree shouldn't use classes.
> Should I submit a patch to use class Token?
> If yes, would you prefer it as a Python class or as a new built-in?
Let me digest the LineList and Tokenizer stuff, and I'll get back to
you. If you are agreeable, I might do the work myself, porting from
your pgen.py. This work shouldn't be too out of my way since it would
be part of my proposal that the Basil PyPgen application have a switch
for selecting the scanner used by the generated source.
As you like. I don't think there is much left to backport. It's mainly
the introduction of the Token class. (I assume that you will not remove
the intermediary classes/functions that I've dropped, probably you have
them for a good reason of your own.) As a first step, I had just changed
the tokenizer so it produces (int, Token, int) tuples instead of (int,
str, int). This allows not to impact the parser while achieving the
transfer of full lexical information to the parse tree.
Moreover, I recently introduced a new class Location to aggregate
lines/columns numbers from a parse sub-tree and I now use Location
instances instead of None in parse trees internal nodes. So, at any
stage of a parse tree traversal, one can provide location information
about the parsed text, which is useful for error messages. :-)
class Location (str) :
"""A position in a parsed file
Used to aggregate the positions of all the tokens is a parse
(sub)tree. The following attributes are available:
- self.srow: start row
- self.scol: start column
- self.erow: end row
- self.ecol: end column
- self.filename: input file from which the token comes
def __new__ (cls, first, last) :
"""Create a new instance
- first: the first Token or Location instance in the region
- last: the last one
self = str.__new__(cls, "%s[%s:%s-%s:%s]"
% (first.filename, first.srow, first.scol,
self.srow, self.scol = first.srow, first.scol
self.erow, self.ecol = last.erow, last.ecol
self.filename = first.filename
def _fix_locations (st) :
"""Replaces None in non-terminal nodes by a Location instance.
- st: a syntax tree as returned by the parser
(kind, token, lineno), children = st
children = [_fix_locations(c) for c in children]
if kind >= token.NT_OFFSET :
token = Location(children, children[-1])
return ((kind, token, lineno), children)