This Engineering Notebook post will be referenced in an upcoming announcement to the
python-dev list.
Executive summary
I have "discovered" a spectacular replacement for Untokenizer.untokenize in python's tokenize library module. The wretched, buggy, and impossible-to-fix add_whitespace method is gone. The new code has no significant 'if' statements, and knows almost nothing about tokens! This is the way untokenize is written in The Book.
The new code should put an end to a
long series of issues against untokenize code in python's
tokenize library module. Some closed issues were blunders arising from dumbing-down the TestRoundtrip.check_roundtrip method in
test_tokenize.py. The docstring says, in part:
When untokenize bugs are fixed, untokenize with 5-tuples should
reproduce code that does not contain a backslash continuation
following spaces. A proper test should test this.
Imo, the way is now clear for proper unit testing of python's Untokenize class.
The new code pass all of python's related unit test using a proper (rigorous) version of check_roundtrip. The new code also passes a new unit test for
python issue 38663, which fails with python's library code, even with the fudged version of check_roundtrip.
This is a long post. I recommend skimming it, unless you are a python dev interested in understanding why the new code works.
This post also tells how I "discovered" the new code. It is mainly of historical interest.
Background
This post discusses in detail why tokenize.untokenize is important to me. To summarize, a simple, correct untokenizer would form the foundation of token-based tools.
I have written token-based versions of python code beautifiers. The present work started because I wanted to write a token-based on fstringify, which at present doesn't do anything on much of Leo's code base. Based on deep study of both black and fstringify, it is my strong opinion that python devs underestimate the difficulties of using ast's (parse trees) and overestimate the difficulties of using tokens.
At present, Leo's NullTokenBeautifier class (in
leoBeautify.py in the
beautify2 branch) uses a
lightly modified version of the original untokenize as the basis of NullTokenBeautifier.make_input_tokens. It is far from fun, elegant, simple or clear :-) I'll soon rewrite make_input_tokens using the new untokenize code.
First principles
1. Code overrides documentation.
2. Don't believe code comments.
The module-level docstring for tokenize.py says: "tokenize(readline)...is designed to match the working of the Python tokenizer exactly, except that it produces COMMENT tokens for comments and gives type OP for all operators." To my knowledge, this assertion is nowhere justified, much less proven. Let's hope I am missing something.
So tokenize.py is the ground truth, not
tokenizer.c, and certainly not any document.
Breakthrough: understanding row numbers and physical lines
The breakthroughs came from reading tokenize.py, and in particular the _tokenize function.
I was trying to figure out just what the heck row numbers are. Are they 1-based? To what, exactly, do they refer? Neither the docs nor the docstrings are of any help at all. Yeah, I know they are indices. Carefully documenting that fact in the module's docstring is unhelpful :-)
Imagine my surprise when I discovered that the lnum var is set only once, at the start of the main loop. This means that nothing fancy is going on. Row numbers are simply indices into code.splitlines(True)!
In other words: code.splitlines(True) are the so-called physical lines mentioned in the Language reference.
Collapsing untokenize
Armed with this new understanding, I wrote a dump_range method. This was supposed to recover token text from 5-tuples. It is hardly elegant, but unlike add_whitespace it actually has a chance of working:
def dump_range(contents, start, end):
lines = contents.splitlines(True)
result = []
s_row, s_col = start
e_row, e_col = end
if s_row == e_row == 0:
return ''
if s_row > len(lines):
return ''
col1 = s_col
row = s_row
if s_row == e_row:
line = lines[row-1]
return line[col1:e_col]
# More than one line.
while row <= e_row:
line = lines[row-1]
col2 = e_col if row == e_row else len(line)
part = line[col1:col2]
result.append(part)
col1 = 0
row += 1
return ''.join(result)
This code may be buggy, but that's moot because...
Breakthrough: scanning is never needed
At some point I realized that the code above is needless complex. If row numbers are indices into contents.splitlines, than we can we can convert row/column numbers directly! All we need is an array of indices to the start of each row in contents.splitlines.
Traces had shown me that row zero is a special case, and that the first "real" token might have a row-number of 1, not zero. We can handle that without effort by saying that line zero has zero length. This resolves the confusion about indexing. Indices are zero-based, but line zero has zero length.
So the code to compute indices is:
# Create the physical lines.
self.lines = self.contents.splitlines(True)
# Create the list of character offsets of the start of each physical line.
last_offset, self.offsets = 0, [0]
for line in self.lines:
last_offset += len(line)
self.offsets.append(last_offset)
Given this offset array, it is trivial to discover the actual text of a token, and any between-token whitespace:
# Unpack..
tok_type, val, start, end, line = token
s_row, s_col = start
e_row, e_col = end
kind = token_module.tok_name[tok_type].lower()
# Calculate the token's start/end offsets: character offsets into contents.
s_offset = self.offsets[max(0, s_row-1)] + s_col
e_offset = self.offsets[max(0, e_row-1)] + e_col
# Add any preceding between-token whitespace.
ws = self.contents[self.prev_offset:s_offset]
if ws:
self.results.append(ws)
# Add the token, if it contributes any real text.
tok_s = self.contents[s_offset:e_offset]
if tok_s:
self.results.append(tok_s)
# Update the ending offset.
self.prev_offset = e_offset
The Post Script shows the complete code, giving the context for the snippet above.
Summary
The new untokenize code is elegant, fast, sound, and easy to understand.
The code knows nothing about tokens themselves, only about token indices :-)
The way is now clear for proper unit testing of python's Untokenize class.
Edward
class Untokenize:
def __init__(self, contents, trace=False):
self.contents = contents # A unicode string.
self.trace = trace
def untokenize(self, tokens):
# Create the physical lines.
self.lines = self.contents.splitlines(True)
# Create the list of character offsets of the start of each physical line.
last_offset, self.offsets = 0, [0]
for line in self.lines:
last_offset += len(line)
self.offsets.append(last_offset)
# Trace lines & offsets.
self.show_header()
# Handle each token, appending tokens and between-token whitespace to results.
self.prev_offset, self.results = -1, []
for token in tokens:
self.do_token(token)
# Print results when tracing.
self.show_results()
# Return the concatentated results.
return ''.join(self.results)
def do_token(self, token):
"""Handle the given token, including between-token whitespace"""
def show_tuple(aTuple):
s = f"{aTuple[0]}..{aTuple[1]}"
return f"{s:8}"
# Unpack..
tok_type, val, start, end, line = token
s_row, s_col = start
e_row, e_col = end
kind = token_module.tok_name[tok_type].lower()
# Calculate the token's start/end offsets: character offsets into contents.
s_offset = self.offsets[max(0, s_row-1)] + s_col
e_offset = self.offsets[max(0, e_row-1)] + e_col
# Add any preceding between-token whitespace.
ws = self.contents[self.prev_offset:s_offset]
if ws:
self.results.append(ws)
if self.trace:
print(
f"{'ws':>10} {ws!r:20} "
f"{show_tuple((self.prev_offset, s_offset)):>26} "
f"{ws!r}")
# Add the token, if it contributes any real text.
tok_s = self.contents[s_offset:e_offset]
if tok_s:
self.results.append(tok_s)
if self.trace:
print(
f"{kind:>10} {val!r:20} "
f"{show_tuple(start)} {show_tuple(end)} {show_tuple((s_offset, e_offset))} "
f"{tok_s!r:15} {line!r}")
# Update the ending offset.
self.prev_offset = e_offset
Typical driver code (from within Leo) would be something like:
import io
import tokenize
import imp
import leo.core.leoBeautify as leoBeautify
imp.reload(leoBeautify)
contents = r'''print ( 'aa \
bb')
print('xx \
yy')
'''
tokens = tokenize.tokenize(io.BytesIO(contents.encode('utf-8')).readline)
results = leoBeautify.Untokenize(contents, trace=True).untokenize(tokens)
if results != contents:
print('FAIL')
EKR