ENB: A much better untokenizer

94 views

Skip to first unread message

Edward K. Ream

unread,

Nov 3, 2019, 4:20:27 AM11/3/19

to leo-editor

This Engineering Notebook post will be referenced in an upcoming announcement to the python-dev list.

Executive summary

I have "discovered" a spectacular replacement for Untokenizer.untokenize in python's tokenize library module. The wretched, buggy, and impossible-to-fix add_whitespace method is gone. The new code has no significant 'if' statements, and knows almost nothing about tokens! This is the way untokenize is written in The Book.

The new code should put an end to a long series of issues against untokenize code in python's tokenize library module. Some closed issues were blunders arising from dumbing-down the TestRoundtrip.check_roundtrip method in test_tokenize.py. The docstring says, in part:

When untokenize bugs are fixed, untokenize with 5-tuples should
reproduce code that does not contain a backslash continuation
following spaces.  A proper test should test this.

Imo, the way is now clear for proper unit testing of python's Untokenize class.

The new code pass all of python's related unit test using a proper (rigorous) version of check_roundtrip. The new code also passes a new unit test for python issue 38663, which fails with python's library code, even with the fudged version of check_roundtrip.

This is a long post. I recommend skimming it, unless you are a python dev interested in understanding why the new code works.

This post also tells how I "discovered" the new code. It is mainly of historical interest.

Background

This post discusses in detail why tokenize.untokenize is important to me. To summarize, a simple, correct untokenizer would form the foundation of token-based tools.

I have written token-based versions of python code beautifiers. The present work started because I wanted to write a token-based on fstringify, which at present doesn't do anything on much of Leo's code base. Based on deep study of both black and fstringify, it is my strong opinion that python devs underestimate the difficulties of using ast's (parse trees) and overestimate the difficulties of using tokens.

At present, Leo's NullTokenBeautifier class (in leoBeautify.py in the beautify2 branch) uses a lightly modified version of the original untokenize as the basis of NullTokenBeautifier.make_input_tokens. It is far from fun, elegant, simple or clear :-) I'll soon rewrite make_input_tokens using the new untokenize code.

First principles

1. Code overrides documentation.

Neither the Lexical Analysis section of Python Language Reference nor the docs for the tokenize module were useful.

2. Don't believe code comments.

The module-level docstring for tokenize.py says: "tokenize(readline)...is designed to match the working of the Python tokenizer exactly, except that it produces COMMENT tokens for comments and gives type OP for all operators." To my knowledge, this assertion is nowhere justified, much less proven. Let's hope I am missing something.

So tokenize.py is the ground truth, not tokenizer.c, and certainly not any document.

Breakthrough: understanding row numbers and physical lines

The breakthroughs came from reading tokenize.py, and in particular the _tokenize function.

I was trying to figure out just what the heck row numbers are. Are they 1-based? To what, exactly, do they refer? Neither the docs nor the docstrings are of any help at all. Yeah, I know they are indices. Carefully documenting that fact in the module's docstring is unhelpful :-)

Imagine my surprise when I discovered that the lnum var is set only once, at the start of the main loop. This means that nothing fancy is going on. Row numbers are simply indices into code.splitlines(True)!

In other words: code.splitlines(True) are the so-called physical lines mentioned in the Language reference.

Collapsing untokenize

Armed with this new understanding, I wrote a dump_range method. This was supposed to recover token text from 5-tuples. It is hardly elegant, but unlike add_whitespace it actually has a chance of working:

def dump_range(contents, start, end):
    lines = contents.splitlines(True)
    result = []
    s_row, s_col = start
    e_row, e_col = end
    if s_row == e_row == 0:
        return ''
    if s_row > len(lines):
        return ''
    col1 = s_col
    row = s_row
    if s_row == e_row:
        line = lines[row-1]
        return line[col1:e_col]
    # More than one line.
    while row <= e_row:
        line = lines[row-1]
        col2 = e_col if row == e_row else len(line)
        part = line[col1:col2]
        result.append(part)
        col1 = 0
        row += 1
    return ''.join(result)

This code may be buggy, but that's moot because...

Breakthrough: scanning is never needed

At some point I realized that the code above is needless complex. If row numbers are indices into contents.splitlines, than we can we can convert row/column numbers directly! All we need is an array of indices to the start of each row in contents.splitlines.

Traces had shown me that row zero is a special case, and that the first "real" token might have a row-number of 1, not zero. We can handle that without effort by saying that line zero has zero length. This resolves the confusion about indexing. Indices are zero-based, but line zero has zero length.

So the code to compute indices is:

# Create the physical lines.
self.lines = self.contents.splitlines(True)
# Create the list of character offsets of the start of each physical line.
last_offset, self.offsets = 0, [0]
for line in self.lines:
    last_offset += len(line)
    self.offsets.append(last_offset)

Given this offset array, it is trivial to discover the actual text of a token, and any between-token whitespace:

# Unpack..
tok_type, val, start, end, line = token
s_row, s_col = start
e_row, e_col = end
kind = token_module.tok_name[tok_type].lower()
# Calculate the token's start/end offsets: character offsets into contents.
s_offset = self.offsets[max(0, s_row-1)] + s_col
e_offset = self.offsets[max(0, e_row-1)] + e_col
# Add any preceding between-token whitespace.
ws = self.contents[self.prev_offset:s_offset]
if ws:
    self.results.append(ws)
# Add the token, if it contributes any real text.
tok_s = self.contents[s_offset:e_offset]
if tok_s:
    self.results.append(tok_s)
# Update the ending offset.
self.prev_offset = e_offset

The Post Script shows the complete code, giving the context for the snippet above.

Summary

The new untokenize code is elegant, fast, sound, and easy to understand.

The code knows nothing about tokens themselves, only about token indices :-)

The way is now clear for proper unit testing of python's Untokenize class.

Edward

P. S. Here is Leo's present Untokenize class, in leoBeautify.py:

class Untokenize:
    
    def __init__(self, contents, trace=False):
        self.contents = contents # A unicode string.
        self.trace = trace
    
    def untokenize(self, tokens):

        # Create the physical lines.
        self.lines = self.contents.splitlines(True)
        # Create the list of character offsets of the start of each physical line.
        last_offset, self.offsets = 0, [0]
        for line in self.lines:
            last_offset += len(line)
            self.offsets.append(last_offset)
        # Trace lines & offsets.
        self.show_header()
        # Handle each token, appending tokens and between-token whitespace to results.
        self.prev_offset, self.results = -1, []
        for token in tokens:
            self.do_token(token)
        # Print results when tracing.
        self.show_results()
        # Return the concatentated results.
        return ''.join(self.results)

    def do_token(self, token):
        """Handle the given token, including between-token whitespace"""

        def show_tuple(aTuple):
            s = f"{aTuple[0]}..{aTuple[1]}"
            return f"{s:8}"

        # Unpack..
        tok_type, val, start, end, line = token
        s_row, s_col = start
        e_row, e_col = end
        kind = token_module.tok_name[tok_type].lower()
        # Calculate the token's start/end offsets: character offsets into contents.
        s_offset = self.offsets[max(0, s_row-1)] + s_col
        e_offset = self.offsets[max(0, e_row-1)] + e_col
        # Add any preceding between-token whitespace.
        ws = self.contents[self.prev_offset:s_offset]
        if ws:
            self.results.append(ws)
            if self.trace:
                print(
                    f"{'ws':>10} {ws!r:20} "
                    f"{show_tuple((self.prev_offset, s_offset)):>26} "
                    f"{ws!r}")
        # Add the token, if it contributes any real text.
        tok_s = self.contents[s_offset:e_offset]
        if tok_s:
            self.results.append(tok_s)
        if self.trace:
            print(
                f"{kind:>10} {val!r:20} "
                f"{show_tuple(start)} {show_tuple(end)} {show_tuple((s_offset, e_offset))} "
                f"{tok_s!r:15} {line!r}")
        # Update the ending offset.
        self.prev_offset = e_offset

Typical driver code (from within Leo) would be something like:

import io
import tokenize
import imp
import leo.core.leoBeautify as leoBeautify
imp.reload(leoBeautify)

contents = r'''print ( 'aa \
bb')
print('xx \
yy')
'''
tokens = tokenize.tokenize(io.BytesIO(contents.encode('utf-8')).readline)
results = leoBeautify.Untokenize(contents, trace=True).untokenize(tokens)
if results != contents:
    print('FAIL')

EKR

Edward K. Ream

unread,

Nov 3, 2019, 8:07:18 AM11/3/19

to leo-editor

On Sunday, November 3, 2019 at 3:20:27 AM UTC-6, Edward K. Ream wrote:

> The new code should put an end to a long series of issues against untokenize code in python's tokenize library module.

A few more words about the testing that I have done.

The python tests in test_tokenize.py are quite rightly careful about unicode. My test code takes similar care.

leoTest.leo contains unit tests from test_tokenize.py, adapted to run within Leo. leoTest.leo contains this new unit test:

# Test https://bugs.python.org/issue38663.
import leo.core.leoBeautify as leoBeautify
check_roundtrip = leoBeautify.check_roundtrip

check_roundtrip(
    "print \\\n"
    "    ('abc')\n",
    expect_failure = True
)

Something similar should be added to test_tokenize.py.

Here is the testing code from leoBeautify.py. As you will see, the code runs stricter tests than those in test_tokenize.py:

import unittest

def check_roundtrip(f, expect_failure=False):
    """
    Called from unit tests in unitTest.leo.
    
    Test python's tokenize.untokenize method and Leo's Untokenize class.
    """
    check_python_roundtrip(f, expect_failure)
    check_leo_roundtrip(f)
    
def check_leo_roundtrip(code, trace=False):
    """Check Leo's Untokenize class"""
    # pylint: disable=import-self
    import leo.core.leoBeautify as leoBeautify
    assert isinstance(code, str), repr(code)
    tokens = tokenize.tokenize(io.BytesIO(code.encode('utf-8')).readline)
    u = leoBeautify.Untokenize(code, trace=trace)
    results = u.untokenize(tokens)
    unittest.TestCase().assertEqual(code, results)
    
def check_python_roundtrip(f, expect_failure):
    """
    This is tokenize.TestRoundtrip.check_roundtrip, without the wretched fudges.
    """
    if isinstance(f, str):
        code = f.encode('utf-8')
    else:
        code = f.read()
        f.close()
    readline = iter(code.splitlines(keepends=True)).__next__
    tokens = list(tokenize.tokenize(readline))
    bytes = tokenize.untokenize(tokens)
    readline5 = iter(bytes.splitlines(keepends=True)).__next__
    result_tokens = list(tokenize.tokenize(readline5))
    if expect_failure:
        unittest.TestCase().assertNotEqual(result_tokens, tokens)
    else:
        unittest.TestCase().assertEqual(result_tokens, tokens)

Summary

I've done the heavy lifting on issue 38663. Python devs should handle the details of testing and packaging.

Leo's tokenizing code in leoBeautify.py can use the new code immediately, without waiting for python to improve tokenize.untokenize.