ENB: About tokens and related commands

67 views
Skip to first unread message

Edward K. Ream

unread,
Nov 1, 2019, 9:33:14 AM11/1/19
to leo-editor
I started this Engineering Notebook post to clarify my thinking. It has shown me the way forward.

Part of it has turned into a Theory of Operation for Leo's token-based code. In the new documentation framework, this is a Discussion.  That is, it is oriented towards understanding.

This post explores picky issues related to python tokens. Feel free to skip any or all of it.

Background

Imo, python devs are biased in favor of parse trees in programs involving text manipulations.  The "real" black and fstringify tools would be significantly simpler, clearer and faster if they used python's tokenize module instead of python's ast module.  Leo's own black and fstringify commands prove my contention to my satisfaction.

I would like to "go public" with my opinion.  This opinion will be controversial, so I want to make the strongest possible case. I need to prove that handling tokens can be done simply and correctly in all cases.  This is a big ask, because python's tokens are complicated.  See the Lexical Analysis section of the Python Language Reference.

The beautify2 branch is intended to provide the required proof.

Strategy

The beautify2 branch bases all token handling on the untokenize function in python's tokenize module.

Given a stream of tokens (5-tuples) from tokenize.generate_tokens(code), untokenize reproduces code, the original source code. This round tripping property of untokenize is basis for the required proof.

Recreating the source code within each token is straightforward. The hard part is recreating the between-token whitespace. This tokenize.untokenize is guaranteed to do!

So the strategy is simple.  All commands will create input tokens based on tokenize.untokenize. This will guarantee that token handling is sound, that is, that the list of input tokens will contain exactly the correct token data.

Classes

The beautify2 branch defines several classes that use tokens. Each class does the following:

1. Creates a list (not an iterator) of input tokens.  Using real lists allows lookahead, which is impossible with iterators.

2. Defines one or more input token handlers. Handlers produce zero or more output tokens.
 
A straightforward concatenation of all output tokens produces the result of each command.

Here are the actual classes:

- class BeautifierToken:
  Input and output tokens are instances of this class.

- class NullTokenBeautifier:
  The base class for actual commands. This class is the natural place to test round-tripping.

- class FStringifyTokens(NullTokenBeautifier).
  Implements Leo's token-based fstringify commands. It defines a handler only for string input tokens.

- class PythonTokenBeautifier(NullTokenBeautifier)
  Implements Leo's token-based beautify commands. It defines handlers for all input tokens.

Tokenizing and token hooks

NullTokenBeautifier.make_input_tokens creates a list of input tokens from a sequence of 5-tuples produced by tokenize.generate_tokens.  There is no need for subclasses to override make_input_tokens, because...

make_input_tokens is exactly the same as tokenize.untokenize, except that it calls token hooks in various places.  These hooks allow subclasses to modify the tokens returned by make_input_tokens. The null hooks (in NullTokenBeautifier) make make_input_tokens work exactly the same as tokenize.untokenize.

This scheme is the simplest thing that could possibly work. Subclasses may adjust the input tokens to make token handling easier:

1. The null hooks create pseudo "ws" tokens (in the proper places!) that carry the between-token whitespace. Nothing more needs to be done!

2. Extra "ws" tokens would complicate token-related parsing in the FStringifyTokens and PythonTokenBeautifier. Instead, the token hooks in these two classes "piggyback" between-token whitespace on already-created tokens. It's quite simple. See the token_hook methods of these two classes. Note that these two hooks are similar, but not exactly the same.

Alas, there is an itsy bitsy problem...

A bug in untokenize

Alas, tokenizer.untokenize does not properly "round trip" this valid python program:

print \
   
("abc")

The result is:

print\
   
("abc")

The whitespace before the backslash is not preserved.

Does the bug matter?

I have given this question considerable thought.  It's a case of theory vs practice.

In practice, this bug doesn't matter:

1. The odds of a programmer writing the code above are small.  Crucially, backspace-newlines within strings are always handled correctly.

2. Even if the buggy case did happen, Leo's beautify and fstringify commands would carry on without incident.

3. It's highly unlikely that anyone would complain about the diffs.

4. The bug could even be called a feature :-)

In theory, this bug is much more troubling.  I want to argue publicly that:

1. Basing token-based tools on tokenize.untokenize is absolutely sound.  Alas, it is not.

2. tokenize.untokenize is easy to understand. Alas it is not.

untokenize's helper, tokenize.add_whitespace, is a faux-clever hack. After hours of study and tracing, there is no obvious way to fix the bug.

Summary

This post contains a high-level theory of operation for flexible token-based classes.

The code in the beautify2 branch has much to recommend it:

- It is demonstrably as sound as tokenize.untokenize, a big advance over previous code.
- It could easily be used as a public exhortation to base text-based tools on tokens, not parse trees.

For now, I'll ignore the bug, except for filing a python bug, and ask for guidance about fixing it.

Edward

Edward K. Ream

unread,
Nov 1, 2019, 11:43:35 AM11/1/19
to leo-editor
On Friday, November 1, 2019 at 8:33:14 AM UTC-5, Edward K. Ream wrote:

For now, I'll ignore the bug, except for filing a python bug, and ask for guidance about fixing it.

There's a better way, which is to rewrite the null_tok_b.make_input_tokens, using the legacy PythonTokenBeautifier.run as a template.

I didn't think of doing this before, because it contains code idiosyncratic to the class.  But Aha, I can encapsulate all class-specific code in one or more new token hooks! The way is open for rewriting untokenize. This is worth doing.

The next step will be to devise an easy way of running python's relevant unit tests, in test_tokenize.py, on the new code.

Edward

Edward K. Ream

unread,
Nov 1, 2019, 12:51:47 PM11/1/19
to leo-editor
On Friday, November 1, 2019 at 10:43:35 AM UTC-5, Edward K. Ream wrote:

> The next step will be to devise an easy way of running python's relevant unit tests, in test_tokenize.py, on the new code.

The normal Anaconda distro doesn't include much in the lib/test folder, so I downloaded them from here.

I created a file, test_ekr_untokenize.py that included only untokenizer tests from test_tokenizer.py.

I then added a test that should fail.

Imagine my surprise when it didn't. Here's why:

def check_roundtrip(self, f):
   
"""
    Test roundtrip for `untokenize`. `f` is an open file or a string.
    The source code in f is tokenized to both 5- and 2-tuples.
    Both sequences are converted back to source code via
    tokenize.untokenize(), and the latter tokenized again to 2-tuples.
    The test fails if the 3 pair tokenizations do not match.

    When untokenize bugs are fixed, untokenize with 5-tuples should
    reproduce code that does not contain a backslash continuation
    following spaces.  A proper test should test this.
    """

Well, this is just dandy.  What is the point of unit tests if they are fudged?  I'll have to think on this...

Edward

Edward K. Ream

unread,
Nov 1, 2019, 1:37:42 PM11/1/19
to leo-editor
On Friday, November 1, 2019 at 11:51:47 AM UTC-5, Edward K. Ream wrote:

> What is the point of unit tests if they are fudged?  I'll have to think on this...

Wow.  Here is the list of open issues against untokenize.  I have just created Python issue 38663.

Edward

rengel

unread,
Nov 3, 2019, 3:04:40 AM11/3/19
to leo-editor
I doubt if they can reproduce your example in Python issue 38663, because it contains Leo's global 'g'.

Reinhard

Edward K. Ream

unread,
Nov 3, 2019, 8:09:28 AM11/3/19
to leo-editor
On Sun, Nov 3, 2019 at 2:04 AM rengel <reinhard...@gmail.com> wrote:

> I doubt if they can reproduce your example in Python issue 38663, because it contains Leo's global 'g'.

Good catch.  I'll update the issue.

Edward
Reply all
Reply to author
Forward
0 new messages