I started this Engineering Notebook post to clarify my thinking. It has shown me the way forward.
Part of it has turned into a Theory of Operation for Leo's token-based code. In the new
documentation framework, this is a Discussion. That is, it is oriented towards understanding.
This post explores picky issues related to python tokens. Feel free to skip any or all of it.
Background
Imo, python devs are biased in favor of parse trees in programs involving text manipulations. The "real" black and fstringify tools would be significantly simpler, clearer and faster if they used python's tokenize module instead of python's ast module. Leo's own black and fstringify commands prove my contention to my satisfaction.
I would like to "go public" with my opinion. This opinion will be controversial, so I want to make the strongest possible case. I need to
prove that handling tokens can be done simply and correctly in
all cases. This is a big ask, because python's tokens are complicated. See the
Lexical Analysis section of the
Python Language Reference.
The beautify2 branch is intended to provide the required proof.
Strategy
The beautify2 branch bases all token handling on the untokenize function in python's tokenize module.
Given a stream of tokens (5-tuples) from tokenize.generate_tokens(code), untokenize reproduces code, the original source code. This round tripping property of untokenize is basis for the required proof.
Recreating the source code within each token is straightforward. The hard part is recreating the between-token whitespace. This tokenize.untokenize is guaranteed to do!
So the strategy is simple. All commands will create input tokens based on tokenize.untokenize. This will guarantee that token handling is sound, that is, that the list of input tokens will contain exactly the correct token data.
Classes
The beautify2 branch defines several classes that use tokens. Each class does the following:
1. Creates a list (not an iterator) of input tokens. Using real lists allows lookahead, which is impossible with iterators.
2. Defines one or more input token handlers. Handlers produce zero or more output tokens.
A straightforward concatenation of all output tokens produces the result of each command.
Here are the actual classes:
- class BeautifierToken:
Input and output tokens are instances of this class.
- class NullTokenBeautifier:
The base class for actual commands. This class is the natural place to test round-tripping.
- class FStringifyTokens(NullTokenBeautifier).
Implements Leo's token-based fstringify commands. It defines a handler only for string input tokens.
- class PythonTokenBeautifier(NullTokenBeautifier)
Implements Leo's token-based beautify commands. It defines handlers for all input tokens.
Tokenizing and token hooks
NullTokenBeautifier.make_input_tokens creates a list of input tokens from a sequence of 5-tuples produced by tokenize.generate_tokens. There is no need for subclasses to override make_input_tokens, because...
make_input_tokens is exactly the same as tokenize.untokenize, except that it calls token hooks in various places. These hooks allow subclasses to modify the tokens returned by make_input_tokens. The null hooks (in NullTokenBeautifier) make make_input_tokens work exactly the same as tokenize.untokenize.
This scheme is the simplest thing that could possibly work. Subclasses may adjust the input tokens to make token handling easier:
1. The null hooks create pseudo "ws" tokens (in the proper places!) that carry the between-token whitespace. Nothing more needs to be done!
2. Extra "ws" tokens would complicate token-related parsing in the FStringifyTokens and PythonTokenBeautifier. Instead, the token hooks in these two classes "piggyback" between-token whitespace on already-created tokens. It's quite simple. See the token_hook methods of these two classes. Note that these two hooks are similar, but not exactly the same.
Alas, there is an itsy bitsy problem...
A bug in untokenize
Alas, tokenizer.untokenize does not properly "round trip" this valid python program:
The result is:
The whitespace before the backslash is not preserved.
Does the bug matter?
I have given this question considerable thought. It's a case of theory vs practice.
In practice, this bug doesn't matter:
1. The odds of a programmer writing the code above are small. Crucially, backspace-newlines within strings are always handled correctly.
2. Even if the buggy case did happen, Leo's beautify and fstringify commands would carry on without incident.
3. It's highly unlikely that anyone would complain about the diffs.
4. The bug could even be called a feature :-)
In theory, this bug is much more troubling. I want to argue publicly that:
1. Basing token-based tools on tokenize.untokenize is absolutely sound. Alas, it is not.
2. tokenize.untokenize is easy to understand. Alas it is not.
untokenize's helper, tokenize.add_whitespace, is a faux-clever hack. After hours of study and tracing, there is no obvious way to fix the bug.
Summary
This post contains a high-level theory of operation for flexible token-based classes.
The code in the beautify2 branch has much to recommend it:
- It is demonstrably as sound as tokenize.untokenize, a big advance over previous code.
- It could easily be used as a public exhortation to base text-based tools on tokens, not parse trees.
For now, I'll ignore the bug, except for filing a python bug, and ask for guidance about fixing it.
Edward