ENB: The unification of the token and ast worlds

63 views

Skip to first unread message

Edward K. Ream

unread,

Nov 10, 2019, 4:37:19 AM11/10/19

to leo-editor

A few days a new goal appeared: to define and create a token-order tree traversal.This would be yet another kind of tree traversal, whose purpose would be to unify the token and ast worlds. This goal started a chain of discovery and experimentation. Iirc, this is the first time I have stated this goal.

This post is a primer of the token and ast based worlds. Please study that post if have no idea what "unifying the token and ast worlds" might mean :-)

Yesterday I spent many happy hours exploring an excellent, elegant, asttokens package. At first I thought asttokens might be an elegant way to define token-order traversals. But then problems arose...

Last night, in conversation with Rebecca, I had two Ahas, each of which will forever change my view of the world:

Aha 1. Any token-order tree traversal must be isomorphic to Leo's AstFormatter class, in leoAst.py.

At last, token-order tree traversals are well defined.

Aha 2. Tree-based code could be an alternative front-end to Leo's token-based beautifier.

I am writing this post now, before the old world view becomes completely inaccessible. This post is an important part of Leo's history. It will also be pre-writing for a new theory of operation.

Background

Three previous threads record the background of the Ahas:

Thread 1: October 25: A small pause for a better fstringify states:

"It is my strong (informed) opinion that parse trees are inappropriate for text-based manipulations such as black and fstringify".

The two new Ahas alter that opinion in large and small ways.

Thread 2: November 1: ENB about tokens and related commands reiterates my contention that python programmers often downplay the significance of token-based code. That's still my opinion.

Thread 3: November 3: ENB A much better untokenizer discusses as spectacular replacement for the untokenize function in tokenize.py, part of Python's standard library. The new code is the foundation of Leo's fstringify commands, and all other token-based code in Leo.

The python devs responsible for tokenize.py were underwhelmed by the new untokenize, but no matter :-) It was part of the background for the new Ahas. And new untokenize is the foundation of Leo's new fstringify commands.

About asttokens

The asttokens package embeds new data into the 5-tuples created by Pythons tokenize function. They become 8-tuples. The new data contain links to the ast nodes "responsible" for the tokens.

Alas, the new data does not suffice to create a two way mapping between tokens and ast nodes.

Rebecca asked whether such a two-way mapping was possible, and both Aha's appeared immediately!

Any proper tree-to-token mapping must have two parts:

1. Links from each token to exactly one tree node, the node that "generates" the token.

2. Links from tree nodes to zero or more tokens, in the proper token order.

Aha 1: clever code has no chance of working

I saw that I was trying to "cheat" yesterday. That is, I was trying to make asttokens do more than it possibly could. This was becoming clearer as I wrestled with the new TokenOrderTraverser class in leoAst.py. The present version of this class will move to the attic.

There is no real need to discuss the problems in detail. Instead, let's just consider the AstFormatter class in leoAst.py. Aha1 is simply the realization that the AstFormatter class already defines token-order!

If I want a two-way mapping between tokens and tree nodes, something that works exactly like the visitors in AstFormatter is not only the simplest thing that could possibly work, it is the only thing that could possibly work. Indeed, there is exactly one traversal that will format text properly. But (Aha!) the formatted text must be (except for easily handled special cases) isomorphic to the stream of tokens!

For the first time, token-order is well defined.

Aha 2: AstFormatter could be the front end to Leo's beautify commands

This is a corollary to the first Aha. If a tree traversal can produce tokens in the same order that tokenize does, then it could create the so-called input tokens used by Leo's beautify and fstringify commands.

Strategy

I'll modify the AstFormatter class so that it injects two-way links between tokens and nodes. Aha1 proves that this is possible!

The AstFormatter class contains one "visitor" function for every single ast node that could possibly generate output text. Crucially, these visitors must call other visitors in the correct order. That order defines token order.

Instead of creating output text, the rewritten TokenOrderTraverser class will insert links. So simple. The answer was staring at me all this time.

There are a few complications that are easily handled. Commas after tuples with two or more elements are optional. Therefore, the do_Tuple visitor must test the token stream.

The old TokenOrderTraverser class tried to use a dict to specify token order. But cruft tokens doomed that approach. Examples:

- The trailing colon in class and def statements.

- The commas in lists and tuples.

Only explicit code, exactly as in the visitors in AstFormatter, could possibly inject links into cruft tokens.

Summary

It's getting very late, so I'll be brief here.

I don't remember exactly how notion of a token-order traversal appeared. In retrospect, this was a crucial "invention".

Initially, all details were fuzzy. It wasn't entirely clear whether the notion could be well defined, though I strongly suspected that it could be.

Now, everything is clear. The code for a token-order traversal class must be isomorphic to the code in AstFormatter class. All the messy, picky, details of that class define token-order traversals!

Lest anyone doubt the importance of these Ahas, consider the status quo ante:

- The old (deprecated) TokenSync class in leoAst.py.

- The horrendous code in the "real" black and fstringify tools.

- The token-level parsing in Leo's fstringify commands.

A proper tree-to-token mapping would be of great value to any tool that munges text. It allows tools to use both tree and token representations interchangeably in the same program.

That's all for now. I've covered enough here to make sure the crucial details behind the Ahas don't fade away.

Edward

Edward K. Ream

unread,

Nov 11, 2019, 5:10:39 PM11/11/19

to leo-e...@googlegroups.com

On Sun, Nov 10, 2019 at 3:37 AM Edward K. Ream <edre...@gmail.com> wrote:

> A few days a new goal appeared: to define and create a token-order tree traversal.

A status report.

I have been obsessed with this project. Everything else is taking a back seat, including sleep :-) I'll announce Leo 6.1 final soon, I promise :-)

This the most consequential project I have done outside of Leo in the last 20 years. It surely is feasible, but it is a juicy, non-trivial, fascinating, programming problem.

After about 10 hours of work, starting very early this morning, I realized that my initial approach to "syncing" tokens with ast nodes needed a rethink. The initial idea was to verify that tokens matched ast nodes. But that's too late.

After a nap I saw that the ast visitors should use the "ground truth" (the tokens themselves) to ensure that tokens match the tree when each ast node is visited. To do this, the ast visitors will call two conditional token generators, put_conditional_comma and put_conditional_newline. The former is needed because tuples with more than one element may be followed by an optional comma. The latter is needed because of difficulties handling "newline" and "indent" tokens. I anticipated both problems.

These conditional methods have access to all tokens in the list of input tokens, not just the "current" token, self.tokens[self.token_index]. So these conditional generators should never need to "guess", and neither will the ast node visitors that call them.

It is my present opinion that the tree must be traversed in a single pass. If it can't done properly from the get go, it can't be done at all. But surely it can be done. All necessary data are present.

This will be a spectacularly powerful tool. For example, the test runner shows the results as follows:

print('Result...\n')
print(''.join([z.to_string() for z in tot.results]))

This suggests that we could define a TokenOrderFormatter class that replaces the existing AstFormatter class. The code (completely untested) would look something like this:

class TokenOrderFormatter (TokenOrderTraverser):
    
    def format(contents):
        """
        Format the tree into a string guaranteed to be generated in token order.
        """
        self.tokens = self.make_tokens(contents)
        tree = parse_ast(contents)
        self.visit(tree)
        return ''.join([z.to_string() for z in self.results])

The base class will do all the work! The converse is not true. There is no way that the TokenOrderTraverser class could be a subclass of the present AstFormatter class.

Summary

This is one of the best, juiciest, most consequential, programming challenges ever. Conditional token generators (the put* methods) are the next pieces of the puzzle. They might be the last pieces. We shall see.

The TokenOrderTraverser class is a re-imagining of the AstFormatter class. A trivial, rigorous formatter could be built on top of TokenOrderTraverser.

As an experiment, I'll be rewriting Leo's fstringify commands, basing them on TokenOrderTraverser. The idea is avoid token-level "parsing" entirely. This experiment will show whether this class is truly as revolutionary as I think it is :-)

Edward

P.S. Development has been very easy. My test runner is an @command node. The contents, though short and simple, suffice to drive development. They have revealed lots of bugs and larger problems. Here is the test runner:

import imp
import leo.core.leoAst as leoAst
imp.reload(leoAst)

def check(contents, tokens):
    result = ''.join([z.to_string() for z in tokens])
    ok = result == contents
    if not ok:
        g.printObj(result)
    return ok

# This is more than enough to test syncing.
contents = r'''
class TestClass:
    def test(a, b=2):
        if a:
            print('hi')
            pass
        print('done')
'''.strip() + '\n'

print('Ctrl-2: leoAst tests...\n')
print('Contents...\n')
print(contents)
tot = leoAst.TokenOrderTraverser()
tokens = tot.make_tokens(contents)
ok = check(contents, tokens)
if ok:
    tree = leoAst.parse_ast(contents)
    print('Tree...\n')
    print(leoAst.AstDumper().dump(tree))
    print('')
    try:
        tot.verify_token_order(tokens, tree)
    except AssertionError as e:
        print(e)
    print('Result...\n')
    print(''.join([z.to_string() for z in tot.results]))
    print('')
    # tot.report_coverage(report_missing=False)
print('Ctrl-2:', ('PASS' if ok else 'FAIL'))
print('')

EKR

Edward K. Ream

unread,

Nov 12, 2019, 3:15:09 AM11/12/19

to leo-editor

On Monday, November 11, 2019 at 4:10:39 PM UTC-6, Edward K. Ream wrote:

After about 10 hours of work, starting very early this morning, I realized that my initial approach to "syncing" tokens with ast nodes needed a rethink. The initial idea was to verify that tokens matched ast nodes. But that's too late.

Note: this continues an Engineering Notebook post. Feel free to ignore.

This post is a milestone. It records notes to myself. It also celebrates a clearing of my mental fog. I want to write this as an important historical note. I also record it so I can go back to sleep ;-)

When I awoke very early this morning I realized that I had been making things much more complicated than they need to be, because I had forgotten what I was trying to do :-)

The present work is not supposed to generate new tokens, it is supposed to annotate the existing tokens created by make_all_tokens. This collapses the complexity of the crucial code. Here are some principles that I awoke with:

1. Add fields to Token class: index, level, node.

The purpose of the TokenOrderTraverser is to add these links. Nothing more!

2. Improve the visit method

- assert isinstance(node, ast.Ast) (None, list, tuple) are not valid.

Unlike all "elegant" traversal classes, the TokenOrderTraverser class must explicitly handle all fields that are lists or tuple, and must check for empty fields. There is absolutely no choice about this. It's the only way to retain the correct traversal order.

- Inject parent, ordered_children fields in ast nodes.

These are not needed to annotate the tokens, but they will be of great value for the clients of the TokenOrderTraverser class.

- compute max_level, max_stack_level.

These are important data for development. max_level is the max indentation level of python blocks. max_stack_level is the max recursion level in tot.visit. The visit method can easily update these date.

The asttokens tool supposedly uses generators to avoid overflowing python's runtime stack. I want to make sure we never come close to this.

3. The following three items look innocuous. In fact, they are supremely important. They arise because the task is now clearer:

- Remove all calls to put_indent and put_dedent. Replace them with self.level +- 1.
- Remove all "speculative" calls to do_newline.
- Remove conditional_newline.

You could say that all of the code above is a brain spike :-) Again, the code is not creating new tokens, it is annotating existing tokens. This is actually a huge Aha:

The put* methods simply "eat" zero or more tokens in the token list, adding fields to those tokens in the process.

Most put methods will eat "ws" tokens if they are next, and then eat the "matching" token. The put_newline method will also any following "indent" token. It's totally simple! There should be no such thing as a conditional_newline!

I'm not sure how "dedent" tokens will be eaten, but it shouldn't be a big deal to do so.

And now there is a second huge Aha:

Eating a token naturally associates exactly one ast node with the token.

Indeed, the self.node (carefully set and restored in tot.visit, using a stack) will be injected into the token's node field. That's all there is to it!

And one last Aha:

Newlines are associated with statements, not blocks.

This ends some massive confusion, and will simplify the code considerably.

Summary

The task of the TokenOrderTraverser class is merely to annotate already existing tokens.

The put methods will "eat" one or more tokens by advancing a pointer to the token list and by injecting data into the eaten tokens. There is no need for complex synchronization!

put_newline will eat a "newline" or "nl" token and any following "indent" token, and probably any preceding "dedent" token. The put_conditional_comma method is still required. It will eat a comma if it exists, but will issue no complaint if it does not.

Eating a token naturally associates a token with the correct ast node. At last I clearly and fully understand the two-way correspondence.

The self.level ivar represents indentation level, and will be injected into all tokens. That's all that needs to be done regarding indentation! There is no need to generate "indent" and "dedent" tokens!

It is rare for a to-do list to have such import, but these are wonderful times :-)

Edward

P. S. Hehe. The TokenOrderFormatter is trivial because it doesn't do anything. True, a proper code beautifier or fstringifier would be a subclass of TokenOrderTraverser, but those tools would be anything but trivial.