Success! High-level overview and status report

39 views
Skip to first unread message

Edward K. Ream

unread,
Nov 18, 2019, 11:09:53 AM11/18/19
to leo-e...@googlegroups.com
The most risky phase of this project is complete. Significant tests pass.

The crucial Linker class is unexpectedly simple and elegant. It is a linear time diff of two lists of tokens.

The rest of this post is a high-level overview of the code and algorithms. It will be pre-writing for a Theory of Operation chapter, to become part of LeoDocs.leo.

High-level overview

The contents are a string, a snippet of python code.

Given a contents string, TokenOrderGenerator (tog) class does the following:

1. Tokenize:

Use python's tokenize module to create the tokens list, a list of Token objects.

2. Create the parse tree:

Use python's ast module to create the parse tree, a tree of ast.AST nodes.

3. Generate the results list:

The tog traverses the parse tree, generating the results list, again a list of Token objects.

Crucially, this traversal happens in token order. This is the one and only order that guarantees the token and results lists will match exactly, except for optional tokens. Optional tokens may appear in either the tokens list or results list.

To generate result tokens in the proper order, the tog class must have a separate visitor for every kind of parse tree node.

All visitors are generators, ensuring that the traversal can't overflow python's run-time stack. Visitors use "yield" or "yield from" to generate results from sub-trees. These yields must be done in token order, which is why a separate visitor is required for each kind of parse tree node.

All visitors end with begin_visitor and end with the end_visitor method. Among other things, these methods save, set and restore the tog.node ivar. This ivar is the tree node corresponding to each visitor.

Visitors call tog.put to create new result tokens. Crucially, the tog.node ivar tells tog.put which parse-tree node is responsible for the token. The node ivar of each result token remembers that node.

4. Assigns two-way links between input tokens and parse-tree nodes...

Syncing and linking

The Linker class is the heart of the entire enterprise. It must be bullet proof.

To review, the tree traversal generates results list, a list of Token objects. All result tokens contain a link (token.node) to the parse tree node that generated the token.

The linker runs in a post pass. The linker syncs both lists, looking ahead in either list as necessary.

Syncing is easy because significant (non-optional) tokens appear in the same order in the two lists! The linker need only ignore optional tokens. When syncing two nodes, the linker transfers the token.node data from the result list to the corresponding node in the tokens list.

I am thrilled to report that the linker is a complete success:

- It runs in time proportional to the number of tokens.

- It is a gem. It is unexpectedly simple, elegant and flexible.

- It passes significant tests. It is now revealing bugs elsewhere.

The tokens and results lists are both required

The tokens list, not the results list, must be the primary "product" of the tog class.

Indeed, the tokens list is the "ground truth", containing details that are not (at present) present in the parse trees. These details include comment tokens (missing entirely from the parse tree) and the spelling of whitespace tokens (more reliable in the tokens list).

Note: Despite being essential for the algorithm, the results list is unlikely to be otherwise useful. For most apps, filtering the tokens list should almost always be simpler and better. Or so I say now.

Executive summary

At last we can view what happens from the highest vantage point:

- The tokens list contains the ground truth. Its data are more reliable than data in the results list.

- The tog class creates the results list by visiting the tree in token order.

- The results list exists only to hold data to be transferred to the tokens list.

- Syncing the tokens and results lists is the only way make this transfer!

- Syncing is feasible and fast because significant (non-optional) tokens appear in exactly the same order in the two lists. The linker can look ahead in either list because the linker runs in a post pass, after all tokens have been generated.

That's all anyone needs to know.

Testing

Much testing remains. Many tree visitors are buggy. Happily, testing is easy...

The linker issues detailed error messages if the token and results lists can't be properly synced.

Using difflib, a diff reporter shows the diffs between the token and results list. These diffs usually highlight which visitor is the culprit.

Summary

The notion of token-order tree traversals got this project started.

The TokenOrderGenerator class is nothing but straightforward infrastructure. Almost all of it is already elegant and simple. I'll improve the remaining ugly bits next.

Everything depends on the Linker class. I am thrilled to report that this class is dead simple and very fast.

Testing is now easy. Errors in visitors cause the Linker class to fail, as it should. The diff reporter then clearly shows why the token and results list can't be synced.

The remaining phases of the project look straightforward:

- Create unit tests covering all visitors.
- Tweak the code to handle weird special cases involving tokens and parse trees.
- Verify that the code works on all of Leo's source files.
- The acid test: simplify Leo's fstringify commands using this project's classes.
- Document everything and package the code for use outside of Leo.

Edward
Reply all
Reply to author
Forward
0 new messages