Brief status report; long ENB re Ahas and code

33 views

Skip to first unread message

Edward K. Ream

unread,

Nov 22, 2019, 10:48:27 AM11/22/19

to leo-editor

For the last few days ideas, Ahas and progress have come so fast that I have not had time to document them properly. Now is the time. A reminder: all work is being done in the fstringify branch.

Short summary: There have been big Aha's, the code keeps getting simpler, and further collapses in complexity beckon. The testing framework continues to evolve. I won't speculate on schedule. Any "delays" have been worthwhile.

This is a long post. Feel free to skim or skip entirely.

The rest of this post discusses status and documents important design and coding details. It will be pre-writing for a revised Theory of Operation, now in LeoDocs.leo.

Status, visualized

I have spent a lot of time on testing and visualization tools. The following are the untouched results from a unit test in unitTest.leo. This test is a proof of concept for a tree-based fstringify.

The 'contents' dump shows the incoming source string, the 'tree' dump shows the resulting (annotated) tree.

Contents...

1    """DS."""
2    print('test %s=%s'%(a, 2))
3    print(f"test {a}={2}")

Patched tree...

parent           lines    node                               tokens
======           =====    ====                               ======
                          0   Module
0   Module                1     Expr
1   Expr         0        2       Str: s='DS.'               string(DS.)
0   Module                3     Expr
3   Expr                  4       Call
4   Call         1..2     5         Name: id='print'         string("""DS.""") newline(1:10) name(print)

4   Call         2        6         BinOp: %                 op( string('test %s=%s') op%
6   BinOp        0        7           Str: s='test %s=%s'    string(test %s=%s)
6   BinOp                 8           Tuple
8   Tuple        2        9             Name: id='a'         op( name(a)
8   Tuple        2        10            Num: n=2             op, number(2)
0   Module                11    Expr
11  Expr                  12      Call
12  Call         2..3     13        Name: id='print'         op) op) newline(2:27) name(print)
12  Call                  14        JoinedStr

Note: parentheses are in the wrong place. Fixing should be straightforward, if not pretty ;-)

About the results array

An earlier post was ambivalent about the results array. Would it ever be useful on its own? The answer is now completely clear:

The results array is for internal use only.

It exists only to transfer data between the tree and the incoming tokens list.

Indeed, the name 'results' is a bit of a misnomer, but I can't think of a better name.

The immediate consequence:

Visitors should not add non-syncing (insignificant) tokens to the results array.

This greatly simplifies the visitors. In particular, visitors never add comma tokens. Visitors only add required parentheses, such as the parens that appear in def statements.

About the asttokens tool

Early yesterday morning I suddenly wondered whether all the work I have done so far might have been a waste. Could the asttokens tool be a better foundation? The short answer is "no", but I looked again at what asttokens does, and how it does it.

Again, let's start with a dump. Here is the exact same unit test as above, with the 'asttokens' switched enabled. So the test uses asttokens to annotate the tree, not the TokenOrderGenerator class:

Contents...

1    """DS."""
2    print('test %s=%s'%(a, 2))
3    print(f"test {a}={2}")

Patched tree...

      Module    0..16
        Expr    0..0
         Str    0..0
        Expr    2..11   'print' '(' "'test %s=%s'" '%' '(' 'a' ',' '2' ')'
        Call    2..11   'print' '(' "'test %s=%s'" '%' '(' 'a' ',' '2' ')'
        Name    2..2
       BinOp    4..10   "'test %s=%s'" '%' '(' 'a' ',' '2'
         Str    4..4
       Tuple    6..10   '(' 'a' ',' '2'
        Name    7..7
         Num    9..9
        Expr   13..16   'print' '(' 'f"test {a}={2}"'
        Call   13..16   'print' '(' 'f"test {a}={2}"'
        Name   13..13
   JoinedStr   15..15

The problems are apparent: the annotations aren't very useful. However, parentheses are in better places.

Pros and cons of the asttokens tool

Pros:

- Thoroughly debugged.
- Uses generators everywhere.

- The code is concise.

- Arguably it is elegant, though argument are possible ;-)

- Works with trees built by ast and with astroid, but this is a nit, for this project.

Cons:

- Imo, the present TOG generators are clearer. They are certainly more flexible.

- asttokens.MarkTokens class doesn't do what is needed, nor can it easily be made to do so.

- It's still not clear whether asttokens traverses the tree in token order.

Neither the asttokens sources nor the asttokens docs are clear on this last point. I have created a unit test to investigate this question, but results are not yet conclusive. Further (easy) unit tests will eventually answer this question.

I continue to study the asttokens sources. After all the work I have done, I know what to look for :-) Some parts are clever, perhaps too clever. Other parts are essential hacks that the TOG will have to emulate, as discussed below.

To summarize: Imo, the TOG classes are clearer and more flexible than the asttokens code. YMMV.

A new code pattern for visitors

Yesterday I spent several happy hours revising the visitors. They all now use a common pattern for example:

def do_Call(self, node):

    yield from self.gen(node.func)
    yield from self.gen_op('(')
    yield from self.gen(node.args)
    yield from self.gen(node.keywords)
        # The visitors puts the '**' if there is no name field.
    if hasattr(node, 'starargs'):
        # The visitor puts the '*'.
        yield from self.gen(node.starargs)
    if hasattr(node, 'kwargs'):
        # The visitor puts the '**'.
        yield from self.gen(node.kwargs)
    yield from self.gen_op(')')

Visitors must follow three simple rules:

Rule 1. Visitors always use 'yield from', never 'yield'.

This allows subclasses to change visitors or other members at will. I'll give an important example later.

Rule 2. Visitors call self.gen_op, self.gen_name, etc to add tokens to the results list.

Rule 3. Visitors call self.gen to generate results from subtrees of the parse trees.

Making the pattern work

The code has been generalized. This eliminates a lot of cruft.

The gen* methods are new. They wrap calls to self.visitor, and usually also self.put* methods...

def gen(self, z):
    yield from self.visitor(z)

def gen_blank(self):
    yield from self.visitor(self.put_blank())
...

The all-important visitor method has been generalized:

def visitor(self, node):
    """Given an ast node, return a *generator* from its visitor."""
    # This saves a lot of tests.
    if node is None:
        return
    # More general, more convenient.
    if isinstance(node, (list, tuple)):
        for z in node or []:
            if isinstance(z, ast.AST):
                yield from self.visitor(z)
            else:
                # Some fields contain ints or strings.
                assert isinstance(z, (int, str)), z.__class__.__name__
        return
    # We *do* want to crash if the visitor doesn't exist.
    method = getattr(self, 'do_' + node.__class__.__name__)
    # Allow begin/end visitor to be generators.
    val = self.begin_visitor(node)
    if isinstance(val, types.GeneratorType):
        yield from val
    # method(node) is a generator, not a recursive call!
    val = method(node)
    if isinstance(val, types.GeneratorType):
        yield from method(node)
    else:
        raise ValueError(f"Visitor is not a generator: {method!r}")
    val = self.end_visitor(node)
    if isinstance(val, types.GeneratorType):
        yield from val

As you can see, all visitors must actually be generators. But the special tests at the end mean that the begin/end_visitor methods can either regular methods or generators. This is an important generalization. For example...

The TokenOrderNodeGenerator class

This class is the foundation for a unit test investigating whether asttokens actually traverses the tree in token order. Here it is:

class TokenOrderNodeGenerator(TokenOrderGenerator):
    """A class that yields a stream of nodes."""

    def generate_nodes(self, tree):
        """Entry: yield a stream of nodes."""
        yield from self.visitor(tree)

    # Overrides...
    
    def begin_visitor(self, node):
        if node:
            yield node
        
    def end_visitor(self, node):
        pass

    def put_token(self, kind, val):
        pass

Nothing could possibly be more elegant. The begin_visitor method has become a generator(!).

A reminder about speed

A note to myself, and any other interested party ;-) Please remember that speed is of tertiary importance. Moreover, the TOG classes are about 30 percent faster than the concise, supposedly elegant, classes in the asttokens tool.

Simplicity and correctness of code in the visitors are infinitely more important than speed. The simplified visitor pattern has no special cases. The new tests in the visitor method allows the TokenOrderNodeGenerator class to be dead simple.

Remaining problems

I'll just mention the problems. Solutions should be fairly straightforward. I thought I had found a brilliant solution to some of them, but serious doubts arose while writing this post, so I'll investigate further...

1. Allocating non-syncing tokens, especially parentheses.

asttokens has a method that replaces tokens in a (tree of) ast nodes. This will only work if all tokens are properly allocated to the correct node. Clearly this can be done, perhaps in a post pass.

2. The If visitor can't distinguish between "else if" and "elif" (!!).

It appears that in some cases these two constructs result in exactly the same parse tree. If that's true (I'm still investigating, but I think it is true), then Linker.check will have to do a late fixup. It should be no big deal. It's certainly not a show stopper.

3. 'comment' and 'string' tokens are difficult special cases.

Comments are not easily accessible in the tree, and neither are strings arising from docstrings. Furthermore, only tokens contain reliable "spellings" of comments and strings. Finally, the Str visitor relies on special case code in Linker.set_links.

Summary

The project is going well. The code continues to improve. Some questions and problems remain, but I see no gotchas.