ENB: Two Ahas re unifying the token and parse-tree worlds

27 views

Skip to first unread message

Edward K. Ream

unread,

Nov 15, 2019, 6:30:02 AM11/15/19

to leo-editor

This Engineering Notebook post records what may be the last Ahas required to complete #1440. This will be pre-writing for a how-to guide.

Aha 1: Use difflib

Yesterday I saw that using difflib could make testing easier. This was the first big Aha.

I played around with diffing the results of the tree traversal against the incoming tokens. This immediately revealed some problems. More importantly, it showed that I had misunderstood what the "eat" method must do:

- Eat must use comment tokens from the token list. Comments do not exist in any easy-to-find form in parse trees!

- Eat probably should take the "spellings" of whitespace from the token list. Those spellings are unreliable/different in parse trees.

- Eat might optionally use conditional results from the parse tree.

It all seemed complicated, so I took a break.

Aha 2: Replace eat with a post pass

When I awoke this morning I saw how to eliminate tot.eat using difflib. This is likely the last Aha needed to complete this project.

A new post_pass method will use difflib to check the results and perform any other "late" adjustments. Something like this:

def post_pass(self):
    """
    Use difflib to test self.results, adjusting the parse tree and creating
    output tokens as required.
    
    Subclasses should override this method.
    """
    tokens = [(z.kind, z.value) for z in self.tokens)]
    for z in difflib.ndiff(tokens, self.results):
        print(z)

I'll override this method in the TokenOrderInjector (TOI) class, the class I use for testing. TOI injects parent/child links into each node of the parse tree. Note that children appear in token-traversal order, something that no code based on ast.walk can possibly do.

The TOI class will be the base class for a roster of example classes. Each example will tailor the TOI class for a particular real-world application.

Important: As its name implies, the post pass happens "late", after everything has been generated. Unlike the ill-fated "get" method, the post-pass can look ahead in both the token and results array. This is a big deal.

You can think of the post pass as a simple peephole optimizer, made even simpler by the ability to look ahead as well as behind.

The put method will remain

The put method no longer calls eat. Instead, it simply appends values to self.results:

def put(self, kind, val):
    """Handle a token whose kind & value are given."""
    val2 = val if isinstance(val, str) else str(val)
    self.results.append((kind,val2),)

The computation of val2 ensures that self.results will match self.tokens as much as possible.

We could even eliminate the put method entirely. Tree visitors would call `yield (x,y)` instead `yield self.put(x,y`. But this would be a big mistake. Subclasses should be free to override the put method!

About conditional results

The tree node visitors make a "generalized best guess" about calling self.put. Some examples:

- The visitors call/yield put_blank() as needed to "ensure" whitespace appears around 'name' tokens.

I put "ensure" in quotes, because subclasses may eliminate whitespace later.

- The do_Tuple visitor calls put_conditional_comma() to put the optional comma after tuples with more than one element.

The post pass makes it easy to handle such details:

- put_blank could append ('blank', ' ') to the results list instead of ('ws', ' '). The pseudo "blank" kind is a flag for the post pass.

Similarly, put_conditional_comma could append('conditional-op', ', ') to the results list. Again, the 'conditional-op" kind is a flag to the post pass.

Important: the calls to put_blank() may be ignored later. The 'blank' op is only a potentially useful optional feature. Subclasses can define do-nothing versions of put_blank if they like. Furthermore, the subclasses may define a post-pass could ignore any whitespace:

def post_pass(self):
    tokens = [z.kind, z.value) for z in self.tokens
        if self.kind != 'ws']
    results = [z.kind, z.value) for z in self.results
        if self.kind not in ('ws', 'blank')]
    for z in difflib.ndiff(tokens, self.results):
        print(z)

Non issues

Generators are required only to ensure that python's run-time stack doesn't overflow. There is no harm whatever in having the results array be a true array.

Speed will be extremely fast, but that's a tertiary issue. GC issues are likewise of no great concern. Being able to use difflib is crucial.

Summary

The way forward is now completely clear. No difficult parts remain.

Using difflib has already accelerated development, and will continue to do so. difflib has highlighted details that would otherwise have been difficult to spot.

A post pass, based on difflib, will replace the infamous "eat method. The post pass is, in effect, a simple peephole optimizer, that can look both behind and ahead. Most importantly, the post pass can easily be seen to be correct.

The post pass will allow subclasses to:

- Verify that the parse tree is in reasonable accord with the list of incoming tokens.

- Make arbitrary adjustments (specialization) to the "generalized" results in self.results.

- Make any needed adjustments to the parse tree (There are two way links between tokens and tree nodes.)

- Create a list of output tokens, if desired.

I'll create several example classes showing how to subclass TokenOrderInjector for real-world applications. These example will contain nothing but simple overrides of base class methods such as post_pass and put. Example classes will form the bulk of the "marketing" for this project.

Edward

Reply all

Reply to author

Forward

0 new messages