ENB: still a juicy project

37 views

Skip to first unread message

Edward K. Ream

unread,

Nov 16, 2019, 9:23:32 AM11/16/19

to leo-editor

And an all-consuming one. For the last week it has taken two or more hours just to process all the ideas I have awakened with. Today was no exception.

This Engineering Notebook discusses yesterday's work and the remaining issues. I am writing this ENB to prime my mental pump.

tl;dr: There are two strategies for creating two-way links between tokens and ast nodes. Neither will be easy. The goal is to find a feasibly simple solution.

Infrastructure

Much of my work involves developing infrastructure: the test runner, diffs and dumps. Making these clean and powerful pays off handsomely:

- Diffs keep reminded me yesterdays of the complexities. Their clarity helps prime my mental pump.

- Dumps show various parts of the ground truth. Yesterday, brief_dump reminded me that the eat method was indirectly responsible for the two-way links. Naturally, not calling eat killed all those links. I knew then it was time for a break, after 15+ hours of work.

A bug in difflib.ndiff

TOG.verify now produces a diff of the token list against a new results list. This was less than straightforward because difflib.ndiff barfs on all Leo files!

File "...\Anaconda3\lib\difflib.py", line 1017, in _fancy_replace
TypeError: can only concatenate str (not "tuple") to str

Perhaps ndiff trips over Leo sentinels. Googling reveals the same error in other parts of Python's standard library. I'll report this bug when I can isolate it to just a few lines. Happily, difflib.Differ().compare works just fine.

The juicy problem

After yesterday's post about diffs I naively assumed that diff would make everything easy. Alas, it does not.

Recall that the task is to sync (somehow!) the tokens produced by tokenize.tokenize with the list of results produced by traversing the tree. This can never be a trivial task, for two reasons:

1. Ast nodes provide no data about comment tokens and the spelling of whitespace. There will always be gaps in the results list.

The tree traversal can never generate 'comment' and 'nl' tokens. Diffs will never be empty.

2. Tokens know nothing (initially) about ast nodes.

So it seems there are only two ways to create the two-way links:

Strategy 1: Generate the links immediately, when traversing the parse tree.

This is what the eat method attempted to do. Previous attempts where not feasibly simple, but I haven't entirely given up yet.

Strategy 2: Generate the links after all nodes have been created, using the diffs. There are three cases to consider:

Case A. The easy case: the token and result match.

The token's kind and value fields match the result, which is a tuple(kind, val), created by tog.put.

tog.begin_visitor sets the tog.node ivar, so this ivar is correct at the time the visitor calls tog.put. put could create tuples (kind, val, ast_node), which means that all data could be made available to the diff handler. With a bit more work, the diff handler can set the two-way links.

Case B: a token corresponds to no result.

Case C: a result corresponds to no token.

Let's lump these two together, because these often come in matching pairs!

This happens when the spelling of a 'string' token does not match the spelling of 'string' result. In that case, the ground truth is the spelling of the token. The result will contain the ast_node link. We use the ast_node link from the result, and the spelling from the token.

Other mismatches are possible, because there must be gaps in the results. These are the hardest cases of all, because no ast_node link is directly available.

This is not a gotcha. This just means that it's not so clear what node should be associated with the "extra" tokens. This creates an association problem. A better term might be the association policy.

Happily, the diffs, tokens and results are real lists (not generators), so the diff handler can look both ahead and behind in all the lists. Whether that can be made feasibly simple is an open question.

We could imagine combining strategies 1 and 2, that won't solve either problem! It would only be a check, a kind of double-entry accounting.

Conflicting intuitions

One of my math professors said that when working on a problem one can often be guided by the problem's inherent difficulty. One should not look for elementary solutions to deep problems, nor for deep solutions to elementary problems.

In this case, however, I have conflicting intuitions about how hard the underlying problem really is. The difflib docs say:

"The Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear."

This means that relying on diff will be slow. It will, however, be accurate :-)

Otoh, eat knows a lot more more than diff about tokens and results!

Summary

The goal is to find a feasibly simple way of associating two-way links between tokens and tree nodes. There are two competing strategies:

1. Create the links in tog.eat, as results are generated. This strategy is worth further work. It promises linear time performance.

2. Create the links in a post pass, using diffs. This will be much slower.

Significant bookkeeping issues bedevil either strategy.

Edward

Edward K. Ream

unread,

Nov 16, 2019, 10:20:03 AM11/16/19

to leo-editor

On Saturday, November 16, 2019 at 8:23:32 AM UTC-6, Edward K. Ream wrote:

> I am writing this ENB to prime my mental pump.

It worked. There is a third strategy.

A "delayed eat" can run in a post pass after all results have been produced. That allows the delayed eat to look ahead in the results. This involves adding ast_node entries to the result tuples, as discussed in this post.