The intuition wars are over. I expect all two-way links between tokens and ast nodes can be generated in linear time, that is, O(N), where where N is the number
of tokens.
Aha: There are very few alternative to consider.
Let T be the number of kinds of tokens. Let R be the number of kinds of results. L < R < about 10.
Unlike the general diff algorithms, there are at most L * R cases to consider. A handler will handle one or more cases. Furthermore, probably only L + R handlers (or less) will be needed, not L * R handlers.
My guess is that, on average, tokens and/or results will be
examined roughly once or twice. That guarantees linear running
time. The actual stats will depend on the actual code.
The plan
The assign_links methods does the work. It will run after all results have been produced. Something like this:
<< define handler dict >>
def assign_links(self):
"""Assign two-way links between tokens and results."""
# The present ast_node, and indices into results and tokens.
node, rx, tx = None, 0, 0
while tx < len(self.tokens) and rx < self.results:
progress = rx, tx
t = self.tokens[tx]
r = self.results[rx]
key = f"{t.kind}:{r.kind}"
handler = self.handler_dict [key]
node, rx, tx = handler(node, rx, tx)
assert progress < rx, tx
This is a simple, flexible scheme:
1. The handler dict specifies a handler for every possible combination of present token and result. The dict will have about 100 entries. Many keys will share the handler. It may turn out that only keys for tokens are needed, in which case this dict will have about 10 entries.
2. Each handler method provides a nature place for discussing policy questions and for recording debugging data or statistics. Crucially, each handler can look ahead as needed, both in the tokens list and in the results list.
Many other tweaks are possible. No need to discuss them here.
Edward