Neither the results list nor the Linker class is needed! This is a stunning collapse in complexity.
The tree generator already exists. It consists of all the tree visitors in the TOG class, and their helpers. The tree generator will stay exactly as it is, except for slight mods to tog.do_If visitor. The tree generator visits each tree node in token order.
The following will define the token generator:
self.token_gen = filter(is_significant, tokens)
Aha! The two generators are completely parallel. At present, the tree generator calls tog.put to generate tokens in the results list. But, at present, tog.put generates significant result tokens in exactly the same order as does the token generator shown above.
This was staring me right in the face. Linker.assign_links contains:
sig_tokens = list(filter(is_significant, tokens))
sig_results = list(filter(is_significant, results))
for r, t in zip(sig_results, sig_tokens):
self.set_links(r, t, tokens)
I thought I needed to convert the generators to lists, but yesterday I saw that it was hopeless to add or subtract tokens from the results list. (See below). So the linker could have been:
sig_tokens = filter(is_significant, tokens)
sig_results = filter(is_significant, results)
for r, t in zip(sig_results, sig_tokens):
self.set_links(r, t, tokens)
Do you see? Linker.set_links only sees significant tokens. It could migrate to tog.put! This means that tog.put can call a new tog.sync method to the token generator whenever it sees a significant token. Nothing else is needed! Furthermore, tog.sync will contain just a few lines of code, very similar to Linker.set_links.
History of the Aha
Early yesterday I wrote: "I thought I had found a brilliant solution to some [problems], but serious
doubts arose while writing this post, so I'll investigate further..."
The initial idea was to sync only 'string' tokens. This was to be done in the do_Str visitor. The "serious doubts" were whether 'string' tokens where significant. It turns out that 'string' tokens are significant. In fact, only 'comment' tokens generate non-whitespace text but are missing from the parse tree. Remember too that comma and parentheses must be insignificant because the parse tree doesn't contain enough data to match them exactly with the token list.
Two days ago I had already glimpsed the potential of this pattern. If we can sync string tokens this way, why not do the same with all statement visitors? And yes, I dared hope that the same could be done with all visitors.
Yesterday afternoon I returned to the problem of distinguishing 'elif' from 'else' followed by 'if' in the do_If visitor. First, I convinced myself, once again, that the parse trees are identical for these two examples:
if 1: if 1:
pass pass
else: elif 2:
if 2: pass
pass
Sad but true. So the do_If visitor needs help.
My first thought was to change 'else' to 'elif' as need in the Linker. But this fails badly. If do_If mistakenly generates "else" tokens, it will put a colon in the wrong place, namely before an arbitrarily long serious of tokens for the "if" condition, rather than after the condition. This pretty much dooms the Linker class, but I wasn't to know that immediately.
So it was time for the heavy guns. First, I added the following to tog.create_links, the main line of the TOG class:
self.if_list = [z for z in self.tokens
if z.kind == 'name' and z.value in ('if', 'elif', 'else')]
self.if_list_index = 0
Second, I rewrote the do_If visitor as follows:
def do_If(self, node):
<< How to disambiguate between 'elif' and 'else' followed by 'if' >>
# If or elif line...
# if %s:\n
# elif %s: \n
if_value = self.if_list[self.if_list_index].value
self.if_list_index += 1
assert if_value in ('if', 'elif'), if_value
yield from self.gen_name(if_value)
yield from self.gen(node.test)
yield from self.gen_op(':')
yield from self.gen_newline()
# Body...
self.level += 1
yield from self.gen(node.body)
self.level -= 1
# Else and elif clauses...
if node.orelse:
self.level += 1
if_value = self.if_list[self.if_list_index].value
if if_value == 'else':
# Consume one if-list entry.
self.if_list_index += 1
yield from self.gen_name('else')
yield from self.gen_op(':')
yield from self.gen_newline()
yield from self.gen(node.orelse)
else:
# Sanity checks.
assert if_value in ('if', 'elif'), if_value
node1 = node.orelse[0]
assert isinstance(node1, ast.If), repr(node1)
# Don't consume an if-list entry here.
# Call ourselves recursively.
yield from self.gen(node.orelse)
self.level -= 1
Instantly, all my unit tests succeeded.
That was the end of a long day's work. However, as I was talking to Rebecca about the success, all the pieces fell into place.
I'll never forget it. I was brushing my teeth and saying wow. A pause. Another wow. Another pause. Another wow.
I realized that the "if-list" should be a generator, that a token generator could yield all significant tokens, and that the tree and token generators would work in parallel.
I explained it to Rebecca as follows. I held both arms over my head. My right hand represented the tree generator. It moved back and forth as it moved from top to bottom, simulating traversing the tree. My left hand represented the token generator. It moved straight down, because it was just moving from one token to the next in the token list. Crucially, the height of the two hands always matched. The generators work in parallel.
Implementation details
It's just a bit tricky to look ahead in a generator. So instead of the code shown above in do_If, the visitor call self.peek() and self.advance(), defined something like this (completely untested):
# Use lists for debugging, generators for production.
peek_list = [] # Must be a list.
def peek(self):
if instance(self.token_gen, list):
return self.token_gen[self.token_gen_index]
if not self.peek_list:
self.peek_list = [self.token_gen.next] # or is it next()?
return self.peek_list[-1]
def advance(self):
if instance(self.token_gen, list):
self.token_gen_index += 1
self.peek_list = []
You get the idea. Finally, tog.sync will maintain a list of insignificant tokens it encounters while syncing to the next significant token. Should be no problem.
Oops. tog.sync will need to see all the tokens in the token list, so it will need a generator that yields all tokens. It's no big deal.
Summary
Ahas such as this happen rarely, and only as the result of many long hours of experimentation.
Last night's Aha justifies all the work I have done on this project.
The tog.sync method could "reassign" parentheses as needed, and can assign the MIA 'comment' tokens. Alternatively, a post-pass can do the same, working only on the tokens list, which will have links from each significant token to the corresponding tree node.
Besides the tree generator, three token generators will probably be needed. The "if" visitor still needs an if generator, and token sync will need a generator yielding all tokens. A generator that yields only significant tokens will probably also be needed.
I've said all along that speed doesn't matter much, but the new code will be spectacularly fast.
I'm way short of sleep. That's all for now.
Edward