ENB: Wow: the entire project will be nothing but two parallel generators

Edward K. Ream

unread,

Nov 23, 2019, 5:46:08 AM11/23/19

to leo-editor

Neither the results list nor the Linker class is needed! This is a stunning collapse in complexity.

The tree generator already exists. It consists of all the tree visitors in the TOG class, and their helpers. The tree generator will stay exactly as it is, except for slight mods to tog.do_If visitor. The tree generator visits each tree node in token order.

The following will define the token generator:

self.token_gen = filter(is_significant, tokens)

Aha! The two generators are completely parallel. At present, the tree generator calls tog.put to generate tokens in the results list. But, at present, tog.put generates significant result tokens in exactly the same order as does the token generator shown above.

This was staring me right in the face. Linker.assign_links contains:

sig_tokens = list(filter(is_significant, tokens))
sig_results = list(filter(is_significant, results))
for r, t in zip(sig_results, sig_tokens):
    self.set_links(r, t, tokens)

I thought I needed to convert the generators to lists, but yesterday I saw that it was hopeless to add or subtract tokens from the results list. (See below). So the linker could have been:

sig_tokens = filter(is_significant, tokens)
sig_results = filter(is_significant, results)
for r, t in zip(sig_results, sig_tokens):
    self.set_links(r, t, tokens)

Do you see? Linker.set_links only sees significant tokens. It could migrate to tog.put! This means that tog.put can call a new tog.sync method to the token generator whenever it sees a significant token. Nothing else is needed! Furthermore, tog.sync will contain just a few lines of code, very similar to Linker.set_links.

History of the Aha

Early yesterday I wrote: "I thought I had found a brilliant solution to some [problems], but serious doubts arose while writing this post, so I'll investigate further..."

The initial idea was to sync only 'string' tokens. This was to be done in the do_Str visitor. The "serious doubts" were whether 'string' tokens where significant. It turns out that 'string' tokens are significant. In fact, only 'comment' tokens generate non-whitespace text but are missing from the parse tree. Remember too that comma and parentheses must be insignificant because the parse tree doesn't contain enough data to match them exactly with the token list.

Two days ago I had already glimpsed the potential of this pattern. If we can sync string tokens this way, why not do the same with all statement visitors? And yes, I dared hope that the same could be done with all visitors.

Yesterday afternoon I returned to the problem of distinguishing 'elif' from 'else' followed by 'if' in the do_If visitor. First, I convinced myself, once again, that the parse trees are identical for these two examples:

if 1:           if 1:
    pass            pass
else:           elif 2:
  if 2:             pass
      pass

Sad but true. So the do_If visitor needs help.

My first thought was to change 'else' to 'elif' as need in the Linker. But this fails badly. If do_If mistakenly generates "else" tokens, it will put a colon in the wrong place, namely before an arbitrarily long serious of tokens for the "if" condition, rather than after the condition. This pretty much dooms the Linker class, but I wasn't to know that immediately.

So it was time for the heavy guns. First, I added the following to tog.create_links, the main line of the TOG class:

self.if_list = [z for z in self.tokens
    if z.kind == 'name' and z.value in ('if', 'elif', 'else')]
self.if_list_index = 0

Second, I rewrote the do_If visitor as follows:

def do_If(self, node):
    << How to disambiguate between 'elif' and 'else' followed by 'if' >>
    # If or elif line...
        # if %s:\n
        # elif %s: \n
    if_value = self.if_list[self.if_list_index].value
    self.if_list_index += 1
    assert if_value in ('if', 'elif'), if_value
    yield from self.gen_name(if_value)
    yield from self.gen(node.test)
    yield from self.gen_op(':')
    yield from self.gen_newline()
    # Body...
    self.level += 1
    yield from self.gen(node.body)
    self.level -= 1
    # Else and elif clauses...
    if node.orelse:
        self.level += 1
        if_value = self.if_list[self.if_list_index].value
        if if_value == 'else':
            # Consume one if-list entry.
            self.if_list_index += 1
            yield from self.gen_name('else')
            yield from self.gen_op(':')
            yield from self.gen_newline()
            yield from self.gen(node.orelse)
        else:
            # Sanity checks.
            assert if_value in ('if', 'elif'), if_value
            node1 = node.orelse[0]
            assert isinstance(node1, ast.If), repr(node1)
            # Don't consume an if-list entry here.
            # Call ourselves recursively.
            yield from self.gen(node.orelse)
        self.level -= 1

Instantly, all my unit tests succeeded.

That was the end of a long day's work. However, as I was talking to Rebecca about the success, all the pieces fell into place.

I'll never forget it. I was brushing my teeth and saying wow. A pause. Another wow. Another pause. Another wow.

I realized that the "if-list" should be a generator, that a token generator could yield all significant tokens, and that the tree and token generators would work in parallel.

I explained it to Rebecca as follows. I held both arms over my head. My right hand represented the tree generator. It moved back and forth as it moved from top to bottom, simulating traversing the tree. My left hand represented the token generator. It moved straight down, because it was just moving from one token to the next in the token list. Crucially, the height of the two hands always matched. The generators work in parallel.

Implementation details

It's just a bit tricky to look ahead in a generator. So instead of the code shown above in do_If, the visitor call self.peek() and self.advance(), defined something like this (completely untested):

# Use lists for debugging, generators for production.

peek_list = [] # Must be a list.

def peek(self):
    if instance(self.token_gen, list):
        return self.token_gen[self.token_gen_index]
    if not self.peek_list:
        self.peek_list = [self.token_gen.next] # or is it next()?
    return self.peek_list[-1]
    
def advance(self):
    if instance(self.token_gen, list):
        self.token_gen_index += 1
    self.peek_list = []

You get the idea. Finally, tog.sync will maintain a list of insignificant tokens it encounters while syncing to the next significant token. Should be no problem.

Oops. tog.sync will need to see all the tokens in the token list, so it will need a generator that yields all tokens. It's no big deal.

Summary

Ahas such as this happen rarely, and only as the result of many long hours of experimentation.

Last night's Aha justifies all the work I have done on this project.

The tog.sync method could "reassign" parentheses as needed, and can assign the MIA 'comment' tokens. Alternatively, a post-pass can do the same, working only on the tokens list, which will have links from each significant token to the corresponding tree node.

Besides the tree generator, three token generators will probably be needed. The "if" visitor still needs an if generator, and token sync will need a generator yielding all tokens. A generator that yields only significant tokens will probably also be needed.

I've said all along that speed doesn't matter much, but the new code will be spectacularly fast.

I'm way short of sleep. That's all for now.

Edward

Edward K. Ream

unread,

Nov 23, 2019, 9:49:31 AM11/23/19

to leo-editor

On Sat, Nov 23, 2019 at 4:46 AM Edward K. Ream <edre...@gmail.com> wrote:

> Finally, tog.sync will maintain a list of insignificant tokens it encounters while syncing to the next significant token. Should be no problem...tog.sync will need to see all the tokens in the token list, so it will need a generator that yields all tokens.

In retrospect, the distinction between significant and insignificant tokens keeps popping as a source of invention. So the distinction itself looks like the real driver of the collapse in complexity.

Edward

Edward K. Ream

unread,

Nov 23, 2019, 10:24:51 AM11/23/19

to leo-editor

On Saturday, November 23, 2019 at 4:46:08 AM UTC-6, Edward K. Ream wrote:

It's just a bit tricky to look ahead in a generator. So instead of the code shown above in do_If, the visitor call self.peek() and self.advance(), defined something like this (completely untested):

# Use lists for debugging, generators for production. peek_list = [] # Must be a list. def peek(self): if instance(self.token_gen, list): return self.token_gen[self.token_gen_index] if not self.peek_list: self.peek_list = [self.token_gen.next] # or is it next()? return self.peek_list[-1] def advance(self): if instance(self.token_gen, list): self.token_gen_index += 1 self.peek_list = []

Heh. The code I wrote at 3 in the morning looks better than any of the stack overflow answers.

True, the code above works only for a single generator. The methods will be renamed to if_gen_peek and if_gen_next.

Similarly, the tog.sync method must peek arbitrarily many times into the full token generator, so it will use the all_tokens_peek and all_tokens_next methods. Likewise, if the string_tokens and significant_tokens generators require lookahead I'll define corresponding peek and next methods.

In short, a general solution is not needed. A per-generator solution seems simpler and better.

Edward

Edward K. Ream

unread,

Nov 23, 2019, 11:36:28 PM11/23/19

to leo-editor

On Saturday, November 23, 2019 at 9:24:51 AM UTC-6, Edward K. Ream wrote:

> It's just a bit tricky to look ahead in a generator.

Correction, it's amazingly tricky. My supposedly simple code fails in mysterious ways. So do the stack overflow answers.

For now, I'll just use helper lists:

def is_if(token):
    return token.kind == 'name' and token.value in ('if', 'elif', 'else')

def is_significant(token):
    return (
        token.kind in ('name', 'number', 'string') or
        token.kind == 'op' and token.value not in ',;()')
        
def is_string(token):
    return token.kind == 'string'

self.if_tokens = list(filter(is_if, self.tokens))
self.significant_tokens = list(filter(is_significant, self.tokens))
self.string_tokens = list(filter(is_string, self.tokens))

Edward

Edward K. Ream

unread,

Nov 24, 2019, 7:58:01 AM11/24/19

to leo-editor

On Saturday, November 23, 2019 at 10:36:28 PM UTC-6, Edward K. Ream wrote:

> > It's just a bit tricky to look ahead in a generator.

> Correction, it's amazingly tricky...For now, I'll just use helper lists:

Shortly after writing this the last piece of this particular puzzle appeared. There is no need for either generators nor helper lists.

Doh! The so-called generators are nothing but indices into the token list. There is no need for helper lists! peek and advance need only maintain an integer index into the token list. This index monotonically advances, guaranteeing O(N) performance, as explained in more detail in the Post Script.

The code is still tricky. The following completely debugged functions are defined within the do_If method.

def is_if(token):
    return token.kind == 'name' and token.value in ('if', 'elif', 'else')

def find_next_if_token(i):
    # Careful. there may be no 'if', 'elif' or 'else' tokens in the token list.
    while i < len(self.tokens):
        if is_if(self.tokens[i]):
            break
        i += 1
    return i

_index = find_next_if_token(0)

def advance():
    nonlocal _index
    _index = find_next_if_token(_index+1)

def peek():
    nonlocal _index
    # The possibility of an IndexError is a sanity check.
    return self.tokens[_index]

Similar functions appear in the all-important tog.sync method.

Timing

The project is very fast. Indeed, TestRunner.run_tests reports the following stats for some of Leo's source files:

description: file: core..leoFind.py
 setup time: 0.17 sec.
  link time: 0.09 sec.

description: file: core..leoGlobals.py
 setup time: 0.58 sec.
  link time: 0.41 sec.

description: file: core..runLeo.py
 setup time: 0.00 sec.
  link time: 0.00 sec.

The setup time is the time to tokenize the file with Python's tokenize module, plus the time to parse the file with Python's ast module.

The link time is the time to traverse the entire tree. This is a figure of merit for the entire project. As you can see, it is extremely fast.

Summary

Neither generators nor helper lists are required. It suffices to define local helper functions, defined within visitor methods.

These helper functions are non-trivial, but all will be variants on the debugged functions defined within do_If.

The entire traversal runs in time linear in the number of tokens and parse tree nodes.

Edward

P.S. The code demonstrates that the running time of the helper functions is O(N), where N is the number of tokens. Indeed, do_If will call peek and advance a bounded number of times for each ast.If node in the tree. In fact, the upper bound is 2 calls to peek and 2 calls to advance.

Finally, the _index pointer monotonically increases, so the bounded for the total number of iterations in find_next_token is N.

EKR

Edward K. Ream

unread,

Nov 24, 2019, 9:22:42 AM11/24/19

to leo-editor

On Saturday, November 23, 2019 at 4:46:08 AM UTC-6, Edward K. Ream wrote:

Neither the results list nor the Linker class is needed! This is a stunning collapse in complexity.

To summarize recent work:

The project creates the absolute minimum of variable length data structures:

- The token list, created by python's tokenizer module.

- The parse tree, created by python's ast module.

- The node stack. This stack saves and restores the node being visited. This stack is essential. It ensures that the python call stack contains a bounded number of entries, about 10 max.

So the project makes the absolute minimal demands on the GC. Moreover, the project is extremely fast. It takes less time to create two-way links between the token list and the parse tree than it takes to create the token list and the parse tree!

Summary of the summary

The only significant global data structures are those mentioned above.

There is now only a single generator. This generator traverses the parse tree without exploding python's call stack.

Syncing tokens to parse tree nodes involves stepping an integer index through the token list using "peek" and "advance" functions. At least two separate syncing operations are required:

- The do_If visitor uses peek and advance to determine whether to generate 'if' or 'elif' tokens.

- The upcoming tog.sync method will use peek and advance to associate significant tokens to their respective parse tree nodes.

All "peek" and "advance" functions will be local to their respective methods.

The only "products" of this project are the two-way links themselves.

Edward

Reply all

Reply to author

Forward