ENB: Recent studies and thoughts

116 views
Skip to first unread message

Edward K. Ream

unread,
Dec 24, 2021, 10:36:21 AM12/24/21
to leo-editor

This Engineering Notebook post discusses recent studies and thoughts.

I slept all day yesterday, perhaps with a mild case of covid. I've been vaccinated and boosted, so I'm not too worried. I feel better this morning. Thank goodness for immune systems! And now I have an excuse if my writing sounds like ravings :-)

The goal of my thinking is to find my next project. Leo is complete as it is. Nothing urgently needs to be added. But studying other problems might suggest additions to Leo.

Creativity requires a juicy problem. Not necessarily a big problem, just a problem that seems interesting!

Type inference

If I could wave a magic wand, I would wish for a faster pylint. And maybe a faster mypy, though mypy is surprisingly fast. Both pylint and mypy attempt heavy-duty type inference, so I decided (once again) to try to understand Hindley-Milner type inference.

Previous attempts were unsuccessful. The notation is cryptic. This time around, google yielded much better results. I struck paydirt with a great lecture by Adam Doupe at Arizona State University. The second part (of the same lecture) is here.

I enjoyed the way Prof. Doupe would ask questions and patiently wait. It's the style of a relaxed, confident teacher. Very rare, in my experience. Here's what I learned:

First, the algorithm is just common sense! It simply traverses the parse tree (in any order), recording inferences along the way. Nothing could be simpler.

Second, the algorithm is more of a roadmap than a proper algorithm. The possible inferences depend on the language, and the algorithm omits any mention of the myriad housekeeping details that form the bulk of the actual code! 

Armed with my new confidence about type inference, I briefly reviewed the source code for mypy and pylint. Mypy and pylint follow this algorithm in very different ways! These programs are large and complex. There is no obvious way to speed them up!

Leonine linting

If improving pylint is out of the question, perhaps we could find a simple way to ensure that Leo's c, g, and p vars always mean what we think they should mean.

The simplest prototype of such an approach is to use cff to search for assignments. I used four cffs with these regexs:

\bg\b\s*= # g =

\bp[0-9]?\b\s*=\s*[A-Z][_\w]+\s*\( # p = Class name

\b[_\w]\s*=\s*(.*?)\bPosition\b    # var = Position

\b[_\w]\s*=\s*(.*?)\bCommands\b    # var = Commands

The results provide an intuitive view of how Leo defines c, g, and p!

The next step would be to transliterate these text patterns into searches on the parse trees. The transliteration should be straightforward! Indeed, one searches (in any order) for Assign, AugAssign, or AnnAssign nodes. Having found the assignment, one can traverse the subtree looking for c, g, p, Position, etc. Maybe one could even generalize the transliteration, but that seems unnecessary.

A puzzle

I'm relating my recent journey out of order. The first puzzle I set myself was to traverse (parse) trees using neither recursion (including ast.walk) nor generators. Yes, this puzzle seems quixotic, but I do like puzzles :-)

I don't remember exactly why I first wanted to solve this puzzle. I remember telling Rebecca that all the generators in leoAst.py seemed like the Hufflepuff way. I wanted a "Ravenclaw way :-) But this is a bit unfair to myself. There are lots of clever tricks in leoAst.py, as I documented years ago.

Anyway, after several days (!!) of thought, I saw the answer. Indeed, Leo already traverses outlines using neither generators nor recursion. The key is Leo's Position class.

Maybe the original problem was contrived, but the solution is notable. For any traversal, there is a Position class that yields that traversal. For example, Leo's c.all_positions generator is:

def all_positions(self, copy=True):
    c = self
    p = c.rootPosition()
    while p:
        yield p.copy() if copy else p
        p.moveToThreadNext()

p.moveToThreadNext defines "outline order" iteratively, without generators or recursion. Take a look. So for any desired traversal, we can define a generalized Position class with a new moveToThreadNext method. Do you see? 

moveToThreadNext might be complex. For example, for ast trees, moveToThreadNext would examine the type of each tree node, and traverse the tree accordingly. Note: Position's stack eliminates the need for recursion. The generalized stack must contain enough data to move through the entire tree.

moveToThreadNext (roughly) corresponds to all the visitors (generators) in leoAst.py. This approach is, at the least, a revealing change in point of view.

Aside: TokenOrderTraverser.traverse (in leoAst.py) corresponds exactly to p.moveToThreadNext. TOT.traverse manipulates a stack explicitly. So a generalized moveToThreadNext can be a function. Defining a generalized Position class is optional.

Summary

The Hindley-Milner algorithm is common sense dressed up in forbidding notation. One traverses the tree (in any) order, recording inferences.

For most languages, the Hindley-Milner algorithm is merely a roadmap. Bookkeeping is the hard part. Pylint and mypy do the bookkeeping in very different ways.

Straightforward scripts (transliterations of regexs) can probably check the types of all uses of c, g, and p within Leo's code base.

It is straightforward to define any conceivable tree traversal using neither recursion nor generators. Just define a generalized moveToThreadNext function. Optionally, moveToThreadNext can be a member of a generalized Position class.

That's all for today. Stay tuned.

Edward

Edward K. Ream

unread,
Dec 27, 2021, 7:30:04 AM12/27/21
to leo-editor
On Friday, December 24, 2021 at 9:36:21 AM UTC-6 Edward K. Ream wrote:

A puzzle

...The first puzzle I set myself was to traverse (parse) trees using neither recursion (including ast.walk) nor generators.

Wow. This is a tricky puzzle. All my instincts are wrong, wrong, wrong.

I was feeling smug yesterday, putting the finishing touches on a fairly simple approach, with simple visitors and a simple supporting infrastructure. Then I realized that the whole approach had no chance of working!

When I awoke this morning I had several new thoughts:

1. I would be happy enough with a two-pass algorithm. The first pass would insert parent/child and thread next/back links in some desired order. The second pass would "deliver" nodes using the thread next links.

2. The code that creates the links could visit the tree nodes in any order, provided that the links themselves imply the desired order. So I could use ast.walk as a prototype. (ast.walk can't be part of the actual solution because ast.walk uses recursion.)

3. There is no way typical node visitors could possibly work. Indeed, visiting in the desired order implies recursion. No clever  infrastructure can change that fact.

As I write this, I realize that the puzzle is even trickier than I thought.  Performing the desired actions in the correct order would be difficult even with correctly ordered threading links between nodes.  What we probably need are threading links that include actions.  Heh, I think something like these thoughts were behind the scheme that blew up so spectacularly yesterday.

What do I mean by actions, you ask?  Well, visitors don't just visit nodes. Visitors do something. In fact, a better way to state the problem is call specific functions/methods in the "proper" order.

Summary

Any solution must execute arbitrary actions in some desired order. An elegant solution seems tantalizingly close, but my usual instincts lead me astray.

Visitors are a dead end.  There is no way to execute visitors without simulating python's call (or generator) machinery. Instead, at minimum, we must have a table of desired traversal orders for actions, one entry per ast node type.

A multi-pass algorithm is acceptable. The doomed method I explored yesterday was very fast, even on an old laptop.

That's all for now. Anyone who thinks this is an easy problem is invited to solve it, hehe.

Edward

tbp1...@gmail.com

unread,
Dec 27, 2021, 8:10:15 AM12/27/21
to leo-editor
I'm not clear what problem you are trying to solve that isn't already being handled in Leo's code, but don't forget that there is depth-first traversal and also width-first traversal.  There has to be an immense body of literature out there on tree traversal, and much if not most of it probably does not depend on generators and recursion.  There is also lots of literature on traversing graphs, and that might be more suitable for some of Leo's structures that we often call trees but might better be termed graphs.

Edward K. Ream

unread,
Dec 27, 2021, 8:46:17 AM12/27/21
to leo-editor
On Monday, December 27, 2021 at 7:10:15 AM UTC-6 tbp1...@gmail.com wrote:
There has to be an immense body of literature out there on tree traversal, and much if not most of it probably does not depend on generators and recursion. 

The typical "algorithm" implies recursion.  For example, you can describe depth first traversal as:

- Visit the children.
- Visit the node.

The problem I'm trying to solve is to do implement this description iteratively.

Edward

Edward K. Ream

unread,
Dec 27, 2021, 9:44:09 AM12/27/21
to leo-editor
On Monday, December 27, 2021 at 6:30:04 AM UTC-6 Edward K. Ream wrote:

Summary

Any solution must execute arbitrary actions in some desired order. An elegant solution seems tantalizingly close, but my usual instincts lead me astray.

The solution is starting to take shape:

1. Replace visitors with an action dictionary.  Keys are node.__class__.__name__; values are lists of actions.

2. Actions are like closures: a binding of functions to arguments. But the actual args are not available in the table.  Instead, each action is a tuple (action_function, arg), where the arg will be a field name.

For example, we can replace the ast.Module visitor with:

   [(visit_list, 'body')]

As a more complex example, we could replace the TokenOrderGenerator.If visitor with something like:
[
  (visit, 'value'),
  (visit, 'test'),
  (sync_op, ':'),
  (visit, 'body'),
  (visit_if, 'orelse', (
    (sync_name, 'else'),
    (sync_op, ':'),
    (visit, 'orelse'),
  )),
]

Maybe you get the idea. The 'sync_op' and 'sync_name' actions are particular to the TokenOrderGenerator class, but the visit and visit_if actions are more general.

The visit action is the key. The present node is an ast.If node. Let's say the argument is 'value'. If node.value exists, the visit action will:

- Push the present node a node_stack.
- Make child = getattr(node, 'value') the present node.
- Push action_dictionary [child.__node__.__name__] on an action_stack.

The main iterative loop looks at the next action, which is action_stack[-1][0]

When action_stack[-1] becomes the empty list, the main loop will pop the action stack and the node stack, making the top of the node stack the present node. The traversal is complete when the action stack becomes empty.

Summary

The action_dictionary will contain the action list for each kind of ast node. Each action list will contain a list of actions.

Each action is a tuple (function, arg), where "arg" must be a string. This tuple is a kind of closure. More complex actions, like visit_if, may have the form: (function, arg, inner tuple)

The iterative main line must use at least a node_stack and an action stack. The visit action moves to the next node, pushing entries on the node_stack and action stacks. The main line pops both stacks when an action list becomes empty.

Probably other housekeeping details will be necessary. This whole scheme is a prototype.

Edward

tbp1...@gmail.com

unread,
Dec 27, 2021, 11:43:31 AM12/27/21
to leo-editor

Edward K. Ream

unread,
Dec 28, 2021, 9:57:49 AM12/28/21
to leo-editor
Thanks for the link. I wasn't clear enough about the problem I am trying to solve. I want to duplicate the TokenOrderGenerator class in leoAst.py without recursion or generators.

The graph theory literature may be vast, but it almost certainly does not include this exact problem. Anyway, I'm having fun with the puzzle, even if it may have been solved before :-)

Edward

vitalije

unread,
Dec 29, 2021, 12:13:29 PM12/29/21
to leo-editor

Thanks for the link. I wasn't clear enough about the problem I am trying to solve. I want to duplicate the TokenOrderGenerator class in leoAst.py without recursion or generators.

I don't think it is possible what you are trying to achieve. To totally avoid any kind of recursion, you must know some facts about the AST in advance, for example how deep it is. If you were able to know this information, you could write an algorithm without any kind of recursion. But that would limit what AST your code can process. If you wish that your could can accept any valid Python code, then you have to use some kind of recursion, because your code would operate on the recursive data structure.

Now, there are two kinds of recursion. One is when a recursive function calls itself or two or more functions call each other recursively. This kind of recursion can reach Python limit for the number of recursive calls or it can fill the stack. The other kind of recursion is using generators. Calling a function which yields returns immediately, before the first iteration has started. This means that the stack space is not going to be exhausted, and instead all necessary data for executing recursion is going to be in the main memory and not on the stack. Python generators are the Pythonesque  way of automatically transforming recursive code written in Python into the imperative code which is guaranteed to produce the same result as would the recursive one. This feature is implemented in C. Now, if you want to avoid using generators, theoretically you can transform recursive code which uses generators into the equivalent imperative code, but your implementation of such code written in Python must be much slower than the same code with generators. If you were able to write faster implementation in Python, that would mean that the C implementation of Python generators is not optimized, and that is highly unlikely because CPython source is pretty much optimized.

You said earlier that Position class is not using recursion, but it isn't entirely  true. The _stack field of the Position instances is where the recursion is hidden. One could say that by using _stack, Position class is just an implementation of the generators written in Python without using Python generators. This is the reason why I almost never use position methods for iteration and instead almost always write my own iterators using ordinary Python generators. In all my experiments and measurements using Python generators produced faster code than the equivalent code using position methods.

It is O.K. if you just want to have some fun to play around with these ideas, but you should know that you can't avoid recursion because the data is recursive, and generators actually produce the fastest code for solving recursive problems in Python. It wouldn't be an easy task to get faster solution even in C, because Python C sources are already well optimized.

My 2 cents.

So, if you are looking to improve the speed, generators are your best choice.

vitalije

unread,
Dec 29, 2021, 12:17:15 PM12/29/21
to leo-editor


> If you wish that your could can accept any valid Python code,
I meant to write: "If you wish that your code can accept any valid Python code"

tbp1...@gmail.com

unread,
Dec 29, 2021, 1:30:57 PM12/29/21
to leo-editor
I believe that it is always possible to replace recursion with lists, but since a generator can be looked on as a lazily evaluated list, you are essentially back to generators.

Edward K. Ream

unread,
Dec 29, 2021, 4:18:43 PM12/29/21
to leo-editor
On Wed, Dec 29, 2021 at 11:13 AM vitalije <vita...@gmail.com> wrote:

Thanks for the link. I wasn't clear enough about the problem I am trying to solve. I want to duplicate the TokenOrderGenerator class in leoAst.py without recursion or generators.

I don't think it is possible what you are trying to achieve. To totally avoid any kind of recursion, you must know some facts about the AST in advance, for example how deep it is.

Thanks for this comment. I am quite sure the approach I am using will work.  However, the working code will be ugly, and at least as complex as the present code in leoAst.py. I agree with Thomas that the final result would be a simulation of generators. For this reason I am considering abandoning the project.

However, I should at least explain why an iterative simulation is feasible.  Here is the top-level iterative loop:

def traverse(self, contents):
    """Completely traverse the ast tree for the given contents."""
    << define action_dict >>
    self.node = root = ast.parse(contents)
    action_list = self.action_dict.get(root.__class__.__name__, []).copy()
    self.stack = [StackEntry(None, action_list)]  # (parent_node, action_list)
    while self.stack:
        if action_list:
            action = action_list.pop(0)
            action.func(action.args)
        else:
            entry = self.stack.pop()
            self.node = entry.parent
            action_list = entry.action_list

Action functions execute in the context of a specific ast node. The most general action function is visit. Here is a slightly simplified version of the present code:

def visit(self, field):
    child = getattr(self.node, field, None)
    if not child:
        return
    if isinstance(child, (list, tuple)):
        # Do *not* change self.node.
        for z in child:
            self.stack.append(StackEntry(self.node, [Action(self.visit_list, z)]))
        return
    assert isinstance(child, ast.AST), repr(child)
    action_list = self.action_dict.get(child.__class__.__name__, None)
    if not action_list:
        if action_list is None:
            print(f"visit: Missing: {child.__class__.__name__}")
        return
    self.stack.append(StackEntry(self.node, action_list.copy()))
    # *Do* change self.node.
    self.node = child


Let us consider the difficulties of handling ast.If nodes. The TOG visitor, which uses generators, is:

def do_If(self, node):
    # Use the next significant token to distinguish between 'if' and 'elif'.
    token = self.find_next_significant_token()
    yield from self.gen_name(token.value)
    yield from self.gen(node.test)
    yield from self.gen_op(':')
    # Body...
    self.level += 1
    yield from self.gen(node.body)
    self.level -= 1
    #
    # Else and elif clauses...
    if node.orelse:
        self.level += 1
        token = self.find_next_significant_token()
        if token.value == 'else':
            yield from self.gen_name('else')
            yield from self.gen_op(':')
            yield from self.gen(node.orelse)
        else:
            yield from self.gen(node.orelse)
        self.level -= 1

As a prototype, we could consider the following general action list for any ast.If node.

'If': [
    Action(visit, 'value',),
    Action(visit, 'test'),
    Action(visit, 'body'),
    Action(visit, 'orelse'),
]

These actions will suffice to visit the subtree of any ast.If node (regardless of complexity) because these actions will be "bound" to an actual node during the traversal of the tree.

However, such general actions will not suffice to simulate all the complications in TOG.do_if.  Instead, we need, for almost every kind of ast node, a list of node-specific actions. Something like this;

'If': [
    Action(visit_if_value)
    Action(visit_if_test)
    Action(visit_if_body)
    Action(visit_if_else),
]

These action methods can generate the required tests and calls because they are aware that they are executing in the context of an ast.If node, just as the TOG.do_If visitor is.

Summary

Every action method executes in the context of a single, specific, ast node in the parse tree.  The algorithm creates new action methods as it traverses the tree.  Each action method handles exactly one child field of the parent ast node. There is no doubt in my mind that this scheme is sound.

General action methods suffice to traverse the tree, but we must have more node-specific actions to simulate each visitor of the TOG class. In effect, the node-specific action methods split each TOG visitor into pieces. The effect is to simulate generators (coroutines) "by hand."

Edward

Edward K. Ream

unread,
Dec 29, 2021, 4:24:55 PM12/29/21
to leo-editor
On Wed, Dec 29, 2021 at 11:13 AM vitalije <vita...@gmail.com> wrote:

It is O.K. if you just want to have some fun to play around with these ideas, but you should know that you can't avoid recursion because the data is recursive

I agree that you can't avoid stacks. However, stacks on the heap can be much larger than python's call stack.
generators actually produce the fastest code for solving recursive problems in Python.

I agree. Any simulation of coroutines will be significantly slower than generators.

In short, it is challenging to solve this puzzle, but the solution isn't going to make it into leoAst.py :-)

Edward
Reply all
Reply to author
Forward
0 new messages