Aha: How to find patterns in python parse trees

45 views
Skip to first unread message

Edward K. Ream

unread,
May 8, 2023, 9:16:08 AM5/8/23
to leo-editor
Yesterday I discovered a dead simple design pattern for finding code patterns in python parse trees.

This pattern is a milestone. It changes forever how I'll use python's ast module. Old techniques now seem like trying to swat flies with a sledgehammer!

The Aha is as surprising as the Aha that created @clean. I'll never forget it.

tl;dr: Convert parts of parse trees to strings using ast.unparse.

Background

I have been revisiting an old problem: how to analyze Leo's code. The previous state of the art was unsatisfactory. There were two approaches, each flawed:

1. Use regex patterns. This approach works for prototyping, but there is no way to ignore false matches in strings, docstrings, and comments.

2. Use parse trees. This approach is sound, but finding patterns in ast nodes was cumbersome.

How to get the best of both approaches?

Finally, my work with the rope project made me familiar with a new design pattern: small, special-purpose tree traversers based on ast.NodeVisitor.

The Problem

I wanted a program to find all chains of ast Attribute nodes within Leo. Examples are g.app.gui and c.frame.body.wrapper.widget.

I packaged this program as the TestChains.test_all_paths unit test in leo/unittests/test_design.py. This is a new file: see PR #3319.

Early attempts were clumsy, as before. ast.Attribute trees can contain much more than ast.Name and ast.Attribute nodes.

The Aha: use ast.unparse

I had vaguely planned to experiment with ast.unparse. Given an ast node/tree, ast.unparse returns a string with code that would produce an equivalent node/tree.

Aha: This string is exactly what we need to discover patterns in parse trees!!!

Here is the complete traverser class that discovers all chains:

class ChainsTraverser(NodeVisitor):
    chains_set = set()

    def visit_Attribute(self, node):
        """
        Add only top-level Attribute chains to chains_set.
        Do *not* call generic_visit!
        """
        chain = ast.unparse(node)
        self.chains_set.add(chain)


OMG! This is amazing:

- unparse returns the entire chain, no matter how complex.
- visit_Attribute handles only top-level attributes because it doesn't call ast.generic_visit.

A straightforward unit test (TestChains.test_all_paths) instantiates ChainsTraverser once, calls the traverser for most of Leo's core files, and prints/tests ChainsTraverser.chains_set. Nothing could be simpler.

Summary

Finding patterns in ast trees is straightforward:

- Use ast visitors to find the roots of ast trees containing potential patterns.
- Convert those roots to strings with ast.unparse.
- Use regexs to filter the results. See the PostScript.

This Aha changes my whole approach to analyzing ast trees.

Edward

P.S. TestChains.test_all_paths filters the resulting chains as follows:

array_pat = re.compile(r'(\[).*?(\])')
call_pat = re.compile(r'(\().*?(\))')
string_pat1 = re.compile(r"(\').*?(\')")
string_pat2 = re.compile(r'(\").*?(\")')
patterns = (array_pat, call_pat, string_pat1, string_pat2)

def filter_chain(s: str) -> str:
       
    def repl(m):
        return m.group(1) + m.group(2)

    for pattern in patterns:
        s = re.sub(pattern, repl, s)
    return s

This kind of post-processing would be impossible if the operands were ast trees.

Edward

Thomas Passin

unread,
May 8, 2023, 9:42:46 AM5/8/23
to leo-editor
I'm not surprised by this.  I found the same thing in developing my bookmarks manager.  Conceptually the structure of the bookmarks collection is tree-like, but it can be much easier to flatten that structure into text and use string methods on the flattened structure.

OTOH, there has been theoretical and (somewhat?) practical work on regular expressions for trees.  E.g.,


If you can convert the AST trees to XML form, then there are XSLT, XPATH, and perhaps best of all, XQuery.  And if you can convert trees into RDF graphs, some implementations of SPARQL can efficiently query and transform documents with millions of RDF triplets.

Edward K. Ream

unread,
May 8, 2023, 10:03:58 AM5/8/23
to leo-e...@googlegroups.com
On Mon, May 8, 2023 at 8:42 AM Thomas Passin <tbp1...@gmail.com> wrote:

there has been theoretical and (somewhat?) practical work on regular expressions for trees.  E.g.,


The Aha sidesteps the need for all such tools.

Edward

Edward K. Ream

unread,
May 9, 2023, 7:54:48 AM5/9/23
to leo-editor
On Monday, May 8, 2023 at 8:16:08 AM UTC-5 Edward K. Ream wrote:
Yesterday I discovered a dead simple design pattern for finding code patterns in python parse trees.

The  AnnotationsTraverser.test_annotation method revealed several surprising annotations. PR #3319 fixes some of them. test_annotation ignores (with comments) several special cases.

Surprisingly, these changes are likely to be the only result of the Aha. Once the technical problem disappeared it became clear that static checks of Leo's code are not likely to be of great value!

In short, the Aha has encouraged me to move on :-)

Edward

Thomas Passin

unread,
May 9, 2023, 10:32:54 AM5/9/23
to leo-editor
On Tuesday, May 9, 2023 at 7:54:48 AM UTC-4 Edward K. Ream wrote:
On Monday, May 8, 2023 at 8:16:08 AM UTC-5 Edward K. Ream wrote:
Once the technical problem disappeared it became clear that static checks of Leo's code are not likely to be of great value!

There has been many words written about how important static typing is, and how therefore languages like Python cannot really be good production languages.  To the contrary, I have read that only say 10% of bugs in actual working systems would have been caught by static type checking.  So yes, you can't ignore getting the types right, but that's only the beginning.

Edward K. Ream

unread,
May 10, 2023, 5:15:11 AM5/10/23
to leo-editor
On Tuesday, May 9, 2023 at 9:32:54 AM UTC-5 tbp1...@gmail.com wrote:

There has been many words written about how important static typing is, and how therefore languages like Python cannot really be good production languages.  To the contrary, I have read that only say 10% of bugs in actual working systems would have been caught by static type checking.  So yes, you can't ignore getting the types right, but that's only the beginning.

True, but consider codon. It's built on type checking!

Edward
Reply all
Reply to author
Forward
0 new messages