ENB: Theory of operation for Leo's importers

57 views
Skip to first unread message

Edward K. Ream

unread,
Nov 30, 2021, 6:43:35 AM11/30/21
to leo-editor
This Engineering Notebook post reviews the workings of Leo's importers so that I will be clear about the details as I revise the python importers.  The architecture of the importers is surprisingly clever, as I'll now explain.

Overview

gen_lines, the main importer loop, splits the incoming lines into nodes, allocating nodes as necessary.  gen_lines calls add_line to add a line to a node. The post-pass calls undent on each node to adjust leading whitespace.

Adding lines to nodes

Importers only ever add entire lines to nodes.  In other words, add_line never removes leading whitespace! This clever policy ensures that gen_lines only needs to detect where nodes begin and end, a major simplification!

Removing leading whitespace

The undent method adjusts a node's lws independently of the gen_lines. The python importer overrides the base Importer.undent method. i.undent is complex, possibly buggy, and clearly unsuitable for python.

Py_Importer.undent removes the lws of the first non-blank line of the node. I shall soon change py_i.indent so that it never generates Leo's escape convention:

- It will "promote" underindented comment lines.
- It will cause unit tests to fail for any underindented non-comment line.

Note: neither i.undent nor py_i.undent is the same as textwrap.dedent!

Splitting lines into nodes

gen_lines splits lines into nodes, generating nodes as necessary. Unlike other importers, indentation drives the py_i.gen_lines. Here, a node's indentation means vnode_info [p.v] ['indent'].  Similarly for a node's kind.

Case 1: Organizer nodes: kind = 'org'

Organizer nodes contain lines outside of classes and defs. Organizer nodes also handle unusual indentation, including unusually indented class and def lines.

Rule 1: Organizer nodes never contain @others.  Naturally, their ancestor nodes could contain @others. So add_lines and py_i.undent should just work for org nodes!

Rule 2: Org nodes never contain children.

gen_lines sets the indentation of an org node to the indentation of its first non-blank line.

- A class or def line whose lws is less than or equal to the org node's indentation will end the org node.

- A class or def whose lws is greater than the indentation of the org node must reside completely within the org node. This rule is likely the only way of handling unusual indentation!

Case 2: class nodes: kind == 'class'

Most class definitions will occur outside of org nodes. All class nodes will contain an @others directive. The first non-blank line within the class determines:

- the lws of the @others directive and
- the indentation of the class node!

Rule 3: An org node must contain the entire range of an indented class or def that appears outside the range of any class or def node.

As a consequence of python's syntax rules, a parent org node must already exist. For example, an indented def or class line would be invalid syntax unless it were already contained in a (top-level) complex statement such as 'if', 'while', etc.

Case 3: def nodes: kind in ('function', 'method')

Rule 4: def lines without lws will generate function nodes.

Rule 5: Indented def lines appearing with the same indentation as a parent 'class' node will generate method nodes.

Rule 6: Indented def lines appearing at a greater indentation than a parent class node will be included within the containing method node. Imo, this is the only reasonable way of handling inner function definitions.

Rule 7: def lines appearing at a lesser indentation than a parent class node will terminate the class node. In most cases, the def line will then become a method of another parent class node. Per rule 3, if there is no such class node, the def line must be allocated to an already existing parent org node.

Summary

add_lines and undent_lines should work for all nodes, regardless of the node's kind. Happily, the vnode_info dict is available to all methods of the post-pass, if special cases are necessary.

gen_lines assigns an indentation to all generated nodes:

- For org nodes, the indentation is the lws of the first non-blank line.

- For class and def nodes, the indentation is the lws of the first non-blank line following the class or def line.

Unindented class or def lines always generate top-level class or function nodes.

Indented class lines generate class nodes if their lws match the indentation of a parent class node. Otherwise, the class must appear within an already existing org node.

Indented def lines generate method nodes if their lws match the indentation of a parent class node. Otherwise, the def line will appear in an enclosing function or method node. As a last resort, the def line must appear within an already existing org node.

These complex rules are likely buggy. I'll revise them as needed.

Edward

tbp1...@gmail.com

unread,
Nov 30, 2021, 9:21:35 AM11/30/21
to leo-editor
Would it be feasible to use Python's tokenizer?  That would eliminate the problem of unusual whitespace, as well as continuation lines, in a perfectly Python-compatible way.


Edward K. Ream

unread,
Nov 30, 2021, 10:41:39 AM11/30/21
to leo-editor
On Tuesday, November 30, 2021 at 8:21:35 AM UTC-6 tbp1...@gmail.com wrote:
Would it be feasible to use Python's tokenizer?  That would eliminate the problem of unusual whitespace, as well as continuation lines, in a perfectly Python-compatible way.

The short answer is: feasible, yes, useful, no.

Edward]

Edward K. Ream

unread,
Nov 30, 2021, 10:49:27 AM11/30/21
to leo-editor
On Tuesday, November 30, 2021 at 5:43:35 AM UTC-6 Edward K. Ream wrote:
This Engineering Notebook post reviews the workings of Leo's importers so that I will be clear about the details as I revise the python importers. 

After a bit more thought, here is a higher level summary:

1. All incoming lines will have an ancestor that is not the top-level (kind == 'outer') node. This means that the outer node will contain only the following test:

@others
@language python
@tabwidth -4

One could imaging different tabwidth values, but in practice -4 will suffice.

2. Crucially, no complex adjustments of indentation values are needed.  The lws of @others directives, combined with the rules for creating nodes, ensures that this is so.

In other words, gen_lines need only consider the relation between the indentation of the present line and the indentation of the relevant parent node.

That's all!  The rules described today are likely still valid, but the above two rules should guide the remaining coding effort.

Edward

Edward K. Ream

unread,
Nov 30, 2021, 4:53:00 PM11/30/21
to leo-editor
A slightly longer answer.  The front end of Leo's importers are solid. gen_lines calls scan_line to determine the scan state. For python, the line ends is "in a state" if the end of the line is in a string or docstring. In those case, the python importer simply adds the line to the present node.

The difficulties lie in the back end, not the front end. gen_lines assigns lines to nodes (creating new nodes as needed), based on a combination of:

- The indentation and 'kind' fields node p and all its ancestors.
- The indentation of the to-be-assigned line.

The difficulty is that of an explosion of cases. In this respect (and others) the python importer resembles a compiler's code generator.

I think I have discovered a way to drastically reduce the number of "code generation" cases.  Perhaps later today.

Edward

tbp1...@gmail.com

unread,
Nov 30, 2021, 5:21:00 PM11/30/21
to leo-editor
I've been thinking that the import operation sounds like it could be viewed as a transformation from a python file to a python file with sentinals (and altered indentation).  The node construction would be replaced by nesting of node sentinals.  When the transform is done, the file could be loaded like any other file with sentinals.

It's not easy to tell if this would be a simpler approach without actually trying it, though!

Edward K. Ream

unread,
Dec 1, 2021, 11:46:56 AM12/1/21
to leo-editor
On Tue, Nov 30, 2021 at 4:21 PM tbp1...@gmail.com <tbp1...@gmail.com> wrote:
I've been thinking that the import operation sounds like it could be viewed as a transformation from a python file to a python file with sentinals (and altered indentation).  The node construction would be replaced by nesting of node sentinals.  When the transform is done, the file could be loaded like any other file with sentinals.

As in xslt?  It's a strange thought.  Happily, everything is now working!

Edward
Reply all
Reply to author
Forward
0 new messages