ENB: About the python importer

51 views
Skip to first unread message

Edward K. Ream

unread,
Nov 23, 2021, 6:52:18 AM11/23/21
to leo-editor
According to PR #2331, I started work on the new python importer 9 days ago.  This Engineering Notebook post will discuss what I have done and the remaining difficulties.

vnode_info dictionary

All importers now use a vnode_info dict instead of injecting the _import_lines ivar into vnodes.  Keys are vnodes; values are inner dictionaries.

The inner dictionary contains at least one key/value pair:

    "lines": <list of lines for the vnode>.

VNodes use slots, so the vnode_info dict slightly reduces the descriptor memory required in all vnodes. More importantly, the vnode_info dict allows the python importer to contain other key/value pairs.

Stackless python importer

Previously, all importers, including the python importer, used a stack that mirrored the structure of the imported nodes that the importers created.  Keeping the stack in sync with created nodes is tricky. Aha! Maybe the stack isn't needed! The vnode_info dict may suffice.  The python importer uses an inner dict with these keys:

{
     '@others': <True: lines contains @others>,
     'indent': <The node's indentation, see below>,
     'kind': <one of 'outer', 'org', 'class', 'def'>,
     'lines': < list of lines for the vnode>,
}

Instead of getting these values from the stack, the importer will get these values from the generated nodes.  For example, in the main importer loop the p var points at the node being generated. So info_dict [p.parent().v] contains the data for p's parent and  info_dict [p.back().v] contains the data for p's previous sibling, if any.

I think this new organization will work, but there are no guarantees. If necessary, I'll revert to the old stack-based architecture, with all of its complexities.

The python importer is inherently complex

Aha! The python importer is intrinsically at least as complex as the javascript importer, and perhaps more so! This complexity has been quite a shock!

How can this be? Doesn't python impose strict standards for indentation and structure?

Strangely indented lines

Alas, the answer is "yes and no." :-)  Most of the time python classes, methods, and functions follow a simple format.  But not always!  For example, the following is a valid python program! Try it!

if 1:
 print('indent 1')
if 2:
  print('indent 2')
if 3:
   print('indent 3')
if 4:
    print('indent 4')
if 5:
     print('indent 5')

Who would do such a thing, you ask?  Well, mypy unit tests, for one. Those unit tests contain other strange (valid!) constructions.

Furthermore, one could replace the "print" statements above with "class" or "def" statements, and one could imagine similar strange "if" statements within the range of a class definition!

Important: strangely-indented lines can only happen within the range of compound statements such as "if", "for", "while", and "with", etc.  But "class" and "def" statements are also compound statements in this sense!  It's quite a mess.

Keeping track of indentation

In short, the python importer can not assume anything about what indentation may be in effect in the range of a class definition!

As noted above, the python importer assigns a vnode kind for each generated vnode. The valid (string) values are outer, org, class, and def. Hmm., As I write this, perhaps the importer should use "method" and "function" kinds instead of the generic "def" kind.

The "org" kind should allow the python importer to handle strangely-indented lines. Indeed, python does not allow complete chaos! For example, the following is a syntax error:

class Class1:
    def method1():  # 4-space indentation
        pass  # 8-space indentation.
      def method2():  # 6-space indentation.
          pass

Python gives this error:

    def method2():  # 6-space indentation.
                                          ^
IndentationError: unindent does not match any outer indentation level

That is, the first statement in the range of the class determines the allowed indentation for all other statements of the class, including compound statements.  Presumably, the 'indent' value for "class" nodes will be the allowed indentation, but perhaps the vnode_info dict should contain two indent-related keys.  See below.

Underindented lines

A further complication involves so-called underindented lines, that is, lines that Leo can not represent properly using the natural node structure.  Leo uses an ugly escape convention to represent such lines.  Most Leonistas probably have never seen the escape convention, but Leo does support it.

At present, the python importer's perfect-import check allows leading whitespace to be added to otherwise underindented comment lines (only). Imo, adding this extra whitespace is preferable to using the underindented convention, but I might change my mind.

Removing common leading whitespace

Importer.undent removes leading whitespace from generated nodes.  i.undent calculates the greatest leading whitespace in the entire node and removes this whitespace from all lines of the nodes, inserting the underindented escape sequence as necessary!

The python importer will likely override i.undent (python_i.undent) so as to never insert the underindented escape sequence. Perhaps textwrap.dedent can be used, but that assumes that all strangely-indented nodes are under the range of an `@others` directive that is indented by exactly the amount that textwrap.dedent will (eventually) remove!

So there are a lot of constraints involved in generating nodes!

Aha! The post pass can use the vnode_info dict

As I write this, I see that the vnode_info dict has another advantage over the stack-based architecture. The vnode_info dict is available to (the possibly overridden) undent method. Perhaps the vnode_info dict might have two indentation-related keys. We shall see.

Summary

Surprisingly, the python importer is inherently the most complex importer of all.

Organizer nodes will allow the importer to handle even the most bizarre strange-indented nodes.  However, generating the necessary organizer nodes has stumped me for several days. The task is far from easy.

The base Importer class defines the architecture of all importers. There is no need to improve this architecture! In particular, the line-by-line nature of the gen_lines method ensures that all importers, including the python importer, will be close to as fast as possible. There is no need to worry about the speed of the python importer!

To sum up: the task is to ensure the perfect import of all valid python programs, regardless of indentation quirks.

Edward

P.S. As I write this I see that the underindented escape convention seems not to be documented.  Searching for "underindentEscapeString" in leoPy.leo will show the relevant code.

EKR

tbp1...@gmail.com

unread,
Nov 23, 2021, 8:22:50 AM11/23/21
to leo-editor
After reading this, a few things came to mind that I hadn't thought about before.  The big one is what should the importer do when finding incorrect Python code, or at least incorrect whitespace?  Should it correct it - at least try to fix up the whitespace? Refuse to complete the import?  One thing might be to treat it as if the user had typed in the text - refuse to write back to the external file until all the errors get fixed, but save the outline when asked.  

Another point is what the importer should do about mixed leading indentation - tabs and spaces together.  Should it convert tabs to spaces?  Presumably it should because that's what would happen when a user tried to type in the same text.  I don't know what the current importer does here.

The few times in the past that I've written little Python importers, I have tried different tactics.  The most general way was to handle a mix of tabs and spaces by using the count of space and tab characters even if their order changed. That would amount to changing the tabs to single spaces, for the purposes of identifying new indentation - I used four spaces per tab for output.  I used the first line that contained a new indent as the template for that indentation level.  It worked pretty well most of the time and wasn't too hard to code.

The easier but less general way was to just replace all tabs with four spaces.  If the original file's editor used a different number of spaces for a tab, it might not have worked so well, although one could build in a little slop, so that if an indentation were say one space over or under it would be accepted (and fixed).

This reminds me that I have never looked into exactly how Python figures out the whitespace.  Might be interesting.

Edward K. Ream

unread,
Nov 23, 2021, 11:13:09 AM11/23/21
to leo-editor
On Tue, Nov 23, 2021 at 7:22 AM tbp1...@gmail.com <tbp1...@gmail.com> wrote:
After reading this, a few things came to mind that I hadn't thought about before.  The big one is what should the importer do when finding incorrect Python code, or at least incorrect whitespace? 

Iirc, we can assume that there is consistent whitespace.  I don't recall whether the importer corrects the whitespace or refuses to continue.

Edward
Reply all
Reply to author
Forward
0 new messages