Engineering notebook: code generation for new importers

37 views
Skip to first unread message

Edward K. Ream

unread,
Oct 30, 2016, 12:25:18 PM10/30/16
to leo-editor
Here, I'll be thinking out loud about how to make code generation work when using the new-style importers.

This is an engineering notebook post.  Feel free to ignore.

importers/python.py contains both the old and new python importers. The new_scanner switch enables the new importer. The new python importer fails miserably when new_scanner is True. Everything works as before (including all unit tests) when new_scanner is False.

Background

In another thread I wrote

If we were to convert the Python importer to use the new scheme, the entire ScanState class would have to be rewritten.  The reason should be clear--Python uses indentation levels to indicate structure, not curly brackets.

Rev 9755cf introduces the PythonScanState class.  It also moves the scan_block method out of the ScanState class and into the BaseLineScanner (BLS) class where it belongs.

The PythonScanState class is surprisingly simple. In particular, it handles backspace-newlines more simply than does the old-style importer. This is tricky to get exactly right.

Happily, rewriting the ScanState class is all that would be required.  The BLS class would remain completely unchanged, and the importer would be just as simple as the perl and javascript importers.

This statement was wildly optimistic. It has gradually dawned on me that there are serious problems with the code generation in the BLS class.

Code Generation

Code generation for javascript is easier than for python because nodes may contain multiple section references.  For the python (and perl) importers, only one @others directive is allowed per node.  This has important implications. The entire algorithm for breaking the input file into nodes may have to be revised.

As a practical matter, I have found the block scanning and rescanning code to be almost impossible to understand.  This is surprising, but not distressing.  The algorithm was always going to be complex.

I have derided the old-style importers as way too complicated.  I may have to revise that assessment :-)

The great advantage of the old-style code generators is that they handle indentation correctly in all situations.  In particular, they handle underindented python comment lines properly.  Such comments do not terminate defs or classes.  I am willing to add extra indentation for such lines (with a warning), but even doing that has repercussions throughout the code.

I plan to study the old code generators today, to remind myself how they work. But before doing that, let's see what the code generators must do.  In fact, the answer is relatively straightforward.  Each generated node, including the top-level node, will look like this:

    One or more leading lines
    @others, indented as discussed below
    zero or more trailing lines

The top-level node will be

    @language python
    @others

Nodes that have no children will consist only of the properly indented body of the class or def.  This indentation depends on the cumulative indentation of all @others nodes in the node's parents.

Nodes that do have children are the hard case.  To repeat, they will look like:

    One or more leading lines
    @others, properly indented
    zero or more trailing lines

There are three problem that must be solved completely:

1. Determining the leading lines.
2. Determining the indentation of the @others directive.
3. Determining the trailing lines.

None of these tasks is trivial.  Furthermore, the post pass may move lines around from the end of one block to the start of the next. Alas, this could affect the proper indentation of the @others directive!

The way forward

Clearly, the new-style code generators can do as well as the old code generators. In fact, the task of the new-style generators is easier than for the old-style code generators because the new code generators work on whole lines.

In the worst case, the new importers can simply mirror the old code generators. Having said that, doing code generation the "old" way may require a complete rewrite of the code that allocates lines to nodes. Happily, adapting the old code generators to a line-oriented scheme must surely simplify them.

Summary

Code generation is much more challenging than I first imagined.

The ScanState class is not the problem.  It is a brilliant invention, if I do say so myself. It completely eliminates the need to parse the imported language. It will remain a foundation of the BLS class.

Much of BLS class may have to be written, including BLS.scan and many of its helpers.

The new code generators may be based on the old. No changes whatever will be tolerated in the old code generators.  Instead, I'll copy any needed code from the BaseScanner class to the BLS class.

Rewriting the old generators to work with the line-by-line scanner will simplify them. I relish such tasks.

The BLS class is a fundamentally important part of Leo. It should be used for all of Leo's importers.  It is worth any amount of work make the new importers as beautiful and accurate as possible.

Edward

Edward K. Ream

unread,
Nov 1, 2016, 12:08:56 PM11/1/16
to leo-editor
On Sunday, October 30, 2016 at 11:25:18 AM UTC-5, Edward K. Ream wrote:

Here, I'll be thinking out loud about how to make code generation work when using the new-style importers.

This is also an Engineering Notebook post.  Feel free to ignore.

Status Report

As expected, a line-oriented approach simplifies much of the code.  However, keeping the details straight is still challenging.  Recursive parsing is tricky, but I have lots of experience with such code.

Code generation drives appears to drive all aspects of the code, including parsing. The simplest thing that could possibly work may be:

1. Ref lines (@others or section references) will never be indented in the top-level nodes.
2. All other ref lines will be indented by the amount implied by @tabwidth.
3. When computing the text of a block, all lines will be unindented by the cumulative value of all ancestor refs.  Underindented lines will be represented by Leo's (ugly) underindented escapes, possibly only for "strict" languages like python.

The advantage of this scheme is that should usually (always?) produce a perfect import of all lines. The disadvantage is that it could produce lots of escapes.

Alternatively, it would be possible to eliminate all escapes by looking ahead to see whether indenting a ref line is possible.  This will eliminate escapes, but could cause generated code to look like:

    class MyClass:
    @others # no indentation.

This will "penalize" all methods that are properly indented.  All lines of  properly indented nodes will start with "extra" whitespace.

This scheme would require three-pass code generation.  The first pass produces blocks. The second computes the indentations needed for ref lines.  The third produces outline nodes. I'll try hard to avoid this way.

Summary

Parsing input files into nodes is tricky, especially when using @others. Using section refs is easier.

The simplest code generation when using @others will be to use "standard" indentation for @others. This implies generating escapes for underindented lines.  I'll attempt this next.

Edward

Edward K. Ream

unread,
Nov 2, 2016, 8:02:53 AM11/2/16
to leo-editor
On Tuesday, November 1, 2016 at 11:08:56 AM UTC-5, Edward K. Ream wrote:

tl;dr: Way ahead of schedule.  I have new ambitions for code better than ever before.

Status Report

As expected, a line-oriented approach simplifies much of the code.

Unexpected so.  Parsing has collapsed in complexity.  I have completely retired the old scan method, all its helpers, many other now-unused methods, and the infamous Block class.  Hurray!  They have been moved to the attic (leo/doc/leoNotes.txt) as a reminder of the bad old days ;-)

The real trick was to determine which code can be used at the top level and during rescan.  The gen_lines method turned out to be that common code.

Now that parsing is handled, I am attempting to generate perfect code in all cases.  That is, the BLS.check method requires exact matches of leading whitespace for all languages.

This is not yet working for some of my javascript examples.  Some section references have incorrect leading whitespace. Hacks in the undent code don't work in all cases.

The code generator must somehow get smarter.  Imo, this worth a lot of work.

Edward
Reply all
Reply to author
Forward
0 new messages