Here, I'll be thinking out loud about how to make code generation work when using the new-style importers.
This is an engineering notebook post. Feel free to ignore.
importers/python.py contains both the old and new python importers. The new_scanner switch enables the new importer. The new python importer fails miserably when new_scanner is True. Everything works as before (including all unit tests) when new_scanner is False.
BackgroundIn
another thread I wrote
If
we were to convert the Python importer to use the new scheme, the
entire ScanState class would have to be rewritten. The reason should be
clear--Python uses indentation levels to indicate structure, not curly
brackets.
Rev 9755cf introduces the PythonScanState class. It also moves the scan_block method out of the ScanState class and into the BaseLineScanner (BLS) class where it belongs.
The PythonScanState class is surprisingly simple. In particular, it handles backspace-newlines more simply than does the old-style importer. This is tricky to get exactly right.
Happily, rewriting the ScanState class is all
that would be required. The BLS class would remain completely
unchanged, and the importer would be just as simple as the perl and
javascript importers.
This statement was wildly optimistic. It has gradually dawned on me that there are serious
problems with the code generation in the BLS class.
Code Generation
Code generation for javascript is easier than for python because nodes may contain multiple section references. For the python (and perl) importers, only one @others directive is allowed per node. This has important implications. The entire algorithm for breaking the input file into nodes may have to be revised.
As a practical matter, I have found the block scanning and rescanning code to be almost impossible to understand. This is surprising, but not distressing. The algorithm was always going to be complex.
I have derided the old-style importers as way too complicated. I may have to revise that assessment :-)
The great advantage of the old-style code generators is that they handle indentation correctly in all situations. In particular, they handle underindented python comment lines properly. Such comments do not terminate defs or classes. I am willing to add extra indentation for such lines (with a warning), but even doing that has repercussions throughout the code.
I plan to study the old code generators today,
to remind myself how they work. But before doing that, let's see what the code generators must do. In fact, the answer is relatively straightforward. Each generated node, including the top-level node, will look like this:
One or more leading lines
@others, indented as discussed below
zero or more trailing lines
The top-level node will be
@language python
@others
Nodes that have no children will consist only of the properly indented body of the class or def. This indentation depends on the cumulative indentation of all @others nodes in the node's parents.
Nodes that do have children are the hard case. To repeat, they will look like:
One or more leading lines
@others, properly indented
zero or more trailing lines
There are three problem that must be solved completely:
1. Determining the leading lines.
2. Determining the indentation of the @others directive.
3. Determining the trailing lines.
None of these tasks is trivial. Furthermore, the post pass may move lines around from the end of one block to the start of the next. Alas, this could affect the proper indentation of the @others directive!
The way forward
Clearly, the new-style code generators can do as well as the old code generators. In fact, the task of the new-style generators is easier than for the old-style code generators because the new code generators work on whole lines.
In the worst case, the new importers can simply mirror the old code generators. Having said that, doing code generation the "old" way may require a complete rewrite of the code that allocates lines to nodes. Happily, adapting the old code generators to a line-oriented scheme must surely simplify them.
Summary
Code generation is much more challenging than I first imagined.
The ScanState class is not the problem. It is a brilliant invention, if I do say so myself. It completely eliminates the need to parse the imported language. It will remain a foundation of the BLS class.
Much of BLS class may have to be written, including BLS.scan and many of its helpers.
The new code generators may be based on the old. No changes whatever will be tolerated in the old code generators. Instead, I'll copy any needed code from the BaseScanner class to the BLS class.
Rewriting the old generators to work with the line-by-line scanner will simplify them. I relish such tasks.
The BLS class is a fundamentally important part of Leo. It should be used for all of Leo's importers. It is worth any amount of work make the new importers as beautiful and accurate as possible.
Edward