The following describes the code as of rev 4288779e2. This posting documents how the new importers work. Feel free to skip it.
I'm not going to show much code in this post, so it would be best if you are looking at importers/basescanner.py and importers/perl.py in LeoPlugins.leo as you read this.
I hope to convince you that the new code is clearly the simplest thing that could possibly work.
Executive Overview1. The javascript and perl importers simply copy
entire lines from a text file to Leo nodes. This makes the new importers much less error prone than the legacy (character-by-character) importers.
2. These importers know
nothing about parsing javascript or perl. They know
only about how to scan tokens
accurately. Again, this makes the new importers more simple and robust than the legacy importers.
3. Importers are simple to write because base classes handle all complex details. Importers override just three methods. The scan_line method is the most important of these. It encapsulates
all language-specific details. It is typically about one page of straightforward token-scanning code.
Overview of the codeleo/plugins/basescanner.py now contains two new classes: BaseLineScanner
(BLS) and ScanState classes. The BLS class replaces the horribly complex BaseScanner class.
The ScanState class encapsulate all knowledge relating to scan state, that is, tokens.
Using ScanState methods, the BLS.scan method breaks input lines into Blocks, continuous input lines that will end up in separate Leo nodes. This is necessarily a complex algorithm.
Happily, importers use the BLS class without knowing
anything about how it works, or even that BLS.scan exists! Leo's import infrastructure, in leoImport.py, calls BLS.run, which calls BLS.scan.
Writing a new importerThe perl and javascript importers now consist of nothing but:
1. A subclass of BaseLineScanner class. This subclass will typically override only two methods:
- The ctor. The ctor just sets a few easy-to-understand language-dependent options.
- An optional clean_headline method. If this method exists, it tells how to massage headlines.
2. A subclass of the ScanState class that just overrides ScanState.scan_line.
ScanState.scan_line updates the net number of curly brackets and parens at the
end of each line. scan_line must compute these numbers
accurately,
taking into account constructs such as multi-line comments, strings and
regular expressions that might contain curly brackets or parens.
The Perl importerThe perl importer consists of the PerlScanState and PerlScanner classes. See leo/
importers.perl.pyPerlScanner.__init__ just sets a few arguments for the BaseLineScanner class.
Here is PerlScanner.clean_headline:
def clean_headline(self, p):
'''Return a cleaned up headline for p, or None for no change.'''
m = re.match(r'sub\s+(\w+)', p.h)
return 'sub ' + m.group(1) if m else None
This replaces "sub name whatever" by "sub name".
The PerlScanState.scan_line is straightforward, but it absolutely must be accurate. Take a look at it.
Importing Python
If we were to convert the Python importer to use the new scheme, the entire ScanState class would have to be rewritten. The reason should be clear--Python uses indentation levels to indicate structure, not curly brackets.
Happily, rewriting the ScanState class is
all that would be required. The BLS class would remain completely unchanged, and the importer would be just as simple as the perl and javascript importers.
Summary
Writing a new importer is easy because the ScanState and BaseLineScanner classes hide the details of a complex algorithm. Importers don't even know that BLS.scan exists. Leo's import machinery in leoImport.py calls BLS.scan.
The overridden scan_line method encapsulates
all language-specific knowledge. This is
token-level knowledge. No parsing is ever done anywhere.
Overriding the scan_line method is the simplest possible way of providing
all needed language-specific knowledge to the BLS class. The entire scheme is the simplest thing that could possibly work.
The perl and javascript importers are just a couple of pages of code each. All new importers will be easy to write. A Python importer would use a completely rewritten ScanState class.
All comments and questions are welcome.
Edward