Documentation: the new perl and js importers

14 views
Skip to first unread message

Edward K. Ream

unread,
Oct 28, 2016, 4:29:28 PM10/28/16
to leo-e...@googlegroups.com
The following describes the code as of rev 4288779e2. This posting documents how the new importers work. Feel free to skip it.

I'm not going to show much code in this post, so it would be best if you are looking at importers/basescanner.py and importers/perl.py in LeoPlugins.leo as you read this.

I hope to convince you that the new code is clearly the simplest thing that could possibly work.

Executive Overview

1. The javascript and perl importers simply copy entire lines from a text file to Leo nodes. This makes the new importers much less error prone than the legacy (character-by-character) importers.

2. These importers know nothing about parsing javascript or perl.  They know only about how to scan tokens accurately.  Again, this makes the new importers more simple and robust than the legacy importers.

3. Importers are simple to write because base classes handle all complex details. Importers override just three methods. The scan_line method is the most important of these.  It encapsulates all language-specific details.  It is typically about one page of straightforward token-scanning code.

Overview of the code

leo/plugins/basescanner.py now contains two new classes: BaseLineScanner (BLS) and ScanState classes. The BLS class replaces the horribly complex BaseScanner class.

The ScanState class encapsulate all knowledge relating to scan state, that is, tokens. 

Using ScanState methods, the BLS.scan method breaks input lines into Blocks, continuous input lines that will end up in separate Leo nodes.  This is necessarily a complex algorithm.

Happily, importers use the BLS class without knowing anything about how it works, or even that BLS.scan exists!  Leo's import infrastructure, in leoImport.py, calls BLS.run, which calls BLS.scan.

Writing a new importer

The perl and javascript importers now consist of nothing but:

1. A subclass of BaseLineScanner class. This subclass will typically override only two methods:

- The ctor.  The ctor just sets a few easy-to-understand language-dependent options.

- An optional clean_headline method.  If this method exists, it tells how to massage headlines.

2. A subclass of the ScanState class that just overrides ScanState.scan_line.

ScanState.scan_line updates the net number of curly brackets and parens at the end of each line.  scan_line must compute these numbers accurately, taking into account constructs such as multi-line comments, strings and regular expressions that might contain curly brackets or parens.

The Perl importer

The perl importer consists of the PerlScanState and PerlScanner classes.  See leo/importers.perl.py

PerlScanner.__init__ just sets a few arguments for the BaseLineScanner class.

Here is PerlScanner.clean_headline:

def clean_headline(self, p):
    '''Return a cleaned up headline for p, or None for no change.'''
    m = re.match(r'sub\s+(\w+)', p.h)
    return 'sub ' + m.group(1) if m else None

This replaces "sub name whatever" by "sub name".

The PerlScanState.scan_line is straightforward, but it absolutely must be accurate. Take a look at it.

Importing Python

If we were to convert the Python importer to use the new scheme, the entire ScanState class would have to be rewritten.  The reason should be clear--Python uses indentation levels to indicate structure, not curly brackets.

Happily, rewriting the ScanState class is all that would be required.  The BLS class would remain completely unchanged, and the importer would be just as simple as the perl and javascript importers.

Summary

Writing a new importer is easy because the ScanState and BaseLineScanner classes hide the details of a complex algorithm.  Importers don't even know that BLS.scan exists. Leo's import machinery in leoImport.py calls BLS.scan.

The overridden scan_line method encapsulates all language-specific knowledge.  This is token-level knowledge.  No parsing is ever done anywhere.

Overriding the scan_line method is the simplest possible way of providing all needed language-specific knowledge to the BLS class. The entire scheme is the simplest thing that could possibly work.

The perl and javascript importers are just a couple of pages of code each.  All new importers will be easy to write. A Python importer would use a completely rewritten ScanState class.

All comments and questions are welcome.

Edward
Reply all
Reply to author
Forward
0 new messages