Important documentation for devs: Leo's new importers

33 views
Skip to first unread message

Edward K. Ream

unread,
Nov 6, 2016, 4:43:57 AM11/6/16
to leo-e...@googlegroups.com
This post will be pre-writing for a theory of operation of Leo's new (V2) import code.  It will be of interest only to those who might want to create their own importer.  Feel free to skip. If you read on, I hope to convince you that creating new importers is straightforward.

This is a long post. tl;dr: as usual, skip to the summary.

While writing this post, I realized that details get in the way of understanding.  So I am going to keep them to a minimum here.  Really :-) Otoh, this post really does tell you everything you need to know to write a new state scanner.

Progress Report

The V2 code is already a great success. The javascript importer is working.  It's the easiest case because it generates section references instead of @others. The perl adapter does generate @others.  I expect to complete it soon. Python is the hardest case. It may take a day or two more.

v2_gen_lines is crucial method. It is even simpler than envisioned in the middle-of-the-night post.

The latest version of v2_gen_lines is here on GitHub. Once on the page, search for def v2_gen_lines. If you have access to Leo, you can study this code in leoPluginsRef.leo: Plugins-->Importer plugins-->@file importers/basescanner.py

Executive Overview

On Oct 28 I documented the V1 code here. The big picture remains unchanged:

1. Both the V1 and V2 importers copy entire lines from a input file to Leo nodes. This makes the new importers much less error prone than the legacy (character-by-character) importers.

2. These importers know nothing about the language being imported. They know only how to scan tokens accurately.  This makes the line-oriented importers simple and robust.

3. Importers are simple to write because hidden infrastructure in importers.basescanner.py handles most details.

The scanning machine

Leo's scanning code is like a mechanical contraption containing three (different-sized) slots in various places. Each slot holds a different cassette that controls part of the machine's operation. The picture is this: to adapt the machine to a new task, you remove the old cassettes from their slots and replace them with new cassettes.

To change what the machine does, you don't have to understand the innards of the machine! You only have to know how to create new cassettes.

Similarly, Leo importers consist of three simple adapter classes, a controller, a scanner, and a state, all defined in a same file. These classes are the cassettes that modify the infrastructure.

You can easily define an importer for a new language X by using the classes from an existing importer as a template. Just create controller, scanner and state classes in leo/plugins/importers/x.py.

Leo's scanning machine is not a pipeline, but it is easily customizable.

The following sections discuss the three adapter classes in more detail. Please study the javascript importer on GitHub as you read along. If you have access to Leo, the javascript importer is in leoPluginsRef.leo: Plugins-->Importer plugins-->@file importers/javascript.py

The controller

The javascript controller is JS_ImportController.  It consists only of a ctor that inits the base class with various keyword arguments. Most will go away once the changeover to the V2 code is complete. The only important argument is:

    scanner = JS_Scanner(c)

The argument tells the infrastructure what the scanner class is.

The state

The javascript state is JS_ScanState.  This state should not be a subclass of any other state class.

States contain:

1. State data.  A context, and one or more counts.

Important: the context is non-empty if and only if the line being scanned is contained in a multiline string or comment or some other special case.

The javascript importer needs to keep track of both curly brackets and parens, so this class contains .context, .curlies and .parens ivars.

2
. Rich comparisons, __eq__, __gt__, etc.

v2_gen_line's helper, cut_stack, uses these to compare the new state to the states on a stack.  As shown in 3. below, the begins_block and continues_block methods often use these methods.

These comparisons are a bit tricky for javascript because there are two counts involved.  The count of curly brackets overrides the paren count.

Important: The __eq__ method must return True if self.context is non-empty.  This ensures that we never change blocks in the middle of a multi-line construct.

3. begins_block and continues_block methods. As their name implies, these methods tell v2_gen_line whether the just scanned line should remain in the present block (node), start a new node, or terminate one or more nodes.

For most
(all?) languages except Python, these methods can be defined in terms of the rich comparisons:

def v2_continues_block(self, prev_state):
    '''Return True if the just-scanned lines should be placed in the inner block.'''
    return self == prev_state

def v2_starts_block(self, prev_state):
    '''Return True if the just-scanned line starts an inner block.'''
    return self > prev_state

The scanner

The javascript scanner is
JS_Scanner.  It must be a subclass LineScanner so that it can init and access the infrastructure. For V2, the ctor just inits the base class.

The javascript scanner defines the v2_scan_line method, and its helper,
skip_possible_regex. v2_scan_line computes a new state is given the previous state and the next input line.

Summary

- importers/javascript.py is the javascript importer.

- importers/basescanner.py contains all the import infrastructure.

- To create an importer for a new language, use the classes from an existing importer as a template.

- All importers define three classes: a controller, a scanner, and a state.

- The controller class tells the infrastructure what scanner to use.

- The scanner class inits the infrastructure. This class must define the all-important v2_scan_line method.  This method returns a new state, given the previous state and the next input line.

- The state class defines a context and one or more counts.  The context is non-empty whenever a line ends inside a multi-line comment or string or starts a continued line.

The state class must also define a full set of rich comparison operators and the starts_block and continues_block methods. The starts/continues_block methods are usually defined in terms of rich comparisons.

- Recent work has simplified the interfaces between the infrastructure and the controller, scanner and state classes.  That work seems complete, so the interfaces (and these docs) are not likely to change significantly.

- The code will collapse further once the changeover to the V2 base is complete.
### comments highlight code that will disappear after we transition to V2.

We have come a long long way since the legacy character-based parser code. Again, I'm writing this in the middle of the night.  It's hard to contain my excitement.

And that's it. All question and comments welcome.

Edward

Edward K. Ream

unread,
Nov 6, 2016, 8:28:19 AM11/6/16
to leo-editor
On Sunday, November 6, 2016 at 3:43:57 AM UTC-6, Edward K. Ream wrote:

This post will be pre-writing for a theory of operation of Leo's new (V2) import code.

I have made two corrections to this post to keep this post up to date.  However, the corrections won't appear in emails, so here they are:

1. Broken link. Here is the correct link to the infrastructure, importers/basescanner.py.

2. I was wrong about not being able to change the "state" keyword argument to controller classes. It was straightforward.  The new code is in rev 1de52f0.  So the corrected text says:

The javascript controller is JS_ImportController.  It consists only of a ctor that inits the base class with various keyword arguments...The only important argument is:


    scanner = JS_Scanner(c)

The argument tells the infrastructure what the scanner class is.

EKR

Edward K. Ream

unread,
Dec 3, 2016, 6:22:41 PM12/3/16
to leo-editor
On Sun, Nov 6, 2016 at 3:43 AM, Edward K. Ream <edre...@gmail.com> wrote:
This post will be pre-writing for a theory of operation of Leo's new (V2) import code.  It will be of interest only to those who might want to create their own importer

​This post is obsolete. The new importers a substantially simpler than the code presented here.  See importers.md for the real scoop.

EKR​
 
Reply all
Reply to author
Forward
0 new messages