This post will be pre-writing for a theory of operation of Leo's new (V2) import code.
It will be of interest only to those who might want to create their own importer. Feel free to skip. If you read on, I hope to convince you that creating new importers is straightforward.
This is a long post. tl;dr: as usual, skip to the summary.
While writing this post, I realized that
details get in the way of understanding. So I am going to keep them to a minimum here. Really :-) Otoh, this post really does tell you
everything you need to know to write a new state scanner.
Progress Report
The V2 code is already a great success. The javascript importer is working. It's the
easiest case because it generates section references instead of @others. The perl adapter does generate @others. I expect to complete it soon. Python is the hardest case. It may take a day or two more.
v2_gen_lines is crucial method. It is even simpler than envisioned in the
middle-of-the-night post.
The latest version of v2_gen_lines is here
on GitHub. Once on the page, search for def v2_gen_lines. If you have access to Leo, you can study this code in leoPluginsRef.leo: Plugins-->Importer plugins-->@file importers/basescanner.py
Executive OverviewOn Oct 28 I documented the V1 code
here. The big picture remains unchanged:
1. Both the V1 and V2 importers copy
entire lines
from a input file to Leo nodes. This makes the new importers much less
error prone than the legacy (character-by-character) importers.
2. These importers know
nothing about the language being imported. They know
only how to scan tokens
accurately. This makes the line-oriented importers simple and robust.
3.
Importers are simple to write because hidden
infrastructure in
importers.basescanner.py handles most details.
The scanning machineLeo's scanning code is like a mechanical contraption containing three (different-sized) slots in various places. Each slot holds a different
cassette that controls part of the machine's operation. The picture is this: to adapt the machine to a new task, you remove the old cassettes from their slots and replace them with new cassettes.
To change what the machine does, you
don't have to understand the innards of the machine! You only have to know how to create new cassettes.
Similarly, Leo importers consist of three
simple adapter classes, a
controller, a
scanner, and a
state, all defined in a same file. These classes are the cassettes that modify the infrastructure.
You can easily define an importer for a new language X
by using the classes from an existing importer as a template. Just create controller, scanner and state classes in leo/plugins/importers/x.py.
Leo's scanning machine is
not a pipeline, but it
is easily customizable.
The following sections discuss the three adapter classes in more detail.
Please study the
javascript importer on GitHub as you read along. If you have access to Leo, the javascript importer is in leoPluginsRef.leo: Plugins-->Importer plugins-->@file importers/javascript.py
The controllerThe javascript controller is
JS_ImportController. It consists only of a ctor that inits the base class with various keyword arguments. Most will go away once the changeover to the V2 code is complete. The only important argument is:
scanner = JS_Scanner(c)
The argument tells the infrastructure what the scanner class is.
The stateThe javascript state is
JS_ScanState. This state should not be a subclass of any other state class.
States contain:
1. State data. A context, and one or more counts.
Important: the context is non-empty if and only if the line being scanned is contained in a multiline string or comment or some other special case.
The javascript importer needs to keep track of both curly brackets and parens, so this class contains .context, .curlies and .parens ivars.
2. Rich comparisons, __eq__, __gt__, etc.
v2_gen_line's
helper, cut_stack, uses these to compare the new state to the states on
a stack. As shown in 3. below, the begins_block and continues_block methods often use these methods.
These comparisons are a bit tricky for javascript because
there are two counts involved. The count of curly brackets overrides
the paren count.
Important: The __eq__ method must return True if self.context is non-empty. This ensures that we never change blocks in the middle of a multi-line construct.
3. begins_block and continues_block methods. As their name implies, these methods tell v2_gen_line whether the just scanned line should remain in the present block (node), start a new node, or terminate one or more nodes.
For most (all?) languages except Python, these methods can be defined in terms of the rich comparisons:
def v2_continues_block(self, prev_state):
'''Return True if the just-scanned lines should be placed in the inner block.'''
return self == prev_state
def v2_starts_block(self, prev_state):
'''Return True if the just-scanned line starts an inner block.'''
return self > prev_state
The scanner
The javascript scanner is JS_Scanner. It must be a subclass LineScanner so that it can init and access the infrastructure.
For V2, the ctor just inits the base class.
The javascript scanner defines the v2_scan_line method, and its helper, skip_possible_regex. v2_scan_line computes a new state is given the previous state and the next input line.Summary- importers/javascript.py is the javascript importer.
- importers/basescanner.py contains all the import infrastructure.
- To create an importer for a new language,
use the classes from an existing importer as a template.
- All importers define three classes: a controller, a scanner, and a state.
- The
controller class tells the infrastructure what scanner to use.
- The
scanner class inits the infrastructure.
This class must define the all-important v2_scan_line method. This
method returns a new state, given the previous state and the next input
line.
- The
state class defines a context and one or more counts.
The context is non-empty whenever a line ends inside a multi-line comment or string or starts a continued line.
The state class must also define a full set of rich comparison operators and the starts
_block and continues_block methods. The starts/continues_block methods are usually defined in terms of rich comparisons.
- Recent work has simplified the interfaces between the infrastructure and the controller, scanner and state classes. That work seems complete, so the interfaces (and these docs) are not likely to change significantly.
- The code will collapse further once the changeover to the V2 base is complete. ### comments highlight code that will disappear after we transition to V2.
We have come a long long way since the legacy character-based parser code. Again, I'm writing this in the middle of the night. It's hard to contain my excitement.
And that's it. All question and comments welcome.
Edward