importer proposal

tfer

unread,

Sep 21, 2016, 12:11:06 PM9/21/16

to leo-editor

There are always request for new importers or changes to existing importers, few people other than Edward have wrapped their mind around how to do this for themselves. This is a proposal to make this easier for people to create their own importers or easily customize any importer to create nodes at whatever level of detail they want.

This would work by borrowing the Unix principles of making little programs that do one thing, then composing a chain of them to accomplish the task you want done. Each language would have its own default chain, but you would have the option of adding your own chains/parameters/programs in your files "settings" node for each language you want to override the defaults on.

I imagine some of these tools would be general and then made specific to the language at hand by passing them things like a keyword list, regex's, and the like. I'm not sure if this would need a intermediate format that gets turned into the nodes by a "nodeMaker" program at the end, (though this seems the likely scenario as the Unix approach is to keep things in text as long as possible), or if each program refines the parsing of nodes created by the previous program in the chain more directly.

Python has support for this sort of thing in the "included batteries", the io module includes stuff to work with strings. To make things look more Unix like their is pipetools: https://pypi.python.org/pypi/pipetools.

Tom

Edward K. Ream

unread,

Sep 22, 2016, 7:30:04 AM9/22/16

to leo-editor

On Wed, Sep 21, 2016 at 11:11 AM, 'tfer' via leo-editor <leo-e...@googlegroups.com> wrote:

There are always request for new importers or changes to existing importers, few people other than Edward have wrapped their mind around how to do this for themselves. This is a proposal to make this easier for people to create their own importers or easily customize any importer to create nodes at whatever level of detail they want.

This a very long reply, but I want to explain exactly why the present code is as it is.

This would work by borrowing the Unix principles of making little programs that do one thing, then composing a chain of them to accomplish the task you want done.

I am aware of this design pattern. The present code uses a different pattern, namely having individual importers/writers override base importer and writer classes.

The writer base class is relatively simple. Writing is much simpler than reading.

The importer base class is complex for three distinct reasons:

1. Parsing is inherently a per-language process. We could use a different parsing tool for each language, but that will lead to duplicate code. I discuss parsing in greater detail below.

2. Given a proper parse of the language, splitting the code into separate Leo nodes (what the importer calls code generation) is a tricky process. We want to preserve line breaks and whitespace wherever possible.

3. We typically want to verify that the result of import will produce (when written) the original import. There is a separate (extremely complex) phase that does this, based on several switches in the overridden importer classes.

So yes, one could turn each of these areas of code into a separate process, but the actual code would not change much.

Take a look at CScanner class in leo/plugins/importers/c.py. It consists only of a ctor that sets various switches. All the real work is done in leo/plugins/importers/basescanner.py. There is no way to make the C scanner simpler.

Imo, your proposal amounts to a request to refactor basescanner.py. Perhaps that could be done, but I don't see any advantage to doing so.

Each language would have its own default chain, but you would have the option of adding your own chains/parameters/programs in your files "settings" node for each language you want to override the defaults on.

That's exactly the situation at present. Each importer uses settings to modify the operation of basescanner.py, but importers are free to override various methods as needed. For example, see importers/python.py.

Yes, the BaseScanner.Parsing methods are hairy. But there is a reason: the parsing code, especially the scan and scanHelper must preserve line breaks and whitespace. Traditional parsers don't do this. They simply create a parse tree.

I have lots of experience with Python's AST parse trees. Annotating the tree to show comments and whitespace is a big hole in the Python API. I describe the workaround in this stack overflow page. As you can see, there are substantial difficulties involved.

But our problem is even harder: to create a character-oriented parser for every imported language. The present parsing code works, though it is ugly behind the scenes. Rather than using the base scanHelper method, individual importers can override scanHelper. That's a feasible approach, and it may be that some importers actually do that, but it doesn't change the overall situation.

I imagine some of these tools would be general and then made specific to the language at hand by passing them things like a keyword list, regex's, and the like. I'm not sure if this would need a intermediate format that gets turned into the nodes by a "nodeMaker" program at the end, (though this seems the likely scenario as the Unix approach is to keep things in text as long as possible), or if each program refines the parsing of nodes created by the previous program in the chain more directly.

The tools already exist. They are all the methods in basescanner.py.

Python has support for this sort of thing in the "included batteries", the io module includes stuff to work with strings. To make things look more Unix like their is pipetools: https://pypi.python.org/pypi/pipetools.

To summarize, your proposal amounts to a request to redesign or refactor basescanner.py. I'm not going to do this, absent proof that the actual problems to be solved are much easier than I know them to be :-)

Edward

Edward K. Ream

unread,

Sep 22, 2016, 9:55:22 AM9/22/16

to leo-editor

On Thursday, September 22, 2016 at 6:30:04 AM UTC-5, Edward K. Ream wrote:

> To summarize, your proposal amounts to a request to redesign or refactor basescanner.py.

In fact, the BaseScanner code already has three phases, as shown by the tree structure of the BaseScanner class's code. The top levels of the phases are:

1. The bs.scan method, in Parsing, called from bs.run.
2. The bs.put methods, in Code generation, called from bs.scan and helpers.
3. The bs.check method, in Checking, called from bs.run.

One could imagine separating these phases by passing dicts of information between them, but nothing much would change. The code would still be exactly as complex.

The complications of the basescanner code are a good thing, imo. They simplify the importers, as clearly shown by the C importer.

EKR

Edward K. Ream

unread,

Sep 23, 2016, 1:53:46 PM9/23/16

to leo-editor

On Wednesday, September 21, 2016 at 11:11:06 AM UTC-5, tfer wrote:

There are always request for new importers or changes to existing importers, few people other than Edward have wrapped their mind around how to do this for themselves. This is a proposal to make this easier for people to create their own importers or easily customize any importer to create nodes at whatever level of detail they want.

The real problem is making parsing simpler. It may be that using a regex-based approach could simplify parsing for some language. #315 is a reminder to investigate this approach.

EKR

Edward K. Ream

unread,

Sep 23, 2016, 3:12:50 PM9/23/16

to leo-editor

Heh. I have just closed #315: the coffeescript importer already uses this approach to simplify the scan method.

EKR

Edward K. Ream

unread,

Oct 10, 2016, 7:59:58 AM10/10/16

to leo-editor

On Wednesday, September 21, 2016 at 11:11:06 AM UTC-5, tfer wrote:

There are always request for new importers or changes to existing importers, few people other than Edward have wrapped their mind around how to do this for themselves.

The new javascript importer points the way to a simpler framework for importers. Some notes:

1. As mentioned elsewhere, almost all the code resides in the JavaScriptScanner class, and its helpers, the Block class and the ScanState classes. Imo, these classes are the simplest thing that could possibly work.

2. These three classes form a framework for writing any importer.

We could imagine a GeneralScanState class that scans input line by line. To do this, the class needs to know comment and string delimiters. These correspond to the following ivars of the BaseScanner class:

    .blockCommentDelim1/2
    .blockDelim1/2
    .lineCommentDelim/2
    .outerBlockDelim1/2

The great thing about the new code is that no character-oriented hacks are needed. Lines get added only in their entirety. As a result, the following BaseScanner ivars would not be needed in the new scheme:

    .classTags
    .extraIdChars
    .functionTags
    .outerBlockEndsDecls
    .sigHeadExtraTokens
    .sigFailTokens

These ivars are pure hacks. They modify the ultra-complex code in BaseScanner.scan_helper and in various code-generation methods in basescanner.py.

Finally, the GeneralScanState class needs to know which kinds of brackets delimit blocks.

The new scheme would bypass all the code generation routines in the BaseScanner class. Instead, the new code just creates child nodes for each block. Nothing could be simpler, or more natural.

Finally, the new code uses a custom check method. The existing code parameterizes the checks made in BaseScanner.check, but for experimentation it is quite handy simply to override the base class method entirely.

Summary

Import code will never be trivial, and special cases will have to be made for some languages. For example, a completely new version of the ScanState class would be needed for Python, because blocks are delimited by indentation, not delimiters.

I would certainly have used the present scheme for all importers if I had discovered it earlier. For now, I plan to convert an importer to the new scheme only if serious bugs are reported against it.

Edward

Reply all

Reply to author

Forward