ENB: Leo's line-oriented importers: the last collapse in complexity

35 views
Skip to first unread message

Edward K. Ream

unread,
Nov 4, 2016, 4:03:30 AM11/4/16
to leo-editor
I am writing this in the middle of the night, too excited to sleep.

This is an Engineering Notebook post. Feel free to ignore.

Here, I'll write down my initial plan for a remarkable collapse in the complexity of the code.  I'll also record the genesis of this idea. I know from experience that big Aha's warp my memory, making it difficult to remember just how the Aha came about.

Summary

Aha 1. There is no need for a rescanning the incoming lines!  BLS.gen_lines can do the job in one pass, without recursion, using only simple helpers.

Aha 2. The StateScan class can be simplified greatly.  No need for push/pop methods.  No need to remember previous states.

History

Yesterday I futzed with the code without much success.  That code is now stashed, and will likely disappear forever.

I took two baths yesterday, noodling about two seemingly small details.  The Aha was the product of considering those details.

Detail 1: It was clumsy to switch parser states.  How, exactly, does one indicate that we have just entered or left a block (class, method or function)?  Previously, the scanners used "base" counts.  Yesterday, I say that temporary "delta" counts simplified things a bit.  But scanning was still way too confusing.

Detail 2: As I rewrote parts of ScanState and PythonScanState, I became aware that there is a problem with the continues_state and begin_state predicates for the javascript language.  Indeed, the js scanner will begin a block when either the curly bracket count or the paren counts increase.  All well and good.  But what to do when only one of them decreases??? In the usual round-about way, this lead to the Aha's.

The algorithm

The rewritten gen_lines (lines, parent) method will generate the body text of the parent node and all created descendant nodes line by line.  Something like this:

<< define Block class, containing various permanent data >>

def gen_lines(self, lines, parent):
    '''Parse all lines, adding to parent.b, creating child nodes as necessary.'''
    state = self.state # The subclass of ScanState for this importer
    prev_state = state.initial_state()
    stack = [Block(parent, prev_state)]
    target = parent
    for s in lines:
        new_state = state.scan_line(s)
        if new_state > prev_state:
            << create @others or section ref in target >>
            target = << new child of the present target node >>
            stack.append(Block(target, new_state))
        elif new_state < prev_state:
            << cut back stack until stack[-1] matches new_state >>
     # Common code
     target = stack[-1].target
     prev_stat = new_state
     target.b = target.b + s # Or an optimized version of this.
      
That's all!!! The following will discuss details.

Handling tail lines

Tail lines are lines following the lines of a class, method or function.  When using section references, they can be written immediately to the target.  When using @others, they must be written to the defining node.

The code above doesn't handle tail lines properly when using @others.  A relatively minor hack will be required.

Cutting back the stack

The line: << cut back stack until stack[-1] matches new_state >> handles the complication mentioned in detail 2.

In particular, for javascript, there are two counts involved, for curly brackets and for parens.  This is why an explicit stack is required. There is no way to know what the next target node is going to be after a block ends!  It could be the previous block.  Or the stack may have to be cut back to the original parent.

Comparing states

The ScanState and all subclasses will define the standard 2.7/3.x rich comparison operators. __eq__, __lt__, etc. These will replace the starts/ends/continues_block methods. No more push/pop methods. 

__eq__ will return True if the scanner is in any multi-line construct. Otherwise, it returns true if all the counts match. And similarly for the other comparison.

What happens if one count increases and the other decreases?  I think the answer has to be that the count of curly brackets must have priority.  Believe it or not, this was the question that lead to the Ahas.

Injecting Block's into targets

The post_pass method may want data contained in the Blocks. For instance, indentation info. This can be done by injecting a reference to a block into the target vnode.  Something like this:

    target.v._importer_block = stack[-1]

Whitespace

As in the present code, the code above writes lines to targets, without changing the leading whitespace of the line.  The post_pass will eliminate as much common whitespace as possible, but no more.

There is a bug in the present code.  The undent method removes the maximal amount of leading whitespace from each node.  But this could remove too much, depending on the cumulative indentation in all parent nodes.

In the new scheme, this cumulative indentation will either exist in the injected block of a node, or can can be computed from the injected blocks of a nodes and all its parent nodes.

Indentation of real javascript code is a nightmare.  We have an unpleasant choice.  We can leave imported code unchanged, which may result in many underindented lines, or we can regularize leading whitespace, thereby changing the original file if the user rewrites it.

Summary

Nothing simpler than the new code can be imagined. A big help for maintainers, including me.

The code shown above doesn't handle tail lines properly when using @others.  A hack will be needed.

The ScanState classes no longer remembers "base" counts. They simply maintain one or more counts, and the context. This is a surprisingly important simplification.  It resolves Detail 1. A new initial_state() method will return a state with empty context and zero counts.

The new code explicitly handles edge cases discussed in Detail 2. Nothing more flexible is possible.

Enough for now.  Time for some sleep :-)

Edward

Edward K. Ream

unread,
Nov 4, 2016, 10:33:04 AM11/4/16
to leo-editor
On Fri, Nov 4, 2016 at 3:03 AM, Edward K. Ream <edre...@gmail.com> wrote:

Here, I'll write down my initial plan for a remarkable collapse in the complexity of the code.

​The start of the new code is at​
 
​e9a78cb.

The new code is disabled by the gen_v2 switch in basescanner.py.  It will remain disabled until all tests pass everywhere, which won't happen for awhile.

The changeover is tricky.  I would like to clean up some errors in the ScanState class, but the first try failed miserably.  Extreme caution is needed, which is why the commits will take baby steps.

Edward

Edward K. Ream

unread,
Nov 4, 2016, 10:54:12 AM11/4/16
to leo-editor
On Friday, November 4, 2016 at 9:33:04 AM UTC-5, Edward K. Ream wrote:
On Fri, Nov 4, 2016 at 3:03 AM, Edward K. Ream <edre...@gmail.com> wrote:
 
> Here, I'll write down my initial plan for a remarkable collapse in the complexity of the code.

In another related thread I said:

> The BLS class is a fundamentally important part of Leo...It is worth any amount of work make the new importers as beautiful and accurate as possible.

More true than ever.  The so-called V1 code generators never worked, yet they contributed to the recent Aha. It was well worth the effort.

I expect it will take several days before the V2 work is complete. It will be one of the highlights of this year.

EKR
Reply all
Reply to author
Forward
0 new messages