I am writing this in the middle of the night, too excited to sleep.
This is an Engineering Notebook post. Feel free to ignore.
Here, I'll write down my initial plan for a remarkable collapse in the complexity of the code. I'll also record the genesis of this idea. I know from experience that big Aha's warp my memory, making it difficult to remember just how the Aha came about.
Summary
Aha 1. There is no need for a rescanning the incoming lines! BLS.gen_lines can do the job in one pass, without recursion, using only simple helpers.
Aha 2. The StateScan class can be simplified greatly. No need for push/pop methods. No need to remember previous states.
History
Yesterday I futzed with the code without much success. That code is now stashed, and will likely disappear forever.
I took two baths yesterday, noodling about two seemingly small details. The Aha was the product of considering those details.
Detail 1: It was clumsy to switch parser states. How, exactly, does one indicate that we have just entered or left a block (class, method or function)? Previously, the scanners used "base" counts. Yesterday, I say that temporary "delta" counts simplified things a bit. But scanning was still way too confusing.
Detail 2: As I rewrote parts of ScanState and PythonScanState, I became aware that there is a problem with the continues_state and begin_state predicates for the javascript language. Indeed, the js scanner will begin a block when either the curly bracket count or the paren counts increase. All well and good. But what to do when only one of them decreases??? In the usual round-about way, this lead to the Aha's.
The algorithm
The rewritten gen_lines (lines, parent) method will generate the body text of the parent node and all created descendant nodes line by line. Something like this:
<< define Block class, containing various permanent data >>
def gen_lines(self, lines, parent):
'''Parse all lines, adding to parent.b, creating child nodes as necessary.'''
state = self.state # The subclass of ScanState for this importer
prev_state = state.initial_state()
stack = [Block(parent, prev_state)]
target = parent
for s in lines:
new_state = state.scan_line(s)
if new_state > prev_state:
<< create @others or section ref in target >>
target = << new child of the present target node >>
stack.append(Block(target, new_state))
elif new_state < prev_state:
<< cut back stack until stack[-1] matches new_state >>
# Common code
target = stack[-1].target
prev_stat = new_state
target.b = target.b + s # Or an optimized version of this.
That's all!!! The following will discuss details.
Handling tail lines
Tail lines are lines following the lines of a class, method or function. When using section references, they can be written immediately to the target. When using @others, they must be written to the defining node.
The code above doesn't handle tail lines properly when using @others. A relatively minor hack will be required.
Cutting back the stack
The line: << cut back stack until stack[-1] matches new_state >> handles the complication mentioned in detail 2.
In particular, for javascript, there are two counts involved, for curly brackets and for parens. This is why an explicit stack is required. There is no way to know what the next target node is going to be after a block ends! It could be the previous block. Or the stack may have to be cut back to the original parent.
Comparing states
The ScanState and all subclasses will define the standard 2.7/3.x rich comparison operators. __eq__, __lt__, etc. These will replace the starts/ends/continues_block methods. No more push/pop methods.
__eq__ will return True if the scanner is in any multi-line construct. Otherwise, it returns true if all the counts match. And similarly for the other comparison.
What happens if one count increases and the other decreases? I think the answer has to be that the count of curly brackets must have priority. Believe it or not, this was the question that lead to the Ahas.
Injecting Block's into targets
The post_pass method may want data contained in the Blocks. For instance, indentation info. This can be done by injecting a reference to a block into the target vnode. Something like this:
target.v._importer_block = stack[-1]
Whitespace
As in the present code, the code above writes lines to targets, without changing the leading whitespace of the line. The post_pass will eliminate as much common whitespace as possible, but no more.
There is a bug in the present code. The undent method removes the maximal amount of leading whitespace from each node. But this could remove too much, depending on the cumulative indentation in all parent nodes.
In the new scheme, this cumulative indentation will either exist in the
injected block of a node, or can can be computed from the injected
blocks of a nodes and all its parent nodes.
Indentation of real javascript code is a nightmare. We have an unpleasant choice. We can leave imported code unchanged, which may result in many underindented lines, or we can regularize leading whitespace, thereby changing the original file if the user rewrites it.
Summary
Nothing simpler than the new code can be imagined. A big help for maintainers, including me.
The code shown above doesn't handle tail lines properly when using @others. A hack will be needed.
The ScanState classes no longer remembers "base" counts. They simply maintain one or more counts, and the context. This is a surprisingly important simplification. It resolves Detail 1. A new initial_state() method will return a state with empty context and zero counts.
The new code explicitly handles edge cases discussed in Detail 2. Nothing more flexible is possible.
Enough for now. Time for some sleep :-)
Edward