ENB: Improving token assignment

24 views

Skip to first unread message

Edward K. Ream

unread,

Jan 19, 2020, 7:35:32 AM1/19/20

to leo-e...@googlegroups.com

The next phase of the project is to complete the code that splits long lines and joins short lines. I want this code to be as simple as possible. The crucial split/join "snippets" should advertise the virtues of the TOG class.

Just as with the code that handles slices, I have only a vague idea of what the final split/join code will look like.This ENB notebook post attempts to clarify issues relating to the split/join logic. As always, feel free to ignore it.

Background

At present, the code that splits lines is entirely token based. This usually works well enough, but the token-based code relies on an open parenthesis (token) already being present in the statement. If this open paren exists, the long line may safely be split anywhere between parens. Most long lines involve function call statements (ast.Call nodes), and such statements do indeed contain the needed open paren. Alas, other python statements, including returns and assignments, may not already have the needed open paren. The split could must know where to insert the required pair of parens.

In short, my working assumption is that having access to the parse tree is essential (very helpful) in the split logic. Ditto for the join logic.

Gaining access to the parse tree

o.colon could get the relevant parse tree from self.token.node, because colons are significant tokens. Job done.

How to access the parse tree for long lines? Using the newline token seems reasonable, because newline tokens are also significant. However, the one-line code snippets used by the split/join logic don't contain any newlines.

Problems assigning newline tokens

More generally, the last newline of a code snippet is assigned to the ast.Module node. At the very least, this must be changed. Or does it? And if so, how?

We could ignore (temporarily) the problems with assigning tokens to nodes. For example, we could "trigger" the split join logic in the o.name token handler. "name" tokens are significant, so self.token.node will be the parse tree for the name. For function calls, we would have to look up the tree to determine whether the name is a function name. Doable, but not pretty.

"return", "if", "while" etc are keywords, so the parse tree is usable as is. Assignments would require a trigger on "=" tokens, that is, op tokens whose values is "=".

So this approach is clunky. It spreads the split/join logic over too many nodes. It seems more reasonable to trigger the split/join logic on the 'newline' token, or the 'endmarker' token for the special case that the file/snippet ends without a newline. Or maybe we can just force a trailing newline for all files/snippets.

Extrinsically significant tokens

At present, tokens are either classified as significant or insignificant. That is "significance" is an intrinsic property of each token. This is foolish, and limiting.

Indeed, the ast.Call and ast.Tuple visitors already call tog.gen_token for parentheses tokens. In such contexts, parens should be considered significant, and the eventual call to tog.sync_token should synchronize on those tokens. This would assure that the parens are assigned to the proper node! Alas, sync_token doesn't do that. At present, it just stupidly returns, assigning the parens (later) to the next "officially" significant token. As a result, rarens are not assigned properly for calls and tuples.

We could go further, and have various visitors generate comma tokens, but I doubt that would ever be useful.

Summary

Whatever happens, the code should properly assign paren tokens in calls and tuples. Ditto for newline tokens that end many statement lines. Only tog.sync_token will need to change, but that will be surprisingly tricky. Details omitted.

I'll investigate using the parse tree as a guide to splitting and joining lines only after parens and newline tokens are more reasonably assigned to ast nodes.

Edward

P. S. There is another complication: statements may become "long" via python's backslash-newline convention. The black tool itself takes the extreme view that backslash-newlines should always be eliminated. But this would be wrong, wrong, wrong in Leo, because Leo nodes can not represent underindented triple-quoted strings. For example, all of the unit tests in leoAst.py for multi-line test code contain this pattern:

    # use r'""" if lines contain back-slashes.
    contents = """\ 
line 1
line 2
"""

Depending on the outline level of the node in which this code resides, line 1, line 2 etc. will initially contain unseen leading whitespace. The test-running code removes such leading whitespace. Anyway, Leo depends on the backslash newline convention.

Even outside Leo the coding pattern shown above seems perfectly reasonable for unit-tests. Why prohibit it?