Please review the new docs for #1440

44 views
Skip to first unread message

Edward K. Ream

unread,
Jan 14, 2020, 8:02:30 AM1/14/20
to leo-editor
In a day or three I'll be announcing #1440, the unification of parse trees and tokens, to python's core developers.  I'll do that by making a comment in this related python issue.

In preparation for that announcement, I have revised/created the first three comments of #1440:

- The first (main) comment gives an overview of the project and its status.
- The second comment is the Theory of Operation.
- The third comment contains the Project's history.

The announcement itself will be fairly brief, and will refer readers to #1440 for more details. So I want these three comments to be "perfect".

Please read and critique all three comments.  I am interested in your comments and suggestions.

Edward

P. S. There are similar discussions in LeoDocs.leo, but I'll be referring to #1440 in the upcoming public announcements. The links above are good looking and are available to anyone, including those with no access to Leo.

EKR

Brian Theado

unread,
Jan 14, 2020, 5:59:23 PM1/14/20
to leo-editor
I just followed the stackoverflow link (https://stackoverflow.com/questions/16748029/how-to-get-source-corresponding-to-a-python-ast-node#) and someone posted that they created https://github.com/gristlabs/asttokens which "Annotates Python AST trees with source text and token information". Is that different from what your code does?

--
You received this message because you are subscribed to the Google Groups "leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leo-editor+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/leo-editor/5c6da99b-360f-4ab2-bae9-f9a4d589e860%40googlegroups.com.

Brian Theado

unread,
Jan 14, 2020, 6:10:48 PM1/14/20
to leo-editor
Ok, now I see in your proposed email response in the other thread you do mention the asttokens project. And also later in #1440 you said "The asttokens project assigns tokens to ast nodes, but only for statement-level ast nodes." That's useful to know.

One thing I really like about the front page of that project (https://github.com/gristlabs/asttokens) is the front-and-center example which shows in a few lines of code how to call the code using concrete inputs and shows exactly what the outputs are.

I find that very valuable. Examples are usually the first thing I want to see. Can you have something similar in #1440?

Brian Theado

unread,
Jan 14, 2020, 7:06:16 PM1/14/20
to leo-editor
In the theory of operation:

"The notion of a token order traversal of a parse tree is the foundation of this project"

In contrast, what traversal order do parse trees provide? How is token order different/better? What does it allow me to do that I can't otherwise do with parse trees?

"This help is essential, because the following two source files generate identical parse trees!"

This implies to me that with a parse tree you can't do a round trip from source code, to a parse tree and back to the original source code. Is that correct? If so is this just one externally visible benefit to your library or is it the main benefit? If it is the main benefit, then I think it should be made much more clear earlier in the document like in the Overview near here:

"These links promise to collapse the complexity of any code that changes text, including the asttokens, fstringify, and black projects."

something like:

"These links allow portions of python source code to be transformed while leaving the rest untouched. Much of the complexity of the asttokens, fstringify, and black projects comes from not being able to link between the text and structure of the Python code.  Using the code of this library can collapse the complexity of these and any projects which change Python text"

Brian Theado

unread,
Jan 14, 2020, 10:59:36 PM1/14/20
to leo-editor
I like your second comment. It is well written and informative. Nice job.

You stress "token order traversal".  How does that contrast the traversal provided by the ast library? Does your library still traverse the ast nodes in the same order as the ast library does, just that with each visited node you will have access to all the tokens corresponding to that node?

Back to the first comment:

A children array from each ast node to its children. Order matters!

In contrast, how does the ast library link the ast nodes? Not at all? Linked, but out of order? Something else?

And this:
  • For each token, token.node contains the ast.node "responsible" for the token.
  • For each ast node, node.first_i and node.last_i are indices into the token list.
    These indices give the range of tokens that can be said to be "generated" by the ast node.
Does this imply the user can choose between traversing each token (and from the token reach the node) and traversing each node (and from the node reach each token)? In what case would the user want to do one over the other?

Edward K. Ream

unread,
Jan 15, 2020, 2:46:26 AM1/15/20
to leo-editor
On Tue, Jan 14, 2020 at 6:10 PM Brian Theado <brian....@gmail.com> wrote:

> One thing I really like about the front page of that project (https://github.com/gristlabs/asttokens) is the front-and-center example which shows in a few lines of code how to call the code using concrete inputs and shows exactly what the outputs are.

> I find that very valuable. Examples are usually the first thing I want to see. Can you have something similar in #1440?

I'll see what I can do. It could turn into a separate project, but I think it's important to complete that project before announcing leoAst.py.

Edward

Edward K. Ream

unread,
Jan 15, 2020, 3:57:48 AM1/15/20
to leo-editor
On Tue, Jan 14, 2020 at 7:06 PM Brian Theado <brian....@gmail.com> wrote:

In the theory of operation:

"The notion of a token order traversal of a parse tree is the foundation of this project"

In contrast, what traversal order do parse trees provide?

None whatever. Traversals are defined by code, not by the tree itself.

The ast module is particularly deficient in this regard. The documentation for  ast.walk is:

" Recursively yield all descendant nodes in the tree starting at node (including node itself), in no specified order. This is useful if you only want to modify nodes in place and don’t care about the context."

Hmm. This could be one motivating example. The TOG class inserts parent/child links, and TOT.traverse(tree) is the token order traversal. So a much more valuable version of ast.walk would be:

tog = TokenOrderGenerator()
tot = TokenOrderTraverser()
contents, encoding, tokens, tree = tog.init_from_file(filename)
tot.traverse(tree)

How is token order different/better? What does it allow me to do that I can't otherwise do with parse trees?

Great question. Perhaps I won't need a separate post after all. Here is a long answer, which boiled down must become part of both the announcement and the regular docs.

Recall that the python issue deals with deficiencies in ast-related tools. The opening comment of that issues says: "the built-in AST does not preserve comments or whitespace;"

This is only a small part of the problem facing anyone who wants to write a program like fstringify or black:

1. The data in the parse tree does not preserve the spelling of comments and strings. Why, I don't know, but that can't be helped. ast.parse creates the initial parse trees, and ast.parse can't change in any way because the ast module is cast in stone.

2. In contrast, the token list is what I have been calling the "ground truth" of a program. Comment and string tokens do preserve spelling. It is straightforward to recreate the program from the token list. That's what the tokens_to_string function (in leoAst.py) does.

3. There is, in principle, no short, easy way to associate tokens with ast nodes. The TOG class does this in what I firmly believe is the simplest, clearest possible code. But the TOG class is far from short and easy.

So the first answer your question is: a token order traversal is what make the TOG class possible.

But why is the TOG class itself valuable? What can devs do with it that they can't already do?

The TOG class inserts links between ast nodes and between nodes and tokens. These links are what TOG does, and nothing else.

But now you ask, what good are these links? This is what I've never properly explained.

The injected links will be useful for any tool that wants to modify python source code. fstringify and black are two most prominent examples.  Now we come to the real motivation. This is the "hole" in the documentation I have been talking about.

Any tool that wants to modify python text will benefit from having both a token-level view of the text and a parse-tree level view of the text. The asttokens package provides this dual view, but only for top-level python statements. In effect, the TOG class is a much simpler implementation of the asttokens package.

This suggests that some wrapper functions, similar/identical to those in the asttokens package, would be useful.

But I digress. Let me explain why the dual view (tokens and ast nodes) is useful. This is something I've never explained because I started the project knowing the answer.

Tokens preserve linear text order. Parse tree define the meaning of those tokens.

Mostly, tools like fstringify and black will want to work at the token level, because that is, or should be the most natural way to modify text: just insert or delete the proper tokens.

Alas, at present, the fstringify and black tools work at the parse tree level, despite enormous difficulties in doing so, because sometimes those tools must have access to the meaning provided by the parse trees.

Example 1: (Fstringify) Potential f-strings are found by looking for a ast.Binop node of a special form: the LHS of the Binop must be a string, the RHS of the Binop must represent the one or more % specs in the LHS string.

Example 2: (Black) When splitting a long line, black must analyze the meaning of the corresponding line in significant detail. It needs to do this because in some cases black must insert parens which did not exist previously in the program. For example:

a = << a very very long line, possibly continued by the backslash newline convention >>

black will convert this to:

a = (
   line 1
   line 2...
)

where none of lines line1, line 2, etc contain backslash newlines.

At present, both the fstringify and black tools are stuck in the "ast ghetto".

Much of what these tools do would be much much easier if the token view of an ast node were available.  For example, black uses a horrendously complicated auxiliary traversal in order to determine the required line length of an ast node!  But if token is the first token of the line, token.line is the physical line containing that token! No problem at all!

The TOG class would radically simplify both fstringify and black by allowing those tools to use token views where appropriate.

The Fstringify class in leoAst.py already demonstrates this. The Orange class will demonstrate how to collapse the complexity of black.

"This help is essential, because the following two source files generate identical parse trees!"

This implies to me that with a parse tree you can't do a round trip from source code, to a parse tree and back to the original source code. Is that correct?

In practice, you are correct.  ast.get_source_segment is new in Python 3.8, and there is also ast.fix_missing_locations. But these are hacks, and are very slow. Even with these functions, associating tokens with ast nodes is messy.

In contrast, round-tripping is a snap with the TOG class.

If so is this just one externally visible benefit to your library or is it the main benefit? If it is the main benefit, then I think it should be made much more clear earlier in the document like in the Overview near here:

"These links promise to collapse the complexity of any code that changes text, including the asttokens, fstringify, and black projects."

I agree. My long answer must be boiled down.

something like:

"These links allow portions of python source code to be transformed while leaving the rest untouched. Much of the complexity of the asttokens, fstringify, and black projects comes from not being able to link between the text and structure of the Python code.  Using the code of this library can collapse the complexity of these and any projects which change Python text"

Not bad! My long answer explains in more detail why what you say is true. I'll have to think about all this. There's no rush.

Many thanks for asking great questions.

Edward

Edward K. Ream

unread,
Jan 15, 2020, 4:30:48 AM1/15/20
to leo-editor
On Tue, Jan 14, 2020 at 10:59 PM Brian Theado <brian....@gmail.com> wrote:

I like your second comment. It is well written and informative. Nice job.

Thanks.
You stress "token order traversal".  How does that contrast the traversal provided by the ast library?

ast.walk is feeble. Formally, it visits nodes "in no particular order", which actually means that it visits nodes in a useless order ;-)

In contrast, TOT.traverse (which requires the links injected by the TOG class), visits nodes in token order, which is the order that will be most useful for any tool that modifies python source code.

Does your library still traverse the ast nodes in the same order as the ast library does, just that with each visited node you will have access to all the tokens corresponding to that node?

The ast module (ast.walk) defines no order at all.

Furthermore, the TOT.traverse method must run after the TOG class has created all links. So when links are available, they are available to all nodes.

You have, however, put your finger on something that probably should be documented.

The Fstringify class in leoAst.py replaces many tokens with a single, new, "fstringified" string.  In the process, it converts all the old tokens to "do-nothing" tokens. This is easy to do. I'll omit the details, except to say that all links are adjusted properly.

However, the Orange class, may increase the net number of tokens. It would be possible to readjust links from nodes to tokens, but that would be messy. Instead, I'll probably replace one or more tokens with an extension token.  Details not clear yet.

Back to the first comment:

A children array from each ast node to its children. Order matters!

In contrast, how does the ast library link the ast nodes? Not at all? Linked, but out of order? Something else?

Something else. Some (but not all!) nodes have lineno, col_offset, end_lineno end_col_offset fields. For further fun, previous versions of Python did not compute these fields properly.

And this:
  • For each token, token.node contains the ast.node "responsible" for the token.
  • For each ast node, node.first_i and node.last_i are indices into the token list.
    These indices give the range of tokens that can be said to be "generated" by the ast node.
Does this imply the user can choose between traversing each token (and from the token reach the node) and traversing each node (and from the node reach each token)? In what case would the user want to do one over the other?

Another great question. My long answer to your previous reply mostly answered this question.

Imo, the token view is usually the most natural view for tools that modify python sources. The more I work with ast trees, the more their deficiencies become apparent. I suspect the authors of fstringify and black overvalue parse trees and undervalue tokens.

All the big Ahas involved in creating the TOG class involved realizing that tokens were essential to help the traversal of the parse tree. I had no idea that parse trees were so feeble. It was a big big change in my attitude.

The Fstringify class is implicitly based on tokens, not parse trees. The changes are made to the token list, and the result is the tokens_to_string(self.tokens). This is so elegant that readers may not understand what is going on!

The Orange class will also be based on tokens. This will be much more obvious in the Orange class because most of the "beautifying" will be done on tokens. The beautifier converts input tokens to output tokens.

My present plan is to use parse-tree views only for difficult cases:

1. Determine the exact spacing for slices. The present code doesn't have access to enough context. Using the parse tree should be the easy way to get that context.

2. Determine how to break long lines. I'm not sure what will happen. Perhaps a purely token-oriented view will suffice. We shall see.

Edward

Brian Theado

unread,
Jan 15, 2020, 8:54:45 AM1/15/20
to leo-editor
On Wed, Jan 15, 2020 at 3:57 AM Edward K. Ream <edre...@gmail.com> wrote:
On Tue, Jan 14, 2020 at 7:06 PM Brian Theado <brian....@gmail.com> wrote:
[...] 
The ast module is particularly deficient in this regard. The documentation for  ast.walk is:

" Recursively yield all descendant nodes in the tree starting at node (including node itself), in no specified order. This is useful if you only want to modify nodes in place and don’t care about the context."

Hmm. This could be one motivating example. The TOG class inserts parent/child links, and TOT.traverse(tree) is the token order traversal. So a much more valuable version of ast.walk would be:

tog = TokenOrderGenerator()
tot = TokenOrderTraverser()
contents, encoding, tokens, tree = tog.init_from_file(filename)
tot.traverse(tree)

But in its current form, this example doesn't do anything user visible. It doesn't return anything and it doesn't display anything. IMO, examples should as much as possible be standalone and do something the user can see. That way the user can paste the exact code into a REPL and see something happen. Or better, the documentation showing the example can also show the results.

As it is now, the user would have to define a subclass of TOT in order to see anything. This is what I was complaining about in another thread. In order to be usable, at a minimum a 5 line subclass needs to be written. I proposed that you move the logic from that class into a function which yields each result. That function could be used both by the TOT class and by simple for loops (perfect for demo and explanations).

But your response was "You have little chance of changing my mind" :-). Maybe my suggestion isn't the right approach, but I still think there is something missing regarding ease-of-use.
 
[...] 
The injected links will be useful for any tool that wants to modify python source code. fstringify and black are two most prominent examples.  Now we come to the real motivation. This is the "hole" in the documentation I have been talking about.

Any tool that wants to modify python text will benefit from having both a token-level view of the text and a parse-tree level view of the text. The asttokens package provides this dual view, but only for top-level python statements. In effect, the TOG class is a much simpler implementation of the asttokens package.

This suggests that some wrapper functions, similar/identical to those in the asttokens package, would be useful.

If this means you will be able to write simple examples in the form of the one on the front page of the asttokens package, then it sounds useful.

[...]

Thanks for all the other explanation.

Edward K. Ream

unread,
Jan 15, 2020, 11:44:55 AM1/15/20
to leo-editor
On Wed, Jan 15, 2020 at 8:54 AM Brian Theado <brian....@gmail.com> wrote:

> But in its current form, this example doesn't do anything user visible. It doesn't return anything and it doesn't display anything. IMO, examples should as much as possible be standalone and do something the user can see. That way the user can paste the exact code into a REPL and see something happen. Or better, the documentation showing the example can also show the results.

All this is moot, at least for the moment, as explained in my latest post.

Edward
Reply all
Reply to author
Forward
0 new messages