A small pause for a better fstringify

74 views
Skip to first unread message

Edward K. Ream

unread,
Oct 25, 2019, 6:50:44 AM10/25/19
to leo-editor
In less than 30 hours I have been able to create a prototype for a simpler fstringify utility.  It's in the fstring branch, in the FstringifyTokens class in leoBeautify.py. This class is a subclass of the PythonTokenBeautifier class.  As the names imply, these classes work on tokens not on ast's (parse trees).

It is my strong (informed) opinion that parse trees are inappropriate for text-based manipulations such as black and fstringify. One has only to study the code for black and fstringify, as I have, to see the outrageous complexities involved in trying bend parse trees to one's will.

Status

The new code is straightforward and fast.  The base class is a one-pass peephole optimizer.  Such things are surprisingly easy to get right.  There are hardly any "if" statements involved. The new code overrides only a single method of the base class, do_string. This code looks ahead, beyond the original token, to parse the arguments following the "%" operator.  It then consumes all the scanned input tokens and generates a single output token representing the new f-string.

Token-based "parsing" of what follows the '%' is complete.  It's a straightforward scanner for "operands" that handles nested parens and curly/square brackets via recursive descent. It's an easy page of code. It could be an assignment in a beginning programming course.

The remaining work involves the following:

1. Parsing python's string formatting minilanguage. The present regex needs more work.

2. Converting the legacy format specifiers to pep 498 form. A bit tricky, but should be doable without type inference :-)

A bonus

During my review of Leo's original beautifier, I discovered that the "raw_token" field was badly misnamed. In fact, it contains the exact line containing the token!  This is exactly what is needed to do black-like line breaking/joining! It should be so easy to do, if I ever get around to it.

Summary

Parsing tokens is surprisingly easy.  Token based approaches naturally retain the essential features of text, including original spellings of strings, line breaks, and original whitespace. Imo, both black and fstringify could be much improved and simplified by using a token-based approach.

The new fstringify-file command will beautify as well as fstringify.  There's no easy way not to beautify the file.

This is one of those all-consuming projects.  It should reach a stopping point in a day or three.  I shall then release 6.1b1.

Edward

Edward K. Ream

unread,
Oct 25, 2019, 12:30:29 PM10/25/19
to leo-editor
On Friday, October 25, 2019 at 5:50:44 AM UTC-5, Edward K. Ream wrote:

The new fstringify-file command will beautify as well as fstringify.  There's no easy way not to beautify the file.

This is the sticking point at present.  I dislike how the beautifier handles colons.  Doing something reasonable in all situations is surprisingly tricky.

The fstringify-* commands are all functional.  Be careful: they aren't undoable at present.

fstringify-file is best used on external files not containing Leo sentinels.  The fstringify-tree/node commands are for use in Leo.  Already they may be more useful (modulo colons) than the fstringify command-line tool.

Edward

Edward K. Ream

unread,
Oct 26, 2019, 12:19:02 PM10/26/19
to leo-editor
On Fri, Oct 25, 2019 at 11:30 AM Edward K. Ream <edre...@gmail.com> wrote:

The new fstringify-file command will beautify as well as fstringify.  There's no easy way not to beautify the file.

I have reached a good stopping point, which doesn't necessarily mean that I'll stop :-)

This morning I spent several hours creating a "do-nothing" tokenizer.  I then remembered the untokenize function in Python's tokenize module. This is exactly what is needed!

Those preliminary hours were not wasted--they helped me understand all the issues. The untokenize code is short, but far from easy. The subtleties involve recreating the whitespace between tokens.  Continued lines (backspace-newlines) are the acid test. The add_whitespace method is the crucial code. I'm so glad I don't have to recreate it!

The untokenize function supposedly guarantees round-tripping of source code. I may study python's unit tests to see why this statement can be made with confidence.

Round-tripping defines a do-nothing "beautifier".  My fstringify code will be based on untokenize, but it will step in and handle string tokens.

Summary

Imo, this project is worth any amount of work, because it shows how to base black or fstringify on tokens. tokenize.untokenize implements a do-nothing "beautifier".

A do-nothing beautifier could easily provide the foundation for the "real" fstringify, and would also be extremely useful for black.  I am ever more convinced that using tokens is superior to parse trees for text munging.

Edward

Matt Wilkie

unread,
Oct 26, 2019, 6:02:23 PM10/26/19
to leo-editor

I am ever more convinced that using tokens is superior to parse trees for text munging.

Text munging I understand, but not parse trees and tokens. Can you give a one or two sentence overview?

Edward K. Ream

unread,
Oct 26, 2019, 6:30:15 PM10/26/19
to leo-editor
On Sat, Oct 26, 2019 at 5:02 PM Matt Wilkie <map...@gmail.com> wrote:

I am ever more convinced that using tokens is superior to parse trees for text munging.

Text munging I understand, but not parse trees and tokens. Can you give a one or two sentence overview?

Parse trees

See python's ast module.  A parse tree is a data structure representing the program's "abstract" structure.  Parse trees allow easy analysis of a program's meaning.

To my knowledge, there is no simple, efficient, way of recovering whitespace data from parse trees. I have given this question considerable attention. See the TokenSync class in leoAst.py. The ast.get_source_segment method (new in Python 3.8) is utterly feeble, and mind-bogglingly slow.

Tokens

See python's tokenize module. A token list is a linear list of the tokens that make up a program.

Alas, tokens do not represent inter-token whitespace directly. Happily, the tokenize module's Untokenizer class shows how to recover inter-token whitespace.

Summary

For text munging, like black and fstringify, my experience shows that it is easier to "parse" a list of tokens than to recover token-related data from parse trees.

Imo, devs typically overestimate the difficulties involved in using tokens, and underestimate the difficulties involved in using parse trees. The proof is in the source code for black (horrendous), the "real" fstringify (complex and still buggy), and my own fstringify, in the FstringifyTokens in leoBeautify.py (fstring branch), which works pretty well after two days of work.

Edward

Matt Wilkie

unread,
Oct 28, 2019, 12:14:14 AM10/28/19
to leo-editor
Thank you :)

Edward K. Ream

unread,
Oct 28, 2019, 2:36:03 AM10/28/19
to leo-editor


On Sun, Oct 27, 2019 at 11:14 PM Matt Wilkie <map...@gmail.com> wrote:
Thank you :)

You're welcome.  I enjoyed answering your question.

Edward
Reply all
Reply to author
Forward
0 new messages