The git logs will show that I have been working night and day for the past month on the fstrings branch.
Yesterday I thought I had completed the next phase of the work. All files but one processed without complaint, which is significant because very strong checks are always present.
However, the one failure involved the most complicated code in the project. After several hours of work in the wee hours this morning I went back to bed. Lying in bed I had a momentous Aha which will eliminate all the hard parts of the code! Let me explain.
Background
The only truly difficult task is determining how many tokens correspond to ast.JoinedStr nodes. These nodes are quite a mishmash. They represent at least one f-string and all other concatenated strings, whether f-strings or plain strings.
The scheme that I have spent so much time attempts to determine, by looking at the JoinedStr node, which tokens correspond to the JoinedStr. This involves an extremely messy process that I call reconciliation, which munges the tree data to put it into exact correspondence with the next 'string' tokens. The following difficult methods are involved: advance_str, adjust_str_token, get_string_parts, scan_fstring, scan_string and the most difficult of all, get_joined_tokens.
All of this is about to go away!
The Aha
We can determine which 'string' tokens are concatenated just by looking at the token list!!!
Indeed, 'string' tokens are concatenated if and only if there are no significant tokens (including parens) between them!
So none of the old correspondence/reconciliation machinery is needed. We can ignore the component ast nodes of the JoinedStr nodes completely and just use the token data.
Figures of merit
The code is already very fast. For example:
leoGlobals.py
len(sources): 286901
setup time: 0.61 sec.
link time: 0.44 sec.
The
setup time is the time to tokenize the file and compile it to a parse
tree. This involves two calls to python's standard library, so it is as
fast as possible.
The link time is the time to execute all the code in the TokenOrderGenerator class! It is already way faster than other tools. It will get a tad faster.
Moreover, the TOG is both substantially simpler and more flexible than other tools. The Aha means that it will be very easy to debug and maintain.
Finally, the TOG makes no significant demands on the GC. There are no large data structures involved, aside from the token list and the parse tree. The only variable-length data is a token stack. This will typically only have a few hundred entries. Python's run-time stack will have only a few entries, because generators eliminate all significant recursion.
Summary
Today's Aha is a big deal. All of the difficult parts of the code are about to disappear! The TOG will be easy to understand and maintain. It can now be adapted easily to handle other kinds of parse trees, such as pgen2/lib2to3.
The TOG class is fast, simple, general and flexible. It promises to be an important tool in the python world. I'm proud of it.
The last month's work is as close as I have ever come to working on a significant mathematical theorem. I guess I'll have to stop thinking of myself as a failed mathematician :-)
Edward