For several days I've been thinking that the crucial phase of the "unification" project might be completed in a few hours. Still true :-) After yesterday's work I know exactly what complications remain.
This Engineering Notebook post will discuss the complications in detail and offer a road map towards solving them. This post will be pre-writing for the most important part of the Theory of Operation.
tl;dr: Clearly, these complications can be solved: the
token list and the parse tree contain all the required data. Alas, no trivial solution can possibly exist. An elegant solution might still be possible.
Overview of the code
The driver (main line) is a non-recursive traversal of the parse tree that visits each visitor in token order.
All visitors are generators. Each visitors contains one or more lines of one of the following forms:
yield from self.gen(tree_node)
yield from self.gen_token(token_kind, token_value)
Several abbreviations for gen_token exist for simplicity. The gen method "delegates" the traversal to constituent tree nodes. The sync_token method is the heart of the entire project. It does two things:
1. It verifies that the entire algorithm calls sync_token in the order in which they appear in the token list.
2. It creates two-way links between self.node (the present parse tree node) and token passed to put_token.
sync_token raises AssignLinksError if the token presented to sync_token does not match the present token, namely tog.token_list [tog.px]. tog.px is the token index. This index monotonically increases. Only sync_token ever updates this index.
Important: the entire "product" of the TOG classes are the two-way links between tokens and tree nodes.
Complications, part 1: ambiguous trees
Naively, one might think that only the token index is needed. Alas, that's not true, because the parse tree can be ambiguous. Two different inputs (and token lists) can generate exactly the same parse tree.
In particular, there is, in principle, no way that the do_If visitor can determine whether it should call gen_name('if') or gen_name('elif') by looking only at the parse tree. So do_If needs help from the token list.
A second index, tog.if_index, tracks all 'if', 'elif', and 'else' tokens in the token list. The peek_if helper returns the current 'if', 'elif', or 'else' token. The advance_if helper moves to the next 'if', 'elif' or 'else' token in the token list.
Note: The do_IfExp visitor handles ternary operators. This visitor must also call advance_if, because ternary operators must "consume" one "if" token and one "else" token.
Complications, part deux: joined and formatted strings
The second set of complications became fully visible only late yesterday. Solving them will involve a new index, tog.string index, and a new helper, tog.advance_str. Alas, these do not suffice, as I discovered yesterday, because...
1. A FormattedValue node represents a single token (a formatted string) that corresponds to an arbitrarily large subtree of parse nodes.
2. A JoinedStr node is a single tree node that corresponds to arbitrarily many tokens.
3. Parse trees can contain nested FormattedValue and JoinedStr nodes. This is the acid test.
For example, here is a (failing!) unit test that illustrates the difficulties:
k.k1(
f"{'A' if self.a else ''}"
f"{'B' if self.b else ''}"
": "
)
k.k2()
The token list (not fully linked, because AssignLinksError intervenes):
Tokens...
0 encoding '' line: 0 level: 0
1 name 'k' line: 1 level: 0 3232 Name children: 0 parent: 8344 Attribute
2 op '.' line: 1 level: 0 8344 Attribute children: 1 parent: 2128 Call
3 name 'k1' line: 1 level: 0 8344 Attribute children: 1 parent: 2128 Call
4 op '(' line: 1 level: 0
5 nl '\n' line: 1 level: 0
6 ws 4 line: 2 level: 0
7 string 'f"{\'A\' if... line: 2 level: 0 3008 FormattedValue children: 1 parent: 3400 JoinedStr
8 nl '\n' line: 2 level: 0
9 ws 4 line: 3 level: 0
10 string 'f"{\'B\' if... line: 3 level: 0 3008 FormattedValue children: 1 parent: 3400 JoinedStr
11 nl '\n' line: 3 level: 0
12 ws 4 line: 4 level: 0
13 string '": "' line: 4 level: 0 9016 FormattedValue children: 1 parent: 3400 JoinedStr
14 nl '\n' line: 4 level: 0
15 op ')' line: 5 level: 0
16 newline '\n' line: 5 level: 0 9016 FormattedValue children: 1 parent: 3400 JoinedStr
17 name 'k' line: 6 level: 0 9016 FormattedValue children: 1 parent: 3400 JoinedStr
18 op '.' line: 6 level: 0 9016 FormattedValue children: 1 parent: 3400 JoinedStr
19 name 'k2' line: 6 level: 0
20 op '(' line: 6 level: 0
21 op ')' line: 6 level: 0
22 newline '\n' line: 6 level: 0
23 endmarker '' line: 7 level: 0
And here is the parse tree, again not fully linked:
parent lines node tokens
====== ===== ==== ======
0 Module
0 Module 1 Expr
1 Expr 2 Call
2 Call 1 3 Attribute op. name(k1)
3 Attribute 1 4 Name: id='k' name(k)
2 Call 5 JoinedStr: values=FormattedVal...
5 JoinedStr 2..3 6 FormattedValue string(f"{'A' if self.a ...) string(f"{'B' if self.b ...)
6 FormattedValue 7 IfExp
7 IfExp 8 Str: s='A'
7 IfExp 9 Attribute
9 Attribute 10 Name: id='self'
7 IfExp 11 Str: s=''
5 JoinedStr 4..6 12 FormattedValue string(": ") newline(5:2) name(k)
op.
12 FormattedValue 13 IfExp
13 IfExp 14 Str: s='B'
13 IfExp 15 Attribute
15 Attribute 16 Name: id='self'
13 IfExp 17 Str: s=''
0 Module 18 Expr
18 Expr 19 Call
19 Call 20 Attribute
20 Attribute 21 Name: id='k'
Summary
The heart of the puzzle is a many-to-many relationship between tokens and tree nodes:
- f-strings (FormattedValue's) can be the root of an arbitrarily large parse tree.
- concatenated strings (JoinedStr's) can correspond to arbitrarily many tokens.
We can say the following about a possible solution:
1. The solution is possible, because all necessary data are present.
2. The solution will be non-trivial. Special case hacks have no chance of working. An elegant solution may be possible. If so, comments must explain why it works.
3. Any solution must involve cooperation between sync_token and the do_Str, do_FormattedValue and do_JoinedStr visitors, including especially the advance_if and advance_string helpers.
The challenge will be to create a sound solution that is also as simple as possible. It's a big ask. It's worth any amount of work.
Edward
Writing started: 5:15 am; finished 7:07 am.