ENB: The last puzzle

55 views

Skip to first unread message

Edward K. Ream

unread,

Nov 27, 2019, 8:07:21 AM11/27/19

to leo-e...@googlegroups.com

For several days I've been thinking that the crucial phase of the "unification" project might be completed in a few hours. Still true :-) After yesterday's work I know exactly what complications remain.

This Engineering Notebook post will discuss the complications in detail and offer a road map towards solving them. This post will be pre-writing for the most important part of the Theory of Operation.

tl;dr: Clearly, these complications can be solved: the token list and the parse tree contain all the required data. Alas, no trivial solution can possibly exist. An elegant solution might still be possible.

Overview of the code

The driver (main line) is a non-recursive traversal of the parse tree that visits each visitor in token order.

All visitors are generators. Each visitors contains one or more lines of one of the following forms:

yield from self.gen(tree_node)
yield from self.gen_token(token_kind, token_value)

Several abbreviations for gen_token exist for simplicity. The gen method "delegates" the traversal to constituent tree nodes. The sync_token method is the heart of the entire project. It does two things:

1. It verifies that the entire algorithm calls sync_token in the order in which they appear in the token list.

2. It creates two-way links between self.node (the present parse tree node) and token passed to put_token.

sync_token raises AssignLinksError if the token presented to sync_token does not match the present token, namely tog.token_list [tog.px]. tog.px is the token index. This index monotonically increases. Only sync_token ever updates this index.

Important: the entire "product" of the TOG classes are the two-way links between tokens and tree nodes.

Complications, part 1: ambiguous trees

Naively, one might think that only the token index is needed. Alas, that's not true, because the parse tree can be ambiguous. Two different inputs (and token lists) can generate exactly the same parse tree.

In particular, there is, in principle, no way that the do_If visitor can determine whether it should call gen_name('if') or gen_name('elif') by looking only at the parse tree. So do_If needs help from the token list.

A second index, tog.if_index, tracks all 'if', 'elif', and 'else' tokens in the token list. The peek_if helper returns the current 'if', 'elif', or 'else' token. The advance_if helper moves to the next 'if', 'elif' or 'else' token in the token list.

Note: The do_IfExp visitor handles ternary operators. This visitor must also call advance_if, because ternary operators must "consume" one "if" token and one "else" token.

Complications, part deux: joined and formatted strings

The second set of complications became fully visible only late yesterday. Solving them will involve a new index, tog.string index, and a new helper, tog.advance_str. Alas, these do not suffice, as I discovered yesterday, because...

1. A FormattedValue node represents a single token (a formatted string) that corresponds to an arbitrarily large subtree of parse nodes.

2. A JoinedStr node is a single tree node that corresponds to arbitrarily many tokens.

3. Parse trees can contain nested FormattedValue and JoinedStr nodes. This is the acid test.

For example, here is a (failing!) unit test that illustrates the difficulties:

k.k1(
    f"{'A' if self.a else ''}"
    f"{'B' if self.b else ''}"
    ": "
)
k.k2()

The token list (not fully linked, because AssignLinksError intervenes):

Tokens...

  0    encoding ''              line: 0  level: 0
  1        name 'k'             line: 1  level: 0  3232 Name             children: 0 parent: 8344 Attribute
  2          op '.'             line: 1  level: 0  8344 Attribute        children: 1 parent: 2128 Call
  3        name 'k1'            line: 1  level: 0  8344 Attribute        children: 1 parent: 2128 Call
  4          op '('             line: 1  level: 0
  5          nl '\n'            line: 1  level: 0
  6          ws 4               line: 2  level: 0
  7      string 'f"{\'A\' if... line: 2  level: 0  3008 FormattedValue   children: 1 parent: 3400 JoinedStr
  8          nl '\n'            line: 2  level: 0
  9          ws 4               line: 3  level: 0
 10      string 'f"{\'B\' if... line: 3  level: 0  3008 FormattedValue   children: 1 parent: 3400 JoinedStr
 11          nl '\n'            line: 3  level: 0
 12          ws 4               line: 4  level: 0
 13      string '": "'          line: 4  level: 0  9016 FormattedValue   children: 1 parent: 3400 JoinedStr
 14          nl '\n'            line: 4  level: 0
 15          op ')'             line: 5  level: 0
 16     newline '\n'            line: 5  level: 0  9016 FormattedValue   children: 1 parent: 3400 JoinedStr
 17        name 'k'             line: 6  level: 0  9016 FormattedValue   children: 1 parent: 3400 JoinedStr
 18          op '.'             line: 6  level: 0  9016 FormattedValue   children: 1 parent: 3400 JoinedStr
 19        name 'k2'            line: 6  level: 0
 20          op '('             line: 6  level: 0
 21          op ')'             line: 6  level: 0
 22     newline '\n'            line: 6  level: 0
 23   endmarker ''              line: 7  level: 0

And here is the parse tree, again not fully linked:

parent           lines    node                                         tokens
======           =====    ====                                         ======
                          0   Module
0   Module                1     Expr
1   Expr                  2       Call
2   Call         1        3         Attribute                          op. name(k1)
3   Attribute    1        4           Name: id='k'                     name(k)
2   Call                  5         JoinedStr: values=FormattedVal...
5   JoinedStr    2..3     6           FormattedValue                   string(f"{'A' if self.a ...) string(f"{'B' if self.b ...)

6   FormattedValue          7             IfExp
7   IfExp                 8               Str: s='A'
7   IfExp                 9               Attribute
9   Attribute             10                Name: id='self'
7   IfExp                 11              Str: s=''
5   JoinedStr    4..6     12          FormattedValue                   string(": ") newline(5:2) name(k)
                                                                       op.
12  FormattedValue          13            IfExp
13  IfExp                 14              Str: s='B'
13  IfExp                 15              Attribute
15  Attribute             16                Name: id='self'
13  IfExp                 17              Str: s=''
0   Module                18    Expr
18  Expr                  19      Call
19  Call                  20        Attribute
20  Attribute             21          Name: id='k'

Summary

The heart of the puzzle is a many-to-many relationship between tokens and tree nodes:

- f-strings (FormattedValue's) can be the root of an arbitrarily large parse tree.

- concatenated strings (JoinedStr's) can correspond to arbitrarily many tokens.

We can say the following about a possible solution:

1. The solution is possible, because all necessary data are present.

2. The solution will be non-trivial. Special case hacks have no chance of working. An elegant solution may be possible. If so, comments must explain why it works.

3. Any solution must involve cooperation between sync_token and the do_Str, do_FormattedValue and do_JoinedStr visitors, including especially the advance_if and advance_string helpers.

The challenge will be to create a sound solution that is also as simple as possible. It's a big ask. It's worth any amount of work.

Edward

Writing started: 5:15 am; finished 7:07 am.

Edward K. Ream

unread,

Nov 27, 2019, 7:26:52 PM11/27/19

to leo-editor

On Wednesday, November 27, 2019 at 7:07:21 AM UTC-6, Edward K. Ream wrote:

QQQ

1. A FormattedValue node represents a single token (a formatted string) that corresponds to an arbitrarily large subtree of parse nodes.

2. A JoinedStr node is a single tree node that corresponds to arbitrarily many tokens.

3. Parse trees can contain nested FormattedValue and JoinedStr nodes.

QQQ

Actually, things are a bit more complicated than this :-)

Happily, I now understand a crucial distinction:

- A single Str node represents the concatenation of multiple plain strings.
- A JoinedStr node exists only when at least f-string is present in the concatenated f-strings.

About advance_str

At last the purpose of advance_str has become clear. It must scan one or more string tokens until the accumulated tokens are exactly equal to self.node.s.

This far from trivial, even when only only plain strings are involved. Indeed, self.node.s is a plain string, but 'string' tokens contain reprs! Sheesh.

A new result_str helper handles the complications. It strips leading and trailing quotes and corrects inner escaped quotes. More work might be needed in result_str to handle f-strings. Or not...

Status and summary

Today's work has greatly clarified what must be done. Success is inevitable.

advance_str is now the simplest thing that could possibly work. It correctly handles all plain (non f-string) tokens. It still fails when handling concatenated f-strings.

advance_str and its helper raise AssignLinksError when something unexpected happens. These are important sanity checks, and they simplifies debugging.

All of my "small" unit tests now pass, except the one specifically designed to fail on concatenated f-strings.

Despite this progress, significant new invention is needed. This is grueling work. I'll not speculate about schedule ;-)