ENB: About python importers

35 views
Skip to first unread message

Edward K. Ream

unread,
Dec 10, 2021, 7:02:54 AM12/10/21
to leo-editor
This Engineering Notebook post will discuss the difficulties that any python importer must face. To state my conclusions first:

1. Generating the proper whitespace before @others correctly in all cases requires:

A: Some form of look-ahead, or equivalently, delayed code generation.
B: What amounts to a full parse of def and class lines.

2. I am willing to let the importer assume 4-space indentation for @others in class nodes. In effect, this is what the legacy Py_Importer class does!

Background

Vitalije's new importer has trouble importing mypy/test-data/stdlib-samples/3.2/test/test_textwrap.py. The file is imported perfectly, but many nodes are over-indented due to missing indentation in `@others` directives in the class nodes.

The relevant code in the mknode function is:

o = indent('@others\n', ind-l_ind)
...   
p.b = f'{b1}{o}{b2}'

Alas, the value ind-l_ind won't work in all cases!  Instead, I suggest using the value 4 for all classes :-)  That's exactly what the legacy importer does!

Yes, this would break the strangely-indented unit tests, but I'm willing to live with that.

The heroic alternative

Generating the correct indentation for @others in all cases is much more difficult. Indeed, the indentation of the @others line must be the indentation of the first significant line following the class or def line. The first significant line is the first line that is not:

- A blank or a comment.
- In a string.

The legacy Py_Importer class detects such lines fairly easily.  It is the first non-blank, non-comment line for which Python_ScanState.in_context returns False:

def in_context(self):
    """True if in a special context."""
    return (
        self.context or
        self.curlies > 0 or  # Open curly brackets
        self.parens > 0 or  # Open parentheses.
        self.squares > 0 or  # Open square brackets
        self.bs_nl  # In backslash/newline.
    )


Ironically, having gone through all this trouble, my legacy importer still assumes 4-space indentation! In theory, the importer could get the indentation right. In practice, it's dashed difficult to do so!

The split_root functions (or its helpers) would also have to find the first significant line of a class! In effect, the new importer would have to do a full parse of the entire class or def line.

Summary

The python importer contains analogs of all the phases of an optimizing compiler. The incoming code must be tokenized and maybe even parsed. Code generation will never be easy.

In class or def nodes, the leading whitespace of @others directive should be the leading whitespace of the first significant line of the class or def. Finding the first significant line of a class or def requires a full parse.

Importers can avoid the parse phase only if they assume 4-space indentation! I am willing to make this concession, and I am willing to abandon (parts of) the unit tests for strangely-indented code.

Edward

tbp1...@gmail.com

unread,
Dec 10, 2021, 9:29:46 AM12/10/21
to leo-editor
As I understand it, the Python tokenizer keeps two stacks of indents.  In one, each tab is expanded to the full 8 spaces.  In the other, a tab counts for one space.  Both stacks have to agree on the indentation level at every stage.

When I have done the same job in the past - except I didn't need to tokenize or parse everything the way an importer has to - to determine the indentation level - I counted the number of tabs and spaces without regard to order.  That gives an unambiguous indent level without needing to depend on invisible details of the permutations and expansions of tabs and spaces.  It worked well.

Then on output of course the tabs could be replaced with four spaces.  No problem there.  I dislike assuming tabs are always four spaces in the input.  It would be easy for someone to set their editor to emit, say, three spaces per tab  to get slightly more compact lines.  We don't know how often that would happen.  And there could still be a few legacy files around that use all tabs.  I have found them from time to time.

tbp1...@gmail.com

unread,
Dec 10, 2021, 9:36:22 AM12/10/21
to leo-editor
This link may be of interest.  It is about reconstructing a python file from its parse tree.  Maybe a few changes to the code generator would do the job: 

tbp1...@gmail.com

unread,
Dec 10, 2021, 9:45:49 AM12/10/21
to leo-editor
Where is this test file  mypy/test-data/stdlib-samples/3.2/test/test_textwrap.py? I don't see it in the devel or import branch, and I don't see it in the mypy package either.

Edward K. Ream

unread,
Dec 10, 2021, 9:49:55 AM12/10/21
to leo-editor
On Friday, December 10, 2021 at 8:36:22 AM UTC-6 tbp1...@gmail.com wrote:
This link may be of interest.  It is about reconstructing a python file from its parse tree.  Maybe a few changes to the code generator would do the job:

Thanks for the link.

Last night I realized your comments about parsing (using leoAst.py) were more apt than I first thought.  However, using any parser, including leoAst.py, seems like overkill.

Remember that my legacy python importer (now resurrected in the ekr-importer2 branch) doesn't really have a parser problem.  My importer handles all normally indented code correctly. Just for fun I'm going to try once again to have the importer handle strangely-indented code.  Vitalije's code generation shows that there is no need to generate outlines more than two-levels deep. That insight may help me simplify the code generators.

Vitalije's problems may be a bit different, but I'll let him speak for himself.

Edward

Edward K. Ream

unread,
Dec 10, 2021, 9:51:17 AM12/10/21
to leo-editor
On Friday, December 10, 2021 at 8:45:49 AM UTC-6 tbp1...@gmail.com wrote:
Where is this test file  mypy/test-data/stdlib-samples/3.2/test/test_textwrap.py?

It's from the mypy sources.

Edward
Reply all
Reply to author
Forward
0 new messages