A new possible approach to importing python source files

79 views
Skip to first unread message

vitalije

unread,
Dec 1, 2021, 5:26:48 PM12/1/21
to leo-editor
Idea:
use tokenize python module to find where all function/method and class definitions start,
and then use this data to find lines where top level children should start. After creating the top level children, the process can be repeated for all nodes which have more than certain threshold number of lines, generating the second level children.

A little bit of background story (feel free to skip if you just want to see the example code):
A long ago I've tried to solve this problem in more efficient way for importing JavaScript files. I remember looking in the Importer class and the way Leo did imports at the time and feeling that it was too complicated, much more than necessary. I can't say that I've solved this problem in general, but for a very specific case, it worked pretty well.

Recent posts about improving Leo in this area, especially regarding Python, made me think again about this problem.

I strongly feel that the main problem with the current implementation is insisting on the use of scan_line. This is maybe suitable for unification of all other source languages, but it is far from the optimal when we talk about the python source files.

The way I see this problem is to search and find the lines where a new node should start. Whether this node should be indented or not, I would rather leave for the next phase. First of all, the outline structure of my python files which I start from the scratch in Leo usually have in the top level node a few lines, then comes at-others and usually after at-others comes the block with `if __name__ == '__main__':. If I have a lot of imports, then I usually put all imports in one section node `<<imports>>`.

The first line where I would introduce the first child node is actually first function definition or first class definition. Everything before should go directly in the root node. Imports can be extracted later.

Attached to this message is a python script which can be executed inside Leo.
It imports any python module from the standard python library and checks to see if the import is perfect or not. At the moment it just extracts top level definitions in separate nodes, the direct children of root.
pyimport_experiment.py

tbp1...@gmail.com

unread,
Dec 1, 2021, 7:38:07 PM12/1/21
to leo-editor
This is more or less what I was thinking of when I brought up Python's tokenizer on the earlier thread.  The tokenizer seems to function much like a SAX parser in the XML world.  It basically emits a stream of events as it works through a Python file.  Most of them wouldn't be of interest for Leo-izing a file, but the  whitespace detection and the  events of defs and classes, etc., are just what are wanted here.

Edward K. Ream

unread,
Dec 2, 2021, 4:31:56 AM12/2/21
to leo-editor
On Wed, Dec 1, 2021 at 4:26 PM vitalije <vita...@gmail.com> wrote:

> Idea: use tokenize python module...

This idea might work, but the attached script "wishes away" the difficulties involved in generating code.

1. Tokenizing the incoming lines (scan_line) is not the problem.  The present code has no trouble discovering which lines start with 'class' or 'def'.

2. The theory of operation post shows that splitting lines into nodes involves tricky considerations of context.

The python importer's code generators are: end_previous_blocks, do_def, do_class, do_normal_line, start_python_block, and gen_python_ref. All of the tests in these methods are strictly necessary. Unit tests will fail if any of these tests go missing! The attached script ignores these complexities.

3. Like it or not, the python importer is part of the ecosystem created by the Importer class. The vnode_info dict must exist and must contain all its data.

Summary

I am happy with the new python importer. The code generators are as simple as possible. The importer's unit tests won't pass with simpler code!

The vnode_info dict guarantees that all necessary state data persists into the post pass.

Edward

vitalije

unread,
Dec 3, 2021, 4:42:27 AM12/3/21
to leo-editor
On Thursday, December 2, 2021 at 10:31:56 AM UTC+1 Edward K. Ream wrote:

> Idea: use tokenize python module...

This idea might work, but the attached script "wishes away" the difficulties involved in generating code.


Attached to this message is an improved version of this script, which doesn't "wish away the difficulties", but does full import. Each module level defined function and class goes into a separate node and furthermore for each class node that has more than 20 lines, each method goes into a separate child node to. All other lines that are between the nodes are in their separate nodes with the heading '..some declarations'. If the body contains only one string definition its headline is changed to `__doc__` and if it contains only comments, then its headline is changed to '...some comments'.

This is pretty much how I would split any module by hand. I don't know what your tests for Python importer look like, but I suggest that you try this script against those tests. The whole script is just 150 lines long including the comments and doc strings. It is short and it is easy to modify it any way you like. It has successfully imported every python module from the standard library that I throw at it, each time the perfect import.

I would expect that your code in the end will have much more than 150 lines. The more lines of code, more space for bugs to hide, more effort later to read and understand the code in order to change or fix something. The choice is yours.
pyimport_experiment.py

Edward K. Ream

unread,
Dec 3, 2021, 8:14:12 AM12/3/21
to leo-editor
On Fri, Dec 3, 2021 at 3:42 AM vitalije <vita...@gmail.com> wrote:

> Attached to this message is an improved version of this script, which doesn't "wish away the difficulties", but does full import.

Excellent work.  Imo, the complexity of this script is roughly comparable to the complexity of the present code generators:

- close_defs is comparable to end_previous_blocks. end_previous_blocks is a bit shorter but contains more "if" statements.

- The way your code generates @others is particularly clever.  There are two cases:

1. In split_root:

def body(a, b):
    return ''.join(lines[a-1:b and (b-1)])
   
nodes = find_node_borders(txt)
a, b, h = nodes[0]
root.b = f'{body(a, b)}@others\n{body(nodes[-1][0], None)}'

2. In split_class:

def indent(x):
    return ' '*ind + x

nlines = [x[ind:] if len(x) > ind else x for x in lines[1:]]
txt = ''.join(nlines)
nodes = find_node_borders(txt)
a, b, h = nodes[0]

def body(a, b):
    return ''.join(nlines[a-1:b and (b-1)])

b1 = ''.join(lines[a:b]) + indent('@others\n')

Other comments
The generated nodes look good enough for a prototype. The post-pass would improve the result.

find_node_borders will be faster than the tokenizer driven by the table in py_i.get_new_dict, but not enough to matter.

Summary

Many thanks for this script. The way it handles @others convinces me that your approach is sound.

Imo, the complexity of your code generators is comparable to the complexity of mine.

I am putting the finishing touches on the new python generator. I will complete my present work in a day or so.

I'll be happy to consider a pull request containing your work. My conditions of satisfaction:

1. I must be convinced that the new code is substantially better than the present code. Yes, this condition increases the risk to you. Perhaps we can vote on which code is better, but we can only vote once we all can see the two final versions.

2. The PR must use the framework established by the Importer class. The PR need not use the existing tokenizer.

3. The PR should include a TestPython class containing all the existing tests, except for tests that test code that no longer exists.

4. The TestPython class should fully cover mk_py_importer.

Edward

Edward K. Ream

unread,
Dec 3, 2021, 8:43:25 AM12/3/21
to leo-editor
On Friday, December 3, 2021 at 7:14:12 AM UTC-6 Edward K. Ream wrote:

I'll be happy to consider a pull request containing your work. My conditions of satisfaction:

1. I must be convinced that the new code is substantially better than the present code.

Still true, but on second thought I see that I can probably "sign off" on your code in a day or so. 

Clearly, you think differently than I do, and that's a very good thing :-) I'd like to get a feel for exactly how the code works, by "seeing it in action", so to speak.  I'll then attempt a theory of operation for my own benefit.

Even supposing that your code is roughly as complex as mine, I think it would be a good thing to include your code rather than mine, as a counterpoint to the way things usually work!

These are preliminary thoughts, subject to my actual explorations. However, I am predisposed to accepting your code. I'll let you know my conclusions in a few days.

Edward

Edward K. Ream

unread,
Dec 3, 2021, 9:23:39 AM12/3/21
to leo-editor
On Friday, December 3, 2021 at 7:43:25 AM UTC-6 Edward K. Ream wrote:

> ...on second thought I see that I can probably "sign off" on your code in a day or so. 

The more I think about this topic, the better I like it.  I was just about to beat my head against the last failing unit test: test_strange_indentation. Now I think, "hehe, why bother?" Much better to spend time studying Vitalije's new code.

So later today or tomorrow I'll merge PR #2331 into devel. This improves the foundation for all importers, and shows what I have done recently. After that, I'll start my study of the proposed new code.

Edward

vitalije

unread,
Dec 3, 2021, 10:09:17 AM12/3/21
to leo-editor

Clearly, you think differently than I do, and that's a very good thing :-) I'd like to get a feel for exactly how the code works, by "seeing it in action", so to speak.  I'll then attempt a theory of operation for my own benefit.


I think that is a good idea. I would like to explain a little more how my code works to make that easier.

The function find_node_borders uses python tokenizer to search for suitable node starting lines (code lines starting with token.NAME whose string value is either def or class. For each found line it appends a list [start_row, start_col, None, line_txt] to the list of possible results. The third element which at this time is None, should be later changed to the end line for this definition. The end of the definition might be noticed in the generated tokens in two different cases. The first case is when we have non comment python statement following the last line. In that case, tokenize will emit token.DEDENT token. However, the unindented comment lines will not trigger this DEDENT token, so we need to also check for the COMMENT tokens having the starting column less then the body of the current definition. In order to know the current indentaion we need to look for token.INDENT tokens too.

So, the only tokens that we are interested in are:
  1. token.INDENT for keeping track of lastindent
  2. token.DEDENT and token.COMMENT whose starting col is less than lastindent . Whenever we encounter this case, we need to close all open definitions with the level of indentation grater than the current starting column of found token (DEDENT  or COMMENT)
  3. token.NAME with the string value in ('def', 'class'). Whenever we encounter this case, we add another open definition to both the list of results and to the list of open definitions at this level of the indentation.
After passing through all tokens we now have closed all open definitions and our resulting list contains each definition in file with its starting (row,col) position and also with the line number where it ends.  Finally we filter our resulting list to exclude all definitions that are at the deeper level and keep only those which start at the zero indentation. As I write this, I've just realized that I can substantially simplify script by ignoring those deeper definitions in the previous phase.

Yes, I have further simplified and speed up code by ignoring all tokens that do not start at the column zero. Instead of keeping a list of open definitions we now have just one open definition (the last one).

Finally, we can start emitting the nodes. We start with the [1, 1, ''] node. This is the block of lines that comes before first definition. While walking through the list of definitions, we check every time to see if the end of the last node in list is less than the start line of this node. In that case we have to insert one '...some definitions' block which starts where last node ends and ends where this node starts. At the end we add [end_of_last_node, None, ''] to our final list of nodes. These are the lines of root node which come after 'at-others'.

There are two functions that use the result of find_node_borders: split_root and split_class. The first one is dealing with the root node and generates the children of the root node for each node in the list. It also sets the body of the root node. The other one split_class is very similar function. It takes the body of the class definition without actual class line, cuts as much white space at the beginning of each line, and use this dedented text to calculate borders of its methods. Then does with the resulting list of nodes almost the same as the split_root does, except it adds indentation to at-others and also restores its first line (class line).

Attached to this message is simplified version of the script.

pyimport_experiment.py

vitalije

unread,
Dec 3, 2021, 10:17:47 AM12/3/21
to leo-editor
I've attached the wrong version of script. Here is the correct one.
pyimport_experiment.py

Edward K. Ream

unread,
Dec 5, 2021, 10:56:03 AM12/5/21
to leo-editor
On Fri, Dec 3, 2021 at 9:09 AM vitalije <vita...@gmail.com> wrote:

> I would like to explain a little more how my code works.

Thanks.  I have begun my study. I'll ask questions as needed.

Substantial work would be required to fold your script into the overall infrastructure created by the base Importer class. I would be happy to let you take it from here, but I'm also willing to help.  What would you like to do?

Note that the devel branch contains all my recent work. I plan no more importer work.

Attached script

I started my study of your script by converting the script to an outline. See the attached .leo file.

I made minor changes to make my study easier. Here is a summary of the top-level changes:

- Clear the log first, so that pyflakes warnings remain visible.
- Import the Importer class from leo.plugins.linescanner.
- Pass the file's contents and name to import_one_level.
  This will make it easier to simulate unit tests.
- Select and expand the resulting outline during the redraw.

Revised functions

- ensure_root writes to the last top-level node, creating it if necessary. The headline is always 'Imported Files...'.

- rename includes the first significant line of p.b in p.h. The start of p.h is either "Docstring" or "Organizer".

- import_one_level has a new signature and uses the Importer class to check the import.
  The test of difflib.py reports "same 2", meaning that the import is perfect after ignoring blank lines.

That's all for now.

Edward
VitalijeScript3.leo

Edward K. Ream

unread,
Dec 5, 2021, 3:09:05 PM12/5/21
to leo-editor
On Sunday, December 5, 2021 at 9:56:03 AM UTC-6 Edward K. Ream wrote:

>  I have begun my study. I'll ask questions as needed.

I like to ask questions of the code :-)  The split_root function defines "body" as follows:

def body(a, b, tag):

    return ''.join(lines[a-1:b and (b-1)])

The split_class function defines "body" this way:

def body(a, b, tag):

    return ''.join(nlines[a-1:b and (b-1)])

Notice: "b and (b-1)" may evaluate to None. But lines[a:None] means lines[a:]. I didn't know that None was a valid index!

For study purposes, I add traces to each:

def body(a, b, tag):  # split_root.
    lines2 = lines[a-1:b and (b-1)]
    g.trace(f" split_root: lines: {len(lines2):<4} {a!r:>4}:{b and (b-1)!r:<4} {tag}")
    return ''.join(lines2)

def body(a, b, tag):  # split_class
    lines2 = nlines[a-1:b and (b-1)]
    g.trace(f"split_class: lines: {len(lines2):<4} {a!r:>4}:{b and (b-1)!r:<4} {tag}")
    return ''.join(lines2)


Each call to these functions has a different tag (not shown).

These traces produce the following when importing difflib.py:

body  split_root: lines: 0       1:0    head: Imported Files...
body  split_root: lines: 0    2097:None tail: Imported Files...
body  split_root: lines: 38      1:38   Declarations
body  split_root: lines: 5      39:43   _calculate_ratio
body  split_root: lines: 654    44:697  class SequenceMatcher
body split_class: lines: 107     1:107  class SequenceMatcher::Declarations
body split_class: lines: 64    108:171  class SequenceMatcher::__init__
body split_class: lines: 12    172:183  class SequenceMatcher::set_seqs
body split_class: lines: 26    184:209  class SequenceMatcher::set_seq1
body split_class: lines: 44    210:253  class SequenceMatcher::set_seq2
body split_class: lines: 39    254:292  class SequenceMatcher::__chain_b
body split_class: lines: 116   293:408  class SequenceMatcher::find_longest_match
body split_class: lines: 71    409:479  class SequenceMatcher::get_matching_blocks
body split_class: lines: 55    480:534  class SequenceMatcher::get_opcodes
body split_class: lines: 50    535:584  class SequenceMatcher::get_grouped_opcodes
body split_class: lines: 25    585:609  class SequenceMatcher::ratio
body split_class: lines: 29    610:638  class SequenceMatcher::quick_ratio
body split_class: lines: 12    639:650  class SequenceMatcher::real_quick_ratio
body  split_root: lines: 49    698:746  get_close_matches
body  split_root: lines: 9     747:755  _keep_original_ws
body  split_root: lines: 327   756:1082 class Differ
body  split_root: lines: 2    1083:1084 Declarations
body  split_root: lines: 16   1085:1100 IS_LINE_JUNK
body  split_root: lines: 23   1101:1123 IS_CHARACTER_JUNK
body  split_root: lines: 11   1124:1134 _format_range_unified
body  split_root: lines: 73   1135:1207 unified_diff
body  split_root: lines: 12   1208:1219 _format_range_context
body  split_root: lines: 76   1220:1295 context_diff
body  split_root: lines: 17   1296:1312 _check_types
body  split_root: lines: 30   1313:1342 diff_bytes
body  split_root: lines: 37   1343:1379 ndiff
body  split_root: lines: 270  1380:1649 _mdiff
body  split_root: lines: 56   1650:1705 Declarations
body  split_root: lines: 351  1706:2056 class HtmlDiff(object)
body split_class: lines: 21      1:21   class HtmlDiff(object)::Declarations
body split_class: lines: 17     22:38   class HtmlDiff(object)::__init__
body split_class: lines: 27     39:65   class HtmlDiff(object)::make_file
body split_class: lines: 23     66:88   class HtmlDiff(object)::_tab_newline_replace
body split_class: lines: 55     89:143  class HtmlDiff(object)::_split_line
body split_class: lines: 28    144:171  class HtmlDiff(object)::_line_wrapper
body split_class: lines: 21    172:192  class HtmlDiff(object)::_collect_lines
body split_class: lines: 23    193:215  class HtmlDiff(object)::_format_line
body split_class: lines: 11    216:226  class HtmlDiff(object)::_make_prefix
body split_class: lines: 47    227:273  class HtmlDiff(object)::_convert_flags
body split_class: lines: 77    274:350  class HtmlDiff(object)::make_table
body  split_root: lines: 2    2057:2058 Declarations
body  split_root: lines: 32   2059:2090 restore
body  split_root: lines: 6    2091:2096 _test

This "picture" shows what is happening.  Pay particular attention to the ranges of line numbers.

The `class HtmlDiff` node contains 351 lines, which split_class then "allocates" to child nodes.  The picture shows that the child nodes "cover" the original lines.

Edward

vitalije

unread,
Dec 5, 2021, 5:22:42 PM12/5/21
to leo-editor
Yes, None when used as index gives the same effect as if it was omitted. aList[i:None] is the same as aList[i:]. The same is correct if None is the first index.

> Substantial work would be required to fold your script into the overall infrastructure created by the base Importer class. I would be happy to let you take it from here, but I'm also willing to help.  What would you like to do?

I don't see what substantial work do you mean is required. It is just a simple function, that can be added to any module, and used to import python source file instead of whatever is used now. It would be enough to use:

class MyPyScanner:
   def __init__(self, *args, **kwargs):
       pass
   def run(self, s, root):
       root.deleteAllChildren()
       root.b = s
       split_root(root)
       return True

instead of currently used Py_Importer. And register this class as a scanner class for python extension.
Or in Py_Importer class, just override run method to use split_root as shown above. This suggestion was made in accordance to the Leo version 96f282672d of 2020-09-14 which is my current version. You may have changed this system recently, but the basic idea remains the same. Just do not call pipeline but instead call split_root and return True.


Edward K. Ream

unread,
Dec 5, 2021, 5:27:02 PM12/5/21
to leo-editor
On Sunday, December 5, 2021 at 2:09:05 PM UTC-6 Edward K. Ream wrote:'

>  I have begun my study.

Oh joy, replacing difflib (as the test file) with the "strange indentation" unit test just works! My modified version of import_one_level reports "same 1".

Note: there was a bug in my version of import_one_level: it must return root, not None.

Summary

I hereby "sign off" on the prototype python importer.  Vitalije, if you're willing, I would like a PR from you that integrates your test script into Leo's importer framework.

Edward

vitalije

unread,
Dec 5, 2021, 6:02:21 PM12/5/21
to leo-editor
Ok. I'll try to make PR.

Edward K. Ream

unread,
Dec 5, 2021, 6:06:49 PM12/5/21
to leo-editor
On Sunday, December 5, 2021 at 4:22:42 PM UTC-6 vitalije wrote:
 
> I don't see what substantial work do you mean is required.

It might be best to use Importer.check and most of the other Importer machinery.

At minimum, the top-level node of leo/importers/python.py should remain the same. (There is no need to define Target = linescanner.Target).

The top-level code (including the importer_dict) is required so that @auto works.

To get the benefits of the Importer class, the new python importer should be a subclass of the Importer class. The Importer class defines this pipeline:

- Stage 0 checks blanks and tabs.  See Importer.check_blanks_and_tabs.
- Stage 1 does the basic importing. You would override this.
- Stage 2 is Import.post_post, which the present the python importer (partially?) overrides.
- Stage 3 is Import.finish, which sets text from the vnode dict.
- Stage 4 is Import.check, which does the perfect-import checks and reports any errors in a clear format.

If you want to use the pipeline, you must use the vnode_dict as follows:

- Init self.vnode_dict to the empty dict.
- Rather than setting p.b directly, the python importer should set vnode_dict [p.v] ['lines'] to the list of lines in the node.

Summary

You must be aware of how Leo's core connects to the importers. The top-level code in leo.plugins.importers.python is required.

I'm willing to consider anything that works, but I would like you to provide the equivalents of:

- Importer.check_blanks_and_tabs.
- The python importer's post-pass.
- Importer.check.

The easy way of doing the above is to subclass the Importer base class and use the vnode_dict interface. It will be easy: just assign to vnode_dict [p.v] ['lines'] rather than p.b.

Edward

Edward K. Ream

unread,
Dec 5, 2021, 6:09:30 PM12/5/21
to leo-editor
On Sunday, December 5, 2021 at 5:02:21 PM UTC-6 vitalije wrote:
Ok. I'll try to make PR.

Thanks!  Let me know if you have any questions.

You are in complete charge.  Just remember that all (still applicable) unit tests should pass.

Edward

Edward K. Ream

unread,
Dec 5, 2021, 6:15:47 PM12/5/21
to leo-editor
On Sunday, December 5, 2021 at 5:09:30 PM UTC-6 Edward K. Ream wrote:

You are in complete charge.  Just remember that all (still applicable) unit tests should pass.

One more thing. The post-pass is a kind of peephole optimizer.  Almost certainly the new importer should provide similar functionality. For example, it's tricky to place class and def decorators where they belong in the main importer phase. The present post-pass does so.  That's why I suggest that you use most of the pipeline.

Edward

vitalije

unread,
Dec 6, 2021, 3:31:01 AM12/6/21
to leo-editor
I am trying to catch up with new testing and your work on importers so far. What is the proper way to run tests in Leo, now?

Edward K. Ream

unread,
Dec 6, 2021, 7:58:58 AM12/6/21
to leo-editor
On Mon, Dec 6, 2021 at 2:31 AM vitalije <vita...@gmail.com> wrote:
I am trying to catch up with new testing and your work on importers so far. What is the proper way to run tests in Leo, now?

Use the test-all or test-import commands. 

I have also defined (in my local leoPy.leo file) the following @command nodes:

@command test-python
g.cls()
g.run_unit_tests('leo.unittests.core.test_leoImport.TestPython')

and

@command test-one
# test = "" tests all python test cases.
test = 'test_nested_classes'  # or any other single python test case.
if test and not test.startswith('.'):
    test = f".{test}"
g.run_unit_tests(f"leo.unittests.core.test_leoImport.TestPython{test}")


The tests themselves reside in the leo/unittests folder.  leoPy(Ref).leo contains @file nodes for all tests.

Edward
Reply all
Reply to author
Forward
0 new messages