ENB: about external file format 5-thin

76 views
Skip to first unread message

vitalije

unread,
Jun 5, 2020, 11:10:24 AM6/5/20
to leo-editor
For the past few days I've been working on the reusable functions for both parsing content of external files and writing external files. In the attached Leo document there are two new scripts. One is for generating the test data, and the other is for testing these two new functions. All tests are passing and round trip (text-> outline -> text) confirms that these functions have almost the same effect as Leo's FastAtFile reading and atFile writing methods.

Thinking about the format of external files and looking at them, I've come to the conclusion that this format contains some redundant information. This is not a big problem, but since I am currently working on this part of the Leo's code base, I wish to propose some improvements to this format. Having redundant information means that different files may produce the same outline. This can cause problems when testing round trip transformations.

First of all I have to say, that I wrote two simple scripts that can automatically convert current external file content to the new format and back to the original format.

  • top level node gnx and its headline are not necessary. Both headline and gnx are present in the xml. They don't provide any useful information. This also can cause problems when two different outlines contain the same external file. If the top level node have different path or different gnx in those outlines than they would produce different file even if they have the same content.
  • @+<< sentinels are redundant too. When we encounter the node whose headline is a section reference, we know that the section reference was just before the opening node line.
  • @-<< sentinel and @afterref can be joined in one. The section name is not necessary because opening and closing sections must be properly nested. We know for sure that the closing    section has the same headline as the last open one. The closing @-<< sentinel can give a clue whether the following line is @afterref or an ordinary line. For example @-<<[ means same as closing section sentinel followed by an @after line, while @-<<] means there is no @after line after this closing sentinel.
  • @+others is not necessary because when we hit the first open node without the section reference in its headline we know for sure that just before this node was @others directive. Also when we encounter new open node with the different identation we can be sure that just before this node was @others directive. In the reading external file this line is used just to push current node data on the stack. But this signal can be added to the opening node sentinel as a single character.
  • format of @+node sentinel can be changed so that headline comes first and gnx and level at the end of the line for example:
    #@ at.findFilesToRead        :ekr.20190108054317.1:6
    instead of
    #@+node:ekr.20190108054317.1: *6* at.findFilesToRead
    It would be nicer to read source code using other editors
  • closing @-leo line is not necessary and there is no need for @last directives either. Last lines are just last lines of the top level node.
  • @first directive can be present in the body, but it doesn't need to be written in the external file, because we know that all lines coming before `@+leo` sentinel are first lines.
Also so called "dangerous directives" (@comment and @delims), are never used in the Leo's code base. Personaly I can't think of the use case for those directives. If anyone knows for a specific use case where these directives can solve a real life problem which can't be solved without these directives, please share it here. I wish to understand why would anyone wish to use these directives. If no such use case can be found, I would strongly suggest dropping support for those dangerous directives. It would allow us to further simplify both reading and writing code.

Less sentinel lines means less parsing less ambiguity and less work which leads to both simpler code and faster execution.

Your thoughts, please.

Vitalije
issue-1598-experiments.leo

vitalije

unread,
Jun 5, 2020, 11:39:13 AM6/5/20
to leo-editor
I forgot to mention that round trip using new functions is 1.9 times faster than using c.atFileCommands. Test script compares round trip of leo/core/leoGlobals.py

$ python p.py
setting leoID
from os.getenv('USER'): 'vitalije'
f_new average
: 30.429ms
f_old average
: 58.055ms

Vitalije

Thomas Passin

unread,
Jun 5, 2020, 11:58:39 AM6/5/20
to leo-editor

On Friday, June 5, 2020 at 11:10:24 AM UTC-4, vitalije wrote:
For the past few days I've been working on the reusable functions for both parsing content of external files and writing external files. In the attached Leo document there are two new scripts. One is for generating the test data, and the other is for testing these two new functions. All tests are passing and round trip (text-> outline -> text) confirms that these functions have almost the same effect as Leo's FastAtFile reading and atFile writing methods.

Thinking about the format of external files and looking at them, I've come to the conclusion that this format contains some redundant information. This is not a big problem, but since I am currently working on this part of the Leo's code base, I wish to propose some improvements to this format. Having redundant information means that different files may produce the same outline. This can cause problems when testing round trip transformations.

First of all I have to say, that I wrote two simple scripts that can automatically convert current external file content to the new format and back to the original format.
Also so called "dangerous directives" (@comment and @delims), are never used in the Leo's code base. Personaly I can't think of the use case for those directives. If anyone knows for a specific use case where these directives can solve a real life problem which can't be solved without these directives, please share it here. I wish to understand why would anyone wish to use these directives. If no such use case can be found, I would strongly suggest dropping support for those dangerous directives. It would allow us to further simplify both reading and writing code.
[snip]
Less sentinel lines means less parsing less ambiguity and less work which leads to both simpler code and faster execution.

Your thoughts, please.

I just used @delims the other day for a Windows command file.  In cmd files I use "::" as a comment marker.  I didn't find a Leo file type for cmd files, so I just went ahead and used the directive.  I have used it a few other times over the years. I imagine that @comments is also needed from time to time.  I can't be the only one.  So I wouldn't get rid of these two.

I'm all in favor of simplifying code, but I think you may be drifting into the area of premature optimization.

Segundo Bob

unread,
Jun 5, 2020, 12:49:07 PM6/5/20
to leo-e...@googlegroups.com
On 6/5/20 8:10 AM, vitalije wrote:
> top level node gnx and its headline are not necessary. Both headline and
> gnx are present in the xml. They don't provide any useful information.
> This also can cause problems when two different outlines contain the
> same external file. If the top level node have different path or
> different gnx in those outlines than they would produce different file
> even if they have the same content.

This has bothered me five or ten times when for unusual reasons I wanted
to @file one external file from two Leo-Editor files. In most cases
this problem caused me to do something else. In one or two cases I
lived with this problem.

--
Segundo Bob
Segun...@gmail.com

vitalije

unread,
Jun 5, 2020, 1:34:31 PM6/5/20
to leo-editor


I just used @delims the other day for a Windows command file.  In cmd files I use "::" as a comment marker.  I didn't find a Leo file type for cmd files, so I just went ahead and used the directive. 

Ok, this is a valid use case, though I didn't object this kind of usage.This kind of directives may be skipped when writing external file. Which delimiters were used to write external file can (and should) be deduced from @+leo sentinel line. If those delimiters don't match delimiters defined for this file extension (or if there are no defaults like in your case), the @delims directive can be automatically added to the top level body. That way we could prevent a possibility of having different pairs of delimiters in a single external file. A possibility to create such ambiguous file is the main reason why these directives are considered dangerous. Handling them during the process of parsing the external file content makes this code complex. And I can't think of a valid use case for this kind of situation.

Delimiters are used in order to allow Leo sentinels to be written in the external file as a comment lines using the proper syntax for the given file. If we have two @delims directives with the different values inside one external file, this file can't be syntactically correct.

I am not against letting user to choose which delimiters to use for any given file. I am just suggesting that this choice should be limited to one set of delimiters per file. If we agree on this limitation, then the @delims directive can be used but it doesn't have to be written in the external file. If it is necessary (i.e. if it clashes with the default delimiters), then reading code would add it automatically in the top level body. Or perhaps it can be written just as a  flag in the @+leo sentinel signaling only that this directive was (or was not) present in the top level body. The delimiters deduced from the @+leo should be used for the entire file.

I hope I made my point a bit more clear.

Vitalije

vitalije

unread,
Jun 5, 2020, 1:41:48 PM6/5/20
to leo-editor

This has bothered me five or ten times when for unusual reasons I wanted
to @file one external file from two Leo-Editor files.  In most cases
this problem caused me to do something else.  In one or two cases I
lived with this problem.

--
Segundo Bob
Segun...@gmail.com

One way to solve this issue is to add a node with the correct @path directive one level above the @file node. This will allow that @file node in both outlines have the same headline. Then it is necessary to make sure that these @file nodes  have the same gnx in both outlines. To achieve this you should copy the @file node from the one outline and then paste it retaining clones in the other outline. After this both outlines will produce the same external file.

It is not impossible to solve this problem using this trick, but it is cumbersome. It would be much easier if the top level gnx and headline were not written in the external file. Every outline could have its own gnx and file path, but they would produce the same output.

Vitalije

Thomas Passin

unread,
Jun 5, 2020, 6:08:14 PM6/5/20
to leo-editor
Yes you have!  It makes perfect sense.

Edward K. Ream

unread,
Jun 6, 2020, 7:24:57 AM6/6/20
to leo-editor
On Fri, Jun 5, 2020 at 10:10 AM vitalije <vita...@gmail.com> wrote:

For the past few days I've been working on the reusable functions for both parsing content of external files and writing external files. In the attached Leo document there are two new scripts. One is for generating the test data, and the other is for testing these two new functions. All tests are passing and round trip (text-> outline -> text) confirms that these functions have almost the same effect as Leo's FastAtFile reading and atFile writing methods.

Good to know.

Thinking about the format of external files and looking at them, I've come to the conclusion that this format contains some redundant information. This is not a big problem, but since I am currently working on this part of the Leo's code base, I wish to propose some improvements to this format. Having redundant information means that different files may produce the same outline. This can cause problems when testing round trip transformations.

Some general reactions:

1. Changing Leo's file format would be a big deal. It will be inconvenient for Leo's users, Leo's devs, and future maintainers. A new file format would, at minimum, create migration problems. It would require new documentation and probably migration scripts similar to the script I recently wrote.

2. Leo's existing file format explicitly represents all of Leo's syntactic constructs. I never considered using a minimal set of sentinels. I only considered the clearest, most explicit, set of sentinels. The second principle of the zen of python is "Explicit is better than implicit." I want to remain the explicit correspondences between sentinels, nodes, @others and section references.

True, the first zen-of-python principle is "Beautiful is better than ugly."  Imo, this principle does not apply here. Eliding sentinels makes it harder for users to understand the sentinels. Again imo, there is nothing very beautiful about embedding subtle inferences in crucial read logic.

3. Error correction is not possible without redundancy. Removing various "non-essential" sentinels would make it harder to write scripts that act on external files. Such scripts would have to recreate the clever inferences that make eliding sentinels possible in the first place.

4. @clean allows users to eliminate all sentinels. Those who dislike sentinels are already using @clean. Those who don't care much about sentinels will not appreciate yet another unnecessary change to Leo.

5. Changing Leo's file format might affect the @clean logic. This logic does a diff between the external file and a recreation of that file (with sentinels) generated from the outline itself. Maybe that diff will work with a new file format, but that is not guaranteed. For sure, removing redundancy in the file format will make the @clean logic more fragile, in hard to predict ways.

6. Changing Leo's file format to make your new code easier to test would be letting the tail wag the dog. I am confident that you can find a robust testing strategy that does not depend on a new file format.

Now to specific comments:

top level node gnx and its headline are not necessary. Both headline and gnx are present in the xml. They don't provide any useful information. This also can cause problems when two different outlines contain the same external file. If the top level node have different path or different gnx in those outlines than they would produce different file even if they have the same content.

I agree with you and Bob that this can be a problem. Imo, the way forward is to define clearly what happens when the xml and external file collide. I welcome your thoughts on this. Imo, it should be considered as a separate issue.
  • @+<< sentinels are redundant too. When we encounter the node whose headline is a section reference, we know that the section reference was just before the opening node line.
Yes, but I don't care. 
  • @-<< sentinel and @afterref can be joined in one. The section name is not necessary because opening and closing sections must be properly nested. We know for sure that the closing    section has the same headline as the last open one. The closing @-<< sentinel can give a clue whether the following line is @afterref or an ordinary line. For example @-<<[ means same as closing section sentinel followed by an @after line, while @-<<] means there is no @after line after this closing sentinel.
The documentation for @afterref is: "Marks non-whitespace text appearing after a section reference." I don't know whether these words are still true. Perhaps @afterref can truly be eliminated. If so, the way to do that is to change the write logic, not the read logic. Leo should be able to read @afterref "forever".
  • @+others is not necessary because when we hit the first open node without the section reference in its headline we know for sure that just before this node was @others directive. Also when we encounter new open node with the different identation we can be sure that just before this node was @others directive. In the reading external file this line is used just to push current node data on the stack. But this signal can be added to the opening node sentinel as a single character.
Again, I don't care. 
  • format of @+node sentinel can be changed so that headline comes first and gnx and level at the end of the line for example:
    #@ at.findFilesToRead        :ekr.20190108054317.1:6
    instead of
    #@+node:ekr.20190108054317.1: *6* at.findFilesToRead
    It would be nicer to read source code using other editors
I don't like this proposal, for several reasons:

1. I prefer the present format. I don't read external files often, but when I do I am usually more interested in the gnx's than the headlines.

2. The regex required to recognize the new node sentinel would be slower and less secure than the present regex. The present regex ends with something like ".*$". A new regex would begin with something like "^.*?"

We need to discuss this in more detail only if we all decide that a new file format is a good idea.
  • closing @-leo line is not necessary and there is no need for @last directives either. Last lines are just last lines of the top level node.
  • @first directive can be present in the body, but it doesn't need to be written in the external file, because we know that all lines coming before `@+leo` sentinel are first lines.
@first and @last have, in the past, caught mal-formed .leo files.

Also so called "dangerous directives" (@comment and @delims), are never used in the Leo's code base. Personaly I can't think of the use case for those directives.

As has already been pointed out, these directives exit for specific reasons.

Summary

I see many reasons to retain the format of external files, and no compelling reason to change that format.

The problem with root @file nodes is real. Let's deal with it as a separate issue.

If @afterref truly is never useful, the graceful way to eliminate the sentinel would be by having Leo's write logic not write it.

Edward

vitalije

unread,
Jun 6, 2020, 9:30:38 AM6/6/20
to leo-editor
6. Changing Leo's file format to make your new code easier to test would be letting the tail wag the dog. I am confident that you can find a robust testing strategy that does not depend on a new file format.

I wrote this post not because I couldn't make tests. The attached Leo document contains scripts that do tests read and write functions performing a round trip on all external files found in the Leo installation folder. Each external file is read/parsed using a function nodes_from_thin_file  which is a generator yielding tuples suitable to be piped into the build_tree which I wrote and tested earlier. The first testing part compares tuple values with the values found and prepared using normal Leo's read logic. Then test script actually builds a VNode instance representing the whole external file and uses function v_to_string to generate the content of the external file and compares the resulting content with the source file.

I understand your unease for making this kind of change. There is nothing urgent in my proposition. If we change write code so that it outputs starting sentinel @+leo-ver=6, we can use two different functions for parsing the rest of the file content. Old files having @+leo-ver=5 will be loaded using the old reading code. So there won't be any inconveniences for users, developers and future maintainers.

Explicit is better than implicit, I agree. Then why is the node level encoded using '*', '**', '*3*', '*4*', ...? Why is it better than just simple '1', '2', '3', ..? Isn't the second variant more explicit?

The need for @last directive is a result of having @-leo sentinel. Try it yourself, delete the closing Leo sentinel and all `@@last` lines before it, and Leo will read this file correctly placing the last lines at the end of the node. The closing leo sentinel doesn't add any useful information to the reading process. But because it exists it generates a need for the at-last directives. Which means more code to execute, more regex searches to perform and no gains in return.

If you edit external file and separate the opening @+others or @+<<  sentinel from the following start node sentinel (for example insert a few lines between them), Leo will read this file correctly, but in the following write it will report file as being changed even if user didn't change anything. If those two sentinels are expressed explicitly not on their own separate line but in the following node start sentinel as a single character  (for example "+/-" can represent the presence/absence of this directive), there won't be possible to separate those two sentinels and we would have two pattern less to match while reading.

Even if you prefer user being able to better understand sentinels, having two consecutive lines containing the same <<section name>> text is not helping a lot. But it does cause user to see (and read) more garbage content.

Perhaps we could have a new setting @int default-external-file-format=5 by default and user can override it to 6 in myLeoSettings.leo. I am sure format-6 would be faster to read and write and some users would prefer to use it instead.

Anyway, I won't insist on changing the format, but if we are changing something it would be better to make all changes at once.

Regarding the first node start sentinel, perhaps new read code can just skip this sentinel and use the values from the xml for gnx and headline. When writing a file, Leo can check to see if this sentinel is present in the external file and if it is, it will keep this sentinel line unchanged. Leo always reads existing file to check whether there is a change or not, so this check won't be too expensive. This way single external file can be opened using different paths in different outlines without generating unnecessary file changes.

Or we can just skip this sentinel when writing file. This will cause a single change to each external file, but after this no changes will ever be caused by accessing this file from different outlines.

Vitalije


Thomas Passin

unread,
Jun 6, 2020, 10:18:00 AM6/6/20
to leo-editor

On Saturday, June 6, 2020 at 9:30:38 AM UTC-4, vitalije wrote:
6. Changing Leo's file format to make your new code easier to test would be letting the tail wag the dog. I am confident that you can find a robust testing strategy that does not depend on a new file format.

I understand your unease for making this kind of change. There is nothing urgent in my proposition. If we change write code so that it outputs starting sentinel @+leo-ver=6, we can use two different functions for parsing the rest of the file content. Old files having @+leo-ver=5 will be loaded using the old reading code. So there won't be any inconveniences for users, developers and future maintainers.

I'm with Edward on this one.  Having had corrupted or obsolete .leo files before, I do not want to have any possibility of having more.  In addition, if a new version of Leo starts to write a new format for say @file nodes, those still using an older version will not be able to read them.   It's already confusing enough to know what we are getting - what is the difference between @auto vs @file, for example?  Adding a new format will add to the uncertainty, and if you call it something different like @file1, that would be even more confusing for a lot of people.

Edward also mentioned redundancy.  IMO, redundancy that helps in error recovery is good.  Remember, there are going to be tens of thousands of files in the new format eventually.  Some of them will have mis-used directives, some of them will have some kind of corruption.  We need to have a good chance of recovering those files anyway.  And we would still need to keep the old code for the old format in Leo for many years.  So the result will be more complexity for Leo (both code branches will need to be maintained), not less, and more potential confusion for users and not less.

If Leo had to read large files rapidly and repeatedly, the conclusion might be different.  But why should I care if Leo could read leoref.leo in 20 ms less time?  It wouldn't matter at all as a practical matter.  As a technical matter, of course it's cool if you develop new, clean, fast code - who doesn't like that?  But for Leo as an everyday tool, There's really no benefit that I can see.

vitalije

unread,
Jun 6, 2020, 10:42:37 AM6/6/20
to leo-editor

On Saturday, June 6, 2020 at 4:18:00 PM UTC+2, Thomas Passin wrote:
Edward also mentioned redundancy.  IMO, redundancy that helps in error recovery is good.  Remember, there are going to be tens of thousands of files in the new format eventually.  Some of them will have mis-used directives, some of them will have some kind of corruption.  We need to have a good chance of recovering those files anyway.

While I would agree that redundancy usually means better error recovery, I really doubt that this can be applied here. The redundant parts that I've mentioned doesn't add any valuable information that could possibly be used for error recovery. And by the way for the redundancy to be used for error recovery you must have error recovery tools that can use it (which AFAIK Leo doesn't have). So the redundancy here means just more complexity, more garbage and nothing valuable in return.

As I said before I won't insist on this change, but for the sake of being precise I won't let go false arguments either.

You wonder why the speed of reading and writing matters. Perhaps when you use Leo it doesn't matter to you if it will load 200ms faster or not. But If a developer wants to run thousand of tests than 20ms less actually means 20 seconds less. Waiting 20 seconds more for tests to finish, might break developer's thought flow. Keeping developer's thought flow leads to better code. So in the end users will benefit even if they don't care about this micro optimizations. 

Vitalije

Thomas Passin

unread,
Jun 6, 2020, 11:48:49 AM6/6/20
to leo-editor


On Saturday, June 6, 2020 at 10:42:37 AM UTC-4, vitalije wrote:

You wonder why the speed of reading and writing matters. Perhaps when you use Leo it doesn't matter to you if it will load 200ms faster or not. But If a developer wants to run thousand of tests than 20ms less actually means 20 seconds less. Waiting 20 seconds more for tests to finish, might break developer's thought flow. Keeping developer's thought flow leads to better code. So in the end users will benefit even if they don't care about this micro optimizations.

Well, there's something in what you say.

Edward K. Ream

unread,
Jun 6, 2020, 1:08:58 PM6/6/20
to leo-editor
But not enough. Any new confusion or bug will cost Leo's users and devs hours, days or weeks of work.

Edward

Edward K. Ream

unread,
Jun 6, 2020, 1:10:40 PM6/6/20
to leo-editor
On Sat, Jun 6, 2020 at 8:30 AM vitalije <vita...@gmail.com> wrote:

Anyway, I won't insist on changing the format, but if we are changing something it would be better to make all changes at once.

I agree.

Regarding the first node start sentinel, perhaps new read code can just skip this sentinel and use the values from the xml for gnx and headline. When writing a file, Leo can check to see if this sentinel is present in the external file and if it is, it will keep this sentinel line unchanged. Leo always reads existing file to check whether there is a change or not, so this check won't be too expensive. This way single external file can be opened using different paths in different outlines without generating unnecessary file changes.

I don't have an opinion about this. Do what you think best, and we'll all test it.

Edward
Reply all
Reply to author
Forward
0 new messages