Rock solid & fast solution for thin node caching

2 views
Skip to first unread message

Ville M. Vainio

unread,
Nov 3, 2008, 7:15:49 PM11/3/08
to leo-e...@googlegroups.com
This will work, is easy to implement and will be fast. In fact it's so
obvious that I feel a slight shame for not thiniking of it before:

There is a dir ~/.leo/filecache

It has file named after md5 sums of the contents of files.

Every file contains a pickled version of data structure that allows
quickly building the nodes contained in the tree for that file (not
the actual nodes!.

E.g.

[(node1-gnx, node1-head, node1-body,
[(node1.1-gnx, node1.1-head, node1.1-body, []),
[(node1.2-gnx, node1.2-head, node1.2-body, []),
(node2-gnx, node2-head, node2-body,[]),
...]

Where 1.1 and 1.2 are children of node1. All the vars are strings here.

So, when a .leo file is opened, we md5sum all the external files (this
will work for shadow files as well). If we have the right pickle
available, we use it. If not, we proceed normally, but write out the
pickle.

You can always delete the cache. It only means that the next .leo file
will be slow to open the first time. Typically, though, the files
rarely change (and only few of them do at the time). And users rarely
have that many .leo files to edit.


--
Ville M. Vainio
http://tinyurl.com/vainio

Ville M. Vainio

unread,
Nov 3, 2008, 7:18:22 PM11/3/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 2:15 AM, Ville M. Vainio <viva...@gmail.com> wrote:

> This will work, is easy to implement and will be fast. In fact it's so
> obvious that I feel a slight shame for not thiniking of it before:

Actually, I may have an idea why this hasn't been done before - we
didn't have a .leo directory where stuff like this can be stored. Such
a simple thing, but yet such an enabling quality..

Ville M. Vainio

unread,
Nov 3, 2008, 7:33:44 PM11/3/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 2:15 AM, Ville M. Vainio <viva...@gmail.com> wrote:

> This will work, is easy to implement and will be fast. In fact it's so
> obvious that I feel a slight shame for not thiniking of it before:
>
> There is a dir ~/.leo/filecache
>
> It has file named after md5 sums of the contents of files.
>
> Every file contains a pickled version of data structure that allows
> quickly building the nodes contained in the tree for that file (not
> the actual nodes!.
>
> E.g.
>
> [(node1-gnx, node1-head, node1-body,
> [(node1.1-gnx, node1.1-head, node1.1-body, []),
> [(node1.2-gnx, node1.2-head, node1.2-body, []),
> (node2-gnx, node2-head, node2-body,[]),
> ...]

And to refine this further, instead of actual strings for head and
body, we can have the start and end byte offset within the file of
headlines and bodies.

This will

- Be even faster, since we have already read in the actual file to
calculate the md5sum
- More importantly, make it possible to never delete the cache
(because so little space is used)

The more I think about it, the more I like it. :-)

Terry Brown

unread,
Nov 3, 2008, 9:46:54 PM11/3/08
to leo-e...@googlegroups.com
On Tue, 4 Nov 2008 02:15:49 +0200

"Ville M. Vainio" <viva...@gmail.com> wrote:

> [(node1-gnx, node1-head, node1-body,
> [(node1.1-gnx, node1.1-head, node1.1-body, []),
> [(node1.2-gnx, node1.2-head, node1.2-body, []),
> (node2-gnx, node2-head, node2-body,[]),
> ...]

unknownAttributes fit in here somewhere, right?

Sounds interesting.

Cheers -Terry

Ville M. Vainio

unread,
Nov 4, 2008, 1:07:54 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 4:46 AM, Terry Brown <terry_...@yahoo.com> wrote:

>> [(node1-gnx, node1-head, node1-body,
>> [(node1.1-gnx, node1.1-head, node1.1-body, []),
>> [(node1.2-gnx, node1.2-head, node1.2-body, []),
>> (node2-gnx, node2-head, node2-body,[]),
>> ...]
>
> unknownAttributes fit in here somewhere, right?

Yes, that dict can be pickled as well.

Ville M. Vainio

unread,
Nov 4, 2008, 1:58:09 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 2:33 AM, Ville M. Vainio <viva...@gmail.com> wrote:

> This will

... allow lazy nodes (in order to save ram, as opposed to save time),
since the nodes themselves can be created without parsing the file. If
we stash byte indices for body in tnode, we can only read it in when
getBodyString() is executed (provided that the file timestamp is not
changed).

Of course this is an optional goodie, and doing "find" will currently
draw in all nodes.

Ville M. Vainio

unread,
Nov 4, 2008, 2:06:54 AM11/4/08
to leo-e...@googlegroups.com
And, if you sceptical about whether this will increase the performance
that much (since we need to read in files to acquire md5sums anyway),
here are some benchmarks for just using md5sums program (python md5
module is written in c as well):

Second run:

%time md5sum core/*.py

...

CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.01 s

Second run:

[~/qt-plugin/leo]|38> %time md5sum plugins/*.py

CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.02 s

First run (files not in disk cache):

[src/linux-headers-2.6.24-21]|49> %time md5sum `find .`
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 7.47 s

Second run:

CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.34 s

And that's 62 megs of code (linux kernel headers).

Edward K. Ream

unread,
Nov 4, 2008, 8:46:11 AM11/4/08
to leo-e...@googlegroups.com
On Mon, Nov 3, 2008 at 6:15 PM, Ville M. Vainio <viva...@gmail.com> wrote:

> There is a dir ~/.leo/filecache
>
> It has file named after md5 sums of the contents of files.

Did you mean to say:

The filecache has *files* named after md5 sums of the contents of
*derived* files. ?

> Every file contains a pickled version of data structure that allows
> quickly building the nodes contained in the tree for that file (not
> the actual nodes!.

I understand you to mean that:

- every file in the filecache corresponds to a version of a derived file

- every file in the filecache contains nothing but the pickled data structure

- pickling just simplifies the process of recreating the data structure.

Presumably, there will be a mechanism for resolving checksum collisions.

> E.g.
>
> [(node1-gnx, node1-head, node1-body,
> [(node1.1-gnx, node1.1-head, node1.1-body, []),
> [(node1.2-gnx, node1.2-head, node1.2-body, []),
> (node2-gnx, node2-head, node2-body,[]),
> ...]
>
> Where 1.1 and 1.2 are children of node1. All the vars are strings here.
>
> So, when a .leo file is opened, we md5sum all the external files (this
> will work for shadow files as well). If we have the right pickle
> available, we use it. If not, we proceed normally, but write out the
> pickle.

So, Leo will recreate the outline from the (unpickled) data, without
reading any derived file? This could be a *big* improvement in speed
compared with the corresponding leoAtFile.py read code!

> You can always delete the cache. It only means that the next .leo file
> will be slow to open the first time. Typically, though, the files
> rarely change (and only few of them do at the time). And users rarely
> have that many .leo files to edit.

Edward

Edward K. Ream

unread,
Nov 4, 2008, 8:47:14 AM11/4/08
to leo-e...@googlegroups.com
On Mon, Nov 3, 2008 at 6:18 PM, Ville M. Vainio <viva...@gmail.com> wrote:

> Actually, I may have an idea why this hasn't been done before - we
> didn't have a .leo directory where stuff like this can be stored. Such
> a simple thing, but yet such an enabling quality..

It wasn't done before because nobody ever thought to do it.

Edward

Edward K. Ream

unread,
Nov 4, 2008, 8:50:37 AM11/4/08
to leo-e...@googlegroups.com
On Mon, Nov 3, 2008 at 6:33 PM, Ville M. Vainio <viva...@gmail.com> wrote:

> And to refine this further, instead of actual strings for head and
> body, we can have the start and end byte offset within the file of
> headlines and bodies.

This is a minor, probably premature, optimization. It may be worth
doing, or not, but it affects the essential design not at all. The
size of the file cache should never be an issue in these days of
gigantic hard disks.

Edward

Edward K. Ream

unread,
Nov 4, 2008, 8:57:58 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 1:06 AM, Ville M. Vainio <viva...@gmail.com> wrote:

> And, if you sceptical about whether this will increase the performance
> that much (since we need to read in files to acquire md5sums anyway),
> here are some benchmarks for just using md5sums program (python md5
> module is written in c as well):

Perhaps I misunderstand what you are saying. It seems you are worried
about something that in fact will improve Leo's performance
significantly.

Indeed, the performance of md5 is unimportant unless it is very slow.
OTOH, the ability to reconstruct external (derived) files from the
cache is huge: it trades an insignificant increase in write time for a
huge decrease in read time.

Notice: Leo writes only *changed* external files, but when reading a
.leo file, Leo must now read *all* external files. So speeding up
reads at the expense of writes is a big win. And this is so even if
we never use node versioning.

Edward

Edward K. Ream

unread,
Nov 4, 2008, 9:00:36 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 12:58 AM, Ville M. Vainio <viva...@gmail.com> wrote:

>> This will
>
> ... allow lazy nodes (in order to save ram, as opposed to save time),
> since the nodes themselves can be created without parsing the file. If
> we stash byte indices for body in tnode, we can only read it in when
> getBodyString() is executed (provided that the file timestamp is not
> changed).

Save ram? Who cares about ram?

This is a premature optimization in the sense that it will millennia
before I do this :-)

Edward

Ville M. Vainio

unread,
Nov 4, 2008, 9:01:04 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 3:46 PM, Edward K. Ream <edre...@gmail.com> wrote:

> Did you mean to say:
>
> The filecache has *files* named after md5 sums of the contents of
> *derived* files. ?

Yes. It was around 2:30 am when I wrote it, and I got out of bed to
write it down ;-)

> I understand you to mean that:
>
> - every file in the filecache corresponds to a version of a derived file
>
> - every file in the filecache contains nothing but the pickled data structure
>
> - pickling just simplifies the process of recreating the data structure.

Yes.

> Presumably, there will be a mechanism for resolving checksum collisions.

Not needed, since the collisions are practically impossible.
Distributed version control systems are built on this assumption.

> So, Leo will recreate the outline from the (unpickled) data, without
> reading any derived file? This could be a *big* improvement in speed
> compared with the corresponding leoAtFile.py read code!

Yes, it will make the operation almost instantaneous.

Ville M. Vainio

unread,
Nov 4, 2008, 9:04:36 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 4:00 PM, Edward K. Ream <edre...@gmail.com> wrote:

> Save ram? Who cares about ram?

People that will have 1 gig text files as external files ;-). I agree
that it's probably premature optimization.

However, I don't think storing byte offsets instead of strings is a
premature optimizations, since it can mean lots of disk space (i.e.
the difference between having to clean up the cache dir few times per
month, as opposed to never having to clean it).

Also, computers with small drives are becoming commonplace (e.g. asus
eee 700 only has a 4gig flash drive).

Edward K. Ream

unread,
Nov 4, 2008, 9:08:07 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 8:01 AM, Ville M. Vainio <viva...@gmail.com> wrote:

>> Presumably, there will be a mechanism for resolving checksum collisions.
>
> Not needed, since the collisions are practically impossible.
> Distributed version control systems are built on this assumption.

I didn't know that!

>> So, Leo will recreate the outline from the (unpickled) data, without
>> reading any derived file? This could be a *big* improvement in speed
>> compared with the corresponding leoAtFile.py read code!
>
> Yes, it will make the operation almost instantaneous.

!! That's reason enough to do it. I suppose the atFile read code must
still exist (for when the cache file does not exist) but bypassing
this code in normal operation is one of the coolest ideas ever. It's
incomparably better than, for example, rewriting leoAtFile.py in pyrex
or c.

Edward

Ville M. Vainio

unread,
Nov 4, 2008, 9:08:36 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 3:57 PM, Edward K. Ream <edre...@gmail.com> wrote:

> Indeed, the performance of md5 is unimportant unless it is very slow.
> OTOH, the ability to reconstruct external (derived) files from the
> cache is huge: it trades an insignificant increase in write time for a
> huge decrease in read time.

Actually, write time is not affected since writing a leo file does not
write the cache files.

What we are trading off is the time spent when doing the first open of
a .leo file (since that's the time the cache files are created), but
that's negligible.

Ville M. Vainio

unread,
Nov 4, 2008, 9:11:22 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 4:08 PM, Edward K. Ream <edre...@gmail.com> wrote:

> !! That's reason enough to do it. I suppose the atFile read code must
> still exist (for when the cache file does not exist) but bypassing
> this code in normal operation is one of the coolest ideas ever. It's
> incomparably better than, for example, rewriting leoAtFile.py in pyrex
> or c.

You need to rewrite the atFile read code to output the data structure
I described above. Then, you just 1) pickle out the data structure to
the cache file and 2) pass it on to a (new) method that creates nodes
from that info.

Ville M. Vainio

unread,
Nov 4, 2008, 9:16:29 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 3:50 PM, Edward K. Ream <edre...@gmail.com> wrote:

> This is a minor, probably premature, optimization. It may be worth
> doing, or not, but it affects the essential design not at all. The
> size of the file cache should never be an issue in these days of
> gigantic hard disks.

Actually, if we pay some attention to discarding old cache files (e.g.
by "remembering" the checksums we used to read from, and noticing when
the checksum for that file changes), we can indeed dump the whole
file. This will allow us to also skip utf8 decoding as well ;-)

I still think storing byte offsets is the right way to go, but that
optimization can be done later.

Edward K. Ream

unread,
Nov 4, 2008, 9:46:36 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 8:11 AM, Ville M. Vainio <viva...@gmail.com> wrote:

> You need to rewrite the atFile read code to output the data structure
> I described above. Then, you just 1) pickle out the data structure to
> the cache file and 2) pass it on to a (new) method that creates nodes
> from that info.

Yes. Almost all the old atFile read/write code remains unchanged. A
few new methods will be added. In particular, the new scheme will use
the old uA code: uA's are stored in the <v> attributes of the .leo
file, not anywhere in any external file.

Edward

Edward K. Ream

unread,
Nov 4, 2008, 9:51:07 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 8:04 AM, Ville M. Vainio <viva...@gmail.com> wrote:
>
> On Tue, Nov 4, 2008 at 4:00 PM, Edward K. Ream <edre...@gmail.com> wrote:
>
>> Save ram? Who cares about ram?
>
> People that will have 1 gig text files as external files ;-). I agree
> that it's probably premature optimization.

I don't think it is any kind of optimization. It would complicate
Leo's code and data model to no purpose. The proper time to create
Leo's nodes is when a file is read. Thereafter, the OS can handle
memory allocation. All nodes will be needed when c.allNodes_iter is
executed, so all this "optimization" does is fragment access to the
disk. It's a bad idea in the presence of virtual memory.

Edward

Terry Brown

unread,
Nov 4, 2008, 11:07:55 AM11/4/08
to leo-e...@googlegroups.com
On Tue, 4 Nov 2008 08:08:07 -0600
"Edward K. Ream" <edre...@gmail.com> wrote:

> >> Presumably, there will be a mechanism for resolving checksum
> >> collisions.
> >
> > Not needed, since the collisions are practically impossible.
> > Distributed version control systems are built on this assumption.
>
> I didn't know that!

Some systems prepend YYYYMMDDHHSS for extra robustness...

Cheers -Terry

Ville M. Vainio

unread,
Nov 4, 2008, 11:22:23 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 6:07 PM, Terry Brown <terry_...@yahoo.com> wrote:

>> > Not needed, since the collisions are practically impossible.
>> > Distributed version control systems are built on this assumption.
>>
>> I didn't know that!
>
> Some systems prepend YYYYMMDDHHSS for extra robustness...

Yeah, but leo probably shouldn't do that (a 'touch'ed file is still
the same file, even if it has a new timestamp).

If we want to optimize even more at some point, we could generate the
hash from just abspath, timestamp and size (the stat() info) - it will
cover the most typical cases. However, let's see how full-body hashing
performs first...

Edward K. Ream

unread,
Nov 4, 2008, 11:29:40 AM11/4/08
to leo-e...@googlegroups.com

How about prepending the original file name: fn.md5 It would be a good hint.

Edward

Ville M. Vainio

unread,
Nov 4, 2008, 11:45:39 AM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 6:29 PM, Edward K. Ream <edre...@gmail.com> wrote:

>> Some systems prepend YYYYMMDDHHSS for extra robustness...
>
> How about prepending the original file name: fn.md5 It would be a good hint.

Actually, if we leave out the filename, we can also reuse the cache
even when the user does file renaming (same content => same resulting
nodes). We don't need to be cryptogrophically secure here.

Though this is turning a bit academic, and such details can be ignored
for now...

Edward K. Ream

unread,
Nov 4, 2008, 2:01:52 PM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 10:45 AM, Ville M. Vainio <viva...@gmail.com> wrote:

> Actually, if we leave out the filename, we can also reuse the cache
> even when the user does file renaming (same content => same resulting
> nodes). We don't need to be cryptogrophically secure here.
>
> Though this is turning a bit academic, and such details can be ignored
> for now...

I agree. My thought was that the filename might be a debugging aid.

Edward

Edward K. Ream

unread,
Nov 4, 2008, 2:21:31 PM11/4/08
to leo-editor


On Nov 4, 1:01 pm, "Edward K. Ream" <edream...@gmail.com> wrote:
> On Tue, Nov 4, 2008 at 10:45 AM, Ville M. Vainio <vivai...@gmail.com> wrote:
>
> > Actually, if we leave out the filename, we can also reuse the cache
> > even when the user does file renaming (same content => same resulting
> > nodes). We don't need to be cryptogrophically secure here.
>
> > Though this is turning a bit academic, and such details can be ignored
> > for now...
>
> I agree.  My thought was that the filename might be a debugging aid.

An alternative would be to put debugging info into the cache file
itself. Your choice.

Edward

Ville M. Vainio

unread,
Nov 4, 2008, 2:43:37 PM11/4/08
to leo-e...@googlegroups.com
On Tue, Nov 4, 2008 at 9:21 PM, Edward K. Ream <edre...@gmail.com> wrote:

>> I agree. My thought was that the filename might be a debugging aid.
>
> An alternative would be to put debugging info into the cache file
> itself. Your choice.

Since it's a pickle, you can add all the debug info you want there. So
that seems to be the winner here.

Reply all
Reply to author
Forward
0 new messages