I saw the recent discussion on the list about slow git status, and with
unicode support finally done, I thought I might warm up a project that
I've been playing around with some time ago.
Basic idea is a cache that sits transparently below our lstat and readdir
implementations, but always reads entire directories (instead of
lstat'ting single files). The cache is kept up to date via
ReadDirectoryChangesW (there is no *A version, so I needed unicode support
first...).
The performance improvements of the cached version are quite impressive
for git status, even with core.preloadindex turned on. The numbers below
are for git status on the WebKit repo, normal git v1.7.9 vs. the cached
version, for different core.preloadindex and --untracked-files settings:
preload | -u | normal | cached | gain
--------+-----+--------+--------+------
false | all | 25.144 | 3.055 | 8.2
false | no | 22.822 | 1.748 | 12.8
true | all | 9.234 | 2.179 | 4.2
true | no | 6.833 | 0.955 | 7.2
Drawback of this approach is that git operations that modify the working
copy are actually slower with the cache, because e.g. git checkout
typically lstats every file immediately after creating it (so the cache is
always dirty). It may be better to activate the cache just for readonly
operations.
In case you wanna try it out, I've pushed the code to:
https://github.com/kblees/git/tree/kb/fscache-v0
However, beware that this actually predates the unicode patches: it is
pretty ugly and quite sparse on comments and error handling (but seems to
work mostly, at least the test suite doesn't complain that much...). I
just pulled the code to devel, fixed compile errors and added thread
synchronization to test with core.preloadindex=true.
Unless I hear violent objections from the list, I plan to polish this code
a bit and provide some patches that are probably worthy of inclusion...(no
promises on the time frame, though :-).
Cheers,
Karsten
On Tue, 14 Feb 2012, karste...@dcon.de wrote:
> Basic idea is a cache that sits transparently below our lstat and
> readdir implementations, but always reads entire directories (instead of
> lstat'ting single files). The cache is kept up to date via
> ReadDirectoryChangesW (there is no *A version, so I needed unicode
> support first...).
>
> The performance improvements of the cached version are quite impressive
> for git status, even with core.preloadindex turned on. The numbers below
> are for git status on the WebKit repo, normal git v1.7.9 vs. the cached
> version, for different core.preloadindex and --untracked-files settings:
>
> preload | -u | normal | cached | gain
> --------+-----+--------+--------+------
> false | all | 25.144 | 3.055 | 8.2
> false | no | 22.822 | 1.748 | 12.8
> true | all | 9.234 | 2.179 | 4.2
> true | no | 6.833 | 0.955 | 7.2
Those are impressive numbers!
> Drawback of this approach is that git operations that modify the working
> copy are actually slower with the cache, because e.g. git checkout
> typically lstats every file immediately after creating it (so the cache
> is always dirty). It may be better to activate the cache just for
> readonly operations.
In the alternative, we could add code hints that turn off the cache in
certain code-blocks (althoug we have to be careful to do this
thread-locally) in a later commit. That is, if I understand correctly that
updating the cache is only worth it when the ratio between file updates vs
lstats calls is small?
> In case you wanna try it out, I've pushed the code to:
>
> https://github.com/kblees/git/tree/kb/fscache-v0
>
> However, beware that this actually predates the unicode patches: it is
> pretty ugly and quite sparse on comments and error handling (but seems
> to work mostly, at least the test suite doesn't complain that much...).
> I just pulled the code to devel, fixed compile errors and added thread
> synchronization to test with core.preloadindex=true.
>
> Unless I hear violent objections from the list, I plan to polish this
> code a bit and provide some patches that are probably worthy of
> inclusion...(no promises on the time frame, though :-).
Sounds like a fun project, and worthwhile, too, what with Git being
optimized speed-wise for Linux.
Will try to find time to play with it tomorrow...
Ciao,
Dscho
The performance improvements of the cached version are quite impressive
for git status, even with core.preloadindex turned on. The numbers below
are for git status on the WebKit repo, normal git v1.7.9 vs. the cached
version, for different core.preloadindex and --untracked-files settings:
preload | -u | normal | cached | gain
--------+-----+--------+--------+------
false | all | 25.144 | 3.055 | 8.2
false | no | 22.822 | 1.748 | 12.8
true | all | 9.234 | 2.179 | 4.2
true | no | 6.833 | 0.955 | 7.2
Yes, I've implemented (very) simple heuristics in fsentry.modcnt, so that
a cache entry is only replaced if there have been several consecutive
reads without a change. Lstat for dirty cache entries is redirected to
mingw_lstat.
However, the heuristics code and the ReadDirectoryChangesW stuff is quite
complex. Additionally, the documentation of ReadDirectoryChangesW states
that it may actually return the short file name (e.g. PROGRA~1 instead of
Program Files), which is useless to find the cache entry.
Implementing a read-only cache activated on demand might be a much more
robust solution, and give us decent improvement without all the trouble.
I'm thinking of git-status / wt_status_collect and read_index_preload
(AFAICT the latter is used by practically every command that operates on
the working copy).
> > In case you wanna try it out, I've pushed the code to:
> >
> > https://github.com/kblees/git/tree/kb/fscache-v0
> >
> > However, beware that this actually predates the unicode patches: it is
> > pretty ugly and quite sparse on comments and error handling (but seems
> > to work mostly, at least the test suite doesn't complain that
much...).
> > I just pulled the code to devel, fixed compile errors and added thread
> > synchronization to test with core.preloadindex=true.
> >
> > Unless I hear violent objections from the list, I plan to polish this
> > code a bit and provide some patches that are probably worthy of
> > inclusion...(no promises on the time frame, though :-).
>
> Sounds like a fun project, and worthwhile, too, what with Git being
> optimized speed-wise for Linux.
>
Ok, then I'll put some time into this...looking forward to a lot of
enlightening discussions :-)
Bye,
Karsten
Not so long ago I looked at using the ReadDirectoryChanges functions
via Tcl in git-gui to see about having refresh itself in a more timely
fashion than it does now. I've a small extension to Tcl that can
notify the app when required and it just needs some refining to only
pay attention to files that are actually part of the git project and
not any old rubbish. However, I wonder if there is scope for some hook
calling in your cache updating. You'd want to just send a message to
something so you don't clobber the performance gains you are chasing
but some sort of "the status changed" message might fall out of this
usefully.
I don't think so. Currently I'm using ReadDirectoryChanges synchronously
in the context of the calling thread, to be absolutely certain the cache
is up to date. Besides, the cache sits below lstat / opendir, it would be
hard to process pathspecs or .gitignore there, and you wouldn't be
interested in e.g. .o / .exe changes...
I think what you're looking for is some asynchronous notification
mechanism, something like a daemonized git process, using a
background-thread to process changes and send notification messages,
right? Lets say a new command git-monitor-status [<pathspec>] that tracks
file system changes of relevant files and prints them to stdout? It could
use ReadDirectoryChanges on Windows and inotify on Linux...
That would surely be an interesting project as well, but not what I had in
mind with the file system cache.
On Tue, 14 Feb 2012, Albert wrote:
> How persistent is the cache?
Karsten posted a link to the Git branch. Look for yourself.
> Will it live between reboots on Widows?
Assuming you refer to Windows, I am pretty certain that it does not.
> *** Please reply-to-all at all times ***
> *** (do not pretend to know who is subscribed and who is not) ***
>
> *** Please avoid top-posting. ***
Thank you so much,
Johannes
Seems I have spent too much time in the TortoiseGit land, and had assumed
that git command line behaved the same way.
Again thanks for the awesome code. Hope it makes its way into msysGit soon.
Albert
Current version is here: https://github.com/kblees/git/tree/kb/fscache-v1
Ciao,
Karsten
Can confirm your version is much faster. Mind providing a fscache-v2 installer too?
Yes, the preload_index function has an early return for repositories <
1.000 files, which prevents the cache from beeing disabled. Thus after
checkout, git stores the cached lstat data (from before the checkout) in
the index...
I've pushed a fixed version to
https://github.com/kblees/git/tree/kb/fscache-v2
Haha, i love you, this project, and the tone on the list. The question was more directed to persons already using this. And thanks for the hints!
Already spoiled,
Rupert :)
Haha, i love you, this project, and the tone on the list. The question was more directed to persons already using this. And thanks for the hints!
You reckon correctly, and the reasoning is all in the mailing list
archives and commit messages.
Sorry to be so succinct.