`git status` speed on ntfs

639 views
Skip to first unread message

Scott Graham

unread,
Dec 16, 2011, 6:04:25 PM12/16/11
to msy...@googlegroups.com
Hi,

git status is pretty slow on my big repo (WebKit) on NTFS.

For reference on my pretty-fast machine on Windows, git status times:
- cold cache: 98s
- second run: 32s
- third+ runs: 16s

Identical spec machine on linux on the same repo:
- cold cache: 18s
- second run: 1.98s
- third+ runs: 1.8s

I don't love the 98s, but don't care that much. It's the hot cache runs that are more annoying.

I ran a quick xperf, and it does indeed seem that all runtime is spent in those fileops, ~45% in QueryInfo, and ~50% in Create+Close+Cleanup for the majority of the runtime, and for the last few seconds there's some DirEnums.

I found a few references to git status speed on the ML and have tried to understand the problems of FindFirstFile and GetFileInformationByHandle vs. having inode data available.

But, I haven't seen any reference to the USN Journal (aka Change Journal). Has anyone investigated that?

thanks,
scott

Joshua Jensen

unread,
Dec 17, 2011, 11:41:29 AM12/17/11
to msy...@googlegroups.com
----- Original Message -----
From: Scott Graham
Date: 12/16/2011 4:04 PM
> git status is pretty slow on my big repo (WebKit) on NTFS.
>
> For reference on my pretty-fast machine on Windows, git status times:
> - cold cache: 98s
> - second run: 32s
> - third+ runs: 16s
>
> I found a few references to git status speed on the ML and have tried
> to understand the problems of FindFirstFile and
> GetFileInformationByHandle vs. having inode data available.
I work in a big tree, and my scan speeds are nowhere near as slow.

I am also using the most recent version of msysGit. What are you using?

-Josh

Erik Faye-Lund

unread,
Dec 17, 2011, 11:47:17 AM12/17/11
to Joshua Jensen, msy...@googlegroups.com

FWIW: I'm getting somewhat similar:
- cold cache: 1m55.383s
- second run: 0m20.960s
- third run: 0m20.882s

But my machine is not at all that fast...

Scott Graham

unread,
Dec 19, 2011, 3:43:25 PM12/19/11
to Joshua Jensen, msy...@googlegroups.com
On Sat, Dec 17, 2011 at 8:41 AM, Joshua Jensen <jje...@workspacewhiz.com> wrote:

I found a few references to git status speed on the ML and have tried to understand the problems of FindFirstFile and GetFileInformationByHandle vs. having inode data available.
I work in a big tree, and my scan speeds are nowhere near as slow.

I am also using the most recent version of msysGit.  What are you using?

Thanks Josh.

I was using "git version 1.7.6.msysgit.0". I tried updating to "git version 1.7.8.msysgit.0", but times are the same. I haven't tried building head yet. Or rather I have, but the build failed and I haven't figured out the problem yet.

Could you quantify your big tree? I'm wondering if it's perhaps something specific like too many directories with few files, or too many files in directories, or something rather than overall size.

FWIW, the repo I'm referring to is git://git.webkit.org/WebKit.git, in case anyone cares to see if there's "wrong" with it. You can also grab http://git.chromium.org/external/WebKit_trimmed.git which is the same repo with old history trimmed, might be a bit faster/smaller to grab. It's still ~1.5GiB though.

I'm also on Windows 7, and I found this in the issue tracker: http://code.google.com/p/msysgit/issues/detail?id=320. I'll investigate that next.

thanks,
scott

Scott Graham

unread,
Dec 19, 2011, 5:08:51 PM12/19/11
to Joshua Jensen, msy...@googlegroups.com
On Mon, Dec 19, 2011 at 12:43 PM, Scott Graham <sco...@chromium.org> wrote:

I'm also on Windows 7, and I found this in the issue tracker: http://code.google.com/p/msysgit/issues/detail?id=320. I'll investigate that next.


I'm _not_ seeing the errors reported there via ProcMon. The majority of the syscalls are simply Create/QueryNameInformationFile/Close, and they all succeed. Looking at the captured stack, those correspond to the call to GetFileAttributesEx.

So, I'm back to my original question I guess... unless other people are still seeing orders-of-magnitude better for the WebKit repo.

I guess no one has looked at using the USN, but does it seem reasonable for a patch? Roughly, it would look like keeping a hash of cached stat information (persisted across runs), and updating that based on the change journal on startup. This would mean only files that had been touched since last run would need to be stat'd, rather than the entire tree.

One complication might be the need to refresh the cache in the middle of a run, but I don't know enough about git internals to know if/when that would be necessary.

Also, is there a rescued copy of the wiki pages from kernel.org? Just looking for the basic instructions on set up/build/etc.

thanks,
scott

Joshua Jensen

unread,
Dec 19, 2011, 9:57:43 PM12/19/11
to kusm...@gmail.com, msy...@googlegroups.com
----- Original Message -----
From: Erik Faye-Lund
Date: 12/17/2011 9:47 AM
>> ----- Original Message -----
>> From: Scott Graham
>> Date: 12/16/2011 4:04 PM
>>> git status is pretty slow on my big repo (WebKit) on NTFS.
>>>
>>> For reference on my pretty-fast machine on Windows, git status times:
>>> - cold cache: 98s
>>> - second run: 32s
>>> - third+ runs: 16s
>>>
> FWIW: I'm getting somewhat similar:
> - cold cache: 1m55.383s
> - second run: 0m20.960s
> - third run: 0m20.882s
>
> But my machine is not at all that fast...
>
I get the following on a Sony Core i7 laptop with new Seagate Momentus
XT drive (the brand new one):

- cold cache: 1m06.08s
- second run: 0m06.64s
- third run: 0m06.63s

I run Windows 7 and do not have UAC on.

-Josh

Joshua Jensen

unread,
Dec 19, 2011, 11:45:27 PM12/19/11
to Scott Graham, kusm...@gmail.com, msy...@googlegroups.com
----- Original Message -----
From: Scott Graham
Date: 12/19/2011 9:34 PM
On Mon, Dec 19, 2011 at 6:57 PM, Joshua Jensen <jje...@workspacewhiz.com> wrote:
I get the following on a Sony Core i7 laptop with new Seagate Momentus XT drive (the brand new one):

- cold cache: 1m06.08s
- second run: 0m06.64s
- third run: 0m06.63s


Hmm, well failing an algorithmic improvement, that seems like a decent improvement. Perhaps I should just go HDD shopping.
Let me give you some additional things to think about:

My work computer is much faster than my laptop.  The obliteration of all artifacts of our asset build takes over 4 minutes, and the obliteration of the artifacts is mostly an 'rm -rf' process.  If I run the free MyDefrag Data Disk Monthly on the drive, the obliteration of the artifacts takes right around **10 seconds**.

On my home Core i7 laptop, I cloned the WebKit code onto a defragmented partition of the Momentus XT (2 or 750 gb or whatever they're calling it).  Theoretically, the layout is very compact with no fragmentation.  Quickly eyeballing it in MyDefrag seems to confirm that.  My 'git status' may have run over the close equivalent of a 'defragmented' drive.

A Lua mailing list posting the other day talked about GetFileInformationByHandleEx() [1].  I believe you made mention of something similar.

I do not have UAC on.  Is yours off?

-Josh

[1] https://gist.github.com/1487388

Scott Graham

unread,
Dec 19, 2011, 11:34:57 PM12/19/11
to Joshua Jensen, kusm...@gmail.com, msy...@googlegroups.com

Scott Graham

unread,
Dec 20, 2011, 12:07:09 AM12/20/11
to Joshua Jensen, kusm...@gmail.com, msy...@googlegroups.com
On Mon, Dec 19, 2011 at 8:45 PM, Joshua Jensen <jje...@workspacewhiz.com> wrote:

Let me give you some additional things to think about:

My work computer is much faster than my laptop.  The obliteration of all artifacts of our asset build takes over 4 minutes, and the obliteration of the artifacts is mostly an 'rm -rf' process.  If I run the free MyDefrag Data Disk Monthly on the drive, the obliteration of the artifacts takes right around **10 seconds**.

On my home Core i7 laptop, I cloned the WebKit code onto a defragmented partition of the Momentus XT (2 or 750 gb or whatever they're calling it).  Theoretically, the layout is very compact with no fragmentation.  Quickly eyeballing it in MyDefrag seems to confirm that.  My 'git status' may have run over the close equivalent of a 'defragmented' drive.

I do defrag nightly, but only using the builtin Windows functionality. I'll give that 3rd party util a try. (Incidentally, for removing large game build directories, I used to do "ren build build_del" and then run a cron to "garbage collect" it at night. :)
 
A Lua mailing list posting the other day talked about GetFileInformationByHandleEx() [1].  I believe you made mention of something similar.

Yes, I saw it mentioned in the archives of this group too [2] (or at least moving to a directory-centric retrieval, rather than per-file). The impression there is that it would be an intrusive change. Anyway, it does seem like a better first thing to try, compared with my thought of adding a new cache. That gist looks like a further improvement to the idea.

 
I do not have UAC on.  Is yours off?

Unfortunately, I can't disable it at work (regardless of amount of pleading/whinging), though I am running everything as Admin. I'll try some tests on a machine where I can turn it on/off to see if I see a difference.

Someone else suggested that I should check/replace the RAID controller in the machine, as some are known to be crap. That seems like a plausible cause also.

thanks

Albert

unread,
Dec 21, 2011, 3:26:03 PM12/21/11
to msy...@googlegroups.com
Using the USN Journal to pick up changes would be rather exciting, and make git on NTFS probably almost as fast as on linux.

The problem is that the UNS code is pretty "hard-core" windows specific code, there are no guarantees it actually exists, and knowing MS, you need to be an administrator to create it.

Does git support file system "extensions"? I found no way of getting to the USN 'transparently' and thus there would be a need for some windows specific code (even though NTFS technically runs on other plaftofrms UNS access seems to be through the windows apis), and fall back on direct drive access if USN does not exist.

Albert

unread,
Dec 29, 2011, 3:57:07 AM12/29/11
to msy...@googlegroups.com
Hi All,

Hope everybody is having a good Christmas/New Year break.

I had some time over the past two days and decided to do some hacking.

I've new written a little application i call "git-usnchanges". It's an app that monitors the USN Journal on NTFS for changes in a git directory.

The code is pretty hacking at the moment... but it performs quite well. To get changes from a git repository, on WebKit,

git status -s takes ~36seconds.
The app I've written takes ~0.4 seconds.

Sounds great at the moment. But the issue I have is, I have no knowledge of the git internals and have no idea how/if it could be useful. At the moment, the app produces a list of paths that have been deleted/modified/added/renamed. git would need to take this list and do something useful with it...

I've posted in here: https://github.com/pro-logic/git-usnchanges

It's by no means complete, and I think I can get even more performance out of it.

If somebody who knows how/if this can be integrated with git is interested, drop me a line, otherwise I'll just let that work hang there.

Albert

unread,
Dec 29, 2011, 10:56:26 PM12/29/11
to msy...@googlegroups.com
I should also say. When I get some time (hopefully soon), I'll make a unmanaged C++ wrapper for my code. At the moment it's a managed C# app. I wrote it in .NET 2.0 since that's the oldest supported framework, and should be on all Windows machines by now. I don't expect that to be a problem given that this is after all a windows version of git and this is NTFS specific code, that requires windows.

Hopefully once I have some c++ wrappers, somebody who's interested can help with the integration work :)

Sebastian Schuberth

unread,
Dec 31, 2011, 5:01:37 AM12/31/11
to msy...@googlegroups.com, Albert
On 30.12.2011 04:56, Albert wrote:

> Hopefully once I have some c++ wrappers, somebody who's interested can
> help with the integration work :)

Your USN for NTFS stuff looks quite interesting to me, thanks for your
efforts so far. Please note, however, that in order to get your changes
integrated into Git for Windows, they should be written in plain C, not
C++. This is because upstream Git is (mostly) written in plain C, and
we're aiming to get all of our changes contributed back upstream, with
only few conditional compiles for Windows vs. Linux (or Mac).

--
Sebastian Schuberth

Albert

unread,
Dec 31, 2011, 5:16:26 PM12/31/11
to msy...@googlegroups.com, Albert
Hi Sebastian,

Unfortunately C++ and C are not part of my skill set, and only C# is, thus the reason my code is written in C#.

It's also the reason I made my code a console app, so that somebody who knows their C could parse it's output to 'inform' git of the paths that changed in a directory. After all the code will only work on Windows, regardless of what its programmed in, since the USN stuff needs Windows APIs.

If I ever have enough spare time I can try learning to program C, but I don't see that happening any time soon.

When I get some time later today (or in the next few days), I will consolidate all the resources and knowledge I acquired in writing the code so if somebody is interested they can implement the code in C.

Hope everybody has a Happy New Year and welcome to 2012 from Australia :)

Albert

Paul Betts

unread,
Dec 31, 2011, 7:26:45 PM12/31/11
to msy...@googlegroups.com, Albert
While the perf benefits of the USN journal seem pretty awesome, my main
concern is that you don't know if the USN journal is "thorough" - that is, if
you're using USN to implement git status, there's a possibility that there
have been so many unrelated changes since the last time you looked at it, that
it ran off the end of he USN Journal (since it has a fixed size).

This could be worked around, but it'd mean that you'd need to cache the last
git status result somewhere, then use its mtime to determine whether you
should "trust" the USN journal.

--
Paul Betts <pa...@paulbetts.org>

Albert

unread,
Dec 31, 2011, 9:02:29 PM12/31/11
to msy...@googlegroups.com, Albert
Hi Paul,

There are actually many levels of 'sanity checks' that need to run before you can access and 'trust' the journal.

In my current implementation I store the current USN Journal ID, and current USN number if a file in the .git directory, when a directory is considered "clean" like after a commit or reset --hard. ( git-usnchanges clean)

When somebody does a 'status' (git-usnchanges status) the application starts is does the following:
1. Check if the thread is running as administrator - if the thread is not running as administrator we can't even read the journal, so the application terminates. I suspect most developers run their machines without UAC and in admin mode anyway.
2. Check the USN Journal ID and see if it matches the one stored from the previous run. If the Journal ID's don't match it means that the journal was deleted/recreated and the journal is not trustworthy
3. Check that the last seen USN in still in the journal. If the minimum journal id < last seen journal id the journal can't be trusted as not all data has been recorded.

Once the above are checked we can actually get to USN processing.

The USN Journal records *all* file changes on a drive (even including reboots), so there can be quite a few entries in the journal depending on the time from when there was a 'clean' copy to when 'status' is run. On my machine it takes about ~1 second to process 1 days worth of changes (mostly getting information on directories in which files are stored from the file system) on C drive. My 2MB journal stores about 2 days worth of changes.

At the moment I'm working on discarding 'useless' data from the journal history (such as temp file being created and deleted between running 'clean' and 'status'). I can also get pretty good 'rename' information from the journal.

Albert

Albert

unread,
Jan 2, 2012, 3:07:51 AM1/2/12
to msy...@googlegroups.com, Albert
This post is to anybody that's interested in implementing the NTFS USN code in C, so that it can be accepted upstream into git as suggested/required by Sebastian. As I said a few posts back it'll probably be a while before I get a chance to look at learning C to re-implement the code I got working that does this, so this post might be a bit of a 'knowledge dump' and/or 'rambling' and of interest to few.

History

The USN Journal, sometimes called the Change Journal was introduced in Windows 2000, and like the name suggests it’s role is to record changes. It’s part of the NTFS file system, and it’s role is to record changes to files and folders on a NTFS partition. If a NTFS journal is activated, it is guaranteed that it will record every file and folder change on the drive.

Requirements

In order to make use of the Journal, the thread executing access must be running with administrative privileges, the drive must be NTFS, and the Windows and NTFS version have be higher then 5.0 (Win 2000).

Required reading:

If you want to start doing anything with the journal, you really have to read [1] and [2], the articles are from 1999, but as far as I can tell, they are the only solid ‘documentation’ of how the journal works and contain everything you must do in order to get information out of the journal there isn't much point in me repeating what's already in these well written articles. A good utility just to have a quick look at the USN is documented here [3]

Thoughts

This section is mostly going to be my ramblings and thoughts. You might find something of use in here.

In order to be sure that the Journal you are calling is ‘valid’ you need to know the ID of the journal last time you called it, if they match you know it’s the same journal. You then need to check to make sure the last journal entry you were keeping an eye on is still in the journal. Check the lowest journal entry, and if the lowest entry is higher than your last entry you’re in luck and the journal is valid.

One of the biggest performance issues I had getting the C# code to perform well was the calls to the native file system, and moving data between managed and unmanaged memory. With some smart coding I reduced the amount of conversions to a bare minimum to the point of it being a non-issue, not really a issue when coding in pure C.

The next biggest performance issue experienced by the code is the need for a directory database. As a USN record stores the filename, and the folder ID. you need to get the folder path from a different API. This involves opening up a read operation to the folder by it’s ID and getting its path from the properties. This makes the process extremely IO bound. In C# I try to minimise this by using a Dictionary (hashtable) to store the outcome of the query, so that if that folder ID ever comes up again, you can get information about it without having to generate IO. Obviously I need to know this to know if a file is in a git repository.

Thanks/Credits

Firstly I’d like to thank “StCroixSkipper” who seems to be one of the few people on the internet that talks about and posts in forums about the USN Journal. It’s his code forms a foundation of my code, and without it I suspect I would still be yelling at my computer in frustration.

I’d also like to thank everybody who works on msysGit (and git in general) as it’s an awesome SCM.

 References

[1] http://www.microsoft.com/msj/0999/journal/journal.aspx
[2] http://www.microsoft.com/msj/1099/journal2/journal2.aspx
[3] http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/fsutil_usn.mspx?mfr=true

Some more stuff on my applet:

So I've currently written a C# application that queries the USN Journal to provide a list of changes in a directory, it's located here: https://github.com/pro-logic/git-usnchanges

The application at the moment details Additions, Deletions, Modifications and Renames. For example the output of the status command at the moment looks like this:

git-usnchanges status
A "newfile.txt"
MR "ChangeLog" -> "ChangeLog2"
D "CMakeLists.txt"

As you can probably guess, the A is Addition, the MR is Modification and Rename, and D is deletion.

Runtimes:
On my application, using nothing more then my 'gut' feeling and some repeated runs the runtimes are fairly stable. A 'cold' run is about 1.5s, and a warm/hot run is about 0.5s. This is a massive improvement on some LARGE repositories. the WebKit repo linked to earlier (~160,000 files), the best 'hot' cache stats I got were ~30seconds. On small repositories on the other hand this is a disadvantage, as the small repo I have for my C# app, a git status operation takes 0.07seconds, so 0.5seconds is faar slower then that. I haven't actually looked at working out a 'sweet spot' for when USN is faster then normal 'status', from a quick run on my local computer it seems that when a repo has more then 3,000 files using the USN becomes faster.

My ideas on 'integration'
Before I found out the code won't be used until it's in C, I was hoping that somebody smarter then me who knows the git plumbing, would do something along the following. When git checkout/reset --hard finish they call 'git-usnchanges clean' to mark the working copy as 'clean', this creates and records the current USN Number. The next time somebody requests a git status, they call 'git-usnchanges' and parse the output and give that to git as a list of stuff that's 'changed'.

I hope some of the above makes sense :) if I come up with anything interesting on this topic I'll post here. If people have questions I'll be happy to try to answer them.

Albert

petrkodl

unread,
Jan 3, 2012, 10:31:18 AM1/3/12
to msysGit
>
> For reference on my pretty-fast machine on Windows, git status times:
> - cold cache: 98s
> - second run: 32s
> - third+ runs: 16s

Couple notes related to NTFS that might help.

1) make sure to disable last acess time update - fsutil or registry
can be used to achieve that
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/fsutil_behavior.mspx?mfr=true

2) Enable the
[core]
preloadindex = true

If the status still seems slow, can you try to load the repository
into Mercurial and see if the status there is faster. I worked on a
Mercurial patch for NTFS a while ago that avoids accessing individual
file status information
by using what's already provided in the info from FindFirst/FindNext.
It cuts the status time from O(N files) to O(N directories) but I
never had a need to try something similar in Git - it seems fast
enough for the trees I work with. The speedup for mercurial was
substantial, but it depends on the tree structure. Installing
mercurial is pretty cheap way to test if that approach would be
viable.

Petr

Scott Graham

unread,
Jan 4, 2012, 12:40:43 PM1/4/12
to petrkodl, msysGit
Hi Petr,

I already had disablelastaccess set (I believe it's the default these days).

preloadindex=true helped quite a bit though! Hot cache went from 15-16s down to 5-6s which is much more pleasant.

I also tried hg on a working copy of the repo. I'm not sure if it's an entirely equivalent test as there's no history in the hg repo, and I'm not sure how "hg status" might differ in what exactly it does compared with "git status". 

But... hg also takes 5-6s for a hot cache on my machine.

WebKit does have relatively large numbers of files in its directories, so based on your description, switching to FindFirst/Next might still be able to improve the git case further.

Thanks!

Petr Kodl

unread,
Jan 4, 2012, 12:48:54 PM1/4/12
to Scott Graham, msysGit
The history should not matter in status - it is just comparing the current index vs filesystem - mercurial is doing pretty much the same. Just curious - how many CPUs does your machine have - since in your case Git is now running the comparison in parallel while Mercurial is serial it could provide a hint how much speedup could be achieved by aggregating the status calls based on the pre-cached directory information.

Petr

Joshua Jensen

unread,
Jan 4, 2012, 2:42:14 PM1/4/12
to Petr Kodl, Scott Graham, msysGit
----- Original Message -----
From: Petr Kodl
Date: 1/4/2012 10:48 AM
> The history should not matter in status - it is just comparing the
> current index vs filesystem - mercurial is doing pretty much the same.
> Just curious - how many CPUs does your machine have - since in your
> case Git is now running the comparison in parallel while Mercurial is
> serial it could provide a hint how much speedup could be achieved by
> aggregating the status calls based on the pre-cached directory
> information.
Oooh, now that's interesting. Is there an option to make Git run single
threaded for this part? In general, I find that multithreaded file I/O
does not help on the same hard drive...

-Josh

Scott Graham

unread,
Jan 4, 2012, 3:15:34 PM1/4/12
to Joshua Jensen, Petr Kodl, msysGit
I was wondering that too, but doesn't look like it: https://github.com/git/git/blob/master/preload-index.c

Reply all
Reply to author
Forward
0 new messages