Git unusably slow for extremely large repos?

282 views
Skip to first unread message

Andrew Arnott

unread,
Feb 20, 2009, 11:31:09 AM2/20/09
to msy...@googlegroups.com
I'm hoping there's something I can do to optimize this.  I'm using git and love it for small projects.  I just tried to switch to git on a very large project (348,000+ files, 21GB in total) and it took about 18 hours just to git init, git add, git commit, and git gc the initial checkin of the whole repo.  Day 2: try "git status" to see how long that takes.  It's still running and it's been over 10 minutes.  

Was my mistake adding the entire repo in just one commit? Would it help if I started over and added it in chunks?  If so, what size of chunk is edible?  Or is this repo just too large for Git to handle? (that would be a shame)

BTW, this is on a Windows box using msysgit 1.6.1.9.g97c34. 

Thanks in advance!

--
Andrew Arnott
"I [may] not agree with what you have to say, but I'll defend to the death your right to say it." - Voltaire

Johannes Schindelin

unread,
Feb 20, 2009, 11:40:50 AM2/20/09
to Andrew Arnott, msy...@googlegroups.com
Hi,

On Fri, 20 Feb 2009, Andrew Arnott wrote:

> I'm hoping there's something I can do to optimize this. I'm using git and
> love it for small projects. I just tried to switch to git on a very large
> project (348,000+ files, 21GB in total) and it took about 18 hours just to git
> init, git add, git commit, and git gc the initial checkin of the whole repo.
> Day 2: try "git status" to see how long that takes. It's still running
> and it's been over 10 minutes.

Ouch. I guess it is the slow stat(), combined with the fact that it has
to read everything into memory.

Maybe it gets faster if you set "core.ignoreStat" to true? The only
downside is that "git status" does not report modified files anymore,
AFAIR.

Ciao,
Dscho

Andrew Arnott

unread,
Feb 20, 2009, 11:58:03 AM2/20/09
to Johannes Schindelin, msy...@googlegroups.com
Thanks for the quick reply, Johannes.  I just tried it:

git config core.ignoreStat true

But git status is still spinning for what seems forever (after a minute or more I just Ctrl+C because that's long enough that it's unusable, so I don't know if the above command actually made it any faster).

Is there anything else you can think of, or did I not execute the right command?

--
Andrew Arnott
"I [may] not agree with what you have to say, but I'll defend to the death your right to say it." - Voltaire


Robin Rosenberg

unread,
Feb 20, 2009, 12:55:56 PM2/20/09
to msy...@googlegroups.com, andrew...@gmail.com
fredag 20 februari 2009 17:31:09 skrev Andrew Arnott <andrew...@gmail.com>:
> I'm hoping there's something I can do to optimize this. I'm using git and
> love it for small projects. I just tried to switch to git on a very large
> project (348,000+ files, 21GB in total) and it took about 18 hours just to git
> init, git add, git commit, and git gc the initial checkin of the whole repo.
> Day 2: try "git status" to see how long that takes. It's still running and
> it's been over 10 minutes.
> Was my mistake adding the entire repo in just one commit? Would it help if I
> started over and added it in chunks? If so, what size of chunk is edible?
> Or is this repo just too large for Git to handle? (that would be a shame)
>
> BTW, this is on a Windows box using msysgit 1.6.1.9.g97c34.

348000 is not large, it's ridiculously large, and certainly something git is not optimized for. I
think you'd need to adopt a different workflow to survive, regardless of SCM. you
probably don't want to run git status very often. The ability to run commit without
invoking git status seems necessary, and it's not there.

Things like git diff would probably take a very very long time too (would with any SCM I think).

Using fast-import for the initial load (somehow), would probably be a better way of doing the
initial add. It would create a pack, that is usually pretty good from the start. I'm
not sure what front-end to use for fast-import in this case. It is usually run for
conversion from other SCM's.

As for git status, I think there are lots of things to do to work faster on Windows,
though you could never get it as fast as linux. 348000 would be slow on Linux too, btw.

-- robin

Johannes Sixt

unread,
Feb 20, 2009, 1:12:10 PM2/20/09
to robin.r...@gmail.com, msy...@googlegroups.com, andrew...@gmail.com
Robin Rosenberg schrieb:

> fredag 20 februari 2009 17:31:09 skrev Andrew Arnott <andrew...@gmail.com>:
>> I'm hoping there's something I can do to optimize this. I'm using git and
>> love it for small projects. I just tried to switch to git on a very large
>> project (348,000+ files, 21GB in total) and it took about 18 hours just to git
>> init, git add, git commit, and git gc the initial checkin of the whole repo.
>> Day 2: try "git status" to see how long that takes. It's still running and
>> it's been over 10 minutes.
>> Was my mistake adding the entire repo in just one commit? Would it help if I
>> started over and added it in chunks? If so, what size of chunk is edible?
>> Or is this repo just too large for Git to handle? (that would be a shame)
>>
>> BTW, this is on a Windows box using msysgit 1.6.1.9.g97c34.
>
> 348000 is not large, it's ridiculously large, and certainly something git is not optimized for.

I guess the bottleneck here is that the initial 'git add' creates loose
objects, and this kills performance on Windows.

Second, a pack created from this initial commit won't be much more than
a gzipped tar ball of the files because commonly very few deltas will be
found between the initial files.

-- Hannes

Brian Downing

unread,
Feb 20, 2009, 1:13:18 PM2/20/09
to Andrew Arnott, msy...@googlegroups.com
On Fri, Feb 20, 2009 at 08:31:09AM -0800, Andrew Arnott wrote:
> I'm hoping there's something I can do to optimize this. I'm using git and
> love it for small projects. I just tried to switch to git on a very large
> project (348,000+ files, 21GB in total) and it took about 18 hours just to git
> init, git add, git commit, and git gc the initial checkin of the whole repo.
> Day 2: try "git status" to see how long that takes. It's still running and
> it's been over 10 minutes.

Just for my curiosity (and Git comparison purposes): What are you
"switching" from? Another SCM? Which one?

-bcd

Andrew Arnott

unread,
Feb 20, 2009, 1:43:37 PM2/20/09
to Brian Downing, msy...@googlegroups.com
TFS, which by the way handles this and hundreds of times this many files quite easily, apparently.  My problem with TFS, which is driving me to git, is that TFS is server-centric and I want something I can use offline.  Also, I really like git's ability to track file changes without "tf add, tf rename, tf edit, tf delete", etc. etc.  

--
Andrew Arnott
"I [may] not agree with what you have to say, but I'll defend to the death your right to say it." - Voltaire


Andrew Arnott

unread,
Feb 20, 2009, 1:46:18 PM2/20/09
to Robin Rosenberg, msy...@googlegroups.com
Can anyone tell me exactly what about Windows makes git so much slower than on Linux?  I have heard many times that git is faster on Linux, but I'm really curious as to why... my first guess is that Windows isn't slower than Linux, but Linux has file system calls which have slow Windows equivalents, but that if git were written on Windows to begin with perhaps APIs are available on Windows that have significantly better performance than the ones currently being called.

In short, I hope that whatever perf bottleneck git has on Windows can be solved by changing git to use the faster APIs.  And if real unavoidable perf problems on Windows exists, what are they?  I'd love to send an email to my contacts at Windows and get them to fix these problems. :)

--
Andrew Arnott
"I [may] not agree with what you have to say, but I'll defend to the death your right to say it." - Voltaire


Andrew Arnott

unread,
Feb 20, 2009, 3:40:37 PM2/20/09
to msysgit
Was it my mention of TFS, or asking how to make Windows better that seems to have stopped all activity on this beforehand very responsive thread?

--
Andrew Arnott
"I [may] not agree with what you have to say, but I'll defend to the death your right to say it." - Voltaire


Brian Downing

unread,
Feb 20, 2009, 4:03:52 PM2/20/09
to Andrew Arnott, Robin Rosenberg, msy...@googlegroups.com
On Fri, Feb 20, 2009 at 10:46:18AM -0800, Andrew Arnott wrote:
> Can anyone tell me exactly what about Windows makes git so much slower than
> on Linux? I have heard many times that git is faster on Linux, but I'm
> really curious as to why... my first guess is that Windows isn't slower than
> Linux, but Linux has file system calls which have slow Windows equivalents,
> but that if git were written on Windows to begin with perhaps APIs are
> available on Windows that have significantly better performance than the
> ones currently being called.
> In short, I hope that whatever perf bottleneck git has on Windows can be
> solved by changing git to use the faster APIs. And if real unavoidable perf
> problems on Windows exists, what are they? I'd love to send an email to my
> contacts at Windows and get them to fix these problems. :)

stat() performance is certainly a large part of it, which is why there's
an option to disable it. msysgit has an optimized-for-windows stat()
implementation, which basically expands to a single GetFileAttributesExA
call plus the neccessary structure rearranging to get just the
information Git needs into a stat_buf. But that system call is still
something like an order of magnitude slower than stat() and friends are
is on Linux. If there's a faster way to get file status information
than that I'm sure the mingw Git people would love to know about it.

I haven't really looked at Git code, mingw or otherwise, for several
months though. (Stupid real job.)

For the code, see:

http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/win32.h;h=c26384e595b4f23d5fef938b1136cc3a85469e56;hb=HEAD
http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/mingw.c;h=3dbe6a77ffa4675b19e7183fd49e11212cb2cda0;hb=HEAD
http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/mingw.h;h=a25589880130f2232aaf626cddcd739ac80dd378;hb=HEAD

-bcd

Brian Downing

unread,
Feb 20, 2009, 4:10:21 PM2/20/09
to Andrew Arnott, Robin Rosenberg, msy...@googlegroups.com
On Fri, Feb 20, 2009 at 03:03:52PM -0600, Brian Downing wrote:
> stat() performance is certainly a large part of it, which is why there's
> an option to disable it. msysgit has an optimized-for-windows stat()
> implementation, which basically expands to a single GetFileAttributesExA
> call plus the neccessary structure rearranging to get just the
> information Git needs into a stat_buf. But that system call is still
> something like an order of magnitude slower than stat() and friends are
> is on Linux. If there's a faster way to get file status information
> than that I'm sure the mingw Git people would love to know about it.
>
> I haven't really looked at Git code, mingw or otherwise, for several
> months though. (Stupid real job.)
>
> For the code, see:
>
> http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/win32.h;h=c26384e595b4f23d5fef938b1136cc3a85469e56;hb=HEAD
> http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/mingw.c;h=3dbe6a77ffa4675b19e7183fd49e11212cb2cda0;hb=HEAD
> http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/mingw.h;h=a25589880130f2232aaf626cddcd739ac80dd378;hb=HEAD

Caveat: This is what makes my repo slow, with its 30,000-or-so files.
A hot-cache "git status" takes about 3s on Windows, where the same repo
takes like .25s on Linux, which is where I got my order-of-magnitude
measure above. I have no idea what is killing your massive repository,
but I suspect the initial add creating a ton of loose objects has
something to do with it as Robin mentioned earlier.

If at all possible I would try to do your massive import on a Linux
machine, and see what performance you see there. You may be hitting
non-Windows-specific Git performace issues here.

-bcd

Robin Rosenberg

unread,
Feb 20, 2009, 4:19:54 PM2/20/09
to msy...@googlegroups.com, andrew...@gmail.com, Brian Downing
fredag 20 februari 2009 19:43:37 skrev Andrew Arnott <andrew...@gmail.com>:
> TFS, which by the way handles this and hundreds of times this many files
> quite easily, apparently. My problem with TFS, which is driving me to git,

Given some definition of "easily"?

- How fast is TF's equivalent of "git status"?
- How fast is the initial import into TF of this many files?
- How fast is a branch switch?

-- robin

Andrew Arnott

unread,
Feb 20, 2009, 4:26:25 PM2/20/09
to Robin Rosenberg, msy...@googlegroups.com, Brian Downing
Great questions, Robin.  The reason TFS does it "easily" is because it chooses not to do the hard operations, I suspect.  For instance to answer each of your questions:

- How fast is TF's equivalent of "git status"?

There is "tf status", but it just queries the server for a list of "pending changes" that I have previously run "tf add, tf edit, or tf delete" on.  So you can imagine this is O(n), where 'n' is only the number of changes I've made.  TFS doesn't have to review all source files to see what timestamps have been updated.

- How fast is the initial import into TF of this many files?

A long time, but reasonably close to network speed (since on TFS everything is over the network).  But whether it's git or TFS that's slow to import this many files, I see this as less significant since it is a one-time operation.

- How fast is a branch switch?

No such thing, really.  With TFS, you have entirely different working copies for each branch you work with.  Technically, you can change all your directory mappings to switch your branch, but it's painstakingly tedious and not a common scenario.  I love being able to branch switch in git and do it all the time.  When I say it is not a common scenario, that's only because of the TFS limitations that make it difficult -- so people don't commonly do it. :)

--
Andrew Arnott
"I [may] not agree with what you have to say, but I'll defend to the death your right to say it." - Voltaire


cerm...@gmail.com

unread,
Feb 22, 2009, 7:40:20 AM2/22/09
to msysGit
> stat() performance is certainly a large part of it, which is why there's
> an option to disable it.  msysgit has an optimized-for-windows stat()
> implementation, which basically expands to a single GetFileAttributesExA
> call plus the neccessary structure rearranging to get just the
> information Git needs into a stat_buf.  But that system call is still
> something like an order of magnitude slower than stat() and friends are
> is on Linux.  If there's a faster way to get file status information
> than that I'm sure the mingw Git people would love to know about it.

There certainly is, but it requires slight paradigm change. The stat
information on Windows is fully available from the directory iteration
calls (FindFirstFile/FindNextFile/FindClose).
Both mercurial and Bazaar take advantage of it. I wrote the Mercurial
implementation and for larger repos Hg on Windows is now quite a bit
faster than Git for status and diff.
(440k files / 1100 directories - Hg st ~.4s Git status ~1.6s Hg
without the optimization 1.8s)

I took a quick look at Git and it would be possible to do something in
the recent release and plug it inot the pthread(ed) cache preloader.
The individual stat call does not become any faster but the number of
calls is reduced from O(n_files) to O(n_dirs) which in reasonable tree
means significant reduction.

pk

Marius Storm-Olsen

unread,
Feb 23, 2009, 2:14:08 AM2/23/09
to cerm...@gmail.com, msysGit
cerm...@gmail.com said the following on 22.02.2009 13:40:

As the original author of the stat() replacement for Git, I was
certainly aware of the FindFirstFile/... calls containing enough info.
However, as you've mentioned, I was afraid of the paradigm change
having too much impact on the original Git code, and this was before
Windows support was merged into the Git mainline. So, I wanted the
impact to be as low as possible, to ensure that Windows patch series
would be accepted.

After the stat() change and the speedup we saw, I had my doubts how
much quicker it would be to change to FFF/FNF/FC verses the amount of
change, so I let it slip. However, with the stats you've shown in this
mail regarding Mercurial, and it's speedup by using FFF/FNF/FC, I see
it should definitely be a research subject for us going forward.

Thanks for your input!

--
.marius

Reply all
Reply to author
Forward
0 new messages