On Fri, 20 Feb 2009, Andrew Arnott wrote:
> I'm hoping there's something I can do to optimize this. I'm using git and
> love it for small projects. I just tried to switch to git on a very large
> project (348,000+ files, 21GB in total) and it took about 18 hours just to git
> init, git add, git commit, and git gc the initial checkin of the whole repo.
> Day 2: try "git status" to see how long that takes. It's still running
> and it's been over 10 minutes.
Ouch. I guess it is the slow stat(), combined with the fact that it has
to read everything into memory.
Maybe it gets faster if you set "core.ignoreStat" to true? The only
downside is that "git status" does not report modified files anymore,
AFAIR.
Ciao,
Dscho
348000 is not large, it's ridiculously large, and certainly something git is not optimized for. I
think you'd need to adopt a different workflow to survive, regardless of SCM. you
probably don't want to run git status very often. The ability to run commit without
invoking git status seems necessary, and it's not there.
Things like git diff would probably take a very very long time too (would with any SCM I think).
Using fast-import for the initial load (somehow), would probably be a better way of doing the
initial add. It would create a pack, that is usually pretty good from the start. I'm
not sure what front-end to use for fast-import in this case. It is usually run for
conversion from other SCM's.
As for git status, I think there are lots of things to do to work faster on Windows,
though you could never get it as fast as linux. 348000 would be slow on Linux too, btw.
-- robin
I guess the bottleneck here is that the initial 'git add' creates loose
objects, and this kills performance on Windows.
Second, a pack created from this initial commit won't be much more than
a gzipped tar ball of the files because commonly very few deltas will be
found between the initial files.
-- Hannes
Just for my curiosity (and Git comparison purposes): What are you
"switching" from? Another SCM? Which one?
-bcd
stat() performance is certainly a large part of it, which is why there's
an option to disable it. msysgit has an optimized-for-windows stat()
implementation, which basically expands to a single GetFileAttributesExA
call plus the neccessary structure rearranging to get just the
information Git needs into a stat_buf. But that system call is still
something like an order of magnitude slower than stat() and friends are
is on Linux. If there's a faster way to get file status information
than that I'm sure the mingw Git people would love to know about it.
I haven't really looked at Git code, mingw or otherwise, for several
months though. (Stupid real job.)
For the code, see:
http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/win32.h;h=c26384e595b4f23d5fef938b1136cc3a85469e56;hb=HEAD
http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/mingw.c;h=3dbe6a77ffa4675b19e7183fd49e11212cb2cda0;hb=HEAD
http://repo.or.cz/w/git/mingw.git?a=blob;f=compat/mingw.h;h=a25589880130f2232aaf626cddcd739ac80dd378;hb=HEAD
-bcd
Caveat: This is what makes my repo slow, with its 30,000-or-so files.
A hot-cache "git status" takes about 3s on Windows, where the same repo
takes like .25s on Linux, which is where I got my order-of-magnitude
measure above. I have no idea what is killing your massive repository,
but I suspect the initial add creating a ton of loose objects has
something to do with it as Robin mentioned earlier.
If at all possible I would try to do your massive import on a Linux
machine, and see what performance you see there. You may be hitting
non-Windows-specific Git performace issues here.
-bcd
Given some definition of "easily"?
- How fast is TF's equivalent of "git status"?
- How fast is the initial import into TF of this many files?
- How fast is a branch switch?
-- robin
As the original author of the stat() replacement for Git, I was
certainly aware of the FindFirstFile/... calls containing enough info.
However, as you've mentioned, I was afraid of the paradigm change
having too much impact on the original Git code, and this was before
Windows support was merged into the Git mainline. So, I wanted the
impact to be as low as possible, to ensure that Windows patch series
would be accepted.
After the stat() change and the speedup we saw, I had my doubts how
much quicker it would be to change to FFF/FNF/FC verses the amount of
change, so I let it slip. However, with the stats you've shown in this
mail regarding Mercurial, and it's speedup by using FFF/FNF/FC, I see
it should definitely be a research subject for us going forward.
Thanks for your input!
--
.marius