Measures to take regarding large binary files in a Git repository

905 views
Skip to first unread message

fabian....@gmail.com

unread,
Aug 25, 2014, 11:13:50 AM8/25/14
to msy...@googlegroups.com
Hi,

I initially posted this to the "Git for human beings" group (https://groups.google.com/d/topic/git-users/5yjQgGD-J_s/discussion), but was told that my questions were too technically deep for that list; therefore, I'm trying again here.

(Scenario at the top, concrete questions below.)

I'm in the progress of migrating an SVN repository to Git, including history, using "git svn". The SVN repository currently contains a number of large (data) files:
- five binary files of 100 MB to 400 MB, with up to 17 revisions,
- eleven binary files in the range of 10 MB to 50 MB, with up to 10 revisions,
- the largest non-binary file has a size of 31 MB.

The large binary files are in the history, so it's not quite easy for me to get rid of them. (Rewriting history to move the largest files out of the repository into a dedicated "large file store" could be an option, though, if absolutely necessary.)

Git has somewhat of a bad reputation regarding large binary files, so I've done some research. I have found this recent thread: https://groups.google.com/d/msg/git-users/EIGoSe1eIYc/dL8voHjF4RUJ, but I'm not able to derive concrete measures from it yet, so I've decided to ask here myself. The clients are using Git for Windows, which is compiled for x86 (32 bit process). My research has shown that, apparently, Git on 32 bits can have problems with out of memory errors when diffing, compressing, or packing files that are too large to fit in the limited memory (twice) a 32 bit process can access.

As I understand it, the diffing problem shouldn't affect me because a) binary files don't need to be diffed anyway and b) even loading a 400 MB file into memory twice for diffing is not going to be a problem in a 32 bit process.

The compression problem can be tackled by setting core.bigFileThreshold to something smaller than the default 256 MB because Git won't try to compress files larger than this. This, however, does have the disadvantage that, for example, the 17 revisions of a 150 MB file would amount to about 2.5 GB of data even if the revisions could be compressed.

The packing problem can be tackled by setting pack.packSizeLimit to something small enough to limit the maximum pack size. (The default in Git for Windows is 2g.) pack.windowMemory and other pack options seem to play a role as well, although I don't understand exactly how. However, these pack settings do not affect the size of pack files created for pushing and pulling. Therefore, pushing and pulling might remain a problem.
So, that's the result of my research so far. Now for my questions:

1. Have I got everything right in my analysis above? Am I missing anything important, any problems I should expect?
2. Would you recommend setting core.bigFileThreshold, pack.packSizeLimit or other options to non-default values proactively on all clients, or should I rather postpone this until (if ever) we're experiencing problems? If I don't set these values proactively, is there a chance that the Git repository could be ruined?
-- What is a good value for core.bigFileThreshold, given my concrete binary files of 10 to 400 MB, some of which have up to 17 revisions?
-- What is a good value for pack.packSizeLimit? Git for Windows defaults it to 2g, is there any reason not to leave it at that?
3. Since pack.packSizeLimit does not affect the packs created for pulling and pushing - what problems can I expect there? How could I tackle them?
4. "git repack -afd" and "git gc" currently fail with an out of memory error on the migrated repository [1][2]. Should I worry about this?
-- I can make "git repack -afd" work by passing "--window-memory 750m" to the command. After that, git gc works fine again) Again, is setting pack.windowMemory to 750m something I should do proactively?

Thanks, best regards,
Fabian

[1] $ git repack -afd
Counting objects: 189121, done.
Delta compression using up to 8 threads.
warning: suboptimal pack - out of memory)
fatal: Out of memory, malloc failed (tried to allocate 331852630 bytes)

[2] $ git gc
Counting objects: 189121, done.
Delta compression using up to 8 threads.
fatal: Out of memory, malloc failed (tried to allocate 73267908 bytes)
error: failed to run repack

Thomas Braun

unread,
Aug 25, 2014, 5:11:33 PM8/25/14
to fabian....@gmail.com, msy...@googlegroups.com
Hi Fabian,

I wouldn't call it bad reputation, git was only designed for small files.
The situation is getting better, but especially on windows is not too good.

I regularly store files up to 1GB in git repositories, so your sizes are
definitly possible.

> As I understand it, the diffing problem shouldn't affect me because a)
> binary files don't need to be diffed anyway and b) even loading a 400 MB
> file into memory twice for diffing is not going to be a problem in a 32 bit
> process.

Although binary files are not diffed, they are still loaded into the memory
for generating the stat info.
You can see that effect if you commit a large file with and without the -q
switch. With the -q switch the commit is done much quicker.
There is currently a patch in submission on the git mailinglist to correct
that, see http://www.spinics.net/lists/git/msg232637.html.

> The compression problem can be tackled by setting core.bigFileThreshold to
> something smaller than the default 256 MB because Git won't try to compress
> files larger than this. This, however, does have the disadvantage that, for
> example, the 17 revisions of a 150 MB file would amount to about 2.5 GB of
> data even if the revisions could be compressed.

Does diskspace still matter that much?

> The packing problem can be tackled by setting pack.packSizeLimit to
> something small enough to limit the maximum pack size. (The default in Git
> for Windows is 2g.) pack.windowMemory and other pack options seem to play a
> role as well, although I don't understand exactly how. However, these pack
> settings do not affect the size of pack files created for pushing and
> pulling. Therefore, pushing and pulling might remain a problem.
> So, that's the result of my research so far.
>
> Now for my questions:
> 1. Have I got everything right in my analysis above? Am I missing anything
> important, any problems I should expect?

I have on my remote server from where I push and pull, git x64 from debian
wheezy, the following settings:

[core]
packedGitLimit = 512m
packedGitWindowSize = 512m
bigFileThreshold = 256m

[pack]
deltaCacheSize = 256m
windowMemory = 256m

These settings ensure that if I clone a repository from a 32bit windows
machine that I can decompress the packs.

I also use addtionally, this time in the repository, the .gitattributes
files feature of disabling delta compression for a certain file suffix.
e.g. in .gitattributes the following line would always turn delta
compression off for files with *.zip suffix.
*.zip delta=false

This wastes diskspace but can greatly enhance commit speeds.

> 2. Would you recommend setting core.bigFileThreshold, pack.packSizeLimit or
> other options to non-default values proactively on all clients, or should I
> rather postpone this until (if ever) we're experiencing problems? If I
> don't set these values proactively, is there a chance that the Git
> repository could be ruined?

I would not change the clients settings, instead tweak the server settings
and use .gitattributes delta=false feature.

And no you can't ruin the repository. As noted above, the worst thing is
that you can not clone from windows.
> -- What is a good value for core.bigFileThreshold, given my concrete binary
> files of 10 to 400 MB, some of which have up to 17 revisions?
Thats a diskspace vs memory/cpu time question. I'd go for diskspace and use
core.bigFileThreshold=32MB.
> -- What is a good value for pack.packSizeLimit? Git for Windows defaults it
> to 2g, is there any reason not to leave it at that?
> 3. Since pack.packSizeLimit does not affect the packs created for pulling
> and pushing - what problems can I expect there? How could I tackle them?
> 4. "git repack -afd" and "git gc" currently fail with an out of memory
> error on the migrated repository [1][2]. Should I worry about this?

Yes. The repository should be fsck-able also on the client side. As you are
otherwise really limited.

> -- I can make "git repack -afd" work by passing "--window-memory 750m" to
> the command. After that, git gc works fine again) Again, is setting
> pack.windowMemory to 750m something I should do proactively?
Can you try with my suggested bigfileThreshold value? In that case you
shouldn't have to adjust --window-memory.

> [1] $ git repack -afd
> Counting objects: 189121, done.
> Delta compression using up to 8 threads.
> warning: suboptimal pack - out of memory)
> fatal: Out of memory, malloc failed (tried to allocate 331852630 bytes)

You do know that the memory settings are per thread?

> [2] $ git gc
> Counting objects: 189121, done.
> Delta compression using up to 8 threads.
> fatal: Out of memory, malloc failed (tried to allocate 73267908 bytes)
> error: failed to run repack
>

Hope that helps,
Thomas

Fabian Schmied

unread,
Aug 26, 2014, 3:52:18 AM8/26/14
to msysGit
Hi Thomas,

thanks a lot for your detailed answer.

[ setting core.bigFileThreshold to prevent compression would cause a waste of disk space ]
 
Does diskspace still matter that much?

Indeed, it doesn't matter that much, but I wanted to know whether wasting it has a real benefit before doing so :)
(It does, see below.)

I have on my remote server from where I push and pull, git x64 from debian wheezy, the following settings:

[core]
packedGitLimit = 512m
packedGitWindowSize = 512m
bigFileThreshold = 256m

[pack]
deltaCacheSize = 256m
windowMemory   = 256m

These settings ensure that if I clone a repository from a 32bit windows machine that I can decompress the packs.

Thank you, I'll adopt that strategy. (It's an Atlassian Stash server, but I guess the settings are fine anyway. I'll have to check out what version of git they use exactly.) I wasn't sure how or whether these settings affect packs created for pushing and pulling, but apparently they do.

I also use addtionally, this time in the repository, the .gitattributes files feature of disabling delta compression for a certain file suffix.
e.g. in .gitattributes the following line would always turn delta compression off for files with *.zip suffix.
*.zip                          delta=false
 
 This wastes diskspace but can greatly enhance commit speeds.

I guess I could simply do this for the same extensions I mark as "binary" anyway. However, will this have any effect on files greater than core.bigFileThreshold?

[ making git repack work by passing "--window-memory 750m". Should do proactively? ]
 
Can you try with my suggested bigfileThreshold value? In that case you shouldn't have to adjust --window-memory.
 
I just did, and indeed it works nicely now, git's working set stays < 885 MB for "git repack -afd" that way. The growth in repository size is very acceptable (400 MB). Besides disk vs memory, is there anything adverse effect that bigfileThreshold could have?

You do know that the memory settings are per thread?

I didn't, but do so now.

Cheers,
Fabian

fabian....@gmail.com

unread,
Aug 26, 2014, 4:00:39 AM8/26/14
to msy...@googlegroups.com, fabian....@gmail.com
(Inline)

On Tuesday, August 26, 2014 9:52:18 AM UTC+2, Fabian Schmied wrote:
Hi Thomas,

thanks a lot for your detailed answer.

[ setting core.bigFileThreshold to prevent compression would cause a waste of disk space ]
 
Does diskspace still matter that much?

Indeed, it doesn't matter that much, but I wanted to know whether wasting it has a real benefit before doing so :)
(It does, see below.)

I have on my remote server from where I push and pull, git x64 from debian wheezy, the following settings:

[core]
packedGitLimit = 512m
packedGitWindowSize = 512m
bigFileThreshold = 256m

[pack]
deltaCacheSize = 256m
windowMemory   = 256m

These settings ensure that if I clone a repository from a 32bit windows machine that I can decompress the packs.

Thank you, I'll adopt that strategy. (It's an Atlassian Stash server, but I guess the settings are fine anyway. I'll have to check out what version of git they use exactly.) I wasn't sure how or whether these settings affect packs created for pushing and pulling, but apparently they do.

Addendum: The Stash server also uses Git for Windows (32 bits).

I guess that means your suggestion for the server configuration doesn't make as much sense but should be replaced by the same configuration as on the clients. That is, core.bigFileThreshold should be set to 32m, but it shouldn't be necessary to set the other memory limits. Right or wrong?

Konstantin Khomoutov

unread,
Aug 26, 2014, 4:44:59 AM8/26/14
to Thomas Braun, fabian....@gmail.com, msy...@googlegroups.com
On Mon, 25 Aug 2014 23:11:27 +0200
Thomas Braun <thomas...@virtuell-zuhause.de> wrote:

[...]
> > The compression problem can be tackled by setting
> > core.bigFileThreshold to something smaller than the default 256 MB
> > because Git won't try to compress files larger than this. This,
> > however, does have the disadvantage that, for example, the 17
> > revisions of a 150 MB file would amount to about 2.5 GB of data
> > even if the revisions could be compressed.
>
> Does diskspace still matter that much?

I wonder if this affects initial clones as well: that is, does this
mean that revisions of files which were not delta-compressed are sent
over the wire "as is" as well as stored?

Thomas Braun

unread,
Aug 26, 2014, 9:19:58 AM8/26/14
to Fabian Schmied, msysGit
Am Di, 26.08.2014, 09:51 schrieb Fabian Schmied:

Hi Fabian,
No, files bigger than core.bigFileThreshold are always marked as binary,
regardless of their extension.


> [ making git repack work by passing "--window-memory 750m". Should do
>> proactively? ]
>
>
>
> Can you try with my suggested bigfileThreshold value? In that case you
>> shouldn't have to adjust --window-memory.
>
>
> I just did, and indeed it works nicely now, git's working set stays < 885
> MB for "git repack -afd" that way. The growth in repository size is very
> acceptable (400 MB). Besides disk vs memory, is there anything adverse
> effect that bigfileThreshold could have?

Nothing I know of.

Thomas
--

Thomas Braun

unread,
Aug 26, 2014, 9:37:33 AM8/26/14
to Konstantin Khomoutov, fabian....@gmail.com, msy...@googlegroups.com
Sorry I can't follow you here.
AFAIK git creates packs for looose objects before they are sent over the
wire. The properties, delta yes/no, pack sizes, etc. are determined by the
side which does the packaging.

So if I clone from my linux box with default settings I might get packs
which I can't uncompress/de-deltify on windows.

Thomas
--

fabian....@gmail.com

unread,
Aug 26, 2014, 11:21:47 AM8/26/14
to msy...@googlegroups.com, fabian....@gmail.com
[snip]

> I have on my remote server from where I push and pull, git x64 from debian
>> wheezy, the following settings:
>>
>> [core]
>> packedGitLimit = 512m
>> packedGitWindowSize = 512m
>> bigFileThreshold = 256m
>>
>> [pack]
>> deltaCacheSize = 256m
>> windowMemory   = 256m
>>
>> These settings ensure that if I clone a repository from a 32bit windows
>> machine that I can decompress the packs.
>
>
> Thank you, I'll adopt that strategy. (It's an Atlassian Stash server, but
> I
> guess the settings are fine anyway. I'll have to check out what version of
> git they use exactly.) I wasn't sure how or whether these settings affect
> packs created for pushing and pulling, but apparently they do.

Since I've now determined that my remote server also uses Git for Windows, I assume that it's probably better to set core.bigFileThreshold on the server as well and omit the pack.windowMemory etc. you provided. Consistency between server and clients will ensure that packs created by the one should be readable by the other. Did I get this right?
 
> I also use addtionally, this time in the repository, the .gitattributes
>> files feature of disabling delta compression for a certain file suffix.
>> e.g. in .gitattributes the following line would always turn delta
>> compression off for files with *.zip suffix.
>> *.zip                          delta=false
>
>  This wastes diskspace but can greatly enhance commit speeds.
>
>
> I guess I could simply do this for the same extensions I mark as "binary"
> anyway. However, will this have any effect on files greater than
> core.bigFileThreshold?

No, files bigger than core.bigFileThreshold are always marked as binary,
regardless of their extension.

And also as "-delta", I assume, since this seems to be the main purpose of core.bigFileThreshold. So, a .gitattributes file with "-delta" for some extensions will improve performance only for files with that extension that are smaller than bigFileThreshold, right?

[snip]

Thanks again,
Fabian

Thomas Braun

unread,
Aug 26, 2014, 5:04:28 PM8/26/14
to fabian....@gmail.com, msy...@googlegroups.com
Am 26.08.2014 um 17:21 schrieb fabian....@gmail.com:
[snip]

> Since I've now determined that my remote server also uses Git for Windows,
> I assume that it's probably better to set core.bigFileThreshold on the
> server as well and omit the pack.windowMemory etc. you
> provided. Consistency between server and clients will ensure that packs
> created by the one should be readable by the other. Did I get this right?

yes.

>>> I also use addtionally, this time in the repository, the .gitattributes
>>>> files feature of disabling delta compression for a certain file suffix.
>>>> e.g. in .gitattributes the following line would always turn delta
>>>> compression off for files with *.zip suffix.
>>>> *.zip delta=false
>>>
>>> This wastes diskspace but can greatly enhance commit speeds.
>>>
>>>
>>> I guess I could simply do this for the same extensions I mark as
>> "binary"
>>> anyway. However, will this have any effect on files greater than
>>> core.bigFileThreshold?
>>
>> No, files bigger than core.bigFileThreshold are always marked as binary,
>> regardless of their extension.
>>
>
> And also as "-delta", I assume, since this seems to be the main purpose of
> core.bigFileThreshold. So, a .gitattributes file with "-delta" for some
> extensions will improve performance only for files with that extension that
> are smaller than bigFileThreshold, right?

I'm not 100% sure about the interaction of both. You can have a try with
git-check-attr, see
https://www.kernel.org/pub/software/scm/git/docs/git-check-attr.html.
This tells you if a file has delta=false set or not.

Thomas
Reply all
Reply to author
Forward
0 new messages