Big files in Mercurial

74 views
Skip to first unread message

cowwoc

unread,
Jun 3, 2010, 1:00:25 AM6/3/10
to merc...@selenic.com

Hi,

I've run across old threads discussing the memory overhead associated with
committing large files (i.e. you need 30MB of memory to commit a 10MB file).
Could someone please clarify whether this is a one-time cost associated with
the commit/checkout operation or do you end up paying this when processing
any revision following the one that committed the large file?

That is, are you "doomed" once you commit a large file to Hg or do you just
pay the price when explicitly referencing the file / change-set in question?

In closing, I have a use-case question. Our team is discouraged from
committing any files into the repository that is generated by the compiler.
So for example, users would commit foo.java, but not foo.class. On the other
hand, we encourage our developers to commit their project files and
3rd-party dependencies (foo.jar) into the repository. Is this a problem
seeing how Mercurial is more sensitive to large files than say SVN? What do
you recommend we do instead? I'm not a fan of storing large files
externally.

Thanks,
Gili
--
View this message in context: http://mercurial.808500.n3.nabble.com/Big-files-in-Mercurial-tp866717p866717.html
Sent from the General mailing list archive at Nabble.com.
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

Mads Kiilerich

unread,
Jun 3, 2010, 4:44:10 AM6/3/10
to cowwoc, merc...@selenic.com
On 06/03/2010 07:00 AM, cowwoc wrote:
>
> Hi,
>
> I've run across old threads discussing the memory overhead associated with
> committing large files (i.e. you need 30MB of memory to commit a 10MB file).
> Could someone please clarify whether this is a one-time cost associated with
> the commit/checkout operation or do you end up paying this when processing
> any revision following the one that committed the large file?
>
> That is, are you "doomed" once you commit a large file to Hg or do you just
> pay the price when explicitly referencing the file / change-set in question?

Mercurial history is immutable and partial cloning is (currently) not
possible, so you will have to carry the weight of the big files in all
future.

Mercurial uses the revlog format for storing file history as a chain of
compressed deltas, and an old big file revision will thus in principle
have to be processed for all future revisions. It will however store the
full content instead of a delta whenever that is more efficient, and it
will thus probably never store deltas to big binary files.

Mercurial (currently) keeps the full file content in memory when working
with a file revision, so it will mostly be a problem when committing,
but probably also for example when updating to a revision with that file
or comparing with it.

> In closing, I have a use-case question. Our team is discouraged from
> committing any files into the repository that is generated by the compiler.
> So for example, users would commit foo.java, but not foo.class. On the other
> hand, we encourage our developers to commit their project files and
> 3rd-party dependencies (foo.jar) into the repository. Is this a problem
> seeing how Mercurial is more sensitive to large files than say SVN? What do
> you recommend we do instead? I'm not a fan of storing large files
> externally.

Binary files in a source control system isn't optimal, but it might work
for you. I would prefer to keep source code and binaries in different
repositories, and perhaps also make a split between own source code and
external source code. See http://mercurial.selenic.com/wiki/subrepos .

The best way to track individual large files seems to be
http://mercurial.selenic.com/wiki/BfilesExtension .

/Mads

Martin Geisler

unread,
Jun 3, 2010, 8:06:24 AM6/3/10
to cowwoc, merc...@selenic.com, Jan Sørensen
cowwoc <cow...@bbs.darktech.org> writes:

> Hi,
>
> I've run across old threads discussing the memory overhead associated
> with committing large files (i.e. you need 30MB of memory to commit a
> 10MB file). Could someone please clarify whether this is a one-time
> cost associated with the commit/checkout operation or do you end up
> paying this when processing any revision following the one that
> committed the large file?
>
> That is, are you "doomed" once you commit a large file to Hg or do you
> just pay the price when explicitly referencing the file / change-set
> in question?

You only pay when you access the file -- that is when Mercurial needs to
compute a diff involving the file or when you checkout a revision where
the big file has changed relative to the current revision.

Commands like 'hg log' remain as fast as ever. If you delete the big
file again, then you will only pay for the extra bandwidth used when
cloning and the extra storage taken up on disk.

> In closing, I have a use-case question. Our team is discouraged from
> committing any files into the repository that is generated by the
> compiler. So for example, users would commit foo.java, but not
> foo.class.

That is a good policy :-)

> On the other hand, we encourage our developers to commit their project
> files and 3rd-party dependencies (foo.jar) into the repository.

I would recommend that you use a tool like Maven to track these
dependencies. That way you will version a small POM file containing the
relevant version information, and then you will fetch the needed JARs
when building your project.

I think everybody can agree that this is the "right way", and with a
tool like Maven, there is hardly any excuse not to do it this way.

--
Martin Geisler

aragost Trifork
Professional Mercurial support
http://aragost.com/mercurial/

Chris Scott

unread,
Jun 3, 2010, 8:44:36 AM6/3/10
to Martin Geisler, cowwoc, merc...@selenic.com, Jan Sørensen
>> On the other hand, we encourage our developers to commit their project
>> files and 3rd-party dependencies (foo.jar) into the repository.
>
> I would recommend that you use a tool like Maven to track these
> dependencies. That way you will version a small POM file containing the
> relevant version information, and then you will fetch the needed JARs
> when building your project.

I agree that revision control isn't the best solution for derived (i.e. compiled) and other big binary files dependencies. I'd recommend apache ivy over maven because you don't have to buy into the "maven way".

There are also a bunch of other benefits to moving to dependency management, like improving your build process. We used to have a big "3rd party" repo in clearcase, but there were problems with refering to, for example, xalan.jar@r4 as opposed to specifying that your project needs xerces 2.7.1.

~Chris

cowwoc

unread,
Jun 3, 2010, 9:10:11 AM6/3/10
to Martin Geisler, merc...@selenic.com, Jan Sørensen
On 03/06/2010 8:06 AM, Martin Geisler wrote:
>
> I would recommend that you use a tool like Maven to track these
> dependencies. That way you will version a small POM file containing the
> relevant version information, and then you will fetch the needed JARs
> when building your project.
>
> I think everybody can agree that this is the "right way", and with a
> tool like Maven, there is hardly any excuse not to do it this way.
>
>

I agree that a tool like this makes sense in principal. In
practice, I have a strong dislike for Maven because of its heavy use of
hard-to-read XML and cryptic error messages. I haven't tried Ivy yet.
Hopefully it's better on that front.

Gili

cowwoc

unread,
Jun 3, 2010, 9:10:55 AM6/3/10
to Mads Kiilerich, merc...@selenic.com

I don't mind the cost when dealing directly with large files so
much as "inheriting" that cost for the rest of the repository's
lifetime. The latter seems rather unacceptable in light of the fact that
you might commit a large file by mistake, remove it, but continue to pay
the price nonetheless.

> Binary files in a source control system isn't optimal, but it might
> work for you. I would prefer to keep source code and binaries in
> different repositories, and perhaps also make a split between own
> source code and external source code. See
> http://mercurial.selenic.com/wiki/subrepos .
>
> The best way to track individual large files seems to be
> http://mercurial.selenic.com/wiki/BfilesExtension .

I don't mean to offend anyone, but this sounds like a cop-out on
the part of Mercurial. The entire point of the repository is to allow us
to jump to different points of history in our application's lifetime. By
placing big files outside the repository, we lose the consistency
guarantee between the source-code and its dependencies. If a customer
reports a problem in the field, I would be far more reluctant to
Mercurial to get me a snapshot of their environment because of this
limitation.

That being said, has anyone considered the following approach?

- Allow users to mark files as opaque (for large binary files or other
appropriate types)
- Mercurial will not attempt to DIFF such files and hopefully this also
means that it doesn't need to load such files into memory (you could
read them in a piecemeal fashion)

Gili

Peter Arrenbrecht

unread,
Jun 3, 2010, 9:23:55 AM6/3/10
to cowwoc, merc...@selenic.com
On Thu, Jun 3, 2010 at 7:00 AM, cowwoc <cow...@bbs.darktech.org> wrote:
> I've run across old threads discussing the memory overhead associated with
> committing large files (i.e. you need 30MB of memory to commit a 10MB file).
> Could someone please clarify whether this is a one-time cost associated with
> the commit/checkout operation or do you end up paying this when processing
> any revision following the one that committed the large file?
>
> That is, are you "doomed" once you commit a large file to Hg or do you just
> pay the price when explicitly referencing the file / change-set in question?
>
> In closing, I have a use-case question. Our team is discouraged from
> committing any files into the repository that is generated by the compiler.
> So for example, users would commit foo.java, but not foo.class. On the other
> hand, we encourage our developers to commit their project files and
> 3rd-party dependencies (foo.jar) into the repository. Is this a problem
> seeing how Mercurial is more sensitive to large files than say SVN? What do
> you recommend we do instead? I'm not a fan of storing large files
> externally.

I version lib/ separately from the main project. You might even want
to use subrepos for this.
-parren

cowwoc

unread,
Jun 3, 2010, 9:30:26 AM6/3/10
to peter.ar...@gmail.com, merc...@selenic.com

What's the benefit of using separate repositories for binaries?

Gili

Martin Geisler

unread,
Jun 3, 2010, 9:45:14 AM6/3/10
to cowwoc, Mads Kiilerich, merc...@selenic.com
cowwoc <cow...@bbs.darktech.org> writes:

It is only the bandwidth and storage costs that you pay.

> The latter seems rather unacceptable in light of the fact that you
> might commit a large file by mistake, remove it, but continue to pay
> the price nonetheless.

Yes, that is unfortunate -- in Subversion the big revision lives on in
the repository on the server, in Mercurial it lives on in all clones.

Unless you fix your mistake. First of all, you can undo the commit. That
goes something like this:

hg add big.avi
# ignore warning about how big the file is...
hg commit -m 'big commit'
# ups!
hg rollback

That undoes the last transaction in your local repository, which you
know is the commit in this case. You can then unadd the file

hg revert big.avi

and get on with your job.

The key difference is really that 'hg commit' != 'svn commit' and so you
can change your mind locally before you push to others.

If you have pushed the big file and others have based work on the bad
changeset, then you need to do more work if you want to get rid of it.
You basically need to change history. That is impossible in Mercurial,
but you can add new history. So if you have

--- A --- B --- C

where B has the big file, A is the parent and C is the child, then you
can do 'hg clone -r A repo repo-new' to get 'repo-new' with

--- A

You then do 'hg export C -o C.patch' in 'repo' and 'hg import
../repo/C.patch' in 'repo-new'. That gives you

--- A --- C'

where C' contains the same change as C, but it has another hash value
since it has a different ancestor than C.

When we say that you cannot change history in Mercurial, this is really
what we mean: you can only create new history (changeset C') and you
will then have to get rid of the old history (changeset B and C)
somehow, for example via cloning.

>> The best way to track individual large files seems to be
>> http://mercurial.selenic.com/wiki/BfilesExtension .
>
> I don't mean to offend anyone, but this sounds like a cop-out on
> the part of Mercurial. The entire point of the repository is to allow
> us to jump to different points of history in our application's
> lifetime. By placing big files outside the repository, we lose the
> consistency guarantee between the source-code and its dependencies.

I see the bfiles extension as a way of combining the strenghts of both
kinds of systems: you get the full log for all your files, and for some
you avoid retrieving a lot of data that you don't need.

It would be nicer to have shallow clones, though. In a shallow clone,
you only retrieve part of the history: if you only retrieve the latest
version you get something that corresponds to Subversion.

If the depth of the shallow clone could be adjusted on a file-by-file
basis, then we would have a really cool system where you could ask for
only a few revisions of large files, and all revisions for other files.

> If a customer reports a problem in the field, I would be far more
> reluctant to Mercurial to get me a snapshot of their environment
> because of this limitation.
>
> That being said, has anyone considered the following approach?
>
> - Allow users to mark files as opaque (for large binary files or other
> appropriate types)
> - Mercurial will not attempt to DIFF such files and hopefully this
> also means that it doesn't need to load such files into memory (you
> could read them in a piecemeal fashion)

Right, we could be better at loading only part of the files into memory.
But it is my impression that people are more worried about having to
carry around all the data all the time -- using a couple of hundred MB
too much RAM is not as bad as having to download a couple of GB of extra
data because you get all the full snapshots with every clone.

--
Martin Geisler

aragost Trifork
Professional Mercurial support
http://aragost.com/mercurial/

Greg Ward

unread,
Jun 3, 2010, 9:54:58 AM6/3/10
to cowwoc, mercurial
On Thu, Jun 3, 2010 at 9:10 AM, cowwoc <cow...@bbs.darktech.org> wrote:
>    I don't mean to offend anyone, but this sounds like a cop-out on the part
> of Mercurial. The entire point of the repository is to allow us to jump to
> different points of history in our application's lifetime. By placing big
> files outside the repository, we lose the consistency guarantee between the
> source-code and its dependencies. If a customer reports a problem in the
> field, I would be far more reluctant to Mercurial to get me a snapshot of
> their environment because of this limitation.

Yeah, I hate the fact that people store large binary files in source
control. It's *source* control, not a file server! But so far I have
been unable to refute your argument, so I wrote bfiles. It seems to
work pretty well so far.

>    That being said, has anyone considered the following approach?
>
> - Allow users to mark files as opaque (for large binary files or other
> appropriate types)
> - Mercurial will not attempt to DIFF such files and hopefully this also
> means that it doesn't need to load such files into memory (you could read
> them in a piecemeal fashion)

That's pretty much what falls out of using bfiles. Specifically,
bfadd'ing a file (rather than adding it normally) has the following
consequences:

* the big file itself will not be included in clones, so its history
doesn't waste time/space/bandwidth; it's only downloaded to your
working dir if you explicitly ask for it (bfupdate or enable
auto-update mode)
* core Mercurial will never see the big file; it's only processed by
bfiles, which is careful about streaming files rather than sucking
them into memory (core Mercurial generally assumes that any file it
tracks is small enough to load into memory)
* thus, Mercurial will never try to diff or merge the big file (it
will diff/merge the 41-byte standin, which is useless apart from
telling you, "yes it changed")

On the downside: it doesn't work so well on Windows yet. I pushed a
patch yesterday that improves things, but I'm trying to make the tests
portable so I don't have to test manually. That takes time, so fixing
Windows-specific bugs is a slow process.

Also, it doesn't have any support for merging yet: you just merge the
41-byte standins, which of course always conflict when two people have
changed the same big file. It's up to the user to figure out what to
do. bfiles could help by writing "local", "other", and "base" copies
right in the working dir -- I believe this is what svn and bzr
routinely do for conflicts in regular files, so there's some precedent
there.

Greg

Peter Arrenbrecht

unread,
Jun 3, 2010, 12:37:43 PM6/3/10
to cowwoc, merc...@selenic.com

You can throw away past history in this repo separately.
-parren

Harvey Chapman

unread,
Jun 3, 2010, 1:07:35 PM6/3/10
to mercurial

Also, the binary repo doesn't change very often. So in days past using SVN, you can checkout many different working copies and have them all use one local checkout of the binary repo.

We used this a lot for embedded linux distributions where we keep a copy of everything required to build in source control including the entire toolchain.

Reply all
Reply to author
Forward
0 new messages