i spent some time over the past few weeks researching various open-
source version control apps for use in vfx. thought i'd throw you all
an update with my findings. as i explored different options and
thought about the big picture, i came up with some features that i
considered necessary and/or preferable.
---prerequisites---
free or very cheap ( perforce is $900/user x 100 users = $90,000 = non-
option )
cross platform
python api
fast performance with binary files
configurable to conserve disk space
- ability to easily remove unneeded files from repo (aka
'obliterate')
- limited file redundancy
---bonus---
no recursive special directories ( like .svn directories )
much of the prereq's are based around the notion that we'll be dealing
with some very large files. we want to avoid replicating them all
over our server because redundancy is a waste of disk space, network
traffic, and copy time.
so, what were my conclusions? subversion simply won't work. here's
why:
while subversion's python api seems quite top notch, subversion itself
fails pretty miserably when it comes to binary performance and disk
space usage. it stores all files in the repo using a delta algorithm,
meaning each file is stored not as a whole file, but as the difference
between itself and the previous commit. this has the advantage of
saving disk space and of always having the diff on hand. however,
calculating a delta for many large binary files -- and then later
merging deltas to reform complete files -- takes prohibitively (read:
insanely) long. take a look at this article for some performance tips
and figures:
http://www.ibm.com/developerworks/java/library/j-svnbins.html.
unfortunately, their solution is to use svn's import and export
commands, which store and retrieve binary files whole and
uncompressed. the problem is that you don't get any version control
on those files, so what's the bloody point?
the second major failing is disk space usage. the delta algorithm
saves space, but that space savings is far outweighed by several
failings. first of all, every file you check out is stored twice.
yep, EVERY file. in addition to your working copy it keeps an extra
copy in the .svn directory so that IF you edit the file you can do a
quick, offline diff. there's no way to turn off this "feature". so,
if you're checking out 500GB of data, it's gonna be more like 1GB.
all that extra disk space used up in every working copy is almost no
benefit, because diff's between binary files are useless without a
custom app to interpret the data. last in the disk space category, if
a user accidentally checks in 100GB of cache data, or lets say, you're
repo is getting very large and you want to wipe out some old versions
of an asset that you know aren't being used, you cannot do so without
going through some extreme pain. you have to use `svnadmin dump` to
dump your entire repo to a text file, then use dumpfillter to filter
through your data and remove what you don't want, then rebuild your
repo. this process can take many hours if your repo is very large.
the last part is a pet peev, and that's the recursive .svn
directories. these are annoying to deal with because if you decide to
switch out some directories in your working copy with some others of
the same name and you expect it to simply use the new ones in their
place, it won't work. you have to copy over all the .svn folders from
the original into the new set. imagine how well this will work with
artists! you would have to write scripts for moving and modifying
these .svn directories and the artists would have to reliably use them
instead of just dragging and dropping directories or the system would
break down.
i was pretty disappointed to finally come to this conclusion about
subversion, but the fact is that it does what it's mean to do well,
and managing large binary datasets is not what it's meant to do. so,
i moved on and began applying my criteria to pretty much every
revision control system i could find ( using this list:
http://en.wikipedia.org/wiki/Comparison_of_revision_control_software
). most are cvs/svn derivatives with no real advantage in feature
set. i ran away from anything that used delta compression on binary
files, and at first i shied away from distributed systems because of
what i read in the mercurial manual:
" Because Subversion doesn’t store revision history on the client, it
is well suited to managing projects that deal with lots of large,
opaque binary files. If you check in fifty revisions to an
incompressible 10MB file, Subversion’s client-side space usage stays
constant The space used by any distributed SCM will grow rapidly in
proportion to the number of revisions, because the differences between
each revision are large.
"
essentially, if you have a 500GB repo, then that 500GB is copied to
every working copy. ie: mercurial is worse than subversion with
binary files ( and subversion is already pretty bad with binary
files ). i shouldn't write off mercurial, though, because with the
right features, it still might be viable, because as i shortly
discovered, my favorite option ended up being a distributed system....
that system is "git". so far, i think it has the most potential of
anything i've seen. it's distributed, but very flexible and has many
different models for revision control, plus a lot of options to help
save disk space / network traffic. it can even be configured to work
like cvs/svn, if that is your desire. the project was started by
linux torvalds, and as he put it: "It's not an SCM, it's a
distribution and archival mechanism. I bet you could make a
reasonable SCM on top of it, though. Another way of looking at it is
to say that it's really a content-addressable filesystem, used to
track directory trees." ( taken from this helpful site:
http://utsl.gen.nz/talks/git-svn/intro.html )
the python api is provided by a 3rd party, which is a bit
disappointing (ironic, coming from the guy who started pymel), but it
exists and looks object-oriented enough. git doesn't use delta-
compression, the amount of history copied from a repo can be limited
or even shared via hard links, it has the ability to prune old
commits, it has an option to pack away commits that are no longer used
into even great compression, and it doesn't use annoying recursive
directories.
i haven't begun using git in a real-world test yet, but if you're
looking for something to base a pipe on, this could be the horse to
bet on. ultimately, i would really like to start an open-source
asset management project, so take a look at git and see what you
think. i'll let you know as i find out more. i haven't done a speed
test on a large image sequence yet, that could still be a deal-
breaker, but so far it "feels" fast.
-chad