How much is too much data in an svn repository?

351 views
Skip to first unread message

Sean McBride

unread,
Sep 22, 2022, 3:59:39 PM9/22/22
to users
Hi all,

Our svn repo is about 110 GB for a full checkout. Larger on the server of course, with all history, weighting about 142 GB.

There haven't been any performance issues, it's working great.

But now some users are interested in committing an additional 200 GB of mostly large binary files.

I worry about it becoming "too big". At what point does that happen? Terabytes? Petabytes? 100s of GB?

Thanks,

Sean

Justin MASSIOT | Zentek

unread,
Sep 23, 2022, 3:33:34 AM9/23/22
to users
Hello Sean,

I have not enough experience to answer your question, but I'm very concerned about large binary files. Whereas I have a more "splitted" structure of repositories.
I'm following this discussion ;-) Can anyone bring some inputs on this topic?

Justin MASSIOT  |  Zentek

Aleksa Todorović

unread,
Sep 23, 2022, 3:45:54 AM9/23/22
to Justin MASSIOT | Zentek, users
Hi all,

I can confirm that Subversion can handle repositories with 100,000+ revisions, size of committed files ranging from few bytes to several GBs, and total repo size of up to 20TB. Speed issues that I'm seeing are mostly related to hard drive operations, but do not prevent efficient work. The only very noticeable speed issues are on commits with thousands (those happen from time to time) on client side, where it takes lot of time to commit (all those files need to be compared by content), but also update (there is always a copy of each file in .svn directory). Outside of that, Subversion performs really well.

Hope this helps.

Regards,
Aleksa

Graham Leggett via users

unread,
Sep 23, 2022, 7:33:31 AM9/23/22
to Sean McBride, users
From experience it becomes too big when the underlying disk gets full. As long as your underlying disks can handle it, it works fine.

I use SVN for versioned incremental backups of files in the 0.5GB range. I’ve seen reports of others checking in multi GB files as backups with no trouble.

Best thing to do is to physically try it. Make a copy of your repo, then try check things into it, and see where your issues are.

Regards,
Graham


Mark Phippard

unread,
Sep 23, 2022, 7:43:19 AM9/23/22
to Sean McBride, users
Assuming you have the disk space then there is no real upper limit.

That said ... do not discount the administrative burden. Are you
backing up your repository? Whether using dump/load, svnsync or
hotcopy the bigger the repository the more of a burden it will be on
these tools.

If this is just about storing binary files why not consider solutions
that were meant for that like an object storage platform like S3 or
minio or a package manager like Maven, Nuget etc.

A big negative of Subversion repositories is you cannot ever delete
anything. Do you really need to keep all these binaries forever?

Mark

Graham Leggett via users

unread,
Sep 23, 2022, 7:52:21 AM9/23/22
to Mark Phippard, Sean McBride, users
On 23 Sep 2022, at 13:42, Mark Phippard <mark...@gmail.com> wrote:

> A big negative of Subversion repositories is you cannot ever delete
> anything. Do you really need to keep all these binaries forever?

In our regulated world that is an important feature.

Once the repos get too big we start new ones. In the meantime, there is no such thing as “we did fraud^H^H^H^H^H a delete to make space”.

Regards,
Graham


Daniel Sahlberg

unread,
Sep 23, 2022, 8:08:08 AM9/23/22
to Sean McBride, users
Hi,

In addition to all other responses, I'd like to advertise the "pristines on demand" feature that got some traction in the spring.

Subversion is normally storing all files twice on the client side (in the "working copy": once for the actual file and once as a "pristine", ie as the file was when checking out, in the .svn folder). The idea with "prisines on demand" is to store the file only once, on the expense of some operations requiring more bandwidth. I'm not sure about the status, but it is not part of any current release yet. Karl Fogel and Julian Foad was involved in this, more details can be found in the list archives on the d...@subversion.apache.org list.

Kind regards,
Daniel


Nathan Hartman

unread,
Sep 23, 2022, 8:26:35 AM9/23/22
to Sean McBride, users
On Thu, Sep 22, 2022 at 3:59 PM Sean McBride <se...@rogue-research.com> wrote:
>
It occurs to me that we don't have a FAQ or other easy-to-find
documentation on maximums, such as the maximum file size, etc.

The largest publicly-accessible SVN repository of which I am aware is
the Apache.org one in which Subversion's own sources (as well as those
of numerous other projects) are housed. This repository contains
approximately 1.9 million revisions. According to [1] the dump of this
repository expands to over 65 gigabytes.

But that seems to be a drop in the ocean when Aleksa writes:

On Fri, Sep 23, 2022 at 3:45 AM Aleksa Todorović <alex...@gmail.com> wrote:
> I can confirm that Subversion can handle repositories with 100,000+ revisions, size of committed files ranging from few bytes to several GBs, and total repo size of up to 20TB.

It is possible that others here are aware of even larger repositories.

My biggest concern mirrors what Mark said about administrative burden:
the size of backups and the time it takes to make them. Mark addressed
that point quite well. Whatever you do, you must have good backups!
(My $dayjob does backups 3 different ways: the filesystem on which the
repository is stored is backed up regularly. In addition we take
periodic 'hotcopy' backups, and periodic full 'dump' backups.
Obviously as a repository grows, this takes longer and requires more
storage.

[1] http://svn-dump.apache.org

Cheers,
Nathan

Nico Kadel-Garcia

unread,
Sep 23, 2022, 8:34:18 AM9/23/22
to Mark Phippard, Sean McBride, users
On Fri, Sep 23, 2022 at 7:43 AM Mark Phippard <mark...@gmail.com> wrote:
>
> On Thu, Sep 22, 2022 at 3:59 PM Sean McBride <se...@rogue-research.com> wrote:
> >
> > Hi all,
> >
> > Our svn repo is about 110 GB for a full checkout. Larger on the server of course, with all history, weighting about 142 GB.
> >
> > There haven't been any performance issues, it's working great.
> >
> > But now some users are interested in committing an additional 200 GB of mostly large binary files.
> >
> > I worry about it becoming "too big". At what point does that happen? Terabytes? Petabytes? 100s of GB?
>
> Assuming you have the disk space then there is no real upper limit.

There are practical limits. The number of file descriptors for years
or decades of irrelevant history accumulate. Bulky accidental commits,
such as large binary objects, accumulate and create burdens for backup
or high availability. And keeping around old tags that haven't been
used in years encourages re-introducing obsolete API's or errors, or
re-introduce security flaws.

Jeffrey Walton

unread,
Sep 23, 2022, 12:12:18 PM9/23/22
to Sean McBride, users
On Thu, Sep 22, 2022 at 3:59 PM Sean McBride <se...@rogue-research.com> wrote:
>
I've never encountered a problem with "too big," but I have
encountered problems with binary file types causing an SVN client or
server to hang. I experienced it back in 2012 or 2013 on a very large
collection of repos. I tried to check out/clone and the operation
would hang about 6 or 8 hours into the operation.

Through trial and error we discovered a developer had checked-in
object files from an XCode build, and the SVN client or server would
hang on the object files. I don't recall if it was all object files,
or just a particular one. As an added twist, I think we were using
TortoiseSVN on Windows. So it may have been a bad interaction with
TortoiseSVN on Windows. Once we manually deleted object files the
check-out/clone proceeded.

I don't know if that would happen nowadays.

Jeff

Doug Robinson

unread,
Sep 26, 2022, 8:36:05 AM9/26/22
to Sean McBride, users
Sean:

On Thu, Sep 22, 2022 at 3:59 PM Sean McBride <se...@rogue-research.com> wrote:

WANdisco supports customers with Subversion repositories in the TiB with millions of revisions.  As others have mentioned, the repository size matters only when it is time to back it up.  Large backups can be managed with different techniques with different costs (only some of which have been mentioned so far).

What tends to be more important on a day to day basis is the size of the checkout: TCP is bandwidth limited by latency so the larger the working copy at any given latency the longer it takes to check out.  And the larger the latency, well...  The number of files/directories in a revision can be an issue with certain operations as can the amount of history of changes for a single file (e.g. "svn blame" can get slow...).  And the chatty nature of WEBDAV means that latency compounds the time required.  Using "svnserve" only helps so in some circumstances since it is difficult to have it cache as much as Apache (and not at all for multi-user support) so it scales differently for different operations.

I've read some excellent suggestions about using artifact management systems for build artifacts - definitely.

All that said, I think it wise to keep repository size bounded to what you (your company?) can reasonably support.

Cheers.

Doug
--
DOUGLAS B ROBINSON SENIOR PRODUCT MANAGER

http://www.wandisco.com

THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY AND MAY BE PRIVILEGED

If this message was misdirected, WANdisco, Inc. and its subsidiaries, ("WANdisco") does not waive any confidentiality or privilege. If you are not the intended recipient, please notify us immediately and destroy the message without disclosing its contents to anyone. Any distribution, use or copying of this email or the information it contains by other than an intended recipient is unauthorized. The views and opinions expressed in this email message are the author's own and may not reflect the views and opinions of WANdisco, unless the author is authorized by WANdisco to express such views or opinions on its behalf. All email sent to or from this address is subject to electronic storage and review by WANdisco. Although WANdisco operates anti-virus programs, it does not accept responsibility for any damage whatsoever caused by viruses being passed.

Reply all
Reply to author
Forward
0 new messages