Scalability of DSpace

66 views
Skip to first unread message

Vlastimil Krejčíř

unread,
Aug 23, 2019, 6:57:46 AM8/23/19
to DSpace Community
Hi all,

back in April 2013 I asked the community about the DSpace scalability, see:


Now, at 2019, it is time to ask the same question :-).

How much data / how many items can DSpace handle? The DSpace system at Cambridge University (https://www.repository.cam.ac.uk/) was reported as the largest then. I can see it stores about 245 thousands of items nowadays.

Does anyone else have bigger one? Are there new information on scalability since 2013?

Regards,

Vlastik Krejčíř

--
----------------------------------------------------------------------------
Vlastimil Krejčíř
Library and Information Centre, Institute of Computer Science
Masaryk University, Brno, Czech Republic
Email: krejcir (at) ics (dot) muni (dot) cz
Phone: +420 549 49 3872
OpenPGP key: https://kic-internal.ics.muni.cz/~krejvl/pgp/
Fingerprint: 7800 64B2 6E20 645B 56AF  C303 34CB 1495 C641 11B9
----------------------------------------------------------------------------

emilio lorenzo

unread,
Aug 23, 2019, 7:30:18 AM8/23/19
to dspace-c...@googlegroups.com
Hi, according to openDOAR the Institutional Repository of Peking University (PKU Institutional Repository) has 510 thousand records (unfortunately, it is currently unavaliable)


El 23/08/2019 a las 12:57, Vlastimil Krejčíř escribió:


Does anyone else have bigger one? Are there new information on scalability since 2013?

Regards,

Vlastik Krejčíř

--
----------------------------------------------------------------------------
Vlastimil Krejčíř
Library and Information Centre, Institute of Computer Science
Masaryk University, Brno, Czech Republic
Email: krejcir (at) ics (dot) muni (dot) cz
Phone: +420 549 49 3872
OpenPGP key: https://kic-internal.ics.muni.cz/~krejvl/pgp/
Fingerprint: 7800 64B2 6E20 645B 56AF  C303 34CB 1495 C641 11B9
----------------------------------------------------------------------------

--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com.
elorenzo.vcf

Mark H. Wood

unread,
Aug 23, 2019, 9:12:48 AM8/23/19
to DSpace Community
On Fri, Aug 23, 2019 at 03:57:46AM -0700, Vlastimil Krejčíř wrote:
> back in April 2013 I asked the community about the DSpace scalability, see:
>
> http://dspace.2283337.n4.nabble.com/DSpace-scalability-tens-of-hundreds-TBs-tt4662988.html#a4663047
>
> Now, at 2019, it is time to ask the same question :-).
>
> How much data / how many items can DSpace handle?

That's a difficult question to answer, without defining "handle" more
precisely.

There is a theoretical limit to the number of distinct objects that
DSpace can manage, imposed by the way they are uniquely identified.
In v1 through v5 it was the number of 31-bit integers: about two
billion. In v6 it is the number of distinct UUIDs of a certain type,
which is less than 1.6e19 but probably more than 2.0e9. The majority
of those identifiers will be on bitstreams, with a somewhat smaller
number allocated to items, and a relatively smaller proportion given to
collections and communities.

DSpace doesn't set a limit on how much data it can store. That will
be determined by the nature of the underlying filesystem and the
distribution of sizes of individual bitstreams. HTTP may also set
some limits on the sizes of individual accesses.

The *practical* limits will also depend on the combination of network
capacity, bitstream sizes, HTTP timeout settings, and how long the
typical user is willing to wait.

It is a large and ramifying question. :-)

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu
signature.asc

Tim Donohue

unread,
Aug 23, 2019, 10:48:28 AM8/23/19
to Vlastimil Krejčíř, DSpace Community
Hello Vlastimil,

Unfortunately, the size of DSpace sites is very difficult to track overall (it relies entirely on self reporting).  

I know there are very large sites out there... a few that come to mind are U of Cambridge (https://www.repository.cam.ac.uk), and Georgetown University (https://repository.library.georgetown.edu/).  I cannot claim to know exactly how large the sites are though, as each of these sites may have access restricted content (which is not even visible on the web).  However, in terms of public content alone each has 250-350 thousand items.

I also admit that I don't know whether there are larger sites out there.  But, maybe institutions on this mailing list will self-report if they have more than 400 thousand items. (I know I'd love to hear which sites have >400K items!)

I think Mark Wood gave a thorough answer regarding the number of items possible in a DSpace.  Technically, the biggest limitation is the amount of server space & memory available (as larger sites need more of each).  For each release we attempt to make DSpace as performant (and memory lean) as we can, and as memory issues are reported we resolve them as bugs in a new release.  For example, for the upcoming DSpace 7 release (which is still under active development) we are running more detailed performance testing as detailed here: https://wiki.duraspace.org/display/DSPACE/DSpace+7+Performance+Testing   At this time, that performance testing is more geared towards minimizing CPU load and memory overall (which will also help in scaling).

Tim


From: dspace-c...@googlegroups.com <dspace-c...@googlegroups.com> on behalf of Vlastimil Krejčíř <kre...@ics.muni.cz>
Sent: Friday, August 23, 2019 5:57 AM
To: DSpace Community <dspace-c...@googlegroups.com>
Subject: [dspace-community] Scalability of DSpace
 

Terry Brady

unread,
Aug 23, 2019, 12:29:14 PM8/23/19
to DSpace Community
Here are some details about DigitalGeorgetown.
  • Total items: 546,000
  • Public items: 397,000
  • Citation only items: ~470,000
As we tested and migrated to DSpace 6x, we did encounter a few performance issues.  We have contributed patches to DSpace 6x releases (and to the future DSpace 6.4 release) to help resolve these issues.

We preserve our assets in the APTrust (Academic Preservation Trust) service, so we do not run the DSpace checksum checker on our DSpace instance. 

Terry



--
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
425-298-5498 (Seattle, WA)

Vlastimil Krejčíř

unread,
Aug 23, 2019, 5:00:19 PM8/23/19
to dspace-c...@googlegroups.com
Thank you very much Mark and Tim for precise answers - I have to be also
more precise giving questions next time, because I am interested in
practical limits and big DSpace installations. However, thanks again for
the great summary of possible (and theoretical) DSpace capabilities.

Vlastik
signature.asc

Vlastimil Krejčíř

unread,
Aug 23, 2019, 5:10:35 PM8/23/19
to dspace-c...@googlegroups.com
Thank you Terry. How fast do your DSpace grow? How many items per month
or year? Do you do clustering / load balancing? What kind of hardware do
you need to run it? I would be grateful if you can share those information.

Vlastik

On 8/23/19 6:28 PM, Terry Brady wrote:
> Here are some details about DigitalGeorgetown.
>
> * Total items: 546,000
> * Public items: 397,000
> * Citation only items: ~470,000
>
> As we tested and migrated to DSpace 6x, we did encounter a few
> performance issues.  We have contributed patches to DSpace 6x releases
> (and to the future DSpace 6.4 release) to help resolve these issues.
>
> We preserve our assets in the APTrust (Academic Preservation Trust)
> service, so we do not run the DSpace checksum checker on our DSpace
> instance. 
>
> Terry
>
> On Fri, Aug 23, 2019 at 7:48 AM Tim Donohue <tim.d...@lyrasis.org
> <mailto:tim.d...@lyrasis.org>> wrote:
>
> Hello Vlastimil,
>
> Unfortunately, the size of DSpace sites is very difficult to track
> overall (it relies entirely on self reporting).  
>
> I know there are very large sites out there... a few that come to
> mind are U of Cambridge (https://www.repository.cam.ac.uk
> <https://www.repository.cam.ac.uk/>), and Georgetown University
> (https://repository.library.georgetown.edu/).  I cannot claim to
> know exactly how large the sites are though, as each of these sites
> may have access restricted content (which is not even visible on the
> web).  However, in terms of public content alone each has 250-350
> thousand items.
>
> I also admit that I don't know whether there are larger sites out
> there.  But, maybe institutions on this mailing list will
> self-report if they have more than 400 thousand items. (I know I'd
> love to hear which sites have >400K items!)
>
> I think Mark Wood gave a thorough answer regarding the number of
> items possible in a DSpace.  Technically, the biggest limitation is
> the amount of server space & memory available (as larger sites need
> more of each).  For each release we attempt to make DSpace as
> performant (and memory lean) as we can, and as memory issues are
> reported we resolve them as bugs in a new release.  For example, for
> the upcoming DSpace 7 release (which is still under active
> development) we are running more detailed performance testing as
> detailed
> here: https://wiki.duraspace.org/display/DSPACE/DSpace+7+Performance+Testing 
>  At this time, that performance testing is more geared towards
> minimizing CPU load and memory overall (which will also help in
> scaling).
>
> Tim
>
> ------------------------------------------------------------------------
> *From:* dspace-c...@googlegroups.com
> <mailto:dspace-c...@googlegroups.com>
> <dspace-c...@googlegroups.com
> <mailto:dspace-c...@googlegroups.com>> on behalf of Vlastimil
> Krejčíř <kre...@ics.muni.cz <mailto:kre...@ics.muni.cz>>
> *Sent:* Friday, August 23, 2019 5:57 AM
> *To:* DSpace Community <dspace-c...@googlegroups.com
> <mailto:dspace-c...@googlegroups.com>>
> *Subject:* [dspace-community] Scalability of DSpace
> <mailto:dspace-communi...@googlegroups.com>.
> <https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> All messages to this mailing list should adhere to the DuraSpace
> Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
> ---
> You received this message because you are subscribed to the Google
> Groups "DSpace Community" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to dspace-communi...@googlegroups.com
> <mailto:dspace-communi...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com
> <https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com?utm_medium=email&utm_source=footer>.
>
>
>
> --
> Terry Brady
> Applications Programmer Analyst
> Georgetown University Library Information Technology
> https://github.com/terrywbrady/info
> 425-298-5498 (Seattle, WA)
>
> --
> All messages to this mailing list should adhere to the DuraSpace Code of
> Conduct: https://duraspace.org/about/policies/code-of-conduct/
> ---
> You received this message because you are subscribed to the Google
> Groups "DSpace Community" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dspace-communi...@googlegroups.com
> <mailto:dspace-communi...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com
> <https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Terry Brady

unread,
Aug 23, 2019, 6:54:00 PM8/23/19
to DSpace Community
Vlastik,

The 470,000 citation items were added to our repository several years ago.  With the exception of the ingest of these large citation collections, our repository growth is relatively modest each year.  

After ingesting one large collection with 50,000 bitstreams, we discovered that we needed to disable the checksum checker process.  We have discussed resuming that process, but it has not been a priority for us.  Aside from the checksum checker, we observed that it took longer to rebuild our discovery index as our item count increased.

Our instance runs on a single server.  After ingesting the large collections, we increased the RAM on our server.  We further increased RAM further when we migrated from DSpace 5 to DSpace 6.  We run tomcat with -Xmx8g.  We run each of our command line tasks with 2-3g of RAM depending on the task.  It takes 3-4 hours to rebuild our discovery index.  Other than the RAM allocation, we have stuck with most of the recommended configuration settings.  See https://wiki.duraspace.org/display/DSDOC6x/Performance+Tuning+DSpace.

Terry

To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/15980bcc-7f2e-9b95-e6a3-6b9777b43332%40ics.muni.cz.

Bram Luyten

unread,
Oct 8, 2019, 8:37:21 AM10/8/19
to Terry Brady, DSpace Community
Cool thread, just picking up on it now ! Here are a few examples of large DSpace instances, in terms of item count:

https://repository.globethics.net/discover
Based on the Atmire Open Repository platform (DSpace 5)
Result of a migration from Fedora to DSpace.
+665.000 items 

Based on the Atmire DSpace Express platform (DSpace 6), where we basically offer DSpace without customizations
+717.000 items 

A Brazilian court / institute of the justice system
+911.000 items 

National parliamentary library of Georgia
+307.000 items

The size of the database, SOLR discovery index, are indeed challenging.
But scaling and performance wise, we generally find it a bigger challenge to keep response times down for repositories that are subject to a lot of traffic.

So if your repository has 100k items, but is very frequently visited, you may have a bigger challenge on your hand than a repository with half a million items that has less usage.

cheers,

Bram

logoBram Luyten
250-B Lucius Gordon Drive, Suite 3A, West Henrietta, NY 14586
Gaston Geenslaan 14, 3001 Leuven, Belgium
atmire.com


Reply all
Reply to author
Forward
0 new messages