HA Clustering Dspace

421 views
Skip to first unread message

Hernán Lagos

unread,
Jul 21, 2016, 10:51:22 AM7/21/16
to DSpace Technical Support
Dear

I want to ask if anyone of you have experience creating a high
availability cluster for Dspace .

Best regards

Luiz dos Santos

unread,
Jul 21, 2016, 11:52:25 AM7/21/16
to DSpace Technical Support
It is a nice question, but if you want to use a cluster is not better use Fedora instead? Please note that is a question...I'm curious to known what the DSpace specialist think about it.

Best
Luiz

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Hernán Lagos

unread,
Jul 21, 2016, 12:22:51 PM7/21/16
to DSpace Technical Support
Thanks

In this case it's for a DSpace installation remains online.

Regards

Hilton Gibson

unread,
Jul 21, 2016, 12:27:07 PM7/21/16
to Hernán Lagos, DSpace Technical Support
Hi All,

For 100% uptime regarding infrastructure, better to use tried and tested cloud services.
Thats my 2c ;-)

Cheers

hg

Hilton Gibson
Stellenbosch University Library

Tim Donohue

unread,
Jul 21, 2016, 12:27:17 PM7/21/16
to Hernán Lagos, DSpace Technical Support

Hi Hernán,

The simple answer here is that, currently, there is no "standard" high availability setup for DSpace, and DSpace has no inherent ability to do load balancing or clustering on its own.

That said, DSpace is essentially just a web application that runs on Tomcat (or similar), uses a PostgreSQL database (or similar) to store metadata/relationships, and uses Apache Solr for searching/browsing.  Each of these three tools (Tomcat, PostgreSQL and Solr) *do* provide clustering options.  So, it may be plausible to rely on the clustering options at those levels to create a DSpace cluster. 

However, I'll admit that I'm not aware of anyone who has done that before. If someone has, I'm hoping they will speak up here to provide us all a bit more clues/hints.  There is an older (outdated now) wiki page where such discussions started a long time ago, but they never came to any final decision/proposal:

https://wiki.duraspace.org/display/DSPACE/Clustering

All that said, I suspect there are others who would be of interested in more easily enabling clustering within DSpace itself.   That seems like it'd make a wonderful addition to the software platform, but it'd take one (or more) institutions who could help us to better define the gaps, what is missing/needed, and then start to figure out a way forward.  DSpace has no centralized development team (developers are volunteers or allowed to work on the project by their institutions). So we are entirely reliant on the institutions using DSpace to help us make such improvements (see how to contribute [1]).  If we can find a few interested users, we also could establish a formal DSpace Clustering Interest Group [2] that could begin to define the use cases, needs, etc for the benefit of us all.

The topic of clustering is one that comes up every once in a while on this mailing list. If others are interested in helping to move this idea forward, I'd encourage you to voice you opinions/experience here. All we'd need to establish a Clustering Interest Group would be some interested individuals and one or more willing to chair / co-chair those group meetings.

Sincerely,

Tim

[1] https://wiki.duraspace.org/display/DSPACE/How+to+Contribute+to+DSpace
[2] https://wiki.duraspace.org/display/DSPACE/DSpace+Interest+Groups

-- 
Tim Donohue
Technical Lead for DSpace & DSpaceDirect
DuraSpace.org | DSpace.org | DSpaceDirect.org

Hernán Lagos

unread,
Jul 21, 2016, 1:11:19 PM7/21/16
to DSpace Technical Support, herna...@gmail.com
Hi Tim

Thanks, your feedback has been very useful.
If we set up a cluster of Dspace, it will be announced.

Regards

Luiz dos Santos

unread,
Jul 21, 2016, 2:34:15 PM7/21/16
to DSpace Technical Support
Hi Tim,

It seems interesting , but I have a point, relay the high availability in hardware and in a big monolithic  software seems more like mitigate the problem but not solve it, you could have Solr and PostgresSQL in clusters, they have their own cluster possibilities, but you will end up with a one DSpace in one Tomcat that can fall and put your repository down, right?

Maybe to a high availability DSpace need something more, something with microservices, something in agreement with the Reactive Manifesto (http://www.reactivemanifesto.org/). In Dspace 7, the new GUI model will bring the possibility of run the GUI in another server, that is great, but DSpace will be relay in a one DSpace backend, right? Do you see a way to have two or more DSpace back end running simultaneous.

One last point, as a volunteer, I would like to take part in the clustering group.

Best regards
Luiz

Peter Dietz

unread,
Jul 21, 2016, 3:13:35 PM7/21/16
to Luiz dos Santos, DSpace Technical Support
From my investigations into DSpace, the key element that I would like to de-couple from DSpace is SOLR. 

Say you were going to build a new frontend to DSpace that heavily used the DSpace REST API. You could have multiple servers, each running tomcat and the DSpace REST API deployed. With nginx outside of that proxying / load balancing. No problems. Especially as you have postgres as an external service (rds), the assetstore is located outside of DSpace (s3). However, I don't see how you can run multiple instances of DSpace SOLR. SOLR stores data, and it wouldn't be as simple as just adding another server running the webapp. But you would need to coordinate the SOLR cluster, using SolrCloud / ZooKeeper. Maybe its not as complicated as I think. But, I thought that I read at one point that DSpace had some custom solr code present, or the solr configs would have to be managed, and I'm not sure how much work it would be to build up a solr cluster with that config.

It could be possible to ensure that DSpace can use stock SOLR, or to write another implementation for storage/search/index/engine that might be more cloud friendly than Solr, such as DynamoDB/CloudSearch. A normal use-case of SOLR is to use it only as an index, that your important data lives in a persistant data store, such as the database, and you could wipe out your search index, and reindex your source data to repopulate it. However, DSpace's use of solr relies on it as being a source of data for some elements (authority, statistics).

________________
Peter Dietz
Longsight
www.longsight.com
pe...@longsight.com
p: 740-599-5005 x809

Luiz dos Santos

unread,
Jul 21, 2016, 4:05:30 PM7/21/16
to DSpace Technical Support
Hi Peter,

    Thanks, nice explanation! Why don't you coordinate the DSpace Clustering group?

Best regards
Luiz

Andrea Bollini

unread,
Jul 21, 2016, 5:40:53 PM7/21/16
to Peter Dietz, Luiz dos Santos, DSpace Technical Support
I have had an extensive experience with DSpace in cluster environment. I was responsible for a product based on DSpace 4, more precisely on DSpace-CRIS 4.4, used by more than 60+ institutions. Depending on the size of the Institution and the expected usage 2 or 4 servers were dedicated to run tomcat with the JSPUI.  The DBMS was Oracle in a centralized cluster environment and the fronted Apache HTTP 2 load balanced. The key point here was to share some part of the filesystem (config and assetstore) between all nodes using NFS. DSpace works fine in cluster for the dissemination until you don't change things that are cached like the metadatafield and bitstream format registry. If you make changes to such aspects or other customization that use local cache it is necessary to introduce some system able to propagate / notify the changes to all the other nodes.
We have tried to use SOLR Cloud but we have found some limitation, after a while we start to receive randomly corrupted response from slave nodes, we were not able to fix he issue and as a single server with 8vcpu, 8gb ram was able to manage heavy loading for 4-8 customers (using multiple cores) we have at the end decided to stay with a SOLR standalone solution.
DSpace can very easily use a standard SOLR server, the addition that we have in DSpace and the configurations fit in the normal SOLR configurations and extensions points. The issue here is that we need to update the client side to be able to use the latest version of SOLR. Right now, I'm starting to investigate about the feasibility of upgrade to SOLR 6.
With the new configuration system to be introduced in DSpace 6 capable to automatically reload changes monitoring the filesystem it could be also easier to achieve a better clustering support for DSpace.

Hope this help,
Andrea

Andrea Bollini

unread,
Jul 21, 2016, 5:50:26 PM7/21/16
to Peter Dietz, Luiz dos Santos, DSpace Technical Support
Another thing about SOLR Cloud: at the time that I looked to it SOLR joins between different cores were not supported. SOLR join are used by DSpace-CRIS to provide aggregated statistics of items at author, department level.
Andrea

Mark Wood

unread,
Jul 22, 2016, 8:53:43 AM7/22/16
to DSpace Technical Support, lui...@gmail.com
On Thursday, July 21, 2016 at 3:13:35 PM UTC-4, Peter Dietz wrote:
From my investigations into DSpace, the key element that I would like to de-couple from DSpace is SOLR. 

Say you were going to build a new frontend to DSpace that heavily used the DSpace REST API. You could have multiple servers, each running tomcat and the DSpace REST API deployed. With nginx outside of that proxying / load balancing. No problems. Especially as you have postgres as an external service (rds), the assetstore is located outside of DSpace (s3). However, I don't see how you can run multiple instances of DSpace SOLR. SOLR stores data, and it wouldn't be as simple as just adding another server running the webapp. But you would need to coordinate the SOLR cluster, using SolrCloud / ZooKeeper. Maybe its not as complicated as I think. But, I thought that I read at one point that DSpace had some custom solr code present, or the solr configs would have to be managed, and I'm not sure how much work it would be to build up a solr cluster with that config.

DSpace does include a tiny dab of custom code for Solr, which I think is not essential.  LocalHostRestrictionFilter can be replaced with fairly simple filtering by the Servlet container.  ConfigureLog4jListener only exists because for some reason we insist that Solr's logging configuration live with DSpace's instead of with the rest of Solr's configuration.  There is nothing else.  It should be simple to use DSpace with a stock Solr 4 instance.  If and when DSpace moves to Solr 5+ this all will have to be revamped anyway due to significant changes in the way Solr must be deployed.
Reply all
Reply to author
Forward
0 new messages