Solr statistics shards in DSpace 7.6?

162 views
Skip to first unread message

Keith Gilbertson

unread,
Nov 6, 2023, 12:35:55 PM11/6/23
to DSpace Technical Support

Hello DSpace fans,

Is sharding of the statistics cores with a separate core for each individual year of statistics supported again in the most recent version of DSpace?

The installation documentation mentions that sharding is no longer supported in DSpace 7, but we noticed some recent commits to the DSpace project that reference sharding and wondered if it is available again.

In past  versions of DSpace during solr index updates or maintenance we've sometimes struggled with slow updates or running out of memory. We have about 12 years of sharded statistics cores in DSpace 6 and are working on a strategy for importing them into DSpace 7, perhaps doing most of the exporting and importing on a separate server with solr before our upgrade day. I am curious about whether we should import all statistics into the same core for DSpace 7, as expected, or if we should keep them separated if it turns out that sharding is supported again.

Can someone please help to clarify which parts of sharding are now supported, and which still do not work, and provide observations or advice?

Thank you for the time,
Keith
 

DSpace Technical Support

unread,
Nov 6, 2023, 2:38:39 PM11/6/23
to DSpace Technical Support
While awaiting an answer to your question, may I propose a tangential question of my own:  what do you think sharding is doing for you, and have you seen evidence to support this?  Because DSpace uses Solr's sharding support in a very eccentric manner, and I have my doubts that it actually buys us anything.  That is why I accepted the need to remove it in 7.0, in exchange for the option to place Solr on a separate host if desired (since supported Solr versions must now be installed separately anyway).  The custom sharding code in DSpace doesn't get enough information to work across hosts.  If sharding is really needed, Solr can do it much better on its own, and we could scrap *all* of the sharding support in DSpace.

Keith Gilbertson

unread,
Nov 6, 2023, 4:28:27 PM11/6/23
to DSpace Technical Support
On Mon, Nov 6, 2023 at 2:38 PM DSpace Technical Support <dspac...@googlegroups.com> wrote:
While awaiting an answer to your question, may I propose a tangential question of my own:  what do you think sharding is doing for you, and have you seen evidence to support this?  
 

Mark, I don't know if it's doing anything for us now.

I can't remember, but honestly, I don't think that we took any measurements of "before sharding" or "after sharding" performance except possibly anecdotal "seems better now, the system is slow less often" notes. If I recall correctly (and this is fuzzy), part of our motivation in finally doing the sharding was to help with logistics in resolving another issue that we were having. When DSpace added the uuid fields to supplement or replace the id fields, some of our statistics reports (perhaps customs one we had here)  were inaccurate or broken. Terry Brady supplied a very nice tool to update old statistics records with IDs to use the new uuid instead, but we had difficulties using this tool on our very large statistics core, with an operating production system, with the particular storage hardware that we were using at the time. With our setup then, it ran slowly and slowed the system so that the web interface was unusable. The initial sharding helped us to use Terry's tool to update the smaller and static cores for each previous year offline, completely away from the production system, and reinstall them when we were finished.

Beyond that logistical motivation to help resolve the missing uuid issue, I suspect that as a group we just read that DSpace supports sharding and that sharding helps performance problems, noticed that we sometimes have performance problems, and thought that sharding was a best practice and we should implement it.

Do we currently get any benefits from the sharding on our system? I don't know. We have at least one drawback under DSpace 6 with sharding. If our Tomcat shuts down and restarts too fast (by monit), sometimes it tries to reopen one of those statistics cores before the lock was released by the previous instance, ultimately resulting in a problem where DSpace temporarily can't see statistics in any previous year cores until we restart Tomcat again.

The systems administrators here strongly support the decision to allow Solr to be placed on a different host, and that's what they're doing. My understanding is that separating things this way will help us to get better at identifying the particular bottlenecks (e.g. is the problem solr, or something else?), allowing them to allocate appropriate resources to each component when needed, and ultimately letting us be more scientific about performance. Also, it should help us comply with organizational security requirements faster. If there are any emergency security patches for solr, the admins can handle them within their normally scheduled update times.


Adán Román Ruiz

unread,
Nov 7, 2023, 3:25:36 AM11/7/23
to dspac...@googlegroups.com

Hello

We experience some problems with date ranges into big cores (60Gb), that are solved sharding.

Software as dspace-stats-collector of "lareferencia" fail working with this cores (https://github.com/lareferencia/dspace-stats-collector)

Maybe sharding is not the best sollution, but works.

Adán

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/f2953515-e38f-4ea3-81c9-751ef4a9afc7n%40googlegroups.com.

DSpace Technical Support

unread,
Nov 15, 2023, 5:51:59 PM11/15/23
to DSpace Technical Support
Hi Keith,

To answer the initial question, as far as I'm aware, Solr Sharding of Statistics should work in DSpace 7.6.x.  However, prior versions may have issues.  These two fixes in particular didn't come into DSpace until 7.6.x:

* 7.6 - Fix issue with statistics loading after sharding: https://github.com/DSpace/DSpace/issues/8478
* 7.6.1 (to be announced tomorrow) - Fix issue with sharded stats sometimes being counted twice: https://github.com/DSpace/DSpace/issues/8933

So, if you upgrade to the latest versions of 7.6.x, I believe sharding should be supported again.

Tim

Reply all
Reply to author
Forward
0 new messages