Troubleshooting failed yearly Solr statistics sharding

Alan Orth

unread,

Feb 6, 2020, 7:50:59 AM2/6/20

to DSpace Technical Support

Dear list,

Our yearly Solr statistics sharding (stats-util -s) failed this year because our core is very large (43GiB) and apparently timed out somewhere. It failed again when I tried to run it manually:

Moving: 51633080 into core statistics-2019

...

Exception: Read timed out

java.net.SocketTimeoutException: Read timed out

As a test I used this really great tool called solr-import-export-json to export some of my 2019 statistics to JSON on the production server, then import them into a new core in my development instance:

$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid

$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'

$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o /tmp/statistics-2019-01.json -k uid

This worked brilliantly... in fact I am very impressed with this tool and recommend it to people!

The problem is, this core does not get enumerated automatically by Solr after I restart the servlet container. I got it to load by hard-coding the core into dspace/solr/solr.xml config² but it seems hacky. How are these core shards enumerated by DSpace's Solr application? What would cause shards to not be loaded automatically?

My environment is DSpace 5.8 with Tomcat 7.0.99 and OpenJDK 8.

Thanks,

¹ https://github.com/freedev/solr-import-export-json

² https://cwiki.apache.org/confluence/display/solr/Solr.xml%20(supported%20through%204.x)

--

Alan Orth
alan...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch

Mark H. Wood

unread,

Feb 6, 2020, 10:03:09 AM2/6/20

to DSpace Technical Support

I think that a good place to look is
'dspace-api/src/main/java/org/dspace/statistics/SolrLoggerServiceImpl#initSolrYearCores'.
Also #createCore in the same class. This is where DSpace enumerates
the cores that it will use for statistics. It seems to be looking for
directories 'solr/statistics-YYYY'. It will call CREATE in Solr's
CoreAdmin API, which would seem to register a core if it already
exists. You seem to be doing the same thing, but there must be
something slightly different about your actions. Or perhaps the way
you are testing -- it looks to me as though Solr is unaware of the
additional cores at startup and is told of them by DSpace when *it*
starts up.

But I think it is actually DSpace that is doing something hacky:
using the same InstanceDir for multiple cores. I have no idea why
that works.

Sadly, SolrJ is almost entirely undocumented, at least in this area.
I have had to puzzle out a lot of its working by reference to the web
API documentation in the Solr Ref Guide.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

signature.asc

Alan Orth

unread,

Feb 26, 2020, 3:22:04 PM2/26/20

to DSpace Technical Support

Dear Mark,

After having exported a slice of my 2019 statistics from production I've just done two experiments in my development environment: manually create a `statistics-2019` core and load the 2019 hits into it, and load data into the main `statistics` core and initiate the `dspace stats-util -s` yearly sharding process. In both cases the core's data is online and available immediately after it is loaded. In the first case the manually created core does not get loaded the next time I restart Tomcat, while in the second case the DSpace-created core does.

Regarding DSpace doing something "hacky" in using multiple data-only cores that share an instanceDir, I'm also wondering how that fits into the official use cases of Solr! I want to add some debug logging to SolrLoggerServiceImpl.java (DSpace 6.x) to try to understand why my manually-created core doesn't get loaded. Possibly related, about half the time we start Tomcat on our production server one of the cores fails to load anyways! To be honest it's making me a bit nervous about running with all these shards (we have ten, back to 2010!) and I am debating whether I should just put everything back in the main statistics core. How does the migration process to a more modern Solr with DSpace 7 look with our "hacky" sharding?

Regards,

--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/20200206150143.GF11530%40IUPUI.Edu.

Mark H. Wood

unread,

Feb 28, 2020, 10:12:22 AM2/28/20

to DSpace Technical Support

On Wed, Feb 26, 2020 at 10:21:48PM +0200, Alan Orth wrote:
> After having exported a slice of my 2019 statistics from production I've
> just done two experiments in my development environment: manually create a
> `statistics-2019` core and load the 2019 hits into it, and load data into
> the main `statistics` core and initiate the `dspace stats-util -s` yearly
> sharding process. In both cases the core's data is online and available
> immediately after it is loaded. In the first case the manually created core
> does not get loaded the next time I restart Tomcat, while in the second
> case the DSpace-created core does.

Clearly you and DSpace are doing something different, but I don't know
what it could be.

> Regarding DSpace doing something "hacky" in using multiple data-only cores
> that share an instanceDir, I'm also wondering how that fits into the
> official use cases of Solr! I want to add some debug logging to
> SolrLoggerServiceImpl.java (DSpace 6.x) to try to understand why my
> manually-created core doesn't get loaded. Possibly related, about half the
> time we start Tomcat on our production server one of the cores fails to
> load anyways! To be honest it's making me a bit nervous about running with
> all these shards (we have ten, back to 2010!) and I am debating whether I
> should just put everything back in the main statistics core.

I think you need more information from the Solr service itself, to
debug core startup issues. Stock Solr logs quite a bit at startup by
default, and can log a lot more. Finer-grained settings in
config/log4j-solr.properties may help, without overwhelming you with
normal-operation chit-chat. I would set log4j.rootLogger up to WARN
or ERROR, log4j.logger.org.apache.solr.client to ERROR, and
log4j.logger.org.apache.solr to INFO or even DEBUG, and see what you
get at startup. You could even send org.apache.solr.client off to a
separate file via a separate appender (and setting its 'additivity' to
false), to completely separate client and server logging. I have no
more-precise suggestions here.

> How does the migration process to a more modern Solr with DSpace 7
> look with our "hacky" sharding?

I haven't tried it yet, because we don't do sharding here so I have no
experience with how it *ought to* look. There are no changes to
DSpace's sharding code, so I would expect it to act much the same as
in previous versions.

signature.asc

Reply all

Reply to author

Forward