DSpace 7.5 Solr Statistics migration from 5.10 with sharding by year

250 views
Skip to first unread message

Tomas Hajek

unread,
Mar 24, 2023, 5:37:19 PM3/24/23
to dspac...@googlegroups.com
Hello, 
   I am working on migrating a DSpace 5.10 installation to a new server running DSpace 7.5.  I have the basic installation running on RHEL 8.7 with Tomcat 9.0.71, Solr 8.11.2, node.js 16.18.1, and pm2 5.2.2.  
I was able to import the database and assetstore and I set up the Solr cores (authority,oai,search,statistics) from the installation instructions.
   The Solr statistics from the 5.10 installation are sharded by year and I exported with the following:

bin/dspace solr-export-statistics -i statistics-2015
bin/dspace solr-export-statistics -i statistics-2016
...
bin/dspace solr-export-statistics -i statistics-2022

I have copied the exported files to the new 7.5 server into /opt/dspace/solr-export and am trying to import them but I get the following error (example when trying to import the 2015 statistics):

/opt/dspace/bin/dspace solr-import-statistics -i statistics-2015
Exception: Error from server at http://localhost:8983/solr/statistics-2015: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/statistics-2015/admin/luke</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>

</body>
</html>

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics-2015: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/statistics-2015/admin/luke</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>

</body>
</html>

at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:635)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:214)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:231)
at org.dspace.util.SolrImportExport.getMultiValuedFields(SolrImportExport.java:482)
at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:433)
at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)

Presumably this is due to not having the sharded statistics-20## cores in Solr configured but I'm not sure at this point how to add and configure them so I can import the statistics.  I am not very familiar with Solr. 

Can anyone enlighten me on how I might do this or correct my steps or let me know what else to look at.

Any assistance would be greatly appreciated.
Thank you,
 -Tomas

James Holobetz

unread,
Apr 4, 2023, 11:25:19 AM4/4/23
to Tomas Hajek, dspac...@googlegroups.com
Hi Tomas,

I recently had this issue and I believe that I have found a solution, which I will document in the next few days. The long and the short of it is that DSpace 7 does not support solr shards. You have to create one large solr shard (statistics) from the multiple shards. The biggest problem I found doing this was that DSpace was only ingesting the current year statistics only. The solution was to rename the *csv files that are dumped by solr-export-statistics. For example: the csv files for the solr core "statistics-2012" will look something like this -- statistics-2012_export_2013-12_5.csv. You have to rename all the csv files to remove the -2012 in the filename to look like this: statistics_export_2013-12_5.csv. I downloaded the zipped up cores in csv form to my windows machine so I could use a bulk rename tool to remove the year suffix in each core. I then uploaded them to my linux box running DSpace and ingested each one using the solr-import-statistics tool. This is a very time consuming task.

Hope this helps and I will try to document this in the next few days.

Best regards,

James Holobetz

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/CAPx-GQoBmwVH6byhm%2BZv4kg%3D5zmEH%3DQStGL-y1TTD%3D8qBQFo1w%40mail.gmail.com.

Tomas Hajek

unread,
Apr 13, 2023, 2:54:47 PM4/13/23
to James Holobetz, dspac...@googlegroups.com
Thanks for the information.  I had similar information sent to me by another person on the list so it seems we are all approaching this about the same way.  I think I have my statistics imported at this point.  
To rename the statistics files I wound up using the following one liner: 
cd /opt/dspace/solr-export; for i in $(ls *.csv); do nf=$(echo $i | sed 's/-20[1-2][0-9]_e/_e/g'); mv -v $i $nf; done

Another suggestion was to use rename (e.g. rename "s/-2015_export/_export/g"  *) but for whatever reason that was not working for me on my RHEL 8.7 server.
After renaming the files and running /opt/dspace/bin/dspace solr-import-statistics -i statistics it looks like I have the statistics imported.

Thanks for the help.
-Tomas
--

                Tomas Hajek
                ha...@oakland.edu
                1-248-370-3505
                Assistant Director, Research Computing and Infrastructure Engineering
                University Technology Services
                Oakland University

Tim Donohue

unread,
Apr 13, 2023, 3:13:00 PM4/13/23
to DSpace Technical Support
Thanks James & Tomas for sharing your hints/tips here!  It's obvious we didn't document this very well in the DSpace 7 Upgrade process.  Just now, I've done my best to summarize your advice & add more hints in Step 10(a) of the Upgrade process to help others along. I even linked folks back to this useful dspace-tech thread for more details.

https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace

If others have hints/tips to share, please do feel free to continue this thread, or add comments to the docs & we'll get them taken into account.

Tim

Nicholas Woodward

unread,
Jun 11, 2023, 4:35:12 PM6/11/23
to DSpace Technical Support
Has this approach of importing all previous year's statistics into the "statistics" Solr core worked for others who have a lot of stats? For the last few days I've been trying to import all of the exported statistics files below after renaming the beginning of each CSV file to "statistics-XXXX-...", but no matter how high I set the `http.socket.timeout` parameter in Solr I get the SocketTimeoutException error below when importing the last ZIP file (statistics.zip). 

I'm working with the most recent code on the main branch of the DSpace repository. I've increased the Java memory given to Solr to 2GB and added the same amount to the `bin/dspace` command, but that didn't seem to help, and in some cases made things worse. At the time that I get the socket timeout error and the import-statistics process stops running the "statistics" core usually has anywhere from 20-30 million docs in the index. 

Error message: 

Problem encountered while trying to import index statistics.
org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://localhost:8983/solr/statistics
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:692)


at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)

at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:465)


at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)

Caused by: java.net.SocketTimeoutException: Read timed out
at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:309)
at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:350)
at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:803)
at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:571)
... 12 more


Statistics files:

52MB Jun  7 08:54 statistics-2014.zip

130MB Jun  7 08:58 statistics-2015.zip

222MB Jun  7 09:05 statistics-2017.zip

46MB Jun  7 09:06 statistics-2018.zip

300MB Jun  7 09:15 statistics-2019.zip

273MB Jun  7 09:22 statistics-2020.zip

415MB Jun  7 09:36 statistics-2021.zip

30MB Jun  7 09:37 statistics-2022.zip

687MB Jun  7 10:02 statistics.zip


Thanks,
Nick

Mohammad S. AlMutairi

unread,
Jun 11, 2023, 5:00:36 PM6/11/23
to DSpace Technical Support
You might need to double check the tomcat connector settings.

# You need to replace the Catalina Connector Elements on lines 69,70 and 71 with the connector elements you see below.
**** edit /etc/tomcat9/server.xml
    <Connector address="127.0.0.1" port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               maxHttpHeaderSize="65536"
               minSpareThreads="25"
               enableLookups="false"
               disableUploadTimeout="true"
               connectionUploadTimeout="120000"
               URIEncoding="UTF-8"/>


Hope it helps.

Mo.

Nicholas Woodward

unread,
Jun 14, 2023, 12:02:08 PM6/14/23
to DSpace Technical Support
Hi Mohammad,
Thank you for the suggestion! My Tomcat connector was missing the connectionUploadTimeout="120000" parameter. After I added that and restarted Tomcat I was able to import all of my statistics. 

Thanks,
Nick
Reply all
Reply to author
Forward
0 new messages