Issue with Dspace 7 search

Snickers

unread,

Aug 30, 2022, 12:32:51 AM8/30/22

to DSpace Technical Support

Hi,

I am migrating Dspace from 5.8 to a new 7.3. I have followed the documentation and completed all tasks - https://wiki.lyrasis.org/display/DSDOC7x/Migrating+DSpace+to+a+new+server and https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace

When I search, for example, I get 26 items from Dspace 7.3 whereas 79 are from Dspace 5.8. When searching for an empty space, I get similar total counts e.g. 5172 and 5192.

Did anyone experience this? To be clear, I have run the reindexing commands found in the documentation above and the commands were completed successfully.

There was no useful log found since this is technically not an error.

Any idea or suggestion would be appreciated.

Regards,

Bryan

Tim Donohue

unread,

Aug 30, 2022, 1:33:24 PM8/30/22

to DSpace Technical Support

Hi Bryan,

It's really hard to say what could be going on without you digging more into which items matched in 5.8 which didn't match in 7.3 (or visa versa). It could be that 5.8 was actually returning incomplete results and the results are *more accurate* in 7.3. Or, as you imply, it's also possible the other way around...somehow 7.3 isn't returning as accurate of results as 5.8.

But, it is worth pointing out that the Solr search settings under DSpace are enhanced little by little in every release. So, there were many changes/improvements in 6.x and continue to be more in 7.x. We've also upgraded Solr several times in those releases, so it's possible that Solr itself is returning slightly different results based on its new/updated behavior.

Overall, until you dig more deeply into those search result differences between 5.x vs 7.x, I wouldn't assume that there's a bug in 7.x. There's also the possibility you are just seeing improvements that resulted in more accurate results. But, that said, if you are able to pinpoint some sort of buggy behavior, then definitely let us know & we'll work to get it assigned and fixed in a future 7.x release.

Tim

Snickers

unread,

Aug 31, 2022, 4:59:29 PM8/31/22

to DSpace Technical Support

Hi Tim,

Thank you for your response. I am sure that there have been many improvements made to Dspace and Solr over the version updates and appreciate the effort of the devs.

I looked a bit deeper into the search results from both 5.8 and 7.3. It seems that the search finds the keyword in the thesis text. However, I found an item where the keyword is mentioned once in the text and the search found it. However, I also found a few items where the keyword appeared once or more times in the text that 7.3 did not find but 5.8

Where possibly this can be looked into to resolve the issue? The number of items is similar and the items looked to be migrated successfully. I have successfully run the full reindex commands found in Step 4 of the migration doc:

# Reindex all your content in DSpace

./dspace index-discovery -b

# (Optionally) also reindex everything into OAI-PMH endpoint

./dspace oai import

Please help. Any suggestion would be appreciated.

Regards,

Bryan

Tim Donohue

unread,

Aug 31, 2022, 5:16:33 PM8/31/22

to Snickers, DSpace Technical Support

Hi Bryan,

Search results issues can be difficult to track down without very specific examples or even links to a public website (feel free to use our demo7.dspace.org site to try and reproduce issues).

Usually, it's best to look for common patterns in the results you are seeing, as that may be helpful to us in tracking down what those behaviors have in common (e.g. if all the files that do not match searches properly are PDFs, that's a clue. Or, if they all are large files, that'd be a different clue. Or, if you find a specific metadata field isn't searchable, that's yet another clue.)

Since you specified that one difference is in the searching the full text of a document, it's possible that changes/updates to the full text indexing in DSpace 7.3 could be impacting your results.

For instance, by default in DSpace 7.3, only the first 100,000 characters of a document are searchable. However, you can change this default in a configuration here: https://github.com/DSpace/DSpace/blob/main/dspace/config/dspace.cfg#L492-L498

(Notice in the comments that you'd have to re-extract text and re-index if you change this setting. Instructions are in those comments)

That's a very quick guess though based on the limited info you've been able to provide so far. I'd recommend looking more closely at your results for patterns or common clues...that might be able to help us figure out what the cause may be (and whether it's a bug, or maybe just a configuration that needs to be tweaked).

Tim

From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Snickers <crims...@gmail.com>
Sent: Wednesday, August 31, 2022 3:59 PM
To: DSpace Technical Support <dspac...@googlegroups.com>
Subject: [dspace-tech] Re: Issue with Dspace 7 search

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/2a800873-a3e5-4b5f-ba6c-b04be22139cen%40googlegroups.com.

Snickers

unread,

Nov 22, 2022, 2:47:14 PM11/22/22

to DSpace Technical Support

Hi Tim,

Thank you for the details you provided. We have made progress on this issue.

I configured the lines "textextractor.max-chars = -1", "textextractor.use-temp-file = true" then restarted the tomcat.

On a machine with 8GB RAM and max heap size 6G, I ran the "filter-media -f" command. It ran for a while then failed with the output below:

----------------------------------------------------------------------------------

File: SundararajA.pdf.jpg
FILTERED: bitstream 84c9128e-34a7-42e5-a83d-64be008bb082 (item: 10292/14803) and created 'SundararajA.pdf.jpg'
File: ATEM Poster - Serena OP.pdf.txt
FILTERED: bitstream 76432201-ee2b-481c-8c18-2889c935b2df (item: 10292/4602) and created 'ATEM Poster - Serena OP.pdf.txt'
File: ATEM Poster - Serena OP.pdf.jpg
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 837021948 bytes for AllocateHeap
# An error report file with more information is saved as:

-----------------------------------------------------------------------------------

I have attached the error report "hs_err_pid4031.log".

The storage spaces are as below:

--------------------------------------------------------

Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root xfs 20G 6.9G 14G 35% /
/dev/sda1 xfs 507M 221M 287M 44% /boot
/dev/mapper/vg_root-lv_var xfs 8.0G 3.0G 5.1G 38% /var
/dev/sdc xfs 1.0T 454G 570G 45% /DISK2
/dev/mapper/vg_root-lv_tmp xfs 16G 1.1G 15G 7% /tmp
/dev/mapper/vg_root-lv_var_log xfs 4.0G 597M 3.5G 15% /var/log
---------------------------------------------------------

Any idea or suggestion would be much appreciated.

Regards,

Bryan

hs_err_pid4031.log

Snickers

unread,

Nov 27, 2022, 7:58:21 PM11/27/22

to DSpace Technical Support

Hi Tim,

For additional info, here is an example - https://openrepository.aut.ac.nz/handle/10292/194

The item has a .doc file with 42,000 characters.

With "textextractor.max-chars = 1000000" (changed to add one more 0), "filter-media -f" took a day to complete "index-discovery -b" ran about 10 mins. When I search "noodle" on Dspace 5.8, it is searched but not in Dspace 7.3 (after the full scan and re-index). The full scan fails with out of memory error when "textextractor.max-chars = -1" is set.