Full text content not being found in index for a particular item(s)

1,032 views
Skip to first unread message

Feed My Lambs Esq.

unread,
Dec 10, 2015, 10:04:07 AM12/10/15
to DSpace Technical Support
I recently discovered the need to run `dspace filter-media` in order to have the full text of items searchable but even after doing that I am getting what seems to be inconsistent results. Some items have their full text searchable because the search returns those items. However, some are inexplicably not showing up when I search for them.

My prime example is the item 196004. I run the following command:

E:\dspace\bin>dspace filter-media -f -i 123456789/196004
Using DSpace installation in: E:\dspace
File: HammonsJ_2012-2_BODY.pdf.txt
FILTERED: bitstream 24981 (item: 123456789/196004) and created 'HammonsJ_2012-2_
BODY.pdf.txt'
File: HammonsJ_2012-2_ABSTRACT.pdf.txt
FILTERED: bitstream 24982 (item: 123456789/196004) and created 'HammonsJ_2012-2_
ABSTRACT.pdf.txt'

I read in the documentation that this filter-media routine will automatically update the DSpace search index by default.

I even found this in the DSpace log file: 
2015-12-10 09:50:35,424 INFO  org.dspace.core.ConfigurationManager @ Loading from classloader: file:/E:/dspace/config/dspace.cfg
2015-12-10 09:50:35,455 INFO  org.dspace.core.ConfigurationManager @ Using dspace provided log configuration (log.init.config)
2015-12-10 09:50:35,455 INFO  org.dspace.core.ConfigurationManager @ Loading: E:/dspace/config/log4j.properties
2015-12-10 09:50:39,658 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is 'PostgreSQL'
2015-12-10 09:50:39,658 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '9.4.1'
2015-12-10 09:50:39,736 INFO  org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:E:/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
2015-12-10 09:50:39,799 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspace (PostgreSQL 9.4)
2015-12-10 09:50:39,924 INFO  org.dspace.storage.rdbms.DatabaseUtils @ DSpace database schema is up to date
2015-12-10 09:50:40,408 INFO  org.dspace.content.MetadataField @ Loading MetadataField elements into cache.
2015-12-10 09:50:40,440 INFO  org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
2015-12-10 09:50:43,862 INFO  org.dspace.content.Bitstream @ anonymous::create_bitstream:bitstream_id=72780
2015-12-10 09:50:43,877 INFO  org.dspace.content.Bundle @ anonymous::add_bitstream:bundle_id=19615,bitstream_id=72780
2015-12-10 09:50:44,174 INFO  org.dspace.content.Bitstream @ anonymous::update_bitstream:bitstream_id=72780
2015-12-10 09:50:44,330 INFO  org.dspace.content.Bundle @ anonymous::remove_bitstream:bundle_id=19615,bitstream_id=72778
2015-12-10 09:50:44,346 INFO  org.dspace.content.Item @ anonymous::update_item:item_id=13562
2015-12-10 09:50:44,346 INFO  org.dspace.content.Bitstream @ anonymous::update_bitstream:bitstream_id=72780
2015-12-10 09:50:44,362 INFO  org.dspace.content.Bitstream @ anonymous::delete_bitstream:bitstream_id=72778
2015-12-10 09:50:44,377 INFO  org.dspace.content.Item @ anonymous::update_item:item_id=13562
2015-12-10 09:50:44,580 INFO  org.dspace.content.Bitstream @ anonymous::create_bitstream:bitstream_id=72781
2015-12-10 09:50:44,596 INFO  org.dspace.content.Bundle @ anonymous::add_bitstream:bundle_id=19615,bitstream_id=72781
2015-12-10 09:50:44,612 INFO  org.dspace.content.Bitstream @ anonymous::update_bitstream:bitstream_id=72781
2015-12-10 09:50:44,627 INFO  org.dspace.content.Bundle @ anonymous::remove_bitstream:bundle_id=19615,bitstream_id=72779
2015-12-10 09:50:44,643 INFO  org.dspace.content.Item @ anonymous::update_item:item_id=13562
2015-12-10 09:50:44,643 INFO  org.dspace.content.Bitstream @ anonymous::update_bitstream:bitstream_id=72781
2015-12-10 09:50:44,643 INFO  org.dspace.content.Bitstream @ anonymous::delete_bitstream:bitstream_id=72779
2015-12-10 09:50:44,643 INFO  org.dspace.content.Item @ anonymous::update_item:item_id=13562
2015-12-10 09:50:44,643 INFO  org.dspace.event.EventManager @ 
2015-12-10 09:50:48,080 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 123456789/196004 to Index

What gives? I can find the metadata for the item -- just not a fairly unique word in the full text of the PDF.

I even ran a discovery-index -bfo at one point to try and force a full with the full text having already been built (I think).

E:\dspace\bin>dspace version
Using DSpace installation in: E:\dspace
DSpace version:  5.4-SNAPSHOT
  SCM revision:  ${buildNumber}
    SCM branch:  UNKNOWN_BRANCH
            OS:  Windows Server 2012 R2(amd64) version 6.3
  Applications:
     Discovery:  enabled.
           JRE:  Oracle Corporation version 1.8.0_65
   Ant version:  Apache Ant(TM) version 1.9.4 compiled on April 29 2014
 Maven version:  3.3.1
   DSpace home:  E:/dspace

Feed My Lambs Esq.

unread,
Dec 10, 2015, 4:32:57 PM12/10/15
to DSpace Technical Support
While viewing the handle in administrator / edit item / item bitstreams / HammonsJ_2012-2_BODY.pdf.txt [view], I can verify that the PDF was appropriately converted to text and the unique words I see in the file aren't being found using search.

If the indexed content via filter-media is truly supposed to automatically get the index updated, I am at a loss for why or how a search for any given word via "All of dspace" doesn't have this item appear in the search results. But there are a few other items that also include the search term (in full-text) which were found!

I just re-ran index-discovery -b and the item was included in the Log:
INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 123456789/196004 to Index

Can there be any reason why the full text wasn't included when this line of code ran? I see the bitstream .pdf.txt in the admin view after all.
Is there a way for the index to get built but not be the one in use? When I execute a command to drop the index the application loses all searching so I doubt that is happening...
Can certain collections have features that keep them from getting properly added to the index?
Do the authorizations or permissions for an item or collection have any say on whether an item can get found from the search functionality?
Are there flags that can exist to prevent an item from being full-text searchable?

Andrea Schweer

unread,
Dec 10, 2015, 5:46:21 PM12/10/15
to Feed My Lambs Esq., DSpace Technical Support
Hi,


On 11/12/15 10:32, Feed My Lambs Esq. wrote:
I just re-ran index-discovery -b and the item was included in the Log:
INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 123456789/196004 to Index

Can there be any reason why the full text wasn't included when this line of code ran? I see the bitstream .pdf.txt in the admin view after all.

Yes, I've seen this happen when the extracted text contained characters that aren't valid in XML.


Is there a way for the index to get built but not be the one in use? When I execute a command to drop the index the application loses all searching so I doubt that is happening...
Can certain collections have features that keep them from getting properly added to the index?
Do the authorizations or permissions for an item or collection have any say on whether an item can get found from the search functionality?

Yes. You'll only find those items via search that you can actually see -- when logged in to DSpace as a super-admin, you'll find all items except those marked private, see below (and I believe only "in archive" items will be included, so no unfinished submissions, no workflow tasks and possibly no withdrawn items).


Are there flags that can exist to prevent an item from being full-text searchable?

Yes, if the item is marked "private" then it won't be searchable at all. Can you find this item using other types of search (eg title search)?

What happens when you query solr directly? You'll have to do this on the server using eg curl or wget.

http://localhost:8080/solr/search/select?q=handle:123456789/196004&indent=true

If the item is indexed then you should get a long XML response with the item's metadata, the extracted fulltext content and some more information on the item's status / read permissions.

cheers,
Andrea

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand
+64-7-837 9120

Feed My Lambs Esq.

unread,
Dec 11, 2015, 9:51:07 AM12/11/15
to DSpace Technical Support, victor....@gmail.com
Thank you, Andrea!

I really appreciate learning about all these nuances and I have answers to all your questions. I wonder if this is a case of malformed PDF/OCR/XML result but it seems clear that something unexpected was encountered. (On a side-note, I could launch a browser on the server and load these documents in the server I'm debugging (v5) but not on our production server (v3) so permissions must have changed to enable this by default on the server).

Yes this item is publicly viewable and searching for the title, author, contributer, abstract text etc. all find the item. I don't see any permissions information (untrained eye) but there is an array named "read" with string "g0" for the item in question.

One additional thing I have discovered is that the [full text] entry begins with [stream_size] information. The problem seems to manifest itself irregardless of privacy settings for the item.

Here is the hopefully tell-tale view of the item's "Full text" lead-up and output. Note that the parts leading up to the fulltext get repeated and absorbed as part of the full-text. However, instead of using the full text that is actually available, it is a duplication of the abstract (which has its own .pdf.txt generated via media-filter). It looks like some sort of indexing bug to me.

-<arr name="bi_5_dis_partial"> (just included for context -- not repeated inside the full text element)
<str>MacKay, Carolyn J. (Carolyn Joyce), 1954-</str>
</arr>
(all the following get duplicated inside the [fulltext] nodes)
-<arr name="stream_size">
<str>1337</str>
</arr>
-<arr name="stream_content_type">
<str>text/plain</str>
</arr>
-<arr name="stream_name">
<str>HammonsJ_2012-2_ABSTRACT.pdf.txt</str>
</arr>
-<arr name="stream_source_info">
<str>HammonsJ_2012-2_ABSTRACT.pdf.txt</str>
</arr>
-<arr name="Content-Encoding">
<str>UTF-8</str>
</arr>
-<arr name="Content-Type">
<str>text/plain; charset=UTF-8</str>
</arr>
-<arr name="fulltext">
<str> stream_size 1337 stream_content_type text/plain stream_name HammonsJ_2012-2_ABSTRACT.pdf.txt stream_source_info HammonsJ_2012-2_ABSTRACT.pdf.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 ABSTRACT RESEARCH PAPER: Solidarity in an Online Community STUDENT: James William Hammons DEGREE: Master of Arts COLLEGE: Sciences and Humanities DATE: May 2012 PAGES: 64.....[extracted for space reasons].....is strongly supported. </str>
</arr>
-<arr name="fulltext_hl">
<str> stream_size 1337 stream_content_type text/plain stream_name HammonsJ_2012-2_ABSTRACT.pdf.txt stream_source_info HammonsJ_2012-2_ABSTRACT.pdf.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 ABSTRACT RESEARCH ...[truncated]

Now that we can see the problem -- have you run into this before? What can we try to get the impacted items corrected?

What I believe is happening is that the two xml nodes [fulltext] and [fulltext_hl] are getting populated by the last bitstream's full text (along with stream_size, stream_content_type, etc. for whatever reason). Whichever is placed last is getting put into the index.

For items with only 1 bitstream the full text is searchable because the first *is the last* (e.g. no body + abstract pdfs -- but if the BODY was listed second I believe it would be searchable). BUT! Even the items with only 1 bitstream can still have the [stream_size], etc. dumped into the [fulltext] node. A search for "stream_size" returned 12k of our 25k items. And interestingly, 26 of the 120 items in the demo dspace. I wonder what the index looks like for http://demo.dspace.org/xmlui/handle/10673/5. Do you have access to view it?

"Test word document" doesn't return that handle even though the phrase is present in the document that is listed FIRST for the bitstreams for that handle.
Stated another way: The handle for "Test PDF document" (#5) contains a word document with that exact phrase -- but handle#5 is not returned with an quoted search for it because the index is only capturing the full text of the second/last(?) bitstream.

Did I uncover something new?!

Chris Gray

unread,
Dec 11, 2015, 11:58:06 AM12/11/15
to DSpace Technical Support, victor....@gmail.com
Following this thread helped solve a similar issue.

We had one item that wasn't searchable that should have been.  I could find it by directly doing a solr search, but it wouldn't turn up in the web pages.

The mention of things being set private made me try setting the item as private and then setting it back to public.  This worked.

Is a private item recorded in the database in the item table under the "discoverable" column?  The item in question was the only one in our collection that was set as in_archive but not discoverable.  I set "discoverable" to true, but still it was not searchable.  Only resetting the public status through the web UI worked.  So I suspect private/public is marked elsewhere.

Feed My Lambs Esq.

unread,
Dec 11, 2015, 2:24:11 PM12/11/15
to DSpace Technical Support, victor....@gmail.com
Thank you for your feedback Chris and I'm glad this thread helped you get your item into the search index!

I believe I found something quite different, though: A fundamental problem with full-text indexing where part of the XML schema gets placed inside the fulltext attribute (I believe this has just been reported in JIRA here: https://jira.duraspace.org/browse/DS-2948 about 6 hours ago).


Also, and more troubling, only the last bitstream ends up getting made available in the index for full-text searching.

I can't say for sure that this happens to every record, but I mention at the end of my previous post that I believe there is evidence of it happening in the demo.dspace.org site as well. Namely that this handle (http://demo.dspace.org/xmlui/handle/10673/5) did not successfully index the text in the first document. Otherwise, the search for "test word document" would have found the handle. Instead only this handle (http://demo.dspace.org/xmlui/handle/10673/7) is found with that search.

If there are two or more bitstreams for an item, only the last one tends to be available in the full text index. If anyone known a good person to contact for fixing a full-text indexing bug, feel free to forward this on to them, along with the examples.

We are leaning towards not using version 5.5 (unfortunately staying with version 3) if this particular full-text searching unpredictability might be semi-permanent. We have liked everything else we've seen about version 5, though, and hope to hear back soon about the nature of this bug and hopes for resolution (or workaround).

Andrea Schweer

unread,
Dec 13, 2015, 4:03:00 PM12/13/15
to Chris Gray, DSpace Technical Support
Hi Chris,

On 12/12/15 05:58, Chris Gray wrote:
> We had one item that wasn't searchable that should have been. I could
> find it by directly doing a solr search, but it wouldn't turn up in
> the web pages.
>
> The mention of things being set private made me try setting the item
> as private and then setting it back to public. This worked.
>
> Is a private item recorded in the database in the item table under the
> "discoverable" column?

That's correct.

> The item in question was the only one in our collection that was set
> as in_archive but not discoverable. I set "discoverable" to true, but
> still it was not searchable. Only resetting the public status through
> the web UI worked. So I suspect private/public is marked elsewhere.

You need to re-index the item after changing the "discoverable" value.
Changing this via the UI automatically triggers the reindex, which is
why it then shows up. I'm not sure exactly what steps would be needed to
make the reindex work after changing the "discoverable" value directly
in the db -- it may be that just running the discovery indexer with no
arguments is enough, or it may not.

Andrea Schweer

unread,
Dec 13, 2015, 4:11:59 PM12/13/15
to Feed My Lambs Esq., DSpace Technical Support
Hi,


On 12/12/15 08:24, Feed My Lambs Esq. wrote:
I believe I found something quite different, though: A fundamental problem with full-text indexing where part of the XML schema gets placed inside the fulltext attribute (I believe this has just been reported in JIRA here: https://jira.duraspace.org/browse/DS-2948 about 6 hours ago).


Also, and more troubling, only the last bitstream ends up getting made available in the index for full-text searching.

This second one is separate from the Jira issue you linked to. Would you mind opening a new Jira ticket for the bit in boldface? Please include your observations made on the test server (the content of the test server re-sets over the weekend, but steps to reproduce the issue are always good to include).

It sounds very plausible, and not like what should be happening.


If there are two or more bitstreams for an item, only the last one tends to be available in the full text index. If anyone known a good person to contact for fixing a full-text indexing bug, feel free to forward this on to them, along with the examples.

I'm sure you're aware the DSpace is developed by a community of volunteers. Filing the issue in Jira will give it more visibility than posting it on the mailing list. Hopefully a volunteer can then be found to look into this.


We are leaning towards not using version 5.5 (unfortunately staying with version 3) if this particular full-text searching unpredictability might be semi-permanent. We have liked everything else we've seen about version 5, though, and hope to hear back soon about the nature of this bug and hopes for resolution (or workaround).

I'd imagine that if the code really does only index the first .pdf.txt bitstream, then this behaviour would have been in place since the Discovery-based search was first released. Unless you're using the old Lucene-based search in your version 3, I'd think there is no benefit in staying on version 3 over upgrading to 5.4 when it comes to search.

As a workaround, you could look into re-ordering the .pdf.txt bitstreams so that the most important one (if there is one) comes first.

Feed My Lambs Esq.

unread,
Dec 14, 2015, 8:45:09 AM12/14/15
to DSpace Technical Support, victor....@gmail.com
Andrea,

Yes, I realized the linked Jira task only (likely) relates to the first 1/2 of my discovered problems.

I would be happy to create my own Jira task but I'm not sure how to create a login; the home screen says to contact the administrators but that links to a page that says it is not set up. Can someone send me instructions or a contact that could grant access?

I believe we are using the Lucene based search in our v.3 installation -- which is probably why it searches a little differently and returns full-text indexed searches for all bitstreams in a handle. I appreciate your perspective that there may not be an older version of SOLR to fall back to.

We have considered your workaround suggestion and verified that we can manipulate which bitstream gets added to the SOLR index by specifying the `bitstream_order` of bundle2bitstream. Whichever is placed LAST (greatest integer) is the one that gets added to the index.

Once I have a Jira account (again, I need a person to contact for access) I can report the issue unless someone beats me to it which would be fine. Thank you, Andrea!

Hilton Gibson

unread,
Dec 14, 2015, 10:04:40 AM12/14/15
to Feed My Lambs Esq., DSpace Technical Support

Hilton Gibson
Stellenbosch University Library


--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Feed My Lambs Esq.

unread,
Dec 14, 2015, 11:29:51 AM12/14/15
to DSpace Technical Support, victor....@gmail.com
Thank you Hilton. That seems to have enabled me to create a user.

The Jira tracking for this problem has been created by me and should be accessible here: https://jira.duraspace.org/browse/DS-2952. This is only meant to address full text indexing (in the SOLR indexing service) failing to make searchable every bitstream except the last bitstream (for handles with multiple bitstreams).

I plan to leave leave this thread alone until a solution is found. 'Watch' and 'vote' for the posted issue to track progress on this matter.
Reply all
Reply to author
Forward
0 new messages