[Dspace-devel] [DSpace-JIRA] Created: (DS-208) Make the fulltext indexes configurable

Andrea Bollini (JIRA)

unread,

Aug 19, 2015, 1:44:41 PM8/19/15

to dspace...@lists.sourceforge.net

Make the fulltext indexes configurable
--------------------------------------

Key: DS-208
URL: http://jira.dspace.org/jira/browse/DS-208
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.5.2
Reporter: Andrea Bollini
Assignee: Andrea Bollini
Attachments: DS-208-configure-fulltext-indexes.patch

This patch allow the user to configure the index name where the extracted text from the bitstream (fulltext) will be stored.
More then one index is allowed and if the configuration is missing the "default" index name is used for backward compatibility.
Documentation update is included, please take a look if possible

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Andrea Bollini (JIRA)

unread,

Aug 19, 2015, 1:48:56 PM8/19/15

to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=10413#action_10413 ]

Andrea Bollini commented on DS-208:
-----------------------------------

Hi Stuart, some users don't like to store the fulltext in the default index used for simple search. Instead its want store the text in a specific index combined with other metadata (I don't think that this is really need: the title, authors and keywords are normally present in the fulltext...). Finally as you say, we need to store the fulltext in a separate field so that it can be used in highlighting, snipets, etc. (We are also looking to the Carrot2 clustering engine http://www.carrot2.org/ and having a separate index for the fulltext make the demo webapp works out-of-box on the dspace lucene index).

So I suggest to keep my original patch behaviour or at least make configurable the inclusion in the default index

Mark Diggory (JIRA)

unread,

Aug 19, 2015, 1:49:06 PM8/19/15

to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=10417#action_10417 ]

Mark Diggory commented on DS-208:
---------------------------------

Stuart, you are correct to a degree...

I would be cautious about storing the full text, unless your planning on presenting fragments of it in context like Google, it is going to create a very large index. And indexing is going to be come very memory intensive is you are pulling the full text into strings, we will begin to risk out of memory errors if the indexing process is not streamed using readers (I did the rewrite to optimize this when I first started at MIT).

You might experiment with using the setter for value after constructing the Field with some default string like....

Field field = new Field(String name, "junk" , Field.Store.Yes, Field.Index.TOKENIZED)
field.setValue(Reader value)

You might be able to get away with a reader parsed tokenized stored field then (but I don't know how much more efficient that may be)

http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/document/Field.html#setValue(java.io.Reader)

p.s. How are you highlighting when the presented values for the search results are pulled from the Item metadata directly? (that is a loaded question, I'm hoping your answer is, we don't use the metadata for the item directly anymore and render the lucene record directly with hit highlighting present?!) ;-)

Andrea Bollini (JIRA)

unread,

Aug 19, 2015, 1:49:16 PM8/19/15

to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=10419#action_10419 ]

Andrea Bollini commented on DS-208:
-----------------------------------

Hi,
of course I'm not a lucene expert too but I think that you can highligth from any source not only from the document field (i.e. you can retrieve on the fly the fulltext contents storing the bitstream id in a separate field). The use of TermVector should alleviate the performance issues.
see the example in:
http://lucene.apache.org/java/2_4_1/api/contrib-highlighter/org/apache/lucene/search/highlight/package-summary.html#package_description
and the new features added in 22/12/2004

Andrea Bollini (JIRA)

unread,

Aug 19, 2015, 2:56:25 PM8/19/15

to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrea Bollini reassigned DS-208:
---------------------------------

Assignee: (was: Andrea Bollini)

> Make the fulltext indexes configurable
> --------------------------------------
>
> Key: DS-208
> URL: http://jira.dspace.org/jira/browse/DS-208
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.2
> Reporter: Andrea Bollini

Jeffrey Trimble (JIRA)

unread,

Aug 19, 2015, 2:56:35 PM8/19/15

to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=10588#action_10588 ]

Jeffrey Trimble commented on DS-208:
------------------------------------

Andrea,

The Documentation looks good. WIll be incorporated into the new configure chapter. (This chapter has not be deposited to the trunk yet.) May have a question or two in a day or two.

--Jeff

> Make the fulltext indexes configurable
> --------------------------------------
>
> Key: DS-208
> URL: http://jira.dspace.org/jira/browse/DS-208
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.2
> Reporter: Andrea Bollini

Reply all

Reply to author

Forward