documentation for Chinese full text indexing

61 views
Skip to first unread message

Kenney Guo

unread,
Aug 30, 2022, 9:14:03 PM8/30/22
to DSpace Technical Support
Dear DSpace team,

With a default installation of the DSpace 7.2, I am not able to search my Chinese documents well. After some research, I realize that I can configure the (word) Analyzer in solr. However, I did not found any official documentation on how to do that. Could anyone point me to those documentations?

Thanks very much,

Kenney

Tim Donohue

unread,
Sep 1, 2022, 12:15:18 PM9/1/22
to Kenney Guo, DSpace Technical Support
Hi Kenney,

I must admit that we currently don't have documentation for how to enable Chinese full text indexing in DSpace.

However, if you are storing primarily Chinese full text documents in your DSpace, I don't think it would be too difficult to change the current Solr indexing settings to support that.

Solr has some documentation on how best to index Chinese here: https://solr.apache.org/guide/8_0/language-analysis.html#traditional-chinese 

What I think you'd want to do in DSpace is to add a new​ fieldType called "text_mandarin" (or similar) to the 'search' schema: 
https://github.com/DSpace/DSpace/blob/main/dspace/solr/search/conf/schema.xml   This fieldType might look something like this:

<fieldType name="text_mandarin" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

Then, if you want the "fulltext" field (which stores the fulltext of documents) to always do indexing/parsing of Chinese, you'd change its type to be "text_mandarin" (instead of just "text") here:


Then you'd have to reindex everything in Solr (./dspace index-discovery -b).

I think​ this would work, but I'll admit I've never tried it.  So, it's always possible I'm overlooking a step to get this working.

Keep in mind, this would only change the behavior of full text indexing/searching... and it would change that behavior globally (so all documents in DSpace would be assumed to contain Chinese text).   Unfortunately, at this time, DSpace doesn't have any smart way to detect the language of documents and index each language differently.

If this sounds like what you need & you find it works for you, please let us know. That way we can more formally document similar instructions for others who may need them.

Tim



From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Kenney Guo <kingl...@gmail.com>
Sent: Tuesday, August 30, 2022 8:14 PM
To: DSpace Technical Support <dspac...@googlegroups.com>
Subject: [dspace-tech] documentation for Chinese full text indexing
 
--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/edf5966d-8476-4a71-82d1-8b22e7b31b28n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages