documentation for Chinese full text indexing

63 views

Skip to first unread message

Kenney Guo

unread,

Aug 30, 2022, 9:14:03 PM8/30/22

to DSpace Technical Support

Dear DSpace team,

With a default installation of the DSpace 7.2, I am not able to search my Chinese documents well. After some research, I realize that I can configure the (word) Analyzer in solr. However, I did not found any official documentation on how to do that. Could anyone point me to those documentations?

Thanks very much,

Kenney

Tim Donohue

unread,

Sep 1, 2022, 12:15:18 PM9/1/22

to Kenney Guo, DSpace Technical Support

Hi Kenney,

I must admit that we currently don't have documentation for how to enable Chinese full text indexing in DSpace.

However, if you are storing primarily Chinese full text documents in your DSpace, I don't think it would be too difficult to change the current Solr indexing settings to support that.

Solr has some documentation on how best to index Chinese here: https://solr.apache.org/guide/8_0/language-analysis.html#traditional-chinese

What I think you'd want to do in DSpace is to add a new fieldType called "text_mandarin" (or similar) to the 'search' schema:

https://github.com/DSpace/DSpace/blob/main/dspace/solr/search/conf/schema.xml This fieldType might look something like this:

</analyzer>

</fieldType>

Then, if you want the "fulltext" field (which stores the fulltext of documents) to always do indexing/parsing of Chinese, you'd change its type to be "text_mandarin" (instead of just "text") here:

https://github.com/DSpace/DSpace/blob/main/dspace/solr/search/conf/schema.xml#L237

Then you'd have to reindex everything in Solr (./dspace index-discovery -b).

I think this would work, but I'll admit I've never tried it. So, it's always possible I'm overlooking a step to get this working.

Keep in mind, this would only change the behavior of full text indexing/searching... and it would change that behavior globally (so all documents in DSpace would be assumed to contain Chinese text). Unfortunately, at this time, DSpace doesn't have any smart way to detect the language of documents and index each language differently.

If this sounds like what you need & you find it works for you, please let us know. That way we can more formally document similar instructions for others who may need them.

Tim

From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Kenney Guo <kingl...@gmail.com>
Sent: Tuesday, August 30, 2022 8:14 PM
To: DSpace Technical Support <dspac...@googlegroups.com>
Subject: [dspace-tech] documentation for Chinese full text indexing

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/edf5966d-8476-4a71-82d1-8b22e7b31b28n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages