ICUTokenizer and CJK (and Solr 4.0)

Tom Burton-West

unread,

Mar 28, 2011, 6:52:22 PM3/28/11

to blacklight-...@googlegroups.com

Hi all,

I saw mention of using the ICUTokenizer for CJK in another thread about using Solr 4.0 and thought I would mention a few considerations. First, I want to say that really like the ICUTokenizer as it does a good job for many languages. We are currently using it in our large scale search application because we have texts in over 400 languages all in one index. However, the ICUTokenizer currently creates unigrams for Han characters (C and J). In general bigrams work better than unigrams for CJK for a number of reasons. The problem with unigrams is that you can get a large number of false drops because Solr will search for each unigram anywhere in a field. This might not be a problem for MARC 880 fields, but for us where we have the OCR of an entire book in one field it really does not work well. For example searching for the title of a famous Chinese novel will get half a million hits when using unigrams. There is currently an issue open to add a bigram option to the ICUTokenizer (LUCENE-2906) and some work has been done on it.

One other complication is no matter what tokenizer you use for CJK, you need to pay attention to the autoGeneratePhraseQueries parameter or the tokenizer and the query parser will work at cross-purposes. See (LUCENE-2458). If you don’t want all your CJK queries to get searched as phrase queries you need to set autoGeneratePhraseQueries to false:

Someday I will find the time to write up these details in a blog post:)

Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

Bill Dueber

unread,

Jan 5, 2012, 11:08:01 PM1/5/12

to blacklight-...@googlegroups.com

For anyone who's been following this stuff, Bob has put out a patch that does bigrams for CJK. See the Jira ticket at https://issues.apache.org/jira/browse/LUCENE-2906

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Erik Hatcher

unread,

Jan 6, 2012, 7:42:31 AM1/6/12

to blacklight-...@googlegroups.com

On Jan 5, 2012, at 23:08 , Bill Dueber wrote:

> For anyone who's been following this stuff, Bob has put out a patch that does bigrams for CJK. See the Jira ticket at https://issues.apache.org/jira/browse/LUCENE-2906

I'm not sure where Bob (I presume you mean SolrMarc Bob) comes into play in this issue as the work was done all by Robert (doesn't go by Bob though, heh) Muir.

But yeah, good stuff. It'll be in Solr 3.6 (and 4.0 of course too).

Erik

Bill Dueber

unread,

Jan 7, 2012, 7:32:10 PM1/7/12

to blacklight-...@googlegroups.com

Whoa! My sincere apologies -- all mad props go to Robert Muir. I'll be testing this stuff out in the next few days and will report back here.

-Bill-

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

Reply all

Reply to author

Forward