Hi all,
I saw mention of using the ICUTokenizer for CJK in another thread about using Solr 4.0 and thought I would mention a few considerations. First, I want to say that really like the ICUTokenizer as it does a good job for many languages. We are currently using it in our large scale search application because we have texts in over 400 languages all in one index. However, the ICUTokenizer currently creates unigrams for Han characters (C and J). In general bigrams work better than unigrams for CJK for a number of reasons. The problem with unigrams is that you can get a large number of false drops because Solr will search for each unigram anywhere in a field. This might not be a problem for MARC 880 fields, but for us where we have the OCR of an entire book in one field it really does not work well. For example searching for the title of a famous Chinese novel will get half a million hits when using unigrams. There is currently an issue open to add a bigram option to the ICUTokenizer (LUCENE-2906) and some work has been done on it.
One other complication is no matter what tokenizer you use for CJK, you need to pay attention to the autoGeneratePhraseQueries parameter or the tokenizer and the query parser will work at cross-purposes. See (LUCENE-2458). If you don’t want all your CJK queries to get searched as phrase queries you need to set autoGeneratePhraseQueries to false:
<fieldType name="FullText" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false”>
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
> For anyone who's been following this stuff, Bob has put out a patch that does bigrams for CJK. See the Jira ticket at https://issues.apache.org/jira/browse/LUCENE-2906
I'm not sure where Bob (I presume you mean SolrMarc Bob) comes into play in this issue as the work was done all by Robert (doesn't go by Bob though, heh) Muir.
But yeah, good stuff. It'll be in Solr 3.6 (and 4.0 of course too).
Erik
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.