CJK searching

Naomi Dushay

unread,

Aug 1, 2012, 1:52:26 PM8/1/12

to Tom Burton-West, solrma...@googlegroups.com

Tom, Solrmarc folks:

It's looking like I am finally going to tackle our CJK searching issues. I would be interested in hearing how you did this. Here are some of my questions:

0. what version of Solr are you using?

1. do you have separate fields for your non-latin script searching? How did you set up the fields in schema.xml, and the request handler(s) in solrconfig.xml?

2. do you use automated script detection at index time to determine how to analyze the text? If yes, what program(s) and how did you do it? Is there an issue when a single script maps to multiple languages?

3. do you use automated script detection at query time to determine how to analyze the text? If yes, what programs(s) and how did you do it? Is there an issue when a single script maps to multiple languages?

4. do you use unigrams, bigrams or something else to improve search results? What factored into your decision?

5. did you address any other languages, such as Arabic, Hebrew, etc?

6. what were the "gotchas"?

7. is there any low hanging fruit? (simpler changes that improve things, but don't fix everything.)

Thanks!
- Naomi

Naomi Dushay

unread,

Aug 6, 2012, 3:09:27 AM8/6/12

to Bob Haschart, solrma...@googlegroups.com, Tom Burton-West

Bob,

As I keep thinking about this, I'm wondering how YOU do this with solrmarc. Do you just index all the 880s as if they were chinese? Does your CJKFilterFactory in your text field do something magic?

How might solrmarc code look at the record and write to different index fields targeted to specific languages, based on script detection AND language codes in the marc record? (chinese characters can be used in more than one language, right? etc.)

And where IS the source code to the unicodenormalizer stuff?

- Naomi

> --
> You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
> To post to this group, send email to solrma...@googlegroups.com.
> To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.
>

Robert Haschart

unread,

Aug 6, 2012, 3:37:02 PM8/6/12

to solrma...@googlegroups.com, Naomi Dushay

Naomi,

The source code for the unicodenormalization stuff including the cjk filter stuff is in the jarfile that also contains the .class files.

The CJKFilterFactory doesn't do anything magic or even anything especially difficult. It is not even clear that what it does makes sense from a linguistic point of view.   To determine whether something is a CJK character, it uses the following simple determination:

   public static final boolean isCJKChar(int c)
    {
        return (c >= 0x3040 && c <= 0x318f) ||
                (c >= 0x3300 && c <= 0x337f) ||
                (c >= 0x3400 && c <= 0x3d2d) ||
                (c >= 0x4e00 && c <= 0x9fff) ||
                (c >= 0xf900 && c <= 0xfaff) ||
                (c >= 0xac00 && c <= 0xd7af);
    }

and then the FilterFactory will split each CJK character out to be a separate word in the index. So that if you encounter a field that contains   夏晓虹著 Xia Xiaohong zhu.     The following output would be produced by the CJKFilterFactory

term position	1	2	3	4	5	6	7
term text	夏	晓	虹	著	Xia	Xiaohong	zhu.
term type	word	word	word	word	word	word	word
source start,end	0,1	1,2	2,3	3,4	5,8	9,17	18,22

So it makes no attempt to determine what language the CJK characters are in, it merely "handles" the where-are-the-word-breaks issue by treating each CJK character as a separate word.

I realize that this decision may make no sense linguistically, but several other determinations that have been made already make little or no sense linguistically. For instance discarding all diacritic characters and folding "look-alike" characters onto their nearest english language match (such as the polish L with slash ( Ł ) onto the L) will often result in entirely different words being formed and added to the index. The thought is that even if you are doing an operation that is not valid in a linguistic sense, if you perform the same operation at index time and at query time the correct record would be included in the search results, (along with others which are perhaps not correct). Sacrificing precision in the search results, for increased recall.

However in many cases the original source material for the index (ie. the MARC records) will have been created with the substitution already having been made so the fact that the change is linguistically invalid is in many instances irrelevant: Its already a fact in the source data and we simple have to do the best we can with it. (A quick search for "Résumé" on our catalog shows that 13 of the top 20 items do not include the word résumé with the accent marks.)

My further reasoning (or rationalization) of this method of handling CJK characters is that in the cases where such characters occur (in our data) it is usually simply a name or a short phrase such that doing a thorough linguistic analysis might well be overkill for the benefit gained.

In terms of next steps that might be useful for CJK handling here suggestions such as adding synonyms to map from traditional chinese glyphs to simplified chinese glyphs have been made, or even to add synonyms for the romanization of the Chinese glyphs. However these efforts may cause more problems than they'd solve, and up to this point no actual effort had been expended in investigating them.

-Bob Haschart

Charles Riley

unread,

Aug 6, 2012, 5:06:06 PM8/6/12

to solrma...@googlegroups.com, Naomi Dushay

A little low-hanging fruit to start :
http://nlp.stanford.edu/software/segmenter.shtml

This is a direction we had hoped to go in before development of Yale's implementation was frozen in 2010. Looking back through my e-mail, I think we had gotten up to SOLR 1.4 by then. I'm attaching the final report we produced, which I think will help with most of your questions, and a bibliography that's dated but might still be worth a look through. Although for much of it, our implementation never became robust, we did a fair amount of survey work and investigation of some alternative solutions, including consideration of Nutch, Rosetta, Autonomy,
and Sematext.

Charles

--

Charles L. Riley

Catalog Librarian for Africana

Sterling Memorial Library, Yale University

<zeno...@gmail.com>

203-432-7566

FinalReportArcadia[1].pdf

ScriptsBibliography.docx

Simon Spero

unread,

Aug 6, 2012, 5:53:32 PM8/6/12

to solrma...@googlegroups.com

Another easy way to go is to use the chinese language plugin in gate; the word segmenter can use either a neural network, or an svm, and it's pretty easy to build a work flow using the ide, then load and run it from java code.

The annual SIGHAN bake off results are also a good place to look.

Note that accuracy is very much genre dependent, so you may want to switch models based on class or subject heading.

If only Naomi worked somewhere with a lot of linguists and Asian L1 speakers.....

Uwe Reh

unread,

Aug 7, 2012, 4:16:55 AM8/7/12

to solrma...@googlegroups.com

Hi,

it's Solr 4.0 Alpha (Solr 3.6), but what's wrong with:

> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-specials.txt" />
> <tokenizer class="solr.ICUTokenizerFactory" />
> <filter class="solr.WordDelimiterFilterFactory" />
> <filter class="solr.ICUFoldingFilterFactory" />
> <filter class="solr.GermanLightStemFilterFactory" />
> <filter class="solr.CJKBigramFilterFactory" />
> </analyzer>
> </fieldType>

Except of the German stemmer and some local optimizations in
'mapping-specials.txt' this filter chain should fit for most languages.
I like the CJKBigramFilterFactory, which was recommended by the Hathi
Trust. (As far I know) Also the reverse order of the Hebrew words, made
by the ICUTokenizer seem to be reasonable.

The analysis of "夏晓虹著 Xia Xiaohong C++ Müller Mueller סדן, מיכל"
looks quite fine. (See attached luke.html)

Uwe

luke.html

mapping-specials.txt

Erik Hatcher

unread,

Aug 7, 2012, 8:14:37 AM8/7/12

to solrma...@googlegroups.com

And there's this in Solr4's example schema.xml too:


<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.CJKWidthFilterFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>

> --
> You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
> To post to this group, send email to solrma...@googlegroups.com.
> To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.
>

> <mapping-specials.txt><luke.html>

Naomi Dushay

unread,

Aug 7, 2012, 8:02:18 PM8/7/12

to solrma...@googlegroups.com, Tom Burton-West

Thanks - these are all helpful responses.

Tom Burton-West of the Hathi-Trust has partially responded to me as well, and I found his blog entry on this topic VERY informative.

http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation

He indicated that his response to me was a draft, so I don't feel comfortable sharing it here, but here's a quote - note his recommendation about the patch for the CJKBigramFilter … I hope it gets into a Solr 3.6.2 release.

> As far as low hanging fruit, I think you should seriously look at the ICUTokenizer along with the CJKBigramFilter (with the fix in https://issues.apache.org/jira/browse/LUCENE-4286 ) as an alternative to your WhiteSpaceTokenizer/WordDelimiterFilter. The other issue is that if you want CJK tokenizing to work at all you need autoGeneratePhraseQueries="false". If you have strong feelings about the WDF and autoGeneratePhraseQueries="true" for Latin-1 scripts, I suppose you could probably set up a mirrored set of fields for your 880 linked fields and only use the ICUTokenizer and autoGeneratePhraseQueries="false" on those fields.
>
> I've found that the ICUTokenizer is pretty good for Latin-1 languages, for example it won't split up "l'art" or "can't". I suggest you pick out some of your use cases for WDF and test them out with the ICUTokenizer. You can use http://unicode.org/cldr/utility/breaks.jsp and the Solr admin analysis panel.

We share some similarities with the Hathi trust in that we can't analyze for one language at the expense of another (in our 880 fields). The strings we have to index generally are far too short for language detection, as we are currently talking about marc records, not full text.

I am currently working on a ruby solr test framework gem to make it easier to automate tests against a solr index. I will certainly share that with the community once it comes together. I intend to try just such fieldtypes as suggested below against the acceptance tests forthcoming from our CJK librarians. Add in acceptance tests for current functionality, and away we go!

- Naomi

Jonathan Rochkind

unread,

Aug 7, 2012, 8:22:52 PM8/7/12

to solrma...@googlegroups.com, Tom Burton-West

What are the downsides of simply segmenting CJK into characters, like Bob's old custom tokenizer did?

Perhaps combined with the dismax2 ps2 ps3 etc params, to boost results where the entered characters appear next to each other?

That seems like the best quick and dirty solution that doesn't disrupt the rest of your index for latin chars. Does it end up with horrible results? It might end up a performance problem on a ginormous index like HT, but probably not (probably? maybe?) on metadata-only indexes like ours.
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] on behalf of Naomi Dushay [ndu...@stanford.edu]
Sent: Tuesday, August 07, 2012 8:02 PM
To: solrma...@googlegroups.com
Cc: Tom Burton-West
Subject: Re: [solrmarc-tech] CJK searching

Jonathan Rochkind

unread,

Aug 7, 2012, 8:30:00 PM8/7/12

to solrma...@googlegroups.com, Tom Burton-West

Nevermind, I see Tom's post answers all my questions. Awsome post Tom.

Also, big ups to the phrase "false drops", an ancient visitor from the past (what's "dropping" in the 'false drop' is a needle into a punchcard, when IR searches were done with needles and punchcards.

http://books.google.com/books?id=xJNLJXXbhusC&pg=PA49&lpg=PA49&dq=false+drop+punchcard&source=bl&ots=1hJHJhAiwq&sig=dzYfCyQwbMStrVWq_Q5ZTvdX1I4&hl=en&sa=X&ei=37IhUOCWHIuF0QH3loHgDQ&ved=0CFoQ6AEwAw#v=onepage&q=false%20drop%20punchcard&f=false

)
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] on behalf of Jonathan Rochkind [roch...@jhu.edu]
Sent: Tuesday, August 07, 2012 8:22 PM

To: solrma...@googlegroups.com
Cc: Tom Burton-West

Subject: RE: [solrmarc-tech] CJK searching

Naomi Dushay

unread,

Aug 7, 2012, 8:35:10 PM8/7/12

to solrma...@googlegroups.com, Tom Burton-West

Tom's post implies that bigrams will be better than unigrams, as there are a lot of "false drops" with unigrams. I believe the patch he mentions may do something whizzy with a combination of both.

Also, I like the idea of using code that's part of the broader Solr community if it serves our needs. What serves our needs? I'll know more after I've got our CJK acceptance tests.

TESTING: I don't know about you guys, but I am not keen to do a bunch of manual tests over and over to determine if I messed up our desired searching behaviors for hyphens. Imagine something along these lines:

describe "General Searching" do
subject {
stt = SolrTestThing.new @solr_url
stt.find(solr_req_params )
}

it "should have document 467 when i search for 'food'" do
subject.find('food').should include document(467)
end

describe "a search for food sorted by title" do
let(:solr_req_params) { { :q => "food", :sort => "title_sort"} }

it "should have title a before title b" do
subject.document(:title => 'vala').should come_before(subject.document(:title =>'valb'))
end

it "should have doc a in first 3 results" do
subject.document(:id => 'a').should be_in_first(3).results
end
end

end

- Naomi

Naomi Dushay

unread,

Aug 7, 2012, 8:36:48 PM8/7/12

to solrma...@googlegroups.com, Tom Burton-West

What drops are the CARDS -- the needle keeps the McBee cards that are NOT relevant, and you pick up the dropped cards as meeting your criteria.

(Boy, am I old.)

- Naomi

Bill Dueber

unread,

Oct 18, 2012, 10:48:19 AM10/18/12

to solrma...@googlegroups.com

On Wed, Aug 1, 2012 at 1:52 PM, Naomi Dushay <ndu...@stanford.edu> wrote:

Tom, Solrmarc folks:

It's looking like I am finally going to tackle our CJK searching issues. I would be interested in hearing how you did this. Here are some of my questions:

0. what version of Solr are you using?

A 1.4 nightly (Solr Specification Version: 1.4.0.2009.12.04.13.05.30)

1. do you have separate fields for your non-latin script searching? How did you set up the fields in schema.xml, and the request handler(s) in solrconfig.xml?

No separate fields. Here's our basic text type (note: there's almost nothing here that I'd repeat when doing it again). You'll note we use a simplified tokenizer (only on whitespace and commas) and Bob's old CJK filter with bigrams turned on.

<analyzer><tokenizerclass="solr.PatternTokenizerFactory"pattern="[,\p{Z}]+"/>

<filterclass="schema.UnicodeNormalizationFilterFactory"version="icu4j"composed="false"remove_diacritics="true"remove_modifiers="true"fold="true"/>

<filterclass="schema.CJKFilterFactory"bigrams="true"/>

<filterclass="solr.WordDelimiterFilterFactory"generateWordParts="1"generateNumberParts="1"catenateWords="1"catenateNumbers="1"catenateAll="0"/>

<filterclass="solr.LowerCaseFilterFactory"/>

<filterclass="solr.EnglishPorterFilterFactory"protected="protwords.txt"/>

<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>

2. do you use automated script detection at index time to determine how to analyze the text? If yes, what program(s) and how did you do it? Is there an issue when a single script maps to multiple languages?

Nope.

3. do you use automated script detection at query time to determine how to analyze the text? If yes, what programs(s) and how did you do it? Is there an issue when a single script maps to multiple languages?

Nope

4. do you use unigrams, bigrams or something else to improve search results? What factored into your decision?

Unigrams and bigrams, as per above.

5. did you address any other languages, such as Arabic, Hebrew, etc?

Nope.

6. what were the "gotchas"?

Welll...we don't know. Obviously Tom knows a hell of a lot more about this stuff than I do.

7. is there any low hanging fruit? (simpler changes that improve things, but don't fix everything.)

Bigrams are the big thing that made a difference for our users.

Thanks!
- Naomi

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Reply all

Reply to author

Forward