| term position | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| term text | 夏 | 晓 | 虹 | 著 | Xia | Xiaohong | zhu. |
| term type | word | word | word | word | word | word | word |
| source start,end | 0,1 | 1,2 | 2,3 | 3,4 | 5,8 | 9,17 | 18,22 |
Another easy way to go is to use the chinese language plugin in gate; the word segmenter can use either a neural network, or an svm, and it's pretty easy to build a work flow using the ide, then load and run it from java code.
The annual SIGHAN bake off results are also a good place to look.
Note that accuracy is very much genre dependent, so you may want to switch models based on class or subject heading.
If only Naomi worked somewhere with a lot of linguists and Asian L1 speakers.....
Perhaps combined with the dismax2 ps2 ps3 etc params, to boost results where the entered characters appear next to each other?
That seems like the best quick and dirty solution that doesn't disrupt the rest of your index for latin chars. Does it end up with horrible results? It might end up a performance problem on a ginormous index like HT, but probably not (probably? maybe?) on metadata-only indexes like ours.
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] on behalf of Naomi Dushay [ndu...@stanford.edu]
Sent: Tuesday, August 07, 2012 8:02 PM
To: solrma...@googlegroups.com
Cc: Tom Burton-West
Subject: Re: [solrmarc-tech] CJK searching
Also, big ups to the phrase "false drops", an ancient visitor from the past (what's "dropping" in the 'false drop' is a needle into a punchcard, when IR searches were done with needles and punchcards.
)
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] on behalf of Jonathan Rochkind [roch...@jhu.edu]
Sent: Tuesday, August 07, 2012 8:22 PM
To: solrma...@googlegroups.com
Cc: Tom Burton-West
Subject: RE: [solrmarc-tech] CJK searching
Tom, Solrmarc folks:
It's looking like I am finally going to tackle our CJK searching issues. I would be interested in hearing how you did this. Here are some of my questions:
0. what version of Solr are you using?
1. do you have separate fields for your non-latin script searching? How did you set up the fields in schema.xml, and the request handler(s) in solrconfig.xml?
2. do you use automated script detection at index time to determine how to analyze the text? If yes, what program(s) and how did you do it? Is there an issue when a single script maps to multiple languages?
3. do you use automated script detection at query time to determine how to analyze the text? If yes, what programs(s) and how did you do it? Is there an issue when a single script maps to multiple languages?
4. do you use unigrams, bigrams or something else to improve search results? What factored into your decision?
5. did you address any other languages, such as Arabic, Hebrew, etc?
6. what were the "gotchas"?
7. is there any low hanging fruit? (simpler changes that improve things, but don't fix everything.)
Thanks!
- Naomi
--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.