> A little low-hanging fruit to start :
>
http://nlp.stanford.edu/software/segmenter.shtml
> This is a direction we had hoped to go in before development of Yale's
> implementation was frozen in 2010. Looking back through my e-mail, I think
> we had gotten up to SOLR 1.4 by then. I'm attaching the final report we
> produced, which I think will help with most of your questions, and a
> bibliography that's dated but might still be worth a look through.
> Although for much of it, our implementation never became robust, we did a
> fair amount of survey work and investigation of some alternative solutions,
> including consideration of Nutch, Rosetta, Autonomy,
> and Sematext.
> Charles
> On Mon, Aug 6, 2012 at 3:37 PM, Robert Haschart <rh...@virginia.edu>wrote:
>> Naomi,
>> The source code for the unicodenormalization stuff including the cjk
>> filter stuff is in the jarfile that also contains the .class files.
>> The CJKFilterFactory doesn't do anything magic or even anything
>> especially difficult. It is not even clear that what it does makes sense
>> from a linguistic point of view. To determine whether something is a CJK
>> character, it uses the following simple determination:
>> public static final boolean isCJKChar(int c)
>> {
>> return (c >= 0x3040 && c <= 0x318f) ||
>> (c >= 0x3300 && c <= 0x337f) ||
>> (c >= 0x3400 && c <= 0x3d2d) ||
>> (c >= 0x4e00 && c <= 0x9fff) ||
>> (c >= 0xf900 && c <= 0xfaff) ||
>> (c >= 0xac00 && c <= 0xd7af);
>> }
>> and then the FilterFactory will split each CJK character out to be a
>> separate word in the index. So that if you encounter a field that
>> contains 夏晓虹著 Xia Xiaohong zhu. The following output would be
>> produced by the CJKFilterFactory
>> term position 1 2 3 4 5 6 7 term text 夏 晓 虹 著 Xia Xiaohong zhu. term
>> type word word word word word word word source start,end 0,1 1,2 2,3 3,4
>> 5,8 9,17 18,22
>> So it makes no attempt to determine what language the CJK characters are
>> in, it merely "handles" the where-are-the-word-breaks issue by treating
>> each CJK character as a separate word.
>> I realize that this decision may make no sense linguistically, but
>> several other determinations that have been made already make little or no
>> sense linguistically. For instance discarding all diacritic characters and
>> folding "look-alike" characters onto their nearest english language match
>> (such as the polish L with slash ( Ł ) onto the L) will often result in
>> entirely different words being formed and added to the index. The thought
>> is that even if you are doing an operation that is not valid in a
>> linguistic sense, if you perform the same operation at index time and at
>> query time the correct record would be included in the search results,
>> (along with others which are perhaps not correct). Sacrificing precision
>> in the search results, for increased recall.
>> However in many cases the original source material for the index (ie. the
>> MARC records) will have been created with the substitution already having
>> been made so the fact that the change is linguistically invalid is in many
>> instances irrelevant: Its already a fact in the source data and we simple
>> have to do the best we can with it. (A quick search for "Résumé" on our
>> catalog shows that 13 of the top 20 items do not include the word résumé
>> with the accent marks.)
>> My further reasoning (or rationalization) of this method of handling CJK
>> characters is that in the cases where such characters occur (in our data)
>> it is usually simply a name or a short phrase such that doing a thorough
>> linguistic analysis might well be overkill for the benefit gained.
>> In terms of next steps that might be useful for CJK handling here
>> suggestions such as adding synonyms to map from traditional chinese glyphs
>> to simplified chinese glyphs have been made, or even to add synonyms for
>> the romanization of the Chinese glyphs. However these efforts may cause
>> more problems than they'd solve, and up to this point no actual effort had
>> been expended in investigating them.
>> -Bob Haschart
>> On 8/6/2012 3:09 AM, Naomi Dushay wrote:
>> Bob,
>> As I keep thinking about this, I'm wondering how YOU do this with solrmarc. Do you just index all the 880s as if they were chinese? Does your CJKFilterFactory in your text field do something magic?
>> How might solrmarc code look at the record and write to different index fields targeted to specific languages, based on script detection AND language codes in the marc record? (chinese characters can be used in more than one language, right? etc.)
>> And where IS the source code to the unicodenormalizer stuff?
>> - Naomi
>> On Aug 1, 2012, at 10:52 AM, Naomi Dushay wrote:
>> Tom, Solrmarc folks:
>> It's looking like I am finally going to tackle our CJK searching issues. I would be interested in hearing how you did this. Here are some of my questions:
>> 0. what version of Solr are you using?
>> 1. do you have separate fields for your non-latin script searching? How did you set up the fields in schema.xml, and the request handler(s) in solrconfig.xml?
>> 2. do you use automated script detection at index time to determine how to analyze the text? If yes, what program(s) and how did you do it? Is there an issue when a single script maps to multiple languages?
>> 3. do you use automated script detection at query time to determine how to analyze the text? If yes, what programs(s) and how did you do it? Is there an issue when a single script maps to multiple languages?
>> 4. do you use unigrams, bigrams or something else to improve search results? What factored into your decision?
>> 5. did you address any other languages, such as Arabic, Hebrew, etc?
>> 6. what were the "gotchas"?
>> 7. is there any low hanging fruit? (simpler changes that improve things, but don't fix everything.)
>> Thanks!
>> - Naomi
>> --
>> You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
>> To post to this group, send email to solrmarc-tech@googlegroups.com.
>> To unsubscribe from this group, send email to solrmarc-tech+unsubscribe@googlegroups.com.
>> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.
>> --
>> You received this message because you are subscribed to the Google Groups
>> "solrmarc-tech" group.
>> To post to this group, send email to solrmarc-tech@googlegroups.com.
>> To unsubscribe from this group, send email to
>> solrmarc-tech+unsubscribe@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/solrmarc-tech?hl=en.
> --
> *Charles L. Riley*
> *Catalog Librarian for Africana*
> *Sterling Memorial Library, Yale University*
> *<**zenodo...@gmail.com* <zenodo...@gmail.com>*>*
> *203-432-7566*
> --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-tech@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.