Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion CJK searching
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Simon Spero  
View profile  
 More options Aug 6 2012, 5:53 pm
From: Simon Spero <sesunc...@gmail.com>
Date: Mon, 6 Aug 2012 17:53:32 -0400
Local: Mon, Aug 6 2012 5:53 pm
Subject: Re: [solrmarc-tech] CJK searching

Another easy way to go is to use the chinese language plugin in gate; the
word segmenter can use either a neural network, or  an svm, and it's pretty
easy to build a work flow using the ide, then load and run it from java
code.

The annual SIGHAN bake off results are also a good place to look.

Note that accuracy is very much genre dependent, so you may want to switch
models based on class or subject heading.

If only Naomi worked somewhere with a lot of linguists and Asian L1
speakers.....
On Aug 6, 2012 5:06 PM, "Charles Riley" <zenodo...@gmail.com> wrote:

> A little low-hanging fruit to start :
> http://nlp.stanford.edu/software/segmenter.shtml

> This is a direction we had hoped to go in before development of Yale's
> implementation was frozen in 2010.  Looking back through my e-mail, I think
> we had gotten up to SOLR 1.4 by then.  I'm attaching the final report we
> produced, which I think will help with most of your questions, and a
> bibliography that's dated but might still be worth a look through.
> Although for much of it, our implementation never became robust, we did a
> fair amount of survey work and investigation of some alternative solutions,
> including consideration of Nutch, Rosetta, Autonomy,
> and Sematext.

> Charles

> On Mon, Aug 6, 2012 at 3:37 PM, Robert Haschart <rh...@virginia.edu>wrote:

>>  Naomi,

>> The source code for the unicodenormalization stuff including the cjk
>> filter stuff is in the jarfile that also contains the .class files.

>> The CJKFilterFactory doesn't do anything magic or even anything
>> especially difficult.  It is not even clear that what it does makes sense
>> from a linguistic point of view.   To determine whether something is a CJK
>> character, it uses the following simple determination:

>>    public static final boolean isCJKChar(int c)
>>     {
>>         return (c >= 0x3040 && c <= 0x318f) ||
>>                 (c >= 0x3300 && c <= 0x337f) ||
>>                 (c >= 0x3400 && c <= 0x3d2d) ||
>>                 (c >= 0x4e00 && c <= 0x9fff) ||
>>                 (c >= 0xf900 && c <= 0xfaff) ||
>>                 (c >= 0xac00 && c <= 0xd7af);
>>     }

>> and then the FilterFactory will split each CJK character out to be a
>> separate word in the index.  So that if you encounter a field that
>> contains   夏晓虹著 Xia Xiaohong zhu.     The following output would be
>> produced by the CJKFilterFactory

>>   term position 1 2 3 4 5 6 7  term text 夏 晓 虹 著 Xia Xiaohong zhu.  term
>> type word word word word word word word  source start,end 0,1 1,2 2,3 3,4
>> 5,8 9,17 18,22
>> So it makes no attempt to determine what language the CJK characters are
>> in, it merely "handles" the where-are-the-word-breaks issue by treating
>> each CJK character as a separate word.

>> I realize that this decision may make no sense linguistically, but
>> several other determinations that have been made already make little or no
>> sense linguistically.  For instance discarding all diacritic characters and
>> folding "look-alike" characters onto their nearest english language match
>> (such as the polish L with slash ( Ł ) onto the L) will often result in
>> entirely different words being formed and added to the index.  The thought
>> is that even if you are doing an operation that is not valid in a
>> linguistic sense, if you perform the same operation at index time and at
>> query time the correct record would be included in the search results,
>> (along with others which are perhaps not correct).   Sacrificing precision
>> in the search results, for increased recall.

>> However in many cases the original source material for the index (ie. the
>> MARC records) will have been created with the substitution already having
>> been made so the fact that the change is linguistically invalid is in many
>> instances irrelevant: Its already a fact in the source data and we simple
>> have to do the best we can with it.   (A quick search for "Résumé" on our
>> catalog shows that 13 of the top 20 items do not include the word résumé
>> with the accent marks.)

>> My further reasoning (or rationalization) of this method of handling CJK
>> characters is that in the cases where such characters occur (in our data)
>> it is usually simply a name or a short phrase such that doing a thorough
>> linguistic analysis might well be overkill for the benefit gained.

>> In terms of next steps that might be useful for CJK handling here
>> suggestions such as adding synonyms to map from traditional chinese glyphs
>> to simplified chinese glyphs have been made, or even to add synonyms for
>> the romanization of the Chinese glyphs.  However these efforts may cause
>> more problems than they'd solve, and up to this point no actual effort had
>> been expended in investigating them.

>> -Bob Haschart

>> On 8/6/2012 3:09 AM, Naomi Dushay wrote:

>> Bob,

>> As I keep thinking about this, I'm wondering how YOU do this with solrmarc.  Do you just index all the 880s as if they were chinese?  Does your CJKFilterFactory in your text field do something magic?

>> How might solrmarc code look at the record and write to different index fields targeted to specific languages, based on script detection AND language codes in the marc record?  (chinese characters can be used in more than one language, right?   etc.)

>> And where IS the source code to the unicodenormalizer stuff?

>> - Naomi

>> On Aug 1, 2012, at 10:52 AM, Naomi Dushay wrote:

>>  Tom, Solrmarc folks:

>> It's looking like I am finally going to tackle our CJK searching issues.   I would be interested in hearing how you did this.   Here are some of my questions:

>> 0.  what version of Solr are you using?

>> 1.   do you have separate fields for your non-latin script searching?  How did you set up the fields in schema.xml, and the request handler(s) in solrconfig.xml?

>> 2.  do you use automated script detection at index time to determine how to analyze the text?  If yes, what program(s) and how did you do it?   Is there an issue when a single script maps to multiple languages?

>> 3.  do you use automated script detection at query time to determine how to analyze the text?  If yes, what programs(s) and how did you do it?  Is there an issue when a single script maps to multiple languages?

>> 4.  do you use unigrams, bigrams or something else to improve search results?  What factored into your decision?

>> 5.  did you address any other languages, such as Arabic, Hebrew, etc?

>> 6.  what were the "gotchas"?

>> 7.  is there any low hanging fruit?  (simpler changes that improve things, but don't fix everything.)

>> Thanks!
>> - Naomi

>> --
>> You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
>> To post to this group, send email to solrmarc-tech@googlegroups.com.
>> To unsubscribe from this group, send email to solrmarc-tech+unsubscribe@googlegroups.com.
>> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

>>  --
>> You received this message because you are subscribed to the Google Groups
>> "solrmarc-tech" group.
>> To post to this group, send email to solrmarc-tech@googlegroups.com.
>> To unsubscribe from this group, send email to
>> solrmarc-tech+unsubscribe@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/solrmarc-tech?hl=en.

> --
> *Charles L. Riley*
> *Catalog Librarian for Africana*
> *Sterling Memorial Library, Yale University*
> *<**zenodo...@gmail.com* <zenodo...@gmail.com>*>*
> *203-432-7566*

>  --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-tech@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.