Issue with English words in Japanese documents

11 views
Skip to first unread message

wz...@box.com

unread,
Nov 5, 2014, 9:29:41 PM11/5/14
to lucene...@googlegroups.com
Hi there:

The library ran awesome before we run into this issue. It splits an English word inside a Japanese document into characters. For example, we have word "engineer" inside a Japanese document, we will get 'e', 'n', 'g', 'i', 'n', 'e', 'e', 'r' after applying GosenTokenizer. This totally messes up search of our accounts that have both Japanese and English documents. Anything I can twig to prevent GosenTokenzier from splitting English words?


Jun Ohtani

unread,
Nov 5, 2014, 10:08:49 PM11/5/14
to lucene...@googlegroups.com
Hi,

You can use CompositeTokenFilter.
This filter merge multiple similar tokens by a part-of-speech based configrations.
https://github.com/lucene-gosen/lucene-gosen/blob/4x/src/java/net/java/sen/filter/stream/CompositeTokenFilter.java
https://github.com/lucene-gosen/lucene-gosen/blob/4x/src/test/net/java/sen/CompositeTokenFilterTest.java

If you want merge multiple “Alphabet” to “Unknown word”, you should set the following rule to CompositeTokenFilter.

未知語 記号-アルファベット

First part-of-speech is “Unknown word”, this is a part-of-speech after merging.
Second part-of-speech is “Alphabet”, this is a target of part-of-speech for merging.

If you use Java library, good sample is GosenTokenizerFactory.
https://github.com/lucene-gosen/lucene-gosen/blob/4x/src/java/org/apache/solr/analysis/GosenTokenizerFactory.java#L54-54

Example for Solr
http://johtani.jugem.jp/?eid=5 (This is only Japanese)

Regards,
Jun

------------
Jun Ohtani
joh...@gmail.com
blog : http://blog.johtani.info
twitter : http://twitter.com/johtani

> 2014/11/06 11:29、wz...@box.com のメール:
>
> Hi there:
>
> The library ran awesome before we run into this issue. It splits an English word inside a Japanese document into characters. For example, we have word "engineer" inside a Japanese document, we will get 'e', 'n', 'g', 'i', 'n', 'e', 'e', 'r' after applying GosenTokenizer. This totally messes up search of our accounts that have both Japanese and English documents. Anything I can twig to prevent GosenTokenzier from splitting English words?
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "lucene-gosen" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lucene-gosen...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

wz...@box.com

unread,
Dec 10, 2014, 4:26:19 PM12/10/14
to lucene...@googlegroups.com
Hi Jun:

The solution you provided works. Thank you so much.
Reply all
Reply to author
Forward
0 new messages