Hi,
You can use CompositeTokenFilter.
This filter merge multiple similar tokens by a part-of-speech based configrations.
https://github.com/lucene-gosen/lucene-gosen/blob/4x/src/java/net/java/sen/filter/stream/CompositeTokenFilter.java
https://github.com/lucene-gosen/lucene-gosen/blob/4x/src/test/net/java/sen/CompositeTokenFilterTest.java
If you want merge multiple “Alphabet” to “Unknown word”, you should set the following rule to CompositeTokenFilter.
未知語 記号-アルファベット
First part-of-speech is “Unknown word”, this is a part-of-speech after merging.
Second part-of-speech is “Alphabet”, this is a target of part-of-speech for merging.
If you use Java library, good sample is GosenTokenizerFactory.
https://github.com/lucene-gosen/lucene-gosen/blob/4x/src/java/org/apache/solr/analysis/GosenTokenizerFactory.java#L54-54
Example for Solr
http://johtani.jugem.jp/?eid=5 (This is only Japanese)
Regards,
Jun
------------
Jun Ohtani
joh...@gmail.com
blog :
http://blog.johtani.info
twitter :
http://twitter.com/johtani
> 2014/11/06 11:29、
wz...@box.com のメール:
>
> Hi there:
>
> The library ran awesome before we run into this issue. It splits an English word inside a Japanese document into characters. For example, we have word "engineer" inside a Japanese document, we will get 'e', 'n', 'g', 'i', 'n', 'e', 'e', 'r' after applying GosenTokenizer. This totally messes up search of our accounts that have both Japanese and English documents. Anything I can twig to prevent GosenTokenzier from splitting English words?
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "lucene-gosen" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
lucene-gosen...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.