tokenization issue with certain phrases

7 views
Skip to first unread message

Wei Zhao

unread,
Feb 25, 2015, 2:57:13 PM2/25/15
to lucene...@googlegroups.com
Hi there:

We just ran into a tokenization issue and want to check if there is any easy fix/workaround. (We are using Gosen library for Japanese language).

When we tokenize "項羽と劉邦", it works perfectly and breaks the phrase into: 
項羽

劉邦

However, when we tokenize "横山光輝ー項羽と劉邦", the behavior is unexpected:

横山
光輝




劉邦

As you can see, it breaks "項羽" incorrectly into two words. Is there a way to enforce Gosen not to break "項羽".

thanks

Wei

Kazuaki Hiraga

unread,
Feb 26, 2015, 11:04:38 AM2/26/15
to lucene...@googlegroups.com
Hi Wei,

It seems that you are using Katakana-Hiragana prolonged sound mark (ー) between two sentences such as "横山光輝" and "項羽と劉邦". Could you try to normalize "ー" to Hyphen-Minus (

-) or dash? I think it will be a workaround for this issue (I am away from my development environment at this moment, so I cannot verify it), because ー is not a punctuation symbol in Japanese. So, it may break the tokenization. 


Regards,

Kazu

Reply all
Reply to author
Forward
0 new messages