TextRank: keywords() function compulsorily removes Japanese dakuten and handakuten

川頭信之

unread,

Jan 5, 2017, 3:51:21 AM1/5/17

to gensim

Dear All,

I used gensim.summarization package with Japanese unicode text.

The keywords() function does not work

because it deletes Japanese dakuten and handakuten from the original text

such as テータ from データ or ハーティー from パーティー.

However the summarize() function returns the normal result without deletion of dakuten or handakuten.

Please help me!

Lev Konstantinovskiy

unread,

Jan 5, 2017, 8:45:11 AM1/5/17

to gensim

Hello,

The accents are removed in the keywords text cleaning code though it is not a necessary step. The algorithm will work even if the accents remain.

You are welcome to change gensim code to make it configurable. A pull request would be welcome that makes deacc a parameter in keywords with deacc=True by default.

Regards
Lev

Lev Konstantinovskiy

unread,

Jan 6, 2017, 8:06:13 AM1/6/17

to gensim

Hello,

By the way, this is fixed in the current develop(unreleased) version of Gensim. Thanks to Bhargav for the quick fix.

Regards
Lev

川頭信之

unread,

Jan 7, 2017, 4:31:50 AM1/7/17

to gen...@googlegroups.com, lev....@gmail.com

Hello Lev,

Thank you. Your advice is very helpful. I also fixed keywords() and others, but not yet pushed.

I will wait for the release.

Best regards,

Nobu

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/-0uMMFOzMCQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

川頭信之
Nobuyuki KAWAGASHIRA

空飛ぶデータサイエンティスト
R、Python、UNIX、データベース
データマイニング、自然言語処理、CodeIQ問題出題者、MOOC
ハンググライダー、スキー、小型ヨット

データ解析　http://www.slideshare.net/nobuyukikawagashira

CodeIQ　https://codeiq.jp/ace/kawagashira_nobuyuki/

ブログ　http://dataflight.wordpress.com/
ツイッター　http://twitter.com/nkawagashira
YouTube　http://www.youtube.com/kawagashira/

Gengdai Liu

unread,

Apr 22, 2017, 2:51:58 AM4/22/17

to gensim, lev....@gmail.com

川頭-san,

I am a new user of gensim. I want to extract some keywords from Japanese texts.

Could you please tell me how to use gensim's summarization module to do it?

When I pass a unicode text into keywords() function, an empty string is returned.

Thank you.

Best regards,

Gengdai

在 2017年1月7日星期六 UTC+9下午6:31:50，川頭信之写道：

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

川頭信之

unread,

Apr 22, 2017, 4:53:58 AM4/22/17

to liuge...@gmail.com, gen...@googlegroups.com

Dear Gendong

Could you please tell me how to use gensim's summarization module to do it?
When I pass a unicode text into keywords() function, an empty string is returned.

You should first do Japanese morphoanalysis by MeCab to make text which are divided into words joined with spaces.

http://taku910.github.io/mecab/

Then perform as follows:

keyword = keywords(text, ratio=0.6, deacc=False, split=True)

Best regards,

Nobuyuki

Reply all

Reply to author

Forward