TextRank: keywords() function compulsorily removes Japanese dakuten and handakuten

176 views
Skip to first unread message

川頭信之

unread,
Jan 5, 2017, 3:51:21 AM1/5/17
to gensim
Dear All,

I used gensim.summarization package with Japanese unicode text.
The keywords() function does not work
because it deletes Japanese dakuten and handakuten from the original text
such as テータ from データ or ハーティー from パーティー.
However the summarize() function returns the normal result without deletion of dakuten or handakuten.
Please help me!

Lev Konstantinovskiy

unread,
Jan 5, 2017, 8:45:11 AM1/5/17
to gensim
Hello,

The accents are removed in the keywords text cleaning code though it is not a necessary step. The algorithm will work even if the accents remain.
You are welcome to change gensim code to make it configurable. A pull request would be welcome that makes deacc a parameter in keywords with deacc=True by default.

Regards
Lev

Lev Konstantinovskiy

unread,
Jan 6, 2017, 8:06:13 AM1/6/17
to gensim
Hello,

By the way, this is fixed in the current develop(unreleased) version of Gensim. Thanks to Bhargav for the quick fix.

Regards
Lev

川頭信之

unread,
Jan 7, 2017, 4:31:50 AM1/7/17
to gen...@googlegroups.com, lev....@gmail.com
Hello Lev,

Thank you. Your advice is very helpful. I also fixed keywords() and others, but not yet pushed.
I will wait for the release.

Best regards,

Nobu

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/-0uMMFOzMCQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
川頭 信之
Nobuyuki KAWAGASHIRA

空飛ぶデータサイエンティスト
R、Python、UNIX、データベース
データマイニング、自然言語処理、CodeIQ問題出題者、MOOC
ハンググライダー、スキー、小型ヨット

データ解析 http://www.slideshare.net/nobuyukikawagashira

Gengdai Liu

unread,
Apr 22, 2017, 2:51:58 AM4/22/17
to gensim, lev....@gmail.com
川頭-san,
I am a new user of gensim. I want to extract some keywords from Japanese texts.
Could you please tell me how to use gensim's summarization module to do it?
When I pass a unicode text into keywords() function, an empty string is returned.

Thank you.
Best regards,
Gengdai

在 2017年1月7日星期六 UTC+9下午6:31:50,川頭信之写道:
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

川頭信之

unread,
Apr 22, 2017, 4:53:58 AM4/22/17
to liuge...@gmail.com, gen...@googlegroups.com
Dear Gendong

Could you please tell me how to use gensim's summarization module to do it?
When I pass a unicode text into keywords() function, an empty string is returned.

​You should first do Japanese morphoanalysis ​by MeCab to make text which are divided into words joined with spaces.
Then perform as follows:
keyword = keywords(text, ratio=0.6, deacc=False, split=True)

​Best regards,

Nobuyuki​
Reply all
Reply to author
Forward
0 new messages