Japanese word frequency counter

1,163 views
Skip to first unread message

Alanna Krause

unread,
Sep 8, 2011, 2:42:26 AM9/8/11
to Honyaku E<>J translation list
Hi. Does anyone know of a good tool for counting word frequency that
can parse Japanese? I don't mean counting total characters or words,
but something that will tell me how many times a given word appears in
the document, and can list all words in order of frequency. Thanks!

Rene

unread,
Sep 8, 2011, 2:52:18 AM9/8/11
to hon...@googlegroups.com

It is concordance. A short google expedition turns up this script... maybe it works for you:

http://www.localizingjapan.com/blog/2008/08/21/generate-a-concordance-from-an-xml-file/

Regards
Rene von Rentzell

Jens Wilkinson

unread,
Sep 8, 2011, 3:27:11 AM9/8/11
to hon...@googlegroups.com
On Thu, Sep 8, 2011 at 3:52 PM, Rene <Yoi...@gmail.com> wrote:
> On Thu, Sep 8, 2011 at 3:42 PM, Alanna Krause <kdyn...@gmail.com> wrote:
>>
>> Hi. Does anyone know of a good tool for counting word frequency that
>> can parse Japanese? I don't mean counting total characters or words,
>> but something that will tell me how many times a given word appears in
>> the document, and can list all words in order of frequency. Thanks!
>
> It is concordance. A short google expedition turns up this script... maybe
> it works for you:
>

I think that you need to be careful because in linguistics,
concordance is used in a different sense than in computing. I think
that concordance would show how words relate to one another
grammatically.

In response to the question, I'm sure there are such scripts, because
people do computational linguistics in Japan and need them. I'm
curious how this relates to translation. Is it a specific document
that you want to analyze, because the author uses the same word over
and over?

Jens Wilkinson
Neo Patwa (patwa.pbwiki.com)

Jean-Christophe Helary

unread,
Sep 8, 2011, 3:44:19 AM9/8/11
to hon...@googlegroups.com

I use yougox:

http://www15.big.or.jp/~t98907/yougox/


Jean-Christophe Helary
----------------------------------------
fun: http://mac4translators.blogspot.com
work: http://www.doublet.jp (ja/en > fr)
tweets: http://twitter.com/brandelune

Jeremy Main

unread,
Sep 8, 2011, 4:14:27 AM9/8/11
to hon...@googlegroups.com
On Sep 8, 2011, at 3:42 PM, Alanna Krause wrote:

EKWords is also fairly good and you can't beat the price; free.
http://www.djsoft.co.jp/products/ekwords.html

I use it from time to time.

Jeremy Main
(Yokohama)


Peter Clark

unread,
Sep 8, 2011, 4:53:48 AM9/8/11
to hon...@googlegroups.com
I have gotten AntConc to work on Japanese in the past.
http://www.antlab.sci.waseda.ac.jp/software.html#antwordprofiler
 
You need patience if you aren't used to setting up environments before actually getting the program to work.
 
Peter Clark

a2ztranslate

unread,
Sep 8, 2011, 8:20:05 AM9/8/11
to Honyaku E<>J translation list
Very very hard to do, because as you know, using kanji how can you
split kanji combinations into separate "words" (and, I would argue,
there are no such things in Japanese)? Big ask of any software to do
this accurately! Consider the "word" "地下”。This has multiple meanings
in English (underground, basement, etc. etc.).

Context is KING

Unless you have the ultimate Japanese dictionary that can read e.g.
"下” in all of its possible combinations (including those random proper
nouns that don't seem to make any sense at all), I don't see how this
is possible.

Joe

unread,
Sep 8, 2011, 8:30:32 AM9/8/11
to Honyaku E<>J translation list
Hi,

This is probably above and beyond what you are looking for, but I
thought I'd throw it out there just in case.

I work primarily on speech recognition and natural language
processing, and there are a several standard open source tools that
exist in this are that are typically used for segmenting and
annotating Japanese text for use in automatic speech recognition or
natural language understanding systems. Probably the best is a tool
MeCab,
http://mecab.sourceforge.net/

which can be used to (fairly) accurately segment arbitrary Japanese
text. Once segmented you can use standard word counting tools for
English or other romance languages, or use the tokenized text to build
other models.

Building the tool is probably non-trivial if you have no experience
with programming, but if there is broader interest or a related
general need in this regard I'd be happy to build a tool useful to the
community.

Again, I suspect this is well outside the scope of what is being
asked, but being close to my area and interests I thought I'd just
throw it out there.

-Joe

JimBreen

unread,
Sep 8, 2011, 7:27:22 PM9/8/11
to Honyaku E<>J translation list
On Sep 8, 10:30 pm, Joe <josef.robert.no...@gmail.com> wrote:
> ......... there are a several standard open source tools that
> exist in this are that are typically used for segmenting and
> annotating Japanese text for use in automatic speech recognition or
> natural language understanding systems. Probably the best is a tool
> MeCab,http://mecab.sourceforge.net/
>
> which can be used to (fairly) accurately segment arbitrary Japanese
> text. Once segmented you can use standard word counting tools for
> English or other romance languages, or use the tokenized text to build
> other models.

I quite agree with MeCab being the best of the free tools available.
I use it regularly, and in the case of some experimental work I run
on thousands of time a day in sentence-by-sentence analysis of
large amounts of text.

However, I don't think it really is immediately suitable for the
sort of concordance generation being discussed here. MeCab,
as with similar tools such as Juman and ChaSen, is a
morphological analyzer, not a word segmenter. Much of this
comes down to the definition of "word" in Japanese, and to
the choice of morpheme lexicon you use.

If I can illustrate with a simple sentence:
民衆は残酷な暴君によって虐げられていた。

MeCab, using the common "IPADIC" lexicon, splits
this into:
民衆|は|残酷|な|暴君|によって|虐げ|られ|て|い|た。
With the more rigorous UniDic lexicon, you get the same
except によって is split に|よっ|て.

I suspect a real concordancer would want 虐げられて
counted in its entirety, and also perhaps as an instance
of 虐げる.

> Building the tool is probably non-trivial if you have no experience
> with programming, but if there is broader interest or a related
> general need in this regard I'd be happy to build a tool useful to the
> community.

As it happens, I have the makings of such a tool
myself. It uses MeCab/UniDic, then builds the
morphemes back into "words" using a very
large dictionary. For the sentence above,
虐げられていた is treated as containing
虐げる and 居る.

I hope one day to get the tool
released, although I need to be careful
with the lexicon, as at present I am
using a combination of several dictionaries
including some commercial dictionaries that I
cannot release.

I have looked at some of the other tools
mentioned in this thread. As far as I can
tell they take a rather haphazard approach
to kana strings, and concentrate of the
kanji sequences.

HTH

Jim

Jean-Christophe Helary

unread,
Sep 8, 2011, 7:44:51 PM9/8/11
to hon...@googlegroups.com

On Sep 9, 2011, at 8:27 AM, JimBreen wrote:

> I have looked at some of the other tools
> mentioned in this thread. As far as I can
> tell they take a rather haphazard approach
> to kana strings, and concentrate of the
> kanji sequences.

Indeed. They are the "good enough" tools that can help you go through a big chunk of text to see what kind of terminology is required. If a rough count is what matters then they'll do.

Joe

unread,
Sep 8, 2011, 10:33:05 PM9/8/11
to Honyaku E<>J translation list
Hi!
Thanks for the feedback on this.

> However, I don't think it really is immediately suitable for the
> sort of concordance generation being discussed here. MeCab,
> as with similar tools such as Juman and ChaSen, is a
> morphological analyzer, not a word segmenter. Much of this
> comes down to the definition of "word" in Japanese, and to
> the choice of morpheme lexicon you use.
Yes you are definitely right and that deserved more discussion than I
provided.

> If I can illustrate with a simple sentence:
> 民衆は残酷な暴君によって虐げられていた。
>
> MeCab, using the common "IPADIC" lexicon, splits
> this into:
> 民衆|は|残酷|な|暴君|によって|虐げ|られ|て|い|た。
> With the more rigorous UniDic lexicon, you get the same
> except によって is split に|よっ|て.
I'd like to add though, that depending on the dictionary you use to
build the tool and train the default models, you can also specify at
run time to output the part-of-speech tags, glosses, etc. along with
the segmentation. Both MeCab and the older ChaSen accept a very
feature rich (and admittedly complicated to use...) configuration
file. I think that with a bit of work it isn't unreasonable to
reconstruct the more standard wordlike forms people are looking for.

> I suspect a real concordancer would want 虐げられて
> counted in its entirety, and also perhaps as an instance
> of 虐げる.
Definitely.

> As it happens, I have the makings of such a tool
> myself. It uses MeCab/UniDic, then builds the
> morphemes back into "words" using a very
> large dictionary. For the sentence above,
> 虐げられていた is treated as containing
> 虐げる and 居る.
This sounds very cool, and much like what I was thinking.

> I hope one day to get the tool
> released, although I need to be careful
> with the lexicon, as at present I am
> using a combination of several dictionaries
> including some commercial dictionaries that I
> cannot release.
I suppose this is again moving outside the scope of this thread, but
I'd be very interested in contributing to the development of such a
tool, especially if it would speed up a public release (data issues
aside). I have some OSS experience and am currently located at a
university in Japan - I'd be very interested in further discussing it,
if that is a possibility.

Best,
Joe

Alanna Krause

unread,
Sep 9, 2011, 12:29:55 AM9/9/11
to Honyaku E<>J translation list
Just want to say thank you to everyone who replied! I will be having a
closer look at all the tools mentioned.

Alexandru Pojoga

unread,
Sep 14, 2011, 4:17:51 AM9/14/11
to hon...@googlegroups.com
I have tried using MeCab to automate extraction of new words for study, unfortunately it's indeed overzealous; e.g. I would like to end up with 獅子奮迅 as one entry but MeCab is probably going to give me two words. Essentially what I want is the longest "word" (expression) as found in EDICT.

Mr. Breen is suggesting post-processing Mecab output to re-join the words where appropriate, but I am wondering, would it not be possible to do away with the Mecab step entirely? EDICT has close to 200,000 entries. Deinflection is not a difficult problem.

There is the problem of old or nonstandard spelling such as 絞込む instead of 絞り込む. Here we could enlist Kanjidic for proposing likely candidates, and the user could pick the appropriate variant using some kind of GUI -- with the choice impacting the rest of the text. The same GUI could be used to pick sensible interpretations from long strings of kana. (And of course Enamdict would come in for names.)

Just thinking out loud. Essentially I don't know what benefits MeCab brings to the table.


2011年9月9日2:27 JimBreen <jimb...@gmail.com>:

--
You received this message because you are subscribed to the Honyaku Mailing list.
To unsubscribe from this group, send email to honyaku+u...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/honyaku?hl=en?hl=en

JimBreen

unread,
Sep 14, 2011, 9:41:23 PM9/14/11
to Honyaku E<>J translation list
On Sep 14, 6:17 pm, Alexandru Pojoga <apoj...@gmail.com> wrote:
> I have tried using MeCab to automate extraction of new words for study,
> unfortunately it's indeed overzealous; e.g. I would like to end up with 獅子奮迅
> as one entry but MeCab is probably going to give me two words.

It depends a little on the lexicon used, but yes, it will most likely
give 獅子 + 奮迅 as they are the morphemes.

> Essentially
> what I want is the longest "word" (expression) as found in EDICT.
>
> Mr. Breen is suggesting post-processing Mecab output to re-join the words
> where appropriate, but I am wondering, would it not be possible to do away
> with the Mecab step entirely? EDICT has close to 200,000 entries.
> Deinflection is not a difficult problem.

There are no easy shortcuts for doing a proper
parse/segmentation of Japanese. MeCab, like its
predecessors, uses some very sophisticated AI
techniques, and the lexicons have been trained on
swags of hand-segmented and tagged text.

In my WWWJDIC server's "text glossing" function
I do a rather messy "longest-match-first" parse using
a monster dictionary of every glossary I can get hold of.
It works most of the time, for example for something like
"救助隊の獅子奮迅の働きにより, がれきの中から生存者が救出された"
it does a fair job (see;
http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?9MGG%B5%DF%BD%F5%C2%E2%A4%CE%BB%E2%BB%D2%CA%B3%BF%D7%A4%CE%C6%AF%A4%AD%A4%CB%A4%E8%A4%EA,%20%A4%AC%A4%EC%A4%AD%A4%CE%C3%E6%A4%AB%A4%E9%C0%B8%C2%B8%BC%D4%A4%AC%B5%DF%BD%D0%A4%B5%A4%EC%A4%BF)
where it handles 獅子奮迅 and がれき OK (not
sure about 救出された) but even then it stumbles quite often.

I intend to rewrite the function using MeCab followed by
an aggregation of morphemes, as in testing this approach
is proving to be much more complete and accurate, but
it's going to take a while.

> There is the problem of old or nonstandard spelling such as 絞込む instead of
> 絞り込む.

MeCab (with the UniDic lexicon) handles 絞込む quite well, indicating
that
it is really 絞り + 込む. I try and have all these variants included in
the
dictionaries,

> Here we could enlist Kanjidic for proposing likely candidates, and the
> user could pick the appropriate variant using some kind of GUI -- with the
> choice impacting the rest of the text. The same GUI could be used to pick
> sensible interpretations from long strings of kana. (And of course Enamdict
> would come in for names.)
>
> Just thinking out loud. Essentially I don't know what benefits MeCab brings
> to the table.

Well, it's a low-level heavy-iron tool. Wonderful in its field, but
mainly useful for NLP research. For many it is the part-of-speech
tagging that is the most useful feature. I appreciate that it's
not immediately useful to people working at the word/compound
level.

Cheers

Jim

Alexandru Pojoga

unread,
Sep 21, 2011, 9:48:34 PM9/21/11
to hon...@googlegroups.com
Thank you for the explanation. I should have realized MeCab had more than strictly one use.
On one hand, there are people (like me) who wouldn't mind taking the time to massage the output to correct any misinterpretations, and on the other, researchers who need bulk decomposition and can tolerate occasional misfires.

By the way, the glossing feature on WWWJDIC is quite impressive, especially as it handles がれき in kana. I will be using it in the future.

My apologies for going off-course,

Alexandru Pojoga


Reply all
Reply to author
Forward
0 new messages