freqency of use for jmdict words

Stuart McGraw

unread,

Dec 7, 2002, 10:34:07 PM12/7/02

to

I have fed a subset of the words in Edict/JMdict (those marked
as "ichi1", "jdd1", "gai1" and no kana-only words) through a
script that feeds them to a web search engine and extracts the
number of hits, as a rough metric of the usage frequency of the
words.

I would be interested in any comments about the validity or non-
validity of these numbers (other than the obvious fact that web
pages aren't representative of other forms of text). Are there
other pitfalls in interpreting these as a *rough* indication of
frequency of use?

If these data are of interest to anyone else they is available
at http://www.frii.com/~smcgraw/freq.utf. The format is one
line per word with each line as three tab separated fields:
hit_count, jmdict-seqno, word-text. Encoding is utf-8 and
size is ~350kB.

Jim Breen

unread,

Dec 8, 2002, 6:44:40 PM12/8/02

to

Stuart McGraw <smcgraw_n...@frii.com> dixit:

>>I have fed a subset of the words in Edict/JMdict (those marked
>>as "ichi1", "jdd1", "gai1" and no kana-only words) through a
>>script that feeds them to a web search engine and extracts the
>>number of hits, as a rough metric of the usage frequency of the
>>words.

Must have made the search engine run hot

>>I would be interested in any comments about the validity or non-
>>validity of these numbers (other than the obvious fact that web
>>pages aren't representative of other forms of text). Are there
>>other pitfalls in interpreting these as a *rough* indication of
>>frequency of use?

Well, you have probably established the fequency of use of those words
in WWW pages. I notice that many single kanji, e.g. 上, 分, etc. are
repeated in your list. Is this because they have multiple JMDict entries?

Two general comments:

(a) what would be of far greater interest to me would be a ranked listing
of the frequency of use of words *not* marked "ichi1", etc.

(b) I am interested in the words with low (< 500 hits) frequencies. As I
suspected, the "jdd1" set contributes a large proportion of these. Still
there are only about 400 of these.
--
Jim Breen (j.b...@csse.monash.edu.au http://www.csse.monash.edu.au/~jwb/)
Computer Science & Software Engineering, Tel: +61 3 9905 3298
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学

Stuart McGraw

unread,

Dec 9, 2002, 1:34:52 AM12/9/02

to

"Jim Breen" <jwbR...@csse.monash.edu.au> wrote in message news:at0lh8$vgn$1...@towncrier.cc.monash.edu.au...

> Stuart McGraw <smcgraw_n...@frii.com> dixit:
> >>I have fed a subset of the words in Edict/JMdict (those marked
> >>as "ichi1", "jdd1", "gai1" and no kana-only words) through a
> >>script that feeds them to a web search engine and extracts the
> >>number of hits, as a rough metric of the usage frequency of the
> >>words.
>
> Must have made the search engine run hot

I thought of it as helping them with their load testing :-) It took about
13 hours to process the 14700 entries. I am on a fairly slow internet
connection which probably reduced the load on the server a little since
I made no attempt to minimize the size of the page returned.

> >>I would be interested in any comments about the validity or non-
> >>validity of these numbers (other than the obvious fact that web
> >>pages aren't representative of other forms of text). Are there
> >>other pitfalls in interpreting these as a *rough* indication of
> >>frequency of use?
>
> Well, you have probably established the fequency of use of those words
> in WWW pages. I notice that many single kanji, e.g. 上, 分, etc. are
> repeated in your list. Is this because they have multiple JMDict entries?

Yes. The JMdict sequence numbers should be different for the repeated
kanji.

> Two general comments:
>
> (a) what would be of far greater interest to me would be a ranked listing
> of the frequency of use of words *not* marked "ichi1", etc.
>
> (b) I am interested in the words with low (< 500 hits) frequencies. As I
> suspected, the "jdd1" set contributes a large proportion of these. Still
> there are only about 400 of these.

Out of curiosity, why are the low frequency ones of interest? Also, I would
think that low frequency results would have a lot more uncertainty that would
make differences between them not so reliable.

I do intend to continue processing the rest of the words, probably in batches
of 10K or so every few days in an attempt not to wear out my welcome on
the search engine. I wanted to check here first to make sure I was not
totally wasting my time.

Jim Breen

unread,

Dec 12, 2002, 6:28:40 PM12/12/02

to

Stuart McGraw <smcgraw_n...@frii.com> dixit:

>>"Jim Breen" <jwbR...@csse.monash.edu.au> wrote in message news:at0lh8$vgn$1...@towncrier.cc.monash.edu.au...

>>> (b) I am interested in the words with low (< 500 hits) frequencies. As I
>>> suspected, the "jdd1" set contributes a large proportion of these. Still
>>> there are only about 400 of these.

>>Out of curiosity, why are the low frequency ones of interest? Also, I would
>>think that low frequency results would have a lot more uncertainty that would
>>make differences between them not so reliable.

I'm interested in the criteria that lexicographers use to include/exclude
words. Since the "jdd1" tags come from a small dictionary, it's of interest
that about 5% of words in it are fairly rare.

>>I do intend to continue processing the rest of the words, probably in batches
>>of 10K or so every few days in an attempt not to wear out my welcome on
>>the search engine. I wanted to check here first to make sure I was not
>>totally wasting my time.

You're not. As I said, I am very interested in cases that are not (yet) tagged,
but are common.

hello

unread,

Dec 12, 2002, 10:32:18 PM12/12/02

to

n

Paul Blay

unread,

Dec 13, 2002, 1:45:49 AM12/13/02

to

"hello" wrote ...
> n

N, r?