Unfortunately, Altavista does not seem to return consistent hit-
counts so all the Altavista numbers should probably be ignored
(see below).
The following files are available at http://www.frii.com/~smcgraw/
freq_av.txt -- Altavista results, no domain restriction.
Same data that was presented last Dec
in freq_all.utf but with new file format.
freq_av_jp.txt -- Altavista .jp results.
freq_google_jp.txt -- Google .jp results.
rand250.xls -- Compares hit counts and order for each
of the above sets, plus two reruns, based
on a random sample of 250 words. MS
Excel 2000.
README.txt Copy of this sci.lang.japan posting.
The two files I put up last december are still available:
freq_all.utf -- Altavista results released last Dec..
Numbers should be same as in freq_av.txt
but format slightly different.
freq.utf -- Subset of words from freq_all.utf: those
marked as"ichi1", "jdd1", or "gai1" in
JMdict.
README_old.txt -- Description of the above two files.
Each of the freq*.txt files is about 1.8MB. The words in each
were taken from Jim Breen's JMdict (V2001-03 14 October 2001).
Format is slightly different than the files I previously released.
There are a few comment lines at top of file (# in first column)
that describe the contents/format. Rest of file is three tab-
separated fields per line: hit-count, order, word (utf-8 encoded).
Words are sort in order of decreasing hit-counts. Position in
the list in indicated by an order number (2nd field). For words
with the same hit count the order numbers are the same and are
equal to the position of the last word in the group.
There are no longer any duplicated words in the files.
Files have MS-style line endings ("\r\n") and the first line is
preceeded by the utf-8 BOM ("\xef\x66\x6c") that MS puts in utf
files.
The freq_av.txt file contains the same data that I released last
Dec (same as freq_all.utf but with the new format) and hit-counts
are likely to contain large number of chinese language pages, at
least for words that contain no kana characters.
While gathering these numbers it became apparent that Altavista
was not returning consistent results. They are usually consistent
but sometimes wildly off numbers are returned. Below are the
results of one test run I made repeating the same characters.
(This is one of the worse results -- usually there would be fewer
or no differences. And I did not try with different types of words
so I do not know if these bogus numbers occur with only some kinds
of searches, e.g. single kanji words.) I see the same bogus hit
counts sometimes when doing searches "by hand" so I don't think
the problem is in my script. I emailed Altavista tech support
last friday but have not gotten a response yet.
The rand250.xls file shows a correlation between two identical
Altavista_jp runs, made two days apart, that is much lower (.84)
than I'd expect, due to large differences in only two words. When
these two words were removed corelation was perfect. I suspect
that these two differences are manifestations of the same problem.
~~~~~~~~~~~
Altavista test results. There was about 10 sec between
each search. If there are encoding problems, lines 1-8,17,
19,21,23,25,27 are single kanji "nichi" and lines 9-16,18,20,
22,24,26 are single kanji "hon".
# Fri Feb 14 10:31:06 2003 av_jp
1 21216895 日
2 21216895 日
3 21216895 日
4 21216895 日
5 21216895 日
6 21216895 日
7 21216895 日
8 21216895 日
9 5526 本
10 5526 本
11 7648258 本
12 8344000 本
13 12863606 本
14 12863606 本
15 12863606 本
16 5526 本
17 6613 日
18 352525 本
19 6613 日
20 352525 本
21 6613 日
22 700249 本
23 6613 日
24 700249 本
25 21216895 日
26 12863606 本
27 21216895 日
28 12863606 本
>
> The following files are available at http://www.frii.com/~smcgraw/
>
> freq_av.txt -- Altavista results, no domain restriction.
> Same data that was presented last Dec
> in freq_all.utf but with new file format.
> freq_av_jp.txt -- Altavista .jp results.
> freq_google_jp.txt -- Google .jp results.
From taking a quick look at the freq_google_jp.txt file, it is apparent
that Google completely ignores the 々 repeater mark, because e.g. 人々
had the same number of hits as 人, and so on for a great number of words
of the format X々. It would be super cool if you could remove these
words from the file.
> While gathering these numbers it became apparent that Altavista
> was not returning consistent results. They are usually consistent
> but sometimes wildly off numbers are returned. Below are the
> results of one test run I made repeating the same characters.
> (This is one of the worse results -- usually there would be fewer
> or no differences. And I did not try with different types of words
> so I do not know if these bogus numbers occur with only some kinds
> of searches, e.g. single kanji words.) I see the same bogus hit
> counts sometimes when doing searches "by hand" so I don't think
> the problem is in my script. I emailed Altavista tech support
> last friday but have not gotten a response yet.
It's stuff like this that makes me glad I use Google.
--
Curt Fischer
Jim Bree_n_ mentioned some time ago* '々' looks like white space
to Google. Exactly _why_ that would be the case is beyond me ...
* Err, may have been in an email.
>It's stuff like this that makes me glad I use Google.
Google's Japanese parsing (if indeed it does a segmentation before
indexing) misses quite often. Apart from the 々 problem, it matches
across punctuation, matches the last kanji of one table entry and the
first of the next as though they were a jukugo, etc, etc.
I have even found cases where despite having set "lr=lang_ja", I get
matched on pages of Chinese.
It seems at its worst when reporting on PDF files. I don't know if it's
using vanilla Acrobat, but the matches reported on ex-PDF documents are
often very odd.
--
Jim Breen (j.b...@csse.monash.edu.au http://www.csse.monash.edu.au/~jwb/)
Computer Science & Software Engineering, Tel: +61 3 9905 3298
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
Very interesting job, but why do you exclude the kana forms ?
You could extract the subset that has the (uk) marquer and do a search
on it ...
Also what do you do for verbs ? Only search for the dictionnary form ?
Time, uncertainty about the value of the results, difficultly getting
search engines to process 10's of thousands of words without complaining.
But having gotten some useful results for the kanji words, I will probably
try to get values for the kana words in the near future.
> You could extract the subset that has the (uk) marquer and do a search
> on it ...
A good suggestion, thanks.
> Also what do you do for verbs ? Only search for the dictionnary form ?
Yes.
So one should NOT directly compare the verb frequence with the frequence
of other words, or then use an appropriate correction factor.
Don't get me wrong, your job is very useful, and I'll try to find some
use for it as soon as I can.
http://www.frii.com/~smcgraw/japan/freq_google_jp_norep.utf
> Stuart McGraw wrote:
> >>Also what do you do for verbs ? Only search for the dictionnary form ?
> >
> > Yes.
>
> So one should NOT directly compare the verb frequence with the frequence
> of other words, or then use an appropriate correction factor.
I am inclined to deal with the verbs seperately. Thus ask: "What are
the most popular verbs?" Then generate from Stuart's list a verb list
in Google order. (The verbs are coded as such in jmDict and Edict).
An interesting question is what such a correction factor is (and if it exists
at all).
> Don't get me wrong, your job is very useful, and I'll try to find some
> use for it as soon as I can.
JMdict has around 10000 verbs about 3700 of which are tagged "vs".
It might be practical to get hit counts for the most common conjugated
forms. Which forms would those be? I suppose one would have to do
something similar for adjectives.
Alternatively, maybe there is a search engine somewhere that indexes
the stemmed form of verbs (e.g. "swim", "swam", and "swimming"
are all considered the same word.) Google doesn't. There was some
discussion here a month or so ago about that.
I'm not sure how far one can go, trying to turn search engine hit-counts
into real frequency-of-use numbers, particularly without knowing how
search engines parse and index words.
I agree.
>An interesting question is what such a correction factor is (and if it exists
>at all).
I don't think it does, as it will change from verb to verb. It would
be interesting to try the whole range for a number of verbs.
>JMdict has around 10000 verbs about 3700 of which are tagged "vs".
>It might be practical to get hit counts for the most common conjugated
>forms. Which forms would those be?
This could be determined empirically.
>I suppose one would have to do
>something similar for adjectives.
Much simpler than verbs, I think.
>Alternatively, maybe there is a search engine somewhere that indexes
>the stemmed form of verbs (e.g. "swim", "swam", and "swimming"
>are all considered the same word.) Google doesn't. There was some
>discussion here a month or so ago about that.
In theory a search engine, having found a page with Japanese, could
invoke a Japanese morphological analyzer, then index on the results of that.
That would bring all the verbs and adjectives back to root forms.
>I'm not sure how far one can go, trying to turn search engine hit-counts
>into real frequency-of-use numbers, particularly without knowing how
>search engines parse and index words.
Yes, you are in their hands unless you want to go into a massive data
collection exercise.
An interesting experiment to do.
There very probably is verbs that are used much more often in some form
that in the others.
But in fact, this shows a problem that also hits the comparison between
verbs.
If the relative frequency of the dictionnary form were the same for all
verbs, it wouldn't be very difficult to find what it is and the adequate
corrective factor.
If it's not, then there's a problem even when comparing the frequence of
two verbs if you base yourself only on the dictionnary form.
>>JMdict has around 10000 verbs about 3700 of which are tagged "vs".
>>It might be practical to get hit counts for the most common conjugated
>>forms. Which forms would those be?
>
> This could be determined empirically.
For "vs" words, a statistic that cumulates the noun and verb usage
clearly seems good enough. So only the 7300 other are problematic.
>>Alternatively, maybe there is a search engine somewhere that indexes
>>the stemmed form of verbs (e.g. "swim", "swam", and "swimming"
>>are all considered the same word.) Google doesn't. There was some
>>discussion here a month or so ago about that.
>
> In theory a search engine, having found a page with Japanese, could
> invoke a Japanese morphological analyzer, then index on the results of that.
> That would bring all the verbs and adjectives back to root forms.
I had thought about getting a large amount of "interesting" web page
addresses, collecting the data and feeding it to Chasen.
I had a friend do a few test and it seems very doable, but you just need
to accumulate the appropriate amount of data.
By doing that, you could also separate the data source by categories, to
see how the statistic vary.
I never really started on it. Even after collecting a lot of data, the
amount of text will be tinyful compared to what Google has.
This means it will be possible to make a better analyze only for really
common words, not for the very rare.
I'd say that for any word that appears only three or less time inside
it, the sample is too small to get significant statistical result.
When trying to think more globally, I'd say that the idea of getting the
exact, very precise frequence for each word can not lead very far.
It all depends on the context, type of writting, writer's style, etc ...
So there is no universally exact, precise frequencies.
Therefore the fact there are some problem with the data from Google is
not that annoying, because the results should only be interpreted as
really significant within a order of magnitude.
A factor 2 between the "google freq" of two words should not be
interpreted as having a major meaning.