Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Chinese Character Usage Frequency

49 views
Skip to first unread message

yon...@yahoo.com

unread,
Nov 23, 2005, 10:07:40 PM11/23/05
to
[Excuse me if this is not the right group]

I wrote a small computer program to generate the Chinese character
usage frequency table. It's at

http://rootshell.be/~yong321/misc/ChineseCharFrequency.html

This is not my profession but I prefer to have peer reviews. Any
comments are very welcome.

Yong Huang

Dylan Sung

unread,
Nov 24, 2005, 4:17:23 AM11/24/05
to

<yon...@yahoo.com> wrote in message
news:1132801660....@g14g2000cwa.googlegroups.com...

Interesting. However, the approach does not give you a true idea of the size
of the character pool from which you've taken those results from. That is,
how many unique Chinese characters are in everyday use (and don't turn to
the number of characters in unicode, since many rare characters in KangXi
Dictionary are now available in Unicode 4.x now or use Big5 or GB as
estimates). Moreover, how can you be sure that the characters are soley used
in Chinese webpages only, when Japan and some Korean authors also use
similar and "han-unified" characters too?

One way of estimating the frequency of use of Chinese characters is by
counting them in a corpus of works, but it may be affected by the types of
subject written about or the types of discussion made in the texts. For
example, the chemical elements in the modern periodic table can be found in
Chinese characters now, but some of the characters are borrowed from older
characters, or new ones created. So, if you wanted general usage by the
people, you'd amass a corpus mainly based on the types of text most people
read, i.e. newspapers, magazines, short stories, novels and so on.

I'm not saying your approach is wrong. Reflecting the number of times a
chinese character appears on the internet may be one way of achieving your
aim, but is it a fair reflection of what everyday readers read, and from the
locale in which the users write?

(And the simplified/traditional thing too....)

Dyl.


mats....@gmail.com

unread,
Nov 24, 2005, 5:02:20 AM11/24/05
to
I did something similar some years ago, but instead of using Google, I
let my script crawl Chinese pages. From one start page (I believe it
was http://www.shuku.net/) I would recursively follow any link (that
hadn't already been followed) containing at least one Chinese character
in the description, with a depth of something like five. The script
went through about three million character instances, covering about
13000 unique characters.

This is my result: http://barbabok.barbanet.com/~maatt/count.html

Mats

Dylan Sung

unread,
Nov 24, 2005, 3:00:39 PM11/24/05
to

"nilsbeng...@yahoo.se" <mats....@gmail.com> wrote in message
news:1132826540.6...@g14g2000cwa.googlegroups.com...

>I did something similar some years ago, but instead of using Google, I
> let my script crawl Chinese pages. From one start page (I believe it
> was http://www.shuku.net/) I would recursively follow any link (that
> hadn't already been followed) containing at least one Chinese character
> in the description, with a depth of something like five. The script
> went through about three million character instances, covering about
> 13000 unique characters.

There are just over 13000 characters in Big5.

I did a frequency count of characters for a sample text somewhile ago.

http://www.dylanwhs.ukgateway.net/scilang/pinyin-stats.htm
http://www.dylanwhs.ukgateway.net/scilang/statistics03.htm

I think Jun Da has created a definitive lists of frequency counts. See the
Classical Chinese list, as well as bigram frequencies here

http://lingua.mtsu.edu/chinese-computing/statistics/

Dyl.


yon...@yahoo.com

unread,
Nov 25, 2005, 1:12:40 AM11/25/05
to
Thanks, Dylan and Mats. I'm going to update my web page with the
information you two provided, including links to Da Jun's page. Dylan
has a very valid point in saying that a simple Google search makes no
distinction between Chinese web pages and non-Chinese pages that
contain Chinese characters. Assuming Google has done a good job, I can
modify my program to let Google search for pages in a specific language
(Google has "lr=lang_zh-CN" command line option for Simplified Chinese
page search; you can tell by searching with Advanced Search).

My search obviously does not count the number of occurrencies in each
found page. But it's reasonable to assume the frequency order is not
significantly modified if the in-page-count is considered.

My work is not about character frequency in any specific subject (news,
colloquial language, science, etc.). If I have to say it's a limited
corpus of some kind, it's the whole Internet. Although there may be no
scientific research to prove, I assume the character frequency based on
a global search engine's page count is close to that experienced in
everyday life by most people speaking that language. This is not true
for a few characters, such as the Chinese characters for Mmm, Ehhh,
nei4ge4, zhe4g4, because they're rarely written down.

One interesting thing about Google's Chinese search is that it returns
results in both simplified and traditional Chinese pages given one
Chinese character. That's exactly what I prefer.

Yong Huang

Dylan Sung

unread,
Nov 25, 2005, 3:30:53 AM11/25/05
to

<yon...@yahoo.com> wrote in message
news:1132899160.1...@o13g2000cwo.googlegroups.com...

Last night, I was watching ShuJian EnShou Lu and it had Chinese subtitles. I
was following the subtitles, and for Emperor QianLong, whose reign name is
composed of two characters qian and long. However, rather than following the
perscribed use of the traditional character, it used the simplified
character associated with it. Qian became simplified as the character "gan",
a three stroke character. There were other errors in the subtitling, which
suggested to me that the people who transcribed it, had sone so in
traditional characters and then substituted the traditional character with a
simplified one, irrespective of the context.

There are a number of simplified characters which have two (and some have
several) traditional character equivalents. Though the numbers are small,
they will boost up their ratings in some character counts, if the method in
which these webpages are scanned and codeconverted takes no account of
actual context. So your preference for Google's Chinese search may give
skewed results.

I've found that sites sometime reduplicate things which were previously in
bookform, and in the public domain, for example, the Confucian classics,
Shijing, Shujing, and so on. Many of these "old" texts contain characters
which aren't used much nowadays. Moreover, there are characters today which
are new coinings, like "li" for and elevator/lift, and the chemical
elements. If you had enough sites reduplicating the same stuff over and over
again, then these rare characters get artificially boosted into prominence.
I think that whilst the top several thousand may indeed be representative of
general usage, we can't say much about those several tens of thousands of
unique and individual Chinese characters which are left over and rare.

Dyl.


Lee Sau Dan

unread,
Nov 25, 2005, 4:00:47 AM11/25/05
to
>>>>> "yong321" == yong321 <yon...@yahoo.com> writes:

yong321> Although there may be no scientific research to prove, I
yong321> assume the character frequency based on a global search
yong321> engine's page count is close to that experienced in
yong321> everyday life by most people speaking that language.

I think you can at *most* claim that it reflects the experience in
everyday *WWW* life. The Internet is frequented by a particular group
of people, and you should not use this sample to generalize on all
people using that language. How many illiterates have you excluded?
How many computer-illiterates have you excluded? And how many
publications that are not put on-line have you excluded? Even for a
literate and computer-literate person, WWW is just a part of his daily
life. Does he not read paper-medium books daily? Does he not write
documents daily, which are not published on any web pages?


yong321> This is not true for a few characters, such as the
yong321> Chinese characters for Mmm, Ehhh, nei4ge4, zhe4g4,
yong321> because they're rarely written down.

Good. You're realizing that what you've done cannot be generalized to
the spoken language in general. Your findings are valid for the
written language only, and limited to the WWW sphere only.

--
Lee Sau Dan 李守敦 ~{@nJX6X~}

E-mail: dan...@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee

Lee Sau Dan

unread,
Nov 25, 2005, 4:18:20 AM11/25/05
to
>>>>> "Dylan" == Dylan Sung <dylanwhs....@pacific.net.hk> writes:

Dylan> Last night, I was watching ShuJian EnShou Lu and it had
Dylan> Chinese subtitles. I was following the subtitles, and for
Dylan> Emperor QianLong, whose reign name is composed of two
Dylan> characters qian and long. However, rather than following
Dylan> the perscribed use of the traditional character, it used
Dylan> the simplified character associated with it. Qian became
Dylan> simplified as the character "gan", a three stroke
Dylan> character.

That's absolutely an error, probably due to the cheap _automatic_
traditional->simplfied conversion done by computer programs.

By definition (of Simplified characters), that Qian2 character should
never be replaced by the gan1 character in the context of a name when
it is pronounced qian2. (I think the same rule applies to the qian2
in the 64 gua4's in the Book of Changes.) Programs that blindly
replaces every Qian2 character with gan1 is deemed to have this error.


Dylan> There were other errors in the subtitling, which suggested
Dylan> to me that the people who transcribed it, had sone so in
Dylan> traditional characters and then substituted the traditional
Dylan> character with a simplified one, irrespective of the
Dylan> context.

Maybe, those are not people, but programs.

I've seen something worse: I bought a book in Peking in 1997. It
talks about education system in HK. It mentions a subject name that
is taken by most HKers in the HKCEE (equiv. to GCE O Level). The name
is, in that book, a strange thing: 中普通話文 zhong1pu3tong1hua4wen2.
I was puzzled. Reading further on, there is another strange subject
name: 英普通話文 ying1pu3tong1hua4wen2. What are these?

The puzzles became clear when I read further and encountered the name
of an AS Level (something between GCE O level and A level, introduced
some time around 1995) subject: 中普通話言文化
zhong1pu3tong1hua4yan2wen2hua4. It's an unintelligible name. But I
knew an AS Level subject called: 中國語言文化
zhong1guo2yu3yan2wen2hua4. Obviously, the publisher has done a global
*blind* substitution "國語" guo2yu3 -> "普通話" pu3tong1hua4 without
paying attention to context. What a big joke! And I guess this
global blind substitution is done by some unintelligent being: a
computer program or a stupid guy. But the editor who ordered such a
substition should be fired. He's so stupid!


Dylan> There are a number of simplified characters which have two
Dylan> (and some have several) traditional character
Dylan> equivalents. Though the numbers are small, they will boost
Dylan> up their ratings in some character counts, if the method in
Dylan> which these webpages are scanned and codeconverted takes no
Dylan> account of actual context. So your preference for Google's
Dylan> Chinese search may give skewed results.

I was going to raise this point. Thanks, Dylan. :P


Dylan> Moreover, there are characters today which are new
Dylan> coinings, like "li" for and elevator/lift, and the chemical
Dylan> elements.

But the "lift" character is not in standard big5. I have to check to
see if it's in the HKSCS.

Dylan> If you had enough sites reduplicating the same stuff over
Dylan> and over again, then these rare characters get artificially
Dylan> boosted into prominence. I think that whilst the top
Dylan> several thousand may indeed be representative of general
Dylan> usage, we can't say much about those several tens of
Dylan> thousands of unique and individual Chinese characters which
Dylan> are left over and rare.

We can say that those figures collected reflect the *occurrence*
frequency in web pages reachable within 5 hops (is that the max. depth
used?) from a page that is indexed by Google. Reduplicated pages are
counted multiple times. The figures do not represent actual usage of
Chinese, which includes non-WWW media and even webpages missed by
Google.

yon...@yahoo.com

unread,
Nov 25, 2005, 12:31:14 PM11/25/05
to
Lee Sau Dan wrote:
> >>>>> "yong321" == yong321 <yon...@yahoo.com> writes:
>
> yong321> Although there may be no scientific research to prove, I
> yong321> assume the character frequency based on a global search
> yong321> engine's page count is close to that experienced in
> yong321> everyday life by most people speaking that language.
>
> I think you can at *most* claim that it reflects the experience in
> everyday *WWW* life. The Internet is frequented by a particular group
> of people, and you should not use this sample to generalize on all
> people using that language. How many illiterates have you excluded?
> How many computer-illiterates have you excluded? And how many
> publications that are not put on-line have you excluded? Even for a
> literate and computer-literate person, WWW is just a part of his daily
> life. Does he not read paper-medium books daily? Does he not write
> documents daily, which are not published on any web pages?

In terms of the source of samples, you're absolutely right. But my
point is that the character frequency will not differ significantly
even if we hypothetically included all sources you mentioned. This
extrapolation, although not scientifically proved, is based on the
plausible assumption that today's Internet content largely reflects
ordinary people's everyday life. That is, the characters and words used
in non-published madia and to a less extent used by illiterates are
about the same on the Internet or otherwise. However, as Dylan pointed
out, duplicate count of the same documents by any search engine calls
for a more sophisticated search on the Internet. But that's a separate
issue.

I propose a research that makes a statistical analysis, perhaps rank
correlation, between the character frequency table based on Google (or
Yahoo China) search (excluding duplicates), and one based on a
judicious selection of all representative materials. I expect the
difference to exist but not in a significant way.

>
> yong321> This is not true for a few characters, such as the
> yong321> Chinese characters for Mmm, Ehhh, nei4ge4, zhe4g4,
> yong321> because they're rarely written down.
>
> Good. You're realizing that what you've done cannot be generalized to
> the spoken language in general. Your findings are valid for the
> written language only, and limited to the WWW sphere only.

All character or word frequency reports probably have the limitation of
underestimating those few words. But whether the WWW sphere is a good
representative sample is yet to prove or disprove.

>
>
>
> --
> Lee Sau Dan 李守敦 ~{@nJX6X~}
>
> E-mail: dan...@informatik.uni-freiburg.de
> Home page: http://www.informatik.uni-freiburg.de/~danlee

Thank you for your very valuable comments.

Yong Huang

Dylan Sung

unread,
Nov 25, 2005, 3:18:53 PM11/25/05
to

"Lee Sau Dan" <dan...@informatik.uni-freiburg.de> wrote in message
news:87lkzdw...@informatik.uni-freiburg.de...

> Dylan> If you had enough sites reduplicating the same stuff over
> Dylan> and over again, then these rare characters get artificially
> Dylan> boosted into prominence. I think that whilst the top
> Dylan> several thousand may indeed be representative of general
> Dylan> usage, we can't say much about those several tens of
> Dylan> thousands of unique and individual Chinese characters which
> Dylan> are left over and rare.
>
> We can say that those figures collected reflect the *occurrence*
> frequency in web pages reachable within 5 hops (is that the max. depth
> used?) from a page that is indexed by Google. Reduplicated pages are
> counted multiple times. The figures do not represent actual usage of
> Chinese, which includes non-WWW media and even webpages missed by
> Google.

No, I was thinking about how many people have put up LunYu, TangShu and all
those histories and stuff. I know that in some of these works, that ancient
and rare characters, and new coinings were created, for instance the during
reign of Empress Wu, http://www.dylanwhs.ukgateway.net/zi/zetian.htm

Dyl.

Lee Sau Dan

unread,
Nov 25, 2005, 7:51:07 PM11/25/05
to
>>>>> "yong321" == yong321 <yon...@yahoo.com> writes:

yong321> In terms of the source of samples, you're absolutely
yong321> right. But my point is that the character frequency will
yong321> not differ significantly even if we hypothetically
yong321> included all sources you mentioned. This extrapolation,
yong321> although not scientifically proved, is based on the
yong321> plausible assumption that today's Internet content
yong321> largely reflects ordinary people's everyday life.

I don't agree with this assumption.

Dylan Sung

unread,
Nov 26, 2005, 3:56:59 AM11/26/05
to

"Lee Sau Dan" <dan...@informatik.uni-freiburg.de> wrote in message
news:87u0e0u...@informatik.uni-freiburg.de...

>>>>> "yong321" == yong321 <yon...@yahoo.com> writes:
>
> yong321> In terms of the source of samples, you're absolutely
> yong321> right. But my point is that the character frequency will
> yong321> not differ significantly even if we hypothetically
> yong321> included all sources you mentioned. This extrapolation,
> yong321> although not scientifically proved, is based on the
> yong321> plausible assumption that today's Internet content
> yong321> largely reflects ordinary people's everyday life.
>
>I don't agree with this assumption.

I think sections of the internet do reflect what people would write in their
everyday lives, but the ones I read are often short comments after other
people have written their comments. In a way, these blog like or
newsgroup/chat/board type messages only encapsulate a small percentage of
someone's entire vocabulary. This is far from a novel where a greater range
of subject matter is entertained. What the proportion of these is to the
rest of the net can only be done by guess work, and any guess is as
plausible or implausible as guesswork :D

Dyl.


yky

unread,
Nov 26, 2005, 9:28:51 AM11/26/05
to
Dylan Sung wrote:
>
> No, I was thinking about how many people have put up LunYu, TangShu and all
> those histories and stuff. I know that in some of these works, that ancient
> and rare characters, and new coinings were created, for instance the during
> reign of Empress Wu, http://www.dylanwhs.ukgateway.net/zi/zetian.htm
>
> Dyl.
>

I'm thinking of digitizing Shuo Wen Jie Zi. Any idea how to
digitize those rare characters? I'm not talking about Xiao Zhuan,
which, I believe, is best handled by showing them as gif files.
I'm talking about those rare characters which are not in unicode.

Dylan Sung

unread,
Nov 26, 2005, 4:09:52 PM11/26/05
to

"yky" <y...@yky.com> wrote in message
news:bbdfa$4388714d$471c54ef$12...@ALLTEL.NET...

I think the Kangxi dictionary has collected and listed all the characters in
Shuowen Jiezi. I think the compilers of KX would not have left any of the
rare characters out, and so they ought to be there. Anyway, if the character
is found in KX, they will appear in Unicode 4.x. Since there is a font with
most of the characters in, digitisation is possible. The font is called
sursong.ttf, and once installed, the characters appear under the font face
"Simsun (Founder Extended)".

Sooner or later, a public listing will probably available through the
unihan.txt database, for instance there is now a listing for the Songben
Guangyun (I've found a few mistakes, but what the hey!).

Dyl.


Lee Sau Dan

unread,
Nov 26, 2005, 9:36:00 PM11/26/05
to
>>>>> "yky" == yky <y...@yky.com> writes:


yky> I'm thinking of digitizing Shuo Wen Jie Zi. Any idea how to
yky> digitize those rare characters?

Use an image. Preferably in PNG format.


yky> I'm not talking about Xiao Zhuan, which, I believe, is best
yky> handled by showing them as gif files.

If that's not a character unencoded in Unicode, then it's just a font
issue. You should use the Unicode for them, and specify a Xiao Zhuan
font in the CSS.

Besides, GIF is obsolete. Use PNG instead. For scanned images or
digital photos, JPEG also works well.

yky> I'm talking about those rare characters which are not in
yky> unicode.

Then, use an image. Maybe, you want to invent a way to give each
character a unique ID. That could make the coordination within the
project easier. e.g. number the radicals serially. Then, use a 2
part number X.Y for the characters, where X is the radical number and
Y is the serial number of the character within that radical.

Lee Sau Dan

unread,
Nov 26, 2005, 9:38:27 PM11/26/05
to
>>>>> "Dylan" == Dylan Sung <dylanwhs....@pacific.net.hk> writes:

Dylan> I think the Kangxi dictionary has collected and listed all
Dylan> the characters in Shuowen Jiezi. I think the compilers of
Dylan> KX would not have left any of the rare characters out, and
Dylan> so they ought to be there. Anyway, if the character is
Dylan> found in KX, they will appear in Unicode 4.x. Since there
Dylan> is a font with most of the characters in, digitisation is
Dylan> possible. The font is called sursong.ttf, and once
Dylan> installed, the characters appear under the font face
Dylan> "Simsun (Founder Extended)".

What about the copyright of this font? Is it free (as in free
speech)? Is it (re-)usable under a unrestrictive and royalty-free
license?


Dylan> Sooner or later, a public listing will probably available
Dylan> through the unihan.txt database, for instance there is now
Dylan> a listing for the Songben Guangyun (I've found a few
Dylan> mistakes, but what the hey!).

Oh! I thought it's already there in the Unihan DB. Not yet?

Dylan Sung

unread,
Nov 27, 2005, 4:46:05 AM11/27/05
to

"Lee Sau Dan" <dan...@informatik.uni-freiburg.de> wrote in message
news:87ek52u...@informatik.uni-freiburg.de...

>>>>> "Dylan" == Dylan Sung <dylanwhs....@pacific.net.hk> writes:
>
> Dylan> I think the Kangxi dictionary has collected and listed all
> Dylan> the characters in Shuowen Jiezi. I think the compilers of
> Dylan> KX would not have left any of the rare characters out, and
> Dylan> so they ought to be there. Anyway, if the character is
> Dylan> found in KX, they will appear in Unicode 4.x. Since there
> Dylan> is a font with most of the characters in, digitisation is
> Dylan> possible. The font is called sursong.ttf, and once
> Dylan> installed, the characters appear under the font face
> Dylan> "Simsun (Founder Extended)".
>
>What about the copyright of this font? Is it free (as in free
>speech)? Is it (re-)usable under a unrestrictive and royalty-free
>license?
>

I don't know to be honest, however, I downloaded the sursong font when I was
in HK earlier this year. It's 36 megabytes or there abouts, and took me
three and a half hours on dial up.

The original poster wanted to know about digitisation. Since the rare
characters in kangxi is now found in Unicode Extension B, you need to some
input and text editor capable of manipulating unicode characters for that
range. I don't think there is an IME ready or available on the net to input
characters from this region of the unicode range. I've been using copy and
paste from a series of webpages created for the the task of displaying the
Ext.B characters.

http://www.dylanwhs.ukgateway.net/u-extb/index.html

>
> Dylan> Sooner or later, a public listing will probably available
> Dylan> through the unihan.txt database, for instance there is now
> Dylan> a listing for the Songben Guangyun (I've found a few
> Dylan> mistakes, but what the hey!).
>
>Oh! I thought it's already there in the Unihan DB. Not yet?
>

I haven't checked in a while. It would probably depend on the generosity of
any academics in the field to give up their hard work.

Dyl.


0 new messages