CJK to Romanization

36 views
Skip to first unread message

Édouard Lopez

unread,
Jul 26, 2010, 2:25:18 PM7/26/10
to cjklib-devel, hugo.lp...@gmail.com
Hi there,
I'm part of the CFDICT project and I'm willing to create a typeface
displaying hanzi and pinyin. Something like ⿰ with a 80-20 or 90-10
sharing.
To do so I'm use the codepoint of the CJK characters to get a complete
list and CJKlib to get the romanization. I looked at the documentation
but didn't find anything that match my need or didn't understand how
to do so.
So could someone give me some sample get the romanization when I only
have the character. For instance, `風` will give me `fèng`

Regards,
Ed

Christoph Burgmer

unread,
Jul 26, 2010, 4:15:59 PM7/26/10
to cjklib...@googlegroups.com, Édouard Lopez
Hi Édouard,

> I'm part of the CFDICT project and I'm willing to create a typeface
> displaying hanzi and pinyin. Something like ⿰ with a 80-20 or 90-10
> sharing.

Not sure if I understand.

> To do so I'm use the codepoint of the CJK characters to get a complete
> list and CJKlib to get the romanization. I looked at the documentation
> but didn't find anything that match my need or didn't understand how
> to do so.

The functionality for looking up the romanization of characters is currently
implemented in CharacterLookup. The documentation would be
<http://cjklib.org/0.3/library/cjklib.characterlookup.html#readings>.

> So could someone give me some sample get the romanization when I only
> have the character. For instance, `風` will give me `fèng`

Here's what you would do on the Python command line:

$ python
Python 2.6.5+ (release26-maint, Jul 6 2010, 14:48:45)
[GCC 4.4.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from cjklib.characterlookup import CharacterLookup
>>> cjk = CharacterLookup('T')
>>> cjk.getReadingForCharacter(u'風', 'Pinyin')
[u'fēng', u'fěng', u'fèng']

Hope that helps
-Christoph

Édouard Lopez

unread,
Jul 27, 2010, 2:53:57 AM7/27/10
to cjklib-devel
[public copy]
Hi,
Currently use this font (http://drop.io/cn_pinyin_font don't know the
licence) but I'm not pleased with it, that why I'm thinking about this
project. I aim to do something similar to it or like the (X)HTML
<ruby> annotation (http://en.wikipedia.org/wiki/
Ruby_(annotation_markup) but with a vertical layout as it seem to
allow a bigger 漢字.
Despite knowing that it's not a real solution, I still believe it
would be really helpful for beginner as it will make the frontier
between latin scripts and non-latin thinner.

As I expected there is more than one romanizations/pronunciations...
Does the sorting reflect their frequency or anything at all ?

Thanks for your short answer, it was really helpful. However, take
note that this is the really beginning of a side project, so don't
have high expectation on it :)

Cheers,
Ed

Édouard Lopez

unread,
Jul 27, 2010, 3:09:09 AM7/27/10
to cjklib-devel
Hi Chris,
I talk with my brother about that and we are wandering what is the the
pinyin completion status in CJKlib ? Did you mirror Unicode or get
your data from somewhere else.
Looking a bit I found the answer: Mandarin character readings in
Pinyin (from kHanyuPinlu, kXHC1983, kHanyuPinyin).
Seem a lot of characters doesn't have pinyin/mandarin pronunciation
nor any others (japanese, korean, cantonese, etc.). Any idea why ?

Edouard Lopez

unread,
Jul 26, 2010, 5:28:09 PM7/26/10
to Christoph Burgmer, cjklib...@googlegroups.com
Hi,
Currently use this font but I'm not pleased with it, that why I'm thinking about this project. I aim to do something similar to it or like  the (X)HTML <ruby> annotation but with a vertical layout as it seem to allow a bigger 漢字. Something like that:
f
è
n
g
Despite knowing that it's not a real solution, I still believe it would be really helpful for beginner as it will make the frontier between latin scripts and non-latin thinner.

As I expected there is more than one romanizations/pronunciations... Does the sorting reflect their frequency or anything at all ?

Thanks for your short answer, it was really helpful. However, take note that this is the really beginning of a side project, so don't have high expectation on it :)

Cheers,
Ed


Internship in 台中, 台灣 (Taichung, Taiwan)
Master of Engineering in System-Person Communication
UPMF, Grenoble, France

Christoph Burgmer

unread,
Jul 27, 2010, 2:59:56 PM7/27/10
to cjklib...@googlegroups.com
Hi Ed,

sorry for the late reply.

It seems your mails got caught up in moderation. This is a measure to keep
SPAM from the list, but makes it a bit harder for unregistered people to post
:(

Am Dienstag, 27. Juli 2010 schrieb Édouard Lopez:
> [public copy]
> Hi,
> Currently use this font (http://drop.io/cn_pinyin_font don't know the
> licence) but I'm not pleased with it, that why I'm thinking about this
> project. I aim to do something similar to it or like the (X)HTML
> <ruby> annotation (http://en.wikipedia.org/wiki/
> Ruby_(annotation_markup) but with a vertical layout as it seem to
> allow a bigger 漢字.
> Despite knowing that it's not a real solution, I still believe it
> would be really helpful for beginner as it will make the frontier
> between latin scripts and non-latin thinner.

The font is a funny idea. Didn't know something like that exists. My first
thought was "how do they handle characters with multiple readings", but from a
second thought this font will serve well for beginners that won't see a 多音字
too early in their learning process.

> As I expected there is more than one romanizations/pronunciations...
> Does the sorting reflect their frequency or anything at all ?

Sadly not. As you already described in your later email, cjklib ships readings
from mixes sources. As only partially sorting (by frequency) is available this
isn't reflected there. It should though be an easy implementation to extend
cjklib to answer your need. But then you need to deal with a subset of data.

Either you send in a patch or bug me enough and wait for action from my side
:)

> Thanks for your short answer, it was really helpful. However, take
> note that this is the really beginning of a side project, so don't
> have high expectation on it :)

A last thought on Ruby annotations. Not sure how their support currently is,
but I created a website recently which uses ruby style annotations by pure
CSS. You might want to search for code & documentation for this.

-Christoph

Christoph Burgmer

unread,
Jul 27, 2010, 3:12:18 PM7/27/10
to cjklib...@googlegroups.com, Édouard Lopez
Am Dienstag, 27. Juli 2010 schrieb Édouard Lopez:
> I talk with my brother about that and we are wandering what is the the
> pinyin completion status in CJKlib ? Did you mirror Unicode or get
> your data from somewhere else.
> Looking a bit I found the answer: Mandarin character readings in
> Pinyin (from kHanyuPinlu, kXHC1983, kHanyuPinyin).

Most reading data currently comes from the Unihan database, yes. Unihan in
turn derives its data from different sources:
* Xiandai Hanyu Pinlu Cidian
* Xiàndài Hànyǔ Cídiǎn
* Hànyǔ Dà Zìdiǎn

> Seem a lot of characters doesn't have pinyin/mandarin pronunciation
> nor any others (japanese, korean, cantonese, etc.). Any idea why ?

I wouldn't expect all characters to have Pinyin, as most are rarely used and
partially not even found in China, but only e.g. Japan.

In fact as Unihan and thus cjklib already ships with sources from three major
Chinese dictionaries, I believe all important characters are covered. You can
add another source "kMandarin", but this source is unknown and has wrong
entries it seems.

Japanese readings are currently not supported, as Unicode only ships a, to my
eyes, mutilated version. Jim Breen actually spoke up recently offering the data
he has been collecting. While initial support for Jim Breen's Kanjidic is
given in cjklib, currently reading information is still not installed.

Cantonese and Korean might have less sources, but I didn't run a too long
investigation there.

Cjklib has now gotten support for Shanghainese through Kellen Parker's
project. The data set is still small and help appreciated :)

Feel free to ask, as not everything from my research on these topics went into
documentation.

-Christoph

Reply all
Reply to author
Forward
0 new messages