Re: Comment on Decomposition in cjklib

cjk...@googlecode.com

unread,

Feb 14, 2010, 2:42:01 PM2/14/10

to cjklib...@googlegroups.com

Is there's a link between cjklib decomposition and the wikimedia project of
the same nature ? Do you collaborate or so ?
http://commons.wikimedia.org/wiki/Commons_talk:Chinese_characters_decomposition

For more information:
http://code.google.com/p/cjklib/wiki/Decomposition

cjk...@googlecode.com

unread,

Feb 14, 2010, 4:47:18 PM2/14/10

to cjklib...@googlegroups.com

Comment by cburg...@ira.uka.de:

No not at all. I know about the page, but I when I looked at it before I
couldn't understand the format. I don't know if a collaboration would be
possible, but I'd be happy if duplicating work could be avoided. Allan,
feel free to investigate if collaboration could be possible. The list here
was mostly released by Gavin Grover under LGPL, which is originally also
available under an Apache license.

cjk...@googlecode.com

unread,

Mar 9, 2010, 1:12:56 AM3/9/10

to cjklib...@googlegroups.com

Comment by goo...@jhoward.fastmail.fm:

In December I fixed up the formatting of the wikimedia version to make it
all consistent and readable. I've been checking through a few characters
and haven't found any obvious problems yet. It's under the GFDL, which
unfortunately is not compatible with the LGPL (AFAICT). Despite that
incompatibility, it could be used as a data source to check against,
without actually incorporating it directly.

cjk...@googlecode.com

unread,

Jun 18, 2010, 2:10:10 AM6/18/10

to cjklib...@googlegroups.com

Comment by gavingrover:

As requested by Christoph, I've just added the CC-by-SA licence to
[http://code.google.com/p/vy-language/downloads/list this data file] in
addition to the other two licences. I created the data about 3 years ago,
though haven't looked at it since. As I've written on the main project
page...

The decomposition file is a graphical analysis of the almost 21,000 CJK
characters in the Unicode CJK common ideograph block, plus the 12 unique
characters from the CJK compatibility block. There are no entries from the
A, B, or C extension blocks. For each character, I recorded one or two
constituent components, and a decomposition type. I used pictorial
decompositions, not semantic ones, because that was my purpose for the data
at that time. I added a few thousand extra character entries to cater for
decomposition components not themselves among the collected characters.
(Although many are in the CJK extension A, B, and C blocks, I kept those
out of scope.) To represent these extra characters in the data, sometimes
I've used a multi-character marker sequence, sometimes a code from the
Unicode BMP user-defined block.

The Apache-licensed file actually contains two files: the second file is a
truetype font file with glyphs to make viewing the data file easier.
However, I can't release this font file under a free or open licence
because I'm not sure where I got some of it from.

If I had time to resume working on this data, I would:
* replace the user-defined codes with unique multi-char codes for the
intermediate decompositions, and
* allow multiple decompositions for each character, allowing both
graphical and semantic decompositions, also catering for disagreements on
the correct decomposition.

Reply all

Reply to author

Forward