Some words about me so you know where I am coming from. I'm the author behind
http://code.google.com/p/cjklib/ a library providing language routines related
to Han characters, including functionality for character pronunciations,
radicals, glyph components, stroke decomposition and variant information.
KanjiVG is stroke order information using state of the art technology, XML,
SVG, Unicode defined strokes. Still there are points I'd like to raise:
1) Scope of language.
KanjiVG seems to be limited to Japanese style characters ("Kanji"), at least
that's what the name and the unidimensional data structure suggests. But, this
project could be easily extensible, if not by manpower, but by a general
system, to also support characters of other locales (i.e. GTKV). This should
provide substantial synergy also benefiting the "Japanese side". While the
Japanese folks have always been the ones pushing in language-based
technologies, I would welcome co-operation with the other languages.
2) License
I don't know about the data sources used to build KanjiVG, so there might be
requirements to satisfy, but Creative Commons by-nc-sa is pretty restrictive
and makes it incompatible with other important open source licenses. From my
viewpoint this hinders me using the included stroke information for my LGPL
sourced data used in cjklib. And vice versa this will keep any foreign data
from helping your project.
3) Glyph selection
You don't state on which basis your glyphs are selected. There are standards
out there that can serve as good glyph sources, which give an objective visual
representation for your characters covered. Benefits are, e.g. following a
official list can save you of ending up with an unusual glyph unknown to most
but also eases the integration of related sources using the same glyph source.
Furthermore, users, arguing over the correctness of a glyph, have valid
references and which help solving those questions before they even arrive on
your mailing list. While cjklib is still far from this goal, this point is on
the agenda.
4) Proof of stroke count
Similar to 3) correctness can be proven by stating official or credible sources.
Arguments about which stroke order is correct will be meaningless once
compared to an authoritative source. You might also consider providing a
character with several possible stroke orders, if a unique one is disputed. I
know for the Chinese interpretation, stroke order can heavily depend on the
scholar in question.
I hope you view my criticism as positive feedback and I would be happy to
discuss any of these points with you. Me coming from a Chinese perspective
might miss some facts basic to Japanese-related folks. If there are facts that
seem obvious to you, please still share them with me.
-Christoph
> Some words about me so you know where I am coming from. I'm the author behind
> http://code.google.com/p/cjklib/ a library providing language routines related
> to Han characters, including functionality for character pronunciations,
> radicals, glyph components, stroke decomposition and variant information.
Very interesting - I can see some common points with KanjiVG. Where
does the character component data come from?
> KanjiVG seems to be limited to Japanese style characters ("Kanji"), at least
> that's what the name and the unidimensional data structure suggests. But, this
> project could be easily extensible, if not by manpower, but by a general
> system, to also support characters of other locales (i.e. GTKV). This should
> provide substantial synergy also benefiting the "Japanese side". While the
> Japanese folks have always been the ones pushing in language-based
> technologies, I would welcome co-operation with the other languages.
At this point I should mention that I am not fluent at all regarding
Japanese and Chinese writing systems - I just happen to contribute to
KanjiVG because I use it in one of my projects.
The structure of KanjiVG could probably be used as a basis for
supporting other non-Japanese characters, but one has to check how
this could conflict with the existing data (i.e. stroke order
differences, etc.) Moreover, there are only a handful of people
working on the project on a partial basis. I personally feel that the
current Japanese set should be strenghtened and fixed before any
charactere extension is thought about.
> I don't know about the data sources used to build KanjiVG, so there might be
> requirements to satisfy, but Creative Commons by-nc-sa is pretty restrictive
> and makes it incompatible with other important open source licenses. From my
> viewpoint this hinders me using the included stroke information for my LGPL
> sourced data used in cjklib. And vice versa this will keep any foreign data
> from helping your project.
This is a point, and it has been raised even since the few weeks
KanjiVG is available now. :p Yes, I overlooked that, but now I also
agree that the licence does not suit the purpose of the project
(providing kanji information to open source/academic projects). This
also prevents KanjiVG from being included in popular Linux
distributions like Debian. Ulrich is the copyright holder, maybe he
would agree to reconsider the licence to a CC-by-sa? IMO the
"share-alike" clause is the one that is important here, since it
ensures all usages of KanjiVG will be made open, whether they are
commercial or not.
> You don't state on which basis your glyphs are selected. There are standards
> out there that can serve as good glyph sources, which give an objective visual
> representation for your characters covered. Benefits are, e.g. following a
> official list can save you of ending up with an unusual glyph unknown to most
> but also eases the integration of related sources using the same glyph source.
> Furthermore, users, arguing over the correctness of a glyph, have valid
> references and which help solving those questions before they even arrive on
> your mailing list. While cjklib is still far from this goal, this point is on
> the agenda.
Guess this is a point for Ulrich. I do not know on which basis the
glyphs were selected, but I'm pretty sure the common kanjis are
covered.
> Similar to 3) correctness can be proven by stating official or credible sources.
> Arguments about which stroke order is correct will be meaningless once
> compared to an authoritative source. You might also consider providing a
> character with several possible stroke orders, if a unique one is disputed. I
> know for the Chinese interpretation, stroke order can heavily depend on the
> scholar in question.
There are variations for many characters, but they are not released
yet. Regarding the authoritative source, which ones could be used
remain an open question to me. Ulrich is the kanji specialist, but
unfortunately he cannot checks all the fixes I'm making to errors may
leak. I hope more kanji experts will join the project.
> I hope you view my criticism as positive feedback and I would be happy to
> discuss any of these points with you. Me coming from a Chinese perspective
> might miss some facts basic to Japanese-related folks. If there are facts that
> seem obvious to you, please still share them with me.
Your points are definitely valid in my opinion, thanks for sharing
them. Me coming from no perspective at all may be missing the whole
point, though. I hope Ulrich will join this discussion so that we can
sort out some of the issues you raised, especially the licence one.
Alex.
thanks for your answer.
Am Donnerstag, 25. Juni 2009 schrieb Alexandre Courbot:
> > Some words about me so you know where I am coming from. I'm the author
> > behind http://code.google.com/p/cjklib/ a library providing language
> > routines related to Han characters, including functionality for character
> > pronunciations, radicals, glyph components, stroke decomposition and
> > variant information.
>
> Very interesting - I can see some common points with KanjiVG. Where
> does the character component data come from?
This is currently a random list done by me. It's more a proof of concept than
a sufficient list. The project actually still needs a good guideline on how
decomposition is done (verifiability et al.).
> > KanjiVG seems to be limited to Japanese style characters ("Kanji"), at
> > least that's what the name and the unidimensional data structure
> > suggests. But, this project could be easily extensible, if not by
> > manpower, but by a general system, to also support characters of other
> > locales (i.e. GTKV). This should provide substantial synergy also
> > benefiting the "Japanese side". While the Japanese folks have always been
> > the ones pushing in language-based technologies, I would welcome
> > co-operation with the other languages.
>
> At this point I should mention that I am not fluent at all regarding
> Japanese and Chinese writing systems - I just happen to contribute to
> KanjiVG because I use it in one of my projects.
>
> The structure of KanjiVG could probably be used as a basis for
> supporting other non-Japanese characters, but one has to check how
> this could conflict with the existing data (i.e. stroke order
> differences, etc.) Moreover, there are only a handful of people
> working on the project on a partial basis. I personally feel that the
> current Japanese set should be strengthened and fixed before any
> charactere extension is thought about.
I agree that it's pointless "forcing" few to do more work. That's why I
envision opening the system and encouraging others to participate. I believe
the overall vision is important, if you signal that you are open to non-
Japanese Kanji, you'll should find people interested.
> > I don't know about the data sources used to build KanjiVG, so there might
> > be requirements to satisfy, but Creative Commons by-nc-sa is pretty
> > restrictive and makes it incompatible with other important open source
> > licenses. From my viewpoint this hinders me using the included stroke
> > information for my LGPL sourced data used in cjklib. And vice versa this
> > will keep any foreign data from helping your project.
>
> This is a point, and it has been raised even since the few weeks
> KanjiVG is available now. :p Yes, I overlooked that, but now I also
> agree that the licence does not suit the purpose of the project
> (providing kanji information to open source/academic projects). This
> also prevents KanjiVG from being included in popular Linux
> distributions like Debian. Ulrich is the copyright holder, maybe he
> would agree to reconsider the licence to a CC-by-sa? IMO the
> "share-alike" clause is the one that is important here, since it
> ensures all usages of KanjiVG will be made open, whether they are
> commercial or not.
Yes, not shipping with Debian is a strong point. If you can satisfy the
requirements of the Debian project, then you can be sure that others will also
be able to.
From my perspective either CC license won't help my work directly. I would
like to extract the stroke order information to enhance support in cjklib. I
have only few data on this, e.g. ㇓ ㇏ ㇔ ㇕ ㇐ ㇐ ㇙ ㇒ ㇏ (SP-N-D-HZ-H-H-ST-P-N) for
⻝. Maybe Ulrich would be willing to release parts of the data under a different
license.
Well, seeing that KanjiVG does come with component and stroke order data, it
does actually have an overlap with cjklib. While the latter could parse
KanjiVG's XML tree to incorporate these parts, such a tree could be in return
generated by data from cjklib (though currently very sparse), lacking the
path data of course. Stating the obvious: if you would already like to
manipulate data provided by KanjiVG, load it into cjklib.
Oh, you surely have heard of the Commons stroke order project [1]. They offer
ready to use .png and .gif images. A nice service could be to create similar
images for easy download. Additionally, will this work here merge back to the
Kanji Stroke Order font from [2]? Were corrections done there already merged
back to Ulrich's data?
-Christoph
[1] http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project
[2] http://sites.google.com/site/nihilistorguk/
I will discuss license with Alex in August. Programmers who need a
commercial license, can talk with me now, too. My work on the kanji
data should be the equivalent of probably two years full time work,
for which I was only partly funded. At the moment I don't want that
others make profit with the data while I don't. This might change,
for example, if I get decent funding for a continuation of the
project, or if something would become a real cooperation, that would
help KanjiVG too.
Glyph selection follows as I said Japanese schoolbook fonts and an
adaptation to Kaisho style.
Standards and "official or credible sources" are a big problem. On
character files by the Unicode consortium you have a line that says
something like: The actual characters might look different too. If
you take Morohashi, Mojikyo or Wenlin as your standard you might get
problems with copyright issues, as they are commercial project.
Probably, we will end by doing so anyway. I suppose, if Christoph was
aware of an "official or credible source" that he would suggest, he
would have told us.
Tim Eyre could make a font from my old data. He shouldn't have
problems to make it from the new one too, as soon as it contains the
information about the places for numbering.
Cheers,
Ulrich
I *think* (understand: I am not a lawyer) it is, since your software
would just be a player for the data, and no derivate from KanjiVG
would make it into your code. It would be the same thing as using The
Gimp for opening a CC-licenced image.
So as long as the distribution means are respectful to both licences,
I guess you are safe since no "mix" actually occurs.
Still, we'll try to solve the licence issues with Ulrich when we meet
next month.
Alex.