<kanji midashi="沛" id="6c9b">
<strokegr element="沛">
<strokegr element="⺡" variant="true" original="水" position="left"
radical="general">
<stroke type="㇔" path="M19.13,17.25c4.1,1.65,10.59,6.78,11.62,9.34"/>
<stroke type="㇔" path="M13.5,41.25c4.24,1.63,10.94,6.21,12,8.75"/>
<stroke type="㇀"
path="M13.25,85.73c1.71,1.27,3.78,1.32,4.86-0.25c3.14-4.57,6.29-10.16,9.14-15.99"/>
</strokegr>
<strokegr element="市" position="right">
<strokegr element="亠" partial="true" position="top">
<stroke type="㇐"
path="M63.98,11.25c0.82,0.59,2.15,2.85,2.15,4.02c0,4.3-0.26,6.69-0.11,10.33"/>
</strokegr>
<strokegr element="巾" position="bottom">
<stroke type="㇑"
path="M39.75,27.98c1.03,0.1,3.38,0.67,4.38,0.53c8.09-1.11,36.81-4.2,47.79-4.48,c1.7-0.05,2.56,0.04,3.83,0.68"/>
<stroke type="㇆a"
path="M43.67,41.81c0.84,0.46,2.14,2.98,2.3,3.92c0.17,0.93,1.39,28.65,1.22,34.47"/>
<stroke type="㇑"
path="M45.99,44.54c3.68-0.3,37.96-4.5,40.69-4.79c2.32-0.25,3.73,1.81,3.81,3.79,C90.9,54.12,90.5,63,88.54,73.27c-1.68,8.81-5.07,3.48-7.4,0.05"/>
</strokegr>
</strokegr>
</strokegr>
</kanji>
Clearly a stroke is missing, the final one. Also, the fifth stroke
here seems to be part of the "nabebuta" rather than the "zoukin no
kin" part.
Guess what - that was me. ;) I'm fixing the 50 or so kanjis for which
stroke count differ between XML and SVG. This was one of them. After
that I'll try to include the stroke numbers positions as you
suggested.
> Oddly enough there is an almost identical bug in the Tomoe data. In
> the Tomoe data the vertical stroke is written as one line rather than
> two.
Looks like Tomoe is right on this one, if I refer to other resources:
http://kakijun.main.jp/page/hai07200.html
Actually, I got fooled myself and fixed it to be 8 strokes instead of
7. Without your mail, that mistake would have been rolled in...
Alex.
I thought it might be.
>I'm fixing the 50 or so kanjis for which
> stroke count differ between XML and SVG. This was one of them. After
> that I'll try to include the stroke numbers positions as you
> suggested.
Thanks for that. Do you have some kind of meta-file?
>> Oddly enough there is an almost identical bug in the Tomoe data. In
>> the Tomoe data the vertical stroke is written as one line rather than
>> two.
>
> Looks like Tomoe is right on this one, if I refer to other resources:
>
> http://kakijun.main.jp/page/hai07200.html
It might be an error in Kanjidic (Jim Breen's kanji dictionary).
Kanjidic lists eight strokes (S8 in the kanjidic information). But the
Unihan database gives it eight strokes.
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=6c9b
I can't find the kanji in my stroke order dictionary.
I expect Jim Breen knows about this discrepancy already but perhaps
I'll raise it with him.
> Actually, I got fooled myself and fixed it to be 8 strokes instead of
> 7. Without your mail, that mistake would have been rolled in...
It's rather difficult for me to decide who is right about this kanji.
The Unihan database says 7 strokes, mostly:
kRSUnicode 85.4 (radical 85 + 4 strokes)
kRSKangXi 85.4
kTotalStrokes 7
But it also lists:
kRSAdobe_Japan1_6 C+5390+85.3.5
The New Nelson also prints it as having 8 strokes, yet CHISE and
GlyphWiki go for 7.
Interestingly enough, at GlyphWiki (cooperative kanji font project)
the kanji was eventually edited from 8 to 7 strokes by one of the
active (Japanese) contributors:
http://glyphwiki.org/wiki/u6c9b
Anyone care to grab a Morohashi to check?
~ Jeroen
A better solution to this kind of thing would be to have both alternative forms.
> It might be an error in Kanjidic (Jim Breen's kanji dictionary).
> Kanjidic lists eight strokes (S8 in the kanjidic information). But the
> Unihan database gives it eight strokes.
>
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=6c9b
Sorry, this was a typo, should have said "gives it seven strokes" as
you can see from the link and the misplaced "But" here. Apologies!
Actually KanjiVG includes variants for many characters (which are not
released because probably full of errors, and don't have the necessary
knowledge to validate them), but this one has none.
I guess unihan could be trusted on this one (and anyway the XML data
goes that way too), but which online resource could be used as
reference for checking KanjiVG's data?
Alex.
Yes, I know that I do make mistakes. The data didn't fall from the
sky, but was developed by actual persons, and people make mistakes.
The quality of the data should be about the same as the released data
which covers JIS Level I and II and more than 6000 character.
Variations are more than 5000 additional characters. Variations
concern character shapes and stroke order. The data is not "full of
errors," but needs checking.
It is a huge amount of character data, and nobody should be
astonished if it contains mistakes. It is a pain to check and to
correct it, but it is much more painful to build the data.
One major problem was naming the variation files and how to include
this data in the non-variant data. If we don't manage to write a
documentation for the other data how can we release data with for
example cryptical file names. We were also not sure whether users
really need this data as soon as possible.
As soon as we have a documentation for the released data, we can write
one for the variation data and release it, too.
I developed the path data with help of an Austrian student for some
hundred characters with a Japanese schoolbook font as a model.
Schoolbook fonts and the writing style that is taught at Japanese
schools use a mixture of Mincho style and traditional handwritten
Kaisho style for the character shape. As reference for the stroke
order of the kyoiku kanji we used of course the official book of the
Ministry for Education: Monbushō (1958). Hitsujun shidō no
tebiki.Hakubundō, Tokyo.
This stroke order may differ from what calligraphers consider
correct. Calligraphers would consider Kaisho as the correct basic
style for handwriting. Kaisho also should be the way in which Chinese
characters are taught in Korea and Taiwan, and the writing style in
the Peoples Republic should be pretty close.
The variation data covers Kaisho, the character shapes of characters
that are approved for the usage in names by the Ministry of Justice,
if this shape differs from the normal shape and stroke order
variants. As reference, I used mainly EMORI, Kenji (2003): Kaigyōsō
– Hitsujun jitai jiten. Sanseidō, Tokyo.
> I guess unihan could be trusted on this one
I wouldn't consider Unihan a really reliable source. One should go to
a kanwa jiten like Kanjigen.
I will be in Japan in August. I am planning to discuss the future of
the project with Alex then.
Further I am writing a research plan to get funding to continue the
work on further kanji data.
Ulrich
> As soon as we have a documentation for the released data, we can write
> one for the variation data and release it, too.
What kind of documentation are you thinking about?
Also, I wonder what your change management plans are? One of the
problems with Jim Breen's dictionary projects is that there is not
very much management of the changes which are entered in via people
from the internet. I think this point is very important if you want to
get "internet" collaboration. The actions needed are not just a way
for people to add entries but also a way to view the changes and a way
to roll the changes back if they turn out to be wrong.
> I developed the path data with help of an Austrian student for some
> hundred characters with a Japanese schoolbook font as a model.
> Schoolbook fonts and the writing style that is taught at Japanese
> schools use a mixture of Mincho style and traditional handwritten
> Kaisho style for the character shape. As reference for the stroke
> order of the kyoiku kanji we used of course the official book of the
> Ministry for Education: Monbushō (1958). Hitsujun shidō no
> tebiki.Hakubundō, Tokyo.
Thanks for the reference, I didn't know about this book.
> This stroke order may differ from what calligraphers consider
> correct. Calligraphers would consider Kaisho as the correct basic
> style for handwriting.
Calligraphers also differ in things like whether to have a "hane" or
not on the basis of their aesthetic judgements. I think it's very
difficult to satisfy all of them.
> The variation data covers Kaisho, the character shapes of characters
> that are approved for the usage in names by the Ministry of Justice,
> if this shape differs from the normal shape and stroke order
> variants. As reference, I used mainly EMORI, Kenji (2003): Kaigyōsō
> - Hitsujun jitai jiten. Sanseidō, Tokyo.
Thanks again for the reference.
>> I guess unihan could be trusted on this one
>
> I wouldn't consider Unihan a really reliable source. One should go to
> a kanwa jiten like Kanjigen.
>
>
> I will be in Japan in August. I am planning to discuss the future of
> the project with Alex then.
>
> Further I am writing a research plan to get funding to continue the
> work on further kanji data.
Good luck with your plan and thanks for your work so far.
I think it's a really valuable resource, even under non-commercial
licence conditions.
軽率でした。 I may have stated that too strongly - what I mean is that the
SVG and XML data are not matching enough in order to be released in
the same file (with respect to stroke count, for instance). My
apologize if you felt like I questioned the quality if the data - as
its first and main user (and a happy one, that is), I am not.
> I developed the path data with help of an Austrian student for some
> hundred characters with a Japanese schoolbook font as a model.
> Schoolbook fonts and the writing style that is taught at Japanese
> schools use a mixture of Mincho style and traditional handwritten
> Kaisho style for the character shape. As reference for the stroke
> order of the kyoiku kanji we used of course the official book of the
> Ministry for Education: Monbushō (1958). Hitsujun shidō no
> tebiki.Hakubundō, Tokyo.
>
> This stroke order may differ from what calligraphers consider
> correct. Calligraphers would consider Kaisho as the correct basic
> style for handwriting. Kaisho also should be the way in which Chinese
> characters are taught in Korea and Taiwan, and the writing style in
> the Peoples Republic should be pretty close.
>
> The variation data covers Kaisho, the character shapes of characters
> that are approved for the usage in names by the Ministry of Justice,
> if this shape differs from the normal shape and stroke order
> variants. As reference, I used mainly EMORI, Kenji (2003): Kaigyōsō
> – Hitsujun jitai jiten. Sanseidō, Tokyo.
Great - this is exactly the information I wanted. I'll try to get a
hand on those. Do you mind if I start writing a wiki page with this
information? I guess many users could think some stroke orders are
incorrect without this information.
> Further I am writing a research plan to get funding to continue the
> work on further kanji data.
I really hope you can get it. With the proper means, this project can
really bring a great contribution.
Alex.
There are several research papers on the project in English and in
German, where we describe what we have done, how we have done it, with
which resources etc. If this information would have been collected at
one place, for example a lot of Christoph's "Criticism" wouldn't have
been necessary.
> Also, I wonder what your change management plans are?
Alex has set a subversion system for file management of changes to
the path data. I must admit that I have some doubts about "internet"
collaboration here -- but of course, you can prove me wrong. As I
said, I could pay a student to work on a number of characters. He was
very willing and motivated. He had the correct software. He had a good
knowledge of kanji and Japanese. But nevertheless, I wasn't always
satisfied with the quality of his work and it took us some time until
it worked out fine.
> Calligraphers also differ in things like whether to have a "hane"
Actually, calligraphers shouldn't differ too much in the use of "hane"
in Kaisho (exception would be 本, where some write hane some don't;
the 木 and its radical is normally written with hane in Kaisho).
Schoolbook fonts use hane much less. That is one of the influences of
Mincho.
> Good luck with your plan and thanks for your work so far.
>
> I think it's a really valuable resource, even under non-commercial
> licence conditions.
Thank you! I am very glad to hear that!
Ulrich
I was wondering if you meant documentation of the format of the file?
It would be nice to write parsers to a specification rather than just
guessing or using trial and error.
> If this information would have been collected at
> one place, for example a lot of Christoph's "Criticism" wouldn't have
> been necessary.
I think the phrase to describe this critisism is "looking a gift horse
in the mouth". Until now the only free information I could find was
the poor-quality data in Tomoe, so it's a big plus to have this data
set.
>> Also, I wonder what your change management plans are?
>
> Alex has set a subversion system for file management of changes to
> the path data. I must admit that I have some doubts about "internet"
> collaboration here -- but of course, you can prove me wrong.
The next step for me is to send the data into the recognition engines
I have here (Zinnia and Kanjipad) based on the Tomoe data and see what
I can come up with. I don't guarantee useful results. For the Tomoe
data I have a file of errors which I think has about 200 or 300 errors
listed. I posted something on the Tomoe mailing list a long time ago
but nobody responded so I am just accumulating these for my own
purposes.
Programming-wise, that involves making something to turn the curves
into a series of dots since the recognition engines don't work with
curves. I think the algorithm is straightforward but it involves a few
hours of coding. Then just I will push the button and let Tomoe and
KanjiVG data battle each other for supremacy. But ... I don't
anticipate very much useful data for KanjiVG coming out of this
though, since I know that the Tomoe data was entered in quite a rough
and ready way. I expect mostly to just find more errors in the Tomoe
data.
> As I
> said, I could pay a student to work on a number of characters. He was
> very willing and motivated. He had the correct software. He had a good
> knowledge of kanji and Japanese. But nevertheless, I wasn't always
> satisfied with the quality of his work and it took us some time until
> it worked out fine.
>> Calligraphers also differ in things like whether to have a "hane"
>
> Actually, calligraphers shouldn't differ too much in the use of "hane"
> in Kaisho
Maybe they shouldn't, but they do! Until two years ago I went to
calligraphy class every week, doing "kaisho", and the teacher used to
often change the "hane" from the textbook ones. (Or maybe my teacher
was strange, or something?)
> (exception would be 本, where some write hane some don't;
> the 木 and its radical is normally written with hane in Kaisho).
> Schoolbook fonts use hane much less. That is one of the influences of
> Mincho.
I think the schoolbook typefaces are standardised about the use of
hane since children are taught them at school.
[...]
> > If this information would have been collected at
> > one place, for example a lot of Christoph's "Criticism" wouldn't have
> > been necessary.
>
> I think the phrase to describe this critisism is "looking a gift horse
> in the mouth". Until now the only free information I could find was
> the poor-quality data in Tomoe, so it's a big plus to have this data
> set.
Sorry, I did not mean to be understood that way. Ulrich, you as the copyright
holder of this data are free to do what ever you like with it. My point though
holds, a less restrictive license would benefit a lot of other projects (and I
mean non-commercial ones). Thanks for your effort so far. I'll contribute
should I ever come across errors personally. I'm glad the "Japanese side" has
this strong support.
Christoph
My idea for managing contributions is that they must be validated by a
knowledgeable person before making it into the subversion repository.
It is very easy for people like me to actually add mistakes while
thinking we are fixing some. If that validation step in is place, I
think KanjiVG has plenty to win from being open.
Of course, a clear description of the format and rules used for stroke
orders are also important, otherwise people may contribute wrongly
because of a lack of guidelines. Not being a kanji expert myself I can
hardly help a lot.
> Sorry, I did not mean to be understood that way. Ulrich, you as the copyright
> holder of this data are free to do what ever you like with it. My point though
> holds, a less restrictive license would benefit a lot of other projects (and I
> mean non-commercial ones). Thanks for your effort so far. I'll contribute
> should I ever come across errors personally. I'm glad the "Japanese side" has
> this strong support.
The licence is definitely something I want to discuss again with
Ulrich in August. I realized that the non-commercial clause will
dramatically limit the applicability of KanjiVG because it makes it
incompatible with licences like the GPL. So I'm not even sure if
Tagaini is not actually violating KanjiVG's licence, even though it
has no commercial ambitions. :(
I actually made a confusion between the non-commercial and share-alike
clauses. Non-commercial says, you cannot make any money from that data
in any way. Share-alike says, you must redistribute products using
this data under the same clause as the data (à la GPL). The
share-alike clause is the one that is important, because it ensures
that people using the data contribute back to the community, whether
they are trying to make money with it or not. And I say "trying",
because in practice anybody can distribute the modifications freely
too, so it is in fact very difficult, if not impossible, to make
monetary profit with such a clause.
On the other hand, the non-commercial clause prevents projects using
KanjiVG to be included in major linux distributions, because the
compagnies behing them may be selling CDs containing the data, which
would be in violation to the licence.
This is my point for now - in my opinion, the Share-alike clause is
sufficient to prevent "unfair" use of the data, since any modification
must be made available freely. On the other hand, the Non-commercial
clause just gets in the way for projects licenced under the GPL. Such
a clause may have a purpose for artistic creations like music, where a
performance may actually generate money, but for software and data I
don't think it makes much sense.
Alex.
> I think the phrase to describe this critisism is "looking a gift horse
> in the mouth". Until now the only free information I could find was
> the poor-quality data in Tomoe, so it's a big plus to have this data
> set.
I don't think that the data in Tomoe and KanjiVG can be compared.
They don't serve the same purpose. Kanji in Tomoe are originally
conceived as templates for the "simple recognizer" in Tomoe. This
explains why they are made exclusively of straight lines. Kanji in
KanjiVG seem to be originally designed more as a reference for
learning Japanese. Of course, nothing prevents you to sample points
from the curves (converting a curve to a series of point) and use them
for handwriting recognition, which is not a bad idea given the high
quality of the data.
>
> The next step for me is to send the data into the recognition engines
> I have here (Zinnia and Kanjipad) based on the Tomoe data and see what
> I can come up with. I don't guarantee useful results. For the Tomoe
> data I have a file of errors which I think has about 200 or 300 errors
> listed. I posted something on the Tomoe mailing list a long time ago
> but nobody responded so I am just accumulating these for my own
> purposes.
I think that they are just busy with their work or other personal
projects... Benoit Cerrina (the author of ShinKanji) also found many
errors and fixed them in the SVN repo by himself. I'm pretty sure that
if you ask for it, they will grant you access to the repo.
>
> Programming-wise, that involves making something to turn the curves
> into a series of dots since the recognition engines don't work with
> curves. I think the algorithm is straightforward but it involves a few
> hours of coding. Then just I will push the button and let Tomoe and
> KanjiVG data battle each other for supremacy.
I don't really understand what you mean by supremacy here. As for as I
understand, either you use the Tomoe data to train Zinnia and the
KanjiVG data to test zinnia or the other way around. My intuition is
that the former is better because KanjiVG look more like real
handwriting and thus is a good candidate for testing.
I was planning to write a small script to measure the accuracy (as
well as precision / recall) of Zinnia against KanjiVG. Have you done
it already? I'm interested in your routines for sampling points from
the bezier curves if you have some code to share.
Also, you may not know about this project because it is fairly new but
there's a new handwriting recognition project for Japanese/Chinese
called Tegaki - http://tegaki.sourceforge.net/. It's available in
Debian sid already.
Like Tomoe, it provides applications for the end-user (scim-tegaki,
tegaki-recognize and soon tegaki-train to train models with your own
handwriting). Unlike Tomoe, it provides truly reusable widgets in
tegaki-gtk (widgets in Tomoe are too much coupled with the
recognizer). There's also "tegaki-lab", which is a playground to
experiment with new recognizer ideas. Another difference is that most
of the code is in Python. We want to use C or C++ only for
recognizers. In Tomoe, everything is written in C and they also
provide bindings for several languages... The project resources are
already very limited so I think it's better to focus on providing good
functionality to users rather than pursuing the hopeless dream of
programming language universality (developers care, users don't)...
Another project we've started recently (Ian Johnson and I) is the
tegaki database. The goal is different from KanjiVG. Our goal is to
collect handwriting samples from people in the hope that they will be
useful to advance the state of open-source recognizers (either as
train data or test data). Of course, if nicely written samples are
available, we can always use them as reference characters but I think
that KanjiVG will be better for that purpose. Our goal is to have a
client-server infrastructure, with a web interface as well as a web
service API. We hope that application (dictionary, learning assitant,
game) developers will use our API in order to send handwriting samples
to our database. We don't have anything to show yet but just to let
you know that this is also an area in which we are working.
Mathieu Blondel
>> Then just I will push the button and let Tomoe and
>> KanjiVG data battle each other for supremacy.
>
> I don't really understand what you mean by supremacy here.
I mean "which one is correct".
> I was planning to write a small script to measure the accuracy (as
> well as precision / recall) of Zinnia against KanjiVG. Have you done
> it already? I'm interested in your routines for sampling points from
> the bezier curves if you have some code to share.
I don't have any code to share at the moment, but the PNG viewer and
the parser which I posted are both steps along this path. The next
step is to write something to turn the Bezier curves like
c0.58,2.12-0.03,3.64-1.16,6.56
into a series of points.
The algorithm is very simple, but it's easy to make a mistake, so the
hard part of this is not coding the algorithm but coding the checking
code to make sure that the points actually do represent the curve
correctly.
Yes it can use more than one sample per character for training. If
your are going to use zinnia to recognize real handwriting, this is a
good idea. However, the idea I suggested was to measure how well
zinnia performs. This can be useful to the author of zinnia in order
to tune his algorithm. If you train zinnia with the Tomoe data, you
can use the KanjiVG data as a test set. For each kanji in KanjiVG, you
send the kanji to zinnia and you check that the expected kanji is
recognized. In that case, it's better not to train it with both
because it becomes close evaluation (it's easier for zinnia to
recognize exactly the same data as it has been trained with than data
it has never seen).
Mathieu Blondel