[kanjivg] what about the wiki commons stroke order project?

131 views
Skip to first unread message

lrdict

unread,
May 23, 2010, 9:20:46 PM5/23/10
to KanjiVG
Hello,

I just discovered this project, and really like the approach to code
the information about stroke order in a standard text-based format.

There is a wiki commons project that, besides this obvious idea about
the data format, seems to have the same objective, but with a bigger
scope (not restriked to Japanese Kanji...).

Please have a look at there project homepage:

http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project

My questions are: Are you aware of their project? Why is this project
separate from the wiki one? Wouldn't it be better to work on a common
XML format for all the Han characters together with them? Why not
propose to add the KanjiVG style XML description to their list of
formats (they only seem to have sequence pictures, colored pictures,
animated gifs and the like).

I think their project could benefit from a simple, small, and easy to
handle text format for the stroke-order description, as well as this
project could benefit from the existing stroke orders (it should be
easy to watch their animated gifs and translate that into a xml
file...). Then, also, the work would not be done twice... I hope this
is not because of license incompatibilities.

Best regards
Lutz



--
You received this message because you are subscribed to the "KanjiVG" group.
For options and unsubscribing, visit this group at
http://groups.google.com/group/kanjivg

Alexandre Courbot

unread,
May 23, 2010, 9:45:07 PM5/23/10
to kan...@googlegroups.com
Hi,

> My questions are: Are you aware of their project?

Yes, and actually there is also the GlyphWiki project that does
something similar (better in my opinion): http://glyphwiki.org/

> Why is this project
> separate from the wiki one? Wouldn't it be better to work on a common
> XML format for all the Han characters together with them? Why not
> propose to add the KanjiVG style XML description to their list of
> formats (they only seem to have sequence pictures, colored pictures,
> animated gifs and the like).

There are several reasons for that, the first being that both projects
started separately. The goal is also different - KanjiVG aims at
covering Japanese characters first, which may sometimes differ from
Chinese ones.

While the Commons project provides ready-to-use material for being
displayed on websites (mainly animated Gifs), KanjiVG provides the raw
material (stroke order, shape, and information) that allows you to
generate such material. The SVG strokes are just raw paths, but with a
proper stroke renderer, you could generate the exact same result as
the Commons project - with just the help of a program. But plenty of
other things are also possible, like using the data for handwriting
recognition (http://kanji.sljfaq.org/kanji16/draw-canvas.html ), build
a stroke-order animation widget for a software (http://www.tagaini.net
), extract component information like the kanjirad files, find similar
kanji, etc, etc, etc.

> I think their project could benefit from a simple, small, and easy to
> handle text format for the stroke-order description, as well as this
> project could benefit from the existing stroke orders (it should be
> easy to watch their animated gifs and translate that into a xml
> file...). Then, also, the work would not be done twice...  I hope this
> is not because of license incompatibilities.

KanjiVG is not a simple and easy text format, instead it tries to be
as precise as possible about the kanji description. I don't think
translating animated GIFs to SVG would be even possible - and even
though it would, the result would still miss the component
description.

KanjiVG can be seen as meta-data from which plenty of outputs can be
produced, including what the commons does. Indeed, it would be great
if this format could be adopted more widely, but for the commons that
would mean restarting from scratch.

KanjiVG's licence is just a little bit less permissive than the one of
the commons' project (same CC 3.0, but with share-alike clause).

Hope this answers your questions,
Alex.

hugo lopes

unread,
May 24, 2010, 5:07:54 AM5/24/10
to kan...@googlegroups.com
Hi,
Now that you are talking about the Commons Stroke Order Project (Commons SOP):
I think I can join this talk since I have been for 5 years the leader/manager/architect of this project (but not the more productive member).

= = = = =
What made the Commons Stroke Srder Project:
First, Kanjivg and the Commons SOP are not comparable, that's why I never talked about it before. Kanjivg is simply, and definitively BETTER
On commons, our REAL wish/dream was to make a cascade SVG project. Then, using the SVGs, a specific script may create a serie of specific type of stroke order diagrams. But none of us (wikipedians users) were programmers. So we just collected bitmap images, then create new conventions, then create some sets of diagrams.
  • the -bw.png (printable Black and White) set of 1300 stroke order diagrams
  • the -red.png (printable RED to black) set of some dozens characters-images
  • the -order.gif (animated stroke ORDER) set of some 4 hundreds nice GIF animations.

= = = = =
Limitations of the Commons SOP:
The conclusion that I make, as the main leader on this commons project, is that the Commons system -create, and upload one by one Bitmap images file on a wiki- is really time consuming, and have no perspective of evolution. The system of Commons does NOT include tools to manage large set of images. While bitmap image have, in themselves, big limitation.

Indeed, the/our team on Commons, mainly just 4 contributors working since 2005, plan to finish the set of GIF animations for the Kangxi radicals, and then leave the place as it. New project, as Kanjivg, are simply the technologies we need for now and tomorrow. ; )

The SVG format of Kanjivg open new ways/opportunities that us, contributers of the Commons Stroke Order Project, have waited for SO.... LOOONG.

= = = = =
Limitations of KanjiVG:
However, a big limitation of KanjiVG is, indeed, that it focus ONLY on the Japanese stroke order. This despite the fact that Taiwan, PRC, and HK stroke order is different for a part of characters.
The need is now to focus on Japanese stroke order, for sure. 
But as a free project, this in my opinion a big longterm issue for Kanjivg: 
  • the need to expand to cover all the stroke order policies and practices (multi-countries), 
  • the need to set up conventional names for files (S/T/J/H ?), 
  • and the need to state/provide trustable sources

= = = = = 
Official Stroke Order Sources:
In 2006 and since, in my free time I made a transversal small research on this Stroke order issue. Their are the sources for a multi-countries approach:
常用國字標準字體筆順手冊 tw 常用國字標準字體筆順手冊 (Stroke order 14 rules), Taiwan Ministry of Education. Book available online (authoritative). ISBN 957-00-7082-X
現代漢語通用字筆順規範 cn 現代漢語通用字筆順規範, 453pages, 1997, editeur: 语文出版社, ISBN:7801262018 (Authoritative)
筆順指導の手びき jp 筆順指導の手びき (Hitsujun shidō no tebiki), 1958. (Authoritative from 1958 to 1977)
Note: nowadays, the Japanese Ministry of Education let editors set freely a character's stroke order, which all should « follow commonsensical orders which are widely accepted in the society ».
香港標準字形及筆順 hk 香港標準字形及筆順 - Stroke Order in Hongkong, Hong Kong Department of Education's List of Commonly Used Characters (Authoritative)

= = = = =
Last ! 
As representative of the wikipedians, all my encouragement to KanjiVG : )

-- 
羅禹國 - Hugo LOPEZ,
Tw. tel: 09-8343-9890
Institute of Innovation, Technology and Management (Master 1)
NCHU, Taizhong, Taiwan.

lrdict

unread,
May 24, 2010, 8:58:38 AM5/24/10
to KanjiVG
Hi there!

Thank you for your replies, and also much respect to the work done by
everyone! It is impressive to find such caliber people here!

It seems that the following is true:

KanjiVG is a level of abstraction higher than GlyphWiki (which
provides among other direct SVG paths), because it sorts the
elementary paths in an ordered tree-structure of stroke groups. In
this way, the commons project, even though intended to be a collection
of cascade SVG data, seems to be even less abstract than GlyphWiki,
because it provides various image bitmaps, which can be seen as an
interpretation of SVG paths. And the KanjiVG adds the ordered tree-
structure of groups as a level of abstraction.

Ordered, because each stroke group on the same tree level needs to be
drawn in the order provided. Tree-structure, because each stroke group
can be defined by more basic stroke groups. The nodes of this tree are
elementary strokes.

However, in KanjiVG the elementary strokes are not defined globally.
Actually, each character includes the "most" elementary strokes in its
definition, by defining their SVG path as in the SVG specification.
So, there is redefinition of each elementary stroke in each character.
The benefit is probably to assure the strokes are positioned correctly
relative to each other. Also the export to SVG is easy, as it seems to
be sufficient to strip all stroke group information, just leaving the
elementary SVG stroke paths.

This is a really good design I think, and it would be really great to
extend this specification for multi-country use, specifically i think
of Han characters.

There seems to be no technical limitation to extend and define a S
(simplified chinese), T (traditional chinese), PRC, ROC, korean etc.
format, but it just has to be done, so that the collection of stroke
orders for these other countries can start. From <kanji> </kanji> to
<hanzi_s>, <hanzi_t> etc is not a big step.

Besides the technical specification, unfortunately the license as to
be addressed. I wonder if for the general scope of multi-country
stroke orders the Commons SOP with their license is not the better
place to collect all this information (no offense please).

For this multi-country stroke orders collection to be made, I think it
would be fair to apply the same Commons SOP licence, which apparently
doesn't include the share-alike feature. If i understand correctly it
would be permissible (and the practical approach) to transform the
stroke order information encoded in the bitmaps of the Commons SOP to
this new format (while stating the source, licence, authors) even if
the share-alike feature is added. But it would seem better and fair to
not add this restriction to the license, just because of the vast data
this potentially would bring in -- the objective would be to have at
least all Commons SOP stroke orders also in this new specification.

Indeed, a program would have difficulties transforming the graphics to
XML data, but humans should be able to do it, recognizing the
different components of the character given the order of all paths and
the image of the character -- and the paths of the GlyphWiki project?
--. There are roughly 1300 (assuming duplicates) characters to be
processed…

Also, I think almost everybody can imagine an XML specification to
order strokes component wise in a tree-structure… so the KanjiVG
license seems to effectively cover only the so-far encoded characters.
There might be people redoing the work of encoding in text format the
stroke order for the Commons SOP or other projects, just to escape the
restrictions of the share-alike clause of KanjiVG.

Therefore, it would be better to have the multi-country specification
not more restrictive than the Commons SOP. This means that KanjiVG
could not be included, which shows again the problematics of license
issues.

I agree that the way KanjiVG specifies the information is the right
technology for today and tomorrow.

Depending on the license choice here, the Commons SOP could actually
be the natural place to add the specification of a multi-country text-
based ordered tree-structure stroke order description (because it
would be essentially only a format transformation of their data). I
don't see it so brutally as that they would have to start from
scratch. It would be more the addition of a format. Moreover, I assume
much more visibility of the Commons SOP which should result in a
faster encoding of stroke orders. All these tedious choices should be
irrelevant if the two projects had identical licenses, right?

What is the Dr. Apel's opinion?


> However, a big limitation of KanjiVG is, indeed, that it focus ONLY on the
> Japanese stroke order. This despite the fact that Taiwan, PRC, and HK stroke
> order is different for a part of characters.
> The need is now to focus on Japanese stroke order, for sure.
> But as a free project, this in my opinion a big *longterm* issue for
> Kanjivg:
> - the need to expand to cover all the stroke order policies and practices (multi-countries),

Indeed!

> - the need to set up conventional names for files (S/T/J/H ?),

Yes … (btw what is H?).

> - and the need to state/provide trustable sources

Do you think of an status attribute in the XML like, "unverified",
"verified with the source XYZ" etc?

As seen above, I would add the license/project-home question for multi/
new-country data to the list of open issues.


Best regards
Lutz

lrdict

unread,
May 24, 2010, 9:00:02 AM5/24/10
to KanjiVG
> However, a big limitation of KanjiVG is, indeed, that it focus ONLY on the
> Japanese stroke order. This despite the fact that Taiwan, PRC, and HK stroke
> order is different for a part of characters.
> The need is now to focus on Japanese stroke order, for sure.
> But as a free project, this in my opinion a big *longterm* issue for
> Kanjivg:
> - the need to expand to cover all the stroke order policies and practices (multi-countries),

Indeed!

> - the need to set up conventional names for files (S/T/J/H ?),

Yes … (what is H?).

> - and the need to state/provide trustable sources

Do you think of an status attribute in the XML like, "unverified",
"verified with the source XYZ" etc?

As seen above, I would add the license/project-home question for multi/
new-country data to the list of open issues.


Best regards
Lutz

On May 24, 11:07 am, hugo lopes <hugo....@gmail.com> wrote:
> Hi,
> Now that you are talking about the Commons Stroke Order Project (Commons
> SOP):
> *http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project
> <http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project>I think I
> can join this talk since I have been for 5 years the
> leader/manager/architect of this project (but not the more productive
> member).
>
> = = = = =
> *What made the Commons Stroke Srder Project:*
> First, Kanjivg and the Commons SOP are not comparable, that's why I never
> talked about it before. *Kanjivg is simply, and definitively BETTER*.
> On commons, our REAL wish/dream was to make a cascade SVG project. Then,
> using the SVGs, a specific script may create a serie of specific type of
> stroke order diagrams. But none of us (wikipedians users) were programmers.
> So we just collected bitmap images, then create new conventions, then create
> some sets of diagrams.
>
>    - the* -bw.png* (printable Black and White) set of 1300 stroke order
>    diagrams
>    - the *-red.png* (printable RED to black) set of some dozens
>    characters-images
>    - the *-order.gif* (animated stroke ORDER) set of some 4 hundreds nice
>    GIF animations.
>
> *
> *
> *= = = = =*
> *Limitations of the Commons SOP:*
> The conclusion that I make, as the main leader on this commons project, is
> that the Commons system -create, and upload one by one *Bitmap images file
> on a wiki*- is really time consuming, and have no perspective of evolution.
> The system of Commons does NOT include tools to manage large set of images.
> While bitmap image have, in themselves, big limitation.
>
> Indeed, the/our team on Commons, mainly just 4 contributors working since
> 2005, plan to *finish the set of GIF animations for the Kangxi radicals*,
> and then leave the place as it. New project, as Kanjivg, are simply the
> technologies we need for now and tomorrow. ; )
>
> The SVG format of Kanjivg open new ways/opportunities that us, contributers
> of the Commons Stroke Order Project, have waited for SO.... LOOONG.
>
> = = = = =
> *Limitations of KanjiVG:*
> However, a big limitation of KanjiVG is, indeed, that it focus ONLY on the
> Japanese stroke order. This despite the fact that Taiwan, PRC, and HK stroke
> order is different for a part of characters.
> The need is now to focus on Japanese stroke order, for sure.
> But as a free project, this in my opinion a big *longterm* issue for
> Kanjivg:
>
>    - the need to expand to cover all the stroke order policies and practices
>    (multi-countries),
>    - the need to set up conventional names for files (S/T/J/H ?),
>    - and the need to state/provide trustable sources
>
> = = = = =
> *Official Stroke Order Sources:*
> In 2006 and since, in my free time I made a transversal small research on
> this Stroke order issue. Their are the sources for a multi-countries
> approach:
> 常用國字標準字體筆順手冊tw常用國字標準字體筆順手冊<http://www.edu.tw/files/site_content/M0001/bishuen/c8.htm?open>
>  (Stroke order 14
> rules<http://www.edu.tw/files/site_content/M0001/bishuen/bs1.htm?open>),
> Taiwan Ministry of Education. Book available online (authoritative). ISBN
> 957-00-7082-X<http://commons.wikimedia.org/wiki/Special:BookSources/957007082X>
> 現代漢語通用字筆順規範cn現代漢語通用字筆順規範, 453pages, 1997, editeur: 语文出版社, ISBN:7801262018
> (Authoritative)筆順指導の手びきjp筆順指導の手びき (Hitsujun shidō no tebiki), 1958.
> (Authoritative from 1958 to 1977)
> *Note:* nowadays, the *Japanese Ministry of Education* let editors set
> freely a character's stroke order, which all should « *follow commonsensical
> orders which are widely accepted in the society*
> ».香港標準字形及筆順hk香港標準字形及筆順<http://www.cchar.com/education/hong-kong-student/chinese-for-hong-kon...>
> > recognition (http://kanji.sljfaq.org/kanji16/draw-canvas.html), build
> For options and unsubscribing, visit this group athttp://groups.google.com/group/kanjivg

hugo lopes

unread,
May 24, 2010, 10:37:22 AM5/24/10
to kan...@googlegroups.com
Hi,
It's late so I will just answer quickly.

= = = =
Multi-country in Commons SOP:
* J - stand for Japanese 'official' stroke order
* T - stand for Taiwanese (ROC) official stroke order
* (nothing) or M - stand for Modern China (PRC)
* H - may stand for Hongkong
* K - may stand for Korea
* S - may stand for Singapour
J (Japan), T (ROC), M (PRC), H (Hongkong) have official or semi official standards. I gave a table for those sources in my previous message.
K (Korea), S (Singapour), have no official stroke order standard.

Note: the stroke order is actually not that important, and is just a recommendations to ease learning, writing, and reading. Accordingly, K and S may never publish any specific standard.

= = = =
Licence:
Personally, after 5 years working on the Commons SOP project (CC-by), I now prefer the Share-Alike like Kanjivg (CC-by-sa). We share, they share. That's cross help, that's more fair.

= = = =
Getting data:
For getting the stroke order data from Commons SOP, I don't get it : this project worked in bitmap, there is not automatically extractable data.
There is only 1300 entries on Commons SOP, while Kanjivg have +6000 (if I'm right). Since most stroke order are identical between all countries, variants to input to get a multi-country coverage is likely just 5~10%. Not a big deal.
See:
  ROC PRC Japan
Sequences (b&w) *-tbw.png (3) *-bw.png (1,006) *-jbw.png (50)
Shades of red (0) *-red.png (192) *-jred.png (8)
Animations *-torder.gif (8) *-order.gif (300) *-jorder.gif (8)


= = = =
Long term:
So for me, progress toward multi-countries system is a long term issue. Go slowly. It is need to:

* Continue on Japanese only for now ;
* Think calmly about the possibility of multi-countries system :
* * naming (<hanzi T>;<hanzi M>;...) ;
* * Sources ;
* * which modifications of the kanjivg's wiki.
* * new problematics ? (risk of mixing datas)
* Go / Wait / Kill decision.

Then:
* Implement the multi-country approach.
* Look on Commons SOP for the characters with diverging stroke order.
* Create variants : <kanji>王</kanji>, Chinese is different > so we create <hanzi T>王</hanzi Ti>.
* Keep identical empty : when we have <kanji>三</kanji>, don't create <hanzi T>三</hanzi T>, <hanzi S>三</hanzi Si> since they have the same stroke order.

= = = =
By the way:
Is there a Kanjivg text wiki page to put some key points on this long term "multi-countries" posibility ? > Sources, approach, etc.

Regards,


--
羅禹國 - Hugo LOPEZ,
Tw. tel: 09-8343-9890
Institute of Innovation, Technology and Management (Master 1)
NCHU, Taizhong, Taiwan.



Ben Bullock

unread,
May 24, 2010, 11:09:46 AM5/24/10
to kan...@googlegroups.com
I don't really care to comment too much on the licence issues and so on, but if people are interested in creating a Chinese version of KanjiVG, a good starting point would be the data from the Tomoe project. There is a huge file of Chinese style kanji in that which was contributed by some Chinese software  people I think. The Tomoe data is just straight lines but it can be used as a starting point for creating the vector graphics.

Of course, the problem with the Japanese Tomoe data was that it was entered very carelessly, unlike KanjiVG which is very carefully done. I have not examined the Chinese Tomoe data, but even if it is quite full of errors it would be a better starting point than "nothing at all".

Alexandre Courbot

unread,
May 25, 2010, 9:21:22 AM5/25/10
to kan...@googlegroups.com
Hi guys, nice discussion that is taking place here.

> It seems that the following is true:
>
> KanjiVG is a level of abstraction higher than GlyphWiki (which
> provides among other direct SVG paths), because it sorts the
> elementary paths in an ordered tree-structure of stroke groups. In
> this way, the commons project, even though intended to be a collection
> of cascade SVG data, seems to be even less abstract than GlyphWiki,
> because it provides various image bitmaps, which can be seen as an
> interpretation of SVG paths. And the KanjiVG adds the ordered tree-
> structure of groups as a level of abstraction.

That is correct. GlyphWiki is definitely a step in the right
direction, it even (somehow) has structured information. However, it
seems to only bother about rendering kanji correctly, and do not care
about stroke order or count. For instance, in GlyphWiki 口 has 4
strokes.

> Ordered, because each stroke group on the same tree level needs to be
> drawn in the order provided. Tree-structure, because each stroke group
> can be defined by more basic stroke groups. The nodes of this tree are
> elementary strokes.

That's a good description of the format, indeed.

> However, in KanjiVG the elementary strokes are not defined globally.
> Actually, each character includes the "most" elementary strokes in its
> definition, by defining their SVG path as in the SVG specification.
> So, there is redefinition of each elementary stroke in each character.
> The benefit is probably to assure the strokes are positioned correctly
> relative to each other. Also the export to SVG is easy, as it seems to
> be sufficient to strip all stroke group information, just leaving the
> elementary SVG stroke paths.

It is probably possible to factorize a good part of the data by
allowing cross-references to component characters. At least for the
structural description - the SVG paths would be harder to represent
correctly this way as variations exist when some characters are
embedded within others. Matrix transformations could probably help
here, but I am not sure how this could be done. The current design is
satisfying, but sometimes when we fix a mistake in a kanji we have to
go through all the compounds in order to apply the fix to them too.
Factorizing would definitely make maintainability easier.

> This is a really good design I think, and it would be really great to
> extend this specification for multi-country use, specifically i think
> of Han characters.

Technically, nothing speaks against this, and I think Ulrich wants the
project to head to this direction in the future. All we need is the
people that will do it. ;)

> Besides the technical specification, unfortunately the license as to
> be addressed. I wonder if for the general scope of multi-country
> stroke orders the Commons SOP with their license is not the better
> place to collect all this information (no offense please).

This is a recurrent question. I'll talk more about it later.

> For this multi-country stroke orders collection to be made, I think it
> would be fair to apply the same Commons SOP licence, which apparently
> doesn't include the share-alike feature. If i understand correctly it
> would be permissible (and the practical approach) to transform the
> stroke order information encoded in the bitmaps of the Commons SOP to
> this new format (while stating the source, licence, authors) even if
> the share-alike feature is added. But it would seem better and fair to
> not add this restriction to the license, just because of the vast data
> this potentially would bring in -- the objective would be to have at
> least all Commons SOP stroke orders also in this new specification.

Indeed in those conditions it would make sense, but I don't really see
how the stroke order information could accurately be extracted from
the Commons' bitmaps.

> Indeed, a program would have difficulties transforming the graphics to
> XML data, but humans should be able to do it, recognizing the
> different components of the character given the order of all paths and
> the image of the character -- and the paths of the GlyphWiki project?
> --. There are roughly 1300 (assuming duplicates) characters to be
> processed…

Same problem with Glyphwiki - although I like their representation,
the fact that their goals are different makes it difficult to reuse
their data for anything else than validation.

> Depending on the license choice here, the Commons SOP could actually
> be the natural place to add the specification of a multi-country text-
> based ordered tree-structure stroke order description (because it
> would be essentially only a format transformation of their data). I
> don't see it so brutally as that they would have to start from
> scratch. It would be more the addition of a format. Moreover, I assume
> much more visibility of the Commons SOP which should result in a
> faster encoding of stroke orders. All these tedious choices should be
> irrelevant if the two projects had identical licenses, right?

If we compare the amount of data in the commons it seems like KanjiVG
has much more data - and most of it probably overlaps anyway.

More to come in my reply to Hugo's mail.

Alex.

Christoph Burgmer

unread,
May 25, 2010, 9:52:40 AM5/25/10
to kan...@googlegroups.com
Hi everybody

I am missing a few names in this discussion:

We have:
- GlyphWiki
- KanjiVG
- Commons Stroke Order project
- Tomoe
- Tegaki
- cjklib (+ CharacterDB)

Every project has some overlap with at least one of the others and I don't
want to give a short introduction on all of them now. I probably won't even
get the full mission on all of them and miss some important points.

Fact is that every project has its particular goal and sharing doesn't really
occur. This might be worthwhile in the future, but I guess most of the
projects have intermediate goals that need to be addressed first. Anyhow.

As I'm active in the latter two projects here's another topic that I believe
plays into this discussion here. Tegaki is the up to date "version" of Tomoe,
which comes with sources for Japanese and Simplified Chinese. Tegaki adds
partial support for Traditional Chinese due to an automatic process built on
top of cjklib. Chinese data though has some issues wrt stroke count and stroke
order. KanjiVG seems to have pretty solid data here. I tried to bridge between
Tegaki which has path information and cjklib which has abstract stroke and
component data. The goal was to go towards similar data in KanjiVG so if you
are looking into this direction you might want to keep an eye on it.

So I believe my point here is: before you start another project you might want
to have a look on what the others provide and where future work can build
upon.

Just a few cents from my side
-Christoph

Alexandre Courbot

unread,
May 25, 2010, 10:12:52 AM5/25/10
to kan...@googlegroups.com
Personally, after 5 years working on the Commons SOP project (CC-by), I now prefer the Share-Alike like Kanjivg (CC-by-sa). We share, they share. That's cross help, that's more fair.

That is my opinion too. Being open-source does not prevent you from making money from your software. IMHO KanjiVG is a unique resource in its kind, and a work that literally took hundreds of hours to compile - all that is asked from the people who use it is a fair return to the community.

Of course in the real world you'll always have people who will happily sit on the licence and won't respect it - still this is not a reason to back off.

For getting the stroke order data from Commons SOP, I don't get it : this project worked in bitmap, there is not automatically extractable data.
There is only 1300 entries on Commons SOP, while Kanjivg have +6000 (if I'm right). Since most stroke order are identical between all countries, variants to input to get a multi-country coverage is likely just 5~10%. Not a big deal.

KanjiVG also includes variants for many kanji (they are not in the release file, but can be checked out from git). I'm no expert so I speak without knowing there, but maybe they cover international variations?

Long term:
So for me, progress toward multi-countries system is a long term issue. Go slowly. It is need to:

* Continue on Japanese only for now ;
* Think calmly about the possibility of multi-countries system :
* * naming (<hanzi T>;<hanzi M>;...) ;
* * Sources ;
* * which modifications of the kanjivg's wiki.
* * new problematics ? (risk of mixing datas)
* Go / Wait / Kill decision.

Then:
* Implement the multi-country approach.
* Look on Commons SOP for the characters with diverging stroke order.
* Create variants : <kanji>王</kanji>, Chinese is different > so we create <hanzi T>王</hanzi Ti>.
* Keep identical empty : when we have <kanji>三</kanji>, don't create <hanzi T>三</hanzi T>, <hanzi S>三</hanzi Si> since they have the same stroke order.

Indeed, there is no need to hurry. The base is already solid and covers many Chinese characters that are common with Japanese.

What I want to stress out is that if an effort (be it from the Commons' team, or anyone else) is to be started to cover international variations, I'm of course all for it. The reason KanjiVG is focusing on Japanese characters is because (1) that's how it started and (2) I happen to be the de facto maintainer for now and therefore orient my project towards my needs (providing kanji information for Tagaini Jisho). But if someone starts contributing international variations, I have no reason to reject them, and am even willing to help the effort.

Also, although I'm fulfilling the task for now I don't feel like I'm the right person to maintain the project - I'm a programmer but don't have the knowledge of kanji necessary to correctly fix mistakes and decide what should be done. Ulrich is the right person for that, but he seems to be busy and therefore distant lately. In the future I'd like to stick to what I can do (design the file format, write the libraries and the code necessary for the website to live) and see someone more competent take the lead. Also the project deserves a better website.

What I wanted to say is that KanjiVG is an open-source project, and those who contribute to it have their word to say on its orientation. It's open to all meaningful contributions, and if people want to get more involved about it I'll be happy to spend my time writing documentation and editing tools that will be helpful in the long term instead of just keeping the thing online. ;)

By the way:
Is there a Kanjivg text wiki page to put some key points on this long term "multi-countries" posibility ? > Sources, approach, etc.

Note yet, but if you want to write about it I can give you the edit password.

Alex.

Alexandre Courbot

unread,
May 25, 2010, 10:19:29 AM5/25/10
to kan...@googlegroups.com
> As I'm active in the latter two projects here's another topic that I believe
> plays into this discussion here. Tegaki is the up to date "version" of Tomoe,
> which comes with sources for Japanese and Simplified Chinese. Tegaki adds
> partial support for Traditional Chinese due to an automatic process built on
> top of cjklib. Chinese data though has some issues wrt stroke count and stroke
> order. KanjiVG seems to have pretty solid data here. I tried to bridge between
> Tegaki which has path information and cjklib which has abstract stroke and
> component data. The goal was to go towards similar data in KanjiVG so if you
> are looking into this direction you might want to keep an eye on it.

Looking forward to seeing KanjiVG used in Tegaki - actually I thought
this was already the case.

I suppose you are already aware of it but Ben wrote a recognizer on
top of KanjiVG's data:
http://kanji.sljfaq.org/kanji16/draw-canvas.html

I guess Tegaki's internals are different, but this is AFAIK the first
time KanjiVG's data is used for that purpose and it seems to work
rather well.

> So I believe my point here is: before you start another project you might want
> to have a look on what the others provide and where future work can build
> upon.

I feel like I'm missing your point - are you targeting a particular
project? I don't remember anyone talking about starting on new project
on this thread.

Alex.

Christoph Burgmer

unread,
May 25, 2010, 10:29:21 AM5/25/10
to kan...@googlegroups.com
Am Dienstag, 25. Mai 2010 schrieb Alexandre Courbot:
> > So I believe my point here is: before you start another project you might
> > want to have a look on what the others provide and where future work can
> > build upon.
>
> I feel like I'm missing your point - are you targeting a particular
> project? I don't remember anyone talking about starting on new project
> on this thread.

From Lutz' initial post I assumed he was suggesting something new. He wasn't
talking about a new project though as it seems.

KanjiVG can be used in Tegaki, just not that easily yet. From what I know the
converter was still not included officially.

A side notice: I like the way KanjiVG offers the data. It connects dimensions
that are handled independently in other projects. You should consider offering
exchange over the CDL format. From what I see you have a compatible design.

-Christoph

Dr. Ulrich Apel

unread,
May 25, 2010, 2:19:56 PM5/25/10
to kan...@googlegroups.com
Hi everybody,

thanks a lot for the very interesting discussion. I am very sorry that I am so slow in answering.

I had several looks at the Commons Stroke Order Project and was thinking that the animations look very nice and very natural, but also, that it must be an incredible amount of time and work necessary, because every stroke has several pictures. Now, to receive praise by Hugo Lopes for KanjiVG is a big honour for me.

Coordination of such a project is pretty difficult for me because -- as Alex said -- I am very busy. I guess I can give some input, and I hope there will be times when I can work again more on KanjiVG. Anyway, I am pretty confident about the progress of the project. In class at Tübingen university, the students and I are now working again on character data. We are building up a new generation of lexicographers, and we are making plans to get founding too. I guess we are on a good way.

There were concerns that KanjiVG would exclude other characters like Hanji etc. Actually, the current data contains kana and Romaji/Latin characters, and they should have the tag <kanji>, too. It manly means just "head character," and most characters are in fact kanji. Exclusion was not the aim.

There is a project called CDL "character description language". It belongs to the Wenlin dictionary and only deals with Chinese characters. A name like "CharacterVG" might have caused confusion.

Then there were other reasons to concentrate on Japanese first: Japanese has canonical character glyphs in the schoolbook / kyokasho fonts. There is canonical stroke order in the Mombusho book from 1952, which Hugo also mentioned. I got little founding from a Japanese-German organisation. I am a Japanologist. Julien Quint and I presented KanjiVG at SVG Open at Tokyo. We tried to cover kanji first. Etc.

The child just needed a somehow handy name, and so it became KanjiVG. If this causes international problems, one should think about changing the name or to allow also project naming variants like HanziVG and so on. I think new tags aren't really necessary at this moment.

Alex mentioned the data on stroke order and glyph variants. This should pretty much cover traditional Chinese Kaisho/block character writing style. So, in fact, the project already should be a multi country project. Most of these Kaisho variants were generated half automatically -- an approach one should also apply for most characters of simplified Chinese. Getting working data shouldn't be difficult, but it would need an esthetic revision later.

A big problem is naming the stroke order and the glyph variants. This seems to be one of the main reasons why the variant data is not in the official release. For calligraphy in Japan, several stroke orders might be Ok. Probably the same is true for the other countries. So, one needs an approach from the glyph and its stroke order and not from the country. One might write an extra file stating, which stroke order and glyph is considered correct in which country, which school of calligraphy and so on.

I don't have a solution yet for the naming problem. I will discuss with Roger, who is also member of this mailing list and other colleagues at Tübingen. I am planning to be in Tokyo in August. Perhaps I can meet with Alex, and we can finalize a naming convention

I was discussing with several people to use the KanjiVG approach also on Hangul. There seem to be no reasons not to do so, but to do it right would take quite some time.

It seems that I am getting involved in project that has to deal with hentai-gana and even Egyptian hieroglyphs. Perhaps this will lead to a more general character description language based on SVG.

I am very much interested in the exchange with other projects. Cross checking the sanity of the data should be very much possible. Unfortunately, I don't have time to really get in to the details of other projects. If you think, you could need some piece of advice from me on something, please feel free to contact me, but don't expect me to be too active and follow other projects closely.

Ulrich

Roger Braun

unread,
May 25, 2010, 2:32:39 PM5/25/10
to kan...@googlegroups.com
On Tue, May 25, 2010 at 4:19 PM, Alexandre Courbot <gnu...@gmail.com> wrote:
> Looking forward to seeing KanjiVG used in Tegaki - actually I thought
> this was already the case.
>
> I suppose you are already aware of it but Ben wrote a recognizer on
> top of KanjiVG's data:
> http://kanji.sljfaq.org/kanji16/draw-canvas.html
>
> I guess Tegaki's internals are different, but this is AFAIK the first
> time KanjiVG's data is used for that purpose and it seems to work
> rather well.

I converted the KanjiVG data for use with Tegaki some time ago. These
were the first test results:

"Using KVG data
roger@amida:~/programming/hwr/tegaki-lab$ ./mmanager svm eval
Running 'eval'...
match1: 79.0%
match5: 97.0%
match10: 97.0%
Done.

Using Tomoe data
roger@amida:~/programming/hwr/tegaki-lab$ ./mmanager svm eval
Running 'eval'...
match1: 76.0%
match5: 92.0%
match10: 93.0%
Done.

Using both
roger@amida:~/programming/hwr/tegaki-lab$ ./mmanager svm eval
Running 'eval'...
match1: 88.0%
match5: 96.0%
match10: 99.0%
Done."

KanjiVG is definitely a very useful set of data for hwr purposes.

--
Roger Braun
http://rbraun.net
roger...@student.uni-tuebingen.de

Mathieu Blondel

unread,
May 25, 2010, 8:57:40 PM5/25/10
to kan...@googlegroups.com
On Wed, May 26, 2010 at 3:32 AM, Roger Braun
<roger...@student.uni-tuebingen.de> wrote:
> On Tue, May 25, 2010 at 4:19 PM, Alexandre Courbot <gnu...@gmail.com> wrote:
>> Looking forward to seeing KanjiVG used in Tegaki - actually I thought
>> this was already the case.
>>
>> I suppose you are already aware of it but Ben wrote a recognizer on
>> top of KanjiVG's data:
>> http://kanji.sljfaq.org/kanji16/draw-canvas.html
>>
>> I guess Tegaki's internals are different, but this is AFAIK the first
>> time KanjiVG's data is used for that purpose and it seems to work
>> rather well.

OK, together with the hiragana and katakana models, let's use KanjiVG
as main Japanese model in the next Tegaki release! I think I'll just
include your Ruby script in the model folder.

During the development of a recognition engine, KanjiVG is in my
opinion a more appropriate test set than Tomoe. So Tomoe can be used
as training set and KanjiVG as test set during the development phase
and KanjiVG (+Tomoe) can be used as training set for the released
model, when the engine has been developed.

Mathieu

Alexandre Courbot

unread,
May 28, 2010, 3:46:23 AM5/28/10
to kan...@googlegroups.com
The child just needed a somehow handy name, and so it became KanjiVG.  If this causes international problems, one should think about changing the name or to allow also project naming variants like HanziVG and so on.   I think new tags aren't really necessary at this moment.

I like the name variation idea. Usually people who need characters description want them for a given language only, so it makes sense to separate the releases.
 
I don't have a solution yet for the naming problem.  I will discuss with Roger, who is also member of this mailing list and other colleagues at Tübingen.  I am planning to be in Tokyo in August.  Perhaps I can meet with Alex, and we can finalize a naming convention

That would be great - please keep me informed about your venue!
 
It seems that I am getting involved in project that has to deal with hentai-gana and even Egyptian hieroglyphs.  Perhaps this will lead to a more general character description language based on SVG.

Wow - now I know nothing about hieroglyphs, but are things like a stroke order involved? Could they be represented by SVG paths?

Alex.

Reply all
Reply to author
Forward
0 new messages