KanjiVG XML Working Format Proposal

178 views
Skip to first unread message

Lee Hericks

unread,
Jan 22, 2015, 9:02:37 PM1/22/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
Introductions
- Why change format?
  Because the current format is limited by it's technology. Multiple SVG group tags are created to cram in multiple radicals and other information.
  Therefore, a more extensible, easy to parse format is desired.

- Ok, describe this new proposed format really simply!
  a) It's in XML, and we can use a DTD to help verify it's correct.
  b) Each character has a structure and metadata about that structure. 
     The structure consists of stroke data and an ideographic description of the kanji.
     
  WAIT, Ideowhat?
  
    b2) An ideographic description is a fancy word for "describe the kanji parts and where they are positioned"
        The proposed format breaks up the duty of telling position and telling the group of strokes into two elements,
        <ideographic-descriptor> and <stroke-group>. This simplifies parsing because it makes it very clear how to process each tag.
        
        BUUUUUUT, it's still verbose, what about ⿱亠⿱口⿱冖丁 for 亭?
        
        b3) We still need to add information about which strokes:
            ⿱亠{s1,s2}⿱口{s3,s4,s5}⿱冖{s6,s7}丁{s8,s9}
            Which cross ref:
            ⿱亠{s1,s2|kvg:04ea0}⿱口{s3,s4,s5|kvg:053e3}⿱冖{s6,s7|kvg:05196}丁{s8,s9|kvg:04e01}
            and possibly other information.
            Oh and some parts like the far right of 術 don't seem to have a literal character?! 
            XML is wordier but easier to parse. 
            
  Check out kanjivg.dtd and the samples!

Goals
1) Clearly organize/mark/explain all data
2) Build proper tools/site for the project. Imagine:

Parsing this proposed XML format
a) Parse the character structure into a collection of strokes and groups mapping to those strokes.
b) Parse meta-data which references strokes and groups. Resolve group references to the appropriate strokes.

General comments / decisions made for this proposal thus far
- Avoiding kvg xml namespace for readability and the unlikeliness of conflicts. 
  Character ids are prefixed with kvg: so if you want to refer to stroke 5 of 明, 
  you do kvg:0660e-s5 just as you would have previously seen in the SVG version. 
  You can think of that as a path to s5 in the character with id kvg:0660e.

- From a Japanese language school perspective, KanjiVG characters are sometimes broken down to meaninglessly small chunks that have no actual linguistic meaning. 
  ex. 行{彳{㇒,亻}, …} -- 亻 is not a component of 彳.
  Please tell me if there is a machine/programming reason for this. We either:
  a) clean this up to an appropriate linguistic level
  b) clearly mark the groups representing linguistic components of the kanji


Clarifications
- Why are both the Ideographic Description Character and it's unicode value included in layout-descriptor tags?
  They are both included for choice in parsing, but unicode may be dropped upon consensus.

- Where did "kvg:element" go?
  Any place where we are choosing the most visually accurate character is called a literal.
  Therefore, for 明,  <g id="kvg:0660e-gX" kvg:element="日"> is <stroke-group id="g1" literal="日">.
  We do this because we are separating out and clearly defining an ideographic description of the character.
  
- But how will I quickly see this character?
  That's a good question. The extensibility of this draft format has a clear advantage moving forward, but will require an editing tool.

Missing
- The 高 partially existing in 亭 is an entry about the origin of the kanji(字源「じげん」). It should have some appropriately named tag in the metadata section.
- kvg:tradForm attribute is only used in 09ed9.svg, 066f2.svg, 08208.svg
- kvg:radicalForm attribute is only used in 05101.svg, 0658e.svg, 07962.svg, 07f86.svg, 09f21.svg

References
- Unicode 7.0 Ideographic Description Characters (18.2)
- Unicode 7.0 CJK Strokes (Not yet checked against KanjiVG stroke types)
04ead.xml
066f2.xml
080fd.xml
0660e.xml
08853.xml
kanjivg.dtd

msk...@ansuz.sooke.bc.ca

unread,
Jan 23, 2015, 3:12:54 AM1/23/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
On Thu, 22 Jan 2015, Lee Hericks wrote:
>         b3) We still need to add information about which strokes:
>             ⿱亠{s1,s2}⿱口{s3,s4,s5}⿱冖{s6,s7}丁{s8,s9}
>             Which cross ref:
>            ⿱亠{s1,s2|kvg:04ea0}⿱口{s3,s4,s5|kvg:053e3}⿱冖{s6,s7|kvg:05196}丁{s8,s9|
> kvg:04e01}
>             and possibly other information.

You might like to look at IDSgrep's extended ideographic description
sequences. CHISE-IDS, its derivative CJKVI, and Pomax's "indigo" database
(not well-known; it's at
http://pomax.nihongoresources.com/index.php?entry=1225052300 )
also extend IDS. These all extend IDS in slightly different ways, for
different purposes, but it's not hard to convert between them.

Keep in mind that sometimes a stroke is part of more than one component,
or part of a stroke is, as in the top of 冏 inside 商.

>             Oh and some parts like the far right of 術 don't seem to have a
> literal character?! 

That's a general issue with all such databases - some components do not
have Unicode code points, and some Unicode code points correspond to more
than one picture depending on language. IDS-derived databases usually
allow some nodes to just not have a code point designated. Some try to
assign names or IDs to every node regardless; that's typical of XML.

> - Avoiding kvg xml namespace for readability and the unlikeliness of
> conflicts. 

This seems like a surprising decision, and it will annoy people who are
serious about XML. It may also make it harder to use existing XML editing
software.

> - Why are both the Ideographic Description Character and it's unicode value
> included in layout-descriptor tags?
>   They are both included for choice in parsing, but unicode may be dropped
> upon consensus.

If you include both, it'd be desirable to have tests that verify they
match.

> - Where did "kvg:element" go?
>   Any place where we are choosing the most visually accurate character is
> called a literal.

"Visually accurate" is tricky because Unicode very firmly defines
characters to be *abstract linguistic things* and not *pictures* - and
this issue is especially significant in Han script although it occurs even
in English. For instance, the letter "a" has two forms that differ more
than many pairs of kanji that people consider different. If you want to
use Unicode numbers to describe pictures, you have to say which language
and which font you mean, because the same code point will look quite
different depending on that.

Other points worth thinking about: when you have three things side by
side, do you call it ⿰X⿰YZ, ⿰⿰XYZ, or ⿲XYZ ? Similarly, there are
many relationships among components within characters that are not clearly
described by IDS relationships; ⿻ often becomes a catch-all for anything
that doesn't fit into the other categories.

--
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/

ospalh

unread,
Jan 23, 2015, 4:04:21 AM1/23/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
Very interesting.

Some quick thoughts:

Am Freitag, 23. Januar 2015 03:02:37 UTC+1 schrieb Lee Hericks:
(…)
Goals
1) Clearly organize/mark/explain all data
2) Build proper tools/site for the project. Imagine:
Can we please get PNGs rather than GIFs

(…)

General comments / decisions made for this proposal thus far
- Avoiding kvg xml namespace for readability and the unlikeliness of conflicts. 
  Character ids are prefixed with kvg: so if you want to refer to stroke 5 of 明, 
  you do kvg:0660e-s5 just as you would have previously seen in the SVG version. 
  You can think of that as a path to s5 in the character with id kvg:0660e.

Actually, colons are not really allowed in XML ids. It’s not something you usually care about, but when we do rework the format, maybe we could follow that rule.
(The fine print: an ID must be a non-colonized name, that is, something without a colon.)
I think the trivial solution of replacing the colon with a hyphen-minus should work. “kvg-06605-s5”.
(…)
- From a Japanese language school perspective, KanjiVG characters are sometimes broken down to meaninglessly small chunks that have no actual linguistic meaning. 
  ex. 行{彳{㇒,亻}, …} -- 亻 is not a component of 彳.
  Please tell me if there is a machine/programming reason for this. We either:
  a) clean this up to an appropriate linguistic level
  b) clearly mark the groups representing linguistic components of the kanji


I’m not sure about this, but the too fine grouping may be useful for handwriting recognition. So before we throw away information i’m somewhat in favor of b)

Lee Hericks

unread,
Jan 23, 2015, 4:11:35 AM1/23/15
to kan...@googlegroups.com
Thank you Matthew and ospalh. We need to hash out all these things over time. I'm off for the weekend but will look at this more on Monday. Please feel free to post more thoughts as they occur to you. 

Lee

Sent from my iPhone
--
--
You received this message because you are subscribed to the "KanjiVG" group.
For options and unsubscribing, visit this group at
http://groups.google.com/group/kanjivg
---
You received this message because you are subscribed to the Google Groups "KanjiVG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kanjivg+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lee Hericks

unread,
Jan 25, 2015, 7:27:22 PM1/25/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
Ospalh,

All of this is a working draft, thanks for your feedback.

I'm not an XML expert, thank you for clarifying the ID type. What does everyone think? Is the "kvg" even necessary?

id="kvg-01234" or id="01234"

To be honest, I think we don't need to prefix it. Simply leave it as the zero-padded unicode number and if someone is looking for a unicode point they type it in and we pad a zero if needed.


As for the handwriting idea, the strokes contain the stroke type and position. In my opinion, alll the "fake groups" do is add noise and assign literals which are linguistically incorrect.

Lee Hericks

unread,
Jan 25, 2015, 7:31:31 PM1/25/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
Mathew,

Let me just check that we aren't misunderstanding each other here. I have no problem with declaring the DTD and using the one-time name space declaration on the kanjivg tag where you don't prefix the tags, but there is no requirement to have <kvg:kanjivg> <kvg:character><kvg:structure><kvg:stroke-data><kvg:stroke> etc.

IMHO it honestly looks like ass. ^^;

msk...@ansuz.sooke.bc.ca

unread,
Jan 26, 2015, 2:10:48 AM1/26/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
On Sun, 25 Jan 2015, Lee Hericks wrote:
> Let me just check that we aren't misunderstanding each other here. I have no
> problem with declaring the DTD and using the one-time name space declaration
> on the kanjivg tag where you don't prefix the tags, but there is no
> requirement to have <kvg:kanjivg>
> <kvg:character><kvg:structure><kvg:stroke-data><kvg:stroke> etc.

I don't know all the ins and outs of how XML namespace declarations
work. If it's possible to do things in the XML-correct way such that the
files will pass a so-called "validating parser" and work with generic XML
tools, but the tags don't have visible prefixes on that, that should be
fine. I just get scared when I see people talking of skipping over things
that XML requires, because XML's only real advantage is that it can be
processed with generic XML tools which require full compliance with the
rules. Forfeit the ability to do that, and you end up dealing with all
the significant disadvantages of XML for no payoff.

> IMHO it honestly looks like ass. ^^;

XML looks like ass, period. If that's the criterion, we should be using
an ad hoc text format.

ospalh

unread,
Jan 26, 2015, 4:34:49 AM1/26/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com


Am Montag, 26. Januar 2015 01:27:22 UTC+1 schrieb Lee Hericks:
As for the handwriting idea, the strokes contain the stroke type and position.
Yeah, i guess we can cut out the fake groups.

Alexandre Courbot

unread,
Jan 26, 2015, 9:44:57 AM1/26/15
to KanjiVG, Ulrich Apel
A few random comments from my end:

1) This is already an obvious improvement over our current format.

2) Some section names could be shorter, e.g. stroke-data could be
renamed to simply strokes. layout-descriptor -> layout, stroke-group
-> group (can we have other groups under the layout anyway?), etc.

3) I would really, really like to be able to get rid of all these
linear ids. Reordering the strokes means numbering a good deal of the
stroke indexes, and of their references, an operation that is quite
error-prone if done manually. We need to be able to reference strokes
and groups though, so I'm afraid I am just raising a concern without
proposing an improvement here. :/

4) Like Matthew said, XML looks like ass. Yet the examples sent by Lee
are much more legible already. But at the end of the day it will come
down to how we will edit
these files. If we have a proper tool that allows us to edit all the
details in a convenient way, then the underlying storage format is
much less of a concern (and we can automate things like id generation
and renaming). This is the preferred scenario, but we are very far
from having such a tool.

If on the other hand we want to be able to edit the files using a
simple text editor, then legibility and convenience of edition become
a top-priority and a specific format/language would be the most fit.
The basic layout can be done using XML though.

It is difficult to come with a silver bullet though - things don't
need to be 100% perfect. I will welcome any incremental improvement,
although I'm sure we are aiming at more here.

Lee Hericks

unread,
Jan 28, 2015, 3:42:04 AM1/28/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de


On Monday, January 26, 2015 at 11:44:57 PM UTC+9, Alexandre Courbot wrote:
A few random comments from my end:

1) This is already an obvious improvement over our current format.

2) Some section names could be shorter, e.g. stroke-data could be
renamed to simply strokes. layout-descriptor -> layout, stroke-group
-> group (can we have other groups under the layout anyway?), etc.

My first draft used "strokes", "layout", and "group". I changed these things for functional clarity.
Example:
<strokes>
  <stroke ...
  <stroke ...
  <stroke ...
</strokes>

This is functional but not particularly clear. A simple "s" differentiates the two tags. The same can be done with "groups" and "group".

Because of this I named tags but imagining describing the data. I believe this is better in the long run, especially to keep unique tags in case additional data is added.

<kanjivg> is a collection of <character>s. 
These <character>s have a <structure> and <metadata>. 
The <structure> is composed of <stroke-data> and an <ideographic-description>. 
The data is one or more <stroke>s.
The <ideographic-description> defines a number of <stroke-group>s (containing <stroke-ref>s) organized with <layout-descriptors>.
etc. etc.

Yes, we could cut the second half off all of them and survive, but then the tags are a bit too generic. In other words, this is self-documenting to some degree.
 

3) I would really, really like to be able to get rid of all these
linear ids. Reordering the strokes means numbering a good deal of the
stroke indexes, and of their references, an operation that is quite
error-prone if done manually. We need to be able to reference strokes
and groups though, so I'm afraid I am just raising a concern without
proposing an improvement here. :/

XML is a flattened data file, where we need to establish these relationships through ID keys. It's unavoidable. So much data was crammed into those SVG groups. This is much easier to parse. 

Already I reduced the IDs to numbering from their monstrous glory in the SVG format.

One more serious issue to discuss is the IDs and ordering, and updating them.  If a person parses KanjiVG and relies on identifying the strokes, you may reorder them, but you should not change their ID, correct? Doing that means losing the ability to sync changes to the correct stroke. As I envision it, it is not the "s1" which says this is the first stroke. The <stroke> embeds it's numbering-position, but the actual number you place at that position should be the order the stroke tags are parsed in. If someone has a better idea or thinks we should warn to never expect these IDs to be properly maintained across updates, please say so.

4) Like Matthew said, XML looks like ass. Yet the examples sent by Lee
are much more legible already. But at the end of the day it will come
down to how we will edit
these files. If we have a proper tool that allows us to edit all the
details in a convenient way, then the underlying storage format is
much less of a concern (and we can automate things like id generation
and renaming). This is the preferred scenario, but we are very far
from having such a tool.

If on the other hand we want to be able to edit the files using a
simple text editor, then legibility and convenience of edition become
a top-priority and a specific format/language would be the most fit.
The basic layout can be done using XML though.

Yes that tool needs to be realized. But in my research of the current KanjiVG SVG format, I would say this XML is a treat to edit in comparison to the nestled <g> tags which really confused me.

Someone should suggest how we might edit the strokes without a tool, being that people are editing SVG files until now, because I am not an editor. And if someone knows an open-source precise SVG path editor that we can use to work out this editing tool, please say so.

ospalh

unread,
Feb 24, 2015, 8:22:59 AM2/24/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
At the moment i think that most variants could be added to these xml files. Maybe adding a variant-stroke-data section for all those Kaisho files with shorter strokes, hooks on some verticals.

Then add variant sections that just list the strokes to use for each version.

When there are variants where this mechanism doesn't work because they are too different, maybe we could just add a reference to another file.

At the moment for lots of files the Kaisho variant has different stroke data for *all* strokes. With my scheme most of these tiny differences would be lost.

Anyway, i thought of something like this, without adding it to the DTD:


066f2.xml
0641c.xml
04ead.xml

ospalh

unread,
Mar 5, 2015, 3:59:08 AM3/5/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
I just noticed something that would make the files a bit more verbose again. You used stuff like <stroke-ref id="s1"/> to reference a <stroke id="s1" …>. So you used the same id, which is an xml attribute for which there are some rules, more than once. Maybe it should be something like <stroke-ref stroke-id="s1"/>.

ospalh

unread,
Apr 9, 2015, 4:24:15 AM4/9/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
Maybe what has been called “stroke weight” could be added as well. When somebody finds (or creates) a database. Probably as extra attribute to the <stroke …> element: type"=㇑" weight="xy". Maybe the mysterious letters currently tagged on to the stroke types could be separated as well, when somebody finds out what they really are about

Lee Hericks

unread,
Apr 9, 2015, 4:26:52 AM4/9/15
to kan...@googlegroups.com, ulric...@uni-tuebingen.de, gnu...@gmail.com
Ulrich, we need to have a Skype powwow(meeting) about stroke types and control points, try to start getting some documentation in order.

Ulrich Apel

unread,
May 13, 2015, 11:11:32 AM5/13/15
to pierce.sp...@gmail.com, kan...@googlegroups.com, gnu...@gmail.com
I am very sorry, that I come back to the discussion on KanjiVG so late. I was very busy and still am. It might be that I missed parts of the discussion.

The idea of using XML was to be able to work with only one file for every character and to have all information in XML since SVG is an XML application and name spaces are very handy for other information than paths.

At an early stage, we had two files or descriptions for every character.: one in a sort of XML and one in pure SVG for paths. The main thing is in my opinion to have structured data and be able to work with it. Generating JSON from XML shouldn't be that difficult. But, I am no programmer.

JSON seems to be very reasonable for descriptions, but I am not sure about paths in SVG. I don't know, whether it is really a progress to go back to two files or to mix JSON and SVG/XML.

I would prefer an approach that allows a rather deep level of description and would not like to lose information. We already discussed that for example the radical 彳 "gyô ninben" does contain 亻 "ninben" , even in its Japanese name. I wouldn't like to use this kind of information. If there are other ways to handle data, easier and faster, I wouldn't have a problem.


Then it is an "graphematic description"; one could even say "graphetic description". The word "ideograph" deals with "pictures or written things" -- _graphs_ - that bear an idea -- ideo. KanjiVG deals mainly with the graphic side -- it deals not too much with ideas.

The distinction between "graphematic" and "graphetic" has the the distinction of "phoneme" vs. "phone" as model, so whether something is meaning distinguishing or whether it only sounds or looks different.

Best wishes

Ulrich


Am 13.05.2015 um 16:08 schrieb pierce.sp...@gmail.com:

>
>
> On Friday, January 23, 2015 at 3:02:37 AM UTC+1, Lee Hericks wrote:
> Introductions
> - Why change format?
> Because the current format is limited by it's technology. Multiple SVG group tags are created to cram in multiple radicals and other information.
> Therefore, a more extensible, easy to parse format is desired.
>
> - Ok, describe this new proposed format really simply!
> a) It's in XML, and we can use a DTD to help verify it's correct.
>
> Sorry to drop but the bomb so late, but am I the only one who does not really want to stick with XML? JSON would save us a lot of parsing time. Moreover, it would be easier to use in the browser, both for drawing and editing (yes, I have this idea in mind).

pierce.sp...@gmail.com

unread,
May 13, 2015, 11:13:21 AM5/13/15
to kan...@googlegroups.com, gnu...@gmail.com, ulric...@uni-tuebingen.de


On Friday, January 23, 2015 at 3:02:37 AM UTC+1, Lee Hericks wrote:
Introductions
- Why change format?
  Because the current format is limited by it's technology. Multiple SVG group tags are created to cram in multiple radicals and other information.
  Therefore, a more extensible, easy to parse format is desired.

- Ok, describe this new proposed format really simply!
  a) It's in XML, and we can use a DTD to help verify it's correct.

Sorry to drop but the bomb so late, but am I the only one who does not really want to stick with XML? JSON would save us a lot of parsing time. Moreover, it would be easier to use in the browser, both for drawing and editing (yes, I have this idea in mind).

Laurent L

unread,
May 29, 2015, 3:32:27 AM5/29/15
to kan...@googlegroups.com, gnu...@gmail.com, ulric...@uni-tuebingen.de
Lee,

It would be great if you could put your draft under code review.

This would allow more people to follow, and probably ease the workflow.
Reply all
Reply to author
Forward
0 new messages