Introductions
- Why change format?
Because the current format is limited by it's technology. Multiple SVG group tags are created to cram in multiple radicals and other information.
Therefore, a more extensible, easy to parse format is desired.
- Ok, describe this new proposed format really simply!
a) It's in XML, and we can use a DTD to help verify it's correct.
b) Each character has a structure and metadata about that structure.
The structure consists of stroke data and an ideographic description of the kanji.
WAIT, Ideowhat?
b2) An ideographic description is a fancy word for "describe the kanji parts and where they are positioned"
The proposed format breaks up the duty of telling position and telling the group of strokes into two elements,
<ideographic-descriptor> and <stroke-group>. This simplifies parsing because it makes it very clear how to process each tag.
BUUUUUUT, it's still verbose, what about ⿱亠⿱口⿱冖丁 for 亭?
b3) We still need to add information about which strokes:
⿱亠{s1,s2}⿱口{s3,s4,s5}⿱冖{s6,s7}丁{s8,s9}
Which cross ref:
⿱亠{s1,s2|kvg:04ea0}⿱口{s3,s4,s5|kvg:053e3}⿱冖{s6,s7|kvg:05196}丁{s8,s9|kvg:04e01}
and possibly other information.
Oh and some parts like the far right of 術 don't seem to have a literal character?!
XML is wordier but easier to parse.
Check out kanjivg.dtd and the samples!
Goals
1) Clearly organize/mark/explain all data
2) Build proper tools/site for the project. Imagine:
Parsing this proposed XML format
a) Parse the character structure into a collection of strokes and groups mapping to those strokes.
b) Parse meta-data which references strokes and groups. Resolve group references to the appropriate strokes.
General comments / decisions made for this proposal thus far
- Avoiding kvg xml namespace for readability and the unlikeliness of conflicts.
Character ids are prefixed with kvg: so if you want to refer to stroke 5 of 明,
you do kvg:0660e-s5 just as you would have previously seen in the SVG version.
You can think of that as a path to s5 in the character with id kvg:0660e.
- From a Japanese language school perspective, KanjiVG characters are sometimes broken down to meaninglessly small chunks that have no actual linguistic meaning.
ex. 行{彳{㇒,亻}, …} -- 亻 is not a component of 彳.
Please tell me if there is a machine/programming reason for this. We either:
a) clean this up to an appropriate linguistic level
b) clearly mark the groups representing linguistic components of the kanji
Clarifications
- Why are both the Ideographic Description Character and it's unicode value included in layout-descriptor tags?
They are both included for choice in parsing, but unicode may be dropped upon consensus.
- Where did "kvg:element" go?
Any place where we are choosing the most visually accurate character is called a literal.
Therefore, for 明, <g id="kvg:0660e-gX" kvg:element="日"> is <stroke-group id="g1" literal="日">.
We do this because we are separating out and clearly defining an ideographic description of the character.
- But how will I quickly see this character?
That's a good question. The extensibility of this draft format has a clear advantage moving forward, but will require an editing tool.
Missing
- The 高 partially existing in 亭 is an entry about the origin of the kanji(字源「じげん」). It should have some appropriately named tag in the metadata section.
- kvg:tradForm attribute is only used in 09ed9.svg, 066f2.svg, 08208.svg
- kvg:radicalForm attribute is only used in 05101.svg, 0658e.svg, 07962.svg, 07f86.svg, 09f21.svg
References
- Unicode 7.0 Ideographic Description Characters (18.2)
- Unicode 7.0 CJK Strokes (Not yet checked against KanjiVG stroke types)