Consistency rules

47 views
Skip to first unread message

msk...@ansuz.sooke.bc.ca

unread,
Mar 12, 2012, 10:21:34 PM3/12/12
to kan...@googlegroups.com
One item of consensus in last week's Skype meeting was that we need
clearer documented rules for just what the different fields in the KanjiVG
database actually mean. The Wiki page at
http://kanjivg.tagaini.net/format.html
provides some description, but it leaves many issues ambiguous, and it is
apparent that many entries in the current database do not actually obey
the format described there.

Here are some thoughts of mine on rules that would be good to have. These
are intended as a starting point for discussion rather than something to
be adopted as-is; and it should be understood that they reflect my own
priorities and the use I'd like to make of the database. Nonetheless, I
do have reasons for all of them and will try to make those reasons clear.
Especially where I think changes from current practice would be advisable,
I'm not just proposing change for change's sake.

I think a primary general principle we should follow is that the
information in the database should be *unambiguous*. It should never be
the case that there is more than one interpretation consistent with a
given database entry and the reader must supply "common sense" to figure
out which one is correct. Computers don't have common sense! And in the
general case, we can't even depend on humans reading that database to have
the same common sense we have. Ideally, it should also be unambiguous in
the other direction: only one correct way to describe any given set of
facts in an entry. That is harder to achieve and less important than
having only one set of facts consistent with any given entry.

A secondary principle is that wherever possible, rules should be
*testable*. It should be possible for a computer program to examine a
database entry and identify any rules that the entry violates, even if it
is not possible to automatically fix the violations. This is important
if, as has been proposed, we're going to enforce the rules in a graphical
editor.

I am going to describe my proposals in terms of the XML database, which is
what I'm most familiar with and what I think many users of the data will
be looking at. I'm aware that the data exists in some other forms, too,
and some of these points have to do with the translation between the other
forms and the XML rather than with the underlying data. Nonetheless, as
long as we're calling ourselves KanjiVG (therefore SVG, therefore XML) it
seems that the XML format of the data is pretty important and it would be
nice if it were correct and consistent.

XML FORMAT

RULE: our files are valid UTF-8 encoded XML.

As far as I know, that is already true.

RULE: we have a DTD and the files follow it.

I believe that is already true too, at least in the case of the per-kanji
SVG files; they contain XML headers with DTD information for the
KanjiVG-specific fields, and references to the W3C's DTD for SVG. I'd
comment, though, that I don't honestly think a DTD is so important. The
things we'd really like to enforce are at a higher level and won't be
captured by DTD validation. Some XML standards exist for the kind of
higher-level consistency described below but I'm not even sure we should
use those - I think these constraints may be better described and handled
in other ways than attempting to use the XML people's Byzantine web of
"all things to all people" standards. Sure, let's have a DTD, it costs
little and has non-zero value; but I don't think it'll be a big part of
addressing the real issues.

RULE: every group, path, and (in the "combined" file) <kanji> tag, has an
XML ID and the IDs are unique across the database.

I think that is already true, and it makes simple searches with
(non-IDS) grep, as well as many other software tasks, much easier.

SPATIAL STRUCTURING

This is the point on which I think my suggestion is most likely to be
controversial, so let's get it out of the way quickly: I think that the
hierarchical structure of the XML files (what tag goes inside what) should
reflect the spatial structure of the character (what element is part of
what) even if that means recording strokes in the file in an order other
than the "stroke order" used in writing the character.

A typical example of the consequences shows up in the entry for U+5712 園.
The top-level group in the current format currently contains *three*
sub-groups. First, the first two strokes of the enclosing box. Second,
the middle. Third, the last stroke of the enclosing box. There is some
markup to indicate that the first and last groups are somehow connected to
each other. The spatial structure of the character is quite different:
at the top level, this character has *two* parts, and they have an
enclosure relationship to each other. One of those parts is the enclosing
box; the other part is the middle. Any software that wants the spatial
structure and is looking at the current representation must make a
translation between the two formats. It is not obvious how to make this
translation correctly; the markup that says the two parts of the box are
connected is designed to help make the translation, but only helps up to
a point.

I suggest that instead of the current practice, we should store the
spatial structure directly in the file, and represent the writing sequence
of strokes by a new attribute, tentatively called "kvg:strokenum".

There are a number of reasons I suggest this.

First of all, applications (like mine!) that need the spatial structure,
currently just can't have it, too bad. The "kvg:part" (note, not the same
as "kvg:partial") attribute attempts to describe spatial structure,
but its use in the current data is inconsistent (used different ways in
different entries) so no software can actually get correct data from the
database in general; there is no documentation on what is the correct way
to use it, so there is no real possibility of fixing the bad entries; and
it's not even clear what would be a sensible way to use it (though I make
a proposal below) so we can't document a good meaning for this field and
then try to make the data follow it.

Second: We have a file format, XML, which naturally represents
hierarchical structures. At present we are not really using that aspect
of XML; the hierarchical structure described by the XML is not the spatial
structure but also isn't anything else that someone would want to use.
People who need stroke-sequence will ignore the grouping and just look at
the strokes themselves, and people who need the spatial structure will do
a messy cutting and pasting job. Nobody has a reason to directly use the
hierarchical structure we're actually storing; why store it?

Third: many of the consistency checks we would like to enforce are
defined in terms of spatial structure. If the data is not stored in
spatial structure form, it becomes much harder to apply those checks
because the checks have to be applied to the output of the
stroke-sequence/spatial transformation.

Fourth: if we want to store more than one stroke order in the same
entry (for instance, traditional versus modern, or Japanese versus
Chinese), that is impossible in the current format. And if we want to
change the stroke order (to correct an error or for any other reason) the
current format implies that the containment structure may be forced to
change too - making it easy for errors and rule violations to creep into
the containment structure, and making it harder to compare two characters
that have similar spatial structure but different stroke order.

With those points in mind, I propose:

RULE: each entry is structured into groups (one at the top, then possibly
others inside it) describing the spatial organization of the kanji
character. In particular, each visual component of the kanji corresponds
to just one group somewhere in the entry. Components are not split to
appear multiple times.

Note that, like the current format document, I'm talking here about one
top-level group that contains strokes; if the SVG file contains another
group for stroke numbers, as the format document proposes under the name
"StrokeNumbers", so be it, but I don't think the current database actually
does that and it's beyond the scope of what I'm interested in right now.

Splitting of groups, as mediated by the current "kvg:part" attribute, is
so difficult to undo reliably that I don't think I can meaningfully count
violations of other rules in entries that do that. As a result, my counts
of rule violations below are generally limited to the 5457 entries that do
NOT split groups with kvg:part. Some numbers may add up to more than 5457
because they are counts of tree nodes, not of entire entries, and each
entry contains many nodes.

RULE: every group contains either A. two or more subgroups, or B. one or
more paths (strokes); not both.

That last rule is violated in multiple ways in the current database. We
have:
* groups and paths mixed in a group (children of the same parent; true
of 3840 parents in the current database)
* a group containing just one group (265 times)
* a group containing nothing (4 times)

Note that, especially as a consequence of other rules below, it may
sometimes be necessary to create a group that contains just a single path.
That is perfectly acceptable. The alternative of trying to treat single
paths as implicit groups when they appear where a group would otherwise be
required, implies allowing all group attributes to also be path
attributes, which makes DTD validation much harder and requires
special-case handling in all readers that try to use the file. Much
better to create the extra level of grouping.

As well as identifying the groups that make up a component, we also need
to know their spatial relationship to each other. For instance, is this
component split into left and right; top and bottom; or enclosing object
and the thing inside it?

ATTRIBUTE: groups have a kvg:position attribute, which is a string from a
relatively short list of possible values. The kvg:position attribute
specifies the spatial relationship between this group and its siblings.

RULE: every group has a kvg:position attribute, except the root group
which does not have one.

Requiring a kvg:position attribute everywhere is a tough requirement. At
present, out of 41902 non-root groups in the database (including kanji
with kvg:part, because this was easier to count), 16674 are lacking
kvg:position. I think a big part of the reason for that is that in very
many cases, it's hard to think of what the correct value should actually
be. The existing vocabulary of spatial relationships doesn't cover all
possibilities and it's not clear that it ever could. There are a lot of
near misses, like the relationship between 中 and 三 in the lower right of
僅 (example from Karl). Is that "top and bottom," "overlay," or what?
As a result of the sheer number of nodes where this information is
missing, and the difficulty of filling it in, it may not be possible to
actually enforce this rule immediately. Nonetheless I'm going to document
what I would really like to have even if it'll be difficult to achieve, as
well as a "transitional" rule that'll be easier to achieve in the short
term.

RULE: in a group that has groups as its children, the number of children
and their kvg:position attributes matches one of the templates below. In
my current proposed templates, the number of children is two in all
cases; that might not be true of some future templates.

TRANSITIONAL RULE: if for some group it is not possible or convenient to
obey the previous two rules (valid data in kvg:position everywhere) then
the group's children may omit the kvg:position attribute entirely. In
this case it will be understood that the children are separate elements in
some meaningful way, but their relationship to each other is not
specified.

What I'm calling "templates" are the spatial relationships we allow among
the children of a group. I don't know what the entire list of these
should be, but it seems clear that we should have at least the ones
currently in use, normalized to be more useful where necessary. Starred
values are new proposals, not currently present in the database. Other
values currently present, such as "kamae1" and "kamae2", I recommend
removing. Other kinds of relationships are possible. Karl pointed out a
list, source unknown, in Wikipedia at
http://en.wikipedia.org/wiki/Radical_%28Chinese_character%29#Position_of_radical_within_character_in_Japanese

There are also several other relationships implied by Unicode's
Ideographic Description Characters (familiar to me from my IDSgrep work);
those are described, but not in much detail, in section 12.2 of the
Unicode standard:
http://www.unicode.org/versions/Unicode6.0.0/ch12.pdf

I believe Unicode's vocabulary of spatial relationships is derived from
the Mainland Chinese GBK standard. The most complete documentation of it
is probably only available in the Chinese language.

Normalized current template list:

"top" and "bottom"
"left" and "right"
"kamae" and *"kamae-inside"
"tare" and *"tare-inside"
"nyo" and *"nyo-inside"

It appears to me that "kamae" in the current database is being used as a
catch-all for several distinct kinds of enclosure (enclosing on all four
sides, on three sides, and maybe even some cases of enclosing on two
sides) while "tare" and "nyo" each refer to specific common ways of
enclosing on two sides. Unicode splits it out into a different
relationship for each combination of which sides are enclosed, and we
might want to consider that as well.

I think I know what "left and right" and "top and bottom" mean (though
even that is not a gimmie - Kanpeki describes the direction of the
dividing line rather than the parts, resulting in opposite terminology to
Unicode's) but I'm not clear on the traditional scopes of the Japanese
terms "kamae," "tare," and "nyo." Unicode also defines three-way
relationships for "left, middle, and right" and "top, middle, and bottom",
which we might or might not want to consider allowing.

Note that I have not specified, and I don't think it's important to
specify, any order that the sub-groups inside a group must appear in the
file (such as "top" before "bottom" or "left" before "right"). When both
"kvg:position" and "kvg:strokenum" (described below) are specified
correctly, the sequential order of tags in the XML file should not matter.

Note that I'm proposing separate kvg:position values for "kamae-inside",
"tare-inside", "nyo-inside", and so on, rather than having a single
"inside" value used with all of them or (apparently the current practice)
leaving it implicit that the inside is everything else in the group. The
reason for this is to improve testability. Under my proposal, we can look
at ANY kvg:position value and instantly know exactly what sibling values
it should have. If "inside" could match with more than one different
thing, that would no longer be true.

If I assume that when "kamae," "tare," or "nyo" occurs side-by-side with
exactly one other group, that group is implicitly the inside (an easy fix
to make), then after that there remain 267 parents in the current database
whose children include at least one kvg:position value but that can't be
reconciled with one of these templates.

STROKE ORDER

I advocate structuring the XML file around the spatial structure of the
kanji, for the reasons described above, but doing that necessitates
representing stroke order in some other way. I recommend the following.

ATTRIBUTE: paths (strokes) have a "kvg:strokenum" attribute. This is a
positive integer specifying the order in which the stroke is written.

RULE: Every path has a "kvg:strokenum" attribute. The values in a kanji
entry are consecutive integers from 1 up to the number of paths; none are
duplicated or missing.

RULE: There is an identified standard we follow to determine the correct
stroke order.

I don't have the necessary knowledge to have a good idea of which standard
that should be; my vague idea is that it should be whatever is currently
taught in Japanese schools, but I'm aware that that may not be fully
defined. Nonetheless I think we should have *some* standard even if,
ultimately, it ends up being "Whatever Ulrich says."

I suggest both that stroke numbers should start from 1, and that they
should be consecutive integers - instead of, for instance, going
"10-20-30-40" to make future insertions easier - for compatibility with
human beings. People outside the project know what "stroke 3 of
such-and-such character" means without needing to know anything about
KanjiVG's internal representation; so even though it is not necessary that
our database's internal format match human language, it would sure improve
the accessibility of our data.

Note that we might well want to define one or more optional attributes for
alternate stroke orders, such as Chinese or calligraphic, but I'm not sure
it's appropriate to codify those immediately.

Note that the introduction of this attribute, and the fact that its values
might NOT appear in sequential order in the XML file, will require changes
to any software that generates animations from the XML data. That is the
significant disadvantage of the switch to structural ordering. I don't
think it's a big problem for software that is designed specifically to
work with KanjiVG data; but if people are generating animations *using
generic SVG software not designed for KanjiVG in particular* then they may
have objections. It would be easy to write a converter that decodes the
kvg:strokenum data into a flat SVG file with the strokes in sequential
order and no spatial structure; that might be a migration path for any
software affected by this issue. I'm not sure that any such software
exists.

ELEMENT IDENTIFICATION

The root-level group for a kanji obviously represents the entire kanji.
However, when a kanji is divided into parts, it is not always clear what,
if any, kanji is represented by each part. Sometimes the "parts" of a
kanji are not kanji in themselves, but nonetheless the same part may be
used as part of more than one kanji and we would like to have a name for
it. There are also a lot of issues surrounding the extent to which two
things that look similar, but possibly not identical, may or may not be
considered to be the same thing. Furthermore, there is often an
etymological relationship we would like to capture, such as the sense in
which the left part of 海 can be said to be 水 even though they do not
look much like each other.

Even when we think we agree on what name to give to a group, it is not
always clear how to represent that in the file in terms of a Unicode code
point. Unicode contains separate ranges of code points for entire kanji,
radicals, and strokes, and what is basically the same thing may appear
in all those ranges. There are multiple reasons for that to occur,
including among other things the Unicode Consortium's desire to make all
distinctions that any other major standards make, in order to enable
a lossless round-trip conversion. If we describe the same object using
more than one code point, searching correctly becomes much harder.
Ambiguities of this nature can often be worked around in an automated way
using a table of equivalent code points, but it would be much better not
to have them in the first place.

ATTRIBUTE: groups have "kvg:element" attributes that primarily represent
their visual appearance, secondarily their meaning.

Note that the current format document says of the kvg:element attribute
"It should be the unicode character that resembles the group as much as
possible." That raises multiple problems, not least that Unicode does NOT
specify the visual appearance of characters; Unicode is very explicitly a
code of abstract concepts called "characters," not a code of pictures.
One simple example of the difference is that the "double-decker" lowercase
a with the curl on top, typical of print fonts in English, and the
"single-decker" lowercase a typical of handwriting, are both U+0061.
Without also fixing a reference font, we can't really say what a given
code point looks like or that this code point looks more like our
character than that code point. I propose instead the following.

RULE: the "kvg:element" attribute should be equal to the Unicode code
point that the element would unify with under Unicode's Han Unification
rules, if such a code point can be chosen unambiguously.

My main reason for suggesting that is that even if they are not perfect,
Unicode's rules have been extensively discussed and debugged, and many
people think they know them. If we try to make our own, we are likely to
fall into traps that they avoided. But even the Unicode rules are not
fully specific, because of the "same thing in multiple ranges" issue.
What we really need is a priority order. I propose the following.

RULE: For the root group of a kanji that has a code point of its own,
always prefer kvg:element equal to the code point of the kanji we are
describing, even if that conflicts with the rules that would apply to
non-root groups.

RULE: For non-root groups, or root groups not covered by the previous
rule, first prefer kvg:element equal to a code point from the CJK Unified
Ideographs range. Failing that, prefer code points in CJK Extension A,
CJK Extension B, and so on in that order through any current or future CJK
Extensions. Failing that, prefer code points in the CJK Radicals/Kangxi
Radicals range, then the CJK Radicals Supplement range, then the CJK
Strokes range.

Note that strokes as such (XML <path> tag) are different from groups, and
discussed below. I think the above rules on which code points to prefer
are basically consistent with current practice; I'm not aware of any
significant violations of them in the current database.

Note that even this priority order of code points is not enough to name
everything, and there is still room for discussion on how to name things
that still remain nameless at this point. We could simply leave such
things without a kvg:element attribute; that would certainly be simplest,
and it presents no problem for my own applications, but it leaves us with
nothing but XML IDs for identifying the groups should we want to. We
could use Kanpeki-style subtraction notation, but it is not fully defined
in some cases (in particular, when there is more than one nameless
component in the same kanji). That, or some other options (including the
use of XML IDs), would require allowing the kvg:element string to be more
than one character long; I don't see any problem with allowing
multi-character strings, but someone might. Making up our own list of
private-use code points is another possibility and would avoid
multi-character strings, but it raises significant administrative and
interoperability issues.

The current documentation and database both attempt to capture an
additional piece of information: whether the group is a close match to its
kvg:element value or not. That is reduced to a simple yes/no. I think it
might be nice to have more detail than true/false, but attempting to
capture more detail opens up a proverbial can of worms, and I'm not sure
what would be better, so I'm just going to propose we stick with the
current practice on this point. It doesn't seem to be dramatically
broken.

ATTRIBUTE: groups have a "kvg:variant" attribute which indicates that the
group is (quoting the current format document) "actually slightly
different from the element attribute."

RULE: the kvg:variant attribute may or may not be present. It is only
permitted if the kvg:element attribute is present. If kvg:variant is
present, its only legal value is "true".

Note that kvg:variant does NOT mean the group is a variant of
kvg:original, described below; it means it is a variant of kvg:element.
Either kvg:variant or kvg:original may appear without the other, though
each requires kvg:element. I didn't understand that at first and thought
it was an error in the database when kvg:variant or kvg:original appeared
without the other, but I'm now convinced that such things can at least in
principle be correct. I don't know whether these attributes are correct
in all cases in the current database.

COMPONENT UN-SPLITTING

Although storing the components in spatial order eliminates the need for
the kvg:part and kvg:number attributes, both of which I recommend
removing, we have 1203 entries in the current database that depend on
kvg:part. I don't believe it is possible to come up with a single set of
rules applicable to all these; there are some very weird things in the
database including single strokes described as part of several components
(which is itself reasonable in principle, but isn't described in the same
way in all entries) and duplicate values of kvg:part (presumably an
error). Nonetheless, here is my best guess at a reasonable way to
untangle stroke-sequential entries into spatial form. It seems to work on
a lot of current entries.

* First, find all the groups that correspond to each component. Two
groups correspond to the same component if the kvg:element value matches
and, if the kvg:number value is present on any of them, if that value
matches as well.

* For any component that is described by more than one group, it should be
the case that all the groups have kvg:part values and those are
consecutive integers starting with 1. If not, the entry is buggy enough
not to pursue further.

* Merge all the groups corresponding to the element. That means the
contents of the new element consist of the contents of the kvg:part="1"
group concatenated with the contents of the kvg:part="2" group, the
kvg:part="3" group, and so on, in order. The new group has all the
attributes present on any of the original groups; except for kvg:part,
which is removed, they all must agree when a given attribute is present.
(Note that I've observed otherwise-reasonable entries in which a given
attribute is only present on one of the groups making up a component; in
that case there doesn't seem to be a conflict.)

* The new group replaces the kvg:part="1" group. All the other groups are
deleted.

* If deleting a part creates an empty group (because the deleted group was
its parent's only child) then delete the empty parent, and contine that
process through as many levels as necessary.

It is possible that the above rules may not fully define their result
(because I haven't specified the order in which to examine the elements)
but it should be unambiguous in most cases and if not, we probably have
enough other problems with the entry anyway that we can't expect a good
result.

ETYMOLOGICAL INFORMATION

In at least some cases we want to capture the etymological origin of a
group.

ATTRIBUTE: groups have a "kvg:original" attribute which represents the
semantic value of the group, or etymological original form when the group
is a simplification of something earlier.

RULE: (assuming the absence of a kvg:element attribute is allowed in the
first place) kvg:original is permitted only if kvg:element is present.

RULE: the priority order for choosing values of kvg:original is the same
as for choosing values of kvg:element.

RULE: kvg:original may optionally be omitted if its value would be
identical to the value of kvg:element; conversely, if kvg:original is not
present, readers must assume that its value would be identical to the
value of kvg:element.

Making kvg:original optional codifies existing practice; the fact that the
omission of kvg:original is optional rather than required may be a slight
ambiguity, but should be harmless as long as we're clear on the
circumstances where it applies.

I suggest having the same priority order for kvg:original as for
kvg:element in order to maximize the opportunities for such omission.
Another alternative I thought about was to prefer code points from Kangxi
Radicals for kvg:original - since those form a nice set, have centuries of
tradition behind them, and would line up with the kvg:radical attribute
mentioned below - but different rules for two attributes whose values need
to be compared to each other seems like it would be a big mess.

Note that the same kvg:element value could go with different kvg:original
values in different kanji, as a result of distinctions being lost to
simplification during government-imposed language reform. Karl has shown
us some examples of kanji pairs in which this has happened, with
components now considered the same that have different historical
antecedents. The consequence is that kvg:original values cannot be
deduced solely by looking at kvg:element values; it's worth having the
separate field.

A point worth thinking about, but I don't have the expertise to decide it
myself: is there, and should there be, a specific cut-off date for the
point in history at which we will consider characters to be "original"?
If a given component was reformed three times in three different
centuries, which "original" version goes in kvg:original?

Four more attributes relating to what I'm calling "etymological
information" are listed in the current format document, but three are only
described as "TBD" and one has a list of allowed values and not a lot
else. I can't really comment on how these attributes should be used, but
for completeness, their names are kvg:radical, kvg:phon, kvg:tradForm, and
kvg:radicalForm.

PARTIAL COMPONENTS

Sometimes part of a component is missing. A related situation (which may
or may not be the same thing) is that a stroke may be part of more than
one component. For instance, in U+5546 商, the top seems to be U+7ACB 立
and the bottom seems to be U+518F 冏. But the bottom stroke of 立 isn't a
separate stroke; it is shared with part of the second stroke of 冏.
The database currently represents this using the kvg:partial attribute,
and as far as I know, it's used correctly in the current data.

ATTRIBUTE: groups have a kvg:partial attribute, true if not all the
strokes of the kvg:element value are present. (Note this is independent
of the meaning expressed by kvg:variant.)

RULE: the kvg:partial attribute may or may not be present. It is only
permitted if the kvg:element attribute is present. If kvg:partial is
present, its only legal value is "true".

There is some ambiguity possible when a given stroke is shared among more
than one component. Which component should it be listed as part of, and
which other components should be tagged as "partial"? In the case of 商
it seems clear that the shared stroke should be part of 冏, not 立, but it
is easy to imagine other cases where it would be less clear. Here are
some rules trying to disambiguate that.

RULE: If the actual stroke matches the stroke type (stroke types are
discussed below) in at least one component that shares it, and does not
match the stroke type in all of them, then prefer placing it in a
component where it matches the stroke type.

That rule disambiguates the 商 case, because the actual stroke is left to
right and then down (U+31C6 ㇆, which Unicode calls "HZG"); 冏 includes
that stroke exactly, whereas the stroke it replaces in 立 is a simple
horizontal line.

RULE: If a tie still exists after the above, prefer placing the shared
stroke in a component with a kvg:element value over one that does not have
one.

RULE: After the above, prefer placing the shared stroke in the component
with fewer strokes (counting any shared strokes in all the components that
contain them).

RULE: After the above, prefer placing the shared stroke in the component
whose first stroke is written earlier.

The purpose of those three rules is to place the shared stroke in the
component more likely to be used as a search key.

It would be nice to include information identifying all the components
that include each stroke. The current database does not attempt to store
such information for all kanji; in particular, it doesn't attempt it for
商. For others, it does make the attempt; in particular, we discussed the
example of U+4E17 丗, in which some strokes are wrapped in multiple layers
of one-child groups, completely breaking the untangling algorithm I
described above. My suggestion is that rather than trying to do it with
custom attributes added to the groups and paths of SVG files, we instead
add custom XML tags. I believe that the XML standards allow us to do
this, and standards-compliant SVG-reading software will just ignore the
tags it doesn't understand; if not, and including them is a problem, it
would also be possible though less elegant to encode this information in a
new custom attribute.

CUSTOM TAG: when a stroke needs to appear in more than one component, it
appears as a <path> tag in one group according to the rules above; any
other components that include the stroke contain instead a
<kvg:sharedStroke> tag. That tag has a kvg:path attribute whose
value is the XML ID of the associated <path> tag.

RULE: one or more <kvg:sharedStroke> tags appear inside a group if and
only if the group has kvg:partial="true". They count as paths for the
purposes of the "either paths or groups, not both" rule. Every
<kvg:sharedStroke> tag has a kvg:path attribute pointing at a <path> tag
in the same kanji.

STROKE TYPES

Most of these rules relate to groups, but we've also talked about strokes
(represented by the <path> tag) and there are some issues worth mentioning
there, too.

ATTRIBUTE: paths have kvg:type attributes describing (qualitatively) the
shapes of the strokes.

It is not clear to me just what standard has been followed, or should be
followed, for the values of kvg:type. Most strokes seem to have
single-character values from the Unicode "CJK Strokes" range, like "㇐" or
"㇟". But many have alphabetic suffixes attached to them, like "㇕b" or
"㇇a"; and there are also many instances of two stroke types (with or
without alphabetic suffixes) joined with a slash, like "㇔/㇏" or
"㇖b/㇆". In our meeting, I think Ulrich said the alphabetic suffixes
describe how many other strokes the stroke touches, but I didn't quite
understand the explanation, and we didn't get as far as discussing the
slash combinations. So I've got this general recommendation, but don't
know what the detailed answer should be.

RULE: every path has a kvg:type attribute. We have a list of all possible
values for it and what they mean. Ideally, this follows some existing
standard.

Another issue we discussed was that there should be some kind of
consistency among all the strokes with a given type. For instance, they
should have the same number of control points, and the same pattern of
which points are corners, which are curve controls, and so on. That seems
to be an important first step toward the hope of making fonts from this
data. I should disclose that I'm working on strokes-to-fonts in my
Tsukurimashou project (http://sourceforge.jp/projects/tsukurimashou/).
The way I'm doing it is quite different, though; in particular,
Tsukurimashou doesn't have stroke types. Anyway, it appears that the SVN
standard already in use for KanjiVG provides a nice way of expressing a
constraint that I think expresses more or less what's wanted.

RULE: for every kvg:type value, we have a sample path (given as a value
for the SVG "d" attribute) describing a typical example of that stroke
type. For every <path> tag, the "d" attribute matches the sample path for
that kvg:type value, except for changing the values of the numbers.

The thing is that the SVG "d" attribute is a string of commands, usually
abbreviated to single alphabetic letters, and numbers that are parameters
to those commands. A value might be something like
"M47.5,50.5c4.24,1.76,10.94,7.25,12,10" .
That describes a starting point ("M47.5,50.5") followed by a cubic
Bézier segment ("c4.24,1.76,10.94,7.25,12,10"). If we change the numbers,
but not the letters and commas, we can put the curve elsewhere and change
its shape somewhat, but we will remain constrained to keep it to one
starting point and one cubic Bézier segment. That means if we have a nice
way of drawing that type of stroke (for instance, putting in the uroko in
a Mincho typeface) we should be able to come up with the right drawing for
all strokes of that type by simple interpolation. (I have my doubts about
that, but it wasn't my idea, and I certainly agree that enforcing the
"same control point pattern" rule will help a lot.)

CONCLUSION

I think that covers all items on my list.

--
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/

Reply all
Reply to author
Forward
0 new messages