SKIP Lookup Method

Lee Hericks

unread,

Nov 10, 2014, 8:38:20 PM11/10/14

to kan...@googlegroups.com

Those familiar with the NJECD(The Kodansha Kanji Dictionary as of last year) and the Kodansha Kanji Learner's Dictionary may know of the SKIP lookup method.

http://www.kanji.org/kanji/dictionaries/skip_permission.htm

As you can see in the link below, besides the first pattern number a part of the kanji are shaded (highlighted in red).

http://www.kanji.org/kanji/dictionaries/features/skip.htm

The KanjiVG marks radicals and elements as SVG groups. The SKIP highlighted parts are sometimes but not always the same as the radicals. Additionally in Pattern 4 they are sometimes not even a complete stroke.

So the question is where to put this data? Some options:

Create a new "layer" similar to the stroke numbers, have to duplicate strokes from the main data, and sometimes add a max and min number between 0.0 and 1.0 denoting a portion of the stroke to color for some of pattern 4 (or edit the stroke path data)
Add attributes to the strokes like skip-shaded="yes" skip-shaded-min="0.6" skip-shaded-max="1.0"
Create an external xml file with entries by kvg id with a list of stroke ids and possibly min/max attributes. In this case editors must be careful to sync stroke order changes in the main kanji files, which is probably not desirable.

The last consideration is the SKIP licensing, which is almost the same as KanjiVG, except that permission is required for commercial apps.

If Ulrich or the collective is not interested in this data, that's fine by me, but it will be made anyway. Just wishing to contribute to the project, and also if people are editing the kanji I don't wish our data to get out of sync with updates.

Lee

Lee Hericks

unread,

Nov 10, 2014, 11:29:30 PM11/10/14

to kan...@googlegroups.com

Here is an example of 仙(04ed9.svg) for option 1:

</g>

</g>

</g>

</g>

</svg>

Lee Hericks

unread,

Nov 10, 2014, 11:34:56 PM11/10/14

to kan...@googlegroups.com

Option 2:

</g>

</g>

Lee Hericks

unread,

Nov 10, 2014, 11:38:15 PM11/10/14

to kan...@googlegroups.com

Option 3 (can easily get out of sync):

<shaded-strokes>

</shaded-strokes>

</character>

msk...@ansuz.sooke.bc.ca

unread,

Nov 11, 2014, 3:12:01 AM11/11/14

to kan...@googlegroups.com

On Mon, 10 Nov 2014, Lee Hericks wrote:
> So the question is where to put this data? Some options:

> 1. Create a new "layer" similar to the stroke numbers, have to duplicate

> strokes from the main data, and sometimes add a max and min number
> between 0.0 and 1.0 denoting a portion of the stroke to color for some
> of pattern 4 (or edit the stroke path data)

> 2. Add attributes to the strokes like skip-shaded="yes"
> skip-shaded-min="0.6" skip-shaded-max="1.0"
> 3. Create an external xml file with entries by kvg id with a list of stroke

> ids and possibly min/max attributes. In this case editors must be
> careful to sync stroke order changes in the main kanji files, which is
> probably not desirable.

I suggest that attributes, in a separate XML name space, is the best plan.
Attempting to represent the information in terms of the tag hierarchy
(i.e. having a "shaded-strokes" tag that encloses the shaded strokes)
is likely to cause problems because of the existing issues with tags not
nesting in a useful way - synchronizing separate files, although also a
problem, isn't the biggest problem with that approach.

--
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/

Alexandre Courbot

unread,

Nov 16, 2014, 8:07:09 AM11/16/14

to KanjiVG

Hi, sorry for the time it took me to come back to this.

I'm afraid there is another serious obstacle to the inclusion of SKIP
data to KanjiVG, which is the licensing of the SKIP data itself. I
have been required to remove all SKIP data (as well as the SKIP lookup
functionality) from Tagaini because the SKIP licensing
(CC-noncommercial) is not compatible with the requirements of most
Linux distributions (the non-commercial clause is a no-go). Including
this data into KanjiVG would mean that we need to turn its licence
into non-commercial as well, which in turn would affect the software
that use it, preventing them from being distributed at free software.

So until the licensing issue can be solved (which means, until someone
can convince the SKIP owner to remove the non-commercial clause from
its license), this data cannot be included into KanjiVG.

> --
> --
> You received this message because you are subscribed to the "KanjiVG" group.
> For options and unsubscribing, visit this group at
> http://groups.google.com/group/kanjivg
> ---
> You received this message because you are subscribed to the Google Groups
> "KanjiVG" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kanjivg+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Lee Hericks

unread,

Nov 17, 2014, 2:02:36 AM11/17/14

to kan...@googlegroups.com

Alex, I'm afraid I don't follow you. Your app is free, it's non-commercial. As such you have no issues with including SKIP data. Commercial applications require written permission from Jack Halpern.

I will discuss this with him but I don't think he will change the license type.

Lee

Alexandre Courbot

unread,

Nov 17, 2014, 9:56:23 PM11/17/14

to KanjiVG

Hi Lee,

The app is free, however the medium which distributes it does not need
to be. You can perfectly buy a Debian or Ubuntu DVD and this falls
under "commercial applications" and is not permitted by SKIP's
license. Therefore any app that includes this data could not be
included into e.g. Debian for that reason.

KanjiVG is a database of free information about kanji, and this
freedom includes the unconditional right to make commercial
applications with it. Any data that is not compatible with these
rights cannot be included into KanjiVG.

Lee Hericks

unread,

Dec 26, 2014, 12:51:41 AM12/26/14

to kan...@googlegroups.com

Ta-da!

http://www.kanji.org/kanji/dictionaries/skip_permission.htm

I had a very nice meeting with Jack Halpern and SKIP is now available to all, including commercial applications. I hope we can discuss some KanjiVG structuring and appendix information formatting next year.

Lee

Alexandre Courbot

unread,

Dec 29, 2014, 12:50:08 PM12/29/14

to KanjiVG

Absolutely awesome! This will be useful not only to KanjiVG, but also
to many other projects out there. Thanks a lot for taking the time to
do this! And thanks to Jack Halpern for reconsidering his licencing
terms.

Lee Hericks

unread,

Dec 30, 2014, 3:04:26 AM12/30/14

to kan...@googlegroups.com

We need to discuss a few things on this. One is how to add the skip shaded strokes metadata(which is sometimes pieces of the actual stroke in pattern 4 (and in general how to clean up the organization and for me to document the project). Another is that the way kanji are divided doesn't always yield parts which are usual radicals, so there is much deeper data for SKIP which I will be able to get from Jack.

Anyway, let's start discussing it more after the holidays.

Alexandre Courbot

unread,

Jan 3, 2015, 12:38:24 PM1/3/15

to KanjiVG

On Tue, Dec 30, 2014 at 9:04 AM, Lee Hericks <lee.h...@me.com> wrote:
> We need to discuss a few things on this. One is how to add the skip shaded strokes metadata(which is sometimes pieces of the actual stroke in pattern 4 (and in general how to clean up the organization and for me to document the project). Another is that the way kanji are divided doesn't always yield parts which are usual radicals, so there is much deeper data for SKIP which I will be able to get from Jack.

All the issues you mention (adding additional data ; documenting the
project ; cleaning the organization) hint that we need a cleaner, more
concise, more extensible, and less redundant format than what we are
currently using. It is fine to generate SVGs for direct consumption by
users - but working directly on it is just painful and inefficient.

I know we all have a (very) limited time to dedicate to this project,
however thinking of a new format is something that we can do by email,
slowly, and without disturbing the current workflow. Are we willing to
give it a try?

Lee Hericks

unread,

Jan 13, 2015, 8:31:15 PM1/13/15

to kan...@googlegroups.com

Hi Alexandre! Sorry for the late reply. The start of the new year has been hectic here at work.

Hmm, the very name of KanjiVG is based on SVG. SVG also has metadata and other features that maybe very useful to better organize the data. I don't think I would go as far as to suggest implementing the SGV font specification though. SVG is also pretty visual for previewing and editing. I'm not sure what a better solution might be. If I were to suggest anything it would still be a custom xml format. Or it may be that we need to alter the SVG structure and also create an XML index file that maps variants, etc.

Even if we go with a database or a different way to organize the data, I'm still highly in favor of SVG paths. It's a documented standard that anyone can lookup and learn to parse or find an existing parser.

ospalh

unread,

Jan 14, 2015, 1:47:07 PM1/14/15

to kan...@googlegroups.com

Am Mittwoch, 14. Januar 2015 02:31:15 UTC+1 schrieb Lee Hericks:

Even if we go with a database or a different way to organize the data, I'm still highly in favor of SVG paths. It's a documented standard that anyone can lookup and learn to parse or find an existing parser.

I'm for keeping the SVG paths, too. At least there we should not invent a circular transportation facilitation device.

Also, i think it would be nice if we, in the end, could still accept changes done to SVG files. Actually, a way to accept stroke data from an SVG file that is not in the strict KanjiVG format would be nice. I’m using this quick hack to transfer graphical changes back to the formatted SVG files, because what comes out of Inkscape doesn’t look at all like what we want.

Alexandre Courbot

unread,

Jan 14, 2015, 8:47:41 PM1/14/15

to KanjiVG

On Wed, Jan 14, 2015 at 10:31 AM, Lee Hericks <lee.h...@me.com> wrote:
> Hi Alexandre! Sorry for the late reply. The start of the new year has been
> hectic here at work.
>
> Hmm, the very name of KanjiVG is based on SVG. SVG also has metadata and
> other features that maybe very useful to better organize the data. I don't
> think I would go as far as to suggest implementing the SGV font
> specification though. SVG is also pretty visual for previewing and editing.
> I'm not sure what a better solution might be. If I were to suggest anything
> it would still be a custom xml format. Or it may be that we need to alter
> the SVG structure and also create an XML index file that maps variants, etc.

Let me try to compile a list of my griefs with the current format:
- A lot of data that should be computed is currently manually
maintained, which is quite error-prone. For instance,
https://github.com/KanjiVG/kanjivg/commit/74ae6fc602b5d6cae0ad5a5325bfa12be2fd59a7
fixes some path and groups names that were not properly updated when
copying a kanji file. We should never have to worry about such
details.
- Variants are, essentially, a copy of the original character with
small variations. It would be nice to only capture the "diff" to avoid
this duplication that makes maintaining the project difficult.
- A lot (and I mean a lot!) of information is redundant. Components
appearing in many characters very often share the same structure.
Again, at the moment this information is duplicated X times, which
makes fixes much more painful than they should.
- The character structuring is hard to understand because we are
constrained by the strokes order, so we sometimes need to split
components into different parts if their strokes are not all
sequential. A much easier format would be to directly specify which
strokes belong to which component.

Providing SVG as release data is fine to me, but the working format
could be dramatically improved, this is my proposal. Of course it
could also be based on a better-constructed XML, but XML bears lots of
noise. A specific format might be better-suited and easier to work
with.

>
> Even if we go with a database or a different way to organize the data, I'm
> still highly in favor of SVG paths. It's a documented standard that anyone
> can lookup and learn to parse or find an existing parser.

Yes, dropping SVG paths was never in my mind.

Lee Hericks

unread,

Jan 14, 2015, 9:06:52 PM1/14/15

to kan...@googlegroups.com

Let's just first talk about the concept of a "working format". If the strokes are SVG paths and the working format is not SVG, how do you propose to do editing of strokes?

Lee Hericks

unread,

Jan 14, 2015, 11:22:26 PM1/14/15

to kan...@googlegroups.com

Alex, I'd still like to get together and talk in person btw. I think we could have a lot faster back and forth conversation on this and have something a little more solid to suggest here. ospalh, are you in Germany? We could do a Skype chat or something as well.

msk...@ansuz.sooke.bc.ca

unread,

Jan 15, 2015, 2:23:21 AM1/15/15

to KanjiVG

I wrote extensive notes on my thoughts about the KanjiVG format a few
years ago and you can find them in the archives of this mailing list. I
can't really be an active participant in KanjiVG development right now
just on the speculation that it would be helpful to current users. If my
academic career is going to happen, it's pretty much an imperative for me
to do things right now that result in publications I can use to get a
permanent faculty job, and the closest thing to KanjiVG development that
seems to fit in that category is my work on IDSgrep - which can also use
almost any other kanji database, and currently uses several including
KanjiVG.

What it would take for KanjiVG to be more useful to IDSgrep would be real
priority on correctly representing the hierarchical structure of
characters. A lot of details of what I mean and how to do it are in my
notes from before. Representing stroke order with an XML attribute
instead of by the sequential order of tags in the file, would be quite
beneficial, and I have written some code to convert to such a structure,
which is in my Github repository forked from mainline KanjiVG.

I also think that having tests for the correctness of whatever file format
is used - for instance, when a component is supposed to have a "left" and
a "right" side, making sure that it really does have exactly one of each
and not some other structure - is the only way the project can really
expect to improve data quality from its current level. It should be
possible to type "make check" and get a report on whether the data is
internally consistent; then we can know what needs to be fixed and make
sure it stays fixed. Any "correctness" constraint that is not regularly
checked will inevitably stop being true as changes are made.

To the extent that there are format conversions involved in however the
system works (for instance, if there is a big master file that gets split
into smaller ones, or vice versa) it should be possible to do these
conversions in a systematic way, not by manually running a selection of
Python scripts from a directory of unlabelled and undocumented Python
scripts that also includes many that don't work. Putting the code into
Git is a nice first step, but it will need to be structured as a real
package - with tests and a build system - before it can really be improved.
Having it structured like other free software packages would also increase
its visibility, because we could get the package into Linux distributions
and package repositories and solicit community involvement in the same way
that other software packages operate. We could do things like nightly
builds, with immediate or near-immediate feedback when a change breaks
something.

I restructured KanjiVG as a real software package with tests and a build
system a few years ago, and to be blunt, you refused to look at the
results or take that goal seriously. I'm not available to work on it more
now, but it remains in my Github repository in the state I left it, for
anyone who might want to try again.

Lee Hericks

unread,

Jan 15, 2015, 2:28:06 AM1/15/15

to kan...@googlegroups.com

Matthew, thanks for writing. I'm sure the project maintainers have changed over time or not had sufficient time to look into it. I'm new. I will try to take a look at your previous work and comments.

Lee

Alexandre Courbot

unread,

Jan 18, 2015, 11:16:44 PM1/18/15

to KanjiVG

All good points from Matthew.

The main issue of this project is the lack of time from its
participants. For most/all of us this project is not professionally
bound and thus the amount of time we can contribute to improve it is
limited. Having more academics with a strong incentive to work on it
for their own research is what could change that sad state of things.

Also, improving the work format is a way to ensure this time is better
spent on actual fixes/improvements instead of tasks that could be
automated or suppressed. This is why I am pushing for this change
first thing.

Ben Bullock

unread,

May 6, 2022, 10:26:59 PM5/6/22

to KanjiVG

On Tuesday, 11 November 2014 at 10:38:20 UTC+9 lee.h...@me.com wrote:

If Ulrich or the collective is not interested in this data, that's fine by me, but it will be made anyway.

Sorry to drag an old thread from eight years ago, but presumably the data is finished by now? Has it been made public?

Reply all

Reply to author

Forward