Bugs in stroke order/ stroke number placement

Jan Eichhorn

unread,

Jun 13, 2012, 2:13:22 PM6/13/12

to KanjiVG

I have written a program to compare the stroke number text positions
with the corresponding stroke path start coordinates. Normally, the
stroke number should be close to the stroke that it belongs to.

The program has revealed a lot of files that most likely contain
errors, either in the stroke order or in the stroke number placement.
It counts 674 files with possible stroke swaps and 274 files with
missing/ incomplete stroke numbers (in a total of 11251 files).

If anyone is interested, I could supply the program output.

Greetings, Jan

The Top 5:
=======
09f4f-Kaisho
09f67-HzFst
07fb6-Kaisho/07fb6
09d87
08e72-Kaisho

Ben Bullock

unread,

Jun 13, 2012, 10:15:58 PM6/13/12

to kan...@googlegroups.com

On 14 June 2012 03:13, Jan Eichhorn <jan.philip...@googlemail.com> wrote:
> I have written a program to compare the stroke number text positions
> with the corresponding stroke path start coordinates. Normally, the
> stroke number should be close to the stroke that it belongs to.
>
> The program has revealed a lot of files that most likely contain
> errors, either in the stroke order or in the stroke number placement.
> It counts 674 files with possible stroke swaps and 274 files with
> missing/ incomplete stroke numbers (in a total of 11251 files).
>
> If anyone is interested, I could supply the program output.

A github gist would be a good place to put those.

msk...@ansuz.sooke.bc.ca

unread,

Jun 13, 2012, 10:28:54 PM6/13/12

to KanjiVG

On Wed, 13 Jun 2012, Jan Eichhorn wrote:
> I have written a program to compare the stroke number text positions
> with the corresponding stroke path start coordinates. Normally, the
> stroke number should be close to the stroke that it belongs to.

If you want to share your program, I could add it to the test suite I'm
building in https://github.com/mskala/kanjivg .
--
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/

Jan Eichhorn

unread,

Jun 14, 2012, 9:02:16 AM6/14/12

to KanjiVG

I followed the suggestion and placed the output in
https://gist.github.com/2930025

What the program does:
calculates the distance from the center of the text for the stroke
number to the start point of the corresponding stroke. If the distance
is greater 15, the text is considered "far from stroke". The stroke is
reported, if there is another text that has a distance<15 to the
stroke start, and this other text is "far" from its own stroke. The
found stroke is most likely involved in a mixed up stroke numbering.
My guess is that this is a relatively conservative way of reporting
that underestimates the number of actual bugs in the files. For kanji
with many crowded strokes there is probably no automatic way of
detecting bugs.
The program (in C#) is too hacky to show to the public and I don't
feel like cleaning it up. Sorry, I will keep it private.

Greetings, Jan

Alexandre Courbot

unread,

Jun 16, 2012, 5:00:36 AM6/16/12

to kan...@googlegroups.com

That's helpful, thanks.

If you want to bring it to the next level of helpfulness, I'd also be
delighted to merge the fixes that one could create using this data.
Right now nobody seems to have time to invest in doing it, and it
would be sad if these faulty numbers do not get some love.

Alex.

> --
> You received this message because you are subscribed to the "KanjiVG" group.
> For options and unsubscribing, visit this group at
> http://groups.google.com/group/kanjivg

Jan Eichhorn

unread,

Jun 16, 2012, 11:35:22 AM6/16/12

to kan...@googlegroups.com

Sorry, I won't do any fixes. I'm not competent enough anyway. You would need someone who could check the stroke order with one look. I don't think there is a meaningful way of doing automatic fixes- there are too many ambiguities. The result would still be a lot of garbage. This will be a manual task.

Jan

Jan Eichhorn

unread,

Jul 17, 2012, 7:47:09 PM7/17/12

to kan...@googlegroups.com

1) I did an update on my number placement checker. The program is now a bit smarter and can compute the distance from the stroke number to the stroke, not only to the stroke's start point. It also tries to detect wrong stroke directions (or number placed at wrong end). I also used more pedantic settings, so there are now 1644 files reported as buggy.

The output is here (sorted by file name and by file error count):

https://gist.github.com/3132779

2) I wrote a program to compute the stroke number placement automatically. The output may be useful for checking the manual stroke numbers. The zip file with all single svg files is here (click download link):

https://docs.google.com/open?id=0B-TA0GJ6dksVVnRiZExtTkdpSU0

It has both manual (grey) and automatic (red) stroke numbers overlaid. The automatic stroke numbers are centered on the stroke start points where possible, there should be few problems to identify the stroke to which they belong. The auto numbers are guaranteed to match the stroke order as given by the stroke id, but of course the stroke ids are not always correct. The format of the auto-numbered svg files differs a bit from the original ones: no inline DTD, two groups with stroke numbers, and the auto numbers use two font sizes.

Jan

Reply all

Reply to author

Forward