Switch to new format official - update your parsers!

44 views
Skip to first unread message

Alexandre Courbot

unread,
Jul 18, 2011, 4:28:37 AM7/18/11
to KanjiVG
At long last, I finally pushed the switch to the new format! Hopefully
this also means that releases will be done again after every fix on
the data.

The KanjiVG data is now available in two different packages:
- one that includes the raw SVG files, one per character. Every file
is valid SVG and contains all the information from the previous XML +
SVG pair, which makes it easier to visualize, edit, and makes it
impossible to have mismatching kanji anymore.
- one that aggregates all the data into a single XML file, like
before. People using this format will only have a few changes to do to
their parsers. The tags and attributes names are changed to match
those of the SVG data. Apart from this change, structure and parsing
should be identical.

Also, the Git repository has been moved to Github to make things
easier. I am just waiting for Ben to remove his gitorious clone in
order to delete mine (gitorious limitation, that reason alone is
enough to go github). Using github, I hope more people will fix the
data directly and propose changes via merge requests, like Ben did,
instead of using the "Incorrects kanji" page.

The Git repo still contains the old XML/SVG directories, but these
should go soon. The data is now available in a kanji/ directory (11252
characters, including variants) for kanji that successfully merged,
and in kanji_mismatch for kanji that did not merge (263 of them). The
goal is of course to have this directory disappear.

I have updated the website accordingly, and made the format
description page official. It is still raw but should be enough to
understand the format.

Hope this will make things smoother!
Alex.

Ben Bullock

unread,
Jul 20, 2011, 9:19:14 PM7/20/11
to kan...@googlegroups.com
On 18 July 2011 17:28, Alexandre Courbot <gnu...@gmail.com> wrote:
> At long last, I finally pushed the switch to the new format! Hopefully
> this also means that releases will be done again after every fix on
> the data.
>
> The KanjiVG data is now available in two different packages:
> - one that includes the raw SVG files, one per character. Every file
> is valid SVG and contains all the information from the previous XML +
> SVG pair, which makes it easier to visualize, edit, and makes it
> impossible to have mismatching kanji anymore.

I think this switch to a new format is not a very good idea. Previous
to this it was possible to edit the XML and SVG files separately in
case of errors or mismatches but now they are combined it is very hard
to edit them. I would suggest reverting this change out and putting
back the old SVG and XML directories. Just as an example I have
attached a PNG which is another error. Prior to the format change it
was possible to fix this kind of problem very easily simply by
shuffling the lines of the SVG file. Now, in order to fix this I have
to do some very dirty work involving stripping the d attributes from
path elements, which seems very error-prone to me.

Can we please go back to the old format?


> The Git repo still contains the old XML/SVG directories, but these
> should go soon.

I think it would have been better to discuss this change before making it.

> Hope this will make things smoother!

Unfortunately it hasn't done that; I think it's better to stick to the
original XML/SVG method.

6518.png

Ben Bullock

unread,
Jul 20, 2011, 10:48:30 PM7/20/11
to kan...@googlegroups.com
I have gone ahead with reverting back to the old system so that it
would be possible to continue making corrections to the svgs.

The following branch now contains the latest edits:

https://github.com/benkasminbullock/kanjivg/tree/use-xml-svg-directories


There are about eight edits. The attached diagram shows one example
where strokes 10 and 11 are the wrong way round.

The script check-all-strokes.pl

https://github.com/benkasminbullock/kanjivg/blob/use-xml-svg-directories/check-all-strokes.pl

is now set up to ignore kanji containing certain elements which
clearly contain errors in the type field of one or more strokes.

At the moment that is the following list:

冬 羽 尽 辛 手 羊 冫 半

Any kanji which contains this as an element is excluded from the check.

There seem to be a few other things which cause common errors but
unfortunately some of them do not have an "element" field or the
element field is confusing.

I have continued to commit the file check-1.0 even though it is
machine-generated, for the sake of people who might find it hard to
run the Perl script. The perl script relies on modules which need to
be installed from CPAN so I think it is worth including for the sake
of users who don't want to have to install Perl modules or run a
script.

Anyway I apologize for reverting back to the old system but these
corrections would have taken hours with the new system or required
extensive programming, but could be completed in a few minutes using
the old version of the files and simple text edits.

5dcd.png

Alexandre Courbot

unread,
Jul 21, 2011, 5:54:04 PM7/21/11
to kan...@googlegroups.com
On Thu, Jul 21, 2011 at 10:19 AM, Ben Bullock
<benkasmi...@gmail.com> wrote:
> I think this switch to a new format is not a very good idea. Previous
> to this it was possible to edit the XML and SVG files separately in
> case of errors or mismatches but now they are combined it is very hard
> to edit them. I would suggest reverting this change out and putting
> back the old SVG and XML directories. Just as an example I have
> attached a PNG which is another error. Prior to the format change it
> was possible to fix this kind of problem very easily simply by
> shuffling the lines of the SVG file. Now, in order to fix this I have
> to do some very dirty work involving stripping the d attributes from
> path elements, which seems very error-prone to me.
>
> Can we please go back to the old format?

That's quite a valid concern, and actually I also realized it would
not be as easy as before to add missing SVG paths or to switch two
strokes, which are two very common fix patterns. On the other hand,
the previous format has one big drawback, which is that it makes it
easy to have different stroke counts between the XML and SVG files -
by design, this cannot happen with the new format. This has been the
biggest inconvenience so far and we still have 200 of these to fix. In
addition, the tagging was not consistent between the files and some
information (structure of the kanji) was uselessly duplicated. I would
like to keep these advantages.

Still, we need a way to allow these changes to take place. I can think
of one right now: we could have a small script that extracts the SVG
paths into a separate SVG file that one can edit and merge back into
the original file. The structure of this file would be kept simple to
allow easy editing - and we could also have comments numbering the
strokes to easily find them (the old format was more messy in that
respect). Such a script would be simple to write and maintain.

We could also permanently separate the strokes in the git repo (which
would be functionally equivalent to switching back to the previous
format), but I knowing the pain it was to maintain, I would really
like to avoid that.

> I think it would have been better to discuss this change before making it.

http://groups.google.com/group/kanjivg/browse_thread/thread/cd6f500658c21797?hl=en_US

Since I had no reaction there I decided to go ahead - KanjiVG is not
very active, so things sometimes just have to move on.

> I have gone ahead with reverting back to the old system so that it
> would be possible to continue making corrections to the svgs.
>
> The following branch now contains the latest edits:
>
> https://github.com/benkasminbullock/kanjivg/tree/use-xml-svg-directories

Merged, thanks - both old and new format directories reflect these changes.

> I have continued to commit the file check-1.0 even though it is
> machine-generated, for the sake of people who might find it hard to
> run the Perl script. The perl script relies on modules which need to
> be installed from CPAN so I think it is worth including for the sake
> of users who don't want to have to install Perl modules or run a
> script.

Well, it's not supposed to be here forever so let's keep it for now,
but as a general rule we should not have generated content on the git
repo.

Btw, I realized I pushed the wrong branch (one that was later
abandonned) into gibhub, so I had to force-push the actual one. I
recommand you to rebase any other change you may have against it, then
"git push --force". Sorry for the inconvenience.

Alex.

Ben Bullock

unread,
Jul 21, 2011, 7:46:58 PM7/21/11
to kan...@googlegroups.com
On 22 July 2011 06:54, Alexandre Courbot <gnu...@gmail.com> wrote:
> On Thu, Jul 21, 2011 at 10:19 AM, Ben Bullock
> <benkasmi...@gmail.com> wrote:
>> I think this switch to a new format is not a very good idea. Previous
>> to this it was possible to edit the XML and SVG files separately in
>> case of errors or mismatches but now they are combined it is very hard
>> to edit them. I would suggest reverting this change out and putting
>> back the old SVG and XML directories. Just as an example I have
>> attached a PNG which is another error. Prior to the format change it
>> was possible to fix this kind of problem very easily simply by
>> shuffling the lines of the SVG file. Now, in order to fix this I have
>> to do some very dirty work involving stripping the d attributes from
>> path elements, which seems very error-prone to me.
>>
>> Can we please go back to the old format?
>
> That's quite a valid concern, and actually I also realized it would
> not be as easy as before to add missing SVG paths or to switch two
> strokes, which are two very common fix patterns. On the other hand,
> the previous format has one big drawback, which is that it makes it
> easy to have different stroke counts between the XML and SVG files -
> by design, this cannot happen with the new format.

Surely there is no difference, in one case there are discrepancies and
in the other case the discrepancies are moved into a directory?

> This has been the
> biggest inconvenience so far and we still have 200 of these to fix. In
> addition, the tagging was not consistent between the files and some
> information (structure of the kanji) was uselessly duplicated. I would
> like to keep these advantages.

> Still, we need a way to allow these changes to take place. I can think
> of one right now: we could have a small script that extracts the SVG
> paths into a separate SVG file that one can edit and merge back into
> the original file. The structure of this file would be kept simple to
> allow easy editing - and we could also have comments numbering the
> strokes to easily find them (the old format was more messy in that
> respect). Such a script would be simple to write and maintain.

Until such a script exists then ...

> We could also permanently separate the strokes in the git repo (which
> would be functionally equivalent to switching back to the previous
> format), but I knowing the pain it was to maintain, I would really
> like to avoid that.
>
>> I think it would have been better to discuss this change before making it.
>
> http://groups.google.com/group/kanjivg/browse_thread/thread/cd6f500658c21797?hl=en_US
>
> Since I had no reaction there I decided to go ahead

Unfortunately at that time I was not aware of the way that the files
were being stored and I had never attempted to parse or edit the
"source" files so all of that discussion went over my head.

> - KanjiVG is not
> very active, so things sometimes just have to move on.

I think it's in very active use, for example jisho.org is using
kanjivg for kanji stroke order diagrams.

>> I have gone ahead with reverting back to the old system so that it
>> would be possible to continue making corrections to the svgs.
>>
>> The following branch now contains the latest edits:
>>
>> https://github.com/benkasminbullock/kanjivg/tree/use-xml-svg-directories
>
> Merged, thanks - both old and new format directories reflect these changes.

OK.

> as a general rule we should not have generated content on the git
> repo.

Do you have another place to put it? I think it's better to put that
somewhere on the web where people can access it easily. I can host it
somewhere but it would be better to put it on the kanjivg web site
rather than locally.

Alexandre Courbot

unread,
Jul 21, 2011, 8:00:57 PM7/21/11
to kan...@googlegroups.com
On Fri, Jul 22, 2011 at 8:46 AM, Ben Bullock <benkasmi...@gmail.com> wrote:
> Surely there is no difference, in one case there are discrepancies and
> in the other case the discrepancies are moved into a directory?

The difference is that it is much harder to introduce *new*
discrepancies when everything is in the same file.

>> Still, we need a way to allow these changes to take place. I can think
>> of one right now: we could have a small script that extracts the SVG
>> paths into a separate SVG file that one can edit and merge back into
>> the original file. The structure of this file would be kept simple to
>> allow easy editing - and we could also have comments numbering the
>> strokes to easily find them (the old format was more messy in that
>> respect). Such a script would be simple to write and maintain.
>
> Until such a script exists then ...

The point of my response was also to ask you (although it was not
explicit) whether you thought it would be an acceptable solution. Then
writing the script is a matter of minutes. IMO it would even be more
convenient since the strokes temporary file would only contain strokes
and would have them numbered. No more hassle when trying to fix
"stroke #15 and #16 are in the wrong order".

Alternative #2: keep the strokes into separate files, merge then
during release, and have a git commit hook that rejects the commit if
stroke numbers do not match. The inconvenience there would be that you
cannot open a single SVG file and have strokes, numbers and structure
at the same time. I really like the 1 file design for this reason,
too.

>> - KanjiVG is not
>> very active, so things sometimes just have to move on.
>
> I think it's in very active use, for example jisho.org is using
> kanjivg for kanji stroke order diagrams.

I agree, it's in very active use, but unfortunately not in so active
development. The website is pityful (although Axel might fix that
sometime), and we lack active sinograph experts to make decisions.
Simplifying access to the data sounds like a necessary requirement to
improve this.

> Do you have another place to put it? I think it's better to put that
> somewhere on the web where people can access it easily. I can host it
> somewhere but it would be better to put it on the kanjivg web site
> rather than locally.

Generated content changes often, so as much as possible I'd like to
avoid it on the git, but it's not exactly like the git repo is in a
clean state anyway, so let's keep it there for now. Having it hosted
somewhere else would introduce more hassle. Once all the potential
errors reported in this file are fixed, it will go away anyway. Let's
just try not to make this a habit. :P

Alex.

Ben Bullock

unread,
Jul 22, 2011, 8:55:42 AM7/22/11
to kan...@googlegroups.com
On 22 July 2011 09:00, Alexandre Courbot <gnu...@gmail.com> wrote:

> The difference is that it is much harder to introduce *new*
> discrepancies when everything is in the same file.

I understand that but that does not seem like an overwhelming reason
to change, since it is already checking for discrepancies.

>>> Still, we need a way to allow these changes to take place. I can think
>>> of one right now: we could have a small script that extracts the SVG
>>> paths into a separate SVG file that one can edit and merge back into
>>> the original file. The structure of this file would be kept simple to
>>> allow easy editing - and we could also have comments numbering the
>>> strokes to easily find them (the old format was more messy in that
>>> respect). Such a script would be simple to write and maintain.
>>
>> Until such a script exists then ...
>
> The point of my response was also to ask you (although it was not
> explicit) whether you thought it would be an acceptable solution.

I think it's an acceptable solution.

> Then
> writing the script is a matter of minutes.

If so then perhaps writing the script would have been quicker than
writing replies to each other. As far as I can see the difficult part
is not the script to extract the data but the script to put the
altered stroke data back in again.

> IMO it would even be more
> convenient since the strokes temporary file would only contain strokes
> and would have them numbered. No more hassle when trying to fix
> "stroke #15 and #16 are in the wrong order".

OK, well that may be a better solution.

> Alternative #2: keep the strokes into separate files, merge then
> during release, and have a git commit hook that rejects the commit if
> stroke numbers do not match. The inconvenience there would be that you
> cannot open a single SVG file and have strokes, numbers and structure
> at the same time. I really like the 1 file design for this reason,
> too.

But the XML file with all these things in one file is really hard to
edit, I don't understand why you want to make the master copy of the
file into this merged file.

>>> - KanjiVG is not
>>> very active, so things sometimes just have to move on.
>>
>> I think it's in very active use, for example jisho.org is using
>> kanjivg for kanji stroke order diagrams.
>
> I agree, it's in very active use, but unfortunately not in so active
> development. The website is pityful (although Axel might fix that
> sometime), and we lack active sinograph experts to make decisions.
> Simplifying access to the data sounds like a necessary requirement to
> improve this.

I don't think editing an XML file with many attributes is simplifying
access to the data, to me it seems like complicating access to the
data. I think that most people faced with a task like editing an XML
file will not find it very simple.

>> Do you have another place to put it? I think it's better to put that
>> somewhere on the web where people can access it easily. I can host it
>> somewhere but it would be better to put it on the kanjivg web site
>> rather than locally.
>
> Generated content changes often, so as much as possible I'd like to
> avoid it on the git, but it's not exactly like the git repo is in a
> clean state anyway, so let's keep it there for now. Having it hosted
> somewhere else would introduce more hassle. Once all the potential
> errors reported in this file are fixed, it will go away anyway. Let's
> just try not to make this a habit. :P

I don't think it's very difficult to put this on the web:

http://www.lemoda.net/kanjivg/angle-check/index.html

I am continuing adding to the branch as mentioned until the fabled
scripts are ready.

Alexandre Courbot

unread,
Jul 24, 2011, 10:51:40 AM7/24/11
to kan...@googlegroups.com

Quick update: I have updated my github repo with a script (kvg) that is supposed to group all my previous scripts and also features the paths split/merge that we discussed earlier. Just use ./kvg split <file> or ./kvg merge <file> to do that. This should do the trick wrt stroke order mistakes.

Alex.

> --
> You received this message because you are subscribed to the "KanjiVG" group.
> For options and unsubscribing, visit this group at
> http://groups.google.com/group/kanjivg

Ben Bullock

unread,
Jul 26, 2011, 3:20:36 AM7/26/11
to KanjiVG
On Jul 22, 6:54 am, Alexandre Courbot <gnu...@gmail.com> wrote:

> Btw, I realized I pushed the wrong branch (one that was later
> abandonned) into gibhub, so I had to force-push the actual one. I
> recommand you to rebase any other change you may have against it, then
> "git push --force". Sorry for the inconvenience.

I'm not sure what I'm supposed to do with the new repository. It is
giving me a gigantic merge even when I go back to the "master" branch
rather than the "use-xml-svg-dirs" branch.

Arne Brasseur

unread,
Jul 27, 2011, 10:55:46 AM7/27/11
to kan...@googlegroups.com
Not sure what the problem is, but if you're using pull it tries to
automatically merge the remote changes into your local master branch.
Try using fetch and reset

git fetch origin
git reset --hard origin/master

WARNING : This will discard any local changes, including local commits.

This will set your master branch to be identical as the master branch on
remote 'origin', which is the remote you originally cloned from.

HTH,
Arne

Alexandre Courbot

unread,
Jul 27, 2011, 8:05:57 PM7/27/11
to kan...@googlegroups.com
> Not sure what the problem is, but if you're using pull it tries to
> automatically merge the remote changes into your local master branch. Try
> using fetch and reset
>
> git fetch origin
> git reset --hard origin/master
>
> WARNING : This will discard any local changes, including local commits.
>
> This will set your master branch to be identical as the master branch on
> remote 'origin', which is the remote you originally cloned from.

Yes, since I force-pulled the branches cannot be merged. The simplest
thing to do is to rebase your branches on my master branch, e.g. 'git
rebase alias-for-my-master-branch'. Then you will also have to
force-push your changes to github once per branch ("git push
--force"). Then you will be back to sync. Such operation should not be
necessary anymore in the future.

Alex.

Reply all
Reply to author
Forward
0 new messages