The KanjiVG data is now available in two different packages:
- one that includes the raw SVG files, one per character. Every file
is valid SVG and contains all the information from the previous XML +
SVG pair, which makes it easier to visualize, edit, and makes it
impossible to have mismatching kanji anymore.
- one that aggregates all the data into a single XML file, like
before. People using this format will only have a few changes to do to
their parsers. The tags and attributes names are changed to match
those of the SVG data. Apart from this change, structure and parsing
should be identical.
Also, the Git repository has been moved to Github to make things
easier. I am just waiting for Ben to remove his gitorious clone in
order to delete mine (gitorious limitation, that reason alone is
enough to go github). Using github, I hope more people will fix the
data directly and propose changes via merge requests, like Ben did,
instead of using the "Incorrects kanji" page.
The Git repo still contains the old XML/SVG directories, but these
should go soon. The data is now available in a kanji/ directory (11252
characters, including variants) for kanji that successfully merged,
and in kanji_mismatch for kanji that did not merge (263 of them). The
goal is of course to have this directory disappear.
I have updated the website accordingly, and made the format
description page official. It is still raw but should be enough to
understand the format.
Hope this will make things smoother!
Alex.
I think this switch to a new format is not a very good idea. Previous
to this it was possible to edit the XML and SVG files separately in
case of errors or mismatches but now they are combined it is very hard
to edit them. I would suggest reverting this change out and putting
back the old SVG and XML directories. Just as an example I have
attached a PNG which is another error. Prior to the format change it
was possible to fix this kind of problem very easily simply by
shuffling the lines of the SVG file. Now, in order to fix this I have
to do some very dirty work involving stripping the d attributes from
path elements, which seems very error-prone to me.
Can we please go back to the old format?
> The Git repo still contains the old XML/SVG directories, but these
> should go soon.
I think it would have been better to discuss this change before making it.
> Hope this will make things smoother!
Unfortunately it hasn't done that; I think it's better to stick to the
original XML/SVG method.
The following branch now contains the latest edits:
https://github.com/benkasminbullock/kanjivg/tree/use-xml-svg-directories
There are about eight edits. The attached diagram shows one example
where strokes 10 and 11 are the wrong way round.
The script check-all-strokes.pl
https://github.com/benkasminbullock/kanjivg/blob/use-xml-svg-directories/check-all-strokes.pl
is now set up to ignore kanji containing certain elements which
clearly contain errors in the type field of one or more strokes.
At the moment that is the following list:
冬 羽 尽 辛 手 羊 冫 半
Any kanji which contains this as an element is excluded from the check.
There seem to be a few other things which cause common errors but
unfortunately some of them do not have an "element" field or the
element field is confusing.
I have continued to commit the file check-1.0 even though it is
machine-generated, for the sake of people who might find it hard to
run the Perl script. The perl script relies on modules which need to
be installed from CPAN so I think it is worth including for the sake
of users who don't want to have to install Perl modules or run a
script.
Anyway I apologize for reverting back to the old system but these
corrections would have taken hours with the new system or required
extensive programming, but could be completed in a few minutes using
the old version of the files and simple text edits.
That's quite a valid concern, and actually I also realized it would
not be as easy as before to add missing SVG paths or to switch two
strokes, which are two very common fix patterns. On the other hand,
the previous format has one big drawback, which is that it makes it
easy to have different stroke counts between the XML and SVG files -
by design, this cannot happen with the new format. This has been the
biggest inconvenience so far and we still have 200 of these to fix. In
addition, the tagging was not consistent between the files and some
information (structure of the kanji) was uselessly duplicated. I would
like to keep these advantages.
Still, we need a way to allow these changes to take place. I can think
of one right now: we could have a small script that extracts the SVG
paths into a separate SVG file that one can edit and merge back into
the original file. The structure of this file would be kept simple to
allow easy editing - and we could also have comments numbering the
strokes to easily find them (the old format was more messy in that
respect). Such a script would be simple to write and maintain.
We could also permanently separate the strokes in the git repo (which
would be functionally equivalent to switching back to the previous
format), but I knowing the pain it was to maintain, I would really
like to avoid that.
> I think it would have been better to discuss this change before making it.
http://groups.google.com/group/kanjivg/browse_thread/thread/cd6f500658c21797?hl=en_US
Since I had no reaction there I decided to go ahead - KanjiVG is not
very active, so things sometimes just have to move on.
> I have gone ahead with reverting back to the old system so that it
> would be possible to continue making corrections to the svgs.
>
> The following branch now contains the latest edits:
>
> https://github.com/benkasminbullock/kanjivg/tree/use-xml-svg-directories
Merged, thanks - both old and new format directories reflect these changes.
> I have continued to commit the file check-1.0 even though it is
> machine-generated, for the sake of people who might find it hard to
> run the Perl script. The perl script relies on modules which need to
> be installed from CPAN so I think it is worth including for the sake
> of users who don't want to have to install Perl modules or run a
> script.
Well, it's not supposed to be here forever so let's keep it for now,
but as a general rule we should not have generated content on the git
repo.
Btw, I realized I pushed the wrong branch (one that was later
abandonned) into gibhub, so I had to force-push the actual one. I
recommand you to rebase any other change you may have against it, then
"git push --force". Sorry for the inconvenience.
Alex.
Surely there is no difference, in one case there are discrepancies and
in the other case the discrepancies are moved into a directory?
> This has been the
> biggest inconvenience so far and we still have 200 of these to fix. In
> addition, the tagging was not consistent between the files and some
> information (structure of the kanji) was uselessly duplicated. I would
> like to keep these advantages.
> Still, we need a way to allow these changes to take place. I can think
> of one right now: we could have a small script that extracts the SVG
> paths into a separate SVG file that one can edit and merge back into
> the original file. The structure of this file would be kept simple to
> allow easy editing - and we could also have comments numbering the
> strokes to easily find them (the old format was more messy in that
> respect). Such a script would be simple to write and maintain.
Until such a script exists then ...
> We could also permanently separate the strokes in the git repo (which
> would be functionally equivalent to switching back to the previous
> format), but I knowing the pain it was to maintain, I would really
> like to avoid that.
>
>> I think it would have been better to discuss this change before making it.
>
> http://groups.google.com/group/kanjivg/browse_thread/thread/cd6f500658c21797?hl=en_US
>
> Since I had no reaction there I decided to go ahead
Unfortunately at that time I was not aware of the way that the files
were being stored and I had never attempted to parse or edit the
"source" files so all of that discussion went over my head.
> - KanjiVG is not
> very active, so things sometimes just have to move on.
I think it's in very active use, for example jisho.org is using
kanjivg for kanji stroke order diagrams.
>> I have gone ahead with reverting back to the old system so that it
>> would be possible to continue making corrections to the svgs.
>>
>> The following branch now contains the latest edits:
>>
>> https://github.com/benkasminbullock/kanjivg/tree/use-xml-svg-directories
>
> Merged, thanks - both old and new format directories reflect these changes.
OK.
> as a general rule we should not have generated content on the git
> repo.
Do you have another place to put it? I think it's better to put that
somewhere on the web where people can access it easily. I can host it
somewhere but it would be better to put it on the kanjivg web site
rather than locally.
The difference is that it is much harder to introduce *new*
discrepancies when everything is in the same file.
>> Still, we need a way to allow these changes to take place. I can think
>> of one right now: we could have a small script that extracts the SVG
>> paths into a separate SVG file that one can edit and merge back into
>> the original file. The structure of this file would be kept simple to
>> allow easy editing - and we could also have comments numbering the
>> strokes to easily find them (the old format was more messy in that
>> respect). Such a script would be simple to write and maintain.
>
> Until such a script exists then ...
The point of my response was also to ask you (although it was not
explicit) whether you thought it would be an acceptable solution. Then
writing the script is a matter of minutes. IMO it would even be more
convenient since the strokes temporary file would only contain strokes
and would have them numbered. No more hassle when trying to fix
"stroke #15 and #16 are in the wrong order".
Alternative #2: keep the strokes into separate files, merge then
during release, and have a git commit hook that rejects the commit if
stroke numbers do not match. The inconvenience there would be that you
cannot open a single SVG file and have strokes, numbers and structure
at the same time. I really like the 1 file design for this reason,
too.
>> - KanjiVG is not
>> very active, so things sometimes just have to move on.
>
> I think it's in very active use, for example jisho.org is using
> kanjivg for kanji stroke order diagrams.
I agree, it's in very active use, but unfortunately not in so active
development. The website is pityful (although Axel might fix that
sometime), and we lack active sinograph experts to make decisions.
Simplifying access to the data sounds like a necessary requirement to
improve this.
> Do you have another place to put it? I think it's better to put that
> somewhere on the web where people can access it easily. I can host it
> somewhere but it would be better to put it on the kanjivg web site
> rather than locally.
Generated content changes often, so as much as possible I'd like to
avoid it on the git, but it's not exactly like the git repo is in a
clean state anyway, so let's keep it there for now. Having it hosted
somewhere else would introduce more hassle. Once all the potential
errors reported in this file are fixed, it will go away anyway. Let's
just try not to make this a habit. :P
Alex.
> The difference is that it is much harder to introduce *new*
> discrepancies when everything is in the same file.
I understand that but that does not seem like an overwhelming reason
to change, since it is already checking for discrepancies.
>>> Still, we need a way to allow these changes to take place. I can think
>>> of one right now: we could have a small script that extracts the SVG
>>> paths into a separate SVG file that one can edit and merge back into
>>> the original file. The structure of this file would be kept simple to
>>> allow easy editing - and we could also have comments numbering the
>>> strokes to easily find them (the old format was more messy in that
>>> respect). Such a script would be simple to write and maintain.
>>
>> Until such a script exists then ...
>
> The point of my response was also to ask you (although it was not
> explicit) whether you thought it would be an acceptable solution.
I think it's an acceptable solution.
> Then
> writing the script is a matter of minutes.
If so then perhaps writing the script would have been quicker than
writing replies to each other. As far as I can see the difficult part
is not the script to extract the data but the script to put the
altered stroke data back in again.
> IMO it would even be more
> convenient since the strokes temporary file would only contain strokes
> and would have them numbered. No more hassle when trying to fix
> "stroke #15 and #16 are in the wrong order".
OK, well that may be a better solution.
> Alternative #2: keep the strokes into separate files, merge then
> during release, and have a git commit hook that rejects the commit if
> stroke numbers do not match. The inconvenience there would be that you
> cannot open a single SVG file and have strokes, numbers and structure
> at the same time. I really like the 1 file design for this reason,
> too.
But the XML file with all these things in one file is really hard to
edit, I don't understand why you want to make the master copy of the
file into this merged file.
>>> - KanjiVG is not
>>> very active, so things sometimes just have to move on.
>>
>> I think it's in very active use, for example jisho.org is using
>> kanjivg for kanji stroke order diagrams.
>
> I agree, it's in very active use, but unfortunately not in so active
> development. The website is pityful (although Axel might fix that
> sometime), and we lack active sinograph experts to make decisions.
> Simplifying access to the data sounds like a necessary requirement to
> improve this.
I don't think editing an XML file with many attributes is simplifying
access to the data, to me it seems like complicating access to the
data. I think that most people faced with a task like editing an XML
file will not find it very simple.
>> Do you have another place to put it? I think it's better to put that
>> somewhere on the web where people can access it easily. I can host it
>> somewhere but it would be better to put it on the kanjivg web site
>> rather than locally.
>
> Generated content changes often, so as much as possible I'd like to
> avoid it on the git, but it's not exactly like the git repo is in a
> clean state anyway, so let's keep it there for now. Having it hosted
> somewhere else would introduce more hassle. Once all the potential
> errors reported in this file are fixed, it will go away anyway. Let's
> just try not to make this a habit. :P
I don't think it's very difficult to put this on the web:
http://www.lemoda.net/kanjivg/angle-check/index.html
I am continuing adding to the branch as mentioned until the fabled
scripts are ready.
Quick update: I have updated my github repo with a script (kvg) that is supposed to group all my previous scripts and also features the paths split/merge that we discussed earlier. Just use ./kvg split <file> or ./kvg merge <file> to do that. This should do the trick wrt stroke order mistakes.
Alex.
git fetch origin
git reset --hard origin/master
WARNING : This will discard any local changes, including local commits.
This will set your master branch to be identical as the master branch on
remote 'origin', which is the remote you originally cloned from.
HTH,
Arne
Yes, since I force-pulled the branches cannot be merged. The simplest
thing to do is to rebase your branches on my master branch, e.g. 'git
rebase alias-for-my-master-branch'. Then you will also have to
force-push your changes to github once per branch ("git push
--force"). Then you will be back to sync. Such operation should not be
necessary anymore in the future.
Alex.