Git repository

54 views

Skip to first unread message

msk...@ansuz.sooke.bc.ca

unread,

Mar 31, 2012, 9:17:56 PM3/31/12

to kan...@googlegroups.com

I've forked the Git repository with an eye to implementing some of the
ideas described in my earlier posting to the list. I thought that the
first step might be to "autoconfiscate" the package - i.e. set up GNU
Autotools to generate a configure script and in turn a Makefile. Then it
should be possible to (on an appropriately-configured system) just type
"make" to generate all the files that are automatically generated from
user-edited inputs, and "make check" to run whatever consistency checks
are implemented and generate a report of any errors. The interface of
Autotools is familiar to many users and might make the package more
accessible.

So my current draft is on my Github account at
https://github.com/mskala/kanjivg . The interesting part is in a branch
called "autoconf." In working on this, a few issues have come up that
might be worth thinking about:

On copyright and licensing: A lot of places, such as the Web pages,
describe KanjiVG (presumably the entire project) as being copyright Ulrich
and covered by the Creative Commons Attribution-ShareAlike 3.0 license.
However, many of the Python code files I'm looking at have notices in them
saying that they're copyright Alex and covered by the GNU GPL 3 license.
This may be a problem, because as I understand it, those two licenses are
incompatible! If you create a work partially derived from CC-BY-SA
materials and partially derived from GPL3 materials, the result won't be
distributable at all, absent permission from the copyright holders on one
or both sides to expand or make exceptions to their licenses. Now, that
doesn't stop us from putting some files under one and some under the
other, and I don't want to pressure anybody to license things in a way
they don't want to, but it does mean the CC-BY-SA and the GPL3 materials
have to be kept separate; no file covered by one license can ever be
combined with a file covered by the other, not if it'll be distributed.

Since these two licenses have very similar goals anyway, I think it would
be really nice if we could agree on just one of them to use, and if
whoever holds copyrights currently licensed under the other would agree to
license them under the single standard license. (This could be instead
of, or in addition to, the current licensing.) Strictly for myself, I
prefer the GPL3. I'm using GPL3 for independent projects of my own like
Tsukurimashou/IDSgrep. But I feel more strongly that agreeing on just one
license for KanjiVG, whatever it is, would be better than trying to mix
two; I'd use CC-BY-SA for my contributions to this project if there were
consensus on doing that. For the moment I haven't put any copyright
notices on the materials I've written and checked into my Git repository,
but that won't be tenable if I'm writing significant amounts of new code.
If there isn't a clear decision from the project, my inclination is to
dual-license my contributions under the GPL3 and CC-BY-SA licenses, so as
to allow them to be compatible with both.

On workflow: it's not clear to me which files in this repository are
inputs and which are outputs. My best guess is that the "source" files,
which editors are expected to edit, are the ones in XML/*.xml and
SVG/*.svg. There seems to be a Python script called mergexml.py that
processes those to create outputs in kanji/*.svg and kanji_mismatch/*.svg.
However, there also seems to be some code for performing a transformation
in the other direction. So which set of files is the source?

It seems to me that only source files should be tracked in source-control
repositories, and changes should be made to source files, not to
automatically-generated products of the source. It's a mess if we have
two sets of files, either of which can be generated from the other, and no
clear guidance on which one is the single right place to make changes;
practical experience in software development suggests that in such cases,
the two sets of files will seldom if ever be in sync with each other.
Accordingly, I have deleted the kanji/* and kanji_mismatch/* files from my
repository (under the "autoconf") tag; the idea is that the Makefile
generated by Autotools will automatically create those files by running
the Python scripts.

On Python versions: I don't know a lot about Python, but these scripts do
not immediately work on my system. I tried running them with the default
Python interpreter on my Arch Linux installation, which is Python version
3.7.2, and they all failed with errors relating to the syntax of the
"print" command. Trial and error led me to execute them with an older
Python interpreter, version 2.7.2, and that seems to work. I don't know
how hard it might or might not be to make them work with the current
version, and I don't know how standard it is in the Python community to
require a 2.x interpreter instead of a 3.x interpreter - is 3.x some kind
of unstable development version? - but my usual thought is that it's best
to aim for code to work with current versions of things wherever possible.
--
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/

Alexandre Courbot

unread,

Apr 1, 2012, 10:09:05 AM4/1/12

to kan...@googlegroups.com

Hi Matthew,

(first, apologies: I have read your previous mail but did not find the
time to correctly answer it yet as I was busy with Tagaini and
JMdict-i18n. I will answer it ASAP).

> I've forked the Git repository with an eye to implementing some of the
> ideas described in my earlier posting to the list. I thought that the
> first step might be to "autoconfiscate" the package - i.e. set up GNU
> Autotools to generate a configure script and in turn a Makefile. Then it
> should be possible to (on an appropriately-configured system) just type
> "make" to generate all the files that are automatically generated from
> user-edited inputs, and "make check" to run whatever consistency checks
> are implemented and generate a report of any errors. The interface of
> Autotools is familiar to many users and might make the package more
> accessible.

I am not sure about the relevance of autotools here. The kanjivg
script should (I do not pretend it is right now) be usable as-is,
without any configuration phase. There is nothing to compile, hence
the use of autotools seems out of focus to me (and I would also say,
out of time :))

> On copyright and licensing: A lot of places, such as the Web pages,
> describe KanjiVG (presumably the entire project) as being copyright Ulrich
> and covered by the Creative Commons Attribution-ShareAlike 3.0 license.
> However, many of the Python code files I'm looking at have notices in them
> saying that they're copyright Alex and covered by the GNU GPL 3 license.
> This may be a problem, because as I understand it, those two licenses are
> incompatible! If you create a work partially derived from CC-BY-SA
> materials and partially derived from GPL3 materials, the result won't be
> distributable at all, absent permission from the copyright holders on one
> or both sides to expand or make exceptions to their licenses. Now, that
> doesn't stop us from putting some files under one and some under the
> other, and I don't want to pressure anybody to license things in a way
> they don't want to, but it does mean the CC-BY-SA and the GPL3 materials
> have to be kept separate; no file covered by one license can ever be
> combined with a file covered by the other, not if it'll be distributed.

This is not an issue for the same reason you can compile proprietary
software with the GPL'd gcc compiler: the KanjiVG data is just
input/output of the scripts, they are thus not affected by their
licence and can live into a "licence space" of their own.

> On workflow: it's not clear to me which files in this repository are
> inputs and which are outputs. My best guess is that the "source" files,
> which editors are expected to edit, are the ones in XML/*.xml and
> SVG/*.svg. There seems to be a Python script called mergexml.py that
> processes those to create outputs in kanji/*.svg and kanji_mismatch/*.svg.
> However, there also seems to be some code for performing a transformation
> in the other direction. So which set of files is the source?

There is a confusion here and it is entirely my fault. Initially, the
kanji data was split into the SVG and XML directory, with every kanji
having one file in both directories. This split was confusing and I
decided to merge all the data into a single, extended SVG file. This
was done by the mergexml.py script and the result is in the 'kanji'
directory ('kanji_mismatch' contains those kanji which stroke count
was not the same for SVG and XML files - one of the reasons I wanted
to join the information was precisely to prevent that from occuring in
the future). So basically this means that the SVG and XML directories
are obsolete - I kept them out of fear that my script would have
messed things up, but apparently this is not the case, and git history
would allow to retrieve them back anyway. Therefore I have just
removed them from the repo.

So the bottom line of that is that the input data is in the "kanji"
directory, and that is all there is to know. This data should respect
the format that we are trying to define. And the only script that
should be relevant in the end is kvg.py (with its libraries kanjivg.py
and xmlhandler.py).

> It seems to me that only source files should be tracked in source-control
> repositories, and changes should be made to source files, not to
> automatically-generated products of the source. It's a mess if we have
> two sets of files, either of which can be generated from the other, and no
> clear guidance on which one is the single right place to make changes;
> practical experience in software development suggests that in such cases,
> the two sets of files will seldom if ever be in sync with each other.
> Accordingly, I have deleted the kanji/* and kanji_mismatch/* files from my
> repository (under the "autoconf") tag; the idea is that the Makefile
> generated by Autotools will automatically create those files by running
> the Python scripts.

So, well, that was actually the opposite that was supposed to be done.
:) Sorry about this.

> On Python versions: I don't know a lot about Python, but these scripts do
> not immediately work on my system. I tried running them with the default
> Python interpreter on my Arch Linux installation, which is Python version
> 3.7.2, and they all failed with errors relating to the syntax of the
> "print" command. Trial and error led me to execute them with an older
> Python interpreter, version 2.7.2, and that seems to work. I don't know
> how hard it might or might not be to make them work with the current
> version, and I don't know how standard it is in the Python community to
> require a 2.x interpreter instead of a 3.x interpreter - is 3.x some kind
> of unstable development version? - but my usual thought is that it's best
> to aim for code to work with current versions of things wherever possible.

The scripts are designed for Python 2 (I also prefer Python 3, but
these scripts have some history). I have updated their headers to
specifically use this version.

Hope this clarifies things a little.
Alex.

msk...@ansuz.sooke.bc.ca

unread,

Apr 1, 2012, 11:35:58 AM4/1/12

to kan...@googlegroups.com

On Sun, 1 Apr 2012, Alexandre Courbot wrote:
> I am not sure about the relevance of autotools here. The kanjivg
> script should (I do not pretend it is right now) be usable as-is,
> without any configuration phase. There is nothing to compile, hence
> the use of autotools seems out of focus to me (and I would also say,
> out of time :))

I see the entire database as a thing that gets compiled - it has source
files, and those are automatically processed to produce files that are
output and not for direct editing. It will soon also have a test suite.
Being able to type "make" to do the compiling and "make check" to run the
test suite is much more convenient for a user than having to know which
scripts must be run in which sequence to build the data properly. And as
my adventures trying to run Python indicate, that compilation-like process
*does* require configuration: the command that runs the appropriate
Python interpreter on my system evidently isn't the same command used for
that purpose on your system.

The advantage to using Autotools in particular is that it's familiar in
the free software world (i.e. me :-) ); someone can download an Autotools
package and start using it immediately. It also makes packaging easier -
once it's properly configured we can type "make dist" and get an archive
file that someone can use.

[licensing]

> This is not an issue for the same reason you can compile proprietary
> software with the GPL'd gcc compiler: the KanjiVG data is just
> input/output of the scripts, they are thus not affected by their
> licence and can live into a "licence space" of their own.

Well, it's still going to be a problem if we want to build a package that
contains both code and data - even if distributing such a package is
legally permitted, "Half of the files are on these terms and half are on
those terms" is not a good answer to give to someone who asks "What is the
license of your package?" It may be an issue for things like inclusion of
KanjiVG in distributions; it's certainly a reason that IDSgrep doesn't
ship with KanjiVG-derived data, instead encouraging users to download
copies of their own.

I also think it's very easy for the distinction between "code" and "data"
to become blurred - for instance, if an editor program needs to have,
inside its source code, a template of what an entry should look like, and
that template ends up being derived from a database entry. Meanwhile the
same editor program also includes input consistency-checking stuff based
on previous work that was "code". Now the editor is going to have to
respect both the "code" and the "data" licenses. It's easy to say "Oh,
well, in such a case we'd store the template in a separate file, thus
creating a license boundary," but if it's clear that that was done solely
to create a technicality under which someone could claim to have followed
the letter of the licensing terms, and the "separate" data file is of no
use except as integrated with the code, I doubt such a claimed separation
would actually protect anyone. The distinction between code and data
becomes especially vague when we're talking about Web pages, which may
have chunks of client-side script, server-side code, and static data and
documentation all mixed together in a single file - and generating Web
pages *is* a significant target of current development.

The immediate question, though, is simply what license and copyright I
should put on the test suite code I'm currently writing. What do you
recommend?

> So the bottom line of that is that the input data is in the "kanji"
> directory, and that is all there is to know. This data should respect
> the format that we are trying to define. And the only script that
> should be relevant in the end is kvg.py (with its libraries kanjivg.py
> and xmlhandler.py).

That's good because it'll mean the tests can be applied more directly,
instead of to translated output files; also, "cayennes" has been
suggesting edits to the kanji/*.svg files, so if those weren't the
sources, there'd be a need to translate them in the other direction back
to the actual source files. It'll also reduce the amount of
compilation-like work to be done. A little is still needed for stuff like
generating the all-in-one file. I'll make appropriate changes in my fork.

Alexandre Courbot

unread,

Apr 2, 2012, 10:35:36 AM4/2/12

to kan...@googlegroups.com

On Mon, Apr 2, 2012 at 12:35 AM, <msk...@ansuz.sooke.bc.ca> wrote:
> I see the entire database as a thing that gets compiled - it has source
> files, and those are automatically processed to produce files that are
> output and not for direct editing. It will soon also have a test suite.
> Being able to type "make" to do the compiling and "make check" to run the
> test suite is much more convenient for a user than having to know which
> scripts must be run in which sequence to build the data properly.

As of now there is no compilation taking place - just some simple
packaging. The files in kanji/ are usable as-is. I'm afraid you might
have been confused by the presence of the merger script which I have
now deleted. What kind of compilation do you think can still take
place with the remaining files?

As far as convenience is concerned, if the kvg.py script is properly
written with a limited set of commands, and comes with its own online
help, I really think we can (and should) not bother about using make.
It would require another dependency, and do little more than being an
empty wrapper to call kvg.py with the right command.

> And as
> my adventures trying to run Python indicate, that compilation-like process
> *does* require configuration: the command that runs the appropriate
> Python interpreter on my system evidently isn't the same command used for
> that purpose on your system.

It's just a matter of forcing the python version in the first line of
the script file, e.g.

#!/usr/bin/python2

In all systems, "python" is just a symbolic link to either python2 or
python3 depending on which version is default. It just happens that
Arch is the only one to use 3 as default. Now that it is explicit, the
python version problem is gone.

> The advantage to using Autotools in particular is that it's familiar in
> the free software world (i.e. me :-) ); someone can download an Autotools
> package and start using it immediately. It also makes packaging easier -
> once it's properly configured we can type "make dist" and get an archive
> file that someone can use.

I remember using Autotools 10 years ago, it was already called
"Autohell" and I doubt things have improved since then. :p As for
familiarity with free software, I understand your point but we should
keep in mind that KanjiVG is *not* only targeted at the free software
enthusiast. For instance, using Autotools will make things harder to
work on Windows, whereas the Python scripts are supposed to work as-is
there too. And ideally we want to avoid having people messing with the
source as much as possible.

The only scripts I can foresee for now would be:
1) Making releases ("kvg.py release")
2) Checking the validity of the files ("kvg.py check", TBD)
3) Fixing files that were edited using e.g. inkscape
("harmonize-svg.py", to be moved into "kvg.py fix").

Do you see anything else that would require a configuration tool? As
it is now, a configure session on KanjiVG will look like this:

$ ./configure
Checking for python2... found!

If you don't have python2, you may as well discover it when you try to
run kvg.py. Plus Autotools gives headaches and makes your beard grow.

> Well, it's still going to be a problem if we want to build a package that
> contains both code and data - even if distributing such a package is
> legally permitted, "Half of the files are on these terms and half are on
> those terms" is not a good answer to give to someone who asks "What is the
> license of your package?" It may be an issue for things like inclusion of
> KanjiVG in distributions; it's certainly a reason that IDSgrep doesn't
> ship with KanjiVG-derived data, instead encouraging users to download
> copies of their own.

It is a non-problem, really. GPL is a licence targeted as source code
; Creative Commons is for data. We ship both code and data and both
use the most appropriate licence. The question of which licence is
KanjiVG distributed under could be answered as "SVG files are CC-SA
and scripts are GPLv3". There is nothing confusing about that.

> I also think it's very easy for the distinction between "code" and "data"
> to become blurred - for instance, if an editor program needs to have,
> inside its source code, a template of what an entry should look like, and
> that template ends up being derived from a database entry. Meanwhile the
> same editor program also includes input consistency-checking stuff based
> on previous work that was "code". Now the editor is going to have to
> respect both the "code" and the "data" licenses. It's easy to say "Oh,
> well, in such a case we'd store the template in a separate file, thus
> creating a license boundary," but if it's clear that that was done solely
> to create a technicality under which someone could claim to have followed
> the letter of the licensing terms, and the "separate" data file is of no
> use except as integrated with the code, I doubt such a claimed separation
> would actually protect anyone. The distinction between code and data
> becomes especially vague when we're talking about Web pages, which may
> have chunks of client-side script, server-side code, and static data and
> documentation all mixed together in a single file - and generating Web
> pages *is* a significant target of current development.

I think you are overthinking this. You can just dismiss any
content-generation system with this reasonning. Drupal's templates are
GPL ; what is the licence of the HTML pages I serve? For KanjiVG, code
is GPL, data is CC. The GPL scripts are for our own use. Users will
only use the data and will only have to abide by the CC licence. All
release packages will not include any script and will therefore be
100% CC. And if some editor includes our CC templates in his GPL code,
as long as he releases the resulting data as CC, we won't get mad at
him. :)

Unfortunately as it will probably turn out, you will see that most
proprietary editors do not care about any licence and just happily rip
your data off and put their name and restrictions on it.

> The immediate question, though, is simply what license and copyright I
> should put on the test suite code I'm currently writing. What do you
> recommend?

For harmonization purposes I'd suggest GPLv3, but since it is your
code feel free to use whatever free-software licence you like.

> That's good because it'll mean the tests can be applied more directly,
> instead of to translated output files; also, "cayennes" has been
> suggesting edits to the kanji/*.svg files, so if those weren't the
> sources, there'd be a need to translate them in the other direction back
> to the actual source files. It'll also reduce the amount of
> compilation-like work to be done. A little is still needed for stuff like
> generating the all-in-one file. I'll make appropriate changes in my fork.

Yes, on the other hand my laziness to remove these obsolete files
clearly misguided you. I apologize for that.

I also noticed Cayennes did some fixes to a couple of svg files, but
he did not fill in a merge request. Cayennes, is your branch good to
be merged?

Alex.

msk...@ansuz.sooke.bc.ca

unread,

Apr 2, 2012, 11:21:32 AM4/2/12

to kan...@googlegroups.com

On Mon, 2 Apr 2012, Alexandre Courbot wrote:
> As of now there is no compilation taking place - just some simple
> packaging. The files in kanji/ are usable as-is. I'm afraid you might

Even simple packaging benefits from having a consistent interface, and the
work I'm currently doing on automated testing has the nature of
translating human-written files through nontrivial processing to create
outputs, which is what I mean by "compilation." This will become more
true in the future. Someone who prefers to figure out which scripts they
want to run and how to run them without invoking configure and make,
remains free to do so - that's no harder than it ever was.

But I'm not going to try to convince you that you want to use Autotools if
you don't in fact want to. Using Autotools makes the work I myself want
to do, easier to do, and so I've autoconfiscated the package in my own
fork of the Github repository. That involved adding some scripts, but not
changing the existing ones; they remain as usable without Autotools as
they ever were. Don't merge that material if you don't want it, but I
certainly hope you'll look at it even if you don't merge it. The code
probably speaks more clearly than I can here, as to why it's something I
think is worth doing.

> For harmonization purposes I'd suggest GPLv3, but since it is your
> code feel free to use whatever free-software licence you like.

At this point I'm thinking either public domain or GPL3/CC dual license.
I want my work to be available for combination with things on both sides
of the code/data boundary.

Alexandre Courbot

unread,

Apr 3, 2012, 7:42:30 AM4/3/12

to kan...@googlegroups.com

Hi Matthiew,

On Tue, Apr 3, 2012 at 12:21 AM, <msk...@ansuz.sooke.bc.ca> wrote:
> Even simple packaging benefits from having a consistent interface, and the
> work I'm currently doing on automated testing has the nature of
> translating human-written files through nontrivial processing to create
> outputs, which is what I mean by "compilation." This will become more
> true in the future. Someone who prefers to figure out which scripts they
> want to run and how to run them without invoking configure and make,
> remains free to do so - that's no harder than it ever was.

But isn't setting up and maintaining autotools a high price for that
relatively small convenience? If we can keep our kvg.py usage clean
and simple, I don't think that should ever be needed. Also autotools
was kind of used for C/C++ programs (I say "was" because the current
trend for newer programs is CMake), but I never saw it used with
Python.

> But I'm not going to try to convince you that you want to use Autotools if
> you don't in fact want to. Using Autotools makes the work I myself want
> to do, easier to do, and so I've autoconfiscated the package in my own
> fork of the Github repository. That involved adding some scripts, but not
> changing the existing ones; they remain as usable without Autotools as
> they ever were. Don't merge that material if you don't want it, but I
> certainly hope you'll look at it even if you don't merge it. The code
> probably speaks more clearly than I can here, as to why it's something I
> think is worth doing.

I had a look at it and for now it just looks like an expensive wrapper
around our Python scripts. I don't know how things will turn in the
future, but for now I don't see any need for this in our master
branch. Of course, the magic of git is that if I turn out to be wrong,
we can always merge this later, but as much as possible I would like
to keep KanjiVG clean enough so that it does not need such a
configuration phase.

Alex.

msk...@ansuz.sooke.bc.ca

unread,

Apr 3, 2012, 12:44:50 PM4/3/12

to kan...@googlegroups.com

On Tue, 3 Apr 2012, Alexandre Courbot wrote:
> But isn't setting up and maintaining autotools a high price for that
> relatively small convenience? If we can keep our kvg.py usage clean

I think losing potential developers because they can't run the code is
more expensive. But that's not really important - it was worth it to me
to do the work even if I am the only one who uses it, and now that I have
done it, nobody who wants to use it will need to do it again for as long
as you're right that this package will never need configuration.

On more interesting topics, I have now merged into my master a script to
add stroke numbers; the results of running that script on the SVG files;
test suite code to check for parseable XML, valid stroke numbers, and the
"no mixing groups with strokes at the same level" criterion; and
documentation thereof.

The test suite can be run with or without make according to the directions
in the README.too file; I can also provide a copy of the test results to
anyone who can't run the tests themselves, but the log output is a bit
lengthy.

My plan is to look next at other kinds of validity checks. Can the
strokes.txt file be trusted to any extent?

Cayenne

unread,

Apr 3, 2012, 4:10:38 PM4/3/12

to kan...@googlegroups.com

On Monday, April 2, 2012 10:35:36 AM UTC-4, Alexandre Courbot wrote:

I also noticed Cayennes did some fixes to a couple of svg files, but

[edit: she] did not fill in a merge request. Cayennes, is your branch good to
be merged?

Sorry, I'm still finding my way around. Yes, I believe my branch is good enough to be merged. All I changed was swapping placement of stroke numbers that were out of order and I used a text editor rather than something that might add cruft to the SVG. I did a pull request on GitHub, but it's a little disorganized (for example the first ones I changed are all in their own commits and the later ones are bunched together) because it was the first thing I did with git. Are there contribution guidelines anywhere? I can redo the pull request if there's a better way to do it.

- Cayenne

Alexandre Courbot

unread,

Apr 4, 2012, 7:44:41 PM4/4/12

to kan...@googlegroups.com

>> I also noticed Cayennes did some fixes to a couple of svg files, but
>> [edit: she]

Oops, apologies. m(__)m

>> did not fill in a merge request. Cayennes, is your branch good to
>> be merged?
>
> Sorry, I'm still finding my way around. Yes, I believe my branch is good enough to be merged. All I changed was swapping placement of stroke numbers that were out of order and I used a text editor rather than something that might add cruft to the SVG. I did a pull request on GitHub, but it's a little disorganized (for example the first ones I changed are all in their own commits and the later ones are bunched together) because it was the first thing I did with git. Are there contribution guidelines anywhere? I can redo the pull request if there's a better way to do it.

Oops, apologies (again). I just realized I missed the pull request you
did 15 days ago. You did everything fine. For such obvious fixes that
we are sure we are not going to revert, both having one commit per
file or grouping several files in one is ok. As long as logical
changes are not mixed up together, it is not a problem.

So I have (finally) merged your request and will be more careful in
the future. Thanks for sharing your fixes upstream, that really helps!