Adding grc to langdata

Nick White

nepřečteno,

29. 10. 2015 13:45:1229.10.15

komu: tesser...@googlegroups.com

Hi all,

I recently opened a pull request to add grc to the langdata
repository; https://github.com/tesseract-ocr/langdata/pull/19

Nobody has done that before, and presumably there are some internal
tools that wrap around tesstrain.sh to generate the stuff that ends
up in tessdata, which I haven't seen. Hopefully everything here
should work nicely with that, but let me know if not and I would be
very happy to fix stuff up as needed.

The files in the pull request are all generated (using 'make
langdata') from a git repository;
http://ancientgreekocr.org/grctraining.git

Hopefully it is more-or-less complete. Any suggestions or whatever
would be most welcome.

The training that results from this process is still ~1.5% worse
than the training I made last year, for reasons I haven't yet
figured out, but I'm sure I'll get there in the end.

I was thinking I should fully document training using the
tesstrain.sh process sometime, with a view to encouraging more
people to work with it and hopefully get more pull requests of this
kind in the future.

All the best,

Nick

Nick White

nepřečteno,

29. 10. 2015 14:25:1729.10.15

komu: tesser...@googlegroups.com

FYI there isn't a web viewer for the git repository, so just clone
it like this:

git clone http://ancientgreekocr.org/grctraining.git

Nick

> --
> You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-dev.
> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/20151029174446.GA4904%40manta.lan.
> For more options, visit https://groups.google.com/d/optout.

Tom Morris

nepřečteno,

29. 10. 2015 14:49:5929.10.15

komu: tesser...@googlegroups.com

That's great stuff, Nick!

I'd need to dig into how they're being used in the processing, but I think care needs to be taken with the licensing of the dependencies:

https://github.com/PerseusDL/canonical-greekLit

https://github.com/brobertson/rigaudon

The first is CC Attribution-ShareAlike, but also includes weasel words about it being only for "personal use," however if the corpus is just being used as input for word counts, etc, it's probably fine. The second is GPL v2, so is potentially more problematic, particularly if its dictionaries/word lists are being used directly. It's also a derivative of another GPL'd work, so you'd need to go even further upstream to find a clean copy.

The provenance of dictionaries is usually very messy and/or unknown, so I'm not sure where you'd find a good replacement for Rigaudon. The Riguadon paper https://docs.google.com/document/d/1iYfqflLybd3f9bBfTBk8aY_FTq05kCnq4tiKwYpqtrM/edit mentions a Dr. Boschetti dictionary. There are also some here (of unclear license); http://www.himeros.eu/

I think it's great that you've got the process pegged to specific commits in the dependencies so the results are reproduceable, but it'd be nice to get those commit IDs and repo URLs emitted somewhere (grc.config?) in the final data.

Tom

Nick White

nepřečteno,

29. 10. 2015 15:39:5829.10.15

komu: tesser...@googlegroups.com

Hi Tom, thanks for taking a look and being so positive :)

I'm friendly with the authors of both the dependencies, so I am
confident I can work out good terms to make sure everything is
perfectly above-board copyright wise. But you're right, for that
reason I was lax about thinking about it much beforehand.

As I read it the canonical-greekLit license statement is "everything
is CC-BY-SA unless stated otherwise in individual files", and
grepping around I can't find any exceptions. The "personal use" part
of the readme doesn't use the phrase "personal use only", so I think
it's more to be taken as a non-normative statement about intention
of the bare minimum they intend to offer, rather than some exception
to the license. Is that how you read it too? If you think it's still
ambiguous I can contact the people in charge and ask if it can be
made clearer - I am confident their intentions are that it is freely
useable for exactly this kind of application.

As for rigaudon, you're right the license (GPLv2) is more
problematic to use in the Apache2.0 licensed training, and moreover
the provenance of the dictionaries I'm getting from there is not at
all clear. I'll write to Bruce (who maintains rigaudon) to ask about
it. And yes, it is true that I am just incorporating the wordlist
from there quite directly, not extracting it like in the case of the
canonical-greekLit repository.

Word lists strike me as a good example of something that copyright
shouldn't (and depending on jurisdiction generally probably doesn't)
cover, as they are very much an aggregation of 'mere facts'. But
sadly I suppose it is sensible to work on the assumption that they
are copyrightable just in case, so we are safe.

> I think it's great that you've got the process pegged to specific commits in
> the dependencies so the results are reproduceable, but it'd be nice to get
> those commit IDs and repo URLs emitted somewhere (grc.config?) in the final
> data.

I think it's great too :) The commit ID of the grctraining
repository is included in the grc.config already, and from that it's
easy to find the commits of the dependencies that were used (as
they're pegged in the makefile, as you saw). I suppose including the
grctraining repo URL in the grc.config comment makes sense, good
idea. Do you think including the commits / URLs used for the
dependencies is useful there? I am inclined to think it is not, but
feel free to persuade me otherwise ;)

Thanks again for taking a look and offering such useful thoughts.

Nick

> CAE9vqEEgq25yA%3DO_o79GBdep1uU1gDdtYaBWSevZ%2Bkkbj2UgYQ%40mail.gmail.com.

Nick White

nepřečteno,

29. 10. 2015 18:14:3429.10.15

komu: tesser...@googlegroups.com

Well, good instincts Tom :)

Turns out the author of Rigaudon had recently been told the wordlist
he was using wasn't redistributable and he had been planning to take
it down.

I updated my grctraining repository and the langdata pull request to
remove the dependency; no harm done as it was just an extra source
of words.

As I said, I think the canonical-greekLit dependency is fine, but if
you reckon the wording is ambiguous I'll contact the right people to
see if I can get it fixed.

Nick

> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/20151029193936.GA7269%40manta.lan.

Helmut Wollmersdorfer

nepřečteno,

30. 10. 2015 14:57:4530.10.15

komu: tesseract-dev

Hi Nick,

Am Donnerstag, 29. Oktober 2015 20:39:58 UTC+1 schrieb Nick White:

Word lists strike me as a good example of something that copyright
shouldn't (and depending on jurisdiction generally probably doesn't)
cover, as they are very much an aggregation of 'mere facts'. But
sadly I suppose it is sensible to work on the assumption that they
are copyrightable just in case, so we are safe.

Words are public domain. But collections of data or whatever can be protected by "Database Protection" in some countries (all EU countries AFAIR).

The protection lasts AFAIR 15 years from publication date. I.e. it's free to extract a wordlist from The Oxford Dictionary or Duden published in the year 1999.

See https://en.wikipedia.org/wiki/Database_Directive

PS: Thanks for your work and sharing it. Helps me a lot OCR-ing 18th century scientific books.

Helmut Wollmersdorfer

Tom Morris

nepřečteno,

30. 10. 2015 15:44:1430.10.15

komu: tesser...@googlegroups.com

On Thu, Oct 29, 2015 at 6:13 PM, Nick White <nick....@durham.ac.uk> wrote:

As I said, I think the canonical-greekLit dependency is fine, but if
you reckon the wording is ambiguous I'll contact the right people to
see if I can get it fixed.

I think it's fine since the corpus is just used as input data and isn't being redistributed in any form, so I shouldn't have bothered to comment on the license. I just don't like ambiguity in license statements. :-)

Tom

Tom Morris

nepřečteno,

31. 10. 2015 17:02:1431.10.15

komu: tesser...@googlegroups.com

On Thu, Oct 29, 2015 at 3:39 PM, Nick White <nick....@durham.ac.uk> wrote:

> I think it's great that you've got the process pegged to specific commits in
> the dependencies so the results are reproduceable, but it'd be nice to get
> those commit IDs and repo URLs emitted somewhere (grc.config?) in the final
> data.

I think it's great too :) The commit ID of the grctraining
repository is included in the grc.config already, and from that it's
easy to find the commits of the dependencies that were used (as
they're pegged in the makefile, as you saw). I suppose including the
grctraining repo URL in the grc.config comment makes sense, good
idea. Do you think including the commits / URLs used for the
dependencies is useful there? I am inclined to think it is not, but
feel free to persuade me otherwise ;)

I'd include the repo and commit ID for the Perseus corpus as well to make it easy for the people to trace the lineage. Otherwise they need to clone your repo and know to dig through the Makefile to find the info. If the corpus is relatively static and unlikely to undergo big changes, perhaps it does't make a difference.

Tom

Nick White

nepřečteno,

11. 12. 2015 9:53:4111.12.15

komu: tesser...@googlegroups.com

Hi Helmut,

On Fri, Oct 30, 2015 at 08:58:50AM -0700, Helmut Wollmersdorfer wrote:
> Am Donnerstag, 29. Oktober 2015 20:39:58 UTC+1 schrieb Nick White:
>
>
> Word lists strike me as a good example of something that copyright
> shouldn't (and depending on jurisdiction generally probably doesn't)
> cover, as they are very much an aggregation of 'mere facts'. But
> sadly I suppose it is sensible to work on the assumption that they
> are copyrightable just in case, so we are safe.
>
>
> Words are public domain. But collections of data or whatever can be protected
> by "Database Protection" in some countries (all EU countries AFAIR).
>
> The protection lasts AFAIR 15 years from publication date. I.e. it's free to
> extract a wordlist from The Oxford Dictionary or Duden published in the year
> 1999.
>
> See https://en.wikipedia.org/wiki/Database_Directive

Thanks a lot for the clarification, that's good to know.

> PS: Thanks for your work and sharing it. Helps me a lot OCR-ing 18th century
> scientific books.

Great, I'm happy you're finding it useful. Let me know if you find
any consistent errors or other areas you reckon could be improved.

Nick

Nick White

nepřečteno,

11. 12. 2015 10:00:1011.12.15

komu: tesser...@googlegroups.com

On Sat, Oct 31, 2015 at 05:02:12PM -0400, Tom Morris wrote:
> I'd include the repo and commit ID for the Perseus corpus as well to make it
> easy for the people to trace the lineage. Otherwise they need to clone your
> repo and know to dig through the Makefile to find the info. If the corpus is
> relatively static and unlikely to undergo big changes, perhaps it does't make a
> difference.

I think you're right, actually, you have convinced me that it's a
useful thing to include, as the corpus is likely to change in the
future, and it could be useful to easily pinpoint changes caused by
a different wordlist source.

I included it in my latest commit to the pull request.

The pull request hasn't been merged or commented on yet. Ray, do let
me know here or wherever if there's anything else you'd like to see
me do in order to merge it.

Nick

Odpovědět všem

Odpověď autorovi

Přeposlat