Do not forget clarify license.
Ok, there are two issues here:
- Why can't I post to tesseract-ocr -> issue 424
- How to submit new traineddata -> attachment as in issue 300, provided that I say "I accept Apache 2.0".
if you can not post on forum that contact owner of forum ;-) . there is a link on http://groups.google.com/group/tesseract-ocr/about (http://groups.google.com/group/tesseract-ocr/post?sendowner=1&_done=%2Fgroup%2Ftesseract-ocr%2Fabout%3F&)
- How to submit new traineddata -> attachment as in issue 300, provided that I say "I accept Apache 2.0".
Create new issue - similar as 300. Attach your data to it and and write "it is released under license Apache 2.0".
I don't see any problem with you having whatever discussion it is you
want to have about the data here. It's certainly more appropriate a
topic than how p-ed off you are about the tesseract-ocr mailing list.
More than that, because Hebrew is an RTL language (for which support
is only beginning to appear in 3.1), any issues raised are more than
likely going to be development issues, so this is the right place to
discuss them.
For the record, I agree with Zdenko. It would have been better if you
had opened an enhancement issue. At the very least, it gives an
appropriate place for you to attach your work for would-be testers to
try out.
Patrick, I think you're on the list anyway, but I'm cc'ing you to be
sure. ISTR that you speak Hebrew (or maybe I just assumed from your
accent :) -- you might be the best placed person to give feedback on a
Hebrew language pack.
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
Cool. Do you think you could introduce Roi and Enrico? I think that if
they can pool their resources, they might come up with something
better than they would have been able to achieve individually.
Also, as it's quite likely that Google are or will be working on
Hebrew, perhaps Ray might chime in? Aside from Hebrew, I think we're
all growing a little frustrated with the status of tesseract-ocr (in
my case, more because I'm starting to get off-list questions again,
and it'll be at least a week before I even have the facilities to do
anything).
I succeeded in creating a heb.traineddata and separately a heb-
ras.traineddata for rashi script. Basic graphic work, no dictionary
data yet.
Questions:
1) I happened to have generated my training images with a commercial
word processor, and I don't know the status of the fonts used. Does
this pose license issues?
2) recognition results are encouraging, but clearly dictionary data is
missing and ambiguous characters cannot be corrected. Any recommended
place to get a decent and suitable wordlist from (perhaps hspell)? I
presume that all words will have to be reversed as tesseract is
recognizing text "the wrong way".
License issues here too?
3) anyone willing to help me with packaging the dictionaries into the
final .traineddata?
4) problem with rashi: There is a huge class of texts (e.g.,
commentaries) merging square and rashi typefaces within the same line
of text. Ideally, one would recognize such texts using a single
language file. The situation with the two scripts is comparable to
latin upper and lowercase - there are letters topologically similar in
both cases, like s and S, letters topologically different, like a and
A, letters in one script easily confused with another in the other
(e.g. - kaf in square hebrew and nun in rashi). In latin, upper and
lower case have different unicodes, and it is easy to generate
prototypes including both. With hebrew, not. I've tied generating a
common .traineddata for both fonts, but recognition results were poor
due to these ambiguities. I've also tried that defining all rashi
characters as italicized, at no avail. Any hint?
Enrico
IIRC, there's a whole bunch of issues in going from LTR to RTL.
Hyphenation is going to be fiddly and brittle if you're trying to
treat the language as LTR, for example.
> I'll send an intro email to Enrico and Roi next, along with the heb.traineddata (in case Enrico has better luck than me somehow).
>
Ok, cool.
> Patrick
Also, I don't think the strings actually need to be reversed. I did a
small experiment with Hebrew a while ago with Apertium, which required
no changes for words and their order: the bytes that make up the
strings are stored the same way as LTR text, they're just rendered
RTL, but I guess you'll have have known that, or at least have seen
enough badly rendered Hebrew to have a gut feeling about it :)
IIRC, there's a whole bunch of issues in going from LTR to RTL.
Hyphenation is going to be fiddly and brittle if you're trying to
treat the language as LTR, for example.
IANAL, TINLA.
It's one of the murkier areas of copyright, but it shouldn't be a real
concern. There's no way to rebuild a font from the features extracted
by Tesseract, so even in the off chance that you wanted to create a
clone of the font, you couldn't. Font copyrights are a bit of a
strange beast -- a font creator has no copyright interest in a work
that was set using their font -- but I guess it should be enough to
say that the existing training data packs for tesseract have been done
that way, and have been distributed for years without issue.
> 2) recognition results are encouraging, but clearly dictionary data is
> missing and ambiguous characters cannot be corrected. Any recommended
> place to get a decent and suitable wordlist from (perhaps hspell)? I
> presume that all words will have to be reversed as tesseract is
> recognizing text "the wrong way".
> License issues here too?
>
Again, no. If you're extracting a wordlist from a body of text,
there's no copyright issue -- at least, not under US law, where 'mere
facts' do not have copyright (it's kind of the same in most
jurisdictions, but the EU has database 'copyrights' which make this a
little unclear).
Think of it this way: you can't copyright a word, you can only
copyright a unique set of words.
(You can /trademark/ a word, but you can't copyright it; trademark
generally doesn't apply here either, because you're not using the word
in a way which affects its ability to function as a trademark: if
Microsoft started an ad campaign tomorrow saying 'Google with Bing!',
Google not only /could/ sue them for breach of trademark, they would
/have/ to, to maintain the trademark).
Anyway... you'll probably do best by grabbing a dump of the Hebrew
Wikipedia, and extract the words from that (if you can wait until next
week, I can do that for you), because 1) they're based in the US, and
follow US law, and 2) they're generally friendly to open source
projects, even if they're not using the same licence terms (some
Wikipedians with a loose grasp on copyright might try to tell you that
you can't do it, but the lawyers will more-than-likely give their
blessing - they have in the past (I can probably dig up a reference if
you need to see one)).
> 3) anyone willing to help me with packaging the dictionaries into the
> final .traineddata?
>
Like I said, I can probably do that next week (waiting for my new
laptop to be delivered).
> 4) problem with rashi: There is a huge class of texts (e.g.,
> commentaries) merging square and rashi typefaces within the same line
> of text. Ideally, one would recognize such texts using a single
> language file. The situation with the two scripts is comparable to
> latin upper and lowercase - there are letters topologically similar in
> both cases, like s and S, letters topologically different, like a and
> A, letters in one script easily confused with another in the other
> (e.g. - kaf in square hebrew and nun in rashi). In latin, upper and
> lower case have different unicodes, and it is easy to generate
> prototypes including both. With hebrew, not. I've tied generating a
> common .traineddata for both fonts, but recognition results were poor
> due to these ambiguities. I've also tried that defining all rashi
> characters as italicized, at no avail. Any hint?
>
I had to look this up (http://en.wikipedia.org/wiki/Rashi_script).
Setting them as italic looks like the right thing to do. Did you
follow the normal training procedure (i.e., put the square letters in
one file, and the rashi in another?)
> Enrico
Anyway... you'll probably do best by grabbing a dump of the Hebrew
Wikipedia, and extract the words from that (if you can wait until nextweek, I can do that for you), because 1) they're based in the US, and
follow US law, and 2) they're generally friendly to open source
projects, even if they're not using the same licence terms (some
Wikipedians with a loose grasp on copyright might try to tell you that
you can't do it, but the lawyers will more-than-likely give their
blessing - they have in the past (I can probably dig up a reference if
you need to see one)).
> 3) anyone willing to help me with packaging the dictionaries into the
> final .traineddata?
>Like I said, I can probably do that next week (waiting for my new
laptop to be delivered).
I had to look this up (http://en.wikipedia.org/wiki/Rashi_script).Setting them as italic looks like the right thing to do. Did you
follow the normal training procedure (i.e., put the square letters in
one file, and the rashi in another?)
Ok, just using their wordlist /would/ be a licensing problem, because
it's GPL. This is kind of difficult to explain... wordlists can have
copyright under certain conditions (usually, based on there having
been criteria used for the selection of the words). I'd really
encourage you to go the wikipedia route instead, unless you can get
permission from the authors of hspell.
>>
>> > 3) anyone willing to help me with packaging the dictionaries into the
>> > final .traineddata?
>> >
>>
>> Like I said, I can probably do that next week (waiting for my new
>> laptop to be delivered).
>
> You'd be most welcome.
>
>> I had to look this up (http://en.wikipedia.org/wiki/Rashi_script).
>>
>> Setting them as italic looks like the right thing to do. Did you
>> follow the normal training procedure (i.e., put the square letters in
>> one file, and the rashi in another?)
>
> Done that. See the tarball attached to issue 432. In the version posted
> there I generated a separate heb.traineddata and a heb-ras.traineddata, but
> previously I merged the two. See the scripts train.com and train2.com there.
Ok, I'll have to make a note and come back to it next week. Feel free
to remind me :)
Yes, and that's actually not optimal. 'Correctness' is great for a
spelling checker, not as ideal for an OCR dictionary, because you want
to be able to recognise words that are commonly used, regardless of
correctness.
> In case of OCR: spellchecker "degree of correctness" is
> IMHO disadvantage while "random web dump" should be advantage ;-) This
> is my opinion as spellchecker maintainer for my (Slovak) language. I
> do not know how it is in hspell, but I keep away
> all abbreviation, punctuation and numbers old words from spellchecker.
> But these "words" are quite common in texts I tried to OCR. And if I
> do OCR I expect that I get the output that will be identical with
> original document - including mistakes.
>
> I do not know all specialties of dawg dictionaries (for punc-dawg and
> number-dawg we can just
> guess http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional)
> <http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_%28Optional%29>)
> But first of all you need to list words based on their frequency. You
> can not make frequency list from spellchecker, but you can make it on
> wikipedia.
I see your point. Ideally then, one should look for a traceable wordlist
with frequency. Or at least build the wordlist from a well defined
repository. That would probably vary according to the application
targeted - business cards are one, old texts another, but maybe
something the like is out there on the net. At least, there are
collections of transcribed texts - for some kind of hebrew texts for
instance http://opensiddur.org/, http://www.seforimonline.org/,
http://www.hebrewbooks.org/. My question becomes then - how to build
automatically the dictionaries needed by tesseract, given a bunch of
text files, taking also into account RTL-LTR.
Enrico
Did you try out the heb.traineddata that is checked in to svn for version 3.01?
We haven't done any work on Hebrew yet - we just turned the handle on our automated training system to generate the heb.traineddata, but the dictionary is in there. It won't help though because of the RTL issue.
I have a plan for dealing with RTL that should dramatically improve its accuracy for Hebrew - the dictionary will actually be useful, and it is not a big fix. The words in the dictionary need to be reversed, and the words on each line need to be reversed, (both word order and character order) and that is all that is needed for minimal RTL support.