I would contribute heb.traineddata

304 views
Skip to first unread message

Enrico Segre

unread,
Jan 10, 2011, 3:24:58 AM1/10/11
to tesser...@googlegroups.com
And my posts to tesseract-ocr never make it out to the group. Why? The group manager is on holiday?

zdenko podobny

unread,
Jan 10, 2011, 3:42:57 AM1/10/11
to tesser...@googlegroups.com
Hi,

please post it as issue (http://code.google.com/p/tesseract-ocr/issues (see http://code.google.com/p/tesseract-ocr/issues/detail?id=300). Do not forget clarify license.

Zd.

Enrico Segre

unread,
Jan 10, 2011, 4:13:15 AM1/10/11
to tesser...@googlegroups.com


On Monday, January 10, 2011 10:42:57 AM UTC+2, Zdenko Podobný wrote:
Do not forget clarify license.

Done, http://code.google.com/p/tesseract-ocr/issues/detail?id=424. What is  clarify license ?

Enrico

Enrico Segre

unread,
Jan 10, 2011, 4:32:07 AM1/10/11
to tesser...@googlegroups.com
Ok, there are two issues here:
  1. Why can't I post to tesseract-ocr -> issue 424
  2. How to submit new traineddata -> attachment as in issue 300, provided that I say "I accept Apache 2.0".
Correct?

However, I would consider improving my heb.traineddata before submitting -> discuss the matter on tesseract-oct -> issue 424.
Enrico

zdenko podobny

unread,
Jan 10, 2011, 5:52:30 AM1/10/11
to tesser...@googlegroups.com
On Mon, Jan 10, 2011 at 10:32 AM, Enrico Segre <enrico...@weizmann.ac.il> wrote:
Ok, there are two issues here:
  1. Why can't I post to tesseract-ocr -> issue 424
  1. How to submit new traineddata -> attachment as in issue 300, provided that I say "I accept Apache 2.0".
Create new issue - similar as 300. Attach your data to it and  and write "it is released under license Apache 2.0".

Enrico Segre

unread,
Jan 10, 2011, 6:22:26 AM1/10/11
to tesser...@googlegroups.com

Already did, no reply. Funny that tesseract-dev is not moderated (or at least not blocked to me) and -ocr is.
 
  1. How to submit new traineddata -> attachment as in issue 300, provided that I say "I accept Apache 2.0".
Create new issue - similar as 300. Attach your data to it and  and write "it is released under license Apache 2.0".


Could do, but prefer to discuss issues in public first, rather than multiple-posting.
Enrico

Enrico Segre

unread,
Jan 17, 2011, 8:47:42 AM1/17/11
to tesseract-dev
So does anyone here know who is the listowner of tesseract-ocr and why
is he not available? You know, it is a little p-ing off to be willing
to contribute some work and not being allowed to talk. Enrico

Jimmy O'Regan

unread,
Jan 17, 2011, 9:06:09 AM1/17/11
to tesser...@googlegroups.com, Patrick Questembert

I don't see any problem with you having whatever discussion it is you
want to have about the data here. It's certainly more appropriate a
topic than how p-ed off you are about the tesseract-ocr mailing list.
More than that, because Hebrew is an RTL language (for which support
is only beginning to appear in 3.1), any issues raised are more than
likely going to be development issues, so this is the right place to
discuss them.

For the record, I agree with Zdenko. It would have been better if you
had opened an enhancement issue. At the very least, it gives an
appropriate place for you to attach your work for would-be testers to
try out.

Patrick, I think you're on the list anyway, but I'm cc'ing you to be
sure. ISTR that you speak Hebrew (or maybe I just assumed from your
accent :) -- you might be the best placed person to give feedback on a
Hebrew language pack.

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Jimmy O'Regan

unread,
Jan 17, 2011, 9:24:49 AM1/17/11
to Patrick Questembert, Ray Smith, tesser...@googlegroups.com
On 17 January 2011 14:12, Patrick Questembert <pat...@scanbizcards.com> wrote:
> Hi Jimmy,
>
> Roi Dayan in Israel build a Hebrew training set - I am going to test it myself soon and I suggest that I upload it as soon as I find that it work decently - and can make it available sooner to anyone who contacts me.
>
> Speaking of posting to the forum: as of a few weeks ago I keep getting an error from Google groups telling me it failed to post my reply - this is consistent across browsers AND when I reply by email. Any ideas? I am testing this again with the present reply.
>

Cool. Do you think you could introduce Roi and Enrico? I think that if
they can pool their resources, they might come up with something
better than they would have been able to achieve individually.

Also, as it's quite likely that Google are or will be working on
Hebrew, perhaps Ray might chime in? Aside from Hebrew, I think we're
all growing a little frustrated with the status of tesseract-ocr (in
my case, more because I'm starting to get off-list questions again,
and it'll be at least a week before I even have the facilities to do
anything).

Enrico Segre

unread,
Jan 17, 2011, 9:55:16 AM1/17/11
to tesser...@googlegroups.com, Patrick Questembert, Ray Smith
Well, I still think that it would have been proper to post in tesseract-ocr to sort out potential newby questions first before annoying the respectable devs, but if that is the position, here I am.

I attached a tarball to http://code.google.com/p/tesseract-ocr/issues/detail?id=432.

Below is the message which I was trying to post on tesseract-ocr (list-owner, if you read this, you can forget moderating all my posts there of the past two weeks, then).

==========================================================================================

I succeeded in creating a heb.traineddata and separately a heb-
ras.traineddata for rashi script. Basic graphic work, no dictionary
data yet.

Questions:
1) I happened to have generated my training images with a commercial
word processor, and I don't know the status of the fonts used. Does
this pose license issues?

2) recognition results are encouraging, but clearly dictionary data is
missing and ambiguous characters cannot be corrected. Any recommended
place to get a decent and suitable wordlist from (perhaps hspell)? I
presume that all words will have to be reversed as tesseract is
recognizing text "the wrong way".
License issues here too?

3) anyone willing to help me with packaging the dictionaries into the
final .traineddata?

4) problem with rashi: There is a huge class of texts (e.g.,
commentaries) merging square and rashi typefaces within the same line
of text. Ideally, one would recognize such texts using a single
language file. The situation with the two scripts is comparable to
latin upper and lowercase - there are letters topologically similar in
both cases, like s and S, letters topologically different, like a and
A, letters in one script easily confused with another in the other
(e.g. - kaf in square hebrew and nun in rashi). In latin, upper and
lower case have different unicodes, and it is easy to generate
prototypes including both. With hebrew, not. I've tied generating a
common .traineddata for both fonts, but recognition results were poor
due to these ambiguities. I've also tried that defining all rashi
characters as italicized, at no avail. Any hint?

Enrico

Jimmy O'Regan

unread,
Jan 17, 2011, 10:01:05 AM1/17/11
to Patrick Questembert, Ray Smith, tesser...@googlegroups.com
On 17 January 2011 14:53, Patrick Questembert
<patrick.q...@gmail.com> wrote:
> Sure, Roy's email is roi....@gmail.com
>
> FYI I just tested the Hebrew language pack he created with Tesseract 3.0 and get this error on the two images I tried:
> 2011-01-17 09:44:18.805 ScanBizCards[4416:8b0b] Calling TessBasePI::InitWithLanguage [heb] - time is 1295275458
> ScanBizCards(4416,0x2ff66000) malloc: *** mmap(size=515375104) failed (error code=12)
> *** error: can't allocate region
>
> About the Tess support for RTL: my (totally unfounded) guess is that even without special RTL support Tess 3.0 should do OK on Hebrew considering that Hebrew fonts are well separated, letters don't usually hang over others (like a 'f' might hover over a 'i' following it). The application would just need to reverse the strings of course (for which public domain code exists). When I tested Roy's previous training set (which didn't crash) I was getting OK results.
>

IIRC, there's a whole bunch of issues in going from LTR to RTL.
Hyphenation is going to be fiddly and brittle if you're trying to
treat the language as LTR, for example.

> I'll send an intro email to Enrico and Roi next, along with the heb.traineddata (in case Enrico has better luck than me somehow).
>

Ok, cool.

> Patrick

Jimmy O'Regan

unread,
Jan 17, 2011, 10:09:50 AM1/17/11
to Patrick Questembert, Ray Smith, tesser...@googlegroups.com
On 17 January 2011 15:01, Jimmy O'Regan <jor...@gmail.com> wrote:
> On 17 January 2011 14:53, Patrick Questembert
> <patrick.q...@gmail.com> wrote:
>> Sure, Roy's email is roi....@gmail.com
>>
>> FYI I just tested the Hebrew language pack he created with Tesseract 3.0 and get this error on the two images I tried:
>> 2011-01-17 09:44:18.805 ScanBizCards[4416:8b0b] Calling TessBasePI::InitWithLanguage [heb] - time is 1295275458
>> ScanBizCards(4416,0x2ff66000) malloc: *** mmap(size=515375104) failed (error code=12)
>> *** error: can't allocate region
>>
>> About the Tess support for RTL: my (totally unfounded) guess is that even without special RTL support Tess 3.0 should do OK on Hebrew considering that Hebrew fonts are well separated, letters don't usually hang over others (like a 'f' might hover over a 'i' following it). The application would just need to reverse the strings of course (for which public domain code exists). When I tested Roy's previous training set (which didn't crash) I was getting OK results.
>>
>
> IIRC, there's a whole bunch of issues in going from LTR to RTL.
> Hyphenation is going to be fiddly and brittle if you're trying to
> treat the language as LTR, for example.
>

Also, I don't think the strings actually need to be reversed. I did a
small experiment with Hebrew a while ago with Apertium, which required
no changes for words and their order: the bytes that make up the
strings are stored the same way as LTR text, they're just rendered
RTL, but I guess you'll have have known that, or at least have seen
enough badly rendered Hebrew to have a gut feeling about it :)

Enrico Segre

unread,
Jan 17, 2011, 10:12:48 AM1/17/11
to tesser...@googlegroups.com, Patrick Questembert, Ray Smith


On Monday, January 17, 2011 5:01:05 PM UTC+2, jimregan wrote:
 

IIRC, there's a whole bunch of issues in going from LTR to RTL.
Hyphenation is going to be fiddly and brittle if you're trying to
treat the language as LTR, for example.


I'm not an expert here, but working on a line by line basis you might neglect hyphenation (would it be resolved for dictionary lookup anyway?). Besides, hebrew words are generally a few chars long and rarely hyphenated.
Rather - bidi marks within the text have to be handled by the rev process. (Arabic/indian) Numbers for instance are embedded as LTR in RTL hebrew text.

Jimmy O'Regan

unread,
Jan 17, 2011, 10:43:28 AM1/17/11
to tesser...@googlegroups.com, Patrick Questembert, Ray Smith
On 17 January 2011 14:55, Enrico Segre <enrico...@weizmann.ac.il> wrote:
> Well, I still think that it would have been proper to post in tesseract-ocr
> to sort out potential newby questions first before annoying the respectable
> devs, but if that is the position, here I am.
>
> I attached a tarball to
> http://code.google.com/p/tesseract-ocr/issues/detail?id=432.
>
> Below is the message which I was trying to post on tesseract-ocr
> (list-owner, if you read this, you can forget moderating all my posts there
> of the past two weeks, then).
>
> ==========================================================================================
>
> I succeeded in creating a heb.traineddata and separately a heb-
> ras.traineddata for rashi script. Basic graphic work, no dictionary
> data yet.
>
> Questions:
>
> 1) I happened to have generated my training images with a commercial
> word processor, and I don't know the status of the fonts used. Does
> this pose license issues?
>

IANAL, TINLA.

It's one of the murkier areas of copyright, but it shouldn't be a real
concern. There's no way to rebuild a font from the features extracted
by Tesseract, so even in the off chance that you wanted to create a
clone of the font, you couldn't. Font copyrights are a bit of a
strange beast -- a font creator has no copyright interest in a work
that was set using their font -- but I guess it should be enough to
say that the existing training data packs for tesseract have been done
that way, and have been distributed for years without issue.

> 2) recognition results are encouraging, but clearly dictionary data is
> missing and ambiguous characters cannot be corrected. Any recommended
> place to get a decent and suitable wordlist from (perhaps hspell)? I
> presume that all words will have to be reversed as tesseract is
> recognizing text "the wrong way".
> License issues here too?
>

Again, no. If you're extracting a wordlist from a body of text,
there's no copyright issue -- at least, not under US law, where 'mere
facts' do not have copyright (it's kind of the same in most
jurisdictions, but the EU has database 'copyrights' which make this a
little unclear).

Think of it this way: you can't copyright a word, you can only
copyright a unique set of words.

(You can /trademark/ a word, but you can't copyright it; trademark
generally doesn't apply here either, because you're not using the word
in a way which affects its ability to function as a trademark: if
Microsoft started an ad campaign tomorrow saying 'Google with Bing!',
Google not only /could/ sue them for breach of trademark, they would
/have/ to, to maintain the trademark).

Anyway... you'll probably do best by grabbing a dump of the Hebrew
Wikipedia, and extract the words from that (if you can wait until next
week, I can do that for you), because 1) they're based in the US, and
follow US law, and 2) they're generally friendly to open source
projects, even if they're not using the same licence terms (some
Wikipedians with a loose grasp on copyright might try to tell you that
you can't do it, but the lawyers will more-than-likely give their
blessing - they have in the past (I can probably dig up a reference if
you need to see one)).

> 3) anyone willing to help me with packaging the dictionaries into the
> final .traineddata?
>

Like I said, I can probably do that next week (waiting for my new
laptop to be delivered).

> 4) problem with rashi: There is a huge class of texts (e.g.,
> commentaries) merging square and rashi typefaces within the same line
> of text. Ideally, one would recognize such texts using a single
> language file. The situation with the two scripts is comparable to
> latin upper and lowercase - there are letters topologically similar in
> both cases, like s and S, letters topologically different, like a and
> A, letters in one script easily confused with another in the other
> (e.g. - kaf in square hebrew and nun in rashi). In latin, upper and
> lower case have different unicodes, and it is easy to generate
> prototypes including both. With hebrew, not. I've tied generating a
> common .traineddata for both fonts, but recognition results were poor
> due to these ambiguities. I've also tried that defining all rashi
> characters as italicized, at no avail. Any hint?
>

I had to look this up (http://en.wikipedia.org/wiki/Rashi_script).
Setting them as italic looks like the right thing to do. Did you
follow the normal training procedure (i.e., put the square letters in
one file, and the rashi in another?)

> Enrico

Enrico Segre

unread,
Jan 17, 2011, 11:02:09 AM1/17/11
to tesser...@googlegroups.com, Patrick Questembert, Ray Smith


On Monday, January 17, 2011 5:43:28 PM UTC+2, jimregan wrote:
Anyway... you'll probably do best by grabbing a dump of the Hebrew
 
Wikipedia, and extract the words from that (if you can wait until next

week, I can do that for you), because 1) they're based in the US, and
follow US law, and 2) they're generally friendly to open source
projects, even if they're not using the same licence terms (some
Wikipedians with a loose grasp on copyright might try to tell you that
you can't do it, but the lawyers will more-than-likely give their
blessing - they have in the past (I can probably dig up a reference if
you need to see one)).

As said, my first bet would be http://hspell.ivrix.org.il/ - it is GPL, it is not just a random dump of web text but a dictionary built on strict orthographic rules. I just figure out that transpacking may not be trivial, as one of their plusses as I understand it from the doc is the optimized, "linguistic" way of packing the words they use.
 

> 3) anyone willing to help me with packaging the dictionaries into the
> final .traineddata?
>

Like I said, I can probably do that next week (waiting for my new
laptop to be delivered).


You'd be most welcome.

I had to look this up (http://en.wikipedia.org/wiki/Rashi_script).

Setting them as italic looks like the right thing to do. Did you
follow the normal training procedure (i.e., put the square letters in
one file, and the rashi in another?)


Done that. See the tarball attached to issue 432. In the version posted there I generated a separate heb.traineddata and a heb-ras.traineddata, but previously I merged the two. See the scripts train.com and train2.com there.

Enrico

Jimmy O'Regan

unread,
Jan 17, 2011, 11:23:04 AM1/17/11
to tesser...@googlegroups.com
On 17 January 2011 16:02, Enrico Segre <enrico...@weizmann.ac.il> wrote:
> On Monday, January 17, 2011 5:43:28 PM UTC+2, jimregan wrote:
>> Anyway... you'll probably do best by grabbing a dump of the Hebrew
>> Wikipedia, and extract the words from that (if you can wait until next
>>
>> week, I can do that for you), because 1) they're based in the US, and
>> follow US law, and 2) they're generally friendly to open source
>> projects, even if they're not using the same licence terms (some
>> Wikipedians with a loose grasp on copyright might try to tell you that
>> you can't do it, but the lawyers will more-than-likely give their
>> blessing - they have in the past (I can probably dig up a reference if
>> you need to see one)).
>
> As said, my first bet would be http://hspell.ivrix.org.il/ - it is GPL, it
> is not just a random dump of web text but a dictionary built on strict
> orthographic rules. I just figure out that transpacking may not be trivial,
> as one of their plusses as I understand it from the doc is the optimized,
> "linguistic" way of packing the words they use.
>

Ok, just using their wordlist /would/ be a licensing problem, because
it's GPL. This is kind of difficult to explain... wordlists can have
copyright under certain conditions (usually, based on there having
been criteria used for the selection of the words). I'd really
encourage you to go the wikipedia route instead, unless you can get
permission from the authors of hspell.

>>
>> > 3) anyone willing to help me with packaging the dictionaries into the
>> > final .traineddata?
>> >
>>
>> Like I said, I can probably do that next week (waiting for my new
>> laptop to be delivered).
>
> You'd be most welcome.
>
>> I had to look this up (http://en.wikipedia.org/wiki/Rashi_script).
>>
>> Setting them as italic looks like the right thing to do. Did you
>> follow the normal training procedure (i.e., put the square letters in
>> one file, and the rashi in another?)
>
> Done that. See the tarball attached to issue 432. In the version posted
> there I generated a separate heb.traineddata and a heb-ras.traineddata, but
> previously I merged the two. See the scripts train.com and train2.com there.

Ok, I'll have to make a note and come back to it next week. Feel free
to remind me :)

Enrico Segre

unread,
Jan 18, 2011, 3:13:19 AM1/18/11
to tesseract-dev
On Jan 17, 6:23 pm, "Jimmy O'Regan" <jore...@gmail.com> wrote:
> On 17 January 2011 16:02, Enrico Segre <enrico.se...@weizmann.ac.il> wrote:
> > As said, my first bet would behttp://hspell.ivrix.org.il/- it is GPL, it
> > is not just a random dump of web text but a dictionary built on strict
> > orthographic rules
>...
>
> Ok, just using their wordlist /would/ be a licensing problem, because
> it's GPL. This is kind of difficult to explain...

I see. http://en.wikipedia.org/wiki/Apache_license.
Honestly speaking, that would be a reason for me to provide further
work, if I'd be ever able to do something worthwile, as GPL. I'm not
really sure I would like my modest contribution to reappear, say, as
part of a sold iphone app for reading business cards.
Besides, a wordlist coming from something like hspell has a traceable
source and an assumed degree of correctness that a random web dump has
not.
Enrico

zdenko podobny

unread,
Jan 18, 2011, 7:56:22 AM1/18/11
to tesser...@googlegroups.com
In case of OCR: spellchecker "degree of correctness" is IMHO disadvantage while "random web dump" should be advantage ;-) This is my opinion as spellchecker maintainer for my (Slovak) language. I do not know how it is in hspell, but I keep away all abbreviation, punctuation and numbers old words from spellchecker. But these "words" are quite common in texts I tried to OCR. And if I do OCR I expect that I get the output that will be identical with original document - including mistakes. 

I do not know all specialties of dawg dictionaries (for punc-dawg and number-dawg we can just guess http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional)) But first of all you need to list words based on their frequency. You can not make frequency list from spellchecker, but you can make it on wikipedia.

Zd.

Jimmy O'Regan

unread,
Jan 18, 2011, 8:16:14 AM1/18/11
to tesser...@googlegroups.com

Yes, and that's actually not optimal. 'Correctness' is great for a
spelling checker, not as ideal for an OCR dictionary, because you want
to be able to recognise words that are commonly used, regardless of
correctness.

Enrico Segre

unread,
Jan 18, 2011, 9:11:29 AM1/18/11
to tesser...@googlegroups.com
zdenko podobny wrote:

> In case of OCR: spellchecker "degree of correctness" is
> IMHO disadvantage while "random web dump" should be advantage ;-) This
> is my opinion as spellchecker maintainer for my (Slovak) language. I
> do not know how it is in hspell, but I keep away
> all abbreviation, punctuation and numbers old words from spellchecker.
> But these "words" are quite common in texts I tried to OCR. And if I
> do OCR I expect that I get the output that will be identical with
> original document - including mistakes.
>
> I do not know all specialties of dawg dictionaries (for punc-dawg and
> number-dawg we can just
> guess http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional)

> <http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_%28Optional%29>)

> But first of all you need to list words based on their frequency. You
> can not make frequency list from spellchecker, but you can make it on
> wikipedia.

I see your point. Ideally then, one should look for a traceable wordlist
with frequency. Or at least build the wordlist from a well defined
repository. That would probably vary according to the application
targeted - business cards are one, old texts another, but maybe
something the like is out there on the net. At least, there are
collections of transcribed texts - for some kind of hebrew texts for
instance http://opensiddur.org/, http://www.seforimonline.org/,
http://www.hebrewbooks.org/. My question becomes then - how to build
automatically the dictionaries needed by tesseract, given a bunch of
text files, taking also into account RTL-LTR.
Enrico

Ray Smith

unread,
Jan 19, 2011, 9:23:09 PM1/19/11
to tesser...@googlegroups.com
Did you try out the heb.traineddata that is checked in to svn for version 3.01?
We haven't done any work on Hebrew yet - we just turned the handle on our automated training system to generate the heb.traineddata, but the dictionary is in there. It won't help though because of the RTL issue.

I have a plan for dealing with RTL that should dramatically improve its accuracy for Hebrew - the dictionary will actually be useful, and it is not a big fix. The words in the dictionary need to be reversed, and the words on each line need to be reversed, (both word order and character order) and that is all that is needed for minimal RTL support.

enrico...@weizmann.ac.il

unread,
Jan 20, 2011, 4:10:40 AM1/20/11
to tesser...@googlegroups.com
On Thursday, January 20, 2011 4:23:09 AM UTC+2, Ray wrote:
Did you try out the heb.traineddata that is checked in to svn for version 3.01?

The one provided with rev 511 is incomplete to say least (it is three weeks I'm trying to write that on tesseract-ocr).
Of the hebrew alphabet, from what can be understood from the initial part of the file, it contains only 11 letters (one repeated), one (out of five) final forms, two ligatures which have a single unicode (out of many other possible), the figures 0-8 but 6, and a few other interpunction signs. Unusable.
 
We haven't done any work on Hebrew yet - we just turned the handle on our automated training system to generate the heb.traineddata, but the dictionary is in there. It won't help though because of the RTL issue.


Do I understand correctly that you have a hebrew dictionary already available for inclusion?
 
I have a plan for dealing with RTL that should dramatically improve its accuracy for Hebrew - the dictionary will actually be useful, and it is not a big fix. The words in the dictionary need to be reversed, and the words on each line need to be reversed, (both word order and character order) and that is all that is needed for minimal RTL support.

Ok, looking forward to be updated on the matter.
Enrico
Reply all
Reply to author
Forward
0 new messages