Re: Yiddish data for Tesseract

133 views
Skip to first unread message

Enrico Segre

unread,
Apr 30, 2011, 3:09:57 PM4/30/11
to tesser...@googlegroups.com
Will,
I have looked into OCRFeeder - you can add the language option under Tools/OCR Engines/Tesseract (Edit)/Engine arguments: e.g.:
$IMAGE $FILE -l heb; cat $FILE.txt
Otherwise, tesseract will just use its default language, i.e. english. Recognition of german text in that case may be somewhat inferior than using the real german option, because of different dictionaries and perhaps (?) missing umlauted characters, but will give at least some result due to the common latin alphabet.

As a matter of fact, the hebrew file installed by the latest (581) svn revision of tesseract is still the crippled one; you may want give a try to my heb.traineddata instead, or consider generating an even better one.
The output of tesseract, though, still suffers from the problem already reported, that tesseract scans LTR, while hebrew is RTL. A temporary workaround is to use the standard *nix rev, e.g:
 
$IMAGE $FILE -l heb; rev $FILE.txt
It is still an open question of the previous thread, how to generate a proper (reversed) dictionary for hebrew or yiddish.

Having said that, I'm not a developer of tesseract, so I can help little more. I am just one who succeeded in the not-so-trivial task of following the instructions of the Training Tesseract 3 wiki successfully at the n-th attempt.

The reason I became interested in tesseract, though, may have some overlap with yours, in my very limited scope. I started caressing the idea of producing judaica-vernacular texts for Project Gutenberg. Some of those I considered have significant hebrew parts, and hence I started looking around for software candidates.
Based on my very little experience with PG so far, I would only throw in a couple of suggestions:

-do not underestimate the amount of human work needed to go from a raw ocr scan to a quality e-text. At PG, human proofreading is done in several passes and may take months/man for a single text. Otherwise, expect a recognition error rate of anything of the order of 1 wrong character every 10. Check the text version of any book

-tesseract out-of the box is quite inapt of dealing with multi-language, multi-alphabet text. Merging two traineddata sets doesn't seem easily possible. Incremental training of tesseract looks quite awkward, as compared to commercial softwares. 
By incremental training I mean improvement of the traineddata base for a specific document, using the results of a first-pass ocr, perhaps with a nice GUI comparing images and ocr result. I would really welcome development work in this area.

-at PG, it seems that tesseract is considered inferior to, say, abbyy Fine Reader. I feel awkward saying that it in this forum, but I have already a couple of projects on PG for which, despite striving to use tesseract for ocr, I had to accept the lending hand of a fellow proofreader scanning my images with FR instead. I'm still looking for best practices to get a decent recognition from tesseract (like image preprocessing, etc).

Enrico

Ray Smith

unread,
May 1, 2011, 5:54:51 PM5/1/11
to tesser...@googlegroups.com
Hi Will,

I feel your pain. The tesseract training process is hideously complex. What makes matters worse is:
(a) We are now keeping the svn repository up to date with our own code that we use in Google.
(b) We use a more automated training process based on rendering text from fonts. We haven't managed to open it up yet, but even if we had, it wouldn't be suitable for every application.
(c) The training process is in flux as we work on improving the accuracy and making it work for more languages, including Indic languages and RTL languages.

This all adds up to the documentation being inaccurate, and likely to remain that way. Fixing it is a bit of a waste of time, as it will be wrong again within a month. Having said that, what we really need to do is cut a 3.01 tarball, so we can write documentation for that, and then let the svn version be different and undocumented until we have the training process stabilized.

Now to Yiddish. i am aware that the current Hebrew is no good, and we have a plan in place to fix it in our training system, and in the recognition process. Those fixes *should* automatically make Yiddish work too, unless:

Is there anything fundamentally different about Yiddish compared to Hebrew in the way it is written or encoded (I am aware that the dictionary and language are totally different) like the rules for writing it RTL that I should be aware of? For instance I understand Yiddish to be a dialect of German when spoken. Does that mean it has the same infinite word compounding that Geman has?

For your particular application, how important is BiDi handling? Or would pure RTL be good enough to get started?

Ray.

Will Helton

unread,
May 2, 2011, 5:20:06 AM5/2/11
to tesser...@googlegroups.com
Hi Ray,

Thanks for your reply.

I'm glad it wasn't just me. I was at the point of pulling my hair out with following the training/data set creation steps and still I just couldn't get it all to work.

If a truly viable Hebrew set were working, it would cover about 98% (perhaps more) of what would be needed for Yiddish. That is, the alphabet is almost identical except for one letter cluster (ײַ) which does not occur in Hebrew.

Mercifully, Yiddish doesn't do the standard endless word compounding that German does. Words are usually taken either from Hebrew, Slavic elements or German (depending on how earthy, technical or emotionally-tinged they are). The odd compound will exist here and there, but that will be hyphenated in most cases.

Currently there is no BiDi need that I am aware of. The idea is to get workable OCR's of literary texts, so standard RTL in the first instance would more than fit the bill. BiDi would only come into play (if ever) in trying to translate/re-publish dictionaries or similar.

Thank you again for your reply and if you need anything whatsoever, please let me know. I have an Ubuntu environment running under Virtualbox, so I can test anything as needed, try things out, gather data - you name it.

And if you have any questions about Yiddish, please do let me know.

Regards,

Will

Enrico Segre

unread,
May 2, 2011, 5:35:09 AM5/2/11
to tesser...@googlegroups.com

> Currently there is no BiDi need that I am aware of. The idea is to get
> workable OCR's of literary texts, so standard RTL in the first
> instance would more than fit the bill. BiDi would only come into play
> (if ever) in trying to translate/re-publish dictionaries or similar.
just a caveat (IIUC): numbers (arabic/indic figures) in hebrew/yiddish
text are already an instance of bidi.

Enrico

Will Helton

unread,
May 2, 2011, 5:36:00 AM5/2/11
to tesser...@googlegroups.com
Ah, yes. You're right, Enrico.

I was thinking purely in terms of text.

Thanks for pointing that out.

Will

Ray Smith

unread,
May 4, 2011, 1:26:26 AM5/4/11
to tesser...@googlegroups.com
OK, well it sounds promising then. I will certainly keep you in mind for feedback when we have something to test.
Ray.

Will Helton

unread,
May 4, 2011, 3:39:22 PM5/4/11
to tesser...@googlegroups.com
Thanks, Ray.

I look forward to it.

Will

Philip Trauring

unread,
May 16, 2013, 12:36:29 PM5/16/13
to tesser...@googlegroups.com, will....@gmail.com
Curious if there has been any progress on supporting Yiddish in Tesseract?

I see that a recent version 3.02 of Tesseract supports Hebrew with BiDi.

I also have seen that the Yiddish Book Center has been working with developer Assaf Urieli to come up with a new Java-based OCR program (https://github.com/urieli/jochre) that can recognize Yiddish (with claims of 97% accuracy).

While there may not be a way to combine code, it would be nice if now that Tesseract supports Hebrew with BiDi, and that there is a training corpus of Yiddish texts that is presumably being provided by the Yiddish Book Center to Urieli for training his software, if the same training texts could be provided to Tesseract as well? It may not be able to match Jochre's accuracy since Jochre has been built from the ground up to recognize Yiddish, but it would presumably help a lot, and make Yiddish that much more accurate in Tesseract (and all the programs that rely on it). Better for everyone who is interested in Yiddish literature I would think.

Philip
Reply all
Reply to author
Forward
0 new messages