Tesseract-OCR

446 views
Skip to first unread message

Aharon Varady

unread,
Aug 26, 2011, 10:11:09 AM8/26/11
to Open Siddur Project
Transcription being such an important part of the Open Siddur Project -- and yet, also outside the scope of our project development -- I've been keen to use whatever automated OCR can help with transcription. Waiting for an OCR supporting Hebrew with its diacritics has been like waiting for Godot -- Kobi Zamir, the developer of HOCR (http://berlios.hocr.de ) has been frank with me that his software, though promising, has a pretty severe memory leak and perennially needs 6 months of full time development. All is not lost, though. Other OCR software is serviceable for transcribing Hebrew. I found ABBYY Finereader an imperfect but invaluable closed-source solution when working earlier this year on transcribing the Rashi script in Pri Etz Hadar. Obviously, an open source solution would be best so I've kept my eye on Tesseract, the open source project used and advanced by Google Books. I recently revisited Tesseract after looking for an OCR to scan Fanny Neuda's Stunden Der Andacht. (ABBYY's Finereader XIX, the first OCR to read Fraktur (Gothic German) scripts, which besides being closed-source, expensive, and trial-limited, is being discontinued this year.) I was pleased to find a great deal of work on Tesseract including folk in Israel who have been busy training tesseract to recognize Hebrew, and sharing their training files.

Tesseract requires (for me at least) a GUI. I've played with most if not all of the tesseract GUI's for windows and for the most part they were all fairly unusable. But I did find one that worked well and so I want to let you all know. It's VietOCR - a GUI written for a project looking to transcribe Vietnamese texts. One of the best features of VietOCR is a menu option to download and select from a growing number of language files which are sometimes hard to find via google and the official tesseract website.

I've used Tesseract via VietOCR and found its read of Fraktur to be better than 95%. I'm hoping that others will test this solution with Hebrew and perhaps even figuring out how to train Tesseract to read Hebrew with niqqud. Documentation is available and a discussion list with folks all over the world working on their own languages is buzzing.

Aharon


To get VietOCR working with PDFs I needed to install another projects application, GPL Ghostscript, and copy gsdll32.dll to the VietOCR directory.

http://sourceforge.net/projects/vietocr/
http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/

Efraim Feinstein

unread,
Aug 26, 2011, 10:55:26 AM8/26/11
to opensid...@googlegroups.com
Hi,

In other OCR news, Google Books has been doing an incredible job of
OCR-ing (unvoweled) Hebrew... but AFAICT they don't tell us how they do
it. There is an OCR option in the upload interface using the same
technology as Google Books, but alas, it does not include Hebrew as one
of the available languages.

On the bright side, I do not see any problem *correcting* and posting
the text from a Google-OCR-ed public domain book. Given that the OCR
itself is a mechanical process and once the text is the same as the
public domain text, it's public domain anyway.

(Standard disclaimer: I am not a lawyer, this is not legal advice)

--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org

Aharon Varady

unread,
Aug 26, 2011, 11:15:22 AM8/26/11
to Open Siddur Project
On Fri, Aug 26, 2011 at 10:11 AM, Aharon Varady <aharon...@gmail.com> wrote:

I've used Tesseract via VietOCR and found its read of Fraktur to be better than 95%. I'm hoping that others will test this solution with Hebrew and perhaps even figuring out how to train Tesseract to read Hebrew with niqqud. Documentation is available and a discussion list with folks all over the world working on their own languages is buzzing.


I should also have included some more links to Tesseract resources:

The Tesseract OCR Project: https://code.google.com/p/tesseract-ocr/

Training Tesseract: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

The discussion list: http://groups.google.com/group/tesseract-ocr?hl=en

Ze'ev Clementson

unread,
Aug 26, 2011, 11:53:32 AM8/26/11
to opensid...@googlegroups.com
Hi Aharon,

On Fri, Aug 26, 2011 at 8:15 AM, Aharon Varady <aharon...@gmail.com> wrote:
> On Fri, Aug 26, 2011 at 10:11 AM, Aharon Varady <aharon...@gmail.com>
> wrote:
>>
>> I've used Tesseract via VietOCR and found its read of Fraktur to be better
>> than 95%. I'm hoping that others will test this solution with Hebrew and
>> perhaps even figuring out how to train Tesseract to read Hebrew with niqqud.
>> Documentation is available and a discussion list with folks all over the
>> world working on their own languages is buzzing.

Tesseract sounds promising. I did a quick scan through the links and
have a couple of questions:

1. The training page says "Tesseract currently can only handle
left-to-right languages. While you can get something out with a
right-to-left language, the output file will be ordered as if the text
were left-to-right." - did you post process the resulting text to
re-order Hebrew or has someone come up with another solution?

2. You mentioned (in your 1st email) "folk in Israel who have been


busy training tesseract to recognize Hebrew, and sharing their

training files" however you didn't provide any links to the Hebrew
training files. How does one locate them?

- Ze'ev

Ze'ev Clementson

unread,
Aug 26, 2011, 12:10:33 PM8/26/11
to opensid...@googlegroups.com

Reading through the mailing list, the only one I came across was Roi
Dayan's one: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/64cac42ce5bbcb81/33da71a43506b4ee?hl=en&lnk=gst&q=hebrew#33da71a43506b4ee

Is that the one you used or have you found others?

- Ze'ev

Aharon Varady

unread,
Aug 26, 2011, 12:20:33 PM8/26/11
to opensid...@googlegroups.com
On Fri, Aug 26, 2011 at 12:10 PM, Ze'ev Clementson <bere...@gmail.com> wrote:

Tesseract sounds promising. I did a quick scan through the links and
have a couple of questions:

1. The training page says "Tesseract currently can only handle
left-to-right languages. While you can get something out with a
right-to-left language, the output file will be ordered as if the text
were left-to-right." - did you post process the resulting text to
re-order Hebrew or has someone come up with another solution?


I've only processed Gothic German in Fraktur so far with Tesseract, so far. I've been told that Google Books uses Tesseract so they must have some reordering process.

 
2. You mentioned (in your 1st email) "folk in Israel who have been
busy training tesseract to recognize Hebrew, and sharing their
training files" however you didn't provide any links to the Hebrew
training files. How does one locate them?


I installed the Hebrew language file via VietOCR.



Reading through the mailing list, the only one I came across was Roi
Dayan's one: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/64cac42ce5bbcb81/33da71a43506b4ee?hl=en&lnk=gst&q=hebrew#33da71a43506b4ee

Is that the one you used or have you found others?


I had done a google search looking for folk who had been experimenting with training Hebrew with tesseract and came across the same link to Roi Dayan that you did.

Aharon

Aharon Varady

unread,
Aug 27, 2011, 9:49:27 AM8/27/11
to Open Siddur Project, enrico...@weizmann.ac.il
On Fri, Aug 26, 2011 at 12:10 PM, Ze'ev Clementson <bere...@gmail.com> wrote:

> 2. You mentioned (in your 1st email) "folk in Israel who have been
> busy training tesseract to recognize Hebrew, and sharing their
> training files" however you didn't provide any links to the Hebrew
> training files. How does one locate them?


The two folks in Israel are Enrico Segre and Roi Dayan.

Roi's training data is here: http://roidayan.com/wordpress/?p=26

Enrico is working on OCR for rashi script. He has a tarball of training files, images, etc., here:

https://code.google.com/p/tesseract-ocr/issues/detail?id=432
https://groups.google.com/forum/#!topic/tesseract-dev/3Ii0pCqTlaY/discussion

Aharon
Reply all
Reply to author
Forward
0 new messages