In other OCR news, Google Books has been doing an incredible job of
OCR-ing (unvoweled) Hebrew... but AFAICT they don't tell us how they do
it. There is an OCR option in the upload interface using the same
technology as Google Books, but alas, it does not include Hebrew as one
of the available languages.
On the bright side, I do not see any problem *correcting* and posting
the text from a Google-OCR-ed public domain book. Given that the OCR
itself is a mechanical process and once the text is the same as the
public domain text, it's public domain anyway.
(Standard disclaimer: I am not a lawyer, this is not legal advice)
--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org
I've used Tesseract via VietOCR and found its read of Fraktur to be better than 95%. I'm hoping that others will test this solution with Hebrew and perhaps even figuring out how to train Tesseract to read Hebrew with niqqud. Documentation is available and a discussion list with folks all over the world working on their own languages is buzzing.
On Fri, Aug 26, 2011 at 8:15 AM, Aharon Varady <aharon...@gmail.com> wrote:
> On Fri, Aug 26, 2011 at 10:11 AM, Aharon Varady <aharon...@gmail.com>
> wrote:
>>
>> I've used Tesseract via VietOCR and found its read of Fraktur to be better
>> than 95%. I'm hoping that others will test this solution with Hebrew and
>> perhaps even figuring out how to train Tesseract to read Hebrew with niqqud.
>> Documentation is available and a discussion list with folks all over the
>> world working on their own languages is buzzing.
Tesseract sounds promising. I did a quick scan through the links and
have a couple of questions:
1. The training page says "Tesseract currently can only handle
left-to-right languages. While you can get something out with a
right-to-left language, the output file will be ordered as if the text
were left-to-right." - did you post process the resulting text to
re-order Hebrew or has someone come up with another solution?
2. You mentioned (in your 1st email) "folk in Israel who have been
busy training tesseract to recognize Hebrew, and sharing their
training files" however you didn't provide any links to the Hebrew
training files. How does one locate them?
- Ze'ev
Reading through the mailing list, the only one I came across was Roi
Dayan's one: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/64cac42ce5bbcb81/33da71a43506b4ee?hl=en&lnk=gst&q=hebrew#33da71a43506b4ee
Is that the one you used or have you found others?
- Ze'ev
Tesseract sounds promising. I did a quick scan through the links and
have a couple of questions:
1. The training page says "Tesseract currently can only handle
left-to-right languages. While you can get something out with a
right-to-left language, the output file will be ordered as if the text
were left-to-right." - did you post process the resulting text to
re-order Hebrew or has someone come up with another solution?
2. You mentioned (in your 1st email) "folk in Israel who have been
busy training tesseract to recognize Hebrew, and sharing their
training files" however you didn't provide any links to the Hebrew
training files. How does one locate them?
Reading through the mailing list, the only one I came across was Roi
Dayan's one: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/64cac42ce5bbcb81/33da71a43506b4ee?hl=en&lnk=gst&q=hebrew#33da71a43506b4ee
Is that the one you used or have you found others?
> 2. You mentioned (in your 1st email) "folk in Israel who have been
> busy training tesseract to recognize Hebrew, and sharing their
> training files" however you didn't provide any links to the Hebrew
> training files. How does one locate them?