R: Re: Yiddish data for Tesseract

Enrico Segre

unread,

Apr 25, 2011, 6:16:23 PM4/25/11

to tesser...@googlegroups.com

Seen http://code.google.com/p/pytesseracttrainer/, http://sourceforge.net/tracker/?func=detail&atid=1231319&aid=3151499&group_id=291112 and the earlier discussion about hebrew training in this group?
I guess that differently from plain unvocalized hebrew, with yiddish you may badly need some peculiar ligatures and some vocalized letters.
Have a look also at hocr, in my experience it performs quite well with vocalized hebrew.
http://he.wikibooks.org/wiki/Hocr_-_%D7%94%D7%A4%D7%99%D7%9B%D7%AA_%D7%AA%D7%9E%D7%95%D7%A0%D7%94_%D7%A2%D7%9D_%D7%90%D7%95%D7%AA%D7%99%D7%95%D7%AA_%D7%A2%D7%91%D7%A8%D7%99%D7%95%D7%AA_%D7%9C%D7%A7%D7%95%D7%91%D7%A5_%D7%98%D7%A7%D7%A1%D7%98/%D7%A1%D7%A8%D7%99%D7%A7%D7%94
http://hocr.berlios.de/ (link not working for me right now)
Enrico

Will Helton

unread,

Apr 26, 2011, 3:58:35 AM4/26/11

to tesser...@googlegroups.com

Hi Enrico,

Yes, I did see the thread you submitted about Hebrew data. It was the reason I joined the forum. :)

I tried downloading and adding that data to /usr/share/tersseract/tessdata/, but found it still didn't recognise any of the files I opened in OCRFeeder.

Would you be willing to discuss how you set up your data - I really do feel I have just missed something/misunderstood something and so have not been able to roll a usable set of Yiddish data files.

Thanks,

Will

Enrico Segre

unread,

Apr 26, 2011, 10:29:43 AM4/26/11

to tesser...@googlegroups.com

Hi Will,
I'm off from my main development system at the moment, I may give you a more precise answer in some day.
Anyway, I don't know OCRFeeder. Might be interesting to give a try. I'm usually operating tesseract from command line, batch files, or gimagereader as a GUI when I need to evaluate image and cropping settings.
Would you be able to recognize text pages in any other language? I mean, are you complaining about a specific problem of the hebrew set, or about general usage of tesseract?
If that matters - a quick check led me to urls describing OCRFeeder in ubuntu - I've tested my hebrew set on a local build of teseract 3.01 from sources, not from the ubuntu tesseract package which was obsolete.
Enico

Will Helton

unread,

Apr 26, 2011, 11:00:26 AM4/26/11

to tesser...@googlegroups.com

Hi Enrico,

I installed Tesseract 3.0/3.01 (not sure as I don't know how to check the version) before I installed OCRFeeder, so it's not using the Ubuntu tesseract default package.

After downloading and transferring the file to the tessdata directory, I restarted OCRFeeder and loaded a Yiddish PDF. It didn't recognise any of the letters - just like when I loaded it without the Hebrew files in place.

I'm not a high end technical user. I'm trying to find some software that will allow for conversion of the PDFs in the Spielberg Collection into workable text so these can then be translated into English.

I think tesseract can fulfil this need, I just need a bit of help on where I'm falling short in getting the right parameters in place.

Many thanks,

Will

Enrico Segre

unread,

Apr 26, 2011, 3:09:52 PM4/26/11

to tesser...@googlegroups.com

As said, I don't know (yet) about OCRFeeder. In gimagereader, one has to choose by menu the language desired for text recognition, and I remember I had to edit a configuration file in order to include the choice of the new languages I added. Unless you say to tesseract (by command line -- -l lang) that you want to use your language data file, I wouldn't be surprised it doesn't recognize any character.
If you start from an image in latin alphabet, are you able to recognize anything at all?
Enrico

Will Helton

unread,

Apr 30, 2011, 5:34:32 AM4/30/11

to tesser...@googlegroups.com

Hi Enrico,

Thanks for the further feedback.

All the German documents I have tried have worked fine with Tesseract/OCRFeeder.

The problem I'm having is that I am unable to complete the steps outlined on the Tesseract3 training page because of errors I don't understand.

This is why it would be helpful if I could speak to a developer directly.

Thanks to a very generous donation by the Spielberg Foundation, the National Yiddish Book Center has been able to digitise and make available the 11,000+ volumes currently online at archive.org.

The Center has now pledged to begin a project to translate titles from this collection, but needs the ability to OCR these texts.

Although I cannot, of course, make any promises (I am only a volunteer), it is reasonable to assume that with the amount of time and effort that has gone into this project so far, the software with the best chance for fulfilling this OCR need would also be in a good position for possible funding considerations. I think that software might be Tesseract.

Again, I cannot promise anything (and I am not making any promises here), but if someone were able to help address the specific problems I'm encountering, I could then pass this back to the Center and give a much fuller endorsement of this product (if, in fact, it can deliver the desired end result).

I would, therefore, be very grateful if someone from the development team or someone who has done a good deal of new language bootstrapping would contact me to help me understand where I've gone wrong.

Please understand I am not asking anyone to do my work for me. I am simply asking for advice. Although I am fairly confident working from the command line, I am not a programmer and certainly do not understand the specifics of this particular software package.

My thanks in advance.

Will

Reply all

Reply to author

Forward