queries that confuses me in tesseract-ocr

8 views
Skip to first unread message

naga raja

unread,
Mar 2, 2010, 3:12:02 AM3/2/10
to indi...@googlegroups.com, tesser...@googlegroups.com
Hi guys,
This is Nagaraja, i m trianing Tesseract-Ocr for tamil languages

The are many questions arises while training tesseract for tamil langauges..i have posted below. plz reply me

1)when i install tesseract-ocr from source, its installed successfully but while testing it shows me an error like
unable to load "/usr/local/share/tessdata/eng.unicharset"
    > i tried downloading the english tessdata and place it in tessdata folder, but still no luck
    > i m using tesseract- 2.04 on ubuntu 9.04

2) Then i created the 8 files of tamil training data. 5 files are created by Training-tesseract GUI by debayan , and 3 files , i created by myself.
   > The Error i face was X characters in inttemp whereis Y characters in tam.unicharset
   > although i seached , i cant find a proper documentation except a single issue

3) How the tesseract-ocr recogonizes the text?For some images each character may be of different size and of different fonts. so while training do i need to train for all the fonts in all sizes?

4)Can we change the output format of tesseract rather than .txt?

5)Although i googled for my above queries , i can get the complete answers or documentation, once if i clear my doubts , i shall create a complete documentation for training the tesseract-ocr which may surely help the other peoples(beginners).


 
Thanks and Regards,
T.Nagaraja
Reply all
Reply to author
Forward
0 new messages