Regarding Tesseract OCR engine for recognizing Tamil Fonts

959 views
Skip to first unread message

sibi kanagaraj

unread,
Jul 14, 2014, 4:07:59 AM7/14/14
to tesser...@googlegroups.com
Hi all ,

This is Sibi from Chennai , India . I wanted to improve the Tesseract OCR engine for recognizing Tamil Fonts . Hence I started with the Ray Smith's paper on "An Overview of the Tesseract OCR Engine" and contacted him for further more information and mailed him for that .

He directed me to see

https://drive.google.com/file/d/0B7l10Bj_LprhbUlIUFlCdGtDYkE/edit?usp=sharing

and also informed me that font recognition is already present for Tamil language
But , I feel that Tamil Training is not sufficient and it could  be streamlined . Hence I went to see if there are sufficient training documents for Tamil . This search  landed me to this page . And subsequently I found  " Things I would NOT recommend working on"  here .

I am little bit stuck here . I wanted to do this project as part of my Masters Degree . Isnt it that Tamil Training is independent module that could be worked upon ?

-Sibi

Paul

unread,
Jul 14, 2014, 2:36:47 PM7/14/14
to tesser...@googlegroups.com
I'm not sure what's the case for Tamil, but in general the imagery for doing training is not available. So basically you would have to start all over.

Paul

Nick White

unread,
Jul 15, 2014, 4:08:17 PM7/15/14
to tesser...@googlegroups.com
On Mon, Jul 14, 2014 at 11:36:46AM -0700, Paul wrote:
> Am Montag, 14. Juli 2014 10:07:59 UTC+2 schrieb sibi kanagaraj:
> But , I feel that Tamil Training is not sufficient and it
> could be
> streamlined . Hence I went to see if there are sufficient training
> documents for Tamil . This search landed me to this page . And
> subsequently I found " Things I would NOT recommend working on" here .
>
> I am little bit stuck here . I wanted to do this project as part of my
> Masters Degree . Isnt it that Tamil Training is independent module that
> could be worked upon ?
>
> I'm not sure what's the case for Tamil, but in general the imagery for doing
> training is not available. So basically you would have to start all over.

Yes, that is the case, I'm afraid. There is a project that was
hoping to create improved trainings for South Asian languages, but
it hasn't been updated for quite a few years. See
http://code.google.com/p/parichit/

Can you give us some clue as to what you think could be improved
about the current Tamil recognition? Changes of configuration
variables, or ambiguity rules (the unicharambigs file), don't need
access to the training images.

Oh, by the way, the "Things I would NOT recommend working on" is a
very old page (from 2010); I wouldn't take it too seriously...

Nick

sibi kanagaraj

unread,
Jul 20, 2014, 12:37:25 PM7/20/14
to tesser...@googlegroups.com
Hi ,

Sorry for my delayed reply .

Thank you Paul and Nick for your Inputs .

@ Paul ,

//imagery for doing training is not available. So basically you would have to start all over.//

Starting all over in the sense ? I have put across the efforts taken by me in the mail . Is it  that the training process has to be started from the beginning ?

@ Nick White

//Can you give us some clue as to what you think could be improved  about the current Tamil recognition? Changes of configuration  variables, or ambiguity rules (the unicharambigs file), don't need
access to the training images. //

I have for now only gone through the documents and not yet put my hands into the code or actual working of the engine . I am in my initial stages of analysis . I have got pretty good time( around 9 months )  to work on the project and would love to contribute to a project in Apache License and also in my Mother Tongue .

“ The new page layout analysis for Tesseract  was designed from the beginning to be language-independent, but the rest of the engine was developed for English, without a great deal of thought as to how it might work for other languages.”[1]And in the training document for Tessaract its noted that  as “ .. the Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Tesseract 3.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with..” and  “..Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly. ..” and “..Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits...”[2]

[1]Ray Smith , Daria Antonova  , Dar-Shyang Lee Adapting the Tesseract open source OCR engine for multilingual OCR, Published by ACM 2009 Article. Bibliometrics Data Bibliometrics.
[2]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Tamil has almost all the above mentioned issues .

I am wondering , where to start my learning process of the codes , where to test it , and other stuffs .

-Sibi
-

Shree Devi Kumar

unread,
Jul 21, 2014, 2:57:18 AM7/21/14
to tesser...@googlegroups.com
Sibi,

I would suggest that you try tesseract by using a gui frontend such as vietocr with the tamil training data provided by google (3.02 version is the latest i think) to get an idea about how well it recognizes tamil. 

You can create your own training data using jtessboxeditor.

More training tools and traineddata for other languages maybe forthcoming during next few months, but no one knows when...

Shree



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sibi kanagaraj

unread,
Jul 21, 2014, 5:15:59 AM7/21/14
to tesser...@googlegroups.com


Hi Shree ,

Thank you for the input .

I have started testing the .png file for Tamil . I have used image from Tamil Text book .

Though an entire page was given as input , I would like to paste the most accurate result which I got . I am sure that the deviation is quite large .

The problem with me is that , I dont know where to start reading and working on code . I see FAQ and suddenly jump to modules , then from there 2.0 or 3.0 confusion and it keeps growing .

For the given input shown above

The output is pasted here

http://pastebin.com/PMRz204y

-Sibi
single.png

Shree Devi Kumar

unread,
Aug 7, 2014, 8:14:25 AM8/7/14
to tesser...@googlegroups.com, tesser...@googlegroups.com
Hello Sibi,


It has training files which can be used as start for Tamil script training for Tesseract 3.02/03.
I am only familiar with the basics of tamil script hence these will require changes and updates.
tam.zip is a zip file with the traineddata, tif and box pairs and other required files.
dir.txt lists alll the files available in the zip. These were produced using Quan Nguyen's JTessBoxEditor and VIETOCR.

jTessBoxEditor v1.0

  • Integrate support for full automation of Tesseract training
  • Bundle Tesseract Windows training executables (r866), English data, and config files

VietOCR v4.0 Beta

  • Upgrade to Tesseract 3.03 RC (r1051)
THANK YOU, QUAN, for the software and your prompt response to my queries. Unicharambigs will require to be modified or postprocessing will be required for the vowel signs which both prepend and append the consonants i.e. பொ 0BCA TAMIL VOWEL SIGN O (combined with pa (ப)) போ 0BCB TAMIL VOWEL SIGN OO (combined with pa (ப)) பௌ 0BCC TAMIL VOWEL SIGN AU (combined with pa (ப)) Changes will also be required for distinguishing between ள 0BB3 TAMIL LETTER LLA and the last part of பௌ 0BCC TAMIL VOWEL SIGN AU (combined with pa (ப)) The files include tam.traineddata which can be used with VIETOCR to test OCR of tamil texts.



Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


sibi kanagaraj

unread,
Aug 27, 2014, 11:56:58 AM8/27/14
to tesser...@googlegroups.com, tesser...@googlegroups.com
Hello Shree ,

Thank you for the input .

I have some doubts regarding it .

1.Is it possible to use jtessbox editor from GNU/Linux platform (Ubuntu)
2.How is it different or similar from the training data which has been prvided along with Tesseract-OCR .

-Sibi

Shree Devi Kumar

unread,
Aug 27, 2014, 1:09:23 PM8/27/14
to tesser...@googlegroups.com
Hi Sibi,

for details about jtessboxeditor. It requires Java Runtime Environment 6.0 or later. 

I have used it only on windows, but I guess it will run under ubuntu if you have the java environment. Please check with Quan about it.

For tamil training source files sample, please download 

Note that it is a large file (37 mb) as it has the sample tif/box pairs. You can use the files as a start for tamil training.

I have not used the tamil training data provided with tesseract and cannot comment on it. Possibly it is better than the sample file provided by me because I just wanted to provide you with a framework for training with Jtessboxeditor  to improve it.

BTW, I noticed that new language related files have been added to the repository and you can get the tamil training text used by google at 


All training related files for tamil are at 


Hope this helps you.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


Reply all
Reply to author
Forward
0 new messages