Hi ,
Sorry for my delayed reply .
Thank you Paul and Nick for your Inputs .
@ Paul ,
//imagery for doing training is not available. So basically you would have to start all over.//
Starting all over in the sense ? I have put across the efforts taken by me in the mail . Is it that the training process has to be started from the beginning ?
@ Nick White
//Can you give us some clue as to what you think could be improved
about the current Tamil recognition? Changes of configuration
variables, or ambiguity rules (the unicharambigs file), don't need
access to the training images.
//
I have for now only gone through the documents and not yet put my hands
into the code or actual working of the engine . I am in my initial
stages of analysis . I have got pretty good time( around 9 months ) to
work on the project and would love to contribute to a project in Apache
License and also in my Mother Tongue .
“ The new page layout analysis for Tesseract was designed from the beginning to be language-independent, but the rest of the engine was developed for English, without a great deal of thought as to how it might work for other languages.”[1]And in the training document for Tessaract its noted that as “ .. the Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Tesseract 3.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with..” and “..Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly. ..” and “..Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits...”[2]
[1]Ray Smith , Daria Antonova , Dar-Shyang Lee Adapting the Tesseract open source OCR engine for multilingual OCR, Published by ACM 2009 Article. Bibliometrics Data Bibliometrics.
[2]
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3Tamil has almost all the above mentioned issues .
I am wondering , where to start my learning process of the codes , where to test it , and other stuffs .
-Sibi
-