I am going to resume my work on Indic OCR. I have been spending some
time going over the basics of image processing. I also did some survey
of the existing solutions that exist.
The two key projects we need to be concerned with are OCRopus and
Tesseract. Tesseract is a good isolated character recogniser
(http://code.google.com/p/tesseract-ocr/) whereas OCRopus has a
wealthy collection ( http://ocrocourse.iupr.com/ ) of image processing
and document processing routines . OCRopus can be made to use
Tesseract as a pluggable backend as well.
Tesseract 3.0 has been adapted well to support Chinese, which has over
3000 characters in its alphabet set. That means it can work well for
Indic script as well if we can feed it with the right kind of
pre-processed image.
Around 18 months back I had done some experiments (
http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html
). I think this approach will work. I am going to implement this split
pre-processing step in OCRopus using its C++ routines.
So the plan right now is to do all image pre-processing in OCRopus and
pass the new modified image to Tesseract. Currently we were doing
"maatraa clipping"
(http://sites.google.com/site/ocropus/old-documentation/morphological-operations)
inside Tesseract.
I will keep documenting my work on
http://hacking-tesseract.blogspot.com/ and will keep this list
updated.
I would like all of you to share your vision and opinion of how we
should proceed to create the first freely Indic OCR. I honestly have
little exposure to low level OCR technology but I am learning as I go
on. I know there are many experienced people on this list who have
worked on OCR and I would like to know how they think we should
proceed.
--
Debayan Banerjee
--
Emaad Ahmed Manzoor,
Third Year Undergraduate,
BITS - Pilani, KK Birla Goa Campus.
halfclosed.wordpress.com
Dear Banerjee,
Recentyly, In fact,I was thinking to approach you to request for help as well as to take up research on Indic-ocr again, if possible. Incidentally,By the Grace of Supreme Lord, you are now voluntarily decided to pursue on Indic-orcr project work in the interest of Indian community.
Yes, it is good idea to start with ocropus for post processing and pass the modified image to Tesseract engine. I suggest to develop common post processing for indic.
From my experience, Major problem of "Apply boxes Failures"does not exist in the tesseract-3.01Alpha. Unicharset file displayed the indic chars along with relevant unicode numbers which has more advantages. I have tested for Kannada - my experience are as follow:
All Kannada script converted to latin english with help of barahaIME and generate tif file and generate box file in Latin english and lastly generated traineddata. When run in tesseract output was correct. and even output was
reconverted from english to Kannada had 100%. I shall forward to you sample for your research.to find out why English latin output has 100% whereas normal
output in Kannda does not have 100%. In this connection, I am willing to assdist you to perform all types of beta testing and feedback to you.
I suggest to start with Sanskrit (which is mother of Indic script)which has similarity to Bengali lang as well as Hindi script.This will help to contribute by Indic community of the Indic-ocr forum.
Wishing you All The Success in your Good Mission by the Grace of Supreme Lord.
With warmest Regards,
-sriranga(78yrs)
On Sat, Mar 12, 2011 at 10:33 PM, Debayan Banerjee <deba...@gmail.com> wrote:
,অন্যকে ব্যবহারে উৎসাহিত করি।Dear All,
Have you seen one Bangladeshi done some works in Bengali OCR
http://crblpocr.blogspot.com/
http://code.google.com/p/banglaocr/
Please be a little patient. I have a reasonable amount of workload
from my day job as well.
Currently I am looking at the whole thing from a high level
architecture point of view and am learning some machine learning
concepts as well. I will touch on lower level training issues later.
--
Debayan Banerjee