Resuming work on OCR

24 views
Skip to first unread message

Debayan Banerjee

unread,
Mar 12, 2011, 12:03:47 PM3/12/11
to indi...@googlegroups.com
Hi,

I am going to resume my work on Indic OCR. I have been spending some
time going over the basics of image processing. I also did some survey
of the existing solutions that exist.

The two key projects we need to be concerned with are OCRopus and
Tesseract. Tesseract is a good isolated character recogniser
(http://code.google.com/p/tesseract-ocr/) whereas OCRopus has a
wealthy collection ( http://ocrocourse.iupr.com/ ) of image processing
and document processing routines . OCRopus can be made to use
Tesseract as a pluggable backend as well.

Tesseract 3.0 has been adapted well to support Chinese, which has over
3000 characters in its alphabet set. That means it can work well for
Indic script as well if we can feed it with the right kind of
pre-processed image.

Around 18 months back I had done some experiments (
http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html
). I think this approach will work. I am going to implement this split
pre-processing step in OCRopus using its C++ routines.

So the plan right now is to do all image pre-processing in OCRopus and
pass the new modified image to Tesseract. Currently we were doing
"maatraa clipping"
(http://sites.google.com/site/ocropus/old-documentation/morphological-operations)
inside Tesseract.

I will keep documenting my work on
http://hacking-tesseract.blogspot.com/ and will keep this list
updated.

I would like all of you to share your vision and opinion of how we
should proceed to create the first freely Indic OCR. I honestly have
little exposure to low level OCR technology but I am learning as I go
on. I know there are many experienced people on this list who have
worked on OCR and I would like to know how they think we should
proceed.

--
Debayan Banerjee

Nirmal Verma

unread,
Mar 12, 2011, 2:04:52 PM3/12/11
to indi...@googlegroups.com
hello debayan,

congratulations on starting this work!!

kiran deshpande is also involved in OCR development work and has done some ground work in this direction. he could be reached on ki...@vaugle.co.in.

Nirmal Verma

Bikash Bag

unread,
Mar 13, 2011, 1:50:09 AM3/13/11
to indi...@googlegroups.com
hii, glad to know that you people working on indic-ocr, I am also interested in indic-ocr(oriya ocr), but implementing in java, so pls share the image processing methods/algorithms you people using, so that I can implement it in java
--
Regards,
Bikash Ranjan Bag


Emaad Ahmed Manzoor

unread,
Mar 13, 2011, 1:59:21 AM3/13/11
to indi...@googlegroups.com
@Debayan: That's great! I'm currently trying to replicate whatever
you've done with the latest version of Tesseract, just to understand
the entire process involved. I'll post in my findings on a thread
here.


--
Emaad Ahmed Manzoor,
Third Year Undergraduate,
BITS - Pilani, KK Birla Goa Campus.
halfclosed.wordpress.com

M.N.S.Rao

unread,
Mar 13, 2011, 3:07:39 AM3/13/11
to indi...@googlegroups.com
Hi,
I am interested in the proceedings. I work on windows platform.
MNS Rao

Sriranga(78yrsold)

unread,
Mar 14, 2011, 5:40:25 AM3/14/11
to Debayan Banerjee, indic-ocr
Debayan Banerjee,
forwarded attached files for your information and research.Output file viz. ben-OCR.txt.rtf of eng-ben-cheluvi.tif was in LatinEnglish lang converted to Bengali lang. - which appears to be in order and has 100% accuracy when compared to original ben-cheluvi.txt. In fact I don't know Bengali lang.
I had similar experience with Kannada script also.
Now, I am trying to experiment with ben-cheluvi.tif and feedback to you shortly.
With Warmest Regards,
-sriranga(78yrs)

On Sun, Mar 13, 2011 at 8:51 AM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
Dear Banerjee,
Recentyly, In fact,I was thinking to approach you to request for help as well as  to take up research on Indic-ocr again, if possible. Incidentally,By the Grace of Supreme Lord, you are now voluntarily decided to pursue on Indic-orcr project work in the interest of Indian community.

Yes, it is good idea to start with ocropus for post processing and pass the modified image to Tesseract engine. I suggest to develop common post processing for indic.
From my experience, Major problem of "Apply boxes Failures"does not exist in the tesseract-3.01Alpha. Unicharset file displayed the indic chars along with relevant unicode numbers which has more advantages. I have tested for Kannada - my experience are as follow:
All Kannada script converted to latin english with help of barahaIME and generate tif file and generate box file in Latin english and lastly generated traineddata. When run in tesseract output was correct. and even output was
reconverted from english to Kannada had 100%. I shall forward to you sample for your research.to find out why English latin output has 100% whereas normal
output in Kannda does not have 100%. In this connection, I am willing to assdist you to perform all types of beta testing  and feedback to you.

I suggest to start with Sanskrit (which is mother of Indic script)which has similarity to Bengali lang as well as Hindi script.This will help to contribute by Indic community of the  Indic-ocr forum.

Wishing you All The Success in your Good Mission by the Grace of Supreme Lord.
With warmest Regards,
-sriranga(78yrs)



 

On Sat, Mar 12, 2011 at 10:33 PM, Debayan Banerjee <deba...@gmail.com> wrote:
eng-ben-cheluvi.txt
ben.traineddata
ben-OcrText.rtf
ben.unicharset
eng-ben-cheluvi..box
ben-cheluvi.txt
eng-ben-cheluvi..tif

jayanta nath

unread,
Mar 14, 2011, 5:56:13 AM3/14/11
to indi...@googlegroups.com
Dear All,
Have you seen one Bangladeshi done some works in Bengali OCR
http://crblpocr.blogspot.com/
http://code.google.com/p/banglaocr/


2011/3/14 Sriranga(78yrsold) <withbl...@gmail.com>



--
With Warm Regards,
Jayanta Nath
Calcutta,West Bengal
+91 9836294438
Facebook :http://www.facebook.com/jayantanth
Wikipedia :http://en.wikipedia.org/wiki/User:Jayantanth
আসুন পাইরেসি মুক্ত ভারত  গড়ি,সবাই মুক্ত সফ্‌টওয়ার ব্যবহার করি O:-),অন্যকে ব্যবহারে উৎসাহিত করি।
______________________________


Debayan Banerjee

unread,
Mar 14, 2011, 6:05:09 AM3/14/11
to indi...@googlegroups.com
On 14 March 2011 15:26, jayanta nath <jayan...@gmail.com> wrote:
Dear All,
Have you seen one Bangladeshi done some works in Bengali OCR
http://crblpocr.blogspot.com/
http://code.google.com/p/banglaocr/

Yes, its M S Hasnat. I am in touch with him. We need to build something more robust which also supports more languages.



--
Debayan Banerjee

sriranga(78yrsold) location: Bangalore

unread,
Mar 15, 2011, 8:03:06 AM3/15/11
to indic-ocr
Dear Debayan Banerjee,
I am interested to know when are you re-start project and what are
your plan to implement?
I am ready to assist your good project by way of beta testing and
feedback at any time.
Please upload the Bengali fonts (WinxP and Linux) used in your
project -
With regards,
-sriranga(78yrs)

On Mar 14, 3:05 pm, Debayan Banerjee <debaya...@gmail.com> wrote:

Debayan Banerjee

unread,
Mar 15, 2011, 8:16:24 AM3/15/11
to indi...@googlegroups.com
On 15 March 2011 17:33, sriranga(78yrsold) location: Bangalore

<withbl...@gmail.com> wrote:
> Dear Debayan Banerjee,
> I am interested to know when are you re-start project and what are
> your plan to implement?
>  I am ready to assist  your good project by way of beta testing and
> feedback at any time.
> Please upload the Bengali fonts (WinxP and Linux)  used in your
> project -

Please be a little patient. I have a reasonable amount of workload
from my day job as well.
Currently I am looking at the whole thing from a high level
architecture point of view and am learning some machine learning
concepts as well. I will touch on lower level training issues later.

--
Debayan Banerjee

Sriranga(78yrsold)

unread,
Mar 15, 2011, 10:38:36 AM3/15/11
to indi...@googlegroups.com
Dear Banerjee,
Thanks for the updated information. I shall wait for good news.
With Regards,
-sriranga(78yrs)
Reply all
Reply to author
Forward
0 new messages