Training the Tesseract-OCR for Kannada Language

1,083 views
Skip to first unread message

sri1683

unread,
May 18, 2012, 10:44:24 AM5/18/12
to tesseract-ocr
Hi,

This is my first time to post a question. Don't exactly know how
to do it but will still try my luck. Please do correct me if am not
clear in what I say.
I have seen the Tesseract-OCR working for the english language
and was very fascinated with it. Now i wanted to train it to read
Kannada text. But I do not know how to do it. Has anyone tried it
earlier? If so please help me as i don't know how to go about training
the OCR.

Thanks in advance.

Dakshika Jayathilaka

unread,
May 20, 2012, 8:46:00 PM5/20/12
to tesser...@googlegroups.com

sri1683

unread,
May 22, 2012, 5:27:08 AM5/22/12
to tesseract-ocr
thanks a lot..
that was very helpful..
i could create the traineddata file..

i am training the tesseract3.00 and not 2.00 as mentioned in the link
u gave me..

however i am getting a blank text file post training.
and one more interesting part is that the traineddata file of the
kannada language is smaller than english,
considering the number of characters in english is very small when
compared to kannada, so i assume that the traineddata file should be
larger.
please help me in understanding this.

i also faced a problem with the font properties as i am unable to find
the actual font details for this language..

Taha Alasli

unread,
May 22, 2012, 6:35:16 AM5/22/12
to tesser...@googlegroups.com
I think that size of the traineddata file Depend on tiff\boxs you used.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

sri1683

unread,
May 22, 2012, 6:53:46 AM5/22/12
to tesseract-ocr
hi taha,

thanks for the suggestion..
i have used 6 tif images for training..
thats what drove me to think that the traineddata file should be
bigger..

On May 22, 3:35 pm, Taha Alasli <nocon...@gmail.com> wrote:
> I think that size of the traineddata file Depend on tiff\boxs you used.
>

mns_rao

unread,
May 27, 2012, 12:21:10 PM5/27/12
to tesseract-ocr
Hi,
Some of us working for kannada language file have worked one
traineddata file which is fairly good, of course needs post-
processing. can send to those interested by mail if sought.
Thanks,
MNS Rao

sridhar n

unread,
May 28, 2012, 2:50:28 AM5/28/12
to tesser...@googlegroups.com
Hello Mr. Rao,

Can u please send the trainneddata file to me as I am stuck.

I am also stuck with a problem where i have the tesseract reading the text but the output sent out from the engine in the text file is some special characters.. I am unable to tell the tesseract engine to write the output in the notepad in kannada.

As far as i can see that the output is trying to write on basis of the shape but the writing is to be done in kannada.
That is the point i am stuck in.

Please help..
--
regards,
Sri

Anand S

unread,
Jan 13, 2014, 9:23:50 AM1/13/14
to tesser...@googlegroups.com
Yes i have tried out for kannada. It is possible to train in kannada. It works good, but it is little complicate compare to English. 

Sriranga(80yrs)

unread,
Jan 13, 2014, 12:35:08 PM1/13/14
to tesser...@googlegroups.com, M.N.S.Rao
It is not NOW complicate to train kannada when compared to English - with help of jboxeditor v-1.0 thanks to the developer Quan. You can generate kan.traineddata from the stage of text file you feed in the wonderful tool Jboxeditor 1.0 ( In orther words, the said tool will generate kan.traineddata automatically - as per wiki instructions)
you can download the jboxeditor 1.0 and also vietOCR from website
https://www.google.com/search?q=vietocr&ie=UTF-8&sa=Search&channel=fe&client=browser-ubuntu&hl=en


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Manjunath M

unread,
May 26, 2015, 10:42:55 AM5/26/15
to tesser...@googlegroups.com, "could you please send the source code...@gmail.com, sridha...@gmail.com
please send the source code

Cisa Anand

unread,
Feb 6, 2018, 11:14:53 AM2/6/18
to tesseract-ocr
Hi guys,
I am working on a project involving Kannada text extraction. I used the kan.traineddata available in the tesseract website but there are quite a lot of inaccuracies in the output that I got. Your Kannada trained data file might be very helpful to me. I found a reference to your Kannada trained data in this Tesseract ocr google group. Can you share it with me please.

ShreeDevi Kumar

unread,
Feb 6, 2018, 11:26:13 AM2/6/18
to tesser...@googlegroups.com
Have you tried tesseract with traineddata from tessdata_fast and tessdata_best

On 06-Feb-2018 9:44 PM, "Cisa Anand" <cisa....@gmail.com> wrote:
Hi guys,
I am working on a project involving Kannada text extraction.  I used the kan.traineddata available in the tesseract website but there are quite a lot of inaccuracies in the output that I got.  Your Kannada trained data file might be very helpful to me.  I found a reference to your Kannada trained data in this Tesseract ocr google group. Can you share it with me please.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a0a17e7-b7ed-4dc0-bf9d-9e81c29badbf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anand Bp

unread,
Aug 11, 2020, 5:36:17 AM8/11/20
to tesseract-ocr
Anand/sri/Rao

i am started working on Kannada project,  could any one of you send me the kannada data file, then it will be a great help. ana...@gmail.com, 91 9845058824. thanks in advance... Anand
Reply all
Reply to author
Forward
0 new messages