Training tesseract for hand written letters

8,024 views
Skip to first unread message

Thilanka Kaushalya

unread,
May 8, 2010, 1:53:27 PM5/8/10
to tesser...@googlegroups.com
Hi,

          I'm a doing a handwritten character recognition using Tesseract. I tried to train the Tesseract exe for my data set. on windows 
I have followed the guide at the wiki. http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract . But I could not do that.
These are the steps I have done. 
                Downloaded the Tesseract 

2.04

                Create a folder named tessdata in that folder
                Then created the following files in the tessdata folder.
                            
  • tessdata/eng.freq-dawg
  • tessdata/eng.word-dawg
  • tessdata/eng.user-words
  • tessdata/eng.inttemp
  • tessdata/eng.normproto
  • tessdata/eng.pffmtable
  • tessdata/eng.unicharset
  • tessdata/eng.DangAmbigs
                Then I have a tiff image which contains English letter a in the root folder.
                Then I have entered the following command. 
                                 
tesseract a.tif fontfile batch.nochop makebox
 
But in this case it gives an error saying ( read_variables_file:Can't open ./tessdata/configs/makeboxUnable to load unichars et file ./tessdata/eng.unicharset)

please can someone help me to fix this issue. Thanks in advance.

Regards,
Thilanka.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

Zdenko Podobný

unread,
May 17, 2010, 3:11:23 AM5/17/10
to tesser...@googlegroups.com
Hello,

can you provide more information (OS? how did you installed Tesseract?)

Zd.

Dňa 08.05.2010 19:53, Thilanka Kaushalya  wrote / napísal(a):
Hi,

          I'm a doing a handwritten character recognition using Tesseract. I
tried to train the Tesseract exe for my data set. on windows
I have followed the guide at the wiki.
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract . But I could
not do that.
These are the steps I have done.
                Downloaded the Tesseract

2.04
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract>
   Create a folder named tessdata in that folder
                Then created the following files in the tessdata folder.


   - tessdata/eng.freq-dawg
   - tessdata/eng.word-dawg
   - tessdata/eng.user-words
   - tessdata/eng.inttemp
   - tessdata/eng.normproto
   - tessdata/eng.pffmtable
   - tessdata/eng.unicharset
   - tessdata/eng.DangAmbigs

                Then I have a tiff image which contains English letter a in
the root folder.
                Then I have entered the following command.


tesseract a.tif fontfile batch.nochop makebox


But in this case it gives an error saying ( read_variables_file:Can't open
./tessdata/configs/makeboxUnable to load unichars et file
./tessdata/eng.unicharset)

please can someone help me to fix this issue. Thanks in advance.

Regards,
Thilanka.


  

Thilanka

unread,
May 21, 2010, 1:17:15 PM5/21/10
to tesseract-ocr
Hi Zdenko,

Thanks for the reply. Initially I have used the Windows
for testing.
It didn't work. Thats why I posted this question. But after that I
have used
Ubuntu to built it. Successfully done it. Thank you very much.

On May 17, 12:11 pm, Zdenko Podobný <zde...@gmail.com> wrote:
> Hello,
>
> can you provide more information (OS? how did you installed Tesseract?)
>
> Zd.
>
> Dn(a 08.05.2010 19:53, Thilanka Kaushalya  wrote / napísal(a):
>
> > Hi,
>
> >           I'm a doing a handwritten character recognition using Tesseract. I
> > tried to train the Tesseract exe for my data set. on windows
> > I have followed the guide at the wiki.
> >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract. But I could
>  smime.p7s
> 5KViewDownload

Thilanka

unread,
Jun 12, 2010, 1:46:37 PM6/12/10
to tesseract-ocr
HI,

I'm willing to train Tesseract for handwritten character
recognition. So I'm willing to train Tesseract for
handwritten characters. I have googled many times and try to find a
appropriate handwritten data set for English.
But I could not find any.

If some one here have done a such training to Tesseract
before, please can you share your data set with me
or direct me for a help. Thanks in advance.

Thilanka.

Sriranga(77yrsold)

unread,
Jun 19, 2010, 7:44:53 AM6/19/10
to tesser...@googlegroups.com
Hi Thilanka,
I think  you want to train for handwritten lang which lang? Since you are interested in handwriting training, please search for "handwriting"  discussion in the forum. Read completely. You will get  idea. From the my past  experiment(2.03/2.02), It is proved that it is  possible to train handwriting in the tesseract-ocr. - that was 2 or 3 yrs back.
I tested in Winxp only. You can experiment  in the latest version of tesseract-ocr -
Wish you success in your efforts,
-sriranga(77yrs old now)

Thilanka

unread,
Jun 20, 2010, 3:33:36 AM6/20/10
to tesseract-ocr
HI Sriranga,

Thanks for the reply. I'm trying to train Tessract for English
hand written characters and numbers.
Ok, I'll refer the past discussions. Thanks you.

Regards,
Thilanka.

On Jun 19, 4:44 pm, "Sriranga(77yrsold)" <withblessi...@gmail.com>
wrote:
> > tesseract-oc...@googlegroups.com<tesseract-ocr%2Bunsubscribe@goog legroups.com>
> > .
Message has been deleted

Jimmy O'Regan

unread,
Jun 20, 2010, 8:27:30 AM6/20/10
to tesser...@googlegroups.com
On 20 June 2010 11:22, Sriranga(77yrsold) <withbl...@gmail.com> wrote:
> Thilanka,
> which lang you know well. Since new version 3.0( r=400) has Chinese
> Lang.trained data. You are
> aware that chinese script has number of strokes and being such a case, I
> firmly believed that English handwriting can be trained easily and
> successfully. please forward sample handwritten to my email address =
> withbl...@gmail.com.  I shall try myself and feedback to you..

No.

The issue of Chinese and handwriting are completely different. With
Chinese, the issue is that of a large character set; with handwriting
- that is, of handwritten printed characters, not cursive - it's the
wide amount of variation. Write the same sentence 10 times, then look
at the page - no two characters will be exactly alike (think of this
as training on multiple examples from the same font - you have to
learn the variations). On top of that, handwriting is 'unique'; each
person's handwriting should be thought of in terms of different fonts
- and there's no way to train for that.

You may have some luck, but don't be surprised if the results are
dramatically less accurate than for printed text.

Cursive writing has its own set of issues - in particular, character
segmentation of joined letters. Tesseract has no support for this type
of segmentation - it has problems with in training from regular
printed pages, when there is not enough space between the characters.
(Sriranga, you have encountered this limitation a number of times, if
the issue tracker is anything to go by).

In summary:

For a single person, with printed characters: you might be lucky.
For multiple people, with printed characters: don't have high expectations.
For cursive: expect close to nothing.

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Aruna Devi

unread,
Feb 10, 2012, 3:00:57 AM2/10/12
to tesser...@googlegroups.com
Hai,
I m working on OCR now, using tesseract my application is able to recognise printed characters but not hand written, i searched for how to recognise hand written characters but not able to understand it. I'm working on windows7. so plz help me in how to proceed with this. 

Thank you.

Sriranga(78yrsold)

unread,
Feb 10, 2012, 3:34:07 AM2/10/12
to tesser...@googlegroups.com
Please furnish which version of tesseract-ocr is using to test  for printed or handwritten? please upload sample of handwritten.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com

Merve Temizer

unread,
Feb 10, 2012, 3:53:00 AM2/10/12
to tesser...@googlegroups.com
You must train tesseract with hand written samples.

2012/2/10 Sriranga(78yrsold) <withbl...@gmail.com>

Aruna Devi

unread,
Feb 10, 2012, 4:18:28 AM2/10/12
to tesser...@googlegroups.com
Sorry Sir, i do not have any hand written samples of each letter, my task is to recognise a form (image) which has both printed and hand written characters and convert to editable form.but the editable form consists of only printed characters but hand written is not getting recognised (sometimes blank and some junk characters.) i'm using tess4j as java wrapper,(in which i'm calling doOCR() using an instance of tesseract.) .

Aruna Devi

unread,
Feb 10, 2012, 12:29:48 PM2/10/12
to tesseract-ocr
yeah sure sir. i took the scan of a document. both in png and tif
format. i want to know the simple methods of how to train the
tesseract. but how to upload image over here?
thanks for your valuable replies.

On Feb 10, 1:34 pm, "Sriranga(78yrsold)" <withblessi...@gmail.com>
wrote:
Message has been deleted
Message has been deleted

Aruna Devi

unread,
Feb 17, 2012, 12:06:12 AM2/17/12
to tesser...@googlegroups.com
Sir i have the trained data file separately for small letters (which you gave me) and capital letters and also the available trained data for printed.

tesseract picture.png picture -l eng+han+ABC

Error opening data file C:\Program Files\Tesseract-OCR\tessdata/eng+han+ABC.trai
neddata

i got an error while executing the above command , how to go with this error? can anyone please suggest me how to solve this.

zdenko podobny

unread,
Feb 17, 2012, 4:26:47 AM2/17/12
to tesser...@googlegroups.com
Do you use (not released yet) tesseract 3.02 (you can find it out by 'tesseract -v')?
This feature (declaring multiple language for OCR) in no available in prior versions.

Zdenko

Aruna Devi

unread,
Feb 17, 2012, 5:51:23 AM2/17/12
to tesser...@googlegroups.com
Sir I'm using tesseract 3.01. Means in this version its not available.

Aruna Devi

unread,
Feb 20, 2012, 11:30:24 AM2/20/12
to tesseract-ocr
how to download tesseract 3.02(for linux) and use it? please do let me
know it, i'm much curious in making the form-image to editable. And
can 3.01 trained data be used in 3.02?
thanks in advance.

79yrsold

unread,
Feb 21, 2012, 12:13:36 PM2/21/12
to tesseract-ocr
please visit wiki section - http://code.google.com/p/tesseract-ocr/w/list
- as well as faq - wherein details instructions reg. download and
usage are available.

Nouman Ghumman

unread,
Apr 18, 2015, 9:13:35 AM4/18/15
to tesser...@googlegroups.com
hello sir 
kindly mail me all the handwritten training simple of small letters and capital lettter nouman...@gmail.com
thanku 
Message has been deleted

Rajvel G

unread,
Mar 23, 2016, 12:31:38 PM3/23/16
to tesseract-ocr, lgtkau...@gmail.com
hi thilanka

actuallai am doing ma final year project using tesseract for analising handwritten source code i need some data file of tesseract

so plz share it tq

Joshua Nwogu

unread,
Jun 10, 2017, 3:15:56 PM6/10/17
to tesseract-ocr, lgtkau...@gmail.com
Hello rajvel,

Were you able to find the data fiile and if you did please could you share what you found and also were you successful? Thank you.

chandra churh chatterjee

unread,
Jun 20, 2018, 9:15:58 AM6/20/18
to tesseract-ocr
What is the format of your dataset an what does it contain can you tell me the details plz as you mentioned above that you are training on tesseract 2.04 and i am trying to do a same work of hand written recognition using tesseract 4.0 and also would like to be informed about the volume of your dataset?
Reply all
Reply to author
Forward
0 new messages