traineddata file size varies according to box file images?

Frederico Ferro Schuh

unread,

Feb 25, 2014, 11:51:39 AM2/25/14

to tesser...@googlegroups.com

Hello all,

I'm training Tesseract to recognize handwritten digits, and I have provided it about 6000 samples of each digit, in 10 different box files, one for each digit. Each box file is a 2152x2152 TIF file. However, the resulting traineddata file I get after completing the training procedure is only 137 kb.

I went through the process again, providing smaller sample files (1000 samples of each digit), and ended up with the same traineddata size of 137 kb.

Is this size reasonable or am I doing something wrong?

I assume something is wrong because my results are pretty bad so far.

I've attached the sample image I am using for the digit 0.

Thanks in advance,

Fred

eng.hwdigitbig.exp0.tif

Bernard Polarski

unread,

Feb 25, 2014, 1:00:20 PM2/25/14

to tesser...@googlegroups.com

How do you produce your traineddata ?

universal reseller

unread,

Feb 25, 2014, 6:31:03 PM2/25/14

to tesser...@googlegroups.com

i have this problem too
i used jtessboxeditor to train the tesseract
my tif file had 34000 word and i build it with a 50 pages tiff file

but the output trained file was 1.5 mb and dont detected any words!!

jtessboxeditor have problem?

> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-oc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

Frederico Ferro Schuh

unread,

Feb 25, 2014, 11:38:27 PM2/25/14

to tesser...@googlegroups.com

I created my traineddata by following these two guides:

http://blog.cedric.ws/how-to-train-tesseract-301

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

I will now describe in detail every single step I used below.

I have called my test font hwdigitbig.

Here are the steps:

- Create 1 box file for each of my TIF files (each TIF holds samples for 1 digit):

tesseract eng.hwdigitbig.exp0.tif eng.hwdigitbig.exp0 batch.nochop makebox

tesseract eng.hwdigitbig.exp1.tif eng.hwdigitbig.exp1 batch.nochop makebox

...

tesseract eng.hwdigitbig.exp9.tif eng.hwdigitbig.exp9 batch.nochop makebox

- Open box files in jTessBoxEditor and fix incorrect values

- Also in jTessBoxEditor, split/merge invalid bounding boxes (I get many bad bounding boxes in those samples, some spanning 3 characters vertically, I guess I need to clean the images a bit)

- Retrain tesseract with fixed box files for each digit

tesseract eng.hwdigitbig.exp0.tif eng.hwdigitbig.exp0.box nobatch box.train

...

tesseract eng.hwdigitbig.exp9.tif eng.hwdigitbig.exp9.box nobatch box.train

- Generate unicharset for all boxes together

unicharset_extractor eng.hwdigitbig.exp0.box eng.hwdigitbig.exp1.box eng.hwdigitbig.exp2.box eng.hwdigitbig.exp3.box eng.hwdigitbig.exp4.box eng.hwdigitbig.exp5.box eng.hwdigitbig.exp6.box eng.hwdigitbig.exp7.box eng.hwdigitbig.exp8.box eng.hwdigitbig.exp9.box

- Font properties file (the simplest font possible, no effects applied to it)

echo "hwdigitbig 0 0 0 0 0" > font_properties

- Clustering step (2 commands, all trained box files together on each command)

mftraining -F font_properties -U unicharset -O eng.unicharset eng.hwdigitbig.exp0.box.tr eng.hwdigitbig.exp1.box.tr eng.hwdigitbig.exp2.box.tr eng.hwdigitbig.exp3.box.tr eng.hwdigitbig.exp4.box.tr eng.hwdigitbig.exp5.box.tr eng.hwdigitbig.exp6.box.tr eng.hwdigitbig.exp7.box.tr eng.hwdigitbig.exp8.box.tr eng.hwdigitbig.exp9.box.tr

cftraining eng.hwdigitbig.exp0.box.tr eng.hwdigitbig.exp1.box.tr eng.hwdigitbig.exp2.box.tr eng.hwdigitbig.exp3.box.tr eng.hwdigitbig.exp4.box.tr eng.hwdigitbig.exp5.box.tr eng.hwdigitbig.exp6.box.tr eng.hwdigitbig.exp7.box.tr eng.hwdigitbig.exp8.box.tr eng.hwdigitbig.exp9.box.tr

- Renaming generated files. The resulting files are:

eng.shapetable

eng.normproto

eng.inttemp

eng.pffmtable

- Generating traineddata

combine_tessdata eng

- The last step will generate this file (137 kb big)

eng.traineddata

- I then rename this file to my new test language name, which I'll call the same as my font

hwdigitbig.traineddata

So that concludes the steps I used.

The traineddata generate with the steps above is 137 kb big, no matter if I use my big samples of 6000 characters per digit, or reduced files of 1000 samples per digit.

The OCR results are not satisfactory at all, in fact even using the default eng language for handwriting recognition is giving better results.

Any ideas/suggestions?

Thank you very much!

Bernard Polarski

unread,

Feb 26, 2014, 7:19:28 AM2/26/14

to tesser...@googlegroups.com

If you do not include a word-dawg, freq-dawg then the only big file is inttemp.

For 34000 character I am surprised to see it at the size of around 100k.

However your 6000 represents only 10 digit so it is very possible.

As of the poor performance, I think that the size is very detrimental : the character are usually 20 to 40 pixel high and 20 to 50 wide ( only for 'm' or 'w' )

Too much precision is not good.

All he others files are usually rather small (pffmtable, normproto, font_properties. shapetable, unicharset, unicharambigs)

and combined are less than 100k.

In this respect your traineddata seems normal.

Beside that you could write using wildcard:

shapeclustering *.tr

mftraining *.tr

cntraining*.tr

Le mardi 25 février 2014 17:51:39 UTC+1, Frederico Ferro Schuh a écrit :

Frederico Ferro Schuh

unread,

Feb 28, 2014, 6:58:42 AM2/28/14

to tesser...@googlegroups.com

Thanks for the reply Bernard.

It's good to know that my traineddata size is normal. I will now focus on improving my samples, hopefully I can improve the performance. Seems like a case of overtraining.

The *.tr tip is a gem, really appreciate it :)

Thanks again!

Fred

zdenko podobny

unread,

Feb 28, 2014, 7:27:18 AM2/28/14

to tesser...@googlegroups.com

wildcard (*.tr) is shell/OS issue (see e.g. Windows[1]) - so support of this feature depends on shell and not tesseract.

[1] http://superuser.com/questions/460598/is-there-any-way-to-get-the-windows-cmd-shell-to-expand-wildcard-paths

Zdenko

--

Quan Nguyen

unread,

Feb 28, 2014, 9:58:05 AM2/28/14

to tesser...@googlegroups.com

I'm not sure having only samples of one character in a file is a good idea. I normally train with all the characters in the same image(s).

Check http://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-2.01.eng.tar.gz for an example.

Frederico Ferro Schuh

unread,

Feb 28, 2014, 11:20:11 PM2/28/14

to tesser...@googlegroups.com

Do you think training one character per file is affecting my results?

I was doing it because I have thousands of samples, and makebox always makes too many wrong guesses. If I have all the digits on the same image, fixing the resulting 10k chars box file manually would take forever. On the other hand, fixing a single digit box file only takes a simple regexp replace operation on the resulting box file (one replace for digit 1, another replace for digit 2, and so on).

Also, the goal of my application is for online OCR, to recognize single lines of handwritten digits as the user draws them. Would this affect the format of my sample image(s) as well?

Thanks,

Fred

Quan Nguyen

unread,

Mar 1, 2014, 8:02:41 AM3/1/14

to tesser...@googlegroups.com

I would go by what is suggested by the training wiki:

Don't make the mistake of grouping all the non-letters together. Make the text more realistic.

I think you can improve the result a little bit by merging your images into a multi-page TIFF and concatenating your box files (make sure the page numbers are correct). However, that still does not meet the suggestion stated above.

Frederico Ferro Schuh

unread,

Mar 2, 2014, 4:30:25 AM3/2/14

to tesser...@googlegroups.com

I remember that part of the training wiki... and I wondered how it would affect such as small subset of characters.

I only have 10 different digits... what kind of text am I supposed to write in the sample files, considering my valid inputs of only sequences of numbers? And the samples contain all those different handwritings from different people as well... should I separate different handwriting styles into different sample files instead of merging them all together? i.e. treat them like different fonts for the same language? (though it would be extremely limiting, considering the current limit of 64 fonts per language)

Reply all

Reply to author

Forward