Losing accuracy when training tessearct on fonts it already is trained on

144 views
Skip to first unread message

Thomas Bruno

unread,
Aug 18, 2014, 5:17:19 PM8/18/14
to tesser...@googlegroups.com
Hello Everyone,

Where can I find the box/tif combo for the eng.traineddata that Tessearct 3.02 provides for download?

Since I cannot append new font information to compiled training data I have to completely create new data even for the fonts that Tesseract's prebuilt training provides.  When I do this I lose accuracy over the provided training data significantly (accuracy ends up being around 50% down from above 90%).   If I had Tessearct's source box/tif files then adding my fonts should at worst still have nearly the same accuracy as the project provided files for documents that contain the default fonts.  It would seem that Tesseract providing the box/tif originally used would be the easiest solution in keeping user's accuracy up while attempting new fonts.

I have found box/tif files for tessearct 2.0 but not 3.0.  When I use the box/tif files from 2.0 for the fonts provided like Arial, Courier New, etc I significantly lose accuracy.

Nick White

unread,
Aug 20, 2014, 11:36:13 AM8/20/14
to tesser...@googlegroups.com
Hi Thomas,

On Mon, Aug 18, 2014 at 02:17:19PM -0700, Thomas Bruno wrote:
> Where can I find the box/tif combo for the eng.traineddata that Tessearct 3.02
> provides for download?

The tif/box files used to create the eng.traineddata for 3.02 are
not available, and are very unlikely to be made so, because they
were automatically generated using a program that was specific to
Google's infrastructure.

The good news is that the training image generation program has
recently been added to the code repository[0] and works with regular
Linux distributions, as well as most[1] of the information needed to
recreate the training tif/box files[2]. If you can get that working,
you can just add your own training tif/box files alongside it.

I plan to update the TrainingTesseract3 wiki page soon to make this
clearer, but haven't done so yet.

An alternative option would just be to use your new training
alongside the official eng.traineddata, and call it something else,
so you call tesseract like this:
tesseract -l eng+mycustomeng image.png outbase

Nick

0. See the training/text2image tool in the main code repository
1. https://groups.google.com/forum/#!topic/tesseract-dev/VhUk9IxFt8Y
2. See the langdata repository

Thomas Bruno

unread,
Aug 22, 2014, 3:42:21 PM8/22/14
to tesser...@googlegroups.com


The good news is that the training image generation program has
recently been added to the code repository[0] and works with regular
Linux distributions, as well as most[1] of the information needed to
recreate the training tif/box files[2]. If you can get that working,
you can just add your own training tif/box files alongside it.

I plan to update the TrainingTesseract3 wiki page soon to make this
clearer, but haven't done so yet.


Is this common when training from text2image output?

APPLY_BOXES: boxfile line 5364/748 ((1488,893),(1532,6)): FAILURE! Couldn't find a matching blob

FAIL!

APPLY_BOXES: boxfile line 5365/1285 ((1494,1418),(1532,6)): FAILURE! Couldn't find a matching blob

FAIL!

APPLY_BOXES: boxfile line 5366/1552 ((1495,1626),(1529,6)): FAILURE! Couldn't find a matching blob

FAIL!

APPLY_BOXES: boxfile line 5367/1708 ((1494,1784),(1531,6)): FAILURE! Couldn't find a matching blob

FAIL!

APPLY_BOXES: boxfile line 5368/1970 ((1484,2101),(1532,6)): FAILURE! Couldn't find a matching blob

FAIL!

APPLY_BOXES: boxfile line 5369/2493 ((1494,2625),(1532,6)): FAILURE! Couldn't find a matching blob



Seems all my files are always filled with these failures when training.

Nick White

unread,
Aug 25, 2014, 9:00:30 AM8/25/14
to tesser...@googlegroups.com
On Fri, Aug 22, 2014 at 12:42:21PM -0700, Thomas Bruno wrote:
> Is this common when training from text2image output?
>
>
> APPLY_BOXES: boxfile line 5364/748 ((1488,893),(1532,6)): FAILURE! Couldn't
> find a matching blob
>
> FAIL!

Yes, there will be some of these. Check the proportion of failing to
not failing blobs is acceptable, and if not check out the
char_spacing argument for text2image.

Thomas Bruno

unread,
Aug 25, 2014, 10:26:35 PM8/25/14
to tesser...@googlegroups.com
Basically of 2000 characters about 1/4 of them fail this way. I've tried char_spacing from 1 all the way up to 10 doesn't seem to matter really.  When we use the same text and print it with 1.25 spacing and manually make the tiff we only get 5-20 fails.  Is there something I could be missing here? Is training data required for this step?

training/text2image --text=trainingText.txt --outputbase=eng.courier.exp0 --font='Courier New' --fonts_dir=/Library/Fonts/ --ptsize=14 --char_spacing=2.5 --degrade_image=0


 Tom
Reply all
Reply to author
Forward
0 new messages