fine tuning from traineddata

hmaster

unread,

Apr 3, 2020, 7:10:17 AM4/3/20

to tesseract-ocr

Hello,

I am trying to improve accuracy for my use case, by fine tuning. Currently I'm getting between 80-90% accuracy on my scanned images, and around 60% for images taken via phone.

I'm running on a Jetson Nano, using:

```

tesseract 4.1.1-rc2-21-gf4ef

leptonica-1.78.0

libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1

```

I'm training on a single image, just to understand the mechanism, and learn about it.

I'm using a scanned receipt, as an example, 600dpi. Identity, and imagemagick says it's 1696x3930.

I'm confused a bit by this, as the script still runs, and the error rate keeps dropping.

I've read the tutorials and examples, and the scripts, and it's all too much for now, as I've been at it for about 2-3 weeks now.

There are a couple of things that are still unclear to me, and have some questions:

1. Do I need to create single line images from each image I have? (~3000)

2. would it help if I create ground-truth text files - for the entire image, or should I create only for a single line? (that is I must have tiff, box and ground-truth files for each image)

3. some of the words in my images are not found in the eng.training_files.txt, as such would it speed up/help if I add them?

4. is there a way to do fine tuning with my own images and my own eng.training_files.txt data, without running tesstrain.sh?

I could not find details about how to train/fine tune with own tif/box. Meaning, I have created a folder with my data, and passed it to tesstrain.sh via my_box_tiff_dir, but it's not using those, from what I can tell, as it creates synth data.

As said above, it's unclear to me if I need to generate the ground-truth data as well, do I still need to fiddle/fix the box files, etc.

Sorry if I asked too many questions, I've invested so much time in it, and I'm not sure where exactly I'm doing wrong.

I've followed the steps in few of the questions posted in this group, and I am getting decent results, however, they are not as good as using the traineddata_best on its own.

Steps I've done were:

Method 1

1. create box files via lstmbox and fix any mistakes - tesseract img.tif img --dpi 600 lstmbox

2. extract lstm from eng.traneddata_best

3. run lstmtraining for fine tuning - lstmtraining --continue from...

4. generate eng.traineddata - lstmtraining stop...

Method 2

1. create box files via lstmbox and fix any mistakes - tesseract img.tif img --dpi 600 lstmbox

2. create lstmf files - tesseract img.tif img --dpi 600 lstm.train

3. extract unicharset - unicharset_extractor *.box

4. shapeclustering -F font_properties -U unicharset *.tr

5. mftraining -F font_properties -U unicharset -O eng.unicharset *.tr

6. cntraining *.tr

7. rename inttemp, normproto, pffmtable, shapetable

8. combine_tessdata eng.

Thank you for your support and help with my endeavor.

Shree Devi Kumar

unread,

Apr 3, 2020, 7:24:00 AM4/3/20

to tesseract-ocr

As per the info given by Ray Smith, lead developer of tesseract, if you just need to fine-tune for a new font face, use fine-tune by impact.

His example uses the training text from langdata repo (approx 80 lines) rendered with the font, generating lstmf files and then running lstmtraining on that for about 400 iterations.

Using too few lines or too many iterations will lead to suboptimal results.

You can whitelist only digits to further improve your results.

The above info is for lstm training - neural network based. That is the only one that allows fine-tuning.

Your second approach is for the legacy engine. That does not have any option for fine-tuning.

You can see shreeshrii/tess4training repo for my replication of the tesstutorials by Ray.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com.

Shree Devi Kumar

unread,

Apr 3, 2020, 7:27:21 AM4/3/20

to tesseract-ocr

There are alternate approaches to training.

tesstrain.sh in tesseract repo works on training text and fonts, creating synthetic training data as multi-page tifs.

tesstrain repo uses a makefile for training from images with their corresponding ground truth.

For fine-tuning for a font, both can work. Both will also be quite fast to try, as you only need to run 400 iterations.

hmaster

unread,

Apr 3, 2020, 11:48:52 AM4/3/20

to tesseract-ocr

So essentially, I need to create a box file and ground-truth file for each image I have, and run it with tesstrain repo. Which doesn't work....
That's what I understood from the README as well.
Unfortunately, I've tried it already, and have not come too far with that either.
The documentation and examples are missing in explanation, and that is very demanding, as can be seen by the sheer questions on how to train, and how to use the tools.
I've spent around 200 hours on this tool so far, and I am no closer to what I need than I was when I started with.
Some repos use lstmbox/lst.train, some use makebox/box.train, and all of them fail at one point or another, through the examples.
Lots of the tutorials or explanations are diluted because of the sheer number of versions and differences in how tesseract works.

Lorenzo Bolzani

unread,

Apr 3, 2020, 1:49:55 PM4/3/20

to tesser...@googlegroups.com

Hi,

tesstrain (https://github.com/tesseract-ocr/tesstrain) works very well. It is not the same thing as tesstrain.sh, it was called ocr-d before.

tesstrain works only with single lines. You need only the images and the corresponding gt.txt files, it will create the tiff, box files and ltmsf, unicharset and other files for you. It will even download the stuff you need.

About your questions:

1. yes. If you have images with multiple lines I think there are tools around to split them automatically search this forum

2. single lines

3. the training does not use the vocabulary at all

4. I recommend tesstrain (without .sh, the one with the Makefile).

The text must be cropped tight, a couple of pixels per side or none, see what works best. Image height should be 35 to 48 pixels (try a few and see what works best for your data). No need to do a full threshold in the images but you want strong contrast. See the attached file.

Bye

Lorenzo

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5c1a6e36-30f5-47f5-a026-2d86b7addc48%40googlegroups.com.

elett-fronte_ocr_9.png

Reply all

Reply to author

Forward

fine tuning from traineddata_best

hmaster

Shree Devi Kumar

Shree Devi Kumar

hmaster

Lorenzo Bolzani