Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

262 views
Skip to first unread message

Gradalajage

unread,
Sep 19, 2020, 5:31:30 AM9/19/20
to tesseract-ocr
I have 395 PNG files depicting numbers with commas. The images are 130x54 pixels and are black text on white background. Here is an example of an image showing the number 638,997:
638,997.png
I would like to use Tesseract to perform reliable OCR on these images and others like them. Out-of-the-box, Tesseract correctly extracts text for 344 of these images, and fails in some manner on 51 of them. I am using the following command line for each image:

> tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' {filename}.png out

I run that command on each image, substituting {filename} as needed. Each invocation of that command produces the following output:

Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.

344/395 is an 87% success rate, but I want to try for better. So, I am attempting to "fine-tune" Tesseract by running through the instructions for tesstrain at https://github.com/tesseract-ocr/tesstrain. Each of my PNG files have file names that indicate ground truth, and I have a little script that generates ground-truth TXT files from the PNG file names. I have chosen "swtor" as the model name. I can then run this command from the tesstrain root directory:

$ make training MODEL_NAME=swtor START_MODEL=eng PSM=7

This command runs, prints lots of info, and eventually produces the following output, just before it ends:

Finished! Error rate = 2.739
lstmtraining \
--stop_training \
--continue_from data/swtor/checkpoints/swtor_checkpoint \
--traineddata data/swtor/swtor.traineddata \
--model_output data/swtor.traineddata
Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...

I can then take the resulting swtor.traineddata file, copy it to my tessdata directory, and then re-run my experiment from earlier, with a command line that looks like this:

> tesseract -l swtor --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' {filename}.png out

With the new swtor model, Tesseract correctly extracts text for 64 of these images, and fails in some manner on 331 of them.
64/395 is a 16% success rate, down from 87% for the eng model.
So, the swtor model I trained does far worse, which I find surprising and unexpected. I think I might be doing something wrong but do not really know what next steps to take to continue troubleshooting this. I'm hoping to post here and get help from someone knowledgeable about the training process.

I can post the contents of the "data" directory in my tesstrain repo root directory if that is helpful for anyone (I'd have to remove the checkpoints).

Shree Devi Kumar

unread,
Sep 19, 2020, 9:08:31 AM9/19/20
to tesseract-ocr
Please share your training data so that we can test. Thanks.

Virus-free. www.avg.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Virus-free. www.avg.com

Gradalajage

unread,
Sep 19, 2020, 12:49:51 PM9/19/20
to tesseract-ocr
Absolutely! The following Google Drive link is for the "training_data.7z" archive for the training data itself:

Also, here is a link to "data.7z" which contains my "./tesstrain/data" directory contents, which includes that same training data in the "swtor-ground-truth" directory, in case it is helpful at all:

One more thing: these training images are the result of some image processing I have applied with the hopes that it helps my OCR chances. I could provide samples of the source images in case that's helpful too (perhaps my image processing techniques are hurting more than helping, or perhaps better techniques would become obvious to someone experienced if they saw my source images).

Thank you so much!

P.S

Shree Devi Kumar

unread,
Sep 19, 2020, 4:46:47 PM9/19/20
to tesseract-ocr
> Each of my PNG files have file names that indicate ground truth, and I have a little script that generates ground-truth TXT files from the PNG file names.

Please review your script. I notice a number of file names ending with -2. The gt.txt files for the same also contain -2 while the image only has the number.

Example files attached.

Virus-free. www.avg.com
99,999-2.gt.txt
99,999-2.png

Shree Devi Kumar

unread,
Sep 19, 2020, 5:01:50 PM9/19/20
to tesseract-ocr
You will get better results when you fix your training data (I deleted all file names ending in -2 and -3).

Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0%
Iteration 396: GROUND  TRUTH : 5,500,000
File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect):
Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0%
Iteration 397: GROUND  TRUTH : 2,000,000
File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect):
Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0%
Iteration 398: GROUND  TRUTH : 6,435
File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect):
Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0%
Iteration 399: GROUND  TRUTH : 3,750,000
File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect):
Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0%
2 Percent improvement time=4, best error was 100 @ 0
At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char train=0.212%, word train=1%, skip ratio=0%,  New best char error = 0.212 wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint wrote checkpoint.

Iteration 400: GROUND  TRUTH : 5,222,100
File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect):
Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0%
Iteration 401: GROUND  TRUTH : 696,969
File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect):
Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0%
Iteration 402: GROUND  TRUTH : 71,000,000
File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect):
Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0%
Iteration 403: GROUND  TRUTH : 64,500
File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect):
Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0%
Iteration 404: GROUND  TRUTH : 39,500,000
File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect):
Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0%
Iteration 405: GROUND  TRUTH : 4,500,000
File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect):
Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0%
Iteration 406: GROUND  TRUTH : 1,450,000

Virus-free. www.avg.com

Grad

unread,
Sep 19, 2020, 5:12:19 PM9/19/20
to tesseract-ocr
If it turns out to be that simple, I will feel really relieved and really stupid at the same time. I cannot believe I didn't catch this before posting. Thank you for taking a look, I'll fix my ground-truth file creator script and try again.

Grad

unread,
Sep 19, 2020, 5:14:01 PM9/19/20
to tesseract-ocr
What matters is the contents of each ground truth file, not the filename, correct? (so long as the ground truth filename matches the PNG image filename, not counting the extension)

Grad

unread,
Sep 19, 2020, 6:54:48 PM9/19/20
to tesseract-ocr
I have fixed my ground-truth file creator script to eliminate the badly-formed numbers and have re-run my experiment. Unfortunately, I am still seeing really poor results (12 pass, 383 fail), even though the training error rates appear to be much smaller this time around:

At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char train=0.344%, word train=2.5%, skip ratio=0%,  New worst char error = 0.344 wrote checkpoint.

Finished! Error rate = 0.308

lstmtraining \
--stop_training \
--continue_from data/swtor/checkpoints/swtor_checkpoint \
--traineddata data/swtor/swtor.traineddata \
--model_output data/swtor.traineddata
Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...

Full log of "make training" is attached.

When I run Tesseract using the "eng" and "swtor" models on the training images, I'm seeing a the following types of results:

"eng" model results for 638,997.png:

> tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' 638,997.png out

Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
> cat .\out.txt
638,997

"swtor" model results for 638,997.png:

> tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' 638,997.png out
Failed to load any lstm-specific dictionaries for lang swtor!!

Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
> cat .\out.txt
3,9,997

In general, digits are more erroneous, and there is a proliferation of commas.

Do any other ideas come to mind? I appreciate your help Shree!

On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote:
make_training_log1.txt

Shree Devi Kumar

unread,
Sep 20, 2020, 4:09:02 PM9/20/20
to tesseract-ocr
Resize your images so that text is 36 pixels high. That's what is used for eng models.

Since you are fine tuning, limit number of iterations to 400 or so (not 10000 which is default).

Use dedug_level of -1 during training so that you can see the details per iteration.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Grad

unread,
Sep 27, 2020, 3:21:17 PM9/27/20
to tesseract-ocr
@shree thank you for the advice, it was helpful. I managed to get everything working satisfactorily: after adding additional training images, I now get perfect results (446 pass, 0 fail)! Furthermore, these results come with using the built-in "eng" model. I ended up not needing to re-train or fine-tune Tesseract. The ticket was finding the magic sequence of image processing steps to perform on my source images to prepare them for input to Tesseract OCR

I have battled with this problem since your response and have come close to giving up more than once, thinking that perhaps Tesseract simply isn't up to the task. But the limited character set and the uniformity of the character appearances kept me going -- there just had to be a way to make this work. I'd love to document all the things I tried, and what results they gave, but there is just too much. A quick summary will have to suffice.

What got me close but ultimately didn't work
  • Resized my images so the text was 36px in height. I did this in Python using OpenCV and (wrongly I think) chose the cv2.INTER_AREA interpolation method.
  • Tried different values for MAX_ITERATIONS in tesstrain's Makefile, and got varied results but nothing perfect.
  • Downloaded https://github.com/Shreeshrii/tessdata_shreetest/blob/master/digits_comma.traineddata and used it for the START_MODEL of tesstrain's Makefile (also had to set TESSDATA for the Makefile)
  • Between these things, the best result I ever got was something like this (input on left, OCR output on right):
    21,485,000 -> 21,483,000
    21,875,000 -> 21,873,000
    24,995 -> 24,999
    5,450,000 -> 9,450,000
    591,958 -> 9591,958
    851 -> 8571
    851 -> 8571
    Pass: 428
    Fail: 7
  • So you can see, close, but still some pretty unforgivable errors (unforgivable to me due to the nature of my application -- these numbers need to be perfect)
What ultimately did work
  • In an act of desperation, and following a bit of a hunch, I abandoned trying to train/re-train/fine-tune, and just focused on getting perfect OCR on one of the images where it failed using "eng" model
    • I chose this file 1,000,000.png, which produced an empty string when ran through Tesseract
  • I used GIMP on Windows and opened 1,000,000.png and began adjusting/tweaking/filtering the image in various ways, each time re-trying the OCR to see if the result changed. Using GIMP was crucial because it allowed me to iterate through trying different image processing techniques using a GUI, which was much quicker than doing the same thing in Python using OpenCV.
  • Once I found what worked, I implemented it in Python. The magic steps ended up being:
    1. Read the source image as color:
      image_to_ocr = cv2.imread(raw_image_file_name, cv2.IMREAD_COLOR)
    2. Use only the green channel of the source image. The numbers in my source images are mostly green tinted and I thought maybe this would help. This results in a grayscale image with a dark background and white text:
      b, image_to_ocr, r = cv2.split(image_to_ocr)
    3. Enlarge the image by 2x. This resulted in text that is ~20px in height, and I found this to be necessary but sufficient. I also found the use of cv2.INTER_CUBIC instead of cv2.INTER_AREA to be crucial here. I think the resizing (enlarging in my case) of the images was an absolute must-have. I'm really thankful I posted here and really thankful to @shree for that little nugget of insight.
      image_to_ocr = cv2.resize(image_to_ocr, (image_to_ocr.shape[1] * 2, image_to_ocr.shape[0] * 2), interpolation = cv2.INTER_CUBIC)
    4. Invert the image so that the background is white and the text is black. I am not sure if this step was necessary.
      image_to_ocr = cv2.bitwise_not(image_to_ocr)
  • With these steps, 1,000,000.png OCR'd perfectly
  • I then re-ran my script to check accuracy on all 400+ source images, and got the perfect result. I was so nervous while the script was running; it prints out errors as it goes, and so many times before I'd run the script with eager anticipation that I'd finally gotten everything right, only to have an error appear. This time...it ran...seconds go by...more seconds go by...no errors...I can't look OMG...check back in 30 seconds, 446 pass, 0 fail, I literally stood up and hooped and hollered with arms raised.

Shree Devi Kumar

unread,
Sep 27, 2020, 5:23:21 PM9/27/20
to tesseract-ocr
Thank you for sharing the results of your trial with fine-tuning and getting better results with the official traineddata after pre-processing the images.

Hope your notes will help other users with similar questions.

Fazle Rabbi

unread,
Oct 10, 2020, 4:01:15 AM10/10/20
to tesseract-ocr
Hi. I have a similar goal in mind about finetuning the 'ben' traineddata with the pictures i am working with. The picture will be an id so the names of people have to be recognized correctly. I tried the (line image,ground truth) way of finetuning the traineddata with very small number of images. The result was not good- I was kinda surprised as i expected at least the performance of the default model. My question is if i have a substantial amount of images and then process and produce the line image and ground truth from it- will that help me in improving the detection?

Shree Devi Kumar

unread,
Oct 10, 2020, 7:35:26 AM10/10/20
to tesseract-ocr
What command did you use?

Difficult to help without seeing what training data you used.

Fazle Rabbi

unread,
Oct 11, 2020, 3:16:13 AM10/11/20
to tesseract-ocr
i did the process manually for 5-6 images. i attached some samples of the line images and ground truth.
then i ran >> make training MODEL_NAME=<model name> START_MODEL=ben TESSDATA=<path to tessdata_best>
the resulting <model name>.traineddata file seem to not have any connection with the original 'ben' file. the ocr produces unreadable text.



Files.zip

Shree Devi Kumar

unread,
Oct 11, 2020, 2:50:08 PM10/11/20
to tesseract-ocr
Tesseract will make a checkpoint, if needed, every 100 iterations, so I suggest a minimum 50-100 line images to test finetuning. Also, one of your image samples has a lot of noise on the right side. Crop all extra parts. Also for `ben` you should choose the Indic language option in tesstrain.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages