Lessons, best practices, recommendations, strategies, hacks

Des Bw

unread,

Oct 21, 2023, 8:28:03 AM10/21/23

to tesseract-ocr

There is no exhaustive user manual for training tesseract. We all start in the darkness; and accumulate bits of information in different places to learn the ins and outs of tesseract.

It would be great if we can collectively write a better manual. Up until then, we can drop /collect our observations, best practices, hacks and lessons we accumulated in our adventure with tesseract.

I will start with some of my observations. I collect them by reading in between the lines: from my own failed experiments:

1. Training from scratch is very difficult because tesseract requires extensive data set. It looks like it requires over 300,000 test lines (around 26mb text file).

https://github.com/tesseract-ocr/tesseract/issues/3909

Multiple that with the fonts you want to train, the data grows exponentially. That requires very powerful computers running for weeks and months.

So, for the regular users, training from a network layer, or fine tuning are the most plausible options.

2. Best practice: make your text lines not too long. The recommended number of works in a line is 10-12. Again from the above link.

( ...to be continued)

Keith Smith

unread,

Oct 21, 2023, 11:18:06 AM10/21/23

to tesser...@googlegroups.com

Thank you Des for your help in this community. It is greatly appreciated!

As one who is struggling, may I make a suggestion.

I have started a google doc here with a suggested format for a tutorial which would be very helpful to me and I think to others. It is editable by anyone with the link.

I'm glad to put in any work myself, but my guess is that there are things in the doc that could be filled without much effort by you or others.

If this is true, once the doc is filled out, the contents of the google doc could be submitted as a PR to the tesstrain repo.

Again, just a suggestion that I hope would be helpful to all.

Thanks,

Keith

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com.

Des Bw

unread,

Oct 21, 2023, 1:58:01 PM10/21/23

to tesseract-ocr

That is good starter dear Keith. Very good idea. We can contribute texts and ideas; and develop it into a booklet or "getting started guide"--making additional explanatory comments, practical examples and elaborations on the official guide (which very dense, and incomplete).

- the tips and best practices can be then be distributed across the tutorial/guide, as you already started.

Des Bw

unread,

Oct 21, 2023, 3:45:31 PM10/21/23

to tesseract-ocr

I have been experimenting with the text2image script:

Here are some of my observations so far:

'--strip_unrenderable_words=false': The idea of this parameter seems to remove characters that are not covered by a certain font. But, I am getting better results with the False value. --Turning this to True removes more characters. Keeping it false flushes a warning that 1 character has been dropped. But, the overall number of characters getting removed is less (closer to the truth-value).

'--distort_image=true': For those of use would like to apply tesseract for ocring scanned documents: distortion is unavoidable. Turning the feature ON trains the model to get used to the distortion. It is turned OFF by default.
'--invert=false': inverting the image to black is uncommon. So, from the distortion parameters, the inversion is less relevant (less common) for scanned documents. So, keep this one to false.

Another big mistake I made when I was training was putting the following:

'--char_spacing=1.0',

This one puts space between the characters. That creates a perfect environment--get great results during the training. But, the final model will be less fit to recognize dense texts.

Des Bw

unread,

Oct 21, 2023, 3:47:54 PM10/21/23

to tesseract-ocr

Another useful parameter to turn ON would have been perspective. But, that one is not working for me.

René JM Clais

unread,

Oct 22, 2023, 12:41:15 PM10/22/23

to tesser...@googlegroups.com

Hi Keith,

The foo.traindedata is not existing but do you mean : the trainedata I want to train ex: hye.traineddata ?

In my case I should add a new character in the hye.traineddata

It seems that I can do this using the option 2 !

But how ? Which command should I use to execute this function and what does mean this process ?

Thank you for your help

Regards

René

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com.

Des Bw

unread,

Oct 22, 2023, 2:07:34 PM10/22/23

to tesseract-ocr

I have updated the guide explaining on how to train by cutting the top layer. You can check it out. I hope it is helpful.

Keith Smith

unread,

Oct 23, 2023, 8:01:37 AM10/23/23

to tesser...@googlegroups.com

Rene, the name “foo” is simply an example (or fictitious) font or language name. When training a new language or font, you should replace “foo” with the name of your language or font. The standard is to choose 3 letters, but that is not required. In fact, I have been training a font named “micr_e13b” and it is working technically for me (though the accuracy isn’t good enough yet). Note the underscore character between sections of the name.

Internal

From: tesser...@googlegroups.com <tesser...@googlegroups.com> on behalf of René JM Clais <renec...@gmail.com>
Date: Sunday, October 22, 2023 at 12:41 PM
To: tesser...@googlegroups.com <tesser...@googlegroups.com>
Subject: [EXTERNAL] Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

CAUTION EXTERNAL EMAIL
DO NOT open attachments or click on links from unknown senders or unexpected emails.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0%3DUajag0_M29Fe%2B8zCy0OZXw%40mail.gmail.com.

René JM Clais

unread,

Oct 24, 2023, 8:45:23 AM10/24/23

to tesser...@googlegroups.com

I have made a first try for a fine tuning, the script run a second and end without any error message. Where can I find a log file ?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com.

Des Bw

unread,

Oct 24, 2023, 8:55:54 AM10/24/23

to tesseract-ocr

You can add training >> data/lang.log & to the end of your training script (shell) to get a log saved inside your data folder. You also add DEBUG_INTERVAL=-1 training >> data/lang.log &. This one flashes more detailed information on the console; and saves a short log inside the data folder. If you want to save everything displayed in the console saved to log file, you can check out methods listed here:

https://unix.stackexchange.com/questions/200637/save-all-the-terminal-output-to-a-file

Des Bw

unread,

Oct 29, 2023, 8:18:32 AM10/29/23

to tesseract-ocr

BCER is a lie:

(B)CER is unrealistic measure of accuracy. It is a lie. I have said it a couple of times already. The BCER we get during the training is nowhere close to the reality of the accuracy of our model. I have many occasions where my training achieved 0 error rate and stopped the training. But, when I tested the output using independent evaluation tools, the best I can get was 95-97% accuracy on the synthetic data and 90-92% accuracy on actual scanned documents (data).

- So, we need to find a way to turn off the target_error-rate parameter which stops the training when the model thinks it achieved 0% error. May be can assign a negative value to it. I am going to try it if it will turn it off.

Message has been deleted

Des Bw

unread,

Oct 30, 2023, 10:46:10 AM10/30/23

to tesseract-ocr

Another lesson I learned today: starting from a smaller number of iteration and slowly increasing it is bad. We all should train from epochs.

Every interruption is causing tesseract to re-start from the beginning. Basically, the data that is appearing to latter parts might not be used for training.

Look at what Stefan said here: https://github.com/tesseract-ocr/tesseract/issues/3954

Message has been deleted

Des Bw

unread,

Oct 31, 2023, 12:22:55 PM10/31/23

to tesseract-ocr

Todays lesson: it is possible to disable TARGET_ERROR_RATE.

If you find your training stopping prematurely because it is hitting the target_error, then, you can disable it and train by epochs (iterations) only .