Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Would training Tesseract with different binarization filters affect eng.traineddata?

67 views
Skip to first unread message

Mitya

unread,
Mar 27, 2025, 4:28:12 PMMar 27
to tesseract-ocr

I am working with Tesseract OCR and want to experiment with different binarization methods, such as Otsu's thresholding and other custom filters, to improve text recognition accuracy.

However, I am concerned that training with these different preprocessing techniques might modify or overwrite eng.traineddata, which I want to keep intact.

My questions are:

Does training a new model affect the existing eng.traineddata file? How can I safely train Tesseract with new filters without modifying the default English model? Is there a recommended approach to train Tesseract on preprocessed images while keeping eng.traineddata unchanged?

What I've tried:

updated my current eng_new.traineddata with three samples, each sample had applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After first 1000 iterations I got difference between initial and target trained.data But target trained.data got slightly worse results.

lstmtraining --continue_from /home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata /home/j/trainingCurrentEng/data/eng.traineddata --train_listfile /home/j/trainingCurrentEng/data/list.train --eval_listfile /home/j/trainingCurrentEng/data/list.eval --model_output /home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate 0.0001 --debug_interval 10 --max_iterations 600 tesseract otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir /home/j/trainingCurrentEng/data --psm 7

cat output_text.txt

Abcd123

tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained --tessdata-dir /home/j/trainingCurrentEng/data --psm 7

cat output_text_1.txt Abc

I would appreciate any guidance or best practices for training custom models without interfering with existing ones.


Mitya

unread,
Mar 28, 2025, 3:13:47 AMMar 28
to tesseract-ocr
I'd try to summarize here, I'm asking if its good idea to train lstm model using preprocessed images with applied filters like OTSU, Binary and others I've also lacked to find guideline for exact sample and its corresponding image. Should it be black fond and white text or reverse. Also any pointers are maximum appreciated


пятница, 28 марта 2025 г. в 03:28:12 UTC+7, Mitya:

Lorenzo Bolzani

unread,
Mar 28, 2025, 11:08:08 AMMar 28
to tesser...@googlegroups.com
Hi Mitya,
tesseract is trained black on white so I think it is not a good idea to use inverted samples (it is usually quite simple to invert the source image in case it is negative).

All the tesseract models, the .traineddata files, are independent from each other so when you train a new model the base model is not affected.

Otsu maybe be a good pre-processing step, just check visually if it is working as expected. A simple thresholding might be better, it really depends on the input.

The important thing is to use training samples that are as similar as possible to the real text that you will process and apply exactly the same preprocessing. Both as images and as text content i.e. do not train all on upper case or random text if your real text is lowercase in a specific language.

If I understand correctly, you are using only three samples just for testing the workflow. In this case I would use exactly the same samples for training and evaluation. If you use 3 samples for training and three different ones for eval the model will focus too much on the three training samples (overfitting badly) and the eval result will get worse than the original model.

For real training use as many samples as possible (1000? 10000?) and randomly sample from these a subset to use for eval.


Bye

Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com.

محمود محمد

unread,
Mar 29, 2025, 1:15:34 AMMar 29
to tesser...@googlegroups.com

Can we hold an online meeting with a general invitation to those interested to discuss how to do this?


Fish Money

unread,
Mar 29, 2025, 4:56:59 AMMar 29
to tesser...@googlegroups.com

Mitya

unread,
Apr 5, 2025, 10:07:26 AMApr 5
to tesseract-ocr
Hi Lorenzo, thanks for reaching me out!
I decided to train one source image  (without any filters), but still getting major issue, assumable with set of commands to train model or (Highly Likely) in area where we update eng.trainedadata or interfere with checkpoints!
Could you please take a look?

To All: please take a look also and kindly reply in derived topic:

https://groups.google.com/u/1/g/tesseract-ocr/c/X0dWjze9twc

Best Regards,
Mitya
пятница, 28 марта 2025 г. в 22:08:08 UTC+7, Lorenzo Blz:
Reply all
Reply to author
Forward
0 new messages