Issue with Colon Recognition After Fine-Tuning Tesseract 5.5.3 on Russian Dataset

28 views

Skip to first unread message

Sandeep G

unread,

Nov 3, 2025, 9:10:04 AM11/3/25

to tesseract-ocr

I’m currently working on fine-tuning the Tesseract OCR model (version 5.5.3) and encountered an issue related to symbol and digit recognition.

With the original Tesseract weight file, the model was missing the colon ( : ) symbol. To address this, I fine-tuned the model using 500 ROIs. After fine-tuning, the model successfully recognized the colon; however, some digits began showing false positives — for example, ‘5’ was sometimes recognized as ‘6’.

When I used a combination of the original Russian model and the fine-tuned Russian model, the digits were recognized correctly, but the colon symbol was again missing.

Approaches Tried (but didn’t yield the desired results):

Converted the images to binary
Performed noise removal
Applied CLAHE
Tried all PSM modes
Enabled early stopping to avoid overfitting

Training Command Used:

make training MODEL_NAME=rusfinetune START_MODEL=rus MAX_ITERATIONS=4000 STOP_TRAINING_CONVERGED=true TESSDATA=/usr/local/share/tessdata

May I know what could be the root cause of this issue or any suggestions to resolve it?

For your reference, I’ve attached the sample images.

sample_Images

Thank you for your time and support.

Reply all

Reply to author

Forward

0 new messages