Issue with Colon Recognition After Fine-Tuning Tesseract 5.5.3 on Russian Dataset

12 views
Skip to first unread message

Sandeep G

unread,
Nov 3, 2025, 9:10:04 AM (3 days ago) Nov 3
to tesseract-ocr

I’m currently working on fine-tuning the Tesseract OCR model (version 5.5.3) and encountered an issue related to symbol and digit recognition.

With the original Tesseract weight file, the model was missing the colon ( : ) symbol. To address this, I fine-tuned the model using 500 ROIs. After fine-tuning, the model successfully recognized the colon; however, some digits began showing false positives — for example, ‘5’ was sometimes recognized as ‘6’.

When I used a combination of the original Russian model and the fine-tuned Russian model, the digits were recognized correctly, but the colon symbol was again missing.

Approaches Tried (but didn’t yield the desired results):

  • Converted the images to binary

  • Performed noise removal

  • Applied CLAHE

  • Tried all PSM modes

  • Enabled early stopping to avoid overfitting                                                          

Training Command Used:
make training MODEL_NAME=rusfinetune START_MODEL=rus MAX_ITERATIONS=4000 STOP_TRAINING_CONVERGED=true TESSDATA=/usr/local/share/tessdata

May I know what could be the root cause of this issue or any suggestions to resolve it?

For your reference, I’ve attached the sample images.

sample_Images

Thank you for your time and support.

Reply all
Reply to author
Forward
0 new messages