I’m currently working on fine-tuning the Tesseract OCR model (version 5.5.3) and encountered an issue related to symbol and digit recognition.
With the original Tesseract weight file, the model was missing the colon ( : ) symbol. To address this, I fine-tuned the model using 500 ROIs. After fine-tuning, the model successfully recognized the colon; however, some digits began showing false positives — for example, ‘5’ was sometimes recognized as ‘6’.
When I used a combination of the original Russian model and the fine-tuned Russian model, the digits were recognized correctly, but the colon symbol was again missing.
Approaches Tried (but didn’t yield the desired results):
Converted the images to binary
Performed noise removal
Applied CLAHE
Tried all PSM modes
Enabled early stopping to avoid overfitting
May I know what could be the root cause of this issue or any suggestions to resolve it?
For your reference, I’ve attached the sample images.
Thank you for your time and support.