Tesseract OCR reads character wrongly, reading extra characters.

89 views

Skip to first unread message

Jiansen Chan

unread,

May 5, 2025, 2:00:13 AM5/5/25

to tesseract-ocr

I custom trained a model, the configuration is shown as below:

custom_config = f'--oem 3 --psm 6 -l jpn22

However, when I use a debugger to check what is actually being scanned this is shown. Sis not able to be read as is assumed to have two different characters in it (hence why there are two bounding boxes in the picture with the "S") and for teh "3L" picture it is shown as "3LL".

The language model I'm using is for Japanese Kanji but it is supposed to be able to read the letters as the unicharset for jpn model comes together with Roman capital letters. I've tried reducing the number of training data with the repeated samples for this, so i don't think it is a matter of overfitting.

Can I get some advice on this?

Issue2.png

Issue.png

Reply all

Reply to author

Forward

0 new messages