Train tesseract for recognition of a dotted font

973 views
Skip to first unread message

Junmock Lee

unread,
Jun 16, 2016, 6:51:34 AM6/16/16
to tesseract-ocr
Dear all,
I'm trying to train tesseract for recognition of a dotted font such as this image.


Here is my tif/box file pair that is generated by jTessBoxEditer.
eng_dotmatrix.dot-matrix.exp0.tif
eng_dotmatrix.dot-matrix.exp0.box
(I want to train tesseract for this font as a new language only for uppercase and digits.)

Then I ran:
tesseract eng_dotmatrix.dot-matrix.exp0.tif eng_dotmatrix.dot-matrix.exp0 box.train
output was only:
Tesseract Open Source OCR Engine v3.02 with Leptonica
and tesseract did not generate .tr file.

Can't I train tesseract for fonts that have too much small blobs in one character?
I think I can make good blobs by eroding the image, but I don't want to manipulate the image.
Do you have any suggestions?

O/S: Windows 7
Tesseract Ver: 3.02.02

Regards,
Lee.

Bojidar Stanchev

unread,
Jun 16, 2016, 7:08:38 AM6/16/16
to tesseract-ocr
You can use an algorithm to connect the dots in such fonts, thin one seems quite easy and then feed it to tesseract. It seems very unlikely for tesseract to recognize a dotted font like this without preprocessing. Check morphology algorithms, opencv, you might find something useful.

Junmock Lee

unread,
Jun 16, 2016, 8:20:54 PM6/16/16
to tesseract-ocr
Thank you for your help. I'll try it.
Reply all
Reply to author
Forward
0 new messages