Train tesseract for recognition of a dotted font

Junmock Lee

unread,

Jun 16, 2016, 6:51:34 AM6/16/16

to tesseract-ocr

Dear all,
I'm trying to train tesseract for recognition of a dotted font such as this image.

Here is my tif/box file pair that is generated by jTessBoxEditer.
eng_dotmatrix.dot-matrix.exp0.tif
eng_dotmatrix.dot-matrix.exp0.box
(I want to train tesseract for this font as a new language only for uppercase and digits.)

Then I ran:

tesseract eng_dotmatrix.dot-matrix.exp0.tif eng_dotmatrix.dot-matrix.exp0 box.train

output was only:

Tesseract Open Source OCR Engine v3.02 with Leptonica

and tesseract did not generate .tr file.

Can't I train tesseract for fonts that have too much small blobs in one character?
I think I can make good blobs by eroding the image, but I don't want to manipulate the image.
Do you have any suggestions?

O/S: Windows 7
Tesseract Ver: 3.02.02

Regards,
Lee.

Bojidar Stanchev

unread,

Jun 16, 2016, 7:08:38 AM6/16/16

to tesseract-ocr

You can use an algorithm to connect the dots in such fonts, thin one seems quite easy and then feed it to tesseract. It seems very unlikely for tesseract to recognize a dotted font like this without preprocessing. Check morphology algorithms, opencv, you might find something useful.

Junmock Lee

unread,

Jun 16, 2016, 8:20:54 PM6/16/16

to tesseract-ocr

Thank you for your help. I'll try it.

Reply all

Reply to author

Forward