Training CMC7 Font

795 views
Skip to first unread message

Roger

unread,
Mar 2, 2016, 2:23:44 AM3/2/16
to tesseract-ocr

I am training tesseract to recognize CMC7 font, following this and this tutorial.


I have made a .tif file with 2621 characters, and created the .box file, going into every character to make sure the X and Y positions are correct (the rectangle around the character).


After that, I have run the command to train tesseract:


tesseract por.cmc7.exp0.tif por.cmc7.box nobatch box.train .stderr


I've made a shell script that calls this command in a loop, so the training wil be repeated a bunch of times. However, after a bunch of:

APLY_BOXES: Unlabelled word at :Bounding box=(762,2763)->(783,2776)

APPLY_BOXES: Unlabelled word at :Bounding box=(774,2269)->(783,2277)

APPLY_BOXES: Unlabelled word at :Bounding box=(787,2269)->(789,2277) ...

 

The result is always:

Found 420 good blobs.

2129 remaining unlabelled words deleted.

Generated training data for 420 words

It is running for several hours, and still it generated training data for only 420 words. And after I run tesseract on a check image to test it will recognize the characters, it doesn't work (doesn't recognize the characters and return random letters and symbols).


How can I make it recognize all the characters in the .tif image?


Thank you.

I have attached the .box and .tif in the zip file.

cmc7.zip

Tom Morris

unread,
Mar 2, 2016, 11:48:25 AM3/2/16
to tesseract-ocr
On Wednesday, March 2, 2016 at 2:23:44 AM UTC-5, Roger wrote:

I am training tesseract to recognize CMC7 font, following this and this tutorial.


I see two immediate issues:

- Tesseract assumes non-noisy character images are connected shapes (except for diacritics, etc) while the CMC7 characters are made up of disconnected vertical bars
- According to this Wikipedia page https://fr.wikipedia.org/wiki/CMC7 the significant part of the CMC7 encoding is the interbar spacing, *not* the overall shape.

Are you sure you're using the right tool for the job?

Tom

Roger

unread,
Mar 3, 2016, 8:05:44 AM3/3/16
to tesseract-ocr
Yes. I've seen some people who accomplish that. But they didn't provide the .traineddata.

I have been able to make tesseract recognize some fonts, by reducing the image size, and increasing its contrast, so the characters are more condensed.

You have any other idea of how can I make tesseract recognize it better?

Roger

unread,
Mar 3, 2016, 1:23:34 PM3/3/16
to tesseract-ocr
Does running tesseract training exhaustive on the .box and .tif files, helps in the recognition accuracy increase?

Meh Hem

unread,
Mar 4, 2016, 1:04:03 AM3/4/16
to tesseract-ocr
If I was going to attempt this I would attempt to solve this via pre-processing. Shouldn't be too difficult to pre process to remove the white spaces in the chars to create consistent shapes that tesseract could read easily. 

Could possibly need some up-scaling to off set the reduced size too. 

I don't think this is the answer you are after, but getting tesseract to consider broken shapes as blobs will be tedious.

Roger

unread,
Mar 9, 2016, 11:20:28 AM3/9/16
to tesseract-ocr
Yes, that's what I'm doing. After I reduced the image size and increased the image contrast and brightness, tesseract was able to recognize about 5 characters. But still, it is hard to recognize the whole string.

Anyone has another approach I could try?

Thank you.

Luis Teodoro Junior

unread,
May 6, 2016, 12:41:06 AM5/6/16
to tesseract-ocr
Roger , 

I am training tesseract to recognize CMC7 font.

I have the same problem, did you get complete training?

Would you help me ?

thank you

Nicolas Naso

unread,
Mar 27, 2017, 3:29:00 PM3/27/17
to tesseract-ocr
Hi! Did you finally trained tesseract with CMC7?
I apreciate if you share the final training file :D
Thanks
Nicolas
Reply all
Reply to author
Forward
0 new messages