Re: cmc7.traineddata

Message has been deleted

Mamadou

unread,

Apr 3, 2020, 11:43:12 AM4/3/20

to tesseract-ocr

The easiest way to train MICR CMC-7 font for Tesseract would be using OCR-D (https://github.com/OCR-D/ocrd-train). This is what we've used in our R&D project (https://github.com/DoubangoTelecom/tesseractMICR). We open sourced the MICR E-13B traineddata but not the CMC-7. We're not using these models in our products but the result is more accurate than any commercial product you can find online (LEADTOLS, accusoft, recogniform and abbyy). You'll also need heavy pre-processing to fill the interspaces. If you're familiar with Tensorflow then, I'd recommend using it instead of Tesseract.

On Thursday, April 2, 2020 at 8:22:44 PM UTC+2, Ghada Aruri wrote:

Hi team,

For CMC-7, I want to train it by using jTessBoxEditor to get cmc7.traineddata what the steps to get the cmc7.traineddata?
and if anybody has done it and is willing to share me if you can?

Best Regards.

Message has been deleted

Essam Zaky

unread,

Apr 4, 2020, 5:59:34 AM4/4/20

to tesseract-ocr

Hi @mamadou

how did you collected the 17000 image are they real images ,

also which type of Tensorfolw models you used , LSTM line , or single character model

Best Regards

Essam

Message has been deleted

Essam Zaky

unread,

Apr 4, 2020, 9:22:15 AM4/4/20

to tesseract-ocr

Thanks @mamadou

بتاريخ السبت، 4 أبريل، 2020 1:24:52 م UTC+2، كتب Mamadou:

Essam,
Yes. They are all real images. We're using web scraping to collect the images from Google, Bing, Pinterest, Instagram...
We're using LSTM with an Attention layer to make sure OCR will work even if the MICR lines are mixed with the signature, stamps, annotations...
There is an online webapp to check the accuracy at https://www.doubango.org/webapps/micr/