Re: cmc7.traineddata

207 views
Skip to first unread message
Message has been deleted

Mamadou

unread,
Apr 3, 2020, 11:43:12 AM4/3/20
to tesseract-ocr
The easiest way to train MICR CMC-7 font for Tesseract would be using OCR-D (https://github.com/OCR-D/ocrd-train). This is what we've used in our R&D project (https://github.com/DoubangoTelecom/tesseractMICR). We open sourced the MICR E-13B traineddata but not the CMC-7. We're not using these models in our products but the result is more accurate than  any commercial product you can find online (LEADTOLSaccusoftrecogniform and abbyy). You'll also need heavy pre-processing to fill the interspaces. If you're familiar with Tensorflow then, I'd recommend using it instead of Tesseract.

On Thursday, April 2, 2020 at 8:22:44 PM UTC+2, Ghada Aruri wrote:
Hi team, 

 For CMC-7, I want to train it  by using jTessBoxEditor to get cmc7.traineddata  what the steps to get the cmc7.traineddata?
 and if anybody has done it and is willing to share me if you can? 

Best Regards.
Message has been deleted
Message has been deleted
Message has been deleted

Essam Zaky

unread,
Apr 4, 2020, 5:59:34 AM4/4/20
to tesseract-ocr
Hi @mamadou

how did you collected the 17000 image are they real images , 
also which type of Tensorfolw models you used , LSTM line , or single character model

Best Regards
Essam
Message has been deleted

Essam Zaky

unread,
Apr 4, 2020, 9:22:15 AM4/4/20
to tesseract-ocr
Thanks @mamadou


بتاريخ السبت، 4 أبريل، 2020 1:24:52 م UTC+2، كتب Mamadou:
Essam,
Yes. They are all real images. We're using web scraping to collect the images from Google, Bing, Pinterest, Instagram...
We're using LSTM with an Attention layer to make sure OCR will work even if the MICR lines are mixed with the signature, stamps, annotations...
There is an online webapp to check the accuracy at https://www.doubango.org/webapps/micr/
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
0 new messages