MICR recognition with tesseract-ocr

3,674 views
Skip to first unread message

ttutuncu

unread,
Mar 28, 2008, 4:44:59 AM3/28/08
to tesseract-ocr
Hi,

I am working on a project in our company. My aim is to recognize MICR
13B-type fonts which are generaly used for number printing on bank
cheques.

I have managed to successfuly read the bank cheque number. The success
rate is around %95 if I only cut the region of the number from the
scanned image.

The problem is when I process the whole image (the whole bank cheque)
tesseract tries to recognize all the regions and and does not want to
ignore characters that do not occur in the allowed unicharset.

My unicharset is: A,B,C,D,1,2,3,4,5,6,7,8,9,0

What I want tesseract to do is to ignore every other character that is
not in the unicharset.

When tesseract tries to assimilate every character in the image, the
success rate of recognizing the cheque number decreases to around %70-
%80.

Is there a way to make tesseract ignore the other characters that are
not in the unicharset.

The other problem is that in the output I sometimes get a "o"
character instead of a "0" (zero) character even though it is not in
my unicharset.

Does the DangAmbigs file really work? Because I think it is not
working for me. Is there a configuration for this?

When I finish my project I will release it to this group.

Thank you for your help...
Message has been deleted

Frank Bennett

unread,
Mar 28, 2008, 4:38:31 PM3/28/08
to tesseract-ocr
Tesseract is aggressive about recognition, so if there are other
characters in the image, they will be recognized as something. You
have probably already considered this, but if the code appears in the
same area of every cheque, you should be able to mask off everything
else before feeding the image to tess. Another thing that you have
probably already looked at is training tess for that specific font
set, which should also help; a minimal training page containing an
example of each of the rest of the characters in the alphabet in
another font should do the trick. For monospaced text like that,
using "enable_chop 0" in a config file might also raise the
recognition rate a bit for you. Worth a try, anyway.

Frank Bennett

Frank Bennett

unread,
Mar 28, 2008, 4:42:59 PM3/28/08
to tesseract-ocr
In addition to "chop_enable 0" for monospace, since you're mostly
interested in numbers, using "number_depth 4" in the config file might
also help. Config files are just added to the command line. If your
trained language is set up as "mic", the line would look something
like this:

tesseract chequeimage.tif chequeimage -l mic my.config

Hope this helps,
FB


On Mar 28, 5:44 pm, ttutuncu <tariktutu...@gmail.com> wrote:

Scan...@gmail.com

unread,
Mar 30, 2008, 1:29:58 PM3/30/08
to tesseract-ocr
You could also filter the non MICR characters out because they would
have a very low recognition percentage.

ttutuncu

unread,
Mar 30, 2008, 5:26:22 PM3/30/08
to tesseract-ocr
Thank you Frank for your detailed reply.

I have already trained tess the MICR font, I have no problem with
that.
So you say that there is no way other than masking the part where the
cheque number is.

What do you mean by: "a minimal training page containing an
example of each of the rest of the characters in the alphabet in
another font should do the trick" ?
The training file I did only contains the characters in the MICR font.

What does enable_chop do?

Why do I sometimes get an "o" character instead of a "0"(zero)
character in my results even though it is not in my charset?

ttutuncu

unread,
Mar 30, 2008, 5:29:07 PM3/30/08
to tesseract-ocr
Is there another way to filter out the characters which are not in the
MICR font set?
I know I can cut out the part with the cheque number, but I want a way
other than this..
I want tess to recognize only the characters in my unicharset file. Is
this possible?

Frank Bennett

unread,
Mar 30, 2008, 8:10:14 PM3/30/08
to tesseract-ocr
On Mar 31, 6:26 am, ttutuncu <tariktutu...@gmail.com> wrote:
> Thank you Frank for your detailed reply.
>
> I have already trained tess the MICR font, I have no problem with
> that.
> So you say that there is no way other than masking the part where the
> cheque number is.

As far as my experience goes (which isn't all that far). If it's a
controlled environment and it's possible to mask the image, it seems
like that would be the simplest thing to do. Tesseract doesn't seem
to know about known unknowns ("Things that we know, that we don't
know", in the words of the well-known culprit).

> What do you mean by: "a minimal training page containing an
> example of each of the rest of the characters in the alphabet in
> another font should do the trick" ?
> The training file I did only contains the characters in the MICR font.

I'm working on a little research project to extract the income
portions from some Japanese financial statements. The first thing I
tried was training Tess for a numbers-only language, and the results
were disappointing -- the recognition rate wasn't great, and since
everything came back as a number, the text was too ambiguous to do
anything with.

When Tess was trained to recognize about 100 Japanese characters plus
digits, things improved considerably. On the small set of samples
I've run so far, I'm getting about 60% confirmed totals (every number
in the statement recognized to the digit), from source docs harvested
from various sources in the wild.

This is just empirical observation, I don't know anything about how
Tess works inside. But providing alternative "noise" characters in
the training set seemed to help improve the recognition of digits in
our case, and it gave us a bit more variation to work with in the
output text, which helped when extracting the data we were interested
in.

> What does enable_chop do?

This was just another empirical observation. :/ I actually don't
know what it does, but with enable_chop 1 (the default, I think), Tess
tries to split some Japanese characters, and breaks them in the
process. With enable_chop 0, that doesn't happen, and on our text,
there does not seem to be any drop in recognition elsewhere. I assume
that this has something to do with the fact that Japanese fonts (and
digits) are monospaced. But it is very possible that I don't know
what I'm talking about.

> Why do I sometimes get an "o" character instead of a "0"(zero)
> character in my results even though it is not in my charset?

That is an odd one. We found that recognition rates suffered when
each training page did not cover exactly the same set of characters.
But if there is no "o" lurking in one of your box files ... that would
be odd. We certainly don't get any roman characters in our returns,
although we do get a mass of (mostly wrong) Japanese characters
corresponding to blobs that Tess can't recognize correctly.

I'm running Tess on Linux, I don't know if that would make a
difference.

Frank

ttutuncu

unread,
Mar 31, 2008, 3:28:50 AM3/31/08
to tesseract-ocr
Hi Frank,

I see you are into tess very deeply. Hope your projects ends up
sucessfuly.

I used "chop_enable 0" in a config file but it didn't change the
output in any way.
It is very odd that it didn't do any effect on the output file. I
filled out the DangAmbigs file too but this has no effect either.
Are there settings for these files to make them work? Whatever I do
the output is still the same.
I am using tesseract v2.01 on a windows system.

The contents of the unicharset file is as follows:

15
NULL f
A f
0 f
B f
C f
D 7
1 f
2 f
3 8
4 8
5 8
6 8
7 8
8 8
9 8

I don't understand why I get an "o" character in my output! I've even
checked the box file. There is no sign of an "o".

It is true that the MICR font contains monospaced characters but
chop_enable had no effect.
If you look at the MICR font there are 4 different characters. The
only problem I get is with the D character. The output of the D
character sometimes comes out as A0, A1,81,0o1. Whatever I tried the
result is always the same. I have problems with the D character only.
And of course the "o" character where I don't understand where it
comes from.
Do you have any thoughts of why this could happen?

Can I use chop_enable while I am creating the box file?

Thank you very much.

Tarik
> > > > Thank you for your help...- Hide quoted text -
>
> - Show quoted text -

Frank Bennett

unread,
Mar 31, 2008, 4:00:16 AM3/31/08
to tesseract-ocr
On Mar 31, 4:28 pm, ttutuncu <tariktutu...@gmail.com> wrote:
> Hi Frank,
>
> I see you are into tess very deeply. Hope your projects ends up
> sucessfuly

So far, so good. But as I said, I don't really know what I'm doing
with this, all I'm able to do is swap experiences.

> I used "chop_enable 0" in a config file but it didn't change the
> output in any way.

If it doesn't make any difference, you can leave it out then.

> It is very odd that it didn't do any effect on the output file. I
> filled out the DangAmbigs file too but this has no effect either.
> Are there settings for these files to make them work? Whatever I do
> the output is still the same.

DangAmbigs never had any effect for us, but since our environment is
so weird anyway, we weren't really surprised.

> I am using tesseract v2.01 on a windows system.
>
> The contents of the unicharset file is as follows:
>
> 15
> NULL f
> A f
> 0 f
> B f
> C f
> D 7
> 1 f
> 2 f
> 3 8
> 4 8
> 5 8
> 6 8
> 7 8
> 8 8
> 9 8

What is the purpose of setting NULL, A, zero, B, C, 1, and 2 specified
as "f"?

From http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract/:

"The character in UTF-8 is followed by a hexadecimal number
representing a binary mask that encodes the properties. Each bit
corresponds to a property. If the bit is set to 1, it means that the
property is true. The bit ordering is (from least significant bit to
most significant bit): isalpha, islower, isupper, isdigit."

By setting the value to "f", you declare those digits to be alphabetic
AND lowercase AND uppercase AND digits ... and the "D 7" declares D to
be a letter that is both uppercase and lowercase at the same time. If
you were Tess, you would probably be confused too. :)

> I don't understand why I get an "o" character in my output! I've even
> checked the box file. There is no sign of an "o".

I wrote earlier that Tess seems to be happier when it has more
characters to choose from. That observation is about the extent of
what I can contribute, sorry.

> It is true that the MICR font contains monospaced characters but
> chop_enable had no effect.
> If you look at the MICR font there are 4 different characters. The
> only problem I get is with the D character. The output of the D
> character sometimes comes out as A0, A1,81,0o1. Whatever I tried the
> result is always the same. I have problems with the D character only.
> And of course the "o" character where I don't understand where it
> comes from.
> Do you have any thoughts of why this could happen?
>
> Can I use chop_enable while I am creating the box file?

I just followed the cookbook instructions at the URL above.

ttutuncu

unread,
Mar 31, 2008, 4:41:32 AM3/31/08
to tesseract-ocr
Hi Frank,

The unicharset was automatically created so I didn't change anything
in it before. But now I changed It the way you said:
Is NULL supposed to be here?

15
NULL 0
A 5
0 8
B 5
C 5
D 5
1 8
2 8
3 8
4 8
5 8
6 8
7 8
8 8
9 8

What should I write for the NULL? Is "0" OK?
Now the "o" character doesn't appear :) But I still get problems with
the D character. Now it is shown as 001, 021 and 221 in the output. Do
you think I should create another training file with more samples of D
and combine it with the current training file? Will this help?
> Fromhttp://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract/:
> > > - Show quoted text -- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -

Anurag Kalra

unread,
May 30, 2014, 1:22:19 PM5/30/14
to tesser...@googlegroups.com, tarikt...@gmail.com
Hi,

This thread is pretty old, but trying my luck here. Were you able to successfully read the MICR data from bank checks? I need to extract account and routing number from images of checks and was looking for a way to do that. There were a couple of commercial libraries which I tested and they work good. But if we can get an open-source one, that'll be awesome.
I tried using tesseract, but it is not able to do accurate OCR on MICR data. Were you able to train it to read the numbers correctly?

~ Anurag

sjsin...@gmail.com

unread,
Mar 22, 2019, 8:40:51 AM3/22/19
to tesseract-ocr
i am a student working on this but i don't have much idea about tessseract will smeone guide me how can i make my own OCR for cheque please
any help is appreciated

Shree Devi Kumar

unread,
Mar 22, 2019, 1:29:59 PM3/22/19
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef53435c-3c79-4404-996a-c3357553d595%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Mar 22, 2019, 1:32:30 PM3/22/19
to tesser...@googlegroups.com

sjsin...@gmail.com

unread,
Mar 22, 2019, 5:54:36 PM3/22/19
to tesseract-ocr
Thanks for the help do you also guide me how can i integrate it to my python project we can also discuss further on gitter or devfolio @sjsingh101

Sushant Patinge

unread,
Oct 6, 2023, 3:43:57 AM10/6/23
to tesseract-ocr
I am using Pytesseract to detect text from cancelled cheque I want to fetch Bank name IFSC code and MICR code from that cancelled cheque. In some cheques all the details are fetch but some of them MICR or IFSC or account numeber is not record in response. can any one tell me which package is install to read MICR font from cheques
Reply all
Reply to author
Forward
0 new messages