On Mar 31, 6:26 am, ttutuncu <
tariktutu...@gmail.com> wrote:
> Thank you Frank for your detailed reply.
>
> I have already trained tess the MICR font, I have no problem with
> that.
> So you say that there is no way other than masking the part where the
> cheque number is.
As far as my experience goes (which isn't all that far). If it's a
controlled environment and it's possible to mask the image, it seems
like that would be the simplest thing to do. Tesseract doesn't seem
to know about known unknowns ("Things that we know, that we don't
know", in the words of the well-known culprit).
> What do you mean by: "a minimal training page containing an
> example of each of the rest of the characters in the alphabet in
> another font should do the trick" ?
> The training file I did only contains the characters in the MICR font.
I'm working on a little research project to extract the income
portions from some Japanese financial statements. The first thing I
tried was training Tess for a numbers-only language, and the results
were disappointing -- the recognition rate wasn't great, and since
everything came back as a number, the text was too ambiguous to do
anything with.
When Tess was trained to recognize about 100 Japanese characters plus
digits, things improved considerably. On the small set of samples
I've run so far, I'm getting about 60% confirmed totals (every number
in the statement recognized to the digit), from source docs harvested
from various sources in the wild.
This is just empirical observation, I don't know anything about how
Tess works inside. But providing alternative "noise" characters in
the training set seemed to help improve the recognition of digits in
our case, and it gave us a bit more variation to work with in the
output text, which helped when extracting the data we were interested
in.
> What does enable_chop do?
This was just another empirical observation. :/ I actually don't
know what it does, but with enable_chop 1 (the default, I think), Tess
tries to split some Japanese characters, and breaks them in the
process. With enable_chop 0, that doesn't happen, and on our text,
there does not seem to be any drop in recognition elsewhere. I assume
that this has something to do with the fact that Japanese fonts (and
digits) are monospaced. But it is very possible that I don't know
what I'm talking about.
> Why do I sometimes get an "o" character instead of a "0"(zero)
> character in my results even though it is not in my charset?
That is an odd one. We found that recognition rates suffered when
each training page did not cover exactly the same set of characters.
But if there is no "o" lurking in one of your box files ... that would
be odd. We certainly don't get any roman characters in our returns,
although we do get a mass of (mostly wrong) Japanese characters
corresponding to blobs that Tess can't recognize correctly.
I'm running Tess on Linux, I don't know if that would make a
difference.
Frank