Tesseract version is Debian testing, tesseract-ocr_2.03-1_i386.deb. Dictionary
is tesseract-ocr-eng_2.00-1_all.deb.
Sorry it took 3 posts to ask this question. ;)
Remi, thank you for the suggestion!
The image was already 150 DPI, I have now tried the grayscale version
(attached), but it doesn't find any character in the image? (it finds just a
single space char)
However, I have again converted to BW and it correctly identified all of the
chars! (see bw_test.tif) If you compare bw_test.tif with b.tif (from previous
post) there is not much difference, at least to human eye... Interesting. :)
But I know, tesseract should work on the grayscale image. Does anybody know
why it doesn't?
Another thing, tesseract seems to be able to read much more if I run it over
the whole A4 document - the read-out is far from perfect ('5' instead of '6'
for instance), but at least it reads something...
Any idea what could be done to make it better?
Thank you!
Let me answer my own question: by scaling the image by 200% (enlarging it). It
looks like the characters have some "ideal" height that has a great impact on
OCR accuracy.
Could anybody please comment on that?
What would the ideal font size be for the default data set?
Thanks! :)
Write these lines to /usr/share/tesseract-ocr/tessdata/configs/nodict :
ok_word 0
good_word 0
non_word 0
Then run tess like this:
tesseract b.tif output /usr/share/tesseract-ocr/tessdata/configs/nodict
This is fun... :)
Any other ideas? I still can't get anything from this grayscale tif...
Thanks!
Thanks, that helps - I'll just try different sizes and decide on the best
one. :)
> another thing you can try is probably restricting the output character
> set if you know that there are only alphabets and numbers? There's a
> guide in the FAQ on this matter... "how to recognise digits only?"
> smthg like that...
I have tried:
http://code.google.com/p/tesseract-ocr/wiki/FAQ
But if I use this:
tessedit_char_whitelist 0123456789
I get this error:
error: Could not find variable 'tessedit_char_whitelist'
My guess is that documentation is outdated?
If nothing else I could train tess, though I would rather not... it seems
labour-intensive. :)
Thanks again, I appreciate it!
> > > another thing you can try is probably restricting the output character
> > > set if you know that there are only alphabets and numbers? There's a
> > > guide in the FAQ on this matter... "how to recognise digits only?"
> > > smthg like that...
> > ...
> > I get this error:
> > error: Could not find variable 'tessedit_char_whitelist'
> ...
> I ran into the same problem
>
> u need to use tesseract 2.03.
I use tesseract 2.03-1 (debian)... Does it work for you in 2.03? Which OS do
you use?
Thanks!
The problem was in the way I was calling tess executable:
*****
$ tesseract test.tif out /usr/share/tesseract-ocr/tessdata/configs/digits
error: Could not find variable 'tessedit_char_whitelist'
$ tesseract test.tif out digits
Could not open file, digits
$ tesseract test.tif out nobatch digits
Tesseract Open Source OCR Engine
*****
It might be obvious, but it wasn't to me - it looks like you need to
use "nobatch" parameter. It would be nice to have command-line help...
Thank you again, it helps a lot!
Best,
Andrew