Yes, Tesseract black lists and whitelists are useful almost
exclusively in situations where you really don't have the blacklisted
characters anywhere in the image (otherwise Tesseract will return the
next best guess, no matter how poor) or vice-versa where you have only
the whitelisted characters in image.
The solution for achieving what you want is to set a variable telling
Tesseract to ignore any match it finds below a specified confidence
level. I wouldn't be surprised if there is such a variable but I have
no idea what it is.
We take a different approach to detecting numbers with tolerance for
errors: we define in our regular expressions a long list of letters we
accept as digits them convert - but we do that only when it helps us
complete a pattern. For example:
- in (88B)1G2-2345 we accept and map to (888)162-2345
- but in "BB (123)456-7861" we leave the B's alone
Patrick
On Jul 1, 8:35 am, 8flm6 <
8f...@gmx.de> wrote:
> Hello,
> I'm trying to apply White- and Blacklists to my OCR-result. If I call:
> SetVariable("tessedit_char_whitelist", "0123456789")
>
> Then all characters in the result are converted to numbers between 0
> and 9. Is that the correct behaviour
> of this option? After my understanding of a whitelist, only those
> characters should returned which are
> defined in the list, all others should be blocked.
> The same with the blacklist. I call:
> SetVariable("tessedit_char_blacklist", "0123456789")
>
> This option converts all occurences of numbers to random characters.
>
> This is the image I used:
https://docs.google.com/leaf?id=0B2ifXewLRYsdMzY3MzIwMTUtZTkxNS00ZDM1...