Trouble with White- and Blacklists

8flm6

unread,

Jul 1, 2011, 8:35:45 AM7/1/11

to tesseract-ocr

Hello,
I'm trying to apply White- and Blacklists to my OCR-result. If I call:
SetVariable("tessedit_char_whitelist", "0123456789")

Then all characters in the result are converted to numbers between 0
and 9. Is that the correct behaviour
of this option? After my understanding of a whitelist, only those
characters should returned which are
defined in the list, all others should be blocked.
The same with the blacklist. I call:
SetVariable("tessedit_char_blacklist", "0123456789")

This option converts all occurences of numbers to random characters.

This is the image I used:
https://docs.google.com/leaf?id=0B2ifXewLRYsdMzY3MzIwMTUtZTkxNS00ZDM1LTllYjgtN2NhMjU0MzRkNWQ4&hl=de

Example results:
normal output:
Tesseract 3.00
123456789

whitelist output:
1185587301 3100
123456789

blacklist output:
Tesseract B.OO
QBASGYSQ

Any help would be appreciated!

thanks

patrickq

unread,

Jul 1, 2011, 9:15:35 AM7/1/11

to tesseract-ocr

Yes, Tesseract black lists and whitelists are useful almost
exclusively in situations where you really don't have the blacklisted
characters anywhere in the image (otherwise Tesseract will return the
next best guess, no matter how poor) or vice-versa where you have only
the whitelisted characters in image.

The solution for achieving what you want is to set a variable telling
Tesseract to ignore any match it finds below a specified confidence
level. I wouldn't be surprised if there is such a variable but I have
no idea what it is.

We take a different approach to detecting numbers with tolerance for
errors: we define in our regular expressions a long list of letters we
accept as digits them convert - but we do that only when it helps us
complete a pattern. For example:
- in (88B)1G2-2345 we accept and map to (888)162-2345
- but in "BB (123)456-7861" we leave the B's alone

Patrick

On Jul 1, 8:35 am, 8flm6 <8f...@gmx.de> wrote:
> Hello,
> I'm trying to apply White- and Blacklists to my OCR-result. If I call:
> SetVariable("tessedit_char_whitelist", "0123456789")
>
> Then all characters in the result are converted to numbers between 0
> and 9. Is that the correct behaviour
> of this option? After my understanding of a whitelist, only those
> characters should returned which are
> defined in the list, all others should be blocked.
> The same with the blacklist. I call:
> SetVariable("tessedit_char_blacklist", "0123456789")
>
> This option converts all occurences of numbers to random characters.
>

> This is the image I used:https://docs.google.com/leaf?id=0B2ifXewLRYsdMzY3MzIwMTUtZTkxNS00ZDM1...

8flm6

unread,

Jul 1, 2011, 3:59:26 PM7/1/11

to tesseract-ocr

Yes a filtering by regular expressions sounds good, though I had
hoped tesseract could do this on its own. I might try a set of
trainedata
limited to numbers as well, in addition to white and black lists.
Maybe that works, I will post my results when finished.

thanks for your reply!

8flm6

John Brohan

unread,

Jul 2, 2011, 7:06:29 AM7/2/11

to tesser...@googlegroups.com

Hi

I think the result is perfectly correct.

To get just the numbers, surely you must use whitelist instead of the blacklist, and then go through your output and replace all non-numerics with a space!

I expect you will need some punctuation too +- ,.: etc If these occur in the text part then they need to be thrown away too eg if a punctuation is followed by a numeric it`s OK ?

Good Luck

John

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--

John Brohan http://www.woundfollowup.com tel 514 995 3749.

5 minute movie http://tinyurl.com/22kfdv8

Haah H

unread,

May 21, 2017, 2:49:11 PM5/21/17

to tesseract-ocr

Actually, it doesn't serve regexp - the while list as an enumeration is supported only.

Reply all

Reply to author

Forward