Characters whitelist.

1,433 views
Skip to first unread message

MARTIN Pierre

unread,
Jul 22, 2011, 12:54:32 PM7/22/11
to tesser...@googlegroups.com
Hello,

i previously was using an older version of tesseract, and i have switched to svn HEAD. i have an issue i didn't have with the previous version. before each recognition i'm setting the whitelist parameter to only numerical digits and "<" and ">". Also, i'm using a trained data i have created from scratch, but which contains all the alphabet for this font...

The command i use is:
[My stuff...]
_tessApi->setVariable("tessedit_char_whitelist", "><0123456789");
[Start recognition...]
Sample of a result i get:
3000657806S<00S60':0<3000657B0<

As you can notice, the whitelist is completely ignored… On the previous version it helped tesseract a lot using the whitelist because some characters could not be "mistaken" for others.

Do you have any idea what i'm doing wrong here?
Thanks a lot for your kind help!
Pierre.

MARTIN Pierre

unread,
Jul 27, 2011, 1:37:10 AM7/27/11
to tesser...@googlegroups.com
Hello,

i'm bumping my own question… i'm not sure if i was correctly subscribed to the list, so i just renew my subscription. Could anyone from the list confirm upon reception even if there is no answer to it?

Thanks,
Pierre.

Dmitri Silaev

unread,
Jul 27, 2011, 3:59:52 AM7/27/11
to tesser...@googlegroups.com, hick...@gmail.com
If you want someone here to dig into your issue, you should give as
much as possible info about it.
From what you've given no one can reproduce it, and reproduction is a
common method to solve issues.
Show us your images, full code snippet, config files, etc. Then maybe
you'll get the answer.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

MARTIN Pierre

unread,
Jul 27, 2011, 4:41:56 AM7/27/11
to tesser...@googlegroups.com
Show us your images, full code snippet, config files, etc. Then maybe you'll get the answer.
Below, i have tried to gather the logic executed on a typical workflow of my application (The C++ code is generated dynamically by a tool of mine… There may be some typos when i reorganized it):

_tessApi = new tesseract::TessBaseAPI();
_tessApi->Init ("./", "cst");
_tessApi->SetPageSegMode (tesseract::PSM_SINGLE_LINE);
_tessApi->SetVariable ("tessedit_char_whitelist", "><0123456789");
_tessApi->SetImage (pipelineBottom.data(), pipelineBottom.width(), pipelineBottom.height(), 1, pipelineBottom.emulatedBPL());
char *text = _tessApi->GetUTF8Text();
// However at this point, text contains some characters which are NOT in the whitelist!
_tessApi->Clear ();

i hope this would help understanding my problem.
Thanks,
Pierre.

MARTIN Pierre

unread,
Jul 27, 2011, 4:35:57 AM7/27/11
to tesser...@googlegroups.com
Hello Dmitri and thanks for your help,

If you want someone here to dig into your issue, you should give as much as possible info about it.
Well, i thought i did.

From what you've given no one can reproduce it, and reproduction is a common method to solve issues.
i really don't need anyone to reproduce it. i'm really asking if anyone have had the same issue with a relatively recent source code, i'm myself using it from C++ with the wrapper API.
Basically: is the behavior of the "tessedit_char_whitelist" the same as before? Or did it change in some way? In previous versions it was not only a hint to the classifier, but would also completely disallow it to follow learnt paths other than the ones in the whitelist. Now it seems to be different, because (And again, whatever the image is) the output contains characters not in this whitelist.

The basic idea of a whitelist is a safer blacklist… Blacklist tends to be a way to exclude few possibilities, while whitelist tends to include only a given amount. i would like to know if this behavior is still the same, that's it.

Show us your images, full code snippet, config files, etc. Then maybe you'll get the answer.
Whatever the image is, the output is in contradiction with a basic rule (Whitelist) which used to work when i first started to use tesseract years ago with older versions.
As a code snipet, the only useful piece of code i can think of copy pasting would be this line:
_tessApi->setVariable("tessedit_char_whitelist", "><0123456789");
But the result i get contains other characters, not allowed by the whitelist:
3000657806S<00S60':0<3000657B0<
Again: i'm using a fresh svn HEAD version of tesseract via the C++ wrapping API.

Would it be possible for anyone here to give me a snippet of a working whitelist, as it was conceptually made in the previous versions?

Thanks a lot,
Pierre.

Subhodeep Maji

unread,
Feb 27, 2017, 1:36:10 PM2/27/17
to tesseract-ocr, hick...@gmail.com
Did you get any answer ? I am facing the same issue.
Reply all
Reply to author
Forward
0 new messages