Is there a way to train Tesseract to NOT output/recognize a character?

699 views
Skip to first unread message

TobiasS

unread,
Jun 4, 2012, 11:05:30 AM6/4/12
to tesseract-ocr
Hi,

I'm having some difficulties with the training of Tesseract on a
custom font. In particular the text I'm scanning contain control
characters that I do not want to be outputted. I've excluded the
aforementioned characters from my box model, with the result that they
will often instead get recognized as another similar character.

Is it possible to train Tesseract to not output/recognize a character?

Options I'm considering:
- Map control characters to nothing
- Map control characters to unicode characters that are not used and
blacklist them.
- Pre-process image to find and remove symbols.

Any tips/input on the viability of any of these options or a better
approach would be appreciated!

Sincerely,
Tobias S

Debayan Banerjee

unread,
Jun 4, 2012, 12:08:22 PM6/4/12
to tesser...@googlegroups.com


On 4 June 2012 20:35, TobiasS <tseb...@gmail.com> wrote:
Hi,



Is it possible to train Tesseract to not output/recognize a character?


Try Tesseract blacklist feature.

--
Debayan Banerjee

TobiasS

unread,
Jun 4, 2012, 12:51:55 PM6/4/12
to tesseract-ocr
Yes, but the issue with blacklist is that the control characters are
not part of the Unicode character set (or any character set - they are
symbols). If possible I would like to use a cleaner solution than to
recognize, map to an arbitrary character and then blacklist.

On Jun 4, 6:08 pm, Debayan Banerjee <debaya...@gmail.com> wrote:

Sven Pedersen

unread,
Jun 6, 2012, 12:48:47 AM6/6/12
to tesser...@googlegroups.com
Hi Tobias,
In the form processing industry control characters are typically
recognized and them discarded -- that allows better debugging and
calibration than just ignoring them entirely.
--Sven

La Monte H. P. Yarroll

unread,
Jun 6, 2012, 1:04:51 PM6/6/12
to tesser...@googlegroups.com
Am I the only one wondering what a printable control character might look like? To me "control character" is a thing like carriage return or form feed which doesn't have a printable representation.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Sven Pedersen

unread,
Jun 6, 2012, 1:12:41 PM6/6/12
to tesser...@googlegroups.com
In this case we mean the type of special delimiter symbol characters
you find at the bottom of a check or form. They allow systems to tell
that the document is aligned correctly in the feed or to calibrate
distances -- you find them in MICR fonts
(http://en.wikipedia.org/wiki/Magnetic_ink_character_recognition) such
as E13-B or OCR-B.
--Sven

Robert Komar

unread,
Jun 6, 2012, 1:31:48 PM6/6/12
to tesser...@googlegroups.com
On Wed, 6 Jun 2012, La Monte H. P. Yarroll wrote:

> Am I the only one wondering what a printable control
> character might look like? To me "control character" is a
> thing like carriage return or form feed which doesn't have
> a printable representation.

Those actually are "printable" because they do affect
the output. Most control codes are for controlling
other aspects of terminals besides what gets printed,
and those I would call the non-printable codes (e.g
^G - bell, ^D - EOT, Esc,...).

Cheers,
Rob Komar

Tobias Sebring

unread,
Jun 7, 2012, 4:51:25 AM6/7/12
to tesser...@googlegroups.com
Thanks for your input on this issue. I will go down the recognize as arbitrary character route and handle those characters after ocr in my code.

/Tobias

Reply all
Reply to author
Forward
0 new messages