limit output to ASCII charset

1,484 views
Skip to first unread message

haratron

unread,
May 25, 2010, 4:34:39 PM5/25/10
to tesser...@googlegroups.com
http://www.linux.com/archive/feed/57222
"Also, it can generate output only in the US-ASCII character set, so
glyphs with accent marks or other unsupported attributes will probably
be reproduced incorrectly."

Which is the option to make it limit output to the ASCII charset only?
Some letters such as "a" are outputted as glyph symbols.

Jimmy O'Regan

unread,
May 25, 2010, 8:19:46 PM5/25/10
to tesser...@googlegroups.com

That refers to an ancient version of Tesseract; since then, Tesseract
has added support for languages other than English, using Unicode by
default. I don't think there's any option to output to ASCII.

You might want to try something like unaccent (http://www.nongnu.org/unac/)

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

nguyenq

unread,
May 25, 2010, 11:09:21 PM5/25/10
to tesseract-ocr
You can perform some text manipulations in post-processing steps to
strip out diacritical marks to leave only the base ASCII characters
behind.

Sriranga(77yrsold)

unread,
May 26, 2010, 1:07:16 AM5/26/10
to tesser...@googlegroups.com
Post-processing steps is a very excellent idea.
-srirnaga(77yrsold)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.


haratron

unread,
May 26, 2010, 3:23:32 AM5/26/10
to tesser...@googlegroups.com
Post-processing is certainly not the same thing. If you restrict the
tesseract engine itself to the ASCII charset, chances are that you're
raising accurasy by forcing it to consider a more sensible alternative
than the glyphs.
Anyway I found the answer to this one.
For anyone interested:
http://code.google.com/p/tesseract-ocr/wiki/FAQ
Search for the "only digits" section. Instead of the digits, you just
define your allowed characters (a-z in my case).

Sriranga(77yrsold)

unread,
May 26, 2010, 4:12:03 AM5/26/10
to tesser...@googlegroups.com
Wish you good luck and result of your experiments may please be posted.
-sriranga(77yrsold)
Reply all
Reply to author
Forward
0 new messages