noise output

zdravco

unread,

Mar 4, 2011, 9:25:21 AM3/4/11

to tesseract-ocr

Hello,

I am using tesseract in my project after some image pre-processing.
There are some false negatives I was hoping tesseract would eliminate
by producing no output. However, sometimes there is a strange output
that I get from almost blank images.
Here is the sample image:
https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274

When I run it with tesseract rev. 552 using English language I get:
" \\\\ R \."

Does anyone know if there are some options in tesseract that could
eliminate this noise? Or maybe if I could improve my input image with
some further pre-processing. I have also tried to recompile tesseract
with "textord_heavy_nr" set to TRUE, but then the output is:
"an \\“ R \".

Thanks,
Zdravko

Dmitry Silaev

unread,

Mar 5, 2011, 12:50:12 AM3/5/11

to tesser...@googlegroups.com

Zdravko,

You should do text-detection before passing images to Tesseract.
Text-detection is a process of determining of image regions containing
text. Even if an image contains no text, Tesseract anyways will treat
it as an image of text.

Before recognition Tess applies a so-called binarization algorithm,
which converts an RGB image to monochrome one (black for text and
white for background). For your sample image the Otsu binarization
used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would
certainly give a number of skewed vertical lines resembling
backslashes and further recognition classifies them as such.

"textord_heavy_nr" and some other variables control size-based noise
removal but work satisfactory only in case when there's a significant
body of good text surrounded but some amount of noise. In your image
everything is noise, so it won't work.

Therefore you need to extend your pre-processing in order to feed Tess
with images indeed containing text. Decisions can be made based on
contrast estimation, distinctive color distribution, etc.

HTH

Warm regards,
Dmitry Silaev

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

Saurabh Gandhi

unread,

Mar 5, 2011, 12:56:27 AM3/5/11

to tesser...@googlegroups.com, Dmitry Silaev

Hey,

Any algorithm / whitepaper suggestions for text extraction, especially if the text is not over-lay text but a part of the image itself. Most algorithms I saw on the internet are compute intensive.

--
Regards,
Saurabh Gandhi

Dmitry Silaev

unread,

Mar 5, 2011, 1:22:10 AM3/5/11

to tesser...@googlegroups.com

There are tons of. And I believe, no ready recipe can be used
universally, this is very task-specific, especially in photographic
images. Also I believe, to do good text detection your algo should in
some extent mimic human behavior so it probably should be multi-stage,
gradually refining results at every stage. Don't account on getting a
working code snippet from the internet, most likely you'd have to
write the code yourself.

Some articles I had picked out when I was self-studying this field of
document image processing. For the moment, there might be newer ones,
but these can provide you with the basis. Apologies, I've no time to
provide you with direct references and author names - I only listed my
file system directory on this topic. You can Google for exact article
titles to find links.

1990 Scale-Space and Edge Detection Using Anisotropic Diffusion.pdf
1998 Edge detection and ridge detection with automatic scale
selection.pdf
2001 Edge-Based Method for Text Detection from Complex Document
Images.pdf
2001 TEXT EXTRACTION FROM GREY SCALE PAGE IMAGES BY SIMPLE EDGE
DETECTORS.pdf
2002 Gaussian-Based Edge-Detection Methods - A Survey.pdf
2003 Fast Computation of Scale Normalised Gaussian Receptive
Fields.pdf
2003 Real-time scale selection in hybrid multi-scale
representations.pdf
2003 Recognition of text in 3-D scenes.pdf
2004 A method for ridge extraction.pdf
2004 A Review of Vessel Extraction Techniques and Algorithms.pdf
2004 Distinctive Image Features from Scale-Invariant Keypoints.pdf
2004 Scene Text Extraction in Natural Scene Images using
Hierarchical Feature Combining and Verification.PDF
2004 Text Detection from Natural Scene Images - Towards a System
for Visually Impaired Persons.PDF
2005 A novel approach for text detection in images using structural
features.pdf
2005 Color Text Extraction from Camera-based Images - the Impact of
the Choice of the Clustering Distance.PDF
2005 Improved Text-Detection Methods for a Camera-based Text
Reading System for Blind Persons.PDF
2005 Text Extraction from Gray Scale Historical Document Images
Using Adaptive Local Connectivity Map.pdf
2006 Multiscale Edge-Based Text Extraction from Complex Images.PDF
2006 Spatial and Color Spaces Combination for Natural Scene Text
Extraction.PDF
2008 A double-threshold image binarization method based on edge
detector.PDF

HTH

Warm regards,
Dmitry Silaev

Saurabh Gandhi

unread,

Mar 5, 2011, 1:24:11 AM3/5/11

to tesser...@googlegroups.com, Dmitry Silaev

Thanks for the prompt response. Will work on these and get back with more specific doubts.

--
Regards,
Saurabh Gandhi

Reply all

Reply to author

Forward