Two visually identical images - Tesseract finds text from one but not the other

106 views
Skip to first unread message

Bhaarat Sharma

unread,
Oct 4, 2015, 11:11:54 PM10/4/15
to tesseract-ocr
I have two versions of the same image. One version was resized and rotated using OpenCV, however, the other was resized and rotated using ImageMagick. 

Tesseract finds text from the ImageMagick version of the image but not from OpenCV version of the image. I don't know why this is. 

Here are the details of the two images using the identify command from imagemagick:

    $ identify original_zoomed_rotated_im.png
    original_zoomed_rotated_im
.png PNG 1334x776 1334x776-34-76 8-bit sRGB 256c 131KB 0.000u 0:00.000
    $ identify opencv_zoomed_rotated
.png
    opencv_zoomed_rotated
.png PNG 1334x776 1334x776-34-76 8-bit sRGB 256c 47.4KB 0.000u 0:00.000

I've attached the two images

Here are the outputs from tesseract

    $ tesseract opencv_zoomed_rotated.png ozr && more ozr.txt
   
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
    $ tesseract original_zoomed_rotated_im
.png ozr_im && more ozr_im.txt
   
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
    MORTE
5;‘
    l
PARA
    ROUSSEFF



opencv_zoomed_rotated.png
original_zoomed_rotated.png

Tom Morris

unread,
Oct 5, 2015, 12:24:04 PM10/5/15
to tesseract-ocr
If you think those images are visually identical you should visit your optician. :-)

The ImageMagick version is much blurrier, so I'd guess that the high frequency noise from the pixelated OpenCV image is making Tesseract unhappy.  If you want to continue using OpenCV, try applying a Gaussian Blur after whatever other operations you're using and see if that creates an image which matches the ImageMagick one.

Tom
Reply all
Reply to author
Forward
0 new messages