Tesseract OCR Failing to Read Cleaned Numbers. Suggestions Please?

105 views
Skip to first unread message

tristan gordon

unread,
Apr 30, 2020, 5:27:22 AM4/30/20
to tesseract-ocr
Hello all,

Could you help?

Attached are two images containing two numbers, 81 and 82, which I am attempting to get Tesseract OCR to read.

Each time Tesseract OCR is returning empty page and producing an empty text.txt document.

The error is displaying as follows:

# tesseract 82.png out
Tesseract Open Source OCR Engine v4.1.1-rc2-20-g01fb with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 1622
Empty page!!
Estimating resolution as 1622
Empty page!!

How can I get the numbers to output? Are any changed required to the images or to tesseract?

These images have been produced using Centos 7, Apache, PHP and Imagick. 
Retrieving the image from an external server, then processing the image using Imagick to crop, grayscale, trim to focus area, resize, smooth edges, remove background, set image to black and white, flatten the image, set a resolution and image format.
These images have then been saved (for development purposes) and tested using the above. 

Once these errors are sorted and it's running, tesseract-ocr-php will complete the process on the fly (as there's around 6000 images to read).

Let me know.

Thank you (in advance).


82.png
81.png

Shree Devi Kumar

unread,
Apr 30, 2020, 5:36:56 AM4/30/20
to tesseract-ocr
Looks like the image resolution is not set correctly. You can specify dpi while processing.

ubuntu@tesseract-ocr:~/TEST$ tesseract 82.png -  --dpi 300
82
ubuntu@tesseract-ocr:~/TEST$ tesseract 81.png -  --dpi 300
81


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2314316b-1b5c-4a44-b9bb-8e65a901a688%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

tristan gordon

unread,
Apr 30, 2020, 10:29:07 AM4/30/20
to tesseract-ocr
Thank you.
Now to look at imagick to set the resolution!

On Thursday, 30 April 2020 10:36:56 UTC+1, shree wrote:
Looks like the image resolution is not set correctly. You can specify dpi while processing.

ubuntu@tesseract-ocr:~/TEST$ tesseract 82.png -  --dpi 300
82
ubuntu@tesseract-ocr:~/TEST$ tesseract 81.png -  --dpi 300
81


On Thu, Apr 30, 2020 at 2:57 PM tristan gordon <trista...@gmail.com> wrote:
Hello all,

Could you help?

Attached are two images containing two numbers, 81 and 82, which I am attempting to get Tesseract OCR to read.

Each time Tesseract OCR is returning empty page and producing an empty text.txt document.

The error is displaying as follows:

# tesseract 82.png out
Tesseract Open Source OCR Engine v4.1.1-rc2-20-g01fb with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 1622
Empty page!!
Estimating resolution as 1622
Empty page!!

How can I get the numbers to output? Are any changed required to the images or to tesseract?

These images have been produced using Centos 7, Apache, PHP and Imagick. 
Retrieving the image from an external server, then processing the image using Imagick to crop, grayscale, trim to focus area, resize, smooth edges, remove background, set image to black and white, flatten the image, set a resolution and image format.
These images have then been saved (for development purposes) and tested using the above. 

Once these errors are sorted and it's running, tesseract-ocr-php will complete the process on the fly (as there's around 6000 images to read).

Let me know.

Thank you (in advance).


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

tristan gordon

unread,
Apr 30, 2020, 10:54:25 AM4/30/20
to tesseract-ocr
Know the resolution, and headers, where the issue for Tesseract OCR PHP the following (should help) for anyone in future looking for a solution:
  1. Create your imagick instance, ie $image -> new Imagick('image.jpg');
  2. Then set the resolution using two lines, first: setImageUnits(imagick::RESOLUTION_PIXELSPERINCH); then setImageResolution(300,300); 
  3. The resolution is then set ready for tesseract to read.
I hope that helps.
Reply all
Reply to author
Forward
0 new messages