Bad quality in the OCR result

Fabian Angeloni

unread,

Oct 19, 2017, 3:08:37 PM10/19/17

to tesseract-ocr

Hi!

I'm trying to do the OCR for the attached image, but the result are just nonsense characters... I tried with some preprocess like change to gray scale, denoise, improve DPI, etc... but the results are the same...

The issue is when I'm doing this taking the image directly from a celphone using a webapp.

Any help?

the command I'm using is:

tesseract-l spa inputoutput

Regards

06D0C811-10EF-453F-AA79-88188CA2E84F.jpeg

zbgns

unread,

Oct 20, 2017, 6:40:32 AM10/20/17

to tesseract-ocr

Some improvements of the image are necessary before you apply OCR on it. Please look at the attached file. The output is maybe not a complete garbage, although the results are still not impressive, as the input image is distorted too much (especially the shadow seems to be challenging).

sample.pdf

Fabian Angeloni

unread,

Oct 20, 2017, 10:03:49 AM10/20/17

to tesseract-ocr

Hi! Thanks for the response!

That's is much better! COuld you please tell me which tool you used to convert the image to PDF but considering the text as text in the PDF and not just an image?

zbgns

unread,

Oct 20, 2017, 10:54:37 AM10/20/17

to tesseract-ocr

tesseract is able to create two-layer pdf file (image and text combined). There is command like:

tesseract -l Latin input.png output pdf

The image was preprocessed using ScanTailor (correction of geometric distortions, binarization) and ImageMagick (setting DPI and dimensions in order to obtain A4 size).

Reply all

Reply to author

Forward