How to improve quality?

Hari P

unread,

Jun 28, 2018, 3:33:37 PM6/28/18

to tesseract-ocr

I am using tesseract v4.0 beta 1 and trying to OCR remittance file. There is one section which has CHECK NO, but tesseract doesn't seem to recognize it at all.

I have tried with setting dictionary words and penalties to 1 for non dictionary words, yet no change.

tesseract capture.png captureoutput1 --user-words "C:\Program Files (x86)\Tesseract-OCR\tessdata\eng.user-words" -c load_system_dawg=0 -c load_freq_dawg=0 -c language_model_penalty_non_dict_word=1 -c language_model_penalty_non_freq_dict_word=1

These are the words I have in eng.user-words.

CHECK NO.
CHECK
NO
check
no

Any idea how to fix this?

Thanks,

Hari

Capture.PNG

captureoutput1.txt

Dattatraya Tembare

unread,

Jun 29, 2018, 12:39:06 PM6/29/18

to tesser...@googlegroups.com

Hello Hari,

I faced the same problem.

When there are 2 different type of fonts, Tesseract doesn't recognize it properly. It recognizes first text and ignores next text if the font size is bigger than first one.

I resolved it by cropping the image into 2 pieces. I'm using ImageMagick (java api) to clean and crop the images.

And I see you made a command unnecessarily complicated (I have tesseract path set up)

C:\EA>tesseract Capture.PNG Capture -l eng

Tesseract Open Source OCR Engine v4.0.0-alpha.20180109 with Leptonica

C:\EA>tesseract Capture1.PNG Capture1 -l eng

Tesseract Open Source OCR Engine v4.0.0-alpha.20180109 with Leptonica

Tesseract will return proper text if the text is at center, how I achieved is -- crop, trim added a border

Datta

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/01ef5e64-3332-4b0f-a0aa-8ab9488083f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Best Regards,
Dattatraya Tembare

+1 914 721 6311

Capture.txt

Capture1.txt

Capture1.png

Capture.PNG

Dattatraya Tembare

unread,

Jun 29, 2018, 12:40:55 PM6/29/18

to tesseract-ocr

"C" is missing in the text because tesseract doesn't have enough margin to read the text.

Require proper margin.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/01ef5e64-3332-4b0f-a0aa-8ab9488083f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dattatraya Tembare

unread,

Jun 29, 2018, 1:27:59 PM6/29/18

to tesser...@googlegroups.com

You can also use -

import java.awt.Rectangle;

public String ocrText(File file, String lang, ImageGeometry geometry) {

String resultText = null;

Tesseract instance = getTesseractInstance("TesseractEnvPath", "eng");

// define an equal or smaller region of interest on the image. Follow:

// x-scale, y-scale, width and height

Rectangle rect = new Rectangle(geometry.getXscale(), geometry.getYscale(), geometry.getWidth(),

geometry.getHeight());

try {

resultText = instance.doOCR(ImageIO.read(file), rect);

log.debug("resultText: {}", resultText);

} catch (TesseractException | IOException e) {

e.printStackTrace();

}

return resultText;

}

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/01ef5e64-3332-4b0f-a0aa-8ab9488083f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Best Regards,
Dattatraya Tembare
+1 914 721 6311

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a883cbb9-a96c-4744-b29f-7335c99b813c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward