How to improve quality?

269 views
Skip to first unread message

Hari P

unread,
Jun 28, 2018, 3:33:37 PM6/28/18
to tesseract-ocr
I am using tesseract v4.0 beta 1 and trying to OCR remittance file. There is one section which has CHECK NO, but tesseract doesn't seem to recognize it at all.

I have tried with setting dictionary words and penalties to 1 for non dictionary words, yet no change.

tesseract capture.png captureoutput1 --user-words "C:\Program Files (x86)\Tesseract-OCR\tessdata\eng.user-words" -c load_system_dawg=0 -c load_freq_dawg=0 -c language_model_penalty_non_dict_word=1 -c language_model_penalty_non_freq_dict_word=1

These are the words I have in eng.user-words.

CHECK NO.
CHECK
NO
check
no

Any idea how to fix this?

Thanks,
Hari
Capture.PNG
captureoutput1.txt

Dattatraya Tembare

unread,
Jun 29, 2018, 12:39:06 PM6/29/18
to tesser...@googlegroups.com
Hello Hari,
I faced the same problem. 

When there are 2 different type of fonts, Tesseract doesn't recognize it properly. It recognizes first text and ignores next text if the font size is bigger than first one.
I resolved it by cropping the image into 2 pieces. I'm using ImageMagick (java api) to clean and crop the images.

And I see you made a command unnecessarily complicated (I have tesseract path set up) 

C:\EA>tesseract Capture.PNG Capture -l eng
Tesseract Open Source OCR Engine v4.0.0-alpha.20180109 with Leptonica

C:\EA>tesseract Capture1.PNG Capture1 -l eng
Tesseract Open Source OCR Engine v4.0.0-alpha.20180109 with Leptonica

Tesseract will return proper text if the text is at center, how I achieved is -- crop, trim added a border 

Datta

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/01ef5e64-3332-4b0f-a0aa-8ab9488083f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Best Regards,
Dattatraya Tembare
Capture.txt
Capture1.txt
Capture1.png
Capture.PNG

Dattatraya Tembare

unread,
Jun 29, 2018, 12:40:55 PM6/29/18
to tesseract-ocr
"C" is missing in the text because tesseract doesn't have enough margin to read the text. 
Require proper margin.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

Dattatraya Tembare

unread,
Jun 29, 2018, 1:27:59 PM6/29/18
to tesser...@googlegroups.com
You can also use - 

import java.awt.Rectangle;
public String ocrText(File file, String lang, ImageGeometry geometry) {
String resultText = null;
Tesseract instance = getTesseractInstance("TesseractEnvPath", "eng");
// define an equal or smaller region of interest on the image. Follow:
// x-scale, y-scale, width and height
Rectangle rect = new Rectangle(geometry.getXscale(), geometry.getYscale(), geometry.getWidth(),
geometry.getHeight());

try {
resultText = instance.doOCR(ImageIO.read(file), rect);
log.debug("resultText: {}", resultText);
} catch (TesseractException | IOException e) {
e.printStackTrace();
}

return resultText;
}

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--
Best Regards,
Dattatraya Tembare

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages