Tips on how to improve results.

152 views
Skip to first unread message

Martín Ochoa

unread,
May 2, 2015, 3:46:03 AM5/2/15
to tesser...@googlegroups.com
Hi,
I'm developing an app that will have to read text from image in order to do some things that have nothing to do with my question. So I have that image and I want to read the text but unfortunately it's not reading it right, I tried to do some image preprocesing but I didn't understand it since I'm new at this, and I don't know if I even have to do it.
This is my output:

Coürdow
Abathur

I've changed the language to spa, so it would read the "ë". But then I think the problem is that the word "Caërdagor", doesn't exist in any language since it's a invented name, then again "Abathur" doesn't exist either but is reading it ok.
All the images that the app would read are the same as this, but obviously with different text. Any tips on how to improve this? Remember I'm a noob at this. Also do you think it would be a good idea to "train" the language, adding this invented names as the app reads them?


Thanks in advance.
hots4.jpg

Allistair C

unread,
May 2, 2015, 6:03:56 AM5/2/15
to tesser...@googlegroups.com
Try resampling your image up to 5x larger and try again.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/37bb44d3-5299-4576-ac31-57d68b901204%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<hots4.jpg>

Dmitri Silaev

unread,
May 6, 2015, 10:15:58 AM5/6/15
to tesser...@googlegroups.com
Hi Martin,

Some things indeed can be done to improve results for the upper word.

- Source image
(inet009.jpg)

- Upscale by 5x. This is required since your upper word has too small characters.
(inet009_rs.jpg)

- Crop out your upper word - you need to help Tess with layout analysis
(inet009_rs_cr.jpg)

- Threshold - you need to help Tess with binarization
>convert inet009_rs_cr.jpg -threshold 45% inet009_rs_cr_ts.jpg
(inet009_rs_cr_ts.jpg)

- Call Tess. I don't know if Spanish traineddata contains two-dotted "e" but French surely do. Used Tess compiled from sources as of 20150203. Perfect OCR result.
>tesseract inet009_rs_cr_ts.jpg inet009_rs_cr_ts.jpg -l fra
(inet009_rs_cr_ts.jpg.txt)

The lower word just being cropped out leads to normal recognition.

Best regards,
Dmitri Silaev
www.CustomOCR.com





--
inet009_rs_cr_ts.jpg.txt
inet009_rs.jpg
inet009_rs_cr.jpg
inet009_rs_cr_ts.jpg
Reply all
Reply to author
Forward
0 new messages