issue with simple reading of numbers 9 and 8

222 views
Skip to first unread message

James Abney

unread,
Apr 21, 2017, 2:08:53 PM4/21/17
to tesseract-ocr
I'm having issues with tesseract dealing with the number 9 and 8 especially when they are next to each other. This is really the only issue I have. Even when ocr a tiff file it shows 123456789 as 123456788. I will link an example. Any help is appreciated. The following image is an example where my software using tesseract interprets the 899B8993B as 88888-838.


James Abney

unread,
Apr 21, 2017, 3:20:48 PM4/21/17
to tesseract-ocr
The font is tungsten semibold

ShreeDevi Kumar

unread,
Apr 21, 2017, 11:05:49 PM4/21/17
to tesser...@googlegroups.com
Which version of Tesseract. Which o/s?

If all your text is in tungsten-semibold, have you tried training with just that font?

- excuse the brevity, sent from mobile


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4a0c2a52-3eb5-4884-9371-111a6fbea73b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

James Abney

unread,
Apr 22, 2017, 12:59:43 PM4/22/17
to tesseract-ocr
Oh yes I guess I forgot to include that information, I did train using only that font and with the same size font. I am on windows 7 and I used 3.05 to train, although the .net wrapper i use is 3.04. I don't see how it has difficulty with the 9 and 8, seems very odd.


On Friday, April 21, 2017 at 11:05:49 PM UTC-4, shree wrote:
Which version of Tesseract. Which o/s?

If all your text is in tungsten-semibold, have you tried training with just that font?

- excuse the brevity, sent from mobile

On 22-Apr-2017 12:50 AM, "James Abney" <abn...@gmail.com> wrote:
The font is tungsten semibold


On Friday, April 21, 2017 at 2:08:53 PM UTC-4, James Abney wrote:
I'm having issues with tesseract dealing with the number 9 and 8 especially when they are next to each other. This is really the only issue I have. Even when ocr a tiff file it shows 123456789 as 123456788. I will link an example. Any help is appreciated. The following image is an example where my software using tesseract interprets the 899B8993B as 88888-838.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Apr 22, 2017, 11:55:56 PM4/22/17
to tesser...@googlegroups.com
Try training using more samples of 8, 9, B etc.

What results do you get with the provided eng.traineddata?  Are they better or worse?

Have you tried changing DPI of image to 300?


- excuse the brevity, sent from mobile
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Apr 24, 2017, 12:43:36 AM4/24/17
to tesser...@googlegroups.com
James,

Were you able to get this to work for you with 3.04/3.05?

I get accurate results using Tesseract 4.0 alpha, though it takes longer with --oem 1 than --oem 0.


./troublewith98-300.jpg
Tesseract Open Source OCR Engine v4.00.00alpha-385-gab41465 with Leptonica

real    0m1.203s
user    0m0.578s
sys     0m0.203s
Tesseract Open Source OCR Engine v4.00.00alpha-385-gab41465 with Leptonica

real    0m4.485s
user    0m5.125s
sys     0m0.234s

See attached ..

which uses Tesseract.NET (Tesseract 4.00alpha 362b68e)


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
troublewith98.png-1-eng.txt
troublewith98.png-0-eng.txt
troublewith98-300.jpg-1-eng.txt
troublewith98-300.jpg-0-eng.txt
troublewith98-300.jpg
troublewith98.png

James Abney

unread,
Apr 24, 2017, 8:52:41 AM4/24/17
to tesser...@googlegroups.com
Thanks for letting me know. No I haven't had a chance. I will try 4.0 although I have never manually dealt with tesseract. I've been using programs for 3.x that trained and made box files automatically. 

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/ekDV9gLb-80/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

James Abney

unread,
Apr 27, 2017, 5:16:20 PM4/27/17
to tesseract-ocr
 It has a hard time with the letter Q for some reason now. Thanks for your help, it seems just using the included traineddata is more accurate than just the font, don't know why that is.



On Friday, April 21, 2017 at 2:08:53 PM UTC-4, James Abney wrote:
Reply all
Reply to author
Forward
0 new messages