Accuracy with non-standard words consisting of random combinations/mix of digits + letters/characters

121 views
Skip to first unread message

Ast

unread,
Oct 21, 2019, 2:22:10 PM10/21/19
to tesseract-ocr
I've spent a good amount of time looking how to resolve this issue. Came across this unanswered post from 2017. Tried it and it is still reproducible today. There are 2 images - one with the letter S, one with 2S. As a single character, the letter S is detected successfully but 2S is detected as 25

From what I've been able to learn, this issue stems from the combination of alphanumeric characters (common in receipts or codes) and how tessaract tries to use dictionary words.

Environment:

tesseract 4.1.0
 leptonica
-1.76.0
  libgif
5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 
Found AVX2
 
Found AVX
 
Found SSE

Debian 10 64bit

I've tried changing some configurations such as load_system_dawg=0 and load_freq_dawg=0 but without luck.

I am fairly new to OCR so any input and feedback is greatly appreciated. Thank you.

Zdenko Podobny

unread,
Oct 22, 2019, 8:32:37 AM10/22/19
to tesser...@googlegroups.com
I am afraid that such small faction of text (where are just letter commonly misinterpreted like S or 5 or ? can not recognized with 100% accuracy. Try to use in some context (line).

Zdenko


po 21. 10. 2019 o 20:22 Ast <asteptoe...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com.

Ast

unread,
Oct 22, 2019, 1:15:42 PM10/22/19
to tesseract-ocr
Fair enough, though I was just using this as an example. In practice, it will be a 8 or 9 character alphanumeric string like a code. Would the extra 6-7 characters be enough context?


On Tuesday, October 22, 2019 at 5:32:37 AM UTC-7, zdenop wrote:
I am afraid that such small faction of text (where are just letter commonly misinterpreted like S or 5 or ? can not recognized with 100% accuracy. Try to use in some context (line).

Zdenko


po 21. 10. 2019 o 20:22 Ast <asteptoe...@gmail.com> napísal(a):
I've spent a good amount of time looking how to resolve this issue. Came across this unanswered post from 2017. Tried it and it is still reproducible today. There are 2 images - one with the letter S, one with 2S. As a single character, the letter S is detected successfully but 2S is detected as 25

From what I've been able to learn, this issue stems from the combination of alphanumeric characters (common in receipts or codes) and how tessaract tries to use dictionary words.

Environment:

tesseract 4.1.0
 leptonica
-1.76.0
  libgif
5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 
Found AVX2
 
Found AVX
 
Found SSE

Debian 10 64bit

I've tried changing some configurations such as load_system_dawg=0 and load_freq_dawg=0 but without luck.

I am fairly new to OCR so any input and feedback is greatly appreciated. Thank you.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Ast

unread,
Oct 22, 2019, 9:11:34 PM10/22/19
to tesseract-ocr
I've also noticed inconsistencies depending on where I crop.

I created a simple image with a 10 point font dejavu sans mono font (code_10_dejavu_sans_mono.png) which contains 6X279SWKF

I pre-process it 2 ways:
  • Scale it up by 4 using (scaled_up_only.png)
cv2.resize(img,
           
None,
           fx
=4,
           fy
=4,
      interpolation=cv2.INTER_CUBIC)
  • Crop it first and then scale it up by 4 as above (cropped_then_scaled_up_only.png)
        x = 10
        y
= 10
        h
= 20
        w
= 110

        img
= img[y:y + h, x:x + w]

I get different results.

tesseract --psm 13 -c tessedit_char_whitelist=-ABCDEFGHIJKLMNOPQRSTUVWXY1234567890 scaled_up_only.png out

  • cropped_then_scaled_up_only gives the correct value 6X279SWKF
  • scaled_up_only gives the incorrect value 6X2795WKF
Any insight on this and possible solutions to overcome it? I am playing with different ways to preprocesses but there seem to be this kind of behavior where the only difference between 2 images is that one has an extra top row of white pixels.

On Tuesday, October 22, 2019 at 5:32:37 AM UTC-7, zdenop wrote:
I am afraid that such small faction of text (where are just letter commonly misinterpreted like S or 5 or ? can not recognized with 100% accuracy. Try to use in some context (line).

Zdenko


po 21. 10. 2019 o 20:22 Ast <asteptoe...@gmail.com> napísal(a):
I've spent a good amount of time looking how to resolve this issue. Came across this unanswered post from 2017. Tried it and it is still reproducible today. There are 2 images - one with the letter S, one with 2S. As a single character, the letter S is detected successfully but 2S is detected as 25

From what I've been able to learn, this issue stems from the combination of alphanumeric characters (common in receipts or codes) and how tessaract tries to use dictionary words.

Environment:

tesseract 4.1.0
 leptonica
-1.76.0
  libgif
5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 
Found AVX2
 
Found AVX
 
Found SSE

Debian 10 64bit

I've tried changing some configurations such as load_system_dawg=0 and load_freq_dawg=0 but without luck.

I am fairly new to OCR so any input and feedback is greatly appreciated. Thank you.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
code_10_dejavu_sans_mono.png
cropped_then_scaled_up_only.png
scaled_up_only.png

Zdenko Podobny

unread,
Oct 24, 2019, 2:45:53 AM10/24/19
to tesser...@googlegroups.com
When I run:
tesseract code_10_dejavu_sans_mono.png -
I got result 6X279SWKF - e.g. no preprocessing is needed.
Also someone in past posted analyze to forum, which showed (AFAIR) than increasing size of letters over 30pt is causing problem for tesseact 4.

Zdenko


st 23. 10. 2019 o 3:11 Ast <asteptoe...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com.

Ast

unread,
Oct 28, 2019, 3:46:57 PM10/28/19
to tesseract-ocr
Thanks for the insight!
Reply all
Reply to author
Forward
0 new messages