tesseract ignores single/short characters -> any ideas?

191 views
Skip to first unread message

test0r man

unread,
Sep 8, 2019, 6:23:28 AM9/8/19
to tesseract-ocr
hi,
i use this command:

tesseract input/image.jpg output/output --dpi 72 --oem 1 -l deu+eng

to scan image like "1_input.jpg" and "2_input.jpg". the ocr result is good, but it seems that tesseract ignores short/single characters.
in the first image it ignores the three "0".
in the second image it only detects the "10.".

the tessinput files are attached too.
if i use the "--psm 6" command, all other words won't be detected right.
if i scale the images to 300 dpi, it's the same result.

has anyone an idea? thanks for help!






1_input.jpg
2_input.jpg
1_tessinput.tif
2_tessinput.tif

test0r man

unread,
Oct 5, 2019, 4:04:01 AM10/5/19
to tesseract-ocr
--Push--

does anyone have an idea?

thanks for help!

Ravi Annaswamy

unread,
Oct 5, 2019, 6:08:35 AM10/5/19
to tesser...@googlegroups.com
I didn’t try these images but my first guess: can you not provide dpi 72 as option and try?

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6bb8a731-afa3-4dbf-a805-90b9120b791b%40googlegroups.com.

Adrian Owen

unread,
Oct 5, 2019, 8:38:26 AM10/5/19
to tesser...@googlegroups.com

Zdenko Podobny

unread,
Oct 5, 2019, 8:52:15 AM10/5/19
to tesser...@googlegroups.com

tesseract 2_input_cropped.png - --psm 6 --oem 0
6.
7.
8.
9.
10.




Zdenko


so 5. 10. 2019 o 10:04 test0r man <test0r...@gmail.com> napísal(a):
2_input_cropped.png

test0r man

unread,
Oct 5, 2019, 12:18:58 PM10/5/19
to tesseract-ocr
i've tried without the 72 dpi option. the result on the first image is a bit bader. on the second image no change


Am Samstag, 5. Oktober 2019 12:08:35 UTC+2 schrieb Ravi Annaswamy:
I didn’t try these images but my first guess: can you not provide dpi 72 as option and try?

Sent from my iPhone

On Oct 5, 2019, at 4:04 AM, test0r man <test0r...@gmail.com> wrote:

--Push--

does anyone have an idea?

thanks for help!


Am Sonntag, 8. September 2019 12:23:28 UTC+2 schrieb test0r man:
hi,
i use this command:

tesseract input/image.jpg output/output --dpi 72 --oem 1 -l deu+eng

to scan image like "1_input.jpg" and "2_input.jpg". the ocr result is good, but it seems that tesseract ignores short/single characters.
in the first image it ignores the three "0".
in the second image it only detects the "10.".

the tessinput files are attached too.
if i use the "--psm 6" command, all other words won't be detected right.
if i scale the images to 300 dpi, it's the same result.

has anyone an idea? thanks for help!






--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

test0r man

unread,
Oct 5, 2019, 12:22:41 PM10/5/19
to tesseract-ocr
thanks for the link. i will read and try it

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

test0r man

unread,
Oct 5, 2019, 12:27:05 PM10/5/19
to tesseract-ocr
thanks for your test. i set the border with imagemagick for a better result on the first image. tesseract detects with psm 6 all numbers right, but only on the second image. have you tried the first image too?
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Zdenko Podobny

unread,
Oct 5, 2019, 2:24:08 PM10/5/19
to tesser...@googlegroups.com
First image has several problems:
  1. not straight baseline
  2. different font size
  3. table like structure
  4. amount/digits fields

1-3  could be solved with custom layout analyze e.g. splitting image to individual parts and sending them to tesseract via API or uzn file. 

There was analyze (you can found it in forum) that suggest not to use letters higher than 30 pixels,so I also resized input image.

LSTM engine is not (always) good at OCR of amount field, so I suggest to use legacy engine for this image (you will need end.trainneddata from tessdata repository).

Here is result:
tesseract 1_input_r.png - --psm 4 --oem 2
UZN file 1_input_r.uzn loaded.
15.

16.

17.

18.

19.

Sophie
Mitglied

DerNick03
Mitglied

Joko
Mitglied

Jens
Mitglied

Christian
Mitglied

76

51

0

0



Zdenko


so 5. 10. 2019 o 18:27 test0r man <test0r...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c84074cd-d44b-4c52-95d5-a725e2a2b6af%40googlegroups.com.
1_input_r.uzn
1_input_r.png

test0r man

unread,
Oct 5, 2019, 3:31:09 PM10/5/19
to tesseract-ocr
Hi Zdenko,

very good job! i've tried so many image manipulation, but this was the wrong way for the problems 1-3. the idea with the uzn file is great and i think the perfect solution. Thanks :-)

i can confirm that scaling these image doesn't helped (more than 30 pixel per letter is the right explanation).

what do you mean with the "end" traineddata? i have the "eng" traineddata and can't find "end.traineddata" - neither on google.

i've tested it your files and the result is perfect. thank you, thank you, thank you!

Zdenko Podobny

unread,
Oct 5, 2019, 3:48:50 PM10/5/19
to tesser...@googlegroups.com
end is typo ;-) should be read as eng :-)

Dňa so 5. 10. 2019, 21:31 test0r man <test0r...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7cd3752d-7fcc-44fe-bd0b-da291ea12d93%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages