how to see which fonts are used in .traineddata files

92 views
Skip to first unread message

H Brenner

unread,
Oct 3, 2020, 1:06:30 AM10/3/20
to tesseract-ocr
Hi,

I have tesseract 3.02 on a Windows 10 PC.

I am trying to recognise text on a form scanned with a camera that has numbers mostly in tabular form with a small amount of Hebrew characters plus one English "graphical" word. I processed the photo to remove a pink background pattern, and to enhance the text in the image (the original - minus the pink pattern - produced the same results)

3198Rfat.png

The Hebrew text on the bottom 2 lines is cut off on the right, but this does not matter to me.

Only the numbers are of interest to me in the output.

I am running tesseract in Python using the pytesseract wrapper, and I am running the following command:
  • Imaj=Image.open(ImgPath)  # ImgPath is the full path to the .png file.
  • print('\n\n','v'*20,'\n', pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n')  # use eng default
I believe this corresponds to the command-line:
  • tesseract  ImgPath  out    (I used the actual path)
The output that I get is the following:
  •  7547512723 2

  • 1334718913
  • 0000000000
  • 3927010465.
  • 4483273819..
  • 0.|..1|.|.1ln/_1|.7_n/.01
  • 0556107919..
  • 1|11n/Tln/_nJ110._O...|__
  • 6978344327..
  • n/..|9._..l9._Q.:1Jn.o3n/___
  • _/0._1|.|9._n0EunD3./:
  • n/L232333333““

  •  A —:1 qnnwn N

  • 156138

  • ::§1§§?13:?76fi-fi333ii‘ifi1
  • 10:52:25 29.11.19 :1 ma‘
Most of it is meaningless gibberish to me. Only the highlighted text is recognised correctly/

When I ran it with the Hebrew language selected, it produced similar results, but with some of the Hebrew characters and only the "156138" recognised correctly.

Running tesseract manually (English) in a 'CMD' window produced the attached file 'out.txt'.

I suspect that the font used in the form is the problem - the form was not printed on a normal Windows, Mac or linux computer.

Which fonts were used to create heb.traineddata? Is there a way for me to display them?

Do I have to train tesseract with the font in the form?

Any help will be appreciated!

Thanks!

out.txt

Zdenko Podobny

unread,
Oct 3, 2020, 5:21:10 AM10/3/20
to tesser...@googlegroups.com
1. try the latest version
2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 produces:
8 27 26 10 04 03 01

N29 19 16 14 09 03

131 27 25 18 12 03

N21 18 16 13 07 04

N32 232112 10 07

N 36 34 30 27 21 01

X35 3417 13 10 08

N36 33 29 28 14 09

R 33 32 31 21 06 01

- oe ————

—— — ——— —— a = —

R 37 27 19 09 05 03

-———

Fra anny

156136

-——

3198(19): ‘on iam mn

10:52:25 28.11.19 1 09

.. . custom image segmentation would help too (and then to OCR each "cell" individually)

Zdenko


so 3. 10. 2020 o 7:06 H Brenner <hylton...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com.

H Brenner

unread,
Oct 5, 2020, 9:17:16 PM10/5/20
to tesser...@googlegroups.com
Hello Zdenko,

1) Can I assume you used the latest version of tesseract to produce the output you produced?
    To install the latest version, do I need to first uninstall the older version that I have on my PC?
2) How do I create a custom image segmentation?

Thanks,
Hylton

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/xhCARSW3RaU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com.

H Brenner

unread,
Oct 22, 2020, 11:05:56 PM10/22/20
to tesseract-ocr
Hi Zdenko,

Per you suggestion I have installed the latest version of tesseract (Ver 5), and I played with the psm.

I get the best result using --psm 11, like you did. Other values of psm give poor results. npsm 11 is the best, but it is still not good.

How do I create custom image segmentation?

Thank you in advance for your help.

Hylton

Zdenko Podobny

unread,
Oct 23, 2020, 6:08:24 AM10/23/20
to tesser...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages