how to see which fonts are used in .traineddata files

H Brenner

unread,

Oct 3, 2020, 1:06:30 AM10/3/20

to tesseract-ocr

Hi,

I have tesseract 3.02 on a Windows 10 PC.

I am trying to recognise text on a form scanned with a camera that has numbers mostly in tabular form with a small amount of Hebrew characters plus one English "graphical" word. I processed the photo to remove a pink background pattern, and to enhance the text in the image (the original - minus the pink pattern - produced the same results)

The Hebrew text on the bottom 2 lines is cut off on the right, but this does not matter to me.

Only the numbers are of interest to me in the output.

I am running tesseract in Python using the pytesseract wrapper, and I am running the following command:

Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png file.
print('\n\n','v'*20,'\n', pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default

I believe this corresponds to the command-line:

tesseract ImgPath out (I used the actual path)

The output that I get is the following:

7547512723 2
1334718913
0000000000
3927010465.
4483273819..
0.|..1|.|.1ln/_1|.7_n/.01
0556107919..
1|11n/Tln/_nJ110._O...|__
6978344327..
n/..|9._..l9._Q.:1Jn.o3n/___
_/0._1|.|9._n0EunD3./:
n/L232333333““
A —:1 qnnwn N
156138
::§1§§?13:?76ﬁ-ﬁ333ii‘iﬁ1
10:52:25 29.11.19 :1 ma‘

Most of it is meaningless gibberish to me. Only the highlighted text is recognised correctly/

When I ran it with the Hebrew language selected, it produced similar results, but with some of the Hebrew characters and only the "156138" recognised correctly.

Running tesseract manually (English) in a 'CMD' window produced the attached file 'out.txt'.

I suspect that the font used in the form is the problem - the form was not printed on a normal Windows, Mac or linux computer.

Which fonts were used to create heb.traineddata? Is there a way for me to display them?

Do I have to train tesseract with the font in the form?

Any help will be appreciated!

Thanks!

out.txt

Zdenko Podobny

unread,

Oct 3, 2020, 5:21:10 AM10/3/20

to tesser...@googlegroups.com

1. try the latest version

2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 produces:

8 27 26 10 04 03 01

N29 19 16 14 09 03

131 27 25 18 12 03

N21 18 16 13 07 04

N32 232112 10 07

N 36 34 30 27 21 01

X35 3417 13 10 08

N36 33 29 28 14 09

R 33 32 31 21 06 01

- oe ————

—— — ——— —— a = —

R 37 27 19 09 05 03

-———

Fra anny

156136

-——

3198(19): ‘on iam mn

10:52:25 28.11.19 1 09

.. . custom image segmentation would help too (and then to OCR each "cell" individually)

Zdenko

so 3. 10. 2020 o 7:06 H Brenner <hylton...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com.

H Brenner

unread,

Oct 5, 2020, 9:17:16 PM10/5/20

to tesser...@googlegroups.com

Hello Zdenko,

1) Can I assume you used the latest version of tesseract to produce the output you produced?

To install the latest version, do I need to first uninstall the older version that I have on my PC?

2) How do I create a custom image segmentation?

Thanks,

Hylton

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/xhCARSW3RaU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com.

H Brenner

unread,

Oct 22, 2020, 11:05:56 PM10/22/20

to tesseract-ocr

Hi Zdenko,

Per you suggestion I have installed the latest version of tesseract (Ver 5), and I played with the psm.

I get the best result using --psm 11, like you did. Other values of psm give poor results. npsm 11 is the best, but it is still not good.

How do I create custom image segmentation?

Thank you in advance for your help.

Hylton

Zdenko Podobny

unread,

Oct 23, 2020, 6:08:24 AM10/23/20

to tesser...@googlegroups.com

e.g.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.444.226&rep=rep1&type=pdf

https://arthurflor23.medium.com/text-segmentation-b32503ef2613

Zdenko

pi 23. 10. 2020 o 5:05 H Brenner <hylton...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/66846144-4cbb-444a-8385-98edfbf1b1c3n%40googlegroups.com.

Reply all

Reply to author

Forward