Hi,
I have tesseract 3.02 on a Windows 10 PC.
I am trying to recognise text on a form scanned with a camera that has numbers mostly in tabular form with a small amount of Hebrew characters plus one English "graphical" word. I processed the photo to remove a pink background pattern, and to enhance the text in the image (the original - minus the pink pattern - produced the same results)
The Hebrew text on the bottom 2 lines is cut off on the right, but this does not matter to me.
Only the numbers are of interest to me in the output.
I am running tesseract in Python using the pytesseract wrapper, and I am running the following command:
- Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png file.
- print('\n\n','v'*20,'\n', pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default
I believe this corresponds to the command-line:
- tesseract
ImgPath out (I used the actual path)
The output that I get is the following:
- 7547512723 2
- 1334718913
- 0000000000
- 3927010465.
- 4483273819..
- 0.|..1|.|.1ln/_1|.7_n/.01
- 0556107919..
- 1|11n/Tln/_nJ110._O...|__
- 6978344327..
- n/..|9._..l9._Q.:1Jn.o3n/___
- _/0._1|.|9._n0EunD3./:
- n/L232333333““
- A —:1 qnnwn N
- 156138
- ::§1§§?13:?76fi-fi333ii‘ifi1
- 10:52:25 29.11.19 :1 ma‘
Most of it is meaningless gibberish to me. Only the highlighted text is recognised correctly/
When I ran it with the Hebrew language selected, it produced similar results, but with some of the Hebrew characters and only the "156138" recognised correctly.
Running tesseract manually (English) in a 'CMD' window produced the attached file 'out.txt'.
I suspect that the font used in the form is the problem - the form was not printed on a normal Windows, Mac or linux computer.
Which fonts were used to create heb.traineddata? Is there a way for me to display them?
Do I have to train tesseract with the font in the form?
Any help will be appreciated!
Thanks!