tesseract detects upper case letter as lower case and detects non existing spaces

926 views

Skip to first unread message

Ashish Goel

unread,

May 27, 2016, 2:47:25 AM5/27/16

to tesseract-ocr

I am using linux command line version of tesseract 3.04 for an OCR project. I have good quality 800X600 images (output of screenprint from a GUI).
The image contains several blocks of text that I want to read. To localize the text block, I am doing image cropping (using ImageMagick).
I am getting good results, except some cases where I am getting non-existent space or upper case/ small case letter mismatch.

I will explain in detail below:

When I OCR image 2_cropped.png (It is a Swedish image) using tesseract:
    tesseract 2_cropped.png stdout --tessdata-dir /usr/share/tesseract-ocr/tessdata/ -l swe
I get,
Får Ej anslutas
direkt till
toraxslangar.

Note that EJ is being read Ej

Whereas, when I OCR image 3_cropped.png:
I get,
Får EJ anslutas
direkt till
trakealslangar.

This is all good.

The property of these cropped images are:
   2_cropped.png PNG 442x214 2143x1605+509+455 8-bit DirectClass 41.4KB 0.000u 0:00.000
   3_cropped.png[1] PNG 435x217 2175x1628+1563+448 8-bit DirectClass 43.4KB 0.000u 0:00.000

Whereas the property of original parent image is (Not attached here):
   PNG 800x600 800x600+0+0 8-bit DirectClass 159KB 0.000u 0:00.000

I tried doing some image processing like resizing. (It looked to help, but added other problems):
    convert 2_cropped.png -resize 400% 3_cropped.png

It helped tesseract recognize 'Ej' back to 'EJ' bu created new problems for other images as detailed below (After I resized all images to 400%):

- tesseract can not read 0_cropped.png at all. (It is a turkish image)
- tesseract inserts a space while reading nonexistantspaces.png.
      tesseract nonexistantspaces.png stdout --tessdata-dir /usr/share/tesseract-ocr/tessdata/ -l tur
It gives me, (turkish image)

Trakea tüplerine

doğrudan
BAĞ LAMAYIN.

Note the space between BAG and LAMAYIN

- tesseract reads another similar image everythingisgood.png correctly. It gives me,
Kapalı yara
drenlerine
BAĞLAMAYIN.

Can anyone help me in understanding how can I avoid getting mismatch letter cases and non existing spaces while doing OCR.
It would be really hard to apply random image processing method for so many images. I want to understand the reason why a specific word is read and not read correctly. Something which can give me a pointer on when and when not shall I resize the image, and what do if resizing adds non-existing spaces.

Regards,
Ashish

0_cropped.png

2_cropped.png

3_cropped.png

everythingisgood.png

nonexistantspaces.png

Bojidar Stanchev

unread,

Jun 8, 2016, 4:56:42 AM6/8/16

to tesseract-ocr

Use different font. This font even confuses me, if you haven't pointed it out I would have thought that this is lower case j. Try using some common font, see what are the fonts originally trained. If you insist on using this font then retrain Tesseract to recognize it.

In my opinion false spaces and mixed up letter case here and there are very minor mistakes and you should not worry about them so much.

I thing that a simple program can process the output to make it perfect - you can check the words in a database, try to remove spaces and see if the concatenated pieces complete a word, etc.

Reply all

Reply to author

Forward

0 new messages