Reading Inconsistently Spaced Text on a busy image

Schuyler Reinken

unread,

Oct 15, 2021, 10:30:10 AM10/15/21

to tesseract-ocr

Hello! I am having trouble using Tesseract to read inconsistently spaced text.

It tends to miss entire lines of text in the government warning in image attached. I don't need to read the blue angled text, only the stuff on the white sidebar. Is there a way to improve it's reading of this sort of image?

Schuyler Reinken

unread,

Oct 21, 2021, 5:32:17 PM10/21/21

to tesseract-ocr

I am using tesseract 4.1.1 and the results on this Image are as follows:

-----------------------------------------------------

roan
nian
Er
Preferred i)
PRODUCED & wa
SPRINGGATES
FARMS AND VINEYARD
Le
1
Tome Son a Woon
Hui Sov vet Aoinii
BEVERAGES UF
a i od oR De pa 1
primi ett
‘OPERATE MACHNERY, AND MAY CAUSE
375 mL 7% ALC BY VOL REATH PROBES. COMANSSUFTES
Jon 2 To 5 GIP \Y » ) SIR VW, T=" Wa COO pn a TEES gemma

-------------------------------------------------------------------------------------------------------------

Schuyler Reinken

unread,

Oct 21, 2021, 5:34:09 PM10/21/21

to tesseract-ocr

I'm using the english tessdata_best on linux

Zdenko Podobny

unread,

Oct 22, 2021, 12:56:51 AM10/22/21

to tesser...@googlegroups.com

Generally: read and follow https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md

Basically: pre-process image: remove not text element, or OCR only text areas (search internet for "text detection")

Zdenko

št 21. 10. 2021 o 23:34 Schuyler Reinken <xarl...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/123a18f9-c281-4063-b197-45a9a35e6090n%40googlegroups.com.

Schuyler Reinken

unread,

Oct 22, 2021, 9:44:02 AM10/22/21

to tesseract-ocr

We already use python opencv2 to convert the image to remove color and do binarisation. I also tried to use erosion, but it showed no marked improvement. Now for this particular image it would be easy to remove the left side, but it is merely a sample and the text can occur in any part of the image in the actual application we are building. When you say OCR only text areas, does that mean you can run tesseract once in a different page segmentation mode to just create a bounding box, then run it again to actually get the text accurately?

Zdenko Podobny

unread,

Oct 22, 2021, 2:14:34 PM10/22/21

to tesser...@googlegroups.com

As I wrote - try to search for "text detection" (or document analysis) - you will see it is quite difficult and there is almost no free/opensource solution.

Something is implemented in tesseract, but ( from my experience) it fails for complex pages like you provided. That's why the documentation suggest to remove "noise" (non text elements). You can try it by cropping your image just to right (white) part and you will get significantly better results with default settings:

scanfor
information
and pairing
Suggestions

PRODUCED & BOTTLED BY
SPRINGGATE®
FARMS AND VINEYARD
HARRISBURG, PA 17112
Www springgatevineyard.com

0812433! l
GOVERNMENT WARNING: 1) ACCORDING
70 THE SURGEON GENERAL, WOMEN
SHOULD NOT RINK ALCOHOLIC
BEVERAGES DURNG PREGNANCY BECAUSE
OF THE RISK OF BIRTH DEFECTS. (2)
CONSUMPTION OF ALCOHOUC BEVERAGES
INPARS YOUR ABLITY TODRNE ACAROR
OPERATE MACHINERY, AND MAY CAUSE
HEALTH PROBLEMS. CONTAINS SULFTES

There are still some problems (e.g. "I") but there are IMO related to quality of image so you can not solve them with preprocessing (maybe post processing with spellchecker would be a solution if you can not get better input).

Zdenko

pi 22. 10. 2021 o 15:44 Schuyler Reinken <xarl...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dfaeda97-e182-4553-ba02-72a6aa8d7fa7n%40googlegroups.com.

Reply all

Reply to author

Forward