Re: Tesseract OCR not performing well even after data cleaning and transformations on black background data

Message has been deleted

farhad khalafi

unread,

Jan 30, 2019, 4:19:49 AM1/30/19

to tesseract-ocr

@Smriti: In the latest version (1.3.0) of our free Tesseract Studio, we have an experimental routine to detect and fix inverted text blocks (e.g. table headers with light text on dark background). The proper detection of image background is not an easy task. Our approach uses histograms in both horizontal and vertical directions to detect large rectangles that can potentially be header blocks. I would be curious to find out if the code works for you. You will need to set the "Fix inverted text" option under the Image tab. I ran an experiment with a sample PDF file and captured intermediate images as in the attached document. Your case might not work the same but no harm in trying.

On Tuesday, January 29, 2019 at 11:35:47 PM UTC-7, sett...@gmail.com wrote:

I have written some code for an image data to be extracted using tesseract, in Python, i.e Pytesseract OCR. But even after various transformations using openCV2, I am not getting satisfactory results. The data which has a dark background is not being extracted properly even after the background has been lightened. I have attached a sample image. The part colored in black is being extracted properly, but the parts in blue, yellow and red aren't being extracted well. I have put them in a square just so that it can be noticed. In the original image, all i have is english words and a few numbers (including decimals). Any help would be much appreciated.

Regards
Smriti

IntermediateImages.pdf

sett...@gmail.com

unread,

Jan 30, 2019, 5:32:26 AM1/30/19

to tesseract-ocr

@farhad khalafi, Thank you for the reply. I tried but I am getting almost same result as that of my code output.

sett...@gmail.com

unread,

Jan 30, 2019, 5:36:23 AM1/30/19

to tesseract-ocr

Capture11.PNG

farhad khalafi

unread,

Jan 30, 2019, 10:09:56 AM1/30/19

to tesseract-ocr

A few questions:

Is the image you have posted the original or after you have processed?

What is the image resolution?

What does the extracted text look like?

Any possibility of sharing the original image without redactions?

sett...@gmail.com

unread,

Jan 30, 2019, 11:37:54 PM1/30/19

to tesseract-ocr

I have processed the image- Grayed, Resized (300 dpi), denoise using fastNlMeansDenoising. All using OpenCV 4.0.0

Suppose the text on the image reads "26 Electrical 8.34 7.47 171,637 ", my OCR reads it as "16,, 5mm -, _. - m. 16w: 111.9311"

Reply all

Reply to author

Forward