Inconsistencies in detection and extraction of text using tesseract

141 views
Skip to first unread message

Saanvi Bhagat

unread,
May 30, 2024, 10:34:36 AMMay 30
to tesseract-ocr

I have provided the image from which I am trying to extract text from, using tesseract ocr (input.jpeg). Along with that, I have also provided the result or the extracted text from the image. As it can be observed from the images, the extracted text is not very accurate. Negative symbols have been omitted, some undesired characters are also there in the extracted text. (I have marked some of the incorrect results with blue boxes)

I have tried to improve the results by preprocessing and bringing changes in the parameters of the model. I have tried:

1. Binarizing the images

2. HDR processing of the processes

Even then, such inconsistencies remain.

How to improve the detection and extraction of text in tesseract? I have also tried paddleocr for the same task. Even then, symbols such as euro, some negative signs are not being detected.

output (2).jpeg
input.jpeg

Jun Repasa

unread,
May 31, 2024, 3:37:01 AMMay 31
to tesseract-ocr
Its hard to give opinion withour seeing how you setup tesseract, what PSM did you specify, .. etc?

Saanvi Bhagat

unread,
May 31, 2024, 8:19:32 AMMay 31
to tesseract-ocr

In order to improve the results, I have implemented canny edge detection and Hough Lines Transform on the images. Then I fed the binarized image to the tesseract model.

text = pytesseract.image_to_string(cropped_frame,lang='eng', config =' --psm 6 --oem 3')
The results have improved a bit, but are still far from perfect. The negative symbols are being omitted, some of them are being misunderstood as ~. Similarly some decimal points are also being omitted. 22.5 was extracted as 225.

Jun Repasa

unread,
Jun 1, 2024, 1:51:17 AMJun 1
to tesseract-ocr
Try to resize the image increase it size, use interpolation with inter_area or inter_cubic the bigger the image the better tesseract perform. PSM 6 is the right setting

Saanvi Bhagat

unread,
Jun 3, 2024, 9:51:26 AMJun 3
to tesseract-ocr
Thank you so much for your help!! Using interpolation improved my results to a great extent. I would like one more suggestion from you. I have extracted the text from the table in the image. Now I am trying to save it in a CSV. For that, I am using the coordinates of the detected text and reconstructing the table structure. 
I am providing the input image and the screenshot of the resultant output in the CSV file. As it can be seen in the output_in_csv image, the facts and figures are being saved correctly, however, the first column is very absurd. A new column is being generated for each word. That might be because tesseract detects the text word by word and hence creates a new column for each word. Could you please suggest a way to optimize my results? (majorly the first column)
The main issues are repetition in the column values and a new column being created for each word rather than just 1 column.  
input_image.jpg
output_in_csv.jpeg

Ger Hobbelt

unread,
Jun 3, 2024, 4:55:41 PMJun 3
to tesser...@googlegroups.com
Re image size, etc.: see:
https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ -- which' report / chart suggests it's beneficial to rescale any input image to produce a text size of about 30px vertical.
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html -- which also links to the report above, plus it has some table-related info.
for *why* resizing, etc. are often beneficial to OCR confidence numbers & quality.

Re your last question about the first column in your reconstructed table: https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ -- your reconstruction of the first column would be part of the [3]PostProcessing phase, as tesseract is book/paper/word focused, so it will only reconstruct words from character sequences.
AFAIK the latest release doesn't have an advanced table reconstruction module like you need. See also the end of the  https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html document for more info / links.



Quick meta-question though as I am quite surprised that people feed financial data into any kind of (fundamentally statistical and thus noise-injecting) OCR process (not just tesseract but any and all of them out there): wouldn't it be more business-smart to scrape financial performance reports like these or even better: get the direct data export from the SAP software at that company, so that you forego the entire machine-render-text-to-image + image-to-text OCR risky and costly process altogether? 
That financial performance stuff is usually reported in PDF/A format for obvious reasons (chamber of commerce, stock exchange, investors, those kinds of folks who all like their data as *virginal* as can be) and when you grab that output you're one straight text extract away from success, instead of wrangling a risky OCR process chain, which, by definition, cannot deliver a 100% accurate reconstruction all the time.
As this clearly is corporate financial data you're processing (and we can thus safely assume this reported data will be fed into follow-up processes where the actual numbers are of some import), I would expect nobody involved will appreciate the implicit risk factors introduced by injecting a inherently noisy statistical filter in the number crunching process, which opens one to the forever clear and present risk of random number value inaccuracies due to the nature of any neural net's output?

You're certainly not the only one attempting to apply OCR to financial data around here (the mailing list is brimming with it), but when I see annual / quarterly corporate performance reports being processed like that, I start to worry a wee bit more than usual. Not for tesseract (it does its job just fine), but for the one who came up with the idea to plonk such data into an image file and feed it to any kind of OCR machinery. Sounds like an already previously failed due diligence execution to me, where the question should have been asked: can we get this data in any type of text format straight from the source, as that is a company and machine-produced already. txt, csv, pdf, excel, anything? At what cost?
Or can't you get the text data (why?! if you get the page images, it's published material, correct?) and do you intend to use tesseract / your OCR process as an *assistive process* where the OCR output is reviewed / vetted by a human before deemed of sufficient quality for further use?






Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2e1f6325-91e5-44ba-9eaa-b64e1b2a4401n%40googlegroups.com.

Jun Repasa

unread,
Jun 3, 2024, 7:08:45 PMJun 3
to tesseract-ocr
Hi Ge Hobbelt,

You point is absolutely valid. In perfect world these financial document must processed as PDF and not for OCR which is costly and really unnecessary in compute time. Exacty my questions  to most users and clients.
But in real world - we have no control over what goes in to our system. Users will feed or upload high quality PDFs and sometimes low quality image based PDF files.

I think if you provide solutions for internal users within your organization that is okay. But for Saas based solutions - its quiet challenging to apply all pre processing and post processes.

salvador

Sundara Ganesh

unread,
Jun 18, 2024, 12:50:58 AM (12 days ago) Jun 18
to tesseract-ocr
Hello Ger Hobbelt,

Your meta question is very reasonable.  However, reality is very different, IMO.

For example, many banks and brokerage firms don't retain personal financial account statements/documents for more than 5 years or so.  However, you may have a printed copies of the same received at that time by mail.  We should be able to OCR them as accurately as possible.
Same is true for OCR'ing the scanned receipts for personal accounting.

I would be very interested in OCR'ing my 10 year old financial documents and statements.
Tesseract is great and far better than other ones that I've tried, but certainly it is not anywhere near perfect - expects human intervention and special handling of inputs based on human verification of every output.

Sundar

Sundara Ganesh

unread,
Jun 18, 2024, 12:51:04 AM (12 days ago) Jun 18
to tesseract-ocr
You said:  Now I am trying to save it in a CSV. For that, I am using the coordinates of the detected text and reconstructing the table structure.

So, I assume, you identified the columns based on the coordinates.  If so, you know that the words before the text of the second column belong to the first column and you should club them together and surround with quotes followed by a single comma (not commas after every word).  BTW, when you open the resultant csv file in spreadsheet, you may have to resize the first column to see the long text of words in it.

I hope I didn't misread your question to give you this obvious answer.
Reply all
Reply to author
Forward
0 new messages