Extracting data from a table with known labels.

Riccardo

unread,

Dec 21, 2024, 12:56:42 AM12/21/24

to tesseract-ocr

Hello,
I am trying to use Tesseract to create a small Windows application that allows the user to:

Take a screenshot of the monitor and cut a smaller portion containing a table (the table always has the same format, and the labels are consistent. The numerical data are different each time).
Provide the screenshot to Tesseract to extract the data. My strategy is to remove the vertical and horizontal lines in the table, extract the entire text, and collect the numerical values corresponding to the labels I want to capture.
Finally, generate a text output based on the extracted data.

The app works fine, but there are still many errors in data extraction. Sometimes, some values are not extracted at all because the label is not correctly recognized. Other times, even if the labels are recognized correctly and the data are extracted, the numbers are incorrect. Also I noticed that the error quote is higher on my work PC, probably because the screen resolution is lower than my home PC.

I am wondering if there is a more reliable way to accomplish my goal.

Below I attached some images of the App to give you an idea, an example of the table and the python script I am using for OCR.

Thank you very much for your help!!!

tesseract v5.4.0.20240606

Python 3.13.1

2.png

3.png

ocr_processing.py

1.png

test_image.png

Zdenko Podobny

unread,

Dec 21, 2024, 1:38:25 PM12/21/24

to tesser...@googlegroups.com

Hi,

have a look at this example:
article: https://iamrajatroy.medium.com/document-intelligence-series-part-2-transformer-for-table-detection-extraction-80a52486fa3

notebook: https://nbviewer.org/github/iamrajatroy/Data-Science-Lab/blob/main/notebook/DETR_Document_Intelligence.ipynb

Zdenko

so 21. 12. 2024 o 6:56 Riccardo <riccardo....@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/191d869f-9ff0-4297-b539-aad42fc3c1e3n%40googlegroups.com.

Zdenko Podobny

unread,

Dec 21, 2024, 1:50:40 PM12/21/24

to tesser...@googlegroups.com

other example:

https://www.kaggle.com/code/sreesankar711/table-transformer-demo

Zdenko

so 21. 12. 2024 o 19:37 Zdenko Podobny <zde...@gmail.com> napísal(a):

Nikola Smolenski

unread,

Dec 21, 2024, 1:51:02 PM12/21/24

to tesser...@googlegroups.com

Consider extracting the fields first, then submitting them to tesseract separately. There is no guarantee tesseract will return the fields in order, also lines confuse it.