Properly Insert OCR Into Separate Columns

393 views
Skip to first unread message

Daniel Lu

unread,
Mar 22, 2021, 1:23:18 AM3/22/21
to tesseract-ocr
Hi,

I am trying to read hundreds of pages of information like the picture below into a CSV file. For us humans, it is very clear where the information should go in each of the four columns. But I am trying to use tesseract to do this!

This is my code right now:

```{python}
import cv2
import pytesseract
import xlsxwriter
import re

img = cv2.imread("*image file path")
pytesseract.pytesseract.tesseract_cmd = r"*tesseract location"

# Initialize the workbook
workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Threshold
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# OCR
txt = pytesseract.image_to_string(thr, config="--psm 11")

# Add ocr to the corresponding part
txt = txt.split("\n")


row = 0
col = 0

for txt1 in txt:
    # Skip over OCR strings that are just spaces or ''
    if txt1.isspace() or txt1 == '':
        continue

    # Hard code detection ...let's just place it into the last column for now
    # Theoretically, the state ("Alaska" in this case) will be in column 0 in the same row
    if re.match(r"\d*\sOpen\sRestaurants", txt1):
        col == 3

    worksheet.write(row//4, col%4, txt1)
    col += 1
    row += 1

workbook.close()

```
However, there are still a lot of miss-alignments, especially when some addresses or names take more than one line. Additionally, why is the text on the first line read in a different order compared to the rest of the rows?

I was thinking that perhaps I could enforce that every fourth txt is in alphabetical order and use that to detect misalignment? But if even the first row is incorrect, I'm not sure how much I want to hard code corrections. Additionally, sometimes the multiple line entries arise from the address column while other times it arises from the name column (e.g. 258 Interstate Commercial Park Loop on the left-hand side of the page).

Below are some screenshots of mixups on the left and right.

Any help would be greatly appreciated! Thank you!





original_image.jpg
right_mixup.png
left_mixup (2).png

Shree Devi Kumar

unread,
Mar 22, 2021, 2:26:56 AM3/22/21
to tesseract-ocr
Please see the newly added table detector to the master branch


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fbdeeed7-87b6-4e8c-9cf9-d91e0d84f04an%40googlegroups.com.

Daniel Lu

unread,
Mar 22, 2021, 10:40:19 AM3/22/21
to tesser...@googlegroups.com
Is there something like tableExtractionDemo.cpp but for Python? I am unable to understand or replicate the C++ demo for the problem I am working on.

Thank you in advance!

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/WUDHFmyadXE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XnZ2wgnNtkAJqpA5tr-GQk3aR0j2-fAxRKL5TPWiqg%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages