Properly Insert OCR Into Separate Columns

Daniel Lu

unread,

Mar 22, 2021, 1:23:18 AM3/22/21

to tesseract-ocr

Hi,

I am trying to read hundreds of pages of information like the picture below into a CSV file. For us humans, it is very clear where the information should go in each of the four columns. But I am trying to use tesseract to do this!

This is my code right now:

```{python}

import cv2

import pytesseract

import xlsxwriter

import re

img = cv2.imread("*image file path")

pytesseract.pytesseract.tesseract_cmd = r"*tesseract location"

# Initialize the workbook

workbook = xlsxwriter.Workbook('result.xlsx')

worksheet = workbook.add_worksheet()

# Convert to the gray-scale

gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Threshold

thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# OCR

txt = pytesseract.image_to_string(thr, config="--psm 11")

# Add ocr to the corresponding part

txt = txt.split("\n")

row = 0

col = 0

for txt1 in txt:

# Skip over OCR strings that are just spaces or ''

if txt1.isspace() or txt1 == '':

continue

# Hard code detection ...let's just place it into the last column for now

# Theoretically, the state ("Alaska" in this case) will be in column 0 in the same row

if re.match(r"\d*\sOpen\sRestaurants", txt1):

col == 3

worksheet.write(row//4, col%4, txt1)

col += 1

row += 1

workbook.close()

```

However, there are still a lot of miss-alignments, especially when some addresses or names take more than one line. Additionally, why is the text on the first line read in a different order compared to the rest of the rows?

I was thinking that perhaps I could enforce that every fourth txt is in alphabetical order and use that to detect misalignment? But if even the first row is incorrect, I'm not sure how much I want to hard code corrections. Additionally, sometimes the multiple line entries arise from the address column while other times it arises from the name column (e.g. 258 Interstate Commercial Park Loop on the left-hand side of the page).

Below are some screenshots of mixups on the left and right.

Any help would be greatly appreciated! Thank you!

original_image.jpg

right_mixup.png

left_mixup (2).png

Shree Devi Kumar

unread,

Mar 22, 2021, 2:26:56 AM3/22/21

to tesseract-ocr

Please see the newly added table detector to the master branch

https://github.com/tesseract-ocr/tesseract/pull/3330

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fbdeeed7-87b6-4e8c-9cf9-d91e0d84f04an%40googlegroups.com.

Daniel Lu

unread,

Mar 22, 2021, 10:40:19 AM3/22/21

to tesser...@googlegroups.com

Is there something like tableExtractionDemo.cpp but for Python? I am unable to understand or replicate the C++ demo for the problem I am working on.

Thank you in advance!

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/WUDHFmyadXE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XnZ2wgnNtkAJqpA5tr-GQk3aR0j2-fAxRKL5TPWiqg%40mail.gmail.com.

Reply all

Reply to author

Forward