Poor results in extracting table data

220 views
Skip to first unread message

Timo Grossenbacher

unread,
Jan 22, 2016, 5:14:48 AM1/22/16
to tesseract-ocr
Hey,

Given the input file 2000.pdf, and the following code, ...

# first, conversion to TIFF with ghostscript
ghostscript
-o 2000_gs.tif -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw 2000.pdf
# then, rotation with imagemagick
convert
2000_gs.tif -rotate 89.4 -background white -alpha Off 2000_rotated.tif
# then, OCR with tesseract, using suggested parameters
tesseract
2000_rotated.tif 2000_readable_gs_custom -c load_system_dawg=0 -c load_freq_dawg=0 -c textord_tablefind_recognize_tables=1 -c textord_tabfind_find_tables=1 pdf

...the quality of the OCR is really poor - hardly 30% of the text is searchable in 2000_readable_gs_custom.pdf.

I have uploaded all the files to https://www.sendspace.com/filegroup/dGA6ojm%2BQ4tZ6gdkyuSM0xSIUD8P2vbB

When I OCR the same file with Adobe Acrobat Professional, I get almost 100% accuracy. Of course I'd like to do it rather with FOSS than with a commercial product, so do you have any hints on how I could mitigate those problems?

Thanks a lot,
Timo

Tom Morris

unread,
Jan 22, 2016, 5:43:53 PM1/22/16
to tesseract-ocr
720 dpi seems high.  Is that the native scan resolution?  I'd use the native resolution unless it's less than 200 dpi or more than 400 dpi.  Similarly, why are you rendering to tiffgray when the input looks like it's bitonal?  tesseract is just going to have to threshold back to bitonal again, resulting in two conversions where none are needed.

Don't have time to play with it myself, but perhaps you could outline the matrix of different conversions you've tried so far so to help folks what's already been tried and eliminated as not helpful.

Tom

Art Rhyno.

unread,
Jan 22, 2016, 8:11:26 PM1/22/16
to tesser...@googlegroups.com

Hi Timo,

 

I tried the line removal example [1] included with leptonica, I have had luck before using it with tesseract for images with horizontal lines. I didn't manipulate the pdf beyond converting it to a grayscale image and rotating it, my ghostscript won't handle your parameters for some reason. This is the unadorned image without the horizontal lines [2] and these are the results [3]. Not 100% but I think more than 30% and maybe an approach to consider.

 

art

---

1. http://www.leptonica.com/line-removal.html

2. https://drive.google.com/file/d/0B-PK1n92dlzwalM1bTRtb0FiMVU/view?usp=sharing

3. https://drive.google.com/file/d/0B-PK1n92dlzweDl5aWFPd0pDQnc/view?usp=sharing

Reply all
Reply to author
Forward
0 new messages