Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Sayali begampure

unread,

May 31, 2019, 6:25:08 AM5/31/19

to tesseract-ocr

We are trying to extract text content from normal pdf and scanned pdf (image) using tesseract-ocr.

We have observed following issues for the pdf's with table as table Contents are not getting extracted properly.

Contents from few cells(rows/columns) are not visible.Sometimes heading of the table is missing.
If numbers are there inside table, all the numbers are not getting extracted.
Some letters are extracted wrongly . eg. i is misinterpreted as l.
Column sequence is getting interchanged as it is parsing horizontally.
Some extra characters are getting extracted along with normal one.

Tried image_to_string ,image_to_data ,opencv approach

Sample code used is:

from PIL import Image

import pytesseract from pytesseract import image_to_string from pytesseract import image_to_boxes

image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) print(image)

It should extract rows and columns properly which it is not extracting as of now. Kindly suggest function or method to enhance the results for table content extraction using tesseract.

Manasi sarode

unread,

May 31, 2019, 6:27:52 AM5/31/19

to tesser...@googlegroups.com

That's fair enough.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Amulya Kali

unread,

May 31, 2019, 6:30:36 AM5/31/19

to tesser...@googlegroups.com

Did you try changing psm?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com.

Sayali begampure

unread,

May 31, 2019, 6:44:34 AM5/31/19

to tesseract-ocr

Used psm for 2 column documents. Its showing results perfectly.

Can you send link or pointers how to use it for table content extraction from scanned pdf?

Thanks

On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote:

Did you try changing psm?

On Fri, May 31, 2019, 15:57 Manasi sarode <manasi....@gmail.com> wrote:

That's fair enough.

On Fri, May 31, 2019, 3:55 PM Sayali begampure <sayalisb...@gmail.com> wrote:

We are trying to extract text content from normal pdf and scanned pdf (image) using tesseract-ocr.
We have observed following issues for the pdf's with table as table Contents are not getting extracted properly.
Contents from few cells(rows/columns) are not visible.Sometimes heading of the table is missing.
If numbers are there inside table, all the numbers are not getting extracted.
Some letters are extracted wrongly . eg. i is misinterpreted as l.
Column sequence is getting interchanged as it is parsing horizontally.
Some extra characters are getting extracted along with normal one.
Tried image_to_string ,image_to_data ,opencv approach
Sample code used is:
from PIL import Image
import pytesseract from pytesseract import image_to_string from pytesseract import image_to_boxes
image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) print(image)

It should extract rows and columns properly which it is not extracting as of now. Kindly suggest function or method to enhance the results for table content extraction using tesseract.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Amulya Kali

unread,

May 31, 2019, 10:11:50 AM5/31/19

to tesser...@googlegroups.com

Hi, I'm not sure about the psm mode you have used. You can try psm 6 for table.

Something like this..

pytesseract.image_to_string(image, lang='eng', config='--psm 6')

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com.

Sayali begampure

unread,

May 31, 2019, 11:30:37 AM5/31/19

to tesseract-ocr

Thanks..I will try with this.

Sayali begampure

unread,

Jun 5, 2019, 11:10:57 PM6/5/19

to tesseract-ocr

Hello ,I tried with both psm 6 and psm 3 , but still problem in detecting the table contents.Numbers are not visible and also sometimes only heading is visible.

Any other change I can do in tesseract or for image quality improvement?

TIA

Krishna Prasad

unread,

Jun 6, 2019, 10:08:17 AM6/6/19

to tesser...@googlegroups.com

Hi Sayali,

I'm dealing with a similar problem. Detecting table contents accurately has never been easy with tesseract. I would suggest building your own pipeline for Detecting tables and complex layouts. There are many public datasets available. I'm trying to use Deeplab V3+ by Google

Model : https://github.com/tensorflow/models/tree/master/research/deeplab

Dataset : https://www.primaresearch.org/datasets/Layout_Analysis

Deeplab is properly documented and really good at its job. If you are familiar with ML, this would be a piece of cake for you.

Hope this helps. 😃

Regards,

Krishna Prasad A S

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com.

Saylee Begampure

unread,

Jun 6, 2019, 11:45:55 PM6/6/19

to tesseract-ocr

Thanks a lot! Yes.. I am familiar with ML part, will implement it and try to get results.

Akhil Dixit

unread,

Jul 3, 2019, 12:28:36 PM7/3/19

to tesseract-ocr

I am also facing same issue for Scan PDF specially with multiple columns and Text with numbers. Please share some inputs here if anyone tried using tesseract or some other APIs.

Reply all

Reply to author

Forward