Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

1,985 views
Skip to first unread message

Sayali begampure

unread,
May 31, 2019, 6:25:08 AM5/31/19
to tesseract-ocr

We are trying to extract text content from normal pdf and scanned pdf (image) using tesseract-ocr.

We have observed following issues for the pdf's with table as table Contents are not getting extracted properly.

  1. Contents from few cells(rows/columns) are not visible.Sometimes heading of the table is missing.
  2. If numbers are there inside table, all the numbers are not getting extracted.
  3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
  4. Column sequence is getting interchanged as it is parsing horizontally.
  5. Some extra characters are getting extracted along with normal one.

Tried image_to_string ,image_to_data ,opencv approach

Sample code used is:

from PIL import Image

import pytesseract from pytesseract import image_to_string from pytesseract import image_to_boxes

image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) print(image)


It should extract rows and columns properly which it is not extracting as of now. Kindly suggest function or method to enhance the results for table content extraction using tesseract.

Manasi sarode

unread,
May 31, 2019, 6:27:52 AM5/31/19
to tesser...@googlegroups.com
That's fair enough.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Amulya Kali

unread,
May 31, 2019, 6:30:36 AM5/31/19
to tesser...@googlegroups.com

Sayali begampure

unread,
May 31, 2019, 6:44:34 AM5/31/19
to tesseract-ocr
Used psm for 2 column documents. Its showing results perfectly.
Can you send link or pointers how to use it for table content extraction from scanned pdf?

Thanks


On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote:
Did you try changing psm?

On Fri, May 31, 2019, 15:57 Manasi sarode <manasi....@gmail.com> wrote:
That's fair enough.

On Fri, May 31, 2019, 3:55 PM Sayali begampure <sayalisb...@gmail.com> wrote:

We are trying to extract text content from normal pdf and scanned pdf (image) using tesseract-ocr.

We have observed following issues for the pdf's with table as table Contents are not getting extracted properly.

  1. Contents from few cells(rows/columns) are not visible.Sometimes heading of the table is missing.
  2. If numbers are there inside table, all the numbers are not getting extracted.
  3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
  4. Column sequence is getting interchanged as it is parsing horizontally.
  5. Some extra characters are getting extracted along with normal one.

Tried image_to_string ,image_to_data ,opencv approach

Sample code used is:

from PIL import Image

import pytesseract from pytesseract import image_to_string from pytesseract import image_to_boxes

image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) print(image)


It should extract rows and columns properly which it is not extracting as of now. Kindly suggest function or method to enhance the results for table content extraction using tesseract.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Amulya Kali

unread,
May 31, 2019, 10:11:50 AM5/31/19
to tesser...@googlegroups.com
Hi,  I'm not sure about the psm mode you have used. You can try psm 6 for table. 

Something like this.. 
pytesseract.image_to_string(image, lang='eng', config='--psm 6')

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Sayali begampure

unread,
May 31, 2019, 11:30:37 AM5/31/19
to tesseract-ocr
Thanks..I will try with this.

Sayali begampure

unread,
Jun 5, 2019, 11:10:57 PM6/5/19
to tesseract-ocr
Hello ,I tried with both psm 6 and psm 3 , but still problem in detecting the table contents.Numbers are not visible and also sometimes only heading is visible.
Any other change I can do in tesseract or for image quality improvement?

TIA

Krishna Prasad

unread,
Jun 6, 2019, 10:08:17 AM6/6/19
to tesser...@googlegroups.com
Hi Sayali,
     I'm dealing with a similar problem. Detecting table contents accurately has never been easy with tesseract. I would suggest building your own pipeline for Detecting tables and complex layouts. There are many public datasets available. I'm trying to use Deeplab V3+ by Google
Deeplab is properly documented and really good at its job. If you are familiar with ML, this would be a piece of cake for you. 

Hope this helps. 😃

Regards,
Krishna Prasad A S

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Saylee Begampure

unread,
Jun 6, 2019, 11:45:55 PM6/6/19
to tesseract-ocr
Thanks a lot! Yes.. I am familiar with ML part, will implement it and try to get results.

Akhil Dixit

unread,
Jul 3, 2019, 12:28:36 PM7/3/19
to tesseract-ocr
I am also facing same issue for Scan PDF specially with multiple columns and Text with numbers. Please share some inputs here if anyone tried using tesseract or some other APIs.
Reply all
Reply to author
Forward
0 new messages