Library to read tables in scanned PDFs

693 views
Skip to first unread message

mohit ranjan

unread,
Jan 20, 2017, 4:53:06 AM1/20/17
to data...@googlegroups.com
Sorry if this is off-topic, but have seen threads here about liberating data from PDFs.
Most likely there will be lot of scanned PDFs among them.

Do we have any in-house expert on this and which library/tool (preferably not paid) to extract tables in scanned PDF/JPG ?

CVision does a decent job, but it's paid.



- Mohit

Johnson Chetty

unread,
Jan 20, 2017, 6:07:54 AM1/20/17
to data...@googlegroups.com

Hello, 

I have had some reasonable success with 'pdfquery' if you like Python. It works with regional text as well. 
Also, for tabular data, do try pdf-table-extract if quick and dirty works for you. 

Java folks should try pdfbox. 





--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


mohit ranjan

unread,
Jan 20, 2017, 7:21:21 AM1/20/17
to data...@googlegroups.com
Thanks for response Johnson.

Is this the pdf-table-extract you are referring to ?
It says, it reads table meta from PDF. 

My query was for scanned PDF/JPG images

- Mohit

Srinivasan Ramani

unread,
Jan 20, 2017, 7:44:36 AM1/20/17
to data...@googlegroups.com
Tabula - http://tabula.technology/ works great with table extraction from PDFs. 
Best Regards,
Srinivasan V. Ramani ,
Associate Editor,
The Hindu,
Chennai.
Ph: 07299033554

mohit ranjan

unread,
Jan 20, 2017, 7:48:59 AM1/20/17
to data...@googlegroups.com
Tried Tabula, but again it's for PDF which has all the meta-data within it.
I need it for paper scanned PDF/JPG and it fails by saying so

"Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information."

- Mohit

Amanbir Singh

unread,
Jan 23, 2017, 3:20:05 AM1/23/17
to datameet
Hi Mohit,

You'll have to use OCR on the pdf before any other method can be applied. This obviously makes it more complicated, but still manageable. 

You could use the Tesseract, a popular OCR package (https://github.com/tesseract-ocr/tesseract) and then try using tabula or the other packages mentioned. I've also had success using Xpdf (http://www.foolabs.com/xpdf/) to convert pdfs to text and then parsing the text. 

Aman


- Mohit


- Mohit

To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Best Regards,
Srinivasan V. Ramani ,
Associate Editor,
The Hindu,
Chennai.
Ph: 07299033554

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

Raphael Susewind

unread,
Jan 23, 2017, 9:02:12 AM1/23/17
to data...@googlegroups.com
Hi Mohit,

just to add - a hacked-but-working workflow to extract the table
structure and OCR bits and pieces as needed can be found in my GitHub,
for instance here (at the bottom of the perl file):

https://github.com/raphael-susewind/india-religion-politics/blob/master/rajrolls2014/run-in-arc/pdf2list.pl

It boils down to

pdf-table-extract -i $file -p $page -r 300 -l 0.7 -t cells_xml

for each page, parsing the results to extract cell coordinates, then

gs -q -r300 -dFirstPage=$page -dLastPage=$page -sDEVICE=tiffgray
-sCompression=lzw -o $temp.tif -g".$width."x".$height." -c '<</Install
{-$bufferx -$buffery translate}>> setpagedevice' -f $file

to get a TIFF of this cell, to be fed into

tesseract -psm 4 -l hin temp.tif stdout

(in the case of devanagari)

Best of luck,
Raphael

On 01/23/2017 09:20 AM, Amanbir Singh wrote:
> Hi Mohit,
>
> You'll have to use OCR on the pdf before any other method can be
> applied. This obviously makes it more complicated, but still manageable.
>
> You could use the Tesseract, a popular OCR package
> (https://github.com/tesseract-ocr/tesseract) and then try using tabula
> or the other packages mentioned. I've also had success using Xpdf
> (http://www.foolabs.com/xpdf/) to convert pdfs to text and then parsing
> the text.
>
> Aman
>
>
> On Friday, 20 January 2017 18:18:59 UTC+5:30, mohit ranjan wrote:
>
> Tried Tabula, but again it's for PDF which has all the meta-data
> within it.
> I need it for paper scanned PDF/JPG and it fails by saying so
>
> /"Sorry, your PDF file is image-based; it does not have any embedded
> text. It might have been scanned from paper... Tabula isn't able to
> extract any data from image-based PDFs. Click the Help button for
> more information."/
>
> - Mohit
>
> On Fri, Jan 20, 2017 at 6:14 PM, Srinivasan Ramani
> <sriniv...@gmail.com <javascript:>> wrote:
>
> Tabula - http://tabula.technology/ works great with table
> extraction from PDFs.
>
> On Fri, Jan 20, 2017 at 5:51 PM, mohit ranjan
> <shoony...@gmail.com <javascript:>> wrote:
>
> Thanks for response Johnson.
>
> Is this the pdf-table-extract
> <https://github.com/ashima/pdf-table-extract> you are
> referring to ?
> It says, it reads table meta from PDF.
>
> My query was for scanned PDF/JPG images
>
> - Mohit
>
> On Fri, Jan 20, 2017 at 4:37 PM, Johnson Chetty
> <johnso...@gmail.com <javascript:>> wrote:
>
>
> Hello,
>
> I have had some reasonable success with 'pdfquery'
> if you like Python. It works with regional text as
> well.
> Also, for tabular data, do try pdf-table-extract if
> quick and dirty works for you.
>
> Java folks should try pdfbox.
>
>
>
>
>
> On 20 January 2017 at 15:23, mohit ranjan
> <shoony...@gmail.com <javascript:>> wrote:
>
> Sorry if this is off-topic, but have seen
> threads here about liberating data from PDFs.
> Most likely there will be lot of scanned PDFs
> among them.
>
> Do we have any in-house expert on this and which
> library/tool (preferably not paid) to extract
> tables in scanned PDF/JPG ?
>
> CVision
> <http://www.cvisiontech.com/library/ocr/file-ocr/ocr-table-recognition.html>
> does a decent job, but it's paid.
>
>
>
> - Mohit
>
> --
> Datameet is a community of Data Science
> enthusiasts in India. Know more about us by
> visiting http://datameet.org
> ---
> You received this message because you are
> subscribed to the Google Groups "datameet" group.
> To unsubscribe from this group and stop
> receiving emails from it, send an email to
> datameet+u...@googlegroups.com <javascript:>.
> For more options, visit
> https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
>
> --
> Datameet is a community of Data Science enthusiasts in
> India. Know more about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to
> the Google Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to datameet+u...@googlegroups.com
> <javascript:>.
> For more options, visit
> https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> Datameet is a community of Data Science enthusiasts in
> India. Know more about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the
> Google Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to datameet+u...@googlegroups.com
> <javascript:>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Best Regards,
> Srinivasan V. Ramani ,
> Associate Editor,
> The Hindu,
> Chennai.
> Ph: 07299033554
>
> --
> Datameet is a community of Data Science enthusiasts in India.
> Know more about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the
> Google Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to datameet+u...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google
> Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to datameet+u...@googlegroups.com
> <mailto:datameet+u...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Dr Raphael Susewind | Postdoc, Max Planck Institute for the Study of
| Religious and Ethnic Diversity (MPI-MMG)
| Hermann-Föge-Weg 11, 37073 Göttingen, Germany
| https://www.raphael-susewind.de

Please consider PGP for encryption: https://keybase.io/raphaelsusewind


mohit ranjan

unread,
Jan 24, 2017, 4:15:52 AM1/24/17
to data...@googlegroups.com
Thanks Aman, Raphael

Let me try these steps.

- Mohit


> For more options, visit https://groups.google.com/d/optout.

--
Dr Raphael Susewind | Postdoc, Max Planck Institute for the Study of
                    | Religious and Ethnic Diversity (MPI-MMG)
                    | Hermann-Föge-Weg 11, 37073 Göttingen, Germany
                    | https://www.raphael-susewind.de

Please consider PGP for encryption: https://keybase.io/raphaelsusewind
--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.

nikh...@gmail.com

unread,
Dec 26, 2021, 11:01:45 AM12/26/21
to datameet
Hi All,

Replying on an old existing thread that matched the subject.

I recently got success in making a single python program do the whole job from taking a pdf, un-encrypting it (apparently that's a thing for many pdfs even with pw being blank), OCR'ing it (that too in a marathi/hindi script plus some english parts), extracting tabular data from it, transforming the data to a proper table with one data item per row and finally saving it to either an excel or a database.

It uses these python libraries and some dependencies which I could easily install in my ubuntu system 
pikepdf (decryption) | ocrmypdf (OCR) | tabula-py (table extraction)

(In windows: maybe doable, not sure about those dependencies. Feels like they also really want people to move to linux these days :P)

The last part in that chain is crucial, and it has evolved by now to give several good options for targeting the specific areas. Here's the core program the python lib wraps around.

For the OCR part, it did things properly for about 80% of my target PDF which was in marathi. This is running on the tesseract project which is continuously evolving, so we can expect it to get better over time. (and if someone knows where to contribute more samples for the training model, pls let me know) It's also got a handy deskew option which you can use in case of photo-scanned pdfs where everything got tilted.

I'm not able to share a demo program at this point, but want to let people know that Hey it can be done! Just follow the trail. (and get in touch if you want to implement something)

Regards
Nikhil VJ

> For more options, visit https://groups.google.com/d/optout.

--
Dr Raphael Susewind | Postdoc, Max Planck Institute for the Study of
                    | Religious and Ethnic Diversity (MPI-MMG)
                    | Hermann-Föge-Weg 11, 37073 Göttingen, Germany
                    | https://www.raphael-susewind.de

Please consider PGP for encryption: https://keybase.io/raphaelsusewind
--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages