Fwd: Extracting data from PDFs

8 views

Skip to first unread message

Nikhil VJ

unread,

Jan 15, 2022, 1:58:35 AM1/15/22

to pytho...@googlegroups.com

Hi All,

Breaking tabular data out from PDFs to be able to use in excel / databases has since long been a requirement and pain point for many folks wanting to do R&D or investigations into officially published data. With data published in Indian languages: even more so.

I recently was able to make a single python program do the whole job from taking a pdf, un-encrypting it if needed, OCR'ing it (with marathi/hindi script plus some english parts), extracting tabular data from it, transforming the data to a proper table with one data item per row and finally saving it to either an excel or a database.

It uses these python libraries and some dependencies which I could easily install in my ubuntu laptop

pikepdf (decryption) | ocrmypdf (OCR) | tabula-py (table extraction)

The last part in that chain is crucial, and it has evolved by now to give several good options for targeting the specific areas, putting exact columns, etc. Here's the core program the python lib wraps around.

For the OCR part, it did things properly for about 80 to 90% of my target PDF which was in marathi. This is running on the tesseract project which is continuously evolving, so we can expect it to get better over time. It's also got a handy deskew option which you can use in case of photo-scanned pdfs where everything got tilted.

I'm not able to share a demo program at this point, but want to let people know that Hey it can be done! Just follow the trail. And get in touch if you want to implement something.

PS: I started on this with Excalibur and Tabula user-interface programs, ran into several hurdles, dug deeper and found the libraries needed to directly work with my code. So, seen those. Excalibur was promising but it's gotten bugs now and is no longer maintained.