PLFS Data in easy-to-read format

83 views
Skip to first unread message

Karthik Shashidhar

unread,
Jun 5, 2019, 8:47:54 AM6/5/19
to datameet
As you would expect from the Indian government, while the Periodic Labour Force Study for 2017-18 was finally released, it's been released as a single PDF. Has anyone succeeded in converting all the tables into an easy to read format such as CSV or excel? 



And if nobody has managed to get the data into excel yet, what's the best way to read the data? 

Thanks
Karthik

Nikhil VJ

unread,
Jun 5, 2019, 11:22:38 AM6/5/19
to datameet
Hi Karthik,

Answering your second question: what's the best way to read the data? 

I recommend using Tabula - "for liberating data tables locked inside PDF files."

With this you can go to a specific page, select the area of the table with mouse to exclude unnecessary things, and extract the data to a CSV. 

There are two ways it uses to extract, so be sure to try the other if one doesn't work out for you. 

And some manual work may be needed after extraction in case there's extra spaces, line-breaks in the header, etc.

Note: If the table you need is a scanned image in this PDF, then tabula is not applicable. Only works with vector data.


Regards,
Nikhil VJ, Pune, India

Dhanesh B. Sabane

unread,
Jun 7, 2019, 4:46:02 PM6/7/19
to data...@googlegroups.com
Hello Karthik and Nikhil,

On 05/06/19 8:52 pm, Nikhil VJ wrote:
> Hi Karthik,
>
> Answering your second question: what's the best way to read the data? 
>
> I recommend using Tabula - "for liberating data tables locked inside PDF
> files."
> https://tabula.technology/
>
> With this you can go to a specific page, select the area of the table
> with mouse to exclude unnecessary things, and extract the data to a CSV. 
>
> There are two ways it uses to extract, so be sure to try the other if
> one doesn't work out for you. 
>
> And some manual work may be needed after extraction in case there's
> extra spaces, line-breaks in the header, etc.
>
> Note: If the table you need is a scanned image in this PDF, then tabula
> is not applicable. Only works with vector data.
>

While Tabula is a great tool to extract tabular data from PDFs (we rely
on it for quite a few tasks in my company), sometimes it fails to
correctly extract tabular data in a way that can be easily used by the
user. Another tool that we use, in such cases, is "Camelot: PDF Table
Extraction for Humans"

https://camelot-py.readthedocs.io/

I created a sample CSV document for the table on page 131 of the report
that you linked. Please find it attached.

Similar to Tabula, Camelot also has a web interface that you can use to
select particular area of the table and get a CSV.

https://www.tryexcalibur.com/

I hope this helps! Feel free to reach me if you have any queries and
would like some assistance along the way.

Cheers! :)

--
Dhanesh B. Sabane
https://dhanesh95.gitlab.io
PGP ID: 0xB69A98C9C1642329
Fingerprint: 9655 11F2 0D18 E76A 2396 D64D B69A 98C9 C164 2329
annual-report-131.csv
signature.asc

Nikhil VJ

unread,
Jun 8, 2019, 1:03:21 AM6/8/19
to datameet
Hi Dhanesh,

Wow! Thanks for sharing about Excalibur. The column separator feature meets a long time need !

Note for users : Set the "Flavor" dropdown on right side to "Stream" to be able to use the column separator.

-Nikhil
Reply all
Reply to author
Forward
0 new messages