Hello Karthik and Nikhil,
On 05/06/19 8:52 pm, Nikhil VJ wrote:
> Hi Karthik,
>
> Answering your second question: what's the best way to read the data?
>
> I recommend using Tabula - "for liberating data tables locked inside PDF
> files."
>
https://tabula.technology/
>
> With this you can go to a specific page, select the area of the table
> with mouse to exclude unnecessary things, and extract the data to a CSV.
>
> There are two ways it uses to extract, so be sure to try the other if
> one doesn't work out for you.
>
> And some manual work may be needed after extraction in case there's
> extra spaces, line-breaks in the header, etc.
>
> Note: If the table you need is a scanned image in this PDF, then tabula
> is not applicable. Only works with vector data.
>
While Tabula is a great tool to extract tabular data from PDFs (we rely
on it for quite a few tasks in my company), sometimes it fails to
correctly extract tabular data in a way that can be easily used by the
user. Another tool that we use, in such cases, is "Camelot: PDF Table
Extraction for Humans"
https://camelot-py.readthedocs.io/
I created a sample CSV document for the table on page 131 of the report
that you linked. Please find it attached.
Similar to Tabula, Camelot also has a web interface that you can use to
select particular area of the table and get a CSV.
https://www.tryexcalibur.com/
I hope this helps! Feel free to reach me if you have any queries and
would like some assistance along the way.
Cheers! :)
--
Dhanesh B. Sabane
https://dhanesh95.gitlab.io
PGP ID: 0xB69A98C9C1642329
Fingerprint: 9655 11F2 0D18 E76A 2396 D64D B69A 98C9 C164 2329