Hi Mohit,
just to add - a hacked-but-working workflow to extract the table
structure and OCR bits and pieces as needed can be found in my GitHub,
for instance here (at the bottom of the perl file):
https://github.com/raphael-susewind/india-religion-politics/blob/master/rajrolls2014/run-in-arc/pdf2list.pl
It boils down to
pdf-table-extract -i $file -p $page -r 300 -l 0.7 -t cells_xml
for each page, parsing the results to extract cell coordinates, then
gs -q -r300 -dFirstPage=$page -dLastPage=$page -sDEVICE=tiffgray
-sCompression=lzw -o $temp.tif -g".$width."x".$height." -c '<</Install
{-$bufferx -$buffery translate}>> setpagedevice' -f $file
to get a TIFF of this cell, to be fed into
tesseract -psm 4 -l hin temp.tif stdout
(in the case of devanagari)
Best of luck,
Raphael
On 01/23/2017 09:20 AM, Amanbir Singh wrote:
> Hi Mohit,
>
> You'll have to use OCR on the pdf before any other method can be
> applied. This obviously makes it more complicated, but still manageable.
>
> You could use the Tesseract, a popular OCR package
> (
https://github.com/tesseract-ocr/tesseract) and then try using tabula
> or the other packages mentioned. I've also had success using Xpdf
> (
http://www.foolabs.com/xpdf/) to convert pdfs to text and then parsing
> the text.
>
> Aman
>
>
> On Friday, 20 January 2017 18:18:59 UTC+5:30, mohit ranjan wrote:
>
> Tried Tabula, but again it's for PDF which has all the meta-data
> within it.
> I need it for paper scanned PDF/JPG and it fails by saying so
>
> /"Sorry, your PDF file is image-based; it does not have any embedded
> text. It might have been scanned from paper... Tabula isn't able to
> extract any data from image-based PDFs. Click the Help button for
> more information."/
>
> - Mohit
>
> On Fri, Jan 20, 2017 at 6:14 PM, Srinivasan Ramani
> <
sriniv...@gmail.com <javascript:>> wrote:
>
> Tabula -
http://tabula.technology/ works great with table
> extraction from PDFs.
>
> On Fri, Jan 20, 2017 at 5:51 PM, mohit ranjan
> <
shoony...@gmail.com <javascript:>> wrote:
>
> Thanks for response Johnson.
>
> Is this the pdf-table-extract
> <
https://github.com/ashima/pdf-table-extract> you are
> referring to ?
> It says, it reads table meta from PDF.
>
> My query was for scanned PDF/JPG images
>
> - Mohit
>
> On Fri, Jan 20, 2017 at 4:37 PM, Johnson Chetty
> <
johnso...@gmail.com <javascript:>> wrote:
>
>
> Hello,
>
> I have had some reasonable success with 'pdfquery'
> if you like Python. It works with regional text as
> well.
> Also, for tabular data, do try pdf-table-extract if
> quick and dirty works for you.
>
> Java folks should try pdfbox.
>
>
>
>
>
> On 20 January 2017 at 15:23, mohit ranjan
> <
shoony...@gmail.com <javascript:>> wrote:
>
> Sorry if this is off-topic, but have seen
> threads here about liberating data from PDFs.
> Most likely there will be lot of scanned PDFs
> among them.
>
> Do we have any in-house expert on this and which
> library/tool (preferably not paid) to extract
> tables in scanned PDF/JPG ?
>
> CVision
> <
http://www.cvisiontech.com/library/ocr/file-ocr/ocr-table-recognition.html>
> does a decent job, but it's paid.
>
>
>
> - Mohit
>
> --
> Datameet is a community of Data Science
> enthusiasts in India. Know more about us by
> visiting
http://datameet.org
> ---
> You received this message because you are
> subscribed to the Google Groups "datameet" group.
> To unsubscribe from this group and stop
> receiving emails from it, send an email to
>
datameet+u...@googlegroups.com <javascript:>.
> <
https://groups.google.com/d/optout>.
>
>
>
> --
> Datameet is a community of Data Science enthusiasts in
> India. Know more about us by visiting
http://datameet.org
> ---
> You received this message because you are subscribed to
> the Google Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to
datameet+u...@googlegroups.com
> <javascript:>.
> <
https://groups.google.com/d/optout>.
>
>
> --
> Datameet is a community of Data Science enthusiasts in
> India. Know more about us by visiting
http://datameet.org
> ---
> You received this message because you are subscribed to the
> Google Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to
datameet+u...@googlegroups.com
> <javascript:>.
> <
https://groups.google.com/d/optout>.
>
>
>
>
> --
> Best Regards,
> Srinivasan V. Ramani ,
> Associate Editor,
> The Hindu,
> Chennai.
> Ph: 07299033554
>
> --
> Datameet is a community of Data Science enthusiasts in India.
> Know more about us by visiting
http://datameet.org
> ---
> You received this message because you are subscribed to the
> Google Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to
datameet+u...@googlegroups.com <javascript:>.
> <
https://groups.google.com/d/optout>.
>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting
http://datameet.org
> ---
> You received this message because you are subscribed to the Google
> Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
datameet+u...@googlegroups.com
> <mailto:
datameet+u...@googlegroups.com>.
Dr Raphael Susewind | Postdoc, Max Planck Institute for the Study of
| Religious and Ethnic Diversity (MPI-MMG)
| Hermann-Föge-Weg 11, 37073 Göttingen, Germany
|
https://www.raphael-susewind.de
Please consider PGP for encryption:
https://keybase.io/raphaelsusewind