Extracting table data out of PDFs

1,292 views
Skip to first unread message

Sankar

unread,
Jun 30, 2016, 4:35:31 AM6/30/16
to golang-nuts
Hi

Are there any stable/production-quality golang libraries that people are aware of which could read and extract tabular data out of PDF documents ?

Thanks

Sankar

Peter Waller

unread,
Jun 30, 2016, 5:06:42 AM6/30/16
to Sankar, golang-nuts
Hi Sankar,

It may not be exactly what you're looking for but I can't resist the opportunity to plug our product! PDFTables.com has a remote API, you can see an example of how to use it here:


You can get an API key and find out more from here:


I created the repository in response to your request but we have been using this code internally for a long time now without issue, and the site has been alive and stable since 2014.

We also offer the software as an on-premises appliance which you access using the same API.

Feel free to reach out to he...@pdftables.com or me if you have any further questions.

Regards,

- Peter


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sankar P

unread,
Jun 30, 2016, 5:11:11 AM6/30/16
to Peter Waller, golang-nuts
Yes, I did come across your service when I was searching. I have some
PII information and so did not try the service. Having an on-premise
solution is encouraging. I will play with it. Thanks.
--
Sankar P
http://psankar.blogspot.com

Shawn Milochik

unread,
Jun 30, 2016, 11:22:44 AM6/30/16
to golang-nuts
I don't know of a Go solution, but if you are on Linux you could try pdftotext and parse the text. With the obvious caveat of "it depends on how the PDF was encoded." Worst-case you may be able to use tesseract OCR to generate text and then do the same thing.



Ian Lance Taylor

unread,
Jun 30, 2016, 11:36:50 AM6/30/16
to Sankar, golang-nuts
On Thu, Jun 30, 2016 at 1:35 AM, Sankar <sankar.c...@gmail.com> wrote:
>
> Are there any stable/production-quality golang libraries that people are
> aware of which could read and extract tabular data out of PDF documents ?

I don't know if it does what you want, but have you looked at
https://godoc.org/rsc.io/pdf ?

Ian

Konstantin Khomoutov

unread,
Jun 30, 2016, 12:05:21 PM6/30/16
to Sh...@milochik.com, Shawn Milochik, golang-nuts
On Thu, 30 Jun 2016 11:22:00 -0400
Shawn Milochik <shawn...@gmail.com> wrote:

> I don't know of a Go solution, but if you are on Linux you could try
> pdftotext and parse the text. With the obvious caveat of "it depends
> on how the PDF was encoded."

I'm using this approach in one of my applications.
The only problem with pdftotext is that its CLI interface is dumb in
that you can't tell it to read PDF data from stdin and output the
results to stdout at the same time, so you have to resort to using
temporary files.

The approach is to first manually play with `pdftotext` on the sample
data set -- trying out its "-raw" and "-layout" options to see which
produces the most sensible results and then go with it.

I should note that I needed to extract few strings, not tabular data,
so the OP might have better results with `pdftohtml` from the same
package, which is able to produce XML output which can be parsed by
means of the encoding/xml package.

Sankar P

unread,
Jul 11, 2016, 6:05:11 AM7/11/16
to Ian Lance Taylor, golang-nuts
> I don't know if it does what you want, but have you looked at
> https://godoc.org/rsc.io/pdf ?

It seems to be unmaintained. I tried loading a complex PDF with plenty
of tables and it hung infinitely on Content() call in the first page.
I lost interest after that. Thanks.

Sankar P

unread,
Jul 11, 2016, 6:11:23 AM7/11/16
to Konstantin Khomoutov, Sh...@milochik.com, Shawn Milochik, golang-nuts
Using pdftohtml and then using regexes or parser on top, seem to be
the easiest solution as of now. I came across tabula-java which also
seems interesting. Thank you everyone for the recommendations. I've
still not got multiple tables in a single page or tables over-flowing
across pages working correctly yet. But no native golang FOSS library
seem to exist as of now.

2016-06-30 21:34 GMT+05:30 Konstantin Khomoutov
<flat...@users.sourceforge.net>:
> --
> You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/8NisCMXjQIw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Sankar P

unread,
Jul 11, 2016, 6:12:37 AM7/11/16
to Ian Lance Taylor, golang-nuts
2016-07-11 15:34 GMT+05:30 Sankar P <sankar.c...@gmail.com>:
>> I don't know if it does what you want, but have you looked at
>> https://godoc.org/rsc.io/pdf ?
>
> It seems to be unmaintained. I tried loading a complex PDF with plenty
> of tables and it hung infinitely on Content() call in the first page.
> I lost interest after that. Thanks.

Comment by Russ Cox on
https://github.com/rsc/pdf/issues/3#issuecomment-168734094 made me
consider it unmaintained. Forgot to paste the link in the previous
mail. Sorry.
Reply all
Reply to author
Forward
0 new messages