Manipulating PDFs: better to use Go + Ghostscript or Scala/Groovy + Apache PDFBox?

1,586 views
Skip to first unread message

Ali Nabavi

unread,
Oct 6, 2015, 3:30:01 PM10/6/15
to golang-nuts
Hey, folks.

I have to manipulate a bunch of large PDFs.  I created a program for this in Go that uses Ghostscript to split the docs into smaller docs based on page numbers.  I think I've gotten it running as fast as I know how.  It takes about 0.7 seconds for each new PDF, and in the actual production run we're anticipating generating over two hundred thousand PDFs.  It's okay if this were to take a long time, but it would be nice to minimize the time as much as possible.

This is on Linux VMs.  We can throw more CPU and RAM at it eventually, but I'm just working on prototypes now.

I started off trying to use goroutines for this, for the first time in my short career as an aspiring Go developer, but it didn't seem to make much of a difference.  Because of the dependence on gs, I ended up writing it so I could kick off multiple instances of the program to operate nicely on the entire set of data, e.g., grabbing a big PDF and operating on it exclusively.

I thought one way it might work faster would be to use a native library instead of handing things off to gs.  We already have Java on these boxes and I thought I'd try another prototype in either Groovy or Scala and use Apache's PDFBox to manipulate the PDFs.  (I'm new to both Scala and Groovy, and practically new to Java, so thought Scala or Groovy would be a little quicker going.)

What do you think?

-Ali

James Aguilar

unread,
Oct 7, 2015, 1:11:56 AM10/7/15
to golang-nuts
You're just going to be shooting in the dark unless you actually know what the problem is/where the time is being spent. E.g. if it's disk IO, switching to Scala is very unlikely to help, and goroutines won't either. What steps have you taken to identify the source of the problem? 

Rusco

unread,
Oct 7, 2015, 8:42:19 AM10/7/15
to golang-nuts
This works seamlessly  for creating simple pdf files, it might help:

https://github.com/jung-kurt/gofpdf

Philip Feairheller

unread,
Oct 8, 2015, 11:12:49 AM10/8/15
to golang-nuts
We had a similar project come up recently.  Our environment is 100% Go REST/JSON servers so my first attempt with with the jung-kurt/gofpdf library already mentioned here.  Our requirements quickly outpaced this library's capabilities (heavy multimedia embedding, document splitting and merging, etc) so we turned to PDFBox.

Since we deploy everything with ansible, spinning up a new VM type with Java was relatively easy, but what really accelerated the effort was using dropwizard.io as the framework for building the PDF server.  With it's simple endpoint definitions and single (Fat) jar deployment model, we were able to get a server with 80% of our features up and running in one sprint (2 weeks).  

We are treating it as just another data service behind our API servers (thus behind the firewall), so all authentication, etc was left to the Go servers which connect and basically proxy the PDF back to the clients.

PDFBox itself is a very nice library with more features than we'll probably need.

waTR

unread,
May 16, 2016, 2:11:50 PM5/16/16
to golang-nuts
Is doing the conversion of PDF to image still best done outside of golang, or has there been some progress?
Reply all
Reply to author
Forward
0 new messages