Hey, folks.
I have to manipulate a bunch of large PDFs. I created a program for this in Go that uses Ghostscript to split the docs into smaller docs based on page numbers. I think I've gotten it running as fast as I know how. It takes about 0.7 seconds for each new PDF, and in the actual production run we're anticipating generating over two hundred thousand PDFs. It's okay if this were to take a long time, but it would be nice to minimize the time as much as possible.
This is on Linux VMs. We can throw more CPU and RAM at it eventually, but I'm just working on prototypes now.
I started off trying to use goroutines for this, for the first time in my short career as an aspiring Go developer, but it didn't seem to make much of a difference. Because of the dependence on gs, I ended up writing it so I could kick off multiple instances of the program to operate nicely on the entire set of data, e.g., grabbing a big PDF and operating on it exclusively.
I thought one way it might work faster would be to use a native library instead of handing things off to gs. We already have Java on these boxes and I thought I'd try another prototype in either Groovy or Scala and use Apache's PDFBox to manipulate the PDFs. (I'm new to both Scala and Groovy, and practically new to Java, so thought Scala or Groovy would be a little quicker going.)
What do you think?
-Ali