Disassembling PDF based on list of Page Ranges (start page - End page)

17 views
Skip to first unread message

Abhishek Kolluru

unread,
Dec 12, 2022, 2:52:43 PM12/12/22
to live...@googlegroups.com
Hello All,

I am trying to read text from a PDF and based on the content I read, for example page numbers, I would like to Split a Large PDF into separate PDFs. I could get a lot of such page ranges, I could store them in a list and then want to feed them to Assembler Service ( with Source Document) and get these individual PDFs as a result. I have written my DDX to be iterative based on the number of page Ranges i have within the List, Reaching out to know if there is any better way of implementing this ? because in this approach i am only able to parse the source document with 500 pages. My requirement is to do this for a PDF with around 25K pages. Seeking your expert thoughts on this.

Thanks,
Abhishek

Duane Nickull

unread,
Dec 12, 2022, 4:11:13 PM12/12/22
to live...@googlegroups.com
A PDF with 25,000 pages?  My first recommendation is to not have such a large PDF.  What size is it?  Parsing this must take a huge amount of memory to hold an in-memory model of it.

Duane

--
You received this message because you are subscribed to the Google Groups "Adobe LiveCycle Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to livecycle+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/livecycle/CAOhfwPfg%3Dk0gVm6GOS5di2xtO0chLZKY_X0LSWFEyksDshyKrw%40mail.gmail.com.


--
******************************
CTO Hired Gun - speaking only for myself
s. Bootstrap 5, jQuery, HTML5, CSS3+, PHP, Node.js, Neo4J & more
@duane...@hachyderm.io

NOTICE: This e-mail and any attachments may contain confidential information. If you are the intended recipient, please consider this a privileged communication, not to be forwarded without explicit approval from the sender.  If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this e-mail and destroy any copies. Any dissemination or use of this information by a person other than the intended recipient is unauthorized and may be illegal. The originator reserves the right to monitor all e-mail communications through its networks for quality control purposes.

Abhishek Kolluru

unread,
Dec 12, 2022, 7:24:56 PM12/12/22
to live...@googlegroups.com
Yes Duane, that’s exactly what’s happening, the pdf is around 60mb but I believe the assembling and parsing have become expensive operations. I am planning to break them down to smaller chunks but the idea to break them into separate PDF based on page ranges is where I am trying to wrap my head around. 

Thanks, 
Abhishek

Duane Nickull

unread,
Dec 12, 2022, 10:15:50 PM12/12/22
to live...@googlegroups.com
Abhishek et al:

So this is actually a very little understood area of PDF processing.  The core libraries that parse any document for transformation must literally make 3 separate representations in memory in order to complete their task.  First, they must parse the input document and create an in-memory representation of that document. This is seldom less memory than the entire document itself (it *can* be if one were to write a super efficient parser that "parked" things like TIFF images in a separate memory location and simply referenced them intact but this is seldom done).

Secondly, they must build a blank "output" representation where the output will be placed.  The third is the "instructions".  Obviously it is far more efficient to use a set of java directives as the instructions document than the old XML days where one had to use XSLT to do a third XML parse and build representations of each XSLT instruction.

The problem is not simply a 3:1 problem. From my experience in profiling these types of apps, the input is almost never less than 1:1.XX (ie - a greater size) that the input. The output usually starts as almost nothing but can quickly grow as things are added and the instructions range in size.

If you are finding that the issue causing your question is problem with too much memory being used, there are some options to explore WRT to setting the Java environmental vars for things like memory management.

It's been a while since I worked for Adobe but I think there are instructions available for your version of LC ES in the help docs for Heap memory size, saturation of HMS, heap fragmentation and more. Some of these are affected by the environment in which the JVM runs and depends on how the OS allocates memory.    It even depends on the App server you are running (websphere etc).  There are so many things to consider.

Back to your first question though -> "I have written my DDX to be iterative based on the number of page Ranges i have within the List, Reaching out to know if there is any better way of implementing this ? "
My recollection is that I had to write a custom parser to parse the input of a very large PDF file, then drop all parts from memory that would not be used. Similar to XML SAX parsers, each event has a handler. If the handler has found a portion of a PDF doc that will *never* be used in the output, it can simply do nothing which would keep memory free.  By default, a complete in memory representation of the PFF would be normally built and if there is not use for a portion of it, this is wasted space.

I am not sure if this helps or is useful in your situation. In order to be of more use, I would have to be more knowledgeable in the specifics.

Good luck and keep us posted.

Duane Nickull

fred.pantalone

unread,
Dec 13, 2022, 9:26:31 AM12/13/22
to Adobe LiveCycle Developers
Hi Abhishek,

I agree with Duane, a PDF that large (25 k pages) is pretty much useless if anyone ever wants to open it. So, if there's anyway to change this upstream then you should explore that first. 

If you're stuck with this then I would move away from Assembler right away and work with a Java PDF library. There are many out there so it will take some research, PoCs, etc, but I'm sure you'l find a solution by going this route. I would start with Apache PDFBox because it's open source and free. I've used Big Faceless in the past and they had (have?) a solid product with great support.

Fred

Duane Nickull

unread,
Dec 13, 2022, 2:48:15 PM12/13/22
to live...@googlegroups.com
Well....  I didn't want to mention PDFBox on this list but then again I no longer work for Adobe.  I use PDFbox all the time and it has such a smaller footprint if you profile it (open source rocks!).

PDFBox if your friend -> https://stackoverflow.com/questions/40221977/pdfbox-split-pdf-in-multi-files-with-different-page-ranges-and-filenames

Duane

--
You received this message because you are subscribed to the Google Groups "Adobe LiveCycle Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to livecycle+...@googlegroups.com.

Abhishek Kolluru

unread,
Dec 13, 2022, 3:32:07 PM12/13/22
to live...@googlegroups.com
Thanks Fred/Duane. The huge PDF is just a raw file for me to work on, it’s just a stitched up document with multiple records. 

I was able to get through most of this, the only challenge I have is, the AssemblerResult object is holding the individual PDF documents, i want to extract them and write them to a directory as my usecase1. I could not think of a way to do this, i noticed an avoka component to write multiple files to a directory, so i passed this "list<document>" but it seem to be overwriting all the files and just shows one single file, problem here is Avoka component does not have that feature to add a suffix to make the files unique and so all the files are taking the Source file name instead of result name.

My requirement is to get the docs out of this variable and write all of those to a directory. My DDX to split the files is something like below,

 DDX string= <?xml version='1.0' encoding='UTF-8'?>
 <DDX xmlns='http://ns.adobe.com/DDX/1.0/'>
 <PDF result='P_1316_042020.pdf'>
 <PDF source='Doc1'  pages='6-7' />
 </PDF>
 <PDF result='P_59884_042020.pdf'>
 <PDF source='Doc1'  pages='8' />
 </PDF>
 <PDF result='P_2619_042020.pdf'>
 <PDF source='Doc1'  pages='2' />
 </PDF>
 <PDF result='P_56007_042020.pdf'>
 <PDF source='Doc1'  pages='5' />
 </PDF>
 </PDF>
 </DDX>

I can even attach the sample .lca if it helps. Any thoughts/ideas/suggestions would be of great help.

Thanks,
Abhishek


--

fred.pantalone

unread,
Dec 13, 2022, 5:04:41 PM12/13/22
to Adobe LiveCycle Developers
Nobody has to drink the Adobe kool-Aid in this group!

fred.pantalone

unread,
Dec 13, 2022, 5:09:48 PM12/13/22
to Adobe LiveCycle Developers
I think you wrote that you have a list of documents, is that correct? If so, just loop through the list and write each doc to disk. I've probably missed something...
Reply all
Reply to author
Forward
0 new messages