Best practices for PDF editing

224 views
Skip to first unread message

rsippl

unread,
Mar 9, 2018, 11:08:38 AM3/9/18
to pdfium
Hi,

PDFium has some PDF editing features, exposed mainly through fpdf_edit.h and fpdf_annot.h. I'm considering using PDFium as backend API for a PDF editor.

For reading the input file, you need to implement FPDF_FILEACCESS by passing a function to its member m_GetBlock. That method is called e.g. every time you call FPDF_LoadPage.

Once you've made changes to your FPDF_DOCUMENT and its FPDF_PAGEs, you save it using one of the functions declared in fpdf_save.h, e.g. FPDF_SaveAsCopy, which takes a FPDF_FILEWRITE. Similarly to m_GetBlock in FPDF_FILEACCESS on the reading side, FPDF_FILEWRITE needs a function WriteBlock (naming and parameter types seem a bit inconsistent, no "m_" prefix here).

So m_GetBlock and WriteBlock will operate block-wise on the input and output file, respectively, reading an unsigned char *pBuf from the input file, and writing a const void* pData to the output file (consistency could be improved a bit here).

Please correct me if I'm wrong, but the input file and the output file need to be different files. In an editor you expect to repeatedly make changes, undo/redo, save, make changes and save again, ideally without keeping all FPDF_PAGEs in memory, so both m_GetBlock and WriteBlock will be called as long as the user is using the editor.

I wonder what's best practice here. Here's one idea.

When opening a document, make a temporary file copy and have m_GetBlock operate on it as long as the document is open. Every time the user hits Save, FPDF_SaveAsCopy overwrites the original file via WriteBlock (here's the dangerous part :)). When the document is closed, the temporary file is deleted.

Please let me know if this makes sense or if I'm missing something.

Thanks,
Ralf

dsin...@chromium.org

unread,
Mar 9, 2018, 11:37:04 AM3/9/18
to pdfium
If the user is editing the document, you don't need to call SaveAsCopy until they're ready to save. You can make all the edits in memory using the pdfium API calls, when the user wants to save then you'd call SaveAsCopy and write out to disk (or save into memory or whatever you're doing).  To avoid keeping all pages in memory, you could track if the user edited a page and keep those in memory until the user saves the document. You can discard un-edited pages as the user moves through the document. 

For your writing issue, it's usually safer to write to a temporary file as you stream the data out to disk. Once that write is done, move the temporary file on top of the original file. This has less chance of the save being interrupted and corputing the document (due to power loss, application termination, etc). If you're saving to a temp file, you could also periodically auto-save to clean out the edited pages if you see you're getting above some threshold.

dan

rsippl

unread,
Mar 10, 2018, 1:18:38 PM3/10/18
to pdfium
Streaming the data out to a temp file makes a lot of sense.

As for keeping only the edited pages in memory, I don't suppose that can be done using the public API? Let's assume the editor keeps a handle to a document (FPDF_DOCUMENT) as long as it's open. When a page is displayed or edited, a FPDF_PAGE will be retrieved via FPDF_LoadPage. After making changes, you call FPDFPage_GenerateContent and then you can discard the FPDF_PAGE instance. Am I right in assuming that FPDF_LoadPage loads a page on first access and caches it until you close the document?

dsin...@chromium.org

unread,
Mar 11, 2018, 11:33:00 AM3/11/18
to pdfium
If you unload the page the document won't keep a copy. You need to keep the page loaded as long as you needed it. There may still be resources loaded in the document as they can be shared but the page is unloaded.

dan

rsippl

unread,
Mar 15, 2018, 6:25:42 PM3/15/18
to pdfium
By "unload", do you mean FPDF_ClosePage?

If I make a change to a page, e.g. add an annotation, then close the page (FPDFPage_GenerateContent, FPDF_ClosePage), and finally reload that same page (FPDF_LoadPage), the changes I made are still there. So the document must be keeping a copy of the page?

Dan Sinclair

unread,
Mar 20, 2018, 2:19:30 PM3/20/18
to Ralf S., pdfium
Ah, yea doing GenerateContent would save the data back into the document I believe.

dan



--
You received this message because you are subscribed to the Google Groups "pdfium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
To post to this group, send email to pdf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/bcee4bfc-4518-4280-a5c6-03a6600d0b05%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages