Replace text in PDF

153 views
Skip to first unread message

Sergey Kozlov

unread,
Feb 5, 2014, 1:19:12 AM2/5/14
to pdfhummus-in...@googlegroups.com
Hi.
I need to replace placeholder text with another one in the existing PDF (actually create new file from the template PDF).
I'm using PDFWriterTestPlayground/ModifyingExistingFileContent.cpp as base.
My steps:
1) open file for modifications with ModifyPDF()
2) copy all stuff except "Contents" (like in
ModifyingExistingFileContent.cpp)
3) read page's "Contents" section to string (like in howto Parsing PDF) and replace text in few parts.
4) at this point I had tried two ways
    a) do the same operations as here [modifiedPageObject->WriteKey("MediaBox"); .....]

       page_object->WriteKey("Contents");
       PDFStream *s = objects.StartPDFStream();
       IByteWriter *out = s->GetWriteStream();
       out->Write(...);
       s->FinalizeStreamWrite();
       objects.EndDictionary(page_object);
       objects.EndIndirectObject();

       I thought what it will create "Contents" with stream data like with "MediaBox" in the example.

    b) use PDFModifiedPage

        PDFModifiedPage page(pdf, 0);
        AbstractContentContext *ctx = page.StartContentContext();
        ctx->WriteFreeCode(...);
        page.EndContentContext();
        page.WritePage();

    in all ways it saves original page's contents.

So my questionis where is my mistake and how to modify text?
Thanks!

Sergey Kozlov

unread,
Feb 5, 2014, 9:06:54 AM2/5/14
to pdfhummus-in...@googlegroups.com
Ok. I understood what 4.a is totally incorrect.
After some sources and pdf analyzing and I figured out what pahe content is a Flate encoded stream.
And I can get it:

    PDFDocumentCopyingContext* copying_ctx = pdf->CreatePDFCopyingContextForModifiedFile();
    PDFParser *parser = copying_ctx->GetSourceDocumentParser();
    PDFObjectCastPtr<PDFDictionary> page(parser->ParsePage(0));

    PDFObjectCastPtr<PDFIndirectObjectReference> contents(page->QueryDirectObject("Contents"));
    PDFObjectCastPtr<PDFStreamInput> stream(parser->ParseNewObject(contents->mObjectID));

stream is what I need to change. PDFStreamInput contains only position in global pdf stream which is read only.
So it's still not creal how to change contents of the page..
Any advice?

Gal Kahana

unread,
Feb 6, 2014, 1:58:56 AM2/6/14
to pdfhummus-in...@googlegroups.com
Hi Sergey,
Assuming that you have the new content for the page in some string "string myNewContent" what you should do is basically this:

1. Create a new indirect object that will be the content stream, and write to it the string.
ObjectIDType contentObjectID = objectContext.StartNewIndirectObject();
PDFStream* newStream = objectContext.StartPDFStream();
newStream->GetWriteStream()->Write(myNewContent.c_str(),myNewContent.length());
objectContext.EndPDFStream(newStream);
objectContext.EndIndirectObject();

2. now, we have the new content stream, and got its id in contentObjectID. all that's left to do is to recreate the page object and change its content entry to point to this object.

2.1 start modified object for the page and copy all but contents (i'm shortcutting the code that you already seem to have figured out):
objectContext.StartModifiedIndirectObject(pageObjectID);
        DictionaryContext* modifiedPageObject = objectContext.StartDictionary();
 
       
        while(pageObjectIt.MoveNext())
        {
            if(pageObjectIt.GetKey()->GetValue() != "Content")
            {
                modifiedPageObject->WriteKey(pageObjectIt.GetKey()->GetValue());
                copyingContext->CopyDirectObjectAsIs(pageObjectIt.GetValue());
            }
        }

2.2 write the new content stream reference to replace the old content
        // write new media box
        modifiedPageObject->WriteKey("Content");
objectContext.WriteIndirectObjectReference(newContentObjectID);

2.3 finalize the modified page object
objectContext.EndDictionary(modifiedPageObject);
objectContext.EndIndirectObject();

Done :)

Hope this helps,
Gal.

Sergey Kozlov

unread,
Feb 6, 2014, 3:47:17 AM2/6/14
to pdfhummus-in...@googlegroups.com
Awesome! Thanks Gal!
My last idea was to replace stream in the contents object somehow, but
seems creating new object is the only way.
It's more clear now how to change stuff, but I have few small questions
if you don't mind.
I checked result PDF and seems the lib just appends new info to the end.
Result PDF contains multiple xref tables and trailer sections.
I know pdf structure not well but may such behaviour cause problems in
the PDF viewers or it's ok?
And back to my task; Old contents object is not used anymore, is it
possible to remove it or not write to the result?

Thanks again.

Gal Kahana

unread,
Feb 6, 2014, 6:31:06 AM2/6/14
to pdfhummus-in...@googlegroups.com
Hi,
adding extra trailer/info and xref is what's done when a file is being modified, as opposed to creating a new one.
It should be 100% perfectly OK.

as for your second question. in scenario of modification you could mark the old contents object for deletion through
EStatusCode DeleteObject(ObjectIDType inObjectID);

of IndirectObjectsReferenceRegistry which you can get via the objectcontext (for more info check the entry on modification here - https://github.com/galkahana/PDF-Writer/wiki/Modification

if you want it completely out then modification is not the way to go. rather, the best way is to create a new file and copy the all that you need from the source document. unless it's critical for you to remove that non-used object completely, i suggest refraining from this method. it's more complex.

Gal.

Sergey Kozlov

unread,
Feb 6, 2014, 8:00:49 AM2/6/14
to pdfhummus-in...@googlegroups.com
Hi.

> adding extra trailer/info and xref is what's done when a file is being
> modified, as opposed to creating a new one.
> It should be 100% perfectly OK.
>
Thanks for explanation, Gal. I just wanted to be sure what result PDF
will be correct.

> if you want it completely out then modification is not the way to go.
> rather, the best way is to create a new file and copy the all that you
> need from the source document. unless it's critical for you to remove
> that non-used object completely, i suggest refraining from this
> method. it's more complex.
Can you explain in short what I need to do to copy to the new file, as I
understand I can't operate with indirect objects in this situation and
have to recreate them.

Your lib is pretty cool, thanks for sharing!

Gal Kahana

unread,
Feb 6, 2014, 8:40:40 AM2/6/14
to pdfhummus-in...@googlegroups.com
To copy a file to a new file you create a new file with a new PDFWriter. then create a pdf copying context for the source file.
then you can start copying element to the new file.
the pdf copying contenxt has a useful parser object, simiilar to the modification scenario. 

if you want to try to do the same thing with a new file, here is what i suggest:

1. Start a new pdf with a new pdfwriter. let's call it "target"
2. Start a new content context for the "source" file.  this - https://github.com/galkahana/PDF-Writer/wiki/PDF-Embedding#wiki-using-copying-context - explain on the copying context.

note that a new instance of copying context has a parser available via GetSourceDocumentParser().

3. use the "source" file parser to construct the new content stream string, as you did before in the modification scenario

4. create a new stream object with the string as its contents in the target file. remember the object ID for this stream. as you did in the modification scenarion

5. use the copying context ReplaceSourceObjects(), placing as a single entry in the map the new object ID with the source document Content object ID. this would cause a future copying to avoid copying the old contents, and use the one already in the target file, which we created in 4.

6. import the page from the source document to the new one. use AppendPDFPageFromPDF(inIndex) of the copying context, where inIndex is the page index in the source document

7. repeat the process with any other pages that you want to replace the content for, or just import any other page that you want using AppendPDFPageFromPDF(inIndex).

This should do the trick.

Gal.

Sergey Kozlov

unread,
Feb 6, 2014, 9:55:49 AM2/6/14
to pdfhummus-in...@googlegroups.com
Very helpful, thanks again.
I'll try this way too.

WBR, Sergey.
Reply all
Reply to author
Forward
0 new messages