How do I replace some text in PDF?

176 views
Skip to first unread message

Support

unread,
May 7, 2009, 5:33:23 PM5/7/09
to PDFTron PDFNet SDK
Q: We have a PDF that is created by a PDF virtual printer. The data
are coming from the Crystal Reports application. We need to replace
e.g. TITLE text by something else.

I have used your ElementReader and have iterated through the elements,
but I found no “TITLE” text element.

When I inspected the PDF, I found that the text is created in a very
funny way:

BT
/F1 1 Tf
9.96 0 0 9.96 337.32 803.16 Tm
-0.001 Tc
[(-)-5(1.0)1(0)]TJ
-30.25302 0 Td
0 Tc
[(TIT)-17(L)1(E)]TJ
ET

No idea what the -17 and 1 things are – but I get from the reader a
separate text element per part: {TIT, L, E}.

How can I find our as a developer that these texts are actually
forming just one text string?

How can I find the physical text width and character positions? For
each text element I get always the same TransformationMatrix and CTM…

Text:TIT
Matrix:9.96;0;0;9.96;337.32;803.16
CTM:1;0;0;1;0;0

Text:L
Matrix:9.96;0;0;9.96;337.32;803.16
CTM:1;0;0;1;0;0

Text:E
Matrix:9.96;0;0;9.96;337.32;803.16
CTM:1;0;0;1;0;0

I have played also with your TextExtractor. That class is able to glue
the text parts together. However, in that context I have no access to
the text elements and therefore I cannot replace the text with a new
one.

If this file is not parse-able with your toolkit, maybe you could
recommend me a different way (a different virtual printer) to generate
PDFs out of any 3rd party application.

-----
A: Text in PDF may be broken into many little elements (text runs)
because a PDF creator may be applying kerning (small spacing
adjustments) between adjacent text characters. Or text runs may be
using different fonts, font sizes, or other properties. Unfortunately
PDF format usually does not preserve the semantic structure of text as
in HTML or Word. For purposes of text extraction it is better to use
TextExtractor class than ElementReader. Unlike ElementReader,
TextExtractor can recognize words, lines, and paragraphs within PDF
pages and can provide precise positioning information for each word.

The simplest way to implement "find/replace" on text within an
existing PDF document is as follows:

1) Search for all occurrences of the string on the page. There are
several ways to implement this, but probably the simplest one is to
use ‘pdftron.PDF.TextExtractor’ as shown in TextExtract sample project
(http://www.pdftron.com/net/samplecode.html#TextExtract). Given a word
you can obtain its positioning information using word.GetBBox().

2) Edit the existing page (e.g. as illustrated in ElementEdit sample –
www.pdftron.com/net/samplecode.html#ElementEdit). Use bounding boxes
of word(s) identified in step 1 to detect if a given run should be
deleted (i.e. skipped).

3) Add new content in place of old text/content. This can be
implemented either after step 2 (e.g. as in www.pdftron.com/pdfnet/faq.html#how_watermark),
or during page copy.
Reply all
Reply to author
Forward
0 new messages