How can I implement search and replace functionality in PDF?

254 views
Skip to first unread message

Support

unread,
Dec 4, 2007, 8:04:36 PM12/4/07
to PDFTron PDFNet SDK
Q: I need to allow simple search and replace functionality in PDFs
with our software. So I'd like to search for a string of characters
in a PDF, possibly with the use of a regular expression, and upon
finding the text we're looking for, I want to replace that text with
something else. So is there a way to use PDFTron to search for text,
get some sort of location identifier so that we can identify where the
text is located, and then replace it? I know that PDF files are
encoded in different ways, and most of the time, the data in PDF files
is not contiguous. The data can be spread out in all different
locations of the PDF file, so would PDFTron be able to search the text
based on what the user would see if the file were rendered on the
screen?

-----

A: You can use PDFNet SDK to implement search and replace text in PDF.
To implement the first part of the problem (i.e. text search) you can
use pdftron.PDF.TextExtractor class (as illustrated in TextExtract
sample project - www.pdftron.com/net/samplecode.html#ElementEdit).

For example, the following code can be used to search text word by
word:

TextExtractor txt = new TextExtractor();
txt.Begin(page);

TextExtractor.Word word;
for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid();
line.GetNextLine()) {
for (word=line.GetFirstWord(); word.IsValid(); word.GetNextWord()) {
string s = word.GetString());
if (s matches search string) {
Rect bbox = word.GetBBox();
...
}
}
}

To match the search string you can use standard RegEx API in .NET /
Java / or third party library (if you are developing using C/C++).

Implementing the second part of the problem (i.e. replacing the search
string) is a bit more tricky. Essentially you would use the pattern
illustrated in ElementEdit sample project (http://www.pdftron.com/net/
samplecode.html#ElementEdit). In your case you would copy elements to
a new page without modifying them. Now if the current text element is
within the bounding box (i.e. bbox) of the element to be replaced
(result from step 1), you can replace the element with a new text
string. Please note that using element.SetTextData() is not the
recommended way to replace text content. Instead, it is usually a
better idea to
completely replace text element instead of trying to edit original
text data. For more information on this topic, please search for
"modify text" in PDFNet Knowledge Base (http://groups.google.com/group/
pdfnet-sdk/).
Reply all
Reply to author
Forward
0 new messages