A few questions on PDF text extraction from a given rectangle.

Support

unread,

Nov 20, 2008, 9:15:36 PM11/20/08

to PDFTron PDFNet SDK

Q: I need to extract text from a PDF at specific locations. I specify
the locations through rectangles whose corner coordinates I pass to
PDFView.SelectByRect() or SelectByStruct).

A general question:
(A)
- When I need to extract text from a PDF without displaying it, I hate
instanciating a PDFView control just for that. Is there a way to
extract the text on a lower level. Ideally, those functions would
reside in pdftron::PDF::Page. My concern is performance (I have to
loop through a couple million pages and have to do it quickly) as well
as memory usage.

Specific questions about PDFView.SelectByRect():
(B)
What are the four parameters x1, y1, x2, y2 exactly? The apiref.html
says briefly "PDF coordinates". But my experiments show that the
origin for those values has to be the TOP LEFT corner of the PDF -
which is different from the usual bottom left corner. Please mention
in your reply what x2, y2 is supposed to be, too.

(C)
I need to select text EXACTLY by a rectangle. No characters outside of
the rectangle may be returned. Unfortunately, SelectByRect() always
returns whole words. How can I set the granularity to character level,
so that only characters intersecting with my coordinates are returned?
SelectByStruct() seemed promising, but has the (for me unwanted) side-
effect of selection whole horizontal lines.

------
A: Regarding text extraction from a rectangle the right approach is to
use TextExtracor class (as shown in TextExtract sample project -
http://www.pdftron.com/net/samplecode.html#TextExtract).

You can either pass an optional clipping rectangle as the second
parameter in text_extractor.Begin(page, box) method or you can iterate
through all words on the page and test for intersection between word's
bounding box (word.GetBBox()) and the selection rectangle. Either of
these will be very fast and more memory efficient than using PDFView.

> What are the four parameters x1, y1, x2, y2 in pdfview.SelectByStruct() exactly?

These are coordinates for the selection rectangle in screen
coordinates (not PDF coordinates - thanks for pointing this out). The
origin of the screen coordinate system is top left corner and it is
using pixel coordinates.

> I need to select text EXACTLY by a rectangle. No characters
> outside of the rectangle may be returned.

To achieve this, use TextExtractor class to extract text from PDF,
pass the selection rectangle as the second parameter, and
TextExtractor.ProcessingFlags.e_remove_hidden_text as the third
parameter in the call to text_extractor.Begin(page, select,
TextExtractor::e_remove_hidden_text).

Support

unread,

Nov 24, 2008, 4:58:24 PM11/24/08

to PDFTron PDFNet SDK

Q: The TextExtractor class with the e_remove_hidden_text processing
option is a real winner! Thanks for pointing it out.
Just out of curiosity: What is the e_no_invisible_text option for? I
couldn't find its description in the APIREF.

-----
A: In PDF, a text element can be tagged as invisible (e.g. using
element.GetGState().SetTextRenderingMode(e_invisible_text)). This
feature is typically used to add a OCR-ed text to scanned documents
(i.e. for searchable PDF images). The text can be used for text
search, copy & paste operations etc, but is not directly visible.
'TextExtractor.ProcessingFlags.e_no_invisible_text' option can be used
to skip text marked as invisible. By default, TextExtractor will
extract all text.

Dennis Van Acker

unread,

Jul 7, 2016, 3:00:57 PM7/7/16

to PDFTron PDFNet SDK

using a TextExtractor with a rectangle made from quads from a selection, causes problems: it sometimes adds a character at the end or deletes one in the beginning though

Op vrijdag 21 november 2008 03:15:36 UTC+1 schreef Support:

Reply all

Reply to author

Forward