Format of -textrun output in PDF2Text?

40 views
Skip to first unread message

Support

unread,
Aug 23, 2011, 5:50:09 PM8/23/11
to pdf2...@googlegroups.com

Q: I’m intrigued by the –f textruns format. Can you tell me what the values contained in the output represent?
90, 322.276, 264.732, 12, "TOVLTO+TimesNewRoman", , 54, "and across the doses the median value was consistently"

I’m guessing that 90 is the right offset, 332.276 is the bottom offset, 264.732 is the length, 12 is the height, “TOV...” is the font
 Not sure what ,, or 54 represent.

-------------------

A: This option is relatively low-level and is used to return a basic text building block (i.e. a text run) in PDF. The feature is mainly useful for tracking down issues (debugging) but there are other creative ways it could be used.

The first 4 numbers represent a bounding-box (a rectangle [x, y, w, h]) for the run.

"TOVLTO+TimesNewRoman" stands for PDF font name used to represent the run.

54 – is the font size (as it appears on the page).

Since you are looking to extract content from PDF, you may also want to take at PDFNet SDK (http://www.pdftron.com/pdfnet/).
PDFNet includes all functionality available in PDF2Text, plus much more (PDF2Text itself is a small utility based on PDFNet).  As a starting point you may want to take a look at TextExtract sample: 

Reply all
Reply to author
Forward
0 new messages