Given use of TextExtractor is there any way to determine BASELINE of TextExtractor.Word?

55 views
Skip to first unread message

Lee Gillie

unread,
Feb 3, 2015, 3:08:44 PM2/3/15
to pdfne...@googlegroups.com
PDF spec states quadrilaterals map the glyph, including ascenders and descenders.  I realize quads represent:


I take it TextExtractor.Word.BBox is the most minimal level square including all 4 points above.

I am looking for the baseline. My text is typically not slanted as shown above.

The only thing I can see to do is to open an output page, render a full character set, Subtract Ascent from the top of the BBox of the sample, or add Descent (this is negative) to the bottom of the BBox of the sample.  Perhaps there is an easier way in the context of line and word extraction using TextExtractor?

--

Anatoly Kudrevatukh

unread,
Feb 5, 2015, 2:04:14 PM2/5/15
to pdfne...@googlegroups.com
Unfortunately baseline cannot be accessed through the TextExtractor interface.

However you can do it through ElementReader interface. You might want to take a look at the DumpText function in the following example: www.pdftron.com/pdfnet/samplecode/TextExtractTest.cpp.html

For every e_text element this is how you access the baseline:

   
CharIterator itr = element.GetCharIterator();
   
double baseline = itr.Current().y;

http://www.pdftron.com/pdfnet/PDFNetC/d1/dc3/namespacepdftron_1_1_p_d_f.html#ae8de40dea55c1c19b7360f8d25251048

Let me know if that helps.
Reply all
Reply to author
Forward
0 new messages