TextExtractor is returning each letter as its own Word.

48 views

Skip to first unread message

Ryan

unread,

Jun 7, 2017, 2:54:29 PM6/7/17

to PDFTron PDFNet SDK

Question

When I use the following code, the word is split into separate words. When I copy the word from Acrobat, it is a single word.

using (PDFDoc doc = new PDFDoc(@"C:\temp\Source.pdf"))
{
       Page page = doc.GetPage(1);
       using (TextExtractor txt = new TextExtractor())
       {
             txt.Begin(page);
             String text = txt.GetAsXML(TextExtractor.XMLOutputFlags.e_words_as_elements);
       }
}

Output from the program above:

<Word>h</Word>
<Word>e</Word>
<Word>l</Word>
<Word>l</Word>
<Word>o</Word>
<Word>!</Word>

In the page content it looks like this:

BT
1 G
1 g
0.66667 0 0 1 60.024 337.01 Tm
/F0 0.96 Tf
[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ
ET

Answer:

This is the correct output when using TextExtractor.XMLOutputFlags.e_words_as_elements flag.

Given

[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ

Each character is its own "element" in this case. e_words_as_elements refers to our lower level Element class, returned by ElementReader. TextExtractor uses ElementReader to parse the raw PDF content stream, and then generates a higher level human reading order.

If you run this document through the ElementReader sample test, you will see what I mean.

Reply all

Reply to author

Forward

0 new messages