TextExtractor is returning each letter as its own Word.

28 views
Skip to first unread message

Ryan

unread,
Jun 7, 2017, 2:54:29 PM6/7/17
to PDFTron PDFNet SDK
Question

When I use the following code, the word is split into separate words. When I copy the word from Acrobat, it is a single word.

using (PDFDoc doc = new PDFDoc(@"C:\temp\Source.pdf"))
{
       
Page page = doc.GetPage(1);
       
using (TextExtractor txt = new TextExtractor())
       
{
             txt
.Begin(page);
             
String text = txt.GetAsXML(TextExtractor.XMLOutputFlags.e_words_as_elements);
       
}
}
 
Output from the program above:

<Word>h</Word>
<Word>e</Word>
<Word>l</Word>
<Word>l</Word>
<Word>o</Word>
<Word>!</Word>

In the page content it looks like this:
BT
1 G
1 g
0.66667 0 0 1 60.024 337.01 Tm
/F0 0.96 Tf
[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ
ET

Answer:

This is the correct output when using TextExtractor.XMLOutputFlags.e_words_as_elements flag.

Given
[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ

Each character is its own "element" in this case. e_words_as_elements refers to our lower level Element class, returned by ElementReader. TextExtractor uses ElementReader to parse the raw PDF content stream, and then generates a higher level human reading order.

If you run this document through the ElementReader sample test, you will see what I mean.


Reply all
Reply to author
Forward
0 new messages