Question
When I use the following code, the word is split into separate words. When I copy the word from Acrobat, it is a single word.
using (PDFDoc doc = new PDFDoc(@"C:\temp\Source.pdf"))
{
Page page = doc.GetPage(1);
using (TextExtractor txt = new TextExtractor())
{
txt.Begin(page);
String text = txt.GetAsXML(TextExtractor.XMLOutputFlags.e_words_as_elements);
}
}
Output from the program above:
<Word>h</Word>
<Word>e</Word>
<Word>l</Word>
<Word>l</Word>
<Word>o</Word>
<Word>!</Word>
In the page content it looks like this:
BT
1 G
1 g
0.66667 0 0 1 60.024 337.01 Tm
/F0 0.96 Tf
[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ
ET
Answer:
This is the correct output when using TextExtractor.XMLOutputFlags.e_words_as_elements flag.
Given
[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ
Each character is its own "element" in this case. e_words_as_elements refers to our lower level Element class, returned by ElementReader. TextExtractor uses ElementReader to parse the raw PDF content stream, and then generates a higher level human reading order.
If you run this document through the ElementReader sample test, you will see what I mean.