I'm running into a strange issue with the getAsXML method. If I call the method in the exact same manner twice on the same PDF, I get slightly different results. Has anyone seen this before or can anyone guess why I might be seeing this behavior?
I'm using the Java PDFNet implementation, here's the relevant code:
import java.io.*;
import pdftron.Common.PDFNetException;
import pdftron.PDF.*;
public class VSMExtractText {
public static void main(String[] args) {
PDFNet.initialize();
String input_path = args[0];
String output_pre = args[1];
String output_post = ".xml";
try {
PDFDoc doc = new PDFDoc(input_path);
doc.initSecurityHandler();
int page_num = doc.getPageCount();
for (int i=1; i<=page_num; ++i) {
try {
File file = new File(output_pre + String.format("%04d", i) + output_post);
Writer output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));
if (!file.exists()) {
file.createNewFile();
}
Page page = doc.getPage(i);
TextExtractor txt = new TextExtractor();
txt.begin(page, null, 0);
String text = txt.getAsXML(7);
String utf8text = text.replace("utf-16","utf-8");
output.write(utf8text);
output.flush();
output.close();
txt.destroy();
}
catch (IOException e) {
System.out.println(e);
}
}
doc.close();
}
catch (PDFNetException e) {
System.out.println(e);
}
PDFNet.terminate();
}
}
I'm running this via the command line on a Mac with (for example) this call:
java -Djava.library.path=bin/libs -classpath .:bin/libs/PDFNet.jar:bin VSMExtractText pdfnet-test.pdf pdfnettest1/pdfnet-test_
And here is an example diff between two output files, running the same method on the same PDF twice in a row. Note that in one case it picked up the text as bold, and in the other it did not:
diff pdfnettest1/pdfnet-test_0010.xml pdfnettest2/pdfnet-test_0010.xml
63c63
< <Word box="268.84, 663.025, 8.10467, 9.43">(a</Word>
---
> <Word box="268.84, 663.025, 8.10467, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">(a</Word>
65c65
< <Word box="291.621, 663.025, 8.50853, 9.43">b)</Word>
---
> <Word box="291.621, 663.025, 8.50853, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">b)</Word>
Thanks for any suggestions!
Nick
<Line box="161.499, 662.79, 202.13, 10.25" style="font-family:HelveticaNeueLTStd-Lt; font-size:10.25; color: #231F20;">
<Word box="161.499, 664.829, 23.1363, 7.47225">circle</Word>
<Word box="187.485, 664.829, 13.8682, 7.47225">the</Word>
<Word box="204.202, 664.829, 19.3991, 7.47225">best</Word>
<Word box="226.451, 664.829, 39.5394, 7.47225">definition</Word>
<Word box="268.84, 663.025, 8.10467, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">(a</Word>
<Word box="279.793, 664.829, 8.94518, 5.5965">or</Word>
<Word box="291.621, 663.025, 8.50853, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">b)</Word>
<Word box="302.98, 664.829, 8.2615, 7.47225">of</Word>
<Word box="314.091, 664.829, 21.6685, 7.47225">each</Word>
<Word box="338.609, 664.829, 25.0202, 7.47225">word.</Word>
</Line>
<Line box="161.499, 662.79, 202.13, 10.25" style="font-family:HelveticaNeueLTStd-Lt; font-size:10.25; color: #231F20;">
<Word box="161.499, 664.829, 23.1363, 7.47225">circle</Word>
<Word box="187.485, 664.829, 13.8682, 7.47225">the</Word>
<Word box="204.202, 664.829, 19.3991, 7.47225">best</Word>
<Word box="226.451, 664.829, 39.5394, 7.47225">definition</Word>
<Word box="268.84, 663.025, 8.10467, 9.43">(a</Word>
<Word box="279.793, 664.829, 8.94518, 5.5965">or</Word>
<Word box="291.621, 663.025, 8.50853, 9.43">b)</Word>
<Word box="302.98, 664.829, 8.2615, 7.47225">of</Word>
<Word box="314.091, 664.829, 21.6685, 7.47225">each</Word>
<Word box="338.609, 664.829, 25.0202, 7.47225">word.</Word>
</Line>