Output discrepancies using TextExtractor.getAsXML method

144 views
Skip to first unread message

Nick Brown

unread,
Jul 12, 2013, 12:02:04 PM7/12/13
to pdfne...@googlegroups.com
Hi All,

I'm running into a strange issue with the getAsXML method. If I call the method in the exact same manner twice on the same PDF, I get slightly different results. Has anyone seen this before or can anyone guess why I might be seeing this behavior?

I'm using the Java PDFNet implementation, here's the relevant code:

import java.io.*;
import pdftron.Common.PDFNetException;
import pdftron.PDF.*;

public class VSMExtractText {
public static void main(String[] args) {
PDFNet.initialize();
String input_path = args[0];
String output_pre = args[1];
String output_post = ".xml";
try {
PDFDoc doc = new PDFDoc(input_path);
doc.initSecurityHandler();
int page_num = doc.getPageCount();
for (int i=1; i<=page_num; ++i) {
try {
File file = new File(output_pre + String.format("%04d", i) + output_post);
Writer output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));
if (!file.exists()) {
file.createNewFile();
}
Page page = doc.getPage(i);
TextExtractor txt = new TextExtractor();
txt.begin(page, null, 0);
String text = txt.getAsXML(7);
String utf8text = text.replace("utf-16","utf-8");
output.write(utf8text);
output.flush();
output.close();
txt.destroy();
}
catch (IOException e) {
System.out.println(e);
}
}
doc.close();
}
catch (PDFNetException e) {
System.out.println(e);
}
PDFNet.terminate();
}
}

I'm running this via the command line on a Mac with (for example) this call:
java -Djava.library.path=bin/libs -classpath .:bin/libs/PDFNet.jar:bin VSMExtractText pdfnet-test.pdf pdfnettest1/pdfnet-test_

And here is an example diff between two output files, running the same method on the same PDF twice in a row. Note that in one case it picked up the text as bold, and in the other it did not:

diff pdfnettest1/pdfnet-test_0010.xml pdfnettest2/pdfnet-test_0010.xml
63c63
< <Word box="268.84, 663.025, 8.10467, 9.43">(a</Word>
---
> <Word box="268.84, 663.025, 8.10467, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">(a</Word>
65c65
< <Word box="291.621, 663.025, 8.50853, 9.43">b)</Word>
---
> <Word box="291.621, 663.025, 8.50853, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">b)</Word>

Thanks for any suggestions!
Nick

Anatoly Kudrevatukh

unread,
Jul 15, 2013, 7:22:48 PM7/15/13
to pdfne...@googlegroups.com
Hi Nick,
That can happen if a line has equal number of elements with different styles. That causes dominant style of a line to alternate between runs.
For example, Run 1 line's styling is the same as for the first word in line(in bold), therefore this word doesn't get style attribute. In Run 2 line's styling info is equal to the second word style(underlined).
Run 1:
<Line box="72, 707.264, 81.2619, 10.2887" style="font-family:ArialMT; font-size:
11.25; color: #000000;"
>
<Word box="72, 707.264, 53.1502, 10.2887">Thompson</Word>
<Word box="128.277, 707.264, 24.9845, 10.2887" style="font-family:Arial-BoldMT;
font-size:11.25; sans-serif; color: #000000;"
>third</Word>
</Line>
Run 2:
-----------------------------------------------------------
<Line box="72, 707.264, 81.2619, 10.2887" style="font-family:Arial-BoldMT; font-
size:11.25; sans-serif; color: #000000;"
>
<Word box="72, 707.264, 53.1502, 10.2887" style="font-family:ArialMT; font-size:
11.25; color: #000000;"
>Thompson</Word>
<Word box="128.277, 707.264, 24.9845, 10.2887">third</Word>
</Line>

So technically both outputs are valid.
Which version of our product are you using(name of the .zip you downloaded)? I have changed it to be deterministic and can provide you with a patched v6 .dll

Anatoly.

Nick Brown

unread,
Jul 16, 2013, 2:16:33 PM7/16/13
to pdfne...@googlegroups.com
Hi Anatoly,

Thanks a lot for the info.  I believe I've got v.6.0.0 - I downloaded the trial for Mac OS X on July 2, filename PDFNetCMac.zip.

The issue I'm having is that I do indeed lose some information - it isn't just that the dominant style switches. Here are the full line elements for the two runs of the test I mentioned before:

Test 1:

<Line box="161.499, 662.79, 202.13, 10.25" style="font-family:HelveticaNeueLTStd-Lt; font-size:10.25; color: #231F20;">
   
<Word box="161.499, 664.829, 23.1363, 7.47225">circle</Word>
   
<Word box="187.485, 664.829, 13.8682, 7.47225">the</Word>
   
<Word box="204.202, 664.829, 19.3991, 7.47225">best</Word>
   
<Word box="226.451, 664.829, 39.5394, 7.47225">definition</Word>

   
<Word box="268.84, 663.025, 8.10467, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">(a</Word>

   
<Word box="279.793, 664.829, 8.94518, 5.5965">or</Word>

   
<Word box="291.621, 663.025, 8.50853, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">b)</Word>

   
<Word box="302.98, 664.829, 8.2615, 7.47225">of</Word>
   
<Word box="314.091, 664.829, 21.6685, 7.47225">each</Word>
   
<Word box="338.609, 664.829, 25.0202, 7.47225">word.</Word>
</Line>



Test 2:

<Line box="161.499, 662.79, 202.13, 10.25" style="font-family:HelveticaNeueLTStd-Lt; font-size:10.25; color: #231F20;">
   
<Word box="161.499, 664.829, 23.1363, 7.47225">circle</Word>
   
<Word box="187.485, 664.829, 13.8682, 7.47225">the</Word>
   
<Word box="204.202, 664.829, 19.3991, 7.47225">best</Word>
   
<Word box="226.451, 664.829, 39.5394, 7.47225">definition</Word>

   
<Word box="268.84, 663.025, 8.10467, 9.43">(a</Word>

   
<Word box="279.793, 664.829, 8.94518, 5.5965">or</Word>

   
<Word box="291.621, 663.025, 8.50853, 9.43">b)</Word>

   
<Word box="302.98, 664.829, 8.2615, 7.47225">of</Word>
   
<Word box="314.091, 664.829, 21.6685, 7.47225">each</Word>
   
<Word box="338.609, 664.829, 25.0202, 7.47225">word.</Word>
</Line>


To be clear, in the majority of cases, I am seeing the behavior you mentioned--the dominant style switches back and forth, but the children word elements have the correct styles applied individually.  But in some cases, as above, style information is just lost.  Hopefully your patch fixes the issue!

Thanks,
Nick

Anatoly Kudrevatukh

unread,
Jul 22, 2013, 6:44:26 PM7/22/13
to pdfne...@googlegroups.com
Hi Nick,

In your test case you came across a word with equal number of characters having two different styles(bold and not bold in your case). When determining a dominant style in a word Text Extractor would "randomly" prefer one over another. I have changed that behavior to be deterministic so you will get a consistent output and this change will be a part of future releases.

Please note, if you need more accurate styling information you can call GetCharStyle() method of a Word class.

Anatoly.

Nick Brown

unread,
Jul 28, 2013, 1:38:12 PM7/28/13
to pdfne...@googlegroups.com
Thanks Anatoly - I just tested out the patched .dll you sent over and it is working as expected, with no more output discrepancies.

That said, I am going to need to get accurate character data for those words that have an equal number of characters with different styles.

Is there any way for me to extend the getAsXML method to add character style information for those special cases?  From looking through the source, it seems like I have to use it as-is.  I know that I could use that GetCharStyle() method if I was already iterating over the words, but if possible I'm hoping to still let getAsXML handle that looping (as well as printing nice XML output, properly chunking words into lines/flows, etc.).

-Nick

Support

unread,
Jul 29, 2013, 1:08:32 PM7/29/13
to pdfne...@googlegroups.com
 
 
Ni Nick, You may need to implement your own XML export function since built-in getAsXML() is not very customizable.
 
Code example #4 in TextExtract sample project (http://www.pdftron.com/pdfnet/samplecode.html#TextExtract) is very similar to GetAsXML() and will help you get started.

Nick Brown

unread,
Jul 29, 2013, 11:58:33 PM7/29/13
to pdfne...@googlegroups.com
Thanks - I'll take a look at using that code sample as a template, it doesn't seem too far off.
Reply all
Reply to author
Forward
0 new messages