Q: We need to use this tool to convert pdf to xml(text) format for
getting the text(word by word) coordinates. We need to use this type
xml result in this tools.
XML Type :
<page height="792" width="612">
<text>
<text height="16" width="181" x="222" y="402">Hello</text>
<text height="13" width="380.7" x="115.7" y="104.7">Hello</
text>
</text>
</page>
Can we use use : PDFtoText Command line (PDFTron) for this purpose?
------
A:
You can use PDF to Text (PDF2Text -
http://www.pdftron.com/downloads.html#PTCMD)
to convert PDF to XML.
You may want to try the following options when running the converter:
pdf2text -o test_out -f xml --output_bbox my.pdf
To output only the first page use:
pdf2text -o test_out -a 1 -f xml --output_bbox my.pdf
To output positioning information for each word use "--
xml_words_as_elements" option. For example:
pdf2text -o test_out -f xml --output_bbox my.pdf --
xml_words_as_elements
To include text sytle information (such as font, font size, color,
etc) include "--xml_output_styles" option. For example:
pdf2text -o test_out -f xml --output_bbox my.pdf --
xml_words_as_elements --xml_output_styles