Converting PDF to XML with positioning information

115 views
Skip to first unread message

Support

unread,
May 14, 2008, 3:12:56 PM5/14/08
to PDF2Text
Q: We need to use this tool to convert pdf to xml(text) format for
getting the text(word by word) coordinates. We need to use this type
xml result in this tools.

XML Type :

<page height="792" width="612">
<text>
<text height="16" width="181" x="222" y="402">Hello</text>
<text height="13" width="380.7" x="115.7" y="104.7">Hello</
text>
</text>
</page>

Can we use use : PDFtoText Command line (PDFTron) for this purpose?

------
A:

You can use PDF to Text (PDF2Text - http://www.pdftron.com/downloads.html#PTCMD)
to convert PDF to XML.

You may want to try the following options when running the converter:

pdf2text -o test_out -f xml --output_bbox my.pdf

To output only the first page use:

pdf2text -o test_out -a 1 -f xml --output_bbox my.pdf

To output positioning information for each word use "--
xml_words_as_elements" option. For example:

pdf2text -o test_out -f xml --output_bbox my.pdf --
xml_words_as_elements

To include text sytle information (such as font, font size, color,
etc) include "--xml_output_styles" option. For example:

pdf2text -o test_out -f xml --output_bbox my.pdf --
xml_words_as_elements --xml_output_styles

Reply all
Reply to author
Forward
0 new messages