Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Extract Text Coordinates from PDF

911 views
Skip to first unread message

sebc...@gmail.com

unread,
Oct 11, 2005, 11:26:55 AM10/11/05
to
Hi,
I was wondering if anyone could recommend a program which can extract
the starting (top left) coordinates (x,y) of each word in a PDF file
(and the end if possible). Ideally output would be in a format that
could be easily inserted into a database.

larry...@nospamjbmsystems.com

unread,
Oct 11, 2005, 11:48:30 AM10/11/05
to
Hi,

We did that here for an internal parsing requirement but did not make it a
commercial product. That would take additional funding to bring it up to a
marketable product. For a one time function, it would not be worth the
cost. As an OEM or volume product, of course the picture changes. BTW our
output was designed to take the information and place it on an OctoTools
Template which is somewhat XML like. From there we could output CSV or a
custom output if required. Call me if you are looking for a more commercial
solution.

Larry T. (978) 535-7676 US-Boston, MA

JB

unread,
Oct 11, 2005, 6:07:31 PM10/11/05
to

pdw.exe, part of PDF Command Line Tools
http://www.pdf-tools.com/asp/products.asp?name=CLE

sample output using the -w option:
231.9 663.0 12.0 50.4 0 Cour: permits
295.7 663.0 12.0 21.6 0 Cour: the
330.6 663.0 12.0 28.8 0 Cour: text
372.8 663.0 12.0 72.0 0 Cour: extraction
458.2 663.0 12.0 28.8 0 Cour: from

fhtino

unread,
Oct 12, 2005, 3:54:48 AM10/12/05
to

Eric

unread,
Oct 12, 2005, 5:33:26 AM10/12/05
to
fhtino wrote:

> PDFLib TET : http://www.pdflib.com/products/tet/index.html

Or write your own PS header library to hook the show command.

Eric

Ralf Koenig

unread,
Oct 25, 2005, 6:16:08 PM10/25/05
to

pdftohtml has an "-xml" mode, which does stuff like that.

http://pdftohtml.sourceforge.net/

Raflf

Don Lancaster

unread,
Oct 28, 2005, 2:21:07 PM10/28/05
to


http://www.tinaja.com/glib/extract1.pdf

and similar tools at http://www.tinaja.com/gurgrm01.asp


--
Many thanks,

Don Lancaster voice phone: (928)428-4073
Synergetics 3860 West First Street Box 809 Thatcher, AZ 85552
rss: http://www.tinaja.com/whtnu.xml email: d...@tinaja.com

Please visit my GURU's LAIR web site at http://www.tinaja.com

0 new messages