Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

PDF to text converter that preserves table lines as "|" like veryPDF

216 views
Skip to first unread message

kpdf3

unread,
Sep 16, 2006, 6:52:49 AM9/16/06
to
I have been trying several PDF to plain text converters including
Adobe's own PDF writer professional 'save as' feature. However its does
not convert the file as I wish.

I need to convert a PDF only accessible in that format to a plain text
that I parse to put info in a database.

The problem is that PDF Writer converts tables to HTML and Word or RTF
as graphics and two types of conversion to text file don't keep table
info in the same line.

The closest to what I want is done by VeryPDF's PDF2text utility. When
it finds a table it converts cell lines to "|" keeping the info in the
cells intact and in the appropriate line even though it displaces the
"|" somehow it doesn't interfere parsing the PDF.

Is there a ---freeware-- program (veryPDF is not freeware) that
converts to text in this way. I have tried around 15 and none converts
the lines with tables correctly.

Thanks

fhtino

unread,
Sep 18, 2006, 2:25:58 AM9/18/06
to
http://www.foolabs.com/xpdf/

Xpdf - PdfToText


Fabrizio


kpdf3 wrote:
> I have been trying several PDF to plain text converters including
> Adobe's own PDF writer professional 'save as' feature. However its does
> not convert the file as I wish.
>

> .............

Hans-Werner Hilse

unread,
Sep 18, 2006, 8:29:02 AM9/18/06
to
Hi,

On 16 Sep 2006 03:52:49 -0700 "kpdf3" <miledep...@yahoo.com> wrote:

> The closest to what I want is done by VeryPDF's PDF2text utility. When
> it finds a table it converts cell lines to "|" keeping the info in the
> cells intact and in the appropriate line even though it displaces the
> "|" somehow it doesn't interfere parsing the PDF.
>
> Is there a ---freeware-- program (veryPDF is not freeware) that
> converts to text in this way. I have tried around 15 and none converts
> the lines with tables correctly.

Most probably not. That's mostly due to PDF not (necessarily)
containing logical structure information, just information about the
physical layout. Software can only get over this by heuristic
approaches. Common PDF-to-text software has only minimal heuristic
behaviour, it just checks for distances between chars and strings and
tries to resemble that in the output. Since there's no "|" sign in the
PDF itself because tables are usually not composed of text chars in
PDFs, the converter doesn't even care for table lines. And it wouldn't
be an easy task to analyze all lines that are defined in the PDF: Not
every table even has them.

Thus, OCR programs are probably the best solution to your
problem. I don't think that you can rely on Free Software here.

-hwh

kpdf3

unread,
Sep 23, 2006, 5:55:27 AM9/23/06
to
The questions is not if any converter does it but if any freeware
converter does it.

As I said veryPDF does what I need and the xpdf which is freeware (I
just tested it) after reading the message) does it too, unlike most
converters to plain text including Adobe PDF Writer export option.

The xpdf conversion will need some additional programming routine when
parsing and pruning the file but seems to be a good answer.


Thanks

0 new messages