I need to convert a PDF only accessible in that format to a plain text
that I parse to put info in a database.
The problem is that PDF Writer converts tables to HTML and Word or RTF
as graphics and two types of conversion to text file don't keep table
info in the same line.
The closest to what I want is done by VeryPDF's PDF2text utility. When
it finds a table it converts cell lines to "|" keeping the info in the
cells intact and in the appropriate line even though it displaces the
"|" somehow it doesn't interfere parsing the PDF.
Is there a ---freeware-- program (veryPDF is not freeware) that
converts to text in this way. I have tried around 15 and none converts
the lines with tables correctly.
Thanks
Xpdf - PdfToText
Fabrizio
kpdf3 wrote:
> I have been trying several PDF to plain text converters including
> Adobe's own PDF writer professional 'save as' feature. However its does
> not convert the file as I wish.
>
> .............
On 16 Sep 2006 03:52:49 -0700 "kpdf3" <miledep...@yahoo.com> wrote:
> The closest to what I want is done by VeryPDF's PDF2text utility. When
> it finds a table it converts cell lines to "|" keeping the info in the
> cells intact and in the appropriate line even though it displaces the
> "|" somehow it doesn't interfere parsing the PDF.
>
> Is there a ---freeware-- program (veryPDF is not freeware) that
> converts to text in this way. I have tried around 15 and none converts
> the lines with tables correctly.
Most probably not. That's mostly due to PDF not (necessarily)
containing logical structure information, just information about the
physical layout. Software can only get over this by heuristic
approaches. Common PDF-to-text software has only minimal heuristic
behaviour, it just checks for distances between chars and strings and
tries to resemble that in the output. Since there's no "|" sign in the
PDF itself because tables are usually not composed of text chars in
PDFs, the converter doesn't even care for table lines. And it wouldn't
be an easy task to analyze all lines that are defined in the PDF: Not
every table even has them.
Thus, OCR programs are probably the best solution to your
problem. I don't think that you can rely on Free Software here.
-hwh
As I said veryPDF does what I need and the xpdf which is freeware (I
just tested it) after reading the message) does it too, unlike most
converters to plain text including Adobe PDF Writer export option.
The xpdf conversion will need some additional programming routine when
parsing and pruning the file but seems to be a good answer.
Thanks