Reading (some) PDF files with FBReader

25 views
Skip to first unread message

Marius Gedminas

unread,
Aug 28, 2009, 9:40:35 AM8/28/09
to fbre...@googlegroups.com
I've been looking for a sane PDF -> HTML converter that handled
paragraphs sanely, without much success. Usually when you run some PDF
to text converter, you get a bunch of text lines with no indication
where a paragraph begins or ends. I thought that it must be possible to
extract that information from the position of the text: either there'll
be an extra first-line indent, or some vertical space between
paragraphs.

There's a tool out there, called pdftohtml, that can convert a PDF into
a custom XML format. It does the heavy lifting of extracting the text
(including formatting such as bold or italic) and annotating it with
position information. I wrote a Python script that takes this XML and
produces an HTML file with the paragraphs correctly joined together. I
only tested it on one PDF file, but the result was an HTML file that was
nicely readable in FBReader.

Is anyone interested?

Are there any good PDF -> HTML conversion tools that I've missed?

Marius Gedminas
--
main(k){float i,j,r,x,y=-16;while(puts(""),y++<15)for(x
=0;x++<84;putchar(" .:-;!/>)|&IH%*#"[k&15]))for(i=k=r=0;
j=r*r-i*i-2+x/25,i=2*r*i+y/10,j*j+i*i<11&&k++<111;r=j);}
/* Mandelbrot in ASCII. */

signature.asc

Erik Hovland

unread,
Aug 28, 2009, 11:44:27 AM8/28/09
to fbre...@googlegroups.com
> There's a tool out there, called pdftohtml, that can convert a PDF into
> a custom XML format.  It does the heavy lifting of extracting the text
> (including formatting such as bold or italic) and annotating it with
> position information.  I wrote a Python script that takes this XML and
> produces an HTML file with the paragraphs correctly joined together.  I
> only tested it on one PDF file, but the result was an HTML file that was
> nicely readable in FBReader.
>
> Is anyone interested?

Yes! Since your at it, have you considered having it convert from XML
to epub or mobi?


> Are there any good PDF -> HTML conversion tools that I've missed?

There is the reflow patch to pdftohtml. Attached. It is still not in
shape enough to be accepted by the poppler folk, but it is a start.

E

--
Erik Hovland
er...@hovland.org
http://hovland.org/

fix-warnings-inpdftohtml
reflow-in-pdftohtml

Mikolaj Machowski

unread,
Aug 28, 2009, 12:11:42 PM8/28/09
to fbre...@googlegroups.com
On Friday 28 August 2009 15:40:35 Marius Gedminas wrote:
> I've been looking for a sane PDF -> HTML converter that handled
> paragraphs sanely, without much success. Usually when you run some PDF
> to text converter, you get a bunch of text lines with no indication
> where a paragraph begins or ends. I thought that it must be possible to
> extract that information from the position of the text: either there'll
> be an extra first-line indent, or some vertical space between
> paragraphs.

I am using simple pdftohtml

pdftohtml "$i" -noframes -enc UTF-8 -nodrm -stdout \
| sed 's/<br>\(..\)/ \1/g' \
| sed 's/<hr>/<br>/g'

This fixes most problems and files are readable in fbreader but all font
formatting is lost and sometimes files need "personal" adjustments (usually
bunch of Vim %s/// commands). Note however that there are files which are
unrecoverable this way.

> There's a tool out there, called pdftohtml, that can convert a PDF into
> a custom XML format. It does the heavy lifting of extracting the text
> (including formatting such as bold or italic) and annotating it with
> position information. I wrote a Python script that takes this XML and
> produces an HTML file with the paragraphs correctly joined together. I
> only tested it on one PDF file, but the result was an HTML file that was
> nicely readable in FBReader.

> Is anyone interested?

I am, I am, I am :)

> Are there any good PDF -> HTML conversion tools that I've missed?
>

I am afraid not.

m.

AlanW

unread,
Aug 28, 2009, 12:24:06 PM8/28/09
to FBReader
I don't think there is a good open source solution. I have seen
reports that the best OCR programs can do a better job of PDF to HTML
because they use different rules for marking text as a single block,
but these are all proprietary.

If you want to take this further, I suggest you do so in Calibre. Its
PDF to something capability isn't all that good right now, but it is
the tool most likely to be heavily used (and improved further) if its
PDF capability ever gets close to good enough.

Marius Gedminas

unread,
Aug 28, 2009, 3:56:44 PM8/28/09
to fbre...@googlegroups.com
On Fri, Aug 28, 2009 at 08:44:27AM -0700, Erik Hovland wrote:
> > There's a tool out there, called pdftohtml, that can convert a PDF into
> > a custom XML format.  It does the heavy lifting of extracting the text
> > (including formatting such as bold or italic) and annotating it with
> > position information.  I wrote a Python script that takes this XML and
> > produces an HTML file with the paragraphs correctly joined together.  I
> > only tested it on one PDF file, but the result was an HTML file that was
> > nicely readable in FBReader.
> >
> > Is anyone interested?
>
> Yes!

The code is here: https://code.launchpad.net/~mgedmin/+junk/pdf2html

> Since your at it, have you considered having it convert from XML
> to epub or mobi?

No, not really. I like HTML.

Though it would be nice if FBReader could extract metadata from HTML
files. <meta name="Author" content="..." /> or something...

> > Are there any good PDF -> HTML conversion tools that I've missed?
>
> There is the reflow patch to pdftohtml. Attached. It is still not in
> shape enough to be accepted by the poppler folk, but it is a start.

Cool!

Does poppler have a bug tracker, with this patch attached to a bug?

Poppler and FBReader both use GPL, and both appear to be written in C++,
so it would be possible to add PDF support directly to FBReader,
eventually.

Marius Gedminas
--
To be intoxicated is to feel sophisticated but not be able to say it.

signature.asc

Jennifer Velez

unread,
Aug 28, 2009, 12:23:49 PM8/28/09
to fbre...@googlegroups.com

How do I stop these from coming to me!!

On Aug 28, 2009 12:12 PM, "Erik Hovland" <er...@hovland.org> wrote:

> There's a tool out there, called pdftohtml, that can convert a PDF into > a custom XML format.  It...

Erik Hovland

unread,
Aug 30, 2009, 12:10:23 AM8/30/09
to fbre...@googlegroups.com
> Does poppler have a bug tracker, with this patch attached to a bug?

As a matter of fact, I can answer yes to both:
https://bugs.freedesktop.org/show_bug.cgi?id=20652

Reply all
Reply to author
Forward
0 new messages