There's a tool out there, called pdftohtml, that can convert a PDF into
a custom XML format. It does the heavy lifting of extracting the text
(including formatting such as bold or italic) and annotating it with
position information. I wrote a Python script that takes this XML and
produces an HTML file with the paragraphs correctly joined together. I
only tested it on one PDF file, but the result was an HTML file that was
nicely readable in FBReader.
Is anyone interested?
Are there any good PDF -> HTML conversion tools that I've missed?
Marius Gedminas
--
main(k){float i,j,r,x,y=-16;while(puts(""),y++<15)for(x
=0;x++<84;putchar(" .:-;!/>)|&IH%*#"[k&15]))for(i=k=r=0;
j=r*r-i*i-2+x/25,i=2*r*i+y/10,j*j+i*i<11&&k++<111;r=j);}
/* Mandelbrot in ASCII. */
Yes! Since your at it, have you considered having it convert from XML
to epub or mobi?
> Are there any good PDF -> HTML conversion tools that I've missed?
There is the reflow patch to pdftohtml. Attached. It is still not in
shape enough to be accepted by the poppler folk, but it is a start.
E
--
Erik Hovland
er...@hovland.org
http://hovland.org/
I am using simple pdftohtml
pdftohtml "$i" -noframes -enc UTF-8 -nodrm -stdout \
| sed 's/<br>\(..\)/ \1/g' \
| sed 's/<hr>/<br>/g'
This fixes most problems and files are readable in fbreader but all font
formatting is lost and sometimes files need "personal" adjustments (usually
bunch of Vim %s/// commands). Note however that there are files which are
unrecoverable this way.
> There's a tool out there, called pdftohtml, that can convert a PDF into
> a custom XML format. It does the heavy lifting of extracting the text
> (including formatting such as bold or italic) and annotating it with
> position information. I wrote a Python script that takes this XML and
> produces an HTML file with the paragraphs correctly joined together. I
> only tested it on one PDF file, but the result was an HTML file that was
> nicely readable in FBReader.
> Is anyone interested?
I am, I am, I am :)
> Are there any good PDF -> HTML conversion tools that I've missed?
>
I am afraid not.
m.
The code is here: https://code.launchpad.net/~mgedmin/+junk/pdf2html
> Since your at it, have you considered having it convert from XML
> to epub or mobi?
No, not really. I like HTML.
Though it would be nice if FBReader could extract metadata from HTML
files. <meta name="Author" content="..." /> or something...
> > Are there any good PDF -> HTML conversion tools that I've missed?
>
> There is the reflow patch to pdftohtml. Attached. It is still not in
> shape enough to be accepted by the poppler folk, but it is a start.
Cool!
Does poppler have a bug tracker, with this patch attached to a bug?
Poppler and FBReader both use GPL, and both appear to be written in C++,
so it would be possible to add PDF support directly to FBReader,
eventually.
Marius Gedminas
--
To be intoxicated is to feel sophisticated but not be able to say it.
How do I stop these from coming to me!!
On Aug 28, 2009 12:12 PM, "Erik Hovland" <er...@hovland.org> wrote:> There's a tool out there, called pdftohtml, that can convert a PDF into > a custom XML format. It...
As a matter of fact, I can answer yes to both:
https://bugs.freedesktop.org/show_bug.cgi?id=20652