Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#775049: poppler-utils: "pdftohtml -s <file>.pdf" produces multiple files.

21 views
Skip to first unread message

pe...@easthope.ca

unread,
Jan 10, 2015, 1:30:02 PM1/10/15
to
Package: poppler-utils
Version: 0.26.5-2
Severity: important
Tags: newcomer patch

Dear Maintainer,

*** Reporter, please consider answering these questions, where appropriate ***

* What led up to the situation?
pdftohtml was applied to a pdf file containing pixmap images.

* What exactly did you do (or not do) that was effective (or ineffective)?
The command was "pdftohtml -s <file>.pdf".

* What was the outcome of this action?
All the text was in one <file>.html but each picture was an additional file.

* What outcome did you expect instead?
With the -s option, text and pictures should all be in one <file>.html.
A JPEG or PNG picture can be included in an html document with Base64
encoding. The syntax is very simple. Examples here.
http://easthope.ca/Category2.html
Scroll down to the heading "Inline, Base64 encoded PNG bitmaps".
The first instance of an embedded bitmap is
<img src="data:image/png;base64,iVBORw0K ..."
alt="Diagram for 0x0 with test object 0,<br>represented in PNG.<br>">

-- System Information:
Debian Release: 8.0
APT prefers stable-updates
APT policy: (500, 'stable-updates'), (500, 'testing'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 3.2.0-0.bpo.4-686-pae (SMP w/1 CPU core)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages poppler-utils depends on:
ii libc6 2.19-13
ii libcairo2 1.14.0-2.1
ii libfreetype6 2.5.2-2
ii libgcc1 1:4.9.1-19
ii liblcms2-2 2.6-3+b3
ii libpoppler46 0.26.5-2
ii libstdc++6 4.9.1-19
ii zlib1g 1:1.2.8.dfsg-2+b1

poppler-utils recommends no packages.

poppler-utils suggests no packages.

-- no debconf information

--
123456789 123456789 123456789 123456789 123456789 123456789 123456789 12
Tel +1 360 639 0202 http://carnot.yi.org/ Bcc: peter at easthope. ca


--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Pino Toscano

unread,
Jul 25, 2015, 3:20:02 PM7/25/15
to
severity 775049 wishlist
tag 775049 - newcomer patch
thanks

Hi Peter,

In data sabato 10 gennaio 2015 10:01:45, hai scritto:
> * What led up to the situation?
> pdftohtml was applied to a pdf file containing pixmap images.
>
> * What exactly did you do (or not do) that was effective (or
> ineffective)? The command was "pdftohtml -s <file>.pdf".
>
> * What was the outcome of this action?
> All the text was in one <file>.html but each picture was an
> additional file.
>
> * What outcome did you expect instead?
> With the -s option, text and pictures should all be in one
> <file>.html. A JPEG or PNG picture can be included in an html
> document with Base64 encoding. The syntax is very simple. Examples
> here.
> http://easthope.ca/Category2.html
> Scroll down to the heading "Inline, Base64 encoded PNG bitmaps".
> The first instance of an embedded bitmap is
> <img src="data:image/png;base64,iVBORw0K ..."
> alt="Diagram for 0x0 with test object 0,<br>represented in
> PNG.<br>">

The current behaviour is generally acceptable, as the binary resources
(like images) are put outside the main HTML content, so they can be f.e.
stored better in version control systems (HTML changes, images stay the
same), or also better served/cached via HTTP.
Thus, I would consider making the image data as inline as a new feature,
which should be explicitly requested.

Can you please request this new feature on the bug tracking system of
Poppler, at https://bugs.freedesktop.org/, "poppler" product and "utils"
component? Once done, you can report here its bug number.

Thanks for your report,
--
Pino Toscano
signature.asc

pe...@easthope.ca

unread,
May 5, 2022, 4:40:03 PM5/5/22
to
Resolved upstream with addition of the option -dataurls. The name of
the option is rather obscure but means that an entity such as an image
is included as a Base64 encoded src attribute of an img tag.

In Debian 11 the command
pdftohtml -dataurls -c -s MyPhDthesis.pdf
produces MyPhDthesis.html as a complete document, including images, in
one file.

Regards, ... P.

--
mobile: +1 778 951 5147
VoIP: +1 604 670 0140
48.7693 N 123.3053 W

pe...@easthope.ca

unread,
Nov 16, 2022, 11:40:04 AM11/16/22
to
-dataurls is definitely a valuable addition. Thanks.

One further detail is worth attention. The man page notes
"-c generate complex output". For at least one HTML file here,
"pdftohtml -dataurls -c -s <file>.pdf" and "pdftohtml -dataurls -s <file>.pdf"
produce the same output. Therefore when -dataurls is used, -c may
have no effect.

Therefore the manual page should explain -c with more than "generate
complex output".

Thanks again, ... P.
https://en.wikibooks.org/wiki/User:PeterEasthope
0 new messages