Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#676238: Unable to convert PDF to xml using pdftohtml (empty pages)

97 views
Skip to first unread message

Petter Reinholdtsen

unread,
Jun 5, 2012, 11:10:01 AM6/5/12
to

Package: poppler-utils
Version: 0.12.4-1.2
Severity: normal

When I convert
<URL: http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pdf >
to XML using

pdftohtml -xml -noframes 1.8116520\!offentligjournal02052012.pdf

I get the following content-less XML file. I find this rather strange,
as the PDF is searchable using xpdf, okular and evince. Any idea where
the text went? Anything I can do to get access to the text as XML?

This is the output I get:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="792" width="612">
<fontspec id="0" size="18" family="Helvetica" color="#000000"/>
<fontspec id="1" size="5" family="Helvetica" color="#000000"/>
<fontspec id="2" size="5" family="Helvetica" color="#000000"/>
<fontspec id="3" size="7" family="Helvetica" color="#000000"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="792" width="612">
<fontspec id="4" size="6" family="Helvetica" color="#000000"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="4" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="5" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="6" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="7" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="8" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="9" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="10" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="11" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="12" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="13" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="14" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="15" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="16" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="17" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="18" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="19" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="20" position="absolute" top="0" left="0" height="792" width="612">
</page>
</pdf2xml>

-- System Information:
Debian Release: 6.0.5
APT prefers stable-updates
APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.32-5-686 (SMP w/1 CPU core)
Locale: LANG=nb_NO.UTF-8, LC_CTYPE=nb_NO.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages poppler-utils depends on:
ii libc6 2.11.3-3 Embedded GNU C Library: Shared lib
ii libfontconfig1 2.8.0-2.1 generic font configuration library
ii libgcc1 1:4.4.5-8 GCC support library
ii libpoppler5 0.12.4-1.2 PDF rendering library
ii libstdc++6 4.4.5-8 The GNU Standard C++ Library v3
ii libxml2 2.7.8.dfsg-2+squeeze4 GNOME XML library

Versions of packages poppler-utils recommends:
ii ghostscript 8.71~dfsg2-9 The GPL Ghostscript PostScript/PDF

poppler-utils suggests no packages.

-- no debconf information



--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Petter Reinholdtsen

unread,
Jun 5, 2012, 12:40:02 PM6/5/12
to
I've also reported this upstream,
<URL: https://bugs.freedesktop.org/show_bug.cgi?id=50739 >.
--
Happy hacking
Petter Reinholdtsen

Pino Toscano

unread,
Jun 21, 2012, 6:50:03 AM6/21/12
to
forwarded 676238 https://bugs.freedesktop.org/show_bug.cgi?id=50739
found 676238 poppler/0.18.4-2
tag 676238 + confirmed
thanks

Hi Petter,

Alle martedì 5 giugno 2012, Petter Reinholdtsen ha scritto:
> Package: poppler-utils
> Version: 0.12.4-1.2

Hm it is an old poppler (the one in stable), though...

> When I convert
> <URL:
> http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pd
> f > to XML using
>
> pdftohtml -xml -noframes 1.8116520\!offentligjournal02052012.pdf
>
> I get the following content-less XML file. I find this rather
> strange, as the PDF is searchable using xpdf, okular and evince.

... this problem can be reproduced also with poppler 0.18.4, currently
in wheezy.

> Any idea where the text went? Anything I can do to get access to
> the text as XML?

Note adding also -hidden to the arguments makes the text show up in the
XML output.

> I've also reported this upstream,
> <URL: https://bugs.freedesktop.org/show_bug.cgi?id=50739 >.

Added forwarding.

Thanks for your report,
--
Pino Toscano
signature.asc

Petter Reinholdtsen

unread,
Jun 29, 2012, 5:50:01 PM6/29/12
to

[Pino Toscano]
>> Any idea where the text went? Anything I can do to get access to
>> the text as XML?
>
> Note adding also -hidden to the arguments makes the text show up in the
> XML output.

Thank you for the hint. It provide me with a workaround that allow my
PDF scraper to work. No idea what hidden text in PDFs are, but
apparenly some PDFs only got hidden text. :)

Now <URL: http://www.scraperwiki.com/ > got support for handling PDFs
with hidden text, and I can continue my project scraping public
information. :)
--
Happy hacking
Petter Reinholdtsen



0 new messages