Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Strange output through pdftotext

60 views
Skip to first unread message

LoZ

unread,
Nov 17, 2009, 4:20:09 PM11/17/09
to
Bonsoir � tous,

Could someone be kind enough to explain me why this pdf file
(http://dl.free.fr/ambGjrH7D) produces a very strange output (without
error code) through pdftotext (from plopper-utils on a debian lenny) ?

<strange_output_excerpt>
!"

!" # "

!#$

%

!" & "

' ( )
#
</strange_output_excerpt>

Merci d'avance

--
Vincent

Derek B. Noonburg

unread,
Nov 17, 2009, 5:35:21 PM11/17/09
to
On 2009-11-17, LoZ <p...@chau.de> wrote:
> Bonsoir � tous,
>
> Could someone be kind enough to explain me why this pdf file
> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without
> error code) through pdftotext (from plopper-utils on a debian lenny) ?
>
><strange_output_excerpt>
> !"
>
> !" # "
> ...

I can't get that download to work, but typically that sort of output
means that there's no valid encoding information for the font(s) in
the PDF file.

For more details, see:

http://www.glyphandcog.com/textext.html

- Derek

LoZ

unread,
Nov 18, 2009, 2:31:11 AM11/18/09
to
Derek B. Noonburg wrote:
> On 2009-11-17, LoZ <p...@chau.de> wrote:
>> Bonsoir � tous,
>>
>> Could someone be kind enough to explain me why this pdf file
>> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without
>> error code) through pdftotext (from plopper-utils on a debian lenny) ?
>>
>> <strange_output_excerpt>
>> !"
>>
>> !" # "
>> ...
>
> I can't get that download to work,
Try this one : wget http://www.cijoint.fr/cj200911/cijXki3VKL.pdf

> but typically that sort of output
> means that there's no valid encoding information for the font(s) in
> the PDF file.

Thanks for your response and informations from
http://www.glyphandcog.com/textext.html

If the problem is � unable to find information from the font �, is there
a way that pdftotext returns a message or a code for that situation ?

--
Vincent

Derek B. Noonburg

unread,
Nov 18, 2009, 5:05:30 PM11/18/09
to

Not really. I downloaded that PDF file, and in this case, the PDF
font objects contain no Encoding key at all. With TrueType fonts,
that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
but those two are close enough, at least for the 7-bit ASCII part).
But in this case, the fonts are subsets and are not using an
ASCII-based encoding. There's really no way for pdftotext to tell. A
very similar PDF file could generate valid extracted text.

- Derek

LoZ

unread,
Nov 30, 2009, 2:37:39 PM11/30/09
to
Derek B. Noonburg �crivait :

> Not really. I downloaded that PDF file, and in this case, the PDF
> font objects contain no Encoding key at all. With TrueType fonts,
> that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
> but those two are close enough, at least for the 7-bit ASCII part).
> But in this case, the fonts are subsets and are not using an
> ASCII-based encoding. There's really no way for pdftotext to tell. A
> very similar PDF file could generate valid extracted text.

Ok I understand.

I use pdftotext as the first step of indexing nearly 2000 pdf files, I
really appreciate if there was a way to know if the pdf to text
conversion fails with that � strange output. �

pdfinfo can ouput the � creator � of the pdf file. Are some of these
software (or versions of software) known to produce pdf with subsets
fonts, or to produce pdf without encoding key ?

Other question : is there a way to have pdftotext return an error code
to say � there no encoding key at all � ?

Merci d'avance.

--
Vincent

Peter Flynn

unread,
Dec 1, 2009, 6:17:26 PM12/1/09
to

Doesn't pdfinfo provide this?

///Peter

LoZ

unread,
Dec 7, 2009, 9:35:41 AM12/7/09
to
Peter Flynn écrivait :

> LoZ wrote:
>> Other question : is there a way to have pdftotext return an error code
>> to say � there no encoding key at all � ?
>
> Doesn't pdfinfo provide this?
As far as I know, no.
0 new messages