Strange output through pdftotext

LoZ

unread,

Nov 17, 2009, 4:20:09 PM11/17/09

to

Bonsoir � tous,

Could someone be kind enough to explain me why this pdf file
(http://dl.free.fr/ambGjrH7D) produces a very strange output (without
error code) through pdftotext (from plopper-utils on a debian lenny) ?

<strange_output_excerpt>
!"

!" # "

!#$

%

!" & "

' ( )
#
</strange_output_excerpt>

Merci d'avance

--
Vincent

Derek B. Noonburg

unread,

Nov 17, 2009, 5:35:21 PM11/17/09

to

On 2009-11-17, LoZ <p...@chau.de> wrote:
> Bonsoir � tous,
>
> Could someone be kind enough to explain me why this pdf file
> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without
> error code) through pdftotext (from plopper-utils on a debian lenny) ?
>
><strange_output_excerpt>
> !"
>
> !" # "

> ...

I can't get that download to work, but typically that sort of output
means that there's no valid encoding information for the font(s) in
the PDF file.

For more details, see:

http://www.glyphandcog.com/textext.html

- Derek

LoZ

unread,

Nov 18, 2009, 2:31:11 AM11/18/09

to

Derek B. Noonburg wrote:
> On 2009-11-17, LoZ <p...@chau.de> wrote:
>> Bonsoir � tous,
>>
>> Could someone be kind enough to explain me why this pdf file
>> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without
>> error code) through pdftotext (from plopper-utils on a debian lenny) ?
>>
>> <strange_output_excerpt>
>> !"
>>
>> !" # "
>> ...
>
> I can't get that download to work,

Try this one : wget http://www.cijoint.fr/cj200911/cijXki3VKL.pdf

> but typically that sort of output
> means that there's no valid encoding information for the font(s) in
> the PDF file.

Thanks for your response and informations from
http://www.glyphandcog.com/textext.html

If the problem is � unable to find information from the font �, is there
a way that pdftotext returns a message or a code for that situation ?

--
Vincent

Derek B. Noonburg

unread,

Nov 18, 2009, 5:05:30 PM11/18/09

to

Not really. I downloaded that PDF file, and in this case, the PDF
font objects contain no Encoding key at all. With TrueType fonts,
that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
but those two are close enough, at least for the 7-bit ASCII part).
But in this case, the fonts are subsets and are not using an
ASCII-based encoding. There's really no way for pdftotext to tell. A
very similar PDF file could generate valid extracted text.

- Derek

LoZ

unread,

Nov 30, 2009, 2:37:39 PM11/30/09

to

Derek B. Noonburg �crivait :

> Not really. I downloaded that PDF file, and in this case, the PDF
> font objects contain no Encoding key at all. With TrueType fonts,
> that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
> but those two are close enough, at least for the 7-bit ASCII part).
> But in this case, the fonts are subsets and are not using an
> ASCII-based encoding. There's really no way for pdftotext to tell. A
> very similar PDF file could generate valid extracted text.

Ok I understand.

I use pdftotext as the first step of indexing nearly 2000 pdf files, I
really appreciate if there was a way to know if the pdf to text
conversion fails with that � strange output. �

pdfinfo can ouput the � creator � of the pdf file. Are some of these
software (or versions of software) known to produce pdf with subsets
fonts, or to produce pdf without encoding key ?

Other question : is there a way to have pdftotext return an error code
to say � there no encoding key at all � ?

Merci d'avance.

--
Vincent

Peter Flynn

unread,

Dec 1, 2009, 6:17:26 PM12/1/09

to

Doesn't pdfinfo provide this?

///Peter

LoZ

unread,

Dec 7, 2009, 9:35:41 AM12/7/09

to

Peter Flynn écrivait :

> LoZ wrote:
>> Other question : is there a way to have pdftotext return an error code
>> to say � there no encoding key at all � ?
>
> Doesn't pdfinfo provide this?

As far as I know, no.