Could someone be kind enough to explain me why this pdf file
(http://dl.free.fr/ambGjrH7D) produces a very strange output (without
error code) through pdftotext (from plopper-utils on a debian lenny) ?
<strange_output_excerpt>
!"
!" # "
!#$
%
!" & "
' ( )
#
</strange_output_excerpt>
Merci d'avance
--
Vincent
I can't get that download to work, but typically that sort of output
means that there's no valid encoding information for the font(s) in
the PDF file.
For more details, see:
http://www.glyphandcog.com/textext.html
- Derek
> but typically that sort of output
> means that there's no valid encoding information for the font(s) in
> the PDF file.
Thanks for your response and informations from
http://www.glyphandcog.com/textext.html
If the problem is � unable to find information from the font �, is there
a way that pdftotext returns a message or a code for that situation ?
--
Vincent
Not really. I downloaded that PDF file, and in this case, the PDF
font objects contain no Encoding key at all. With TrueType fonts,
that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
but those two are close enough, at least for the 7-bit ASCII part).
But in this case, the fonts are subsets and are not using an
ASCII-based encoding. There's really no way for pdftotext to tell. A
very similar PDF file could generate valid extracted text.
- Derek
Ok I understand.
I use pdftotext as the first step of indexing nearly 2000 pdf files, I
really appreciate if there was a way to know if the pdf to text
conversion fails with that � strange output. �
pdfinfo can ouput the � creator � of the pdf file. Are some of these
software (or versions of software) known to produce pdf with subsets
fonts, or to produce pdf without encoding key ?
Other question : is there a way to have pdftotext return an error code
to say � there no encoding key at all � ?
Merci d'avance.
--
Vincent
Doesn't pdfinfo provide this?
///Peter