How to check pdf files for errors?

Stefan Röhle

unread,

Nov 15, 2004, 2:47:42 AM11/15/04

to

Hi,

is it possible to scan pdf files for errors (syntax errors, pdf version,
etc.)?
We have the problem that students quite often report problems with their
printing of pdfs from varoius sources from our university.
Now I want to investigate why these errors occur. Probably there are
some peculiarities in the pdf files that our printers (ps mode) are not
able to deal with.
Any suggestions?

Stefan

*******************************
Stefan Röhle
Zentrum für Datenverarbeitung
Johannes Gutenberg-Universität
D-55099 Mainz
Germany

Tel. +49-(0)6131/39-26303
Fax. +49-(0)6131/39-26407
Email: roe...@uni-mainz.de
*******************************

Ralf Koenig

unread,

Nov 15, 2004, 5:00:22 AM11/15/04

to

Stefan Röhle schrieb:

> is it possible to scan pdf files for errors (syntax errors, pdf version,
> etc.)?

Sure, it is. There are PDF validation and pre-flighting tools, such as:

Free (with limited capabilities, but the Multivalent tools in general
are some of the best free PDF tools):

* tool.pdf.Validate (Multivalent, free)

Two prominent commercial products (in the upto $1000 range) used in the
graphics and printing industry:
* Enfocus Pitstop
* callas pdfInspektor2 (needs Adobe Acrobat)

I just heard of an online service (operated by callas) for this purpose.
Maybe you can give it a try (accepts documents upto 3MB in size):
http://www.pdfcity.com/index.htm

For additional tools search for "PDF pre-flighting" in your favourite
search engine.

In many cases the file itself is valid, but its complexity causes
problems. In most cases this complexity is introduced by bad tools or
software for creating and converting PDF files.

Just as a side note: There is a spec for a much more stringent form of
PDF used in commercial printing: PDF/X. You should know about it, but a
university is not the right place to require it.

> We have the problem that students quite often report problems with their
> printing of pdfs from varoius sources from our university.
> Now I want to investigate why these errors occur. Probably there are
> some peculiarities in the pdf files that our printers (ps mode) are not
> able to deal with.
> Any suggestions?

We have similar problems at our university, where I investigated some of
these issues. I even have a special directory called "pdf-clinic" where
I collect cases of such broken or problematic PDF files for
investigation or testing of software.

Top problem are users not understanding the workflow and conversion
occuring in the PDF -> printed page process.

Another problem is with tools (or old versions) of software, that try to
catch up with the speed of new features introduced by Adobe into the PDF
format (newest spec is about 1.200 pages, AR7 is in the line). PDF is
quite a complex format, only few free software can handle this.

Major problem is with Acrobat Reader 6 only being available for Windows
and Mac (where fewer basic problems occur), but not on Linux, where
5.0.9 is the latest version and Adobe makes no statement on updates,
while planning AR7 (for Windows).

I listed some of the problems at the following resources:

[German]
http://www-user.tu-chemnitz.de/~ralk/pdf_problems/

Esp. take a look at my presentation about PDF (German as well).
http://archiv.tu-chemnitz.de/pub/2004/0144/index.html

I will gradually transfer this stuff to my Wiki pages and translate to
English. Reason: I am a fan of a world language, that a wide audience
can understand.

[English]
https://rnvs.informatik.tu-chemnitz.de/twiki/bin/view/Main/PdfPrintingProblems

Here is a unformatted copy for the news archive:

Problem probable cause analysis solution

Printing takes forever
=======================
Acrobat Transparency Bug (images with transparent parts (typically GIFs)
are split into thousands of little piece parts, this is heavy workload
for the rasterizer inside the printer)

tool.pdf.Info -images from Multivalent tools

re-generate the PDF or delete the thousands of images in Acrobat

a lot of JPEG images (DCTDecode) reside in the PDF (and so in the PS).
this is heavy workload for the rasterizer inside the printer, who has to
uncompress them

tool.pdf.Info -images from Multivalent tools

reduce the resolution of the images or print document split into small
jobs of a few pages each

size explosion from PDF to PS due to uncompression of images. There are
a few compression filters that viewers do uncompress while converting
PDF to PS, this leads to a much bigger PS file

tool.pdf.Info -images from Multivalent tools

reduce the resolution of the images or print document split into small
jobs of a few pages each

PDF cannot be opened
=======================
use of new compression techniques in PDF 1.5 (e.g. object streams,
compressed Xref table)

with knowledge of the PDF spec you can directly search for keywords in
the PDF file,

there is only Acrobat Reader 5 for Linux

re-generate the PDF, set compatibility to PDF 1.3 (Acrobat 4)

PDF page orientation and print page orientation are different
=======================
irritation with the user or viewer
check a few simple cases, learn from them
sometimes using another PDF->PS converter fixes this

Fonts are not correct or missing
=======================
Fonts are not embedded inside the PDF file
pdffonts (tool of xpdf) or tool.pdf.Info -fonts (Multivalent)
re-generate the PDF

Ghostscript errors with unknown fonts (invalidfont)
=======================
Names of PDF-Base-14 fonts differ widely (such as SymbolMT is not
recognized as Symbol)
update to newer Ghostscript, e.g. to version 8.14

viewer does not allow certain operations, such as printing
=======================
settings of PDF document permissions
see Document properties in AR
ask creator of the PDF to provide a PDF without the restrictions

Ralf

> Stefan Röhle
> Zentrum für Datenverarbeitung
> Johannes Gutenberg-Universität

PS. No wonder, that Gutenberg university has problems with printing
_digital_ files. ;-)

--
Ralf Koenig
Wissenschaftlicher Mitarbeiter an der
Professur Rechnernetze und verteilte Systeme
TU Chemnitz, Zi. 1/B320, Tel. 0371-531-1532

Andreas Lobinger

unread,

Nov 15, 2004, 5:02:43 AM11/15/04

to

Stefan Röhle wrote:
> is it possible to scan pdf files for errors (syntax errors, pdf version,
> etc.)?

> Any suggestions?

Recent ghostscript(s) are quite good in reporting pdf errors.
The problem mainly with .pdf -> .ps is, that you can have a
syntactically and structurally errorfree .pdf that still isn't
suitable to transform to PS. If you want to be sure run gs and
look at the output.

Wishing a happy day
LOBI

Ralf Koenig

unread,

Nov 15, 2004, 6:15:44 AM11/15/04

to

Andreas Lobinger schrieb:

> Stefan Röhle wrote:
>
>> is it possible to scan pdf files for errors (syntax errors, pdf
>> version, etc.)? Any suggestions?
>
>
> Recent ghostscript(s) are quite good in reporting pdf errors. The
> problem mainly with .pdf -> .ps is, that you can have a syntactically
> and structurally errorfree .pdf that still isn't suitable to
> transform to PS.

GS is the lo-tech solution we use here at university as well. But we are
not too satisfied. It is quite good at detecting PDF documents that are
"really broken", but it is of little help for overly complex documents
or for non-errors such as missing fonts/wrong page size/wrong
orientation. Is that what you mean or do you refer to PostScript's
nature as a programming language?

> If you want to be sure run gs and look at the output.

Well, when doing so, you can only be sure that GS does process the file
in the expected manner. The main problem is, that the PS rasterizer in
GS is not equal to the RIP in the printer, esp. when it comes to
handling strange/uncommon features or very complex jobs. Jobs rejected
by GS may (in a few cases) be accepted by the printer and processed just
fine.

Another issue is the often big difference in resources of a PC (where GS
is run) and a PostScript printer. The workstation running GS typically
has much more compute power (MHz-wise) and memory than the RIP in an
average Postscript printer. Many printers, that are about 4-5 years old,
have 66MHz or 100MHz processors built-in. You cannot directly compare
MHz numbers here (RISC processors used in PS printers are usually quite
good at their special purpose in their dedicated environment), but you
should get the point.

Example (overly complex document caused by an Acrobat bug, it contains
more than 20 000 small images):
Download size: 8MB
http://www-user.tu-chemnitz.de/~ralk/pdf_problems/broschuere.pdf

It is displayed quite well on a PC, but takes ages and ages (I cancelled
the job after half an hour on a LJ8500) to print on all but the latest
PS printers.

Ralf

Aandi Inston

unread,

Nov 15, 2004, 6:25:11 AM11/15/04

to

Ralf Koenig <ralf....@informatik.tu-chemnitz.de> wrote:
>
>GS is the lo-tech solution we use here at university as well. But we are
>not too satisfied. It is quite good at detecting PDF documents that are
>"really broken", but it is of little help for overly complex documents
>or for non-errors such as missing fonts/wrong page size/wrong
>orientation. Is that what you mean or do you refer to PostScript's
>nature as a programming language?

Indeed, they are non-errors so why should any tool detect them (if
looking for errors)?

It sounds as if what you really need is pre-flighting. However, this
tends to be professional software targeted at print shops; I doubt the
free software community has identified any such need exists.
----------------------------------------
Aandi Inston qu...@dial.pipex.com http://www.quite.com
Please support usenet! Post replies and follow-ups, don't e-mail them.

john farrow

unread,

Nov 15, 2004, 6:48:12 PM11/15/04

to

If you have Acrobat 6.0 Professional there is the Document -> Pre-flight
menu which gives you lots of options to validate the PDF.

regards

John Farrow

Visual Programming Ltd mail PO Box 22-222, Khandallah, Wellington, New
Zealand site Level 2, 2 Ganges Road, Khandallah, Wellington, New Zealand
phone +64 4 479 1738 fax +64 4 479 1294 web http://www.xmlpdf.com
"Stefan Röhle" <roe...@PLEASE-REMOVE-THISuni-mainz.de> wrote in message
news:cn9muu$po2$1...@news1.zdv.uni-mainz.de...

Stefan Lagotzki

unread,

Nov 16, 2004, 3:03:05 AM11/16/04

to

Ralf Koenig <ralf....@informatik.tu-chemnitz.de>:
[...]

> Example (overly complex document caused by an Acrobat bug, it contains
> more than 20 000 small images):
> Download size: 8MB
> http://www-user.tu-chemnitz.de/~ralk/pdf_problems/broschuere.pdf

Maybe there is a problem with the very large bitmap in the
upper left corner. It could be traced (e.g. with potrace) into
EPS and converted into PDF.

Stefan

.

Ralf Koenig

unread,

Nov 16, 2004, 4:14:17 AM11/16/04

to

Stefan Lagotzki schrieb:

There definitely is a problem with this graphics, it's composed of
hundreds of small bitmap pieces roughly 3x1 pixels each. That's what I
wanted to show!

Actually, this image is not very large (around 600x800 pixels). The
problem is, that this was a GIF file with transparency. Due to what I
consider a bug in Acrobat, the generated PDF got unnecessarily complex.
And as the graphics is on every page, the same hundreds of images are
repeated on every page.

Adobe confirmed this issue here:
http://www.adobe.com/support/techdocs/328145.html