pdfrw problem reading strange PDF

113 views
Skip to first unread message

Attila Tajti

unread,
Mar 6, 2010, 3:26:26 AM3/6/10
to rst2pdf...@googlegroups.com
I have a strange PDF file that is not displayed properly in Mac OS X Preview but works in Adobe Reader and Foxit Reader on Windows.

I thought it would be interesting to find out what is wrong with it, so I sent it to PDFHarmony, an online PDF verifier to see what is wrong. They emailed the PDF back to me, with the format changed from PDF 1.3 to PDF 1.7. PDFHarmony supposed to report any problems with the original PDF but there were none reported, the old PDF was perfect. In Adobe and Foxit readers both the original and "harmonized" PDFs looked the same as far as I can tell.

Also the "harmonized" pdf displayed fine in Mac OS X Preview.

I started looking for an alternative to PDFHarmony to "fix" those PDFs somehow in a script. I want a solution that can be used in Mac OS X. I tried Openoffice.org Draw and Inkscape PDF import so far, and they are basically adequate but the imported files looks a bit different, eg. some text does not use boldface anymore.

Finally I tried pdfrw, but the problem seems to be the same as with Mac OS X Preview: it cannot open the original 1.3 PDF, but reads the harmonized 1.7 version fine.

First of all it fails because it cannot find the '%%EOF' directive because in my original PDF file there is no EOL nor whitespace at the end. (I do not know if it is a problem but it was easy to fix in pdftoken.py by adding the regex '$' as token delimiter which should be fine because the pattern is not multi-line in our case, so will match only at EOF).

The next problem seems to be more fundamental, though: the first stream in my original PDF contains a character '}' where parsing stops instead of looking for the endstream directive. I do not yet know what is this, perhaps an encoding problem?

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 184, in __init__
self.update(self.readdict(source))
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 117, in readdict
value = self.readindirect(value, tok)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 53, in readindirect
obj = self.special.get(obj, ordinary)(source, setobj, obj)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 117, in readdict
value = self.readindirect(value, tok)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 53, in readindirect
obj = self.special.get(obj, ordinary)(source, setobj, obj)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 111, in readdict
value = special[value](source)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 96, in readarray
value = self.readindirect(result.pop(), generation)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 53, in readindirect
obj = self.special.get(obj, ordinary)(source, setobj, obj)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 117, in readdict
value = self.readindirect(value, tok)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 54, in readindirect
self.readstream(obj, source)
File "/Users/ata/Scrap/pdfrw/pdfrw/pdfreader.py", line 81, in readstream
assert endit == 'endstream endobj'.split(), endit
AssertionError: ['2\xe4\xb0\xee\x82\xa9\xd3\xaa', '}']


My question is, does it make sense to try fix and this reader, or is it that PDF 1.3 is just too old to make it work in a reasonable amount of time? I have workarounds to see the PDF anyway but this problem is bugging because I get a dozen or so PDF from the same source in this format and I cannot look at them using my preferred solution. Perhaps trying to fix this in pdfrw would introduce bugs without solving any real problems.

-- Attila

Patrick Maupin

unread,
Mar 6, 2010, 10:09:20 AM3/6/10
to rst2pdf...@googlegroups.com

Attila:

1) Most of the error-reporting in pdfrw is basically a development aid
for me -- when the thing throws an exception, it's because I didn't
understand the format, so I want to go back and update my
understanding of the format. So if we can convince ourselves that
removing a check (or especially, adding an alternate check like your
EOF check, is not a problem, I'm all for supporting more PDFs. That's
the goal; even if it is the PDF that is broken, I would like to be
able to read it unless that somehow conflicts with the ability to read
a correct one.

2) There may be some way with old PDFs to have two adjacent objects
where you don't need the endobj. That would actually be pretty easy
to support, if so.

3) I have a "Jython" development branch. Basically, one pdfrw user
wants to run pdfrw in a server environment where the only thing he has
available is Jython 2.2.1, which is old, ugly, and slow. I have been
working with his PDFs, so there are a few actual pdfrw bug fixes in
that branch that are not in the main branch. You might try that as a
starting point, as well.

4) Unfortunately, I don't have any very good testcases. Roberto had
lots of testcases for rst2pdf, so I automated the running of those
tests (so it's not a matter of lacking the technology, just a matter
of time, and not wanting to check 500 large ugly PDFs into
subversion).

5) I'm a bit swamped for time right now, but would certainly devote a
bit of time to helping debug this. Unfortunately, it may be awhile
before I have the time to do all the other manual tests that convince
me to update the main pdfrw trunk with the fix (see my comments on
jython above).

6) If you want to create a new branch to test fixes in and make them
available, I'm certainly happy to give you commit access to the
repository. (In fact, look! I've already done it :-)

Since pdfrw is a new project, the bar is pretty low, and you've
already submitted one useful patch, so you're now one of the core
developers :-) All I ask is that you snapshot the trunk (or even
better, the Jython development branch, which has a few more bugfixes)
into your own branch and work there, and then we can discuss changes
before you trunk them, and that is really only because I don't yet
have an automated regression set up. (I think a regression with
multiple SMALL PDFs would be great, but I need to figure out how to
easily create small testcases for failures. Also, right now, I don't
commit to the trunk unless the rst2pdf regression runs OK.)

Thanks,
Pat

Attila Tajti

unread,
Mar 11, 2010, 8:49:24 AM3/11/10
to rst2pdf...@googlegroups.com

On 6 Mar 2010, at 16:09, Patrick Maupin wrote:

> If you want to create a new branch to test fixes in and make them
> available, I'm certainly happy to give you commit access to the
> repository. (In fact, look! I've already done it :-)

Thank you, but it turned out that my PDFs are broken. (Which probably also means pdfHarmony web service is also broken somewhat, because it reported that my PDFs are fine.)

I found a tool that helped me understand the problem at: http://blog.didierstevens.com/programs/pdf-tools/

In my case the PDFs have a stream that has an invalid /Length attribute. After I found out how to fix this attribute Mac OS X Preview displays the PDF correctly. I was lucky enough because the specified size was always bigger than the actual size, so I could pad the calculated value with spaces, keeping all references intact.

Once I fixed this Length attribute, pdfrw could read my files with my %%EOF patch.

I also found out that pdfrw likely fails to read streams in my PDF because of this. pdf-parser.py from the link above could not read my PDFs because they do not have a whitespace before the 'endstream' marker (It is only recommended, but not required to have a newline there).

So I am not sure it makes sense to fix this in pdfrw, though it would be possible by replacing the assert with a search for 'endstream' to fix incorrect Length.

-- Attila

Reply all
Reply to author
Forward
0 new messages