Can't parse this PDF

531 views
Skip to first unread message

Su Yuen Chin

unread,
Jan 15, 2012, 1:16:19 PM1/15/12
to pdf-r...@googlegroups.com
Hi PDF-reader team,

I read on Github that if I get a MalformedPDFError to send my file to the maintainers of the gem via Google Groups.

I tried passing this file (attached) I have but I keep getting this error: PDF::Reader::MalformedPDFError: xref table not found at offset 46293 (R != xref)

This happens for the master branch and also the latest stable branch :(. Please advice on how I can parse the PDF.

Thanks lots!

Regards,
Su Yuen

doc_sample.pdf

James Healy

unread,
Jan 16, 2012, 6:41:02 AM1/16/12
to pdf-r...@googlegroups.com
Hi Su Yen,

On 16 January 2012 05:16, Su Yuen Chin <suy...@gmail.com> wrote:
> I read on Github that if I get a MalformedPDFError to send my file to the maintainers of the gem via Google Groups.
>
> I tried passing this file (attached) I have but I keep getting this error: PDF::Reader::MalformedPDFError: xref table not found at offset 46293 (R != xref)
>
> This happens for the master branch and also the latest stable branch :(. Please advice on how I can parse the PDF.

If you open the sample file in a text editor you'll see there's some
html-ish garbage at the top of the file that is confusing pdf-reader.

Other pdf reading apps probably have some smarts to handle this sort
of scenario, but pdf-reader is a little dumb.

I suggest re-saving the file and trying again.

cheers

James

Su Yuen, Chin

unread,
Jan 16, 2012, 1:04:09 PM1/16/12
to PDF::Reader
Hey James!

Thanks for the reply! Not sure if my earlier reply got through so
sending again.

I opened the file in a text editor and saw <html> and <head></head>.
Are those the HTML garbage? When I removed it and saved it, the PDF is
a blank PDF in the viewer.

Sorry, not familiar with PDF standards/structures so a bit lost on
what is HTML garbage. Hope you can point me in the right direction.

Thanks!

Regards,
Su Yuen

James Healy

unread,
Jan 16, 2012, 8:43:03 PM1/16/12
to pdf-r...@googlegroups.com
Hi Su Yuen,

On 17 January 2012 05:04, Su Yuen, Chin <suy...@gmail.com> wrote:
> I opened the file in a text editor and saw <html> and <head></head>.
> Are those the HTML garbage? When I removed it and saved it, the PDF is
> a blank PDF in the viewer.
>
> Sorry, not familiar with PDF standards/structures so a bit lost on
> what is HTML garbage. Hope you can point me in the right direction.

The byte offsets in a PDF file must be exact - just removing the HTML
may not be enough to fix the file. Can you re-save the file from it's
original source?

A PDF file should always start with "%PDF".

James

Su Yuen, Chin

unread,
Jan 16, 2012, 10:52:27 PM1/16/12
to PDF::Reader
Unfortunately I can't because this is a PDF generated by an external
service. The website I'm building works like this: User provides their
account info and password with that external service, and I have a
script that goes in and downloads the PDFs into our local server. We
then do the PDF scraping to extract certain data from the PDF and
store it in the database. :(

Hmm this means that I may have to deduct the number of bytes that
<html><head></head> is taking up from all the offset values in the
PDF?

Regards,
Su Yuen

James Healy

unread,
Jan 17, 2012, 3:58:07 AM1/17/12
to pdf-r...@googlegroups.com
Hi Su Yuen,

On 17 January 2012 14:52, Su Yuen, Chin <suy...@gmail.com> wrote:
> Hmm this means that I may have to deduct the number of bytes that
> <html><head></head> is taking up from all the offset values in the
> PDF?

You could try patching the XRef#load_xref_table method to detect when
the xref table is offset by a few bytes and then add the same number
of bytes to all object offsets.

It might solve a particular case of corruption like what you're seeing
and I'd consider a patch provided it doesn't break anything in the
test suite.

James

Reply all
Reply to author
Forward
0 new messages