Error: PDF does not contain EOF marker

3,008 views
Skip to first unread message

Shri

unread,
Apr 12, 2011, 4:56:06 PM4/12/11
to PDF::Reader
Hi,

I am getting PDF does not contain EOF marker error.

I am using pdf-reader 0.8.6 and ruby 1.8.6, I can not upgrade to 1.8.7
due to some project issues.

My aim is to convert the PDF file contents into the text file and
validate perticular text inside it.

Converting the PDF file based on some line numbers or till perticular
text is found will also do but it seems that
PDF::Reader.file(pdf_file_path, receiver) is checking for the EOF
first and then start converting.

Could you please help me regarding this, as this is very urgent for me.

James Healy

unread,
Apr 26, 2011, 4:37:08 AM4/26/11
to pdf-r...@googlegroups.com
Hi,

Unfortunately I'm not supporting pdf-reader on 1.8.6 any more. The
latest gems will refuse to install on less than 1.8.7.

It's not my preferred approach, but unfortunately my time to support
PDF::Reader is limited. I test against 1.9.2 and 1.8.7, but the
significant changes introduced after 1.8.6 make supporting earlier
versions time consuming.

cheers

James

> --
> You received this message because you are subscribed to the Google Groups "PDF::Reader" group.
> To post to this group, send email to pdf-r...@googlegroups.com.
> To unsubscribe from this group, send email to pdf-reader+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pdf-reader?hl=en.
>
>

Shri

unread,
Apr 27, 2011, 4:49:14 PM4/27/11
to PDF::Reader
Thanks James,
I have upgraded to Ruby 1.8.7 and I am able to convert PDF to text but
now the header section of my PDF (some text) is not getting converted
to text, i.e. the PDF to text conversion does not include complete PDF
text in text file, it misses the date located at top left hand side.

I want to validate this date inside PDF against the expected date
(with me).

Is there any other approach to validate a single text in PDF file,
other than converting it to text.

Appreciate your help!!

-Shri

On Apr 26, 1:37 am, James Healy <ji...@deefa.com> wrote:
> Hi,
>
> Unfortunately I'm not supporting pdf-reader on 1.8.6 any more. The
> latest gems will refuse to install on less than 1.8.7.
>
> It's not my preferred approach, but unfortunately my time to support
> PDF::Reader is limited. I test against 1.9.2 and 1.8.7, but the
> significant changes introduced after 1.8.6 make supporting earlier
> versions time consuming.
>
> cheers
>
> James
>

Jon Kern

unread,
Apr 27, 2011, 10:56:20 PM4/27/11
to pdf-r...@googlegroups.com
I use James', et al's, PDF library to get more control over what I parse
out of the PDF itself.

For example: I use it to ignore headers and footers by Y-location
(thanks to some vague tips from Jack R and perusing the damn PDF spec).

*BUT* for straight dump of all text in the PDF document, you can try
another pdf reader: https://github.com/finalist/pdf_reader

p = PDFReader.new('test.pdf')
puts p.raw_text

It really is that simple. See if you can find your header... either by
brute force, or regex. If you can't find the header, see if you can
"select it" in the PDF itself and paste it into a text document... What
if it really isn't what you imagine?

FWIW: I have seen the wonkiest things when it comes to selecting text
near the top/bottom of the page.

When trying to intelligently parse PDFs, I have come to realize they are
evil beasts that make you think someone has a voodoo doll and is poking you.

jon
blog: http://technicaldebt.com
twitter: http://twitter.com/JonKernPA


Shri said the following on 4/27/11 4:49 PM:

Shripad

unread,
Apr 28, 2011, 1:53:25 AM4/28/11
to pdf-r...@googlegroups.com, pdf-r...@googlegroups.com
Thanks Jon.
But while trying to install PDF_reader I am getting error as no such
gem found in any of the repository .

I am using command
gem install finalist-pdf_reader

Am I doing anything wrong over here...

Thanks
Shri

Jon Kern

unread,
Apr 28, 2011, 8:27:36 AM4/28/11
to pdf-r...@googlegroups.com
Hmmm, weird

I even tried doing it with the source URL

ruby-1.8.7-p174[develop*]$ gem install finalist-pdf_reader --source https://github.com/finalist/pdf_reader.git
ERROR:  Could not find a valid gem 'finalist-pdf_reader' (>= 0) in any repository
ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    bad response Internal Server Error 500 (https://github.com/finalist/pdf_reader.git/latest_specs.4.8.gz)

You could try two things:

1) Install & Build gem from Git (see here)
2) Just grab the source code from Git and drop it in your vendor or lib folder and run it like you own it/wrote it :-)

I'll see if I can leave the finalist guy a note that something is wonky with installing his gem.
Shripad said the following on 4/28/11 1:53 AM:

Shri

unread,
Apr 28, 2011, 8:16:31 PM4/28/11
to PDF::Reader
Thanks Jon

I have installed the gem using below command and the same is updated
in GitHub also.

gem install pdf_reader

but after installing the gem when i try to use the raw_text method it
is output as blank, i.e probably its unable to read text from PDF.

i have used below code,

require 'pdf_reader'

p = PDFReader.new('test.pdf')
puts p.raw_text

here i am sure that the path is correct infact the rb file and pdf are
in same folder, so whats going wrong here??




On Apr 28, 5:27 am, Jon Kern <jonker...@gmail.com> wrote:
> Hmmm, weird
> I even tried doing it with the source URLruby-1.8.7-p174[develop*]$ gem install finalist-pdf_reader --sourcehttps://github.com/finalist/pdf_reader.git
> ERROR:  Could not find a valid gem 'finalist-pdf_reader' (>= 0) in any repository
> ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
>     bad response Internal Server Error 500 (https://github.com/finalist/pdf_reader.git/latest_specs.4.8.gz)
> You could try two things:
> 1) Install & Build gem from Git (seehere)
> 2) Just grab the source code from Git and drop it in your vendor or lib folder and run it like you own it/wrote it :-)
> I'll see if I can leave the finalist guy a note that something is wonky with installing his gem.jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPA

Jon Kern

unread,
Apr 28, 2011, 9:58:34 PM4/28/11
to pdf-r...@googlegroups.com
try this code: https://github.com/JonKernPA/mongo_examples/tree/master/pdf_reader

require 'rubygems'
require 'pdf_reader'
class SimpleReader
  def self.read_pdf(chart_name)
    p = PDFReader.new(File.dirname(__FILE__) + "/" + chart_name)
    puts p.raw_text
  end
  
end
puts "Test"
SimpleReader.read_pdf("test.pdf")
As dumb as it seems, getting the "local" file to load requires more than meets the eye. Maybe there is a better way, but the way I show works :-)

Shri said the following on 4/28/11 8:16 PM:

Shri

unread,
Apr 29, 2011, 5:08:54 PM4/29/11
to PDF::Reader
even after simple_reader.rb also i am again getting blank only, no
text is extracted from PDF.

please suggest! just to remind my aim is to validate date (text
format) inside PDF against expected.

Thanks

On Apr 28, 6:58 pm, Jon Kern <jonker...@gmail.com> wrote:
> try this code:https://github.com/JonKernPA/mongo_examples/tree/master/pdf_readerrequire'rubygems'
>
> require'pdf_reader'
>
>
>
> classSimpleReader
>
>   defself.read_pdf(chart_name)
>
>     p=PDFReader.new(File.dirname(__FILE__)+"/"+chart_name)
>
>     putsp.raw_text
>
>   end
>
>   
>
> end
>
> puts"Test"
>
> SimpleReader.read_pdf("test.pdf")
>
>
>
> As dumb as it seems, getting the "local" file to load requires more than meets the eye. Maybe there is a better way, but the way I show works :-)jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPA
> Shri said the following on 4/28/11 8:16 PM:Thanks Jon I have installed the gem using below command and the same is updated in GitHub also. gem install pdf_reader but after installing the gem when i try to use the raw_text method it is output as blank, i.e probably its unable to read text from PDF. i have used below code, require 'pdf_reader' p = PDFReader.new('test.pdf') puts p.raw_text here i am sure that the path is correct infact the rb file and pdf are in same folder, so whats going wrong here?? On Apr 28, 5:27 am, Jon Kern<jonker...@gmail.com>wrote:Hmmm, weird I even tried doing it with the source URLruby-1.8.7-p174[develop*]$ gem install finalist-pdf_reader --sourcehttps://github.com/finalist/pdf_reader.git ERROR:  Could not find a valid gem 'finalist-pdf_reader' (>= 0) in any repository ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)     bad response Internal Server Error 500 (https://github.com/finalist/pdf_reader.git/latest_specs.4.8.gz) You could try two things: 1) Install & Build gem from Git (seehere) 2) Just grab the source code from Git and drop it in your vendor or lib folder and run it like you own it/wrote it :-) I'll see if I can leave the finalist guy a note that something is wonky with installing his gem.jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPAShripad said the following on 4/28/11 1:53 AM:Thanks Jon. But while trying to install PDF_reader I am getting error as no such gem found in any of the repository . I am using command gem install finalist-pdf_reader Am I doing anything wrong over here... Thanks Shri
>
>  Running “simple_reader.rb”….png
> 31KViewDownload

Jon Kern

unread,
Apr 29, 2011, 5:20:58 PM4/29/11
to pdf-r...@googlegroups.com
send along your PDF, maybe it is corrupt?

jon
blog: http://technicaldebt.com
twitter: http://twitter.com/JonKernPA


Shri said the following on 4/29/11 5:08 PM:


> even after simple_reader.rb also i am again getting blank only, no
> text is extracted from PDF.
>
> please suggest! just to remind my aim is to validate date (text
> format) inside PDF against expected.
>
> Thanks
>
> On Apr 28, 6:58 pm, Jon Kern<jonker...@gmail.com> wrote:
>> try this code:https://github.com/JonKernPA/mongo_examples/tree/master/pdf_readerrequire'rubygems'
>>
>> require'pdf_reader'
>>
>>
>>
>> classSimpleReader
>>
>> defself.read_pdf(chart_name)
>>
>> p=PDFReader.new(File.dirname(__FILE__)+"/"+chart_name)
>>
>> putsp.raw_text
>>
>> end
>>
>>
>>
>> end
>>
>> puts"Test"
>>
>> SimpleReader.read_pdf("test.pdf")
>>
>>
>>
>> As dumb as it seems, getting the "local" file to load requires more than meets the eye. Maybe there is a better way, but the way I show works :-)jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPA

>> Shri said the following on 4/28/11 8:16 PM:Thanks Jon I have installed the gem using below command and the same is updated in GitHub also. gem install pdf_reader but after installing the gem when i try to use the raw_text method it is output as blank, i.e probably its unable to read text from PDF. i have used below code, require 'pdf_reader' p = PDFReader.new('test.pdf') puts p.raw_text here i am sure that the path is correct infact the rb file and pdf are in same folder, so whats going wrong here?? On Apr 28, 5:27 am, Jon Kern<jonker...@gmail.com>wrote:Hmmm, weird I even tried doing it with the source URLruby-1.8.7-p174[develop*]$ gem install finalist-pdf_reader --sourcehttps://github.com/finalist/pdf_reader.git ERROR: Could not find a valid gem 'finalist-pdf_reader' (>= 0) in any repository ERROR: While executing gem ... (Gem::RemoteFetcher::FetchError) bad response Internal Server Error 500 (https://github.com/finalist/pdf_reader.git/latest_specs.4.8.gz) You could try two things: 1) Install& Build gem from Git (seehere) 2) Just grab the source code from Git and drop it in your vendor or lib folder and run it like you own it/wrote it :-) I'll see if I can leave the finalist guy a note that something is wonky with installing his gem.jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPAShripad said the following on 4/28/11 1:53 AM:Thanks Jon. But while trying to install PDF_reader I am getting error as no such gem found in any of the repository . I am using command gem install finalist-pdf_reader Am I doing anything wrong over here... Thanks Shri
>>
>> Running “simple_reader.rb”….png
>> 31KViewDownload

Jack Rusher

unread,
Apr 29, 2011, 8:43:22 PM4/29/11
to pdf-r...@googlegroups.com
On 29 Apr, 2011, at 17:20, Jon Kern wrote:
> send along your PDF, maybe it is corrupt?

Some PDFs with non unicode-mapped embedded fonts contain no legible text runs. The easiest way to see if this is the case is to open the PDF in a viewer, select all (CTRL-A or command-A), then paste the selected text into a text editor (NotePad, TextEdit, whatever). If what appears in the editor is not what you see in the PDF viewer, you have one of these PDFs.


J.

Shri

unread,
May 2, 2011, 3:48:31 PM5/2/11
to PDF::Reader
Thanks Jack!!
When i do CTRL-A and paste in nottepad, its working fine, all the
contents of pdf are copied into text file.

Shri

unread,
May 2, 2011, 3:50:09 PM5/2/11
to PDF::Reader
the simple_reader.rb is working with the test pdf that you povided on
github but with my pdf its blank only.

On Apr 29, 2:20 pm, Jon Kern <jonker...@gmail.com> wrote:
> send along your PDF, maybe it is corrupt?
>
> jon
> blog:http://technicaldebt.com
> twitter:http://twitter.com/JonKernPA
>
> Shri said the following on 4/29/11 5:08 PM:
>
>
>
>
>
>
>
> > even after simple_reader.rb also i am again getting blank only, no
> > text is extracted from PDF.
>
> > please suggest! just to remind my aim is to validate date (text
> > format) inside PDF against expected.
>
> > Thanks
>
> > On Apr 28, 6:58 pm, Jon Kern<jonker...@gmail.com>  wrote:
> >> try this code:https://github.com/JonKernPA/mongo_examples/tree/master/pdf_readerreq...
>
> >> require'pdf_reader'
>
> >> classSimpleReader
>
> >>    defself.read_pdf(chart_name)
>
> >>      p=PDFReader.new(File.dirname(__FILE__)+"/"+chart_name)
>
> >>      putsp.raw_text
>
> >>    end
>
> >> end
>
> >> puts"Test"
>
> >> SimpleReader.read_pdf("test.pdf")
>
> >> As dumb as it seems, getting the "local" file to load requires more than meets the eye. Maybe there is a better way, but the way I show works :-)jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPA
> >> Shri said the following on 4/28/11 8:16 PM:Thanks Jon I have installed the gem using below command and the same is updated in GitHub also. gem install pdf_reader but after installing the gem when i try to use the raw_text method it is output as blank, i.e probably its unable to read text from PDF. i have used below code, require 'pdf_reader' p = PDFReader.new('test.pdf') puts p.raw_text here i am sure that the path is correct infact the rb file and pdf are in same folder, so whats going wrong here?? On Apr 28, 5:27 am, Jon Kern<jonker...@gmail.com>wrote:Hmmm, weird I even tried doing it with the source URLruby-1.8.7-p174[develop*]$ gem install finalist-pdf_reader --sourcehttps://github.com/finalist/pdf_reader.gitERROR:  Could not find a valid gem 'finalist-pdf_reader' (>= 0) in any repository ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)     bad response Internal Server Error 500 (https://github.com/finalist/pdf_reader.git/latest_specs.4.8.gz) You could try two things: 1) Install&  Build gem from Git (seehere) 2) Just grab the source code from Git and drop it in your vendor or lib folder and run it like you own it/wrote it :-) I'll see if I can leave the finalist guy a note that something is wonky with installing his gem.jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPAShripadsaid the following on 4/28/11 1:53 AM:Thanks Jon. But while trying to install PDF_reader I am getting error as no such gem found in any of the repository . I am using command gem install finalist-pdf_reader Am I doing anything wrong over here... Thanks Shri
>
> >>   Running “simple_reader.rb”….png
> >> 31KViewDownload

Jon Kern

unread,
May 2, 2011, 3:59:19 PM5/2/11
to pdf-r...@googlegroups.com
another thing to try...

if your PDF is multiple pages, try making a copy of it, deleting pages
via adobe reader or smth, until you get a pdf that can be read. Maybe
there is some page with bogus stuff that is causing pdf parsing to fail.

not sure what else to tell you to try.

all i can say is that some PDFs seem to be better than others at being
parsable.


Shri said the following on 5/2/11 3:50 PM:

James Healy

unread,
May 11, 2011, 8:28:18 AM5/11/11
to pdf-r...@googlegroups.com
On 3 May 2011 05:50, Shri <shripad.d...@gmail.com> wrote:
> the simple_reader.rb is working with the test pdf that you povided on
> github but with my pdf its blank only.

Sorry about chiming in with this so late folks. The code at
https://github.com/finalist/pdf_reader will only work for very simple
PDFs, and even then it will often produce incorrect results for non
ASCII characters.

It's a nice simple solution if it works for you, but it's not really a
general purpose PDF parser.

Shri, I suspect your issue with missing header text is related to Form
XObjects. They're a regular source of issues in PDF::Reader, and I
still haven't found a solution that works in all cases. Sorry about
that, hopefully I'll get to them soon.

James

Reply all
Reply to author
Forward
0 new messages