Unfortunately I'm not supporting pdf-reader on 1.8.6 any more. The
latest gems will refuse to install on less than 1.8.7.
It's not my preferred approach, but unfortunately my time to support
PDF::Reader is limited. I test against 1.9.2 and 1.8.7, but the
significant changes introduced after 1.8.6 make supporting earlier
versions time consuming.
cheers
James
> --
> You received this message because you are subscribed to the Google Groups "PDF::Reader" group.
> To post to this group, send email to pdf-r...@googlegroups.com.
> To unsubscribe from this group, send email to pdf-reader+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pdf-reader?hl=en.
>
>
For example: I use it to ignore headers and footers by Y-location
(thanks to some vague tips from Jack R and perusing the damn PDF spec).
*BUT* for straight dump of all text in the PDF document, you can try
another pdf reader: https://github.com/finalist/pdf_reader
p = PDFReader.new('test.pdf')
puts p.raw_text
It really is that simple. See if you can find your header... either by
brute force, or regex. If you can't find the header, see if you can
"select it" in the PDF itself and paste it into a text document... What
if it really isn't what you imagine?
FWIW: I have seen the wonkiest things when it comes to selecting text
near the top/bottom of the page.
When trying to intelligently parse PDFs, I have come to realize they are
evil beasts that make you think someone has a voodoo doll and is poking you.
jon
blog: http://technicaldebt.com
twitter: http://twitter.com/JonKernPA
Shri said the following on 4/27/11 4:49 PM:
I am using command
gem install finalist-pdf_reader
Am I doing anything wrong over here...
Thanks
Shri
require 'rubygems'require 'pdf_reader'class SimpleReaderdef self.read_pdf(chart_name)p = PDFReader.new(File.dirname(__FILE__) + "/" + chart_name)puts p.raw_textendendputs "Test"SimpleReader.read_pdf("test.pdf")
jon
blog: http://technicaldebt.com
twitter: http://twitter.com/JonKernPA
Shri said the following on 4/29/11 5:08 PM:
> even after simple_reader.rb also i am again getting blank only, no
> text is extracted from PDF.
>
> please suggest! just to remind my aim is to validate date (text
> format) inside PDF against expected.
>
> Thanks
>
> On Apr 28, 6:58 pm, Jon Kern<jonker...@gmail.com> wrote:
>> try this code:https://github.com/JonKernPA/mongo_examples/tree/master/pdf_readerrequire'rubygems'
>>
>> require'pdf_reader'
>>
>>
>>
>> classSimpleReader
>>
>> defself.read_pdf(chart_name)
>>
>> p=PDFReader.new(File.dirname(__FILE__)+"/"+chart_name)
>>
>> putsp.raw_text
>>
>> end
>>
>>
>>
>> end
>>
>> puts"Test"
>>
>> SimpleReader.read_pdf("test.pdf")
>>
>>
>>
>> As dumb as it seems, getting the "local" file to load requires more than meets the eye. Maybe there is a better way, but the way I show works :-)jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPA
>> Shri said the following on 4/28/11 8:16 PM:Thanks Jon I have installed the gem using below command and the same is updated in GitHub also. gem install pdf_reader but after installing the gem when i try to use the raw_text method it is output as blank, i.e probably its unable to read text from PDF. i have used below code, require 'pdf_reader' p = PDFReader.new('test.pdf') puts p.raw_text here i am sure that the path is correct infact the rb file and pdf are in same folder, so whats going wrong here?? On Apr 28, 5:27 am, Jon Kern<jonker...@gmail.com>wrote:Hmmm, weird I even tried doing it with the source URLruby-1.8.7-p174[develop*]$ gem install finalist-pdf_reader --sourcehttps://github.com/finalist/pdf_reader.git ERROR: Could not find a valid gem 'finalist-pdf_reader' (>= 0) in any repository ERROR: While executing gem ... (Gem::RemoteFetcher::FetchError) bad response Internal Server Error 500 (https://github.com/finalist/pdf_reader.git/latest_specs.4.8.gz) You could try two things: 1) Install& Build gem from Git (seehere) 2) Just grab the source code from Git and drop it in your vendor or lib folder and run it like you own it/wrote it :-) I'll see if I can leave the finalist guy a note that something is wonky with installing his gem.jon blog:http://technicaldebt.comtwitter:http://twitter.com/JonKernPAShripad said the following on 4/28/11 1:53 AM:Thanks Jon. But while trying to install PDF_reader I am getting error as no such gem found in any of the repository . I am using command gem install finalist-pdf_reader Am I doing anything wrong over here... Thanks Shri
>>
>> Running “simple_reader.rb”….png
>> 31KViewDownload
Some PDFs with non unicode-mapped embedded fonts contain no legible text runs. The easiest way to see if this is the case is to open the PDF in a viewer, select all (CTRL-A or command-A), then paste the selected text into a text editor (NotePad, TextEdit, whatever). If what appears in the editor is not what you see in the PDF viewer, you have one of these PDFs.
J.
if your PDF is multiple pages, try making a copy of it, deleting pages
via adobe reader or smth, until you get a pdf that can be read. Maybe
there is some page with bogus stuff that is causing pdf parsing to fail.
not sure what else to tell you to try.
all i can say is that some PDFs seem to be better than others at being
parsable.
jon
blog: http://technicaldebt.com
twitter: http://twitter.com/JonKernPA
Shri said the following on 5/2/11 3:50 PM:
Sorry about chiming in with this so late folks. The code at
https://github.com/finalist/pdf_reader will only work for very simple
PDFs, and even then it will often produce incorrect results for non
ASCII characters.
It's a nice simple solution if it works for you, but it's not really a
general purpose PDF parser.
Shri, I suspect your issue with missing header text is related to Form
XObjects. They're a regular source of issues in PDF::Reader, and I
still haven't found a solution that works in all cases. Sorry about
that, hopefully I'll get to them soon.
James