Can 'PDF' files be read in Ruby?

1,511 views
Skip to first unread message

Vijay

unread,
Aug 1, 2008, 4:57:02 PM8/1/08
to Watir General
Hello people,

In our project, which we are trying to automate with Watir, we need to
read check the contents of a 'PDF report' that comes embedded in a
'IE' window like the following

<HTML><HEAD></HEAD>

<BODY leftMargin=0 topMargin=0 scroll=no><EMBED src=http://
192.1.2.24:10041/servlets/elite/shared/attachment/N/
RECEIPT_08012008_014938.pdf width="100%" height="100%"
type=application/pdf fullscreen="yes"></BODY></HTML>


Can the contents of this file be read in Ruby after saving it in the
hardrive? When we use 'File.read' statement, Ruby outputs some junk
values like the following,

%PDF-1.4
1 0 obj <</Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj <</Type /Pages /Count 1 /Kids [3 0 R] /MediaBox [0 0 792
612]>>
endobj
3 0 obj <</Type /Page /Parent 2 0 R /Resources 4 0 R /Contents 6 0 R>>
endobj
4 0 obj <</ProcSet 5 0 R /Font 100 0 R>>......

Thanks for your time,
Vijay.

juuser

unread,
Aug 2, 2008, 5:09:05 AM8/2/08
to Watir General
Hi.

We are using http://pdf-toolkit.rubyforge.org/ to extract text from
pdf and then compare to expected text.

One way would be to use that pdftotext method, other way would be to
use system("pdftotext.exe -layout input.pdf output.txt")

You might need to download some other tools in addition to PdfToolkit
to get it working (xpdf and pdftotext for win32 or smth).

Sameh

unread,
Aug 4, 2008, 2:26:19 AM8/4/08
to Watir General
Hey Vijay,
The process is a little complicated and not so straightforward.

First you will need to download pdftk. Download these files and
extract the files only in the C:\windows\system32 folder.
http://www.accesspdf.com/article.php/20041130153545577

Secondly you will need to download and isntall xpdf :
http://pdf-toolkit.rubyforge.org/
Extract those files into the C:\windows\system32 folder also

Then you will need the PDF::TOOLKIT gem. This can be found here
http://rubyforge.org/projects/pdf-toolkit/

Basically this will convert the pdf to a textfile and you can do what
you like with it. In the following example I have just read a file on
my c:\ and displayed it using the 'puts' command.


require 'rubygems'
require 'pdf/toolkit'

my_pdf = PDF::Toolkit.open("c:\\file.pdf")
text = my_pdf.to_text.read
puts text

I hope that helps a little.
Cheers
Sameh.

Vijay

unread,
Aug 8, 2008, 8:04:23 PM8/8/08
to Watir General
Thanks Sameh and Juuser for your valuable and crystal-clear replies.
The code, which you provided worked like a gem. Now, we are able to
read 'pdf' files using Ruby.

However, there is a small obstacle. The application, which were
trying to automate with Watir, has a lot of 'Modal Dialogs' and so, we
are using 'Ruby 1.82', which, according to the instructions given in
"http://wtr.rubyforge.org", only can support these dialogs. The 'pdf-
toolkit' code, though works perfectly with Ruby 1.85 (the latest but
one version of Ruby), it does not work with "Ruby 1.82". The code
throws some error "undefined method gem or something in one of its
internal files".

So, I was wondering if it was possible to change the value of the
"Environment Variable", 'Path', through a 'DOS' command to point to
the 'Ruby 1.85' installation in the same computer and running this
'pdf_read' program so as to execute it. Is this (having two versions
of Ruby installed in the computer and switching between versions
whenever needed) possible? or if there is any other way to get round
this?

Thanks again,
Vijay.

On Aug 4, 2:26 am, Sameh <sam.abdelha...@gmail.com> wrote:
> Hey Vijay,
> The process is a little complicated and not so straightforward.
>
> First you will need to download pdftk. Download these files and
> extract the files only in the C:\windows\system32 folder.http://www.accesspdf.com/article.php/20041130153545577
>
> Secondly you will need to download and isntall xpdf :http://pdf-toolkit.rubyforge.org/
> Extract those files into the C:\windows\system32 folder also
>
> Then you will need the PDF::TOOLKIT gem. This can be found herehttp://rubyforge.org/projects/pdf-toolkit/
> > Vijay.- Hide quoted text -
>
> - Show quoted text -

juuser

unread,
Aug 11, 2008, 12:17:09 PM8/11/08
to Watir General
Try to do it something like this so you only need to have this
pdftotext.exe

First, try it on your command line to see if that works correctly:
pdftotext.exe -layout input.pdf output.txt

or omit the -layout switch...

Now, in Ruby just call the exe with:

raise "failed!" unless system("pdftotext.exe -layout input.pdf
output.txt")

Now you can just read the output from file for example:
data = File.readlines("output.txt")

this should solve your problem about different Ruby versions.

Hope this helps.

Vijay

unread,
Aug 27, 2008, 7:43:52 AM8/27/08
to Watir General
Thank you so much. juuser. Your reply had solved this problem.
pdftotext.exe, works perfectly from a command prompt and so we have
used the 'system' command in Ruby 1.82.

Thanks once again for your great help.

Vijay.
> > > - Show quoted text -- Hide quoted text -

Wesley Chen

unread,
Mar 25, 2009, 11:02:36 AM3/25/09
to watir-...@googlegroups.com
Hi, Juuser,
Thank you very much for this post, but when I run the command:
system("pdftotext.exe -layout c:\\hello.pdf c:\\test.txt")
I get warning message:
Error (0): PDF file is damaged - attempting to reconstruct xref table...
I get the test.txt with an unexpected character in the end.
Please see the file attached.

So do you have any choice to avoid it?

Thanks.
Wesley Chen.
unexpected.png

Gauri Kuwar

unread,
Sep 30, 2015, 9:23:29 PM9/30/15
to Watir General, cjq...@gmail.com
Hello All, 
Code for reading through a pdf was really helpful. 
Yet we need to compare two pdf's both text and appearance and save the differences if any. 
Can anybody suggest on automating the above scenario using ruby?

Thanks 
Gauri

Super Kevy

unread,
Oct 1, 2015, 9:34:47 AM10/1/15
to Watir General, cjq...@gmail.com
Why not just to ruby FileUtils file compare the *.pdf?

BTW: other readers that are 100% gems are 

Gauri Kuwar

unread,
Oct 4, 2015, 11:58:44 AM10/4/15
to Watir General, cjq...@gmail.com
Thanks Kevy!!! Actually I did install pdf-reader and inspector gems. These ahems are really great when we work on single pdf document. Yet for  comparison between the two pdf files , I am nog aware how these gems can be used. Can you please help in understanding how can FileUtils be used for two pdf file comparison? 
Thanks for your reply...

Regards
Gauri

Super Kevy

unread,
Oct 5, 2015, 9:32:06 AM10/5/15
to Watir General, cjq...@gmail.com
Simple syntax

require 'fileutils'

if FileUtils.compare_file(FileName1,FileName2) then 
  puts 'Pass'
else
  puts 'Fail'
end'
Reply all
Reply to author
Forward
0 new messages