Re: [selenium-users] need script to Read data from a PDF file using Web Driver

7,159 views
Skip to first unread message

Mark Collin

unread,
Apr 24, 2013, 9:22:33 AM4/24/13
to seleniu...@googlegroups.com
You can't

Webdriver interacts with HTML pages through a browser.  It cannot interact with a PDF.

If you want to test PDF's you would have to download the PDF and use a PDF library to open it up and query it.

On 24/04/2013 11:02, Vishi wrote:

I need to read data from a PDF file using Web Driver.

Let us suppose the PDF contains "User Name", "Address", "Date of Birth"...etc.....

Now I want to fetch that information using Web Driver...

It would be really helpful for me If any one know the solution...

-Vishi

--
You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-user...@googlegroups.com.
To post to this group, send email to seleniu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/selenium-users/-/y7ZwAosJ68oJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

ash

unread,
Apr 24, 2013, 9:34:43 AM4/24/13
to seleniu...@googlegroups.com
Although i haven't tested PDF file, I have tested the XML file with some work-arounds like mentioned below and you could do the same.

1. Configure your browser like FF to open pdf file in browser window instead of another file.
2. Then once it opens in the browser window in pdf file, you could probably get the entire html source and assign it to a string.
3. Then you could parse it such a way that you could test your data.


Hope that helps!




Vishi

unread,
Apr 24, 2013, 9:41:15 AM4/24/13
to seleniu...@googlegroups.com
Thanks ash...

Good Idea, I will try and let you know soon....

ash

unread,
Apr 24, 2013, 9:47:55 AM4/24/13
to seleniu...@googlegroups.com
No problem, btw, I just tried getting pdf html source into a string and I was able to do it succesfully. That means you could test your data in the entire pdf file since the html source is saved in string variable.

Like someone said, Nothing is IMPOSSIBLE with selenium as long as you have the Page source ;-).

Hope that helps!

To view this discussion on the web visit https://groups.google.com/d/msg/selenium-users/-/bgMrEk3Jc1kJ.

ARK Satyanarayana Raju

unread,
Apr 25, 2013, 5:18:03 AM4/25/13
to seleniu...@googlegroups.com
Hi,

     I found this in internet, try this.
WebDriver (Selenium2) : Extract text from PDF file using java :-

Verifying PDF content is also part of testing.But in WebDriver (Selenium2) we don't have any
direct methods to achieve this.
If you would like to extract pdf content then we can use Apache PDFBox API.
Download the Jar files and add them to your Eclipse Class path.Then you are ready to extract
text from PDF file... :)
Here is the sample script which will extract text from the below PDF file.
http://www.votigo.com/pdf/corp/CASE_STUDY_EarthBox.pdf

 import java.io.BufferedInputStream;
 
import java.io.IOException;
 
import java.net.URL;
 
import java.util.concurrent.TimeUnit;
 
import org.apache.pdfbox.pdfparser.PDFParser;
 
import org.apache.pdfbox.util.PDFTextStripper;
 
import org.openqa.selenium.WebDriver;
 
import org.openqa.selenium.firefox.FirefoxDriver;
 
import org.testng.Reporter;
 
import org.testng.annotations.BeforeTest;
 
import org.testng.annotations.Test;
 
public class ReadPdfFile {
 
WebDriver driver;
 
@BeforeTest
 
public void setUpDriver() {
 driver
= new FirefoxDriver();
 
Reporter.log("I am done");
 
}
 
@Test
 
public void start() throws IOException{
 driver
.get("http://votigo.com/overview_collateral.pdf");
 driver
.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
 URL url
= new URL(driver.getCurrentUrl());
 
BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());
 
//parse() -- This will parse the stream and populate the COSDocument object.
 
//COSDocument object -- This is the in-memory representation of the PDF document
 
PDFParser parser = new PDFParser(fileToParse);
 parser
.parse();
 
//getPDDocument() --
This will get the PD document that was parsed. When you are done with this document y
ou must call close
() on it to release resources
 
//PDFTextStripper() --
This class will take a pdf document and strip out all of the text and ignore the formatting
and such.
String output=new PDFTextStripper().getText(parser.getPDDocument());
 
System.out.println(output);
 parser
.getPDDocument().close();
driver
.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);
 
}
 
}



Here is the output of above program:
1. EarthBox a Day Giveaway
2. Objectives
3. EarthBox wanted to engage their Facebook
4. audience with an Earth Day promotion that would
5. also increase their Facebook likes. They needed a
6. simple solution that would allow them to create a
7. sweepstakes application themselves.
8. Solution
9. EarthBox utilized the Votigo
10. platform to create a like-
11. gated sweepstakes. Utilizing a
12. theme and uploading a custom graphic they
13. were able to create a branded promotion.
14.
15.
16. Details
17. • 1 prize awarded each day for the entire Month of April
18. • A grand prize given away on Earth Day
19. • Daily winner announcements on Facebook
20. • Promoted through email newsletter blast
21.
22. Results (4 weeks)
23. • 6,550 entries
24.
25. Facebook

Mark Collin

unread,
Apr 26, 2013, 1:48:08 AM4/26/13
to seleniu...@googlegroups.com

I would be very surprised if you did because PDF is not an HTML based format:

 

http://en.wikipedia.org/wiki/Portable_Document_Format#File_structure

 

XML is easy to pull down through Selenium because the browser renders it in the same way as HTML (Unless of course you are using IE then it adds markup).  If your browser has a built in PDF reader that renders the PDF as HTML then it may be possible to pull the source out as a string but then you are not looking at the original PDF source, but whatever the browser converted it into.

 

If you really want to test it properly download it and MD5 hash it and compare that to an MD5 hash of a known good copy or load it up using a external library like PDFbox.  I’m a big believer in using the right tool for the right job and Selenium is most definitely not the right tool for working with PDF files.

Krishnan Mahadevan

unread,
Apr 26, 2013, 2:51:02 AM4/26/13
to Selenium Users
Perhaps something from here can be considered instead of WebDriver : http://java-source.net/open-source/pdf-libraries

Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/

ARK Satyanarayana Raju

unread,
Apr 26, 2013, 3:34:09 AM4/26/13
to seleniu...@googlegroups.com
Hi Vishi,

I already sent u code for how to get PDF data.

Download PDF jar from here: PDF Jar

Configure this jar with u r project. Take the following code. I tried it. Its working fine.

import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.URL;
import java.util.concurrent.TimeUnit;

import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.Reporter;
import org.testng.annotations.AfterTest;
import org.testng.annotations.BeforeTest;
import org.testng.annotations.Test;


public class PDFDataRead {
   
WebDriver driver;
   
public static String latestwindowid;
   
@BeforeTest
   
public void open()
   
{
        driver
=new FirefoxDriver();
        driver
.manage().window().maximize();

       
Reporter.log("I am done");
   
}

   
@AfterTest
   
public void close()
   
{
        driver
.quit();

   
}
   
@Test
     
public void start() throws IOException{

        driver
.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
        URL url
= new URL(driver.getCurrentUrl());
       
BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());

       
PDFParser parser = new PDFParser(fileToParse);
        parser
.parse();

       
String output=new PDFTextStripper().getText(parser.getPDDocument());
       
System.out.println(output);
        parser
.getPDDocument().close();
        driver
.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);
       
}
}


Thanks,
Raju

ARK Satyanarayana Raju

unread,
Apr 26, 2013, 3:37:07 AM4/26/13
to seleniu...@googlegroups.com
Hi

I did a typo mistake. Remove this from the code 
"public static String latestwindowid;"

Thanks
Raju

Madan Singh

unread,
Oct 18, 2013, 1:03:02 AM10/18/13
to seleniu...@googlegroups.com
Hi Ash, May help me , I am using C# with web driver and now I have a link from where I will download the PDF file and want match  some text after reading PDF file.

Thank in Advance

Madan
M P Singh
9971360313

Kannan Venkatesan

unread,
Oct 24, 2013, 9:56:18 AM10/24/13
to seleniu...@googlegroups.com
Hi,

I need to read the data from a PDF file that is shown as part of the webpage inside a frame. 

How to go about this?

Thanks,
Kannan 

sirus tula

unread,
Oct 24, 2013, 11:11:56 AM10/24/13
to seleniu...@googlegroups.com
Follow the directions as stated above.

1. Configure your browser like FF to open pdf file in browser window instead of another file.
2. Then once it opens in the browser window in pdf file, you could probably get the entire html source and assign it to a string.
3. Then you could parse it such a way that you could test your data.


Hope that helps!

--
You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-user...@googlegroups.com.
To post to this group, send email to seleniu...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



--
 
- "If you haven't suffered, you haven't lived your life."
 
Thanks,
 
Sirus

serv...@seleniummaster.com

unread,
Jan 22, 2014, 5:35:20 PM1/22/14
to seleniu...@googlegroups.com
I wrote a Java Maven project to compare two pdf files hosted on a web server. check this site out. http://seleniummaster.com/sitecontent/index.php/selenium-test-automation-with-java/165-comparing-pdf-documentations-with-selenium
If the link does not work, go to http://www.seleniummaster.com ->Selenium Test Automation With Java -> 
Open the article named "Comparing Pdf Documentations With Selenium". 
Thanks. 

Oscar Rieken

unread,
Jan 23, 2014, 7:44:55 AM1/23/14
to seleniu...@googlegroups.com
you wouldn't use webdriver to read the PDF I suggest looking for a tool to help you do that in whatever programming language you are using. Then take that data and compare it to whatever your source data is.


Usha

unread,
Apr 23, 2014, 2:30:17 AM4/23/14
to seleniu...@googlegroups.com
This does'nt work if PDF is opened in chrome browser. Any other way to read page source of PDF opened in Chrome?

On Wednesday, April 24, 2013 7:04:43 PM UTC+5:30, Ash wrote:

Krishnan Mahadevan

unread,
Apr 23, 2014, 3:16:54 AM4/23/14
to Selenium Users
Please try extracting the URL of the pdf file from within the webpage, and then try downloading it from outside of WebDriver and then use java libraries such as pdfbox to work with the pdf.


Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribbings @ http://rationaleemotions.wordpress.com/


Reply all
Reply to author
Forward
0 new messages