Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

How to retrieve text content from PDF file by itext?

1,538 views

Skip to first unread message

Rui Chang

unread,

Apr 28, 2005, 5:31:39 AM4/28/05

Hi guys,

I am currently working on a project in a request of retrieveing text
streams from PDF file. I have read through some threads with regard to
itext library. I am quite new to the topic of converting PFD text to
objects. So, first off, can anyone tell me is it possible to fullfill my
goal with itext? (Namely with the class PdfReader, Is there other class
do I need for this?) Secondly, could anyone give me some example in
detailed codec to illurstrate me how to make a simplest PDF->text parser
with PdfReader class in itext.

Thanks a lot!!

Regards

Rui

Rui Chang

unread,

Apr 28, 2005, 5:44:21 AM4/28/05

Hi,
Hereby, I state my question a bit more detailed as follows
//creat a Pdfreader

> PdfReader PDFreader=new PdfReader("somefile.pdf");

// retrieve page 2 for example
> text=PDFreader.FlateDecode(PDFreader.getPageContent(2),true);

Is it all for parsing??(Obviously, I know I have something missing here,
but what are them?)

Thanks

Rui

bruno

unread,

Apr 28, 2005, 6:47:11 AM4/28/05

With iText you can extract Dictionaries, streams,... from a PDF file.
These are PDF objects as described in the PDF Reference Manual.
If you decode a stream, you get PDF syntax.
This doesn't mean you get the text that is shown in Acrobat Reader.
iText doesn't parse the Graphics State or Text State operators.

I could explain more about the internal of iText,
but I will keep it short:
If you want to use iText to manipulate existing PDFs,
read http://itext.sourceforge.net/tutorial/general/copystamp/
If you need to extract text from a PDF,
you will need another library.

br,
Bruno

b...@csh.rit.edu

unread,

Apr 28, 2005, 9:16:47 AM4/28/05

http://www.pdfbox.org is an open source Java PDF Library that does text
extraction.

See the command line tool org.pdfbox.ExtractText and utility class
org.pdfbox.util.PDFTextStripper to see how to extract text from a PDF
document.

Ben

Rui Chang

unread,

Apr 28, 2005, 11:06:40 AM4/28/05

Thanks for your suggestion.Ben. PDFBOX is a great library...I have
already tested it, and it works very fine!! I will keep posting
following questions (if there are) by using pdfbox.

Regards to all repliers
Rui

deepandroid

unread,

Jun 8, 2012, 4:12:28 AM6/8/12

I have tried to extract the pdf document to text using pdf box library only in Android
public static void read(String[] args) throws IOException{

PDDocument doc = null;
try {
doc = PDDocument.load("C:\\Android.pdf");
PDFTextStripper stripper = new PDFTextStripper();
String text =stripper.getText(doc);

} finally {
if (doc != null) {
doc.close();
}
}
But getting error in the logcat that Could not find method org.apache.pdfbox.pdmodel.PDDocument.load, referenced from method com.packagename.classname.method...
moreover i had the classpath and path in system variables and jar file too
why it so coming errors!!!

--http://compgroups.net/comp.text.pdf/how-to-retrieve-text-content-from-pdf-file-by-i/326846

0 new messages