I am currently working on a project in a request of retrieveing text
streams from PDF file. I have read through some threads with regard to
itext library. I am quite new to the topic of converting PFD text to
objects. So, first off, can anyone tell me is it possible to fullfill my
goal with itext? (Namely with the class PdfReader, Is there other class
do I need for this?) Secondly, could anyone give me some example in
detailed codec to illurstrate me how to make a simplest PDF->text parser
with PdfReader class in itext.
Thanks a lot!!
Regards
Rui
Hi,
Hereby, I state my question a bit more detailed as follows
//creat a Pdfreader
> PdfReader PDFreader=new PdfReader("somefile.pdf");
// retrieve page 2 for example
> text=PDFreader.FlateDecode(PDFreader.getPageContent(2),true);
Is it all for parsing??(Obviously, I know I have something missing here,
but what are them?)
Thanks
Rui
With iText you can extract Dictionaries, streams,... from a PDF file.
These are PDF objects as described in the PDF Reference Manual.
If you decode a stream, you get PDF syntax.
This doesn't mean you get the text that is shown in Acrobat Reader.
iText doesn't parse the Graphics State or Text State operators.
I could explain more about the internal of iText,
but I will keep it short:
If you want to use iText to manipulate existing PDFs,
read http://itext.sourceforge.net/tutorial/general/copystamp/
If you need to extract text from a PDF,
you will need another library.
br,
Bruno
See the command line tool org.pdfbox.ExtractText and utility class
org.pdfbox.util.PDFTextStripper to see how to extract text from a PDF
document.
Ben
Regards to all repliers
Rui