Does anyone have sample java code for extracting pdf metadata? The key
fields that I am looking at is the "Creator", "Producer", "Author" and
"Subject" metadata fields in the pdf document.
Or would someone be able to guide me in the correct direction on how to
extract these metadata fields from a pdf document. Is there a java
class that I can use. I was able to find a java class for extracting
image properties from a pdf file - Java clases "ImageInfo" however the
fields provided from this class are not the fields that I am looking
for. (the four above fields)
Thanks in advance for your help.
Prakash
fabrizio
PDDocument pdf = PDDocument.load( "my.pdf" );
PDDocumentInformation info = pdf.getDocumentInformation();
System.out.println( "creator=" + info.getCreator() );
...
Ben
Thanks for your help.
I am still confused though. Would you know which pdfbox files that I
would need to import. How would I use the pdfbox library and which ones
would i need to import.
"package ....?
import .....?
...."
I am still not sure which files to include to run my java code and
which files/libaries from the pdfbox to include. How do I get to
utilize this pdfbox library - do I need to import all the files, is
there some sort of way to have include just one file or a few files
like in C/C++ which makes calls to the other files. Do we really need
all these files?
I am still new to java and am unsure about this. I normally would just
use import but I do not which file to utilize and how to use this
pdfbox library - there just two many files. How can we just utilize the
library without importing all the files and tracking all the file
dependies.
Please help.
Thanks,
Prakash
The jar file contains classes, which need to be referenced from your
class.
For the example I posted you can import an entire package or a single
class,
For the example to import an entire package use this line at the top of
your source file
import org.pdfbox.pdmodel.*;
or to import just the classes you can list each one individually, like
this
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;
Either way your compiled class will be exactly the same, Java does
dynamic linking, so an import statement only tells the compiler where
it should look for stuff to verify your syntax is correct. In C++ if
you add an include your exe gets bigger(been a while since I did C++ so
please forgive me if I am wrong) because it statically links in the
stuff you include, that does not happen in Java.
so complete source would look something like this
package mypackage;
import java.io.*;
import org.pdfbox.pdmodel.*;
public class MyClass
{
public static void main( String[] args ) throws IOException
{
PDDocument doc = null;
try
{
doc = PDDocument.load( "my.pdf" );
PDDocumentInformation info = pdf.getDocumentInformation();
System.out.println( "creator=" + info.getCreator() );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
Thank you very much for the info. I think I got it now. This was very
helpful.
PC
One more question. I have been reading the documentation and it
mentions that to use the PDFBox Library we need to install "Ant."
Reading "Ant" I understand that it ilike gmake for making files. So do
I need to install this first before I can use the PDFBox Library. Once
I get "Ant" how to I include all the files so that it can build the
PDFBox files.
Also do I need to change the Class Path after installing "Ant" in my
directory for me to use PDFBox? How to I build the PDFBox library to
use it - you mentioned something about PDFBox-x.x.x.jar file, I
downloaded "PDFBox-0.7.3-dev-20060219" and looked at the build
properties and all it mentions is
"#forrest.home=c:\\javalib\\apache-forrest-0.6\\src\\core
#ikvm.dir=C:\\javalib\\ikvm-12-07-2004\\ikvm"
What does this mean?
Sorry I am just confused about the importing process of PDFBox. The
Java documentation of the classes are excellent, it is easily
understandable. However the site missed one crucial piece of
information, how to actually get to use the library. Maybe it is just
me, other users who use java more probably understand this library
import process - the way I have done it before is to import each
one.The only way I understand is by importing every file which is
extremely tedious. How do you just do "import org.pdfbox.pdmodel.*;" Do
I just place all the class files in the same root directory? (Basically
my question is how to get the PDFBox library to be useable as "import
org.pdfbox.pdmodel.*;" the same way "import java.io.*", I need to know
how to build the custom library so it is like the standard library
"import java.io.*."
Could you give me an example. of the library building process. I am
guessing it is like
1) Download PDFBox
2) Download "Ant"
3) Set "Ant" Class Paths
4) Once that is done use "Ant" to build PDFBox.(In what directory do I
do the PDFBox build. "Ant" Class Path has been set and accessible from
any where, but how does that enable PDFBox library to be accessed from
any directory)
5) Now that it is built you can use it in the code as "import
org.pdfbox.pdmodel.*;"
Could you explain the process if you have some time. I am confused
about the process of using the PDFBox Library rather the class
code/structure itself of PDFBox Classes - that is straight forward and
there is excellent documentation.
Thanks,
PC
In Java there is typically no benefit to building something yourself
versus using binaries. The only reason people should be building
PDFBox themselves is 1)they are curious and want to understand PDFBox
more 2)they are planning on making modifications(which of course should
be contributed back to PDFBox :) )
The build process will create the PDFBox-0.7.2.jar in the lib
directory. You already have this file, which is why you don't need to
build PDFBox, this is the file you need to be able to use PDFBox.
The build.properties has *optional* properties you can set when
building PDFBox.
PDFBox uses the following two projects for a complete build
IKVM: To build .NET DLLs from jar files
Apache Forrest:To generate the website documentation
You can leave these blank and the build will complete and just leave
out those parts and just build the jar file.
Ben
Thanks for your help. It is appreciated.
PC