I am looking at certain text features, large centered text, the copyright symbol, margin size, ect. Features that hopefully allow me to classify the page type (copyright, table of contents, ect.). I have access to a truth set containing pages and their correct types. Loss function will be designed on the occurrence of these features on the page types.
So what my novice mind is thinking, pass a pdf into a pdf processor like the apache library I linked above, identify certain features and metadata, and pass these into Caffe to train. Maybe that model is incorrect or I am misunderstanding the process. Does the input to caffe have to be an image? Are there machine learning frameworks more suitable for processing pdfs? (Sorry, I know this is an evil question to ask on the caffe-users group.).