New User - PDF files as Input to Caffe

William Schaller

unread,

May 25, 2016, 5:43:20 PM5/25/16

to Caffe Users

New user here, please forgive my ignorance.

Is it possible to train a model in Caffe using PDF files as input? I don't want to have to convert the PDF's to images as there is important metadata contained in the PDFs. Is it possible to integrate a PDF library into Caffe (like https://pdfbox.apache.org/)? Are there any examples of PDF processing using Caffe? I haven't had any luck with my research thus far.

Any guidance would be greatly appreciated and thank you for your patience,

William

S. Majid Azimi

unread,

May 25, 2016, 5:49:56 PM5/25/16

to William Schaller, Caffe Users

Mostly caffe have been used for images and videos, but not any other kinds of data as far as I know.

what you are going to extract from the data? you would need a huge dataset which should be already manually annotated. how do you want to come up with an loss function?

there are a couple of more questions which you should have an answer for

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/e8a8b05a-28a8-49d9-a87c-4e0992515687%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

William Schaller

unread,

May 25, 2016, 6:29:37 PM5/25/16

to Caffe Users, william.f...@gmail.com

I am looking at certain text features, large centered text, the copyright symbol, margin size, ect. Features that hopefully allow me to classify the page type (copyright, table of contents, ect.). I have access to a truth set containing pages and their correct types. Loss function will be designed on the occurrence of these features on the page types.

So what my novice mind is thinking, pass a pdf into a pdf processor like the apache library I linked above, identify certain features and metadata, and pass these into Caffe to train. Maybe that model is incorrect or I am misunderstanding the process. Does the input to caffe have to be an image? Are there machine learning frameworks more suitable for processing pdfs? (Sorry, I know this is an evil question to ask on the caffe-users group.).

Reply all

Reply to author

Forward