Importing PDFs

30 views
Skip to first unread message

Gigi Frias

unread,
Jul 13, 2015, 11:59:50 AM7/13/15
to press...@googlegroups.com
I saw that this was a topic of discussion back in 2013 when Hugh mentioned that it would be harder to import PDFs because they are often unstructured and it was therefore not a priority, but I'm wondering if there has been any progress on this front since then. 

I know that most PDF to HTML converters are hit or miss and they tend to bloat the file with a lot of unnecessary styles, etc., but I am curious to know if there has been any demand for this and where the option stands (if it still stands at all!).

Thanks,
Gigi

bdolor

unread,
Jul 13, 2015, 4:17:29 PM7/13/15
to press...@googlegroups.com
We looked into this possibility with a fair amount of resources. The conclusion we drew from that research isn't promising. If the goal is a high quality conversion, whatever automation you throw at the conversion process, manual/human intervention is unavoidable. PDF conversion can only make guesses — even sophisticated software can't guarantee a high quality output because there is no certainty in mapping PDF content to structured markup. You need a human to verify/modify the guesses that a PDF conversion makes. If it can't figure out what the content is supposed to be it just defaults to putting it in a <p> tag or something similar. For us, that scenario would have meant too much work, thereby losing any benefit we stood to gain from automation. 

Gigi Frias

unread,
Jul 14, 2015, 1:22:41 PM7/14/15
to press...@googlegroups.com
This makes complete sense. I got the same results as I was exploring options, there are a lot of apps and services that try to do it but none with the desired results much less with any consistent structure. I was just checking in case you all ran across a solution or had something under development that might yield better results. 

Having your rationale will also help us to explain it to our stakeholders with some third-party supporting evidence to back us up!

Thanks for the response,
Gigi
Reply all
Reply to author
Forward
0 new messages