Abstract:
PDF is a hugely popular format, and for good reason: with a PDF, you can be virtually assured that a document will display and print exactly the same way on different computers. However, PDF documents suffer from a drawback in that they are usually missing information specifying which content constitutes paragraphs, tables, figures, header/footer info etc. This lack of ‘logical structure’ information makes it difficult to edit files or to view documents on small screens, or to extract meaningful data from a PDF. In a sense, the content becomes ‘trapped’. In this article we discuss the logical structure problem and introduce PDFGenie, a tool for extracting text and tables, as well as establishing a ground truth for evaluating progress in this area by PDFGenie as well as other tools.