Science Beam: a computer vision approach to the extraction of PDF data

22 views
Skip to first unread message

Emily Packer

unread,
Aug 4, 2017, 11:58:07 AM8/4/17
to SPARC OA Forum
[With apologies for cross-posting]

Hi all,

eLife has today outlined a new project to convert PDF to XML with high accuracy by complementing existing tools with computer vision technology.


A vast trove of scientific research is locked inside the PDF format, and extracting key information from these files is not trivial. It would therefore be useful to be able to extract and store this data in a more accessible and reusable format such as XML (of the publishing industry standard JATS variety or otherwise).


Science Beam uses computer vision algorithms to help ‘see’ the structure of a research paper in PDF as a human would. This can then be used to assign the correct metadata to the document’s content.


You can read more about the project in our latest eLife Labs post: https://elifesciences.org/labs/5b56aff6/science-beam-a-computer-vision-approach-to-the-extraction-of-pdf-data


In order for it to be able to extract good metadata from the myriad variations in font, layout and content of PDFs from different sources, we need to train our system with a wide variety of PDFs and their corresponding XML. To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks. Our hope is that the wide variety of papers and formats in this corpus will help our system learn to deduce the structure of a research paper well enough to be useful in real-world applications.


For more information, or to speak to us further about Science Beam, please don’t hesitate to contact me.


Best wishes,


Emily


-- 




Emily Packer
Press Officer


+44 1223 855373 (office)


http://elifesciences.org


eLife Sciences Publications, Ltd is a limited liability non-profit non-stock corporation incorporated in the State of Delaware, USA, with company number 5030732, and is registered in the UK with company number FC030576 and branch number BR015634 at the address First Floor, 24 Hills Road, Cambridge CB2 1JP.



Reply all
Reply to author
Forward
0 new messages