I just noticed this discussion and wanted to suggest a package on github (
https://github.com/cstubben/pmcXML) that might be helpful. I really need to clean up the code, but the main objectives are to read PMC Open access into XMLInternalDocuments, then parse 1) metadata 2) text 3) tables and 4) supplements (technically not in XML, but included on ftp or via links in XML). Specifically, I parse the text into a list of subsections (vector of paragraphs) and get the full path to the subsection title for the list names. This is easy to convert into a tm Corpus or write to a Solr XML file for importing or apply a function like sentDetect from openNLP to split into sentences for searching. I read tables into a list of data.frames and I have been working on code to improve readHTMLtables by repeating subheaders, filling row and colspans, correctly parsing mutli-row headers and so on.
Our main use case is that we may know of ~500 papers on some non-model microbial organism and we'd like to make a local Solr collection for enhanced searching, so we need to index all the text, tables and supplements. The main problem is getting all the papers into Solr that are not in PMC, so developing new packages to work with our local copies of XML docs from Elsevier or even html from other publishers or PDFs would be great. Also, for PDFs only, I have been working on code to convert these to Markdown, which I can then read into R and convert to Solr import files.
Please let me know how I can help, thanks.
Chris