pmcXML and fulltext

Chris Stubben

unread,

Sep 3, 2014, 2:16:53 PM9/3/14

to ropensci...@googlegroups.com

I just noticed this discussion and wanted to suggest a package on github (https://github.com/cstubben/pmcXML) that might be helpful. I really need to clean up the code, but the main objectives are to read PMC Open access into XMLInternalDocuments, then parse 1) metadata 2) text 3) tables and 4) supplements (technically not in XML, but included on ftp or via links in XML). Specifically, I parse the text into a list of subsections (vector of paragraphs) and get the full path to the subsection title for the list names. This is easy to convert into a tm Corpus or write to a Solr XML file for importing or apply a function like sentDetect from openNLP to split into sentences for searching. I read tables into a list of data.frames and I have been working on code to improve readHTMLtables by repeating subheaders, filling row and colspans, correctly parsing mutli-row headers and so on.

Our main use case is that we may know of ~500 papers on some non-model microbial organism and we'd like to make a local Solr collection for enhanced searching, so we need to index all the text, tables and supplements. The main problem is getting all the papers into Solr that are not in PMC, so developing new packages to work with our local copies of XML docs from Elsevier or even html from other publishers or PDFs would be great. Also, for PDFs only, I have been working on code to convert these to Markdown, which I can then read into R and convert to Solr import files.

Please let me know how I can help, thanks.

Chris

Scott Chamberlain

unread,

Sep 5, 2014, 12:41:49 PM9/5/14

to ropensci...@googlegroups.com

Hi Chris,

Thanks for reaching out. I assume by "this discussion", you meant the fulltext package discussion?

I'll have a look at your repo. The Solr part sounds interesting. I maintain an R client for Solr (https://github.com/ropensci/solr), perhaps that could be useful to interact with Solr. I actually don't think I have any methods for writing to a Solr instance right now, but would be easy to add.

Would also be good to include functionality in fulltext package for working with pdfs, so that could be something to include for sure.

Best way to jump in with fulltext is to look over the issues https://github.com/ropensci/fulltext/issues . We haven't written much code yet, so much is still up in the air.

David Winter is maintaining the rentrez package (https://github.com/ropensci/rentrez). I'll ping him to see if he has anything to add to this discussion.

Cheers, Scott

David Winter

unread,

Sep 5, 2014, 1:12:17 PM9/5/14

to ropensci...@googlegroups.com

HI Guys,

I don't have much to add at this point -- if pmcXML already handles parsing out information from PMC records then it makes sense to use that functionality.

I don't know if rOpenSci has run into a package with dependencies on Bioconductor packages yet? I don't know if there is a nice way to handle both dependencies if the fulltext package on CRAN (I'm sure it's solvable, I just don't know how :)

David

Chris Stubben

unread,

Sep 5, 2014, 5:30:01 PM9/5/14

to ropensci...@googlegroups.com

Scott, David, and others,

Thanks for pointing out the issues page - i'm definitely interested in tracking down other sources and considering the best data structures to use, although right now I'd prefer getting full text into XMLInternalDocument or HTMLInternalDocument and then add code to simplify parsing into lists. HTML is always messy and I have made some attempts at pmc's HTML (not open access), Elsevier, SGM, ASM, Wiley and a few others. All of these have various restrictions, but that's constantly changing - I think Elsevier will now allow text mining and even returning snippets of text with 200 characters and the DOI, which seems workable for Solr queries..

This github wiki page probably describes best what I've been trying to do

https://github.com/cstubben/pmcXML/wiki/Parse-xml

Finally, the bioC dependency could be removed. Basically, this was added to expand locus tags ranges mentioned in full text, so you need GFF files and other genome related stuff.

Chris

--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discu...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Scott Chamberlain

unread,

Sep 5, 2014, 5:39:01 PM9/5/14

to ropensci...@googlegroups.com

Chris,

I agree that there should be an option to get as raw a response as possible instead of only giving back e.g, a list. Probably just allow users to toggle what format they get data in.

In terms of data sources, there's an issue for that https://github.com/ropensci/fulltext/issues/4#issuecomment-52376743, and that comment specifically holds the master list so far. We can add things as needed to that list.

Scott

To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward