At least for PMC Open Access, if you want to search for 2000 different pesticides, you probably should download all 1 million XML files in the four article* files in
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc and search those directly. You can load them in R using xmlParse(file) and I recently looped over all 1 million files to count tables (1.46 million) and supplements (300K) by journal and that only took a few hours, so you could modify that for pesticide searches and save the matching text. Depending on what you need, the xpath queries can be difficult and I constantly find new problems like some XML with tables nested in paragraph tags.
Also, if you want to return full text from specific PMC search results, you could use the pmcXML package and related functions that help with XML parsing (see
https://github.com/cstubben/pmcXML and like fulltext, this is in development and will change, hopefully based on suggestions from users). I apologize for the long email below, but here are the basic steps.
1. Run a query in PMC and get ids
atrazine AND open access[FILTER] #911 records
atrazine[Body - All Words] AND open access[FILTER] #610
# or 56 with atrazine in title
atz <- ncbiPMC("atrazine[TITLE] AND open access[FILTER]")
names(atz)
[1] "pmc" "authors" "year" "title" "journal" "volume" "pages" "pubdate" "epubdate" "pmid" "doi"
subset(atz, journal=="BMC Genomics")
pmc authors year
46 PMC2242805 Ramel F, Sulmon C, Cabello-Hurtado F, et al 2007
title
46 Genome-wide interacting effects of sucrose and herbicide-mediated stress in Arabidopsis thaliana: novel insights into atrazine toxicity and sucrose-induced tolerance
journal volume pages pubdate epubdate pmid doi
46 BMC Genomics 8 450 2007/12/05 18053238 10.1186/1471-2164-8-450
Note: ncbiPMC uses old functions in the BioC genomes packages for the E-utility scripts, which I need to replace with rentrez scripts and drop that package dependency. I will keep the parser to create the summary table above in the package.
2. Download XML
There are two options for automated downloads. pmcOAI uses the PMC OAI service and removes the namespace for easier XPath queries and adds carets (^) within superscript tags and hyperlinked table footnotes for displaying as plain text.
doc <- pmcOAI("PMC2242805")
## since this returns an XMLInternalDocument, all the related functions in the XML package should work.
summary(doc)
getNodeSet(doc, "//abstract")
The other option is to use the PMC FTP site, but this currently requires loading a local copy of the million row file_list.txt file to get the directory name (any suggestions on how to avoid this would be welcome)
pmcfiles <- read.delim( "
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt" , skip=1, header=FALSE, stringsAsFactors=FALSE)
nrow(pmcfiles)
[1] 941083
names(pmcfiles)<-c("dir", "citation", "pmcid")
subset(pmcfiles, pmcid == "PMC2242805")
dir citation pmcid
122831 a0/ff/BMC_Genomics_2007_Dec_5_8_450.tar.gz BMC Genomics. 2007 Dec 5; 8:450 PMC2242805
pmcFTP( "PMC2242805")
trying URL '
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/a0/ff/BMC_Genomics_2007_Dec_5_8_450.tar.gz'
...
Saved to ./PMC2242805
# the ftp site has xml and PDF copies of the paper that you mentioned in your email, plus figures incl thumbnails and supplements.
list.files("PMC2242805")
[1] "1471-2164-8-450-1.gif" "1471-2164-8-450-1.jpg" "1471-2164-8-450-2.gif" "1471-2164-8-450-2.jpg" "1471-2164-8-450-3.gif" "1471-2164-8-450-3.jpg"
[7] "1471-2164-8-450-4.gif" "1471-2164-8-450-4.jpg" "1471-2164-8-450-5.gif" "1471-2164-8-450-5.jpg" "1471-2164-8-450.nxml" "1471-2164-8-450.pdf"
[13] "1471-2164-8-450-S10.pdf" "1471-2164-8-450-S11.pdf" "1471-2164-8-450-S12.pdf" "1471-2164-8-450-S1.pdf" "1471-2164-8-450-S2.pdf" "1471-2164-8-450-S3.pdf"
[19] "1471-2164-8-450-S4.pdf" "1471-2164-8-450-S5.pdf" "1471-2164-8-450-S6.xls" "1471-2164-8-450-S7.pdf" "1471-2164-8-450-S8.pdf" "1471-2164-8-450-S9.pdf"
[25] "license.txt"
doc2 <- xmlParse("PMC2242805/1471-2164-8-450.nxml")
3. Parse XML ( see the wiki page at
https://github.com/cstubben/pmcXML/wiki/Parse-xml for more details )
The package currently includes functions to parse the metadata, full text, tables, references and optionally load supplements (all of this was written to index in Apache Solr, so I have not thought too much about text-mining applications)
# pmcMeta creates a list of metadata fields and also gets mesh terms from PubMed
meta <- pmcMeta(doc)
names(meta)
[1] "id" "title" "author_display" "year" "journal" "volume" "pages" "journal_display"
[9] "citation" "doc_type" "doc_source" "epubdate" "pubdate" "first_author" "publisher" "pmid"
[17] "pmcid" "doi" "URL" "author" "affiliation" "keywords" "mesh" "license"
# pmcText splits the document into a list of subsections (with full path to subsection title) and each subsection is either a vector of paragraphs or sentences
txt <- pmcText(doc, sentence=FALSE)
txt <- pmcText(doc)
sapply(txt, length)
Main title
1
Abstract
8
Background
25
Results; Physiological effects of atrazine and sucrose treatments
12
Results; Effects of atrazine and sucrose on global gene expression
17
Results; Identification of protection-related functional categories
27
Results; Characterization of atrazine xenobiotic and oxidative effects: evidence for deleterious effects on gene regulation
34
...
I added another function to simplify searches using grep and this finds 154 atrazine mentions. The important question to consider is whether the structure of the document really matters - do you care if atrazine is in the abstract, section tile, caption, a specific section below and so on, or do you just want a giant text blob to pass to the tm or other package?
x <- searchPMC(txt, "atrazine")
head(x)
data.frame(table(x$section))
Var1 Freq
1 Abstract 5
2 Background 8
3 Conclusion 1
4 Discussion 28
5 Figure caption 8
6 Main title 1
7 Methods; Microarray data validation and qRT-PCR experiment 1
8 Methods; Plant material and growth conditions 2
9 Methods; RNA isolation and microarray analysis 1
10 Results; Characterization of atrazine xenobiotic and oxidative effects: evidence for deleterious effects on gene regulation 20
11 Results; Differential expression of specific transcription factors during sucrose-induced atrazine tolerance 8
12 Results; Effects of atrazine and sucrose on global gene expression 7
13 Results; Identification of protection-related functional categories 17
14 Results; Physiological effects of atrazine and sucrose treatments 4
15 Results; Specific effects of combined sucrose plus atrazine treatment on tolerance-related gene regulation 12
16 Results; Time-course of induction of transcription factors during sucrose-dependent atrazine protection 6
17 Section title 6
18 Supplement caption 14
19 Table caption 5
At least for the tm package, you can convert this list into a Corpus (but again I have little experience with traditional text mining)
library(tm)
Corpus(VectorSource(txt))
# pmcTable creates a list of data.frames. This functions uses rowspan and colspan attributes within the th and td tags to correctly format and repeat cell values as needed, for example table 1 has a three row header in columns 3-5 which is collapsed into a single name.
x <- pmcTable(doc)
Parsing Table 1 Induction by atrazine of genes involved in xenobiotic and oxidative stress response
Parsing Table 2 Repression by atrazine of genes involved in xenobiotic and oxidative stress response
Parsing Table 3 Selected atrazine-regulated genes that may be involved in atrazine injury
Parsing Table 4 Genes potentially involved in sucrose-induced atrazine tolerance
Parsing Table 5 Transcription factors potentially involved in sucrose-induced atrazine tolerance
lapply(x, head)
$`Table 1`
Accession number Gene description log2(ratio): Treatment comparison: MA/M
1 At1g06570 4-hydroxyphenylpyruvate dioxygenase (PDS1) 3.19
2 At1g33110 MATE efflux family protein 2.17
3 At1g53580 Hydroxyacylglutathione hydrolase, putative/glyoxalase II putative 1.75
4 At1g70610 ABC transporter (TAP1) 2.20
5 At1g80160 Glyoxalase I family protein 2.03
log2(ratio): Treatment comparison: S/M log2(ratio): Treatment comparison: SA/M
1 -0.76 2.12
2 nde 1.88
3 nde 1.06
4 nde 1.38
5 -1.81 nde
# The function also adds a bunch of additional attributes
attributes(x[[1]])
$id
[1] "PMC2242805"
$file
[1] "
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2242805/table/T1"
$label
[1] "Table 1"
$caption
[1] "Induction by atrazine of genes involved in xenobiotic and oxidative stress response"
$footnotes
[1] "nde: not differentially expressed, genes with a Bonferroni P-values higher than 5% were considered as being not differentially expressed as described in Lurin et al. [75]."
I index tables in three different formats in Solr (original text blob, delimited text, and also as a collapsed row, since each one changes relevancy scoring). Also the collapse2 function here will detect and repeat subheaders and therefore increase its term frequency (row id is optional, mainly added for Solr highlighting)
collapse2(x[[1]])
[1] "Row 1 of 5; Accession number=At1g06570; Gene description=4-hydroxyphenylpyruvate dioxygenase (PDS1); log2(ratio): Treatment comparison: MA/M=3.19; log2(ratio): Treatment comparison: S/M=-0.76; log2(ratio): Treatment comparison: SA/M=2.12."
[2] "Row 2 of 5; Accession number=At1g33110; Gene description=MATE efflux family protein; log2(ratio): Treatment comparison: MA/M=2.17; log2(ratio): Treatment comparison: S/M=nde; log2(ratio): Treatment comparison: SA/M=1.88."
## References
y<- pmcRef(doc)
# Supplements. This lists the links to supplements mentioned in the full text.
z<- pmcSupp(doc)
[1] "label" "caption" "file" "type"
z[12,]
If you have unix tools for some systems commands, you can get Excel, Word, HTML, PDF, text and compressed files. I need to fix this to optionally read from a local file downloaded with pmcFTP rather than the link..
s12 <- pmcSupp(doc, 12)
Downloading Additional file 12
[1] "Returned 35 rows"
s12
[1] "Genes selected for qRT-PCR analysis and primer sequences"
[2] "Accession"
[3] "number Gene description Forward sequence Reverse sequence"
[4] "At1g06570 4-hydroxyphenylpyruvate"
[5] "dioxygenase (PDS1)"
[6] "TCGCTCGTCGCTTCTCCTG TGTGGTTGTCGGTTTAATCTCTCC"
[7] "At1g42990 bZIP transcription factor"
[8] "family protein"
[9] "TCTGCTGTGCTCTTGTTGGAATC GAACCCTTACATCTCCGACTAACG"