Dear EuropePMC developers,
I am José Mª Fernández, from Barcelona Supercomputing Center and ELIXIR Spain. One metadata enricher I wrote several years ago for OpenEBench has been using your REST services for years, in order to gather the metrics related to manuscripts associated to bio.tools entries. As the number of tools to annotate have growth too much, the enricher basically is hammering your REST services, which is slow and communication error prone.
- The first and foremost one is that XML contents within the archive are not syntactically correct (following the XML syntax rules). You can check it using for instance xmllint tool from libxml2:
xmllint --stream --noout PMC.0.xml
PMC.0.xml:614086: parser error : PCDATA invalid Char value 25
09168</pmcid><DOI>10.1258/jrsm.100.3.153</DOI><title>The diagnosis of art: Lowry
^
PMC.0.xml : failed to parse
This is happening on every file. Even avoiding the issue using a "tr" to filter out the invalid chars, another widespread issue arises:
tr -c '[:print:][\t\n]' ' ' < PMC.0.xml | xmllint --stream --noout -
-:1132032: parser error : EntityRef: expecting ';'
9/bfm.2011.0082</DOI><title>State breastfeeding worksite statutes.&breastfeeding
^
- : failed to parse
- The provided entries in PMCLiteMetadata.tgz are a subset of all the entries available through your REST services. Performing a quick-n-dirty command-line sum it gives it is a quarter of all the entries provided by EuropePMC reachable through the REST APIs:
grep -aFhc PMC_ARTICLE *.xml|awk '{s+=$1} END {printf "%.0f", s}'
10352624
So, is there some place where all the entries can be fetched at once, without hammering your services?
- This last one is not related to PMCLiteMetadata.tgz , but a general question. Is there some way to bulk download the list of references (or citations) associated to all the entries with references? Currently, the only programmatic way is (again) hammering the REST services, fetching the list of references (or citations) entry by entry.
Thanks in advance, and Merry Christmas!!