Issues in contents of bulk PMCLiteMetadata.tgz

9 views
Skip to first unread message

José Mª Fernández

unread,
Dec 20, 2025, 5:13:36 AM12/20/25
to Europe PMC Developer Forum
Dear EuropePMC developers,
        I am José Mª Fernández, from Barcelona Supercomputing Center and ELIXIR Spain. One metadata enricher I wrote several years ago for OpenEBench has been using your REST services for years, in order to gather the metrics related to manuscripts associated to bio.tools entries. As the number of tools to annotate have growth too much, the enricher basically is hammering your REST services, which is slow and communication error prone.

So, several weeks ago I focused on a different approach, one which involves the bulk download of all the EuropePMC entries metadata. I have realised that bulk download contents provided through https://europepmc.org/ftp/pmclitemetadata/PMCLiteMetadata.tgz have a couple of problematic issues:
  1. The first and foremost one is that XML contents within the archive are not syntactically correct (following the XML syntax rules). You can check it using for instance xmllint tool from libxml2:

    xmllint --stream --noout PMC.0.xml  
    PMC.0.xml:614086: parser error : PCDATA invalid Char value 25
    09168</pmcid><DOI>10.1258/jrsm.100.3.153</DOI><title>The diagnosis of art: Lowry
                                                                                  ^
    PMC.0.xml : failed to parse

    This is happening on every file. Even avoiding the issue using a "tr" to filter out the invalid chars, another widespread issue arises:

    tr -c '[:print:][\t\n]' ' ' < PMC.0.xml | xmllint --stream --noout -
    -:1132032: parser error : EntityRef: expecting ';'
    9/bfm.2011.0082</DOI><title>State breastfeeding worksite statutes.&breastfeeding
                                                                                  ^
    - : failed to parse
     

  2. The provided entries in PMCLiteMetadata.tgz are a subset of all the entries available through your REST services. Performing a quick-n-dirty command-line sum it gives it is a quarter of all the entries provided by EuropePMC reachable through the REST APIs:

    grep -aFhc PMC_ARTICLE *.xml|awk '{s+=$1} END {printf "%.0f", s}'

    10352624

    So, is there some place where all the entries can be fetched at once, without hammering your services?

  3. This last one is not related to PMCLiteMetadata.tgz , but a general question. Is there some way to bulk download the list of references (or citations) associated to all the entries with references? Currently, the only programmatic way is (again) hammering the REST services, fetching the list of references (or citations) entry by entry.
Thanks in advance, and Merry Christmas!!

Mohamed Selim

unread,
Dec 22, 2025, 4:50:29 AM12/22/25
to Europe PMC Developer Forum, José Mª Fernández
Hi Jose,

Thank you for reaching out. I will add the bug you reported to our todolist and keep you posted when it is fixed.
If you can give me more details about your use case and queries we might be able to suggest something different.
Regarding references, unfortunately we don't have any bulk download set for this at the moment.
Please let me know if I can help with anything else.
Kind Regards,
Mohamed

Reply all
Reply to author
Forward
0 new messages