Issues in contents of bulk PMCLiteMetadata.tgz

16 views
Skip to first unread message

José Mª Fernández

unread,
Dec 20, 2025, 5:13:36 AM12/20/25
to Europe PMC Developer Forum
Dear EuropePMC developers,
        I am José Mª Fernández, from Barcelona Supercomputing Center and ELIXIR Spain. One metadata enricher I wrote several years ago for OpenEBench has been using your REST services for years, in order to gather the metrics related to manuscripts associated to bio.tools entries. As the number of tools to annotate have growth too much, the enricher basically is hammering your REST services, which is slow and communication error prone.

So, several weeks ago I focused on a different approach, one which involves the bulk download of all the EuropePMC entries metadata. I have realised that bulk download contents provided through https://europepmc.org/ftp/pmclitemetadata/PMCLiteMetadata.tgz have a couple of problematic issues:
  1. The first and foremost one is that XML contents within the archive are not syntactically correct (following the XML syntax rules). You can check it using for instance xmllint tool from libxml2:

    xmllint --stream --noout PMC.0.xml  
    PMC.0.xml:614086: parser error : PCDATA invalid Char value 25
    09168</pmcid><DOI>10.1258/jrsm.100.3.153</DOI><title>The diagnosis of art: Lowry
                                                                                  ^
    PMC.0.xml : failed to parse

    This is happening on every file. Even avoiding the issue using a "tr" to filter out the invalid chars, another widespread issue arises:

    tr -c '[:print:][\t\n]' ' ' < PMC.0.xml | xmllint --stream --noout -
    -:1132032: parser error : EntityRef: expecting ';'
    9/bfm.2011.0082</DOI><title>State breastfeeding worksite statutes.&breastfeeding
                                                                                  ^
    - : failed to parse
     

  2. The provided entries in PMCLiteMetadata.tgz are a subset of all the entries available through your REST services. Performing a quick-n-dirty command-line sum it gives it is a quarter of all the entries provided by EuropePMC reachable through the REST APIs:

    grep -aFhc PMC_ARTICLE *.xml|awk '{s+=$1} END {printf "%.0f", s}'

    10352624

    So, is there some place where all the entries can be fetched at once, without hammering your services?

  3. This last one is not related to PMCLiteMetadata.tgz , but a general question. Is there some way to bulk download the list of references (or citations) associated to all the entries with references? Currently, the only programmatic way is (again) hammering the REST services, fetching the list of references (or citations) entry by entry.
Thanks in advance, and Merry Christmas!!

Mohamed Selim

unread,
Dec 22, 2025, 4:50:29 AM12/22/25
to Europe PMC Developer Forum, José Mª Fernández
Hi Jose,

Thank you for reaching out. I will add the bug you reported to our todolist and keep you posted when it is fixed.
If you can give me more details about your use case and queries we might be able to suggest something different.
Regarding references, unfortunately we don't have any bulk download set for this at the moment.
Please let me know if I can help with anything else.
Kind Regards,
Mohamed

José Mª Fernández

unread,
Jan 22, 2026, 11:35:29 AM (8 days ago) Jan 22
to Europe PMC Developer Forum, mse...@ebi.ac.uk, José Mª Fernández
Hi, Mohamed!
        first of all, sorry for the delay, I have been a bit offline due family matters.

        Our use case is that OpenEBench (https://openebench.bsc.es) has a tools technical monitoring part , which is related to the ELIXIR Life Sciences Research Software Ecosystem. The technical monitoring is weekly ingesting entries from other members of the Research Software Ecosystem (like bio.tools)  in order to discover new entries and changes. The ingested entries about tools can bring additional metadata, like publications, in the form of the identifier of the publication. That identifier can be either PubMed Id, PMCID and/or DOI.

        So, an enrichment of that information is gathering the list of citations for each one of those publications associated to life science tools, which is condensed and represented as the number of citations per year for that publication. The sources we are currently querying are EuropePMC, PubMed and Wikidata. In order to avoid counting twice or thrice a citation for a publication, the code gathers additional metadata (i.e. DOI, PubMed and PMCID identifiers) for each citation and source, and then it consolidates the list of citations.

        In the case of EuropePMC (it is very similar for the other sources), the process is first querying whether entries for a batch of possible publications exist. For those existing publications, then the list of citations is gathered, which is tedious as one or more queries are submitted to obtain the complete associated list of citations for each publication. And last, a resolution query is sent for batches of citations, in order to obtain other associated identifiers (DOI, PMCID, ...).

        Hope this explanation can help!

        Best,
                José Mª

Mohamed Selim

unread,
Jan 27, 2026, 5:42:23 AM (3 days ago) Jan 27
to Europe PMC Developer Forum, José Mª Fernández, Mohamed Selim
Hi Jose,
Thank you for clarifying. If you are pulling data regularly what you can do is using the field  first_idate which is short for first indexing date. This field marks when the article is indexed first in our system.
example:
The results should always be the same. The only exception for this is deletions but I believe that is acceptable.
This will give you list of articles indexed in epmc within time interval and its ids. You can get up to 1k results in one go and iterate through the api using cursormark.
This includes all sorts of articles preprints, ..etc let me know if you are interested in this query or need to narrow it down.

for each article you can check values of citedByCount if it is bigger than 0 then it is cited by other articles in the system.
example article:
<id>39761155</id>
<source>MED</source>
it has 4 articles citing it.
To get these articles you can use the query:

There might be minor discrepancy in the data and lag in update as we update our citation network on quarterly basis, and as you can expect the number of article's citation changes over time.
Please let me know if this matches your use case or we can check something else.
Looking forward to your reply.
Kind Regards,
Mohamed






Reply all
Reply to author
Forward
0 new messages