Best way to maintain a copy of OA articles

21 views
Skip to first unread message

Cong Chen

unread,
Nov 21, 2025, 4:03:40 AM (12 days ago) Nov 21
to Europe PMC Developer Forum
Hi,

I'm doing some work that requires keeping an up-to-date copy of the full OA article collection, which I'm currently retrieving from https://europepmc.org/ftp/oa/. I note that the files are updated weekly, but would prefer not to redownload them regularly but only when there are changes.

Is it safe to assume that once I have a specific filename like  PMC10630001_PMC10639174.xml.gz I will not need to download it again? If files are updated, is there a way to know which files have been updated?

Thanks, 

Cong

Madhumiethaa Jayaprabha Palanisamy

unread,
Nov 21, 2025, 7:24:15 AM (12 days ago) Nov 21
to Europe PMC Developer Forum, vola...@gmail.com

Hi Cong,

Thanks for your question.

The OA files on the Europe PMC FTP site are generated as a weekly full refresh rather than incremental updates. This means that even if a file keeps the same name, its contents may still change. Europe PMC does not provide per file update information, but you can identify which records have changed by querying the API with the UPDATE_DATE field, for example:

https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=OPEN_ACCESS:Y%20AND%20UPDATE_DATE:[2025-11-16%20TO%202025-11-23]

Please note that UPDATE_DATE reflects any type of update, whether to the file or to the metadata.

The filenames (e.g. PMC13900_PMC17829.xml.gz) represent processing chunks and are not continuous ranges, though all PMCIDs within a file will fall somewhere within the numeric range shown. Given that, one possible approach might be use the updated PMCIDs returned by the API and maintain your own local mapping of which PMCIDs belong to which FTP file and download them. The specific workflow is up to you, as we don't provide any file level update indicators.

Best regards,
Madhu

Cong Chen

unread,
Nov 21, 2025, 7:46:24 AM (12 days ago) Nov 21
to Europe PMC Developer Forum, mad...@ebi.ac.uk
Hello Madhu,

This is very helpful, it looks like we're getting <20 a week, so we can for instance use a separate job to monitor these updates and reprocess the appropriate files.

I assume it is also possible for an article to transition from non open access to open access. Very rarely this might lead to changes in filenames of the processing chunks but generally it would fall into the middle of a chunk. I guess the query would capture this as well as it is a metadata change and the paper is now open access?

Thank you,

Cong

Madhumiethaa Jayaprabha Palanisamy

unread,
Nov 21, 2025, 9:35:16 AM (12 days ago) Nov 21
to Europe PMC Developer Forum, vola...@gmail.com, Madhumiethaa Jayaprabha Palanisamy

Hi,
Yes, that’s correct, the query will capture that. Also pointing out, that the weekly update volume will be much higher than 20, the sample query earlier for instance returned 27,239 ( hitCount ) updated OA in a single week.

Best regards,
Madhu

Reply all
Reply to author
Forward
0 new messages