Should the OAI-PMH service be used?

61 views
Skip to first unread message

Michał Politowski

unread,
Apr 2, 2025, 8:29:12 AMApr 2
to Europe PMC Developer Forum
Hello,
I would be interested in using the OAI-PMH service as a way to incrementally collect full texts from Europe PMC. Incremental harvesting is in theory exactly what OAI-PMC is for.

But at the moment it seems that the Europe PMC OAI-PMH service returns "500 Internal Server Error" more often than not. Some of the errors are temporary and disappear with retries but this is of course not ideal resource-consumption-wise - at both ends.
And additionally some of these errors seem permanent, like the one described in 2020 in https://groups.google.com/a/ebi.ac.uk/g/epmc-webservices/c/eEjAN59Utb0/m/k2foUactBgAJ

So, should harvesters rather concentrate on using the "REST" API instead? It would be a less efficient approach in theory, with no response batching, but in practice it's looking to be the only option.
If so, which date field should be used in API queries to best approximate what OAI-PMH would give in terms of incremental harvesting?

Regards,
Michał Politowski
Message has been deleted

Islam Hassan

unread,
Apr 28, 2025, 6:03:39 AMApr 28
to Europe PMC Developer Forum, Michał Politowski
Dear Michał,

Apologies for the delay in my reply.

Our OAI service is a copy provided by PMC USA. We are aware of the issues with OAI service stability. The work to address these issues has been added to our roadmap, but has not yet been scheduled and will likely take some time to resolve. At the moment I do not have an estimated timeline for this fix.
For the time being we would recommend using the REST API to retrieve metadata from Europe PMC. I appreciate it is not an ideal solution in this case, and I apologise for the inconvenience. If there is anything I can help with in the meantime, please let me know.

Best wishes,
Islam, on behalf of the Europe PMC team

Michał Politowski

unread,
Apr 28, 2025, 8:47:12 AMApr 28
to Europe PMC Developer Forum, isha...@ebi.ac.uk
Dear Islam,

Thank you for your answer. As a follow-up question - approximating OAI-PMH using the REST API:

to regularly find new results that match a "HAS_FT:y AND OPEN_ACCESS:y" query.

An obvious idea is to do ask weekly for eg. "HAS_FT:y AND OPEN_ACCESS:y AND some_date:[2025-04-01 TO 2025-04-07]"
But is there a "some_date" field that ensures that I miss nothing, or at least very little, that way?
I guess that FIRST_IDATE will not work if one of HAS_FT or OPEN_ACCESS changes to 'yes' during a subsequent update.
So, UPDATE_DATE looks promising - although may probably return quite a few records that already matched HAS_FT:y AND OPEN_ACCESS:y
some weeks before.
But will it miss something? And is there some other date field that will mostly/only be updated when a fulltext appears?

Regards,
Michał Politowski

Islam Hassan

unread,
Apr 28, 2025, 9:18:34 AMApr 28
to Europe PMC Developer Forum, Michał Politowski, Islam Hassan

Dear Michał,

If you're looking to incrementally collect new open-access full-text articles , I would suggest the CREATION_DATE field.
Example:
https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=(HAS_FT:y%20AND%20OPEN_ACCESS:y)%20AND%20(CREATION_DATE:[2025-04-27%20TO%202025-04-28])&resultType=core&format=json&pageSize=1000

The caveat here is that there might be changes to the full-text article (or its metadata) afterwards that won't be captured by only searching by the creation date. However, if you only care about new articles, then it should be enough.

If you care about updates to articles you have already harvested, the UPDATE_DATE field can be useful.
Example:
https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=(HAS_FT:y%20AND%20OPEN_ACCESS:y)%20AND%20(UPDATE_DATE:[2025-04-27%20TO%202025-04-28])&resultType=core&format=json&pageSize=1000

I hope this is helpful and please let me know if you have any further questions.

Best wishes,
Islam Hassan, on behalf of the Europe PMC team

Michał Politowski

unread,
May 6, 2025, 7:12:44 AMMay 6
to Europe PMC Developer Forum, isha...@ebi.ac.uk
Dear Islam,

Thank you for the reply.
Could you also explain the difference between CREATION_DATE and FIRST_IDATE? Both fields are mentioned in various documents with similar descriptions.
Although it seems that UPDATE_DATE is out best bet anyway if we don't want to miss changes for already harvested articles.

Regards,
Michał Politowski
Message has been deleted

Islam Hassan

unread,
May 6, 2025, 9:10:50 AMMay 6
to Europe PMC Developer Forum, Michał Politowski, Islam Hassan
Dear Michał,

The CREATION_DATE field is for when the article first enters our DB, while the FIRST_IDATE field is when it's first indexed by our search engine and appears on our website.
The two dates are typically identical or very close to each other but can sometimes differ slightly due to some technical reasons.

As for keeping with the changes of already harvested articles, relying on the UPDATE_DATE field is the way to go as you mentioned.

Please let me know if you have any further questions.


Best wishes,
Islam Hassan, on behalf of the Europe PMC team

Reply all
Reply to author
Forward
0 new messages