I am building a data pipeline for my University to extract full-text content for a large set of Open Access records for the query "breast cancer". When I query this using the EPMC website then I get: Full text: In Europe PMC (656,815)
Full text: Unpaywall link (63,878)
And a total of 921,629
results. Can someone clarify how many of these can I get Full text using API and for how many can I get the metadata?
My current logic uses the /search endpoint with resultType=core and cursorMark for deep paging. I have two distinct workflows:
For records with a PMCID, I successfully used the /{PMCID}/fullTextXML endpoint to get JATS XML which were around 400k.
For Open Access records without a PMCID, I am currently parsing the fullTextUrlList to find links where availabilityCode is "F" and documentStyle is "html". Sometimes there is a PDF which helps a lot. But the HTML does not have the full text.
My Question: How can I retrieve the full-text for these no-PMCID articles/book/etc?
Any advice on improving the robustness of this "no-PMCID" workflow would be greatly appreciated.
Best regards
Thanks for reaching out.
We provides bulk download of Open Access full text content available in Europe PMC via FTP, including both XML and PDFs. The XML set is updated weekly, while the PDF set is updated monthly. Each file is mapped to its corresponding PMCID. You can explore the bulk download options here: https://europepmc.org/downloads/openaccess
Although the FTP doesn’t support query-based downloads directly, you can combine it with the Search API to achieve the same result. For example, refine your query using the open access and full-text filters: "breast cancer" AND OPEN_ACCESS:Y AND HAS_FT:Y, and retrieve all matching PMCIDs. (Check more syntax here: https://europepmc.org/searchsyntax#fulltextavailability). If you only need IDs, set resultType=idList in the API call. Once you have the PMCIDs, download the OA bulk set from FTP and filter locally to keep only the files corresponding to your PMCIDs. You can consider this approach to avoid a lot of API calls.
And yes, only the subset of OA articles with a PMCID will have full text directly available in Europe PMC. For all other OA records, your approach using fullTextUrlList looks good. It has links to pdf/ html/ doi whichever available.