PDF downloads - 10M new files added, bad files removed

132 views

Skip to first unread message

Casey Meyer

unread,

May 26, 2026, 9:59:16 AMMay 26

to OpenAlex users

Hi All,

We've made some big improvements to the PDF and parsed PDF (grobid XML) download service. Last week we fixed sync issues that added 5.3M PDFs and 5.2M new parsed PDFs to the database. Many of those were created in the last six months.

We deleted bad files that returned HTML or garbage text, including 3M bad PDFs and around 8M bad parsed PDFs. In the next two months we will run a job to replace the bad XML files with good ones.

When testing samples of data after this fix, 99% of files accessed through the OpenAlex API were valid. Thanks to the OpenAlex users that reported these issues. To read more about file downloads, check out the documentation here: https://developers.openalex.org/download/full-text-pdfs

Thanks,
Casey

Reply all

Reply to author

Forward

Message has been deleted

0 new messages