PDF downloads - 10M new files added, bad files removed

68 views
Skip to first unread message

Casey Meyer

unread,
May 26, 2026, 9:59:16 AM (6 days ago) May 26
to OpenAlex users
Hi All,

We've made some big improvements to the PDF and parsed PDF (grobid XML) download service. Last week we fixed sync issues that added 5.3M PDFs and 5.2M new parsed PDFs to the database. Many of those were created in the last six months.

We deleted bad files that returned HTML or garbage text, including 3M bad PDFs and around 8M bad parsed PDFs. In the next two months we will run a job to replace the bad XML files with good ones.

When testing samples of data after this fix, 99% of files accessed through the OpenAlex API were valid. Thanks to the OpenAlex users that reported these issues. To read more about file downloads, check out the documentation here: https://developers.openalex.org/download/full-text-pdfs

Thanks,
Casey
Reply all
Reply to author
Forward
Message has been deleted
0 new messages