Hi All,
We've made some big improvements to the PDF and parsed PDF (grobid XML) download service. Last week we fixed sync issues that added 5.3M PDFs and 5.2M new parsed PDFs to the database. Many of those were created in the last six months.
We deleted bad files that returned HTML or garbage text, including 3M bad PDFs and around 8M bad parsed PDFs. In the next two months we will run a job to replace the bad XML files with good ones.
When testing samples of data after this fix, 99% of files accessed through the OpenAlex API were valid. Thanks to the OpenAlex users that reported these issues. To read more about file downloads, check out the documentation here:
https://developers.openalex.org/download/full-text-pdfs
Thanks,
Casey