Dear community,
OpenAlex’s coverage of preprint articles is generally very high. Most servers reach over 98% coverage as long as DOIs are available.
However, I recently noticed that arXiv is a clear exception. In my calculations, approximately 24%, 37%, and 74% of articles in computer science, quantitative biology, and astrophysics, respectively, are missing. A summary of these statistics is available here:
I am wondering whether this gap stems from the fact that arXiv only started assigning DOIs after 2022. Although OpenAlex contains more than 1.2 million pre-2022 arXiv records, the missing items appear to be concentrated between 2007 and 2021.
As scientometric research increasingly focuses on preprints, comprehensive coverage is becoming more important—especially because OpenAlex is one of the few bibliographic databases that index citations to both preprints and their subsequent journal publications separately. This functionality is crucial for investigating many aspects of preprint scholarship.
I am happy to share the list of missing arXiv entries if it would be useful. Please let me know if there is anything I can contribute.
Thank you for your work maintaining and improving OpenAlex.
Best regards,
Chiaki