Data update (repository source attribution fix) :44M works corrected, 9M OA reclassifications

90 views
Skip to first unread message

Jason Priem

unread,
Feb 11, 2026, 11:06:45 PM (7 days ago) Feb 11
to OpenAlex users

Fixing repository source attribution in OpenAlex


OpenAlex harvests metadata from about 5,000 repository endpoints, each belonging to a specific repository. Until now, we weren't doing a good job of labeling which repository each record came from. Many records had no source at all -- we had the metadata, but didn't know where it came from. Others were attributed to the wrong source, because our matching logic couldn't distinguish between institutions that share the same hosting platform (Pure, DiVA, DSpace, etc.).


This bug has existed for a long time, but it used to matter a lot less. In the old system, repository data was secondary -- we pulled most metadata from Crossref and other publisher sources, and repository records were layered on top. With the new Walden pipeline, repositories are first-class sources: we mint works directly from repository metadata. That shift made this attribution bug much more consequential, and also much easier to fix.


We've now built an authoritative mapping from each harvesting endpoint to the correct source, and corrected the data. Here's what changed:


Source attributions:

CleanShot 2026-02-11 at 22.02.29@2x.png

The 20.6 million works that gained a source are the headline: these are works we were harvesting from repositories but couldn't identify the repository they came from. Now we can.


Impact on OA status

For about 20.6 million of these works, the fix also revealed the host type. Previously, without a source, we didn't know these works were hosted by a repository -- so `primary_location.source.type` was null. Now that we've identified the source, we know it's a repository. And since a work hosted by a repository is green OA, this triggered a large reclassification:


CleanShot 2026-02-11 at 22.03.40@2x.png

The 8.9 million gold-to-green shift is the biggest user-visible impact. These works were in repositories, but because we didn't know the source was a repository, they were incorrectly classified as gold. Now they're correctly green.


The 494K works that moved from green to closed lost their OA status because the source we'd previously attributed them to was wrong -- and with the correct attribution, we no longer have evidence they're openly accessible. We believe most of these works *are* in a repository; we just can't confirm which one yet. Restoring OA status for these works is an active priority.


What's next
  • Adding ~500 new repository sources: Our biggest addition of new repositories in years, which should improve open access coverage for existing works and add a large number of newly minted works.
  • Restoring lost OA status: Most of the ~345K OA losses should be recoverable with additional endpoint verification.

  • Ongoing: The new mapping system ensures that as new endpoints are added, they'll be correctly attributed from the start

Thanks so much for your support, and please don't hesitate to get in touch with any comments, questions, ideas, or feedback!
Best,
j

PS lots of really exciting new API goodies coming next week shhh....

Reply all
Reply to author
Forward
0 new messages