Fixing repository source attribution in OpenAlex
OpenAlex harvests metadata from about 5,000 repository endpoints, each belonging to a specific repository. Until now, we weren't doing a good job of labeling which repository each record came from. Many records had no source at all -- we had the metadata, but didn't know where it came from. Others were attributed to the wrong source, because our matching logic couldn't distinguish between institutions that share the same hosting platform (Pure, DiVA, DSpace, etc.).
This bug has existed for a long time, but it used to matter a lot less. In the old system, repository data was secondary -- we pulled most metadata from Crossref and other publisher sources, and repository records were layered on top. With the new Walden pipeline, repositories are first-class sources: we mint works directly from repository metadata. That shift made this attribution bug much more consequential, and also much easier to fix.
We've now built an authoritative mapping from each harvesting endpoint to the correct source, and corrected the data. Here's what changed:
Source attributions:

The 20.6 million works that gained a source are the headline: these are works we were harvesting from repositories but couldn't identify the repository they came from. Now we can.
Impact on OA status
For about 20.6 million of these works, the fix also revealed the host type. Previously, without a source, we didn't know these works were hosted by a repository -- so `primary_location.source.type` was null. Now that we've identified the source, we know it's a repository. And since a work hosted by a repository is green OA, this triggered a large reclassification:

The 8.9 million gold-to-green shift is the biggest user-visible impact. These works were in repositories, but because we didn't know the source was a repository, they were incorrectly classified as gold. Now they're correctly green.
The 494K works that moved from green to closed lost their OA status because the source we'd previously attributed them to was wrong -- and with the correct attribution, we no longer have evidence they're openly accessible. We believe most of these works *are* in a repository; we just can't confirm which one yet. Restoring OA status for these works is an active priority.
Restoring lost OA status: Most of the ~345K OA losses should be recoverable with additional endpoint verification.
Ongoing: The new mapping system ensures that as new endpoints are added, they'll be correctly attributed from the start