Unpaywall rewrite complete!

209 views
Skip to first unread message

Jason Priem

unread,
May 20, 2025, 1:19:01 PMMay 20
to Unpaywall discussion (read-only)

As of today, the Unpaywall API and dataset come from a new, completely rewritten codebase. For most users, there’s no action you need to take.

What do you need to do?

If you’re just using the API: nothing.

If you’re using the data feed: download and reload this new snapshot, overwriting your old data with the new dataset. We’re sorry for the inconvenience, but there are too many changes small to fit in a normal changefile.

What schema changes are there?

Just two tiny changes:

  • oa_locations.evidence field is deprecated. It was always sketchy and the docs urged folks not to use it, so we’re removing it.
  • oa_locations.updated field is deprecated.

Both keys are still there (for now), but the value is always set to the string “deprecated”.

Why is this happening?

We wrote Unpaywall nearly a decade ago, and over that time we accrued a lot of technical debt. The interest on that debt got too high:

  • It got really hard for us to address bug reports. Everything we fixed broke two other things, and every code structure was a load-bearing wall. Support has suffered.
  • Unpaywall and OpenAlex didn’t always agree about open access status, which is confusing.
  • Adding new features and data sources became impossible.

The point of the rewrite is to fix these issues.

What data changes are there?

Overall, the data isn’t changing much…we put a lot of effort into making sure that on aggregate, metrics from the dataset (like percent of works that are gold open access, or number of works with licenses, and so fort) change less than 5%—and if they change, it’s mostly in the direction of higher accuracy. However, 5% of 150M is still a big number, so you will see a lot of individual changes.

The reality is that Unpaywall is always building on metadata with, shall we say, “diverse levels of accuracy.” We’re the most accurate OA index out there because we work hard to be, but as Daniel Day Lewis might say, There Will Be Bugs. The great thing about the new system is: it’ll let us to address these bugs more quickly. Which brings us to:

What’s the future of Unpaywall?
  • Bugs will get fixed faster. This was the number one goal of the project.
  • To facilitate this, we’re launching web-based curation portal that lets you manually correct bad data; your corrections will be applied within days. We’ll announce that later this week.
  • Data will change more quickly. Our approach has always been continuous improvement of the data, but lately this hasn’t happened. With the cleaner codebase will come faster and more consistent updates to data quality and new sources.
  • Finally, later this year we’ll finish an ongoing rewrite of OpenAlex so it will always agree with the open access status with Unpaywall.
Bonus

The new system cuts the average API response time by 90%, from 500ms to 50ms. ⚡👍

As always, we’d love to hear your feedback, and thanks for your support!

Best,
Jason

Reply all
Reply to author
Forward
0 new messages