Three recent changefiles are super large; here's why.

37 views
Skip to first unread message

Jason Priem

unread,
Oct 27, 2025, 7:35:46 PM (8 days ago) Oct 27
to Unpaywall announcements

In the last few days we've pushed three really big changefiles for Data Feed subscribers:

  1. Thursday: 19M works, diverse changes (see below)
  2. Friday: 48M works, BUG which was quickly reverted
  3. Today (Monday): 14M works, diverse changes (see below)

Thursday, 19M works changed

When assigning repository_institution, we now use the institution name linked to the source rather than the raw source name. This was primarily done to improve OpenAlex locations.source.host_organization, but it also altered Unpaywall repository_institution field which uses that same data. This has little to no effect on is_oa or oa_status (gold, bronze, etc).

Friday, 48M works changed (briefly)

This was a bug! This file was available through the API at https://api.unpaywall.org/feed/changefiles for three hours, from 8 to 11PM EST and was named “changed_dois_with_versions_2025-10-24T235107.jsonl.gz”. When we realized this was a mistake we deleted it from the API.

Although the file was only live for three hours, we know some of you have automated systems that can quickly pick up on changes like this. So please revert any changes if you ingested the file. We’re sorry for this error and are making improvements to ensure it doesn’t happen again.

Today (Monday), 14M works changed

Today we’re releasing one more large update, which includes two types of changes:

  1. 11M records have added a journal_name…this mostly affects only book and book-chapters, which were missing journal_name before.
  2. 3M adjustments to OA locations, with some impact on OA status for around 800k records. This is a result of improving source matching for eBook sources—the status of the source (OA vs toll-access) affect the OA status of the article. We also saw changes from adding PubMed matches that we’d been missing before, which is a bug we’ve been hoping to fix for a long time and finally got squashed.
Why so many changes?
We're putting finishing touches on the full launch of the OpenAlex codebase rewrite a week from Tuesday (Nov 4). Because Unpaywall and OpenAlex now share the same codebase, these changes propagate to the Unpaywall dataset as well.

You can expect the number of updates to go back down to historical levels very soon, as we finish the OpenAlex launch.

The good news is that (with the exception of the 3-hour, 48M-record bug) these changes are improving the quality and accuracy of the Unpaywall dataset, including closing many bugs that have been lurking for a long time.
Best,
Jason


Reply all
Reply to author
Forward
0 new messages