Major bug in OpenAlex snapshot

83 views
Skip to first unread message

Richard Orr

unread,
Jun 22, 2022, 1:00:41 PM6/22/22
to openale...@googlegroups.com
Dear OpenAlex users,

During the most recent update of the JSON-formatted snapshot on 2022-06-09, we introduced a bug that resulted in some Entity rows being unusable. This applies only to the "standard" format snapshot in openalex, not the MAG-format data in openalex-mag-format.

The problem:

Specifically, some rows were not parseable JSON because each backslash was replaced with a double backslash, so that a string like:

"this string contains \"escaped double quotes\""
became
"this string contains \\"escaped double quotes\\""

The second form is not parseable as JSON because the second double quote (before escaped) is interpreted by JSON parsers as ending the string, and everything after that as invalid input.

Our fix:

Yesterday, we corrected these rows. The Entity files now contain only parseable JSON.

The impact:

How this affects you depends greatly on your import process.
  • If you parse each row as json early in the process, some rows would have generated errors. If these were silently ignored, you missed updates or new Entities in the malformed rows. If the process quit and reported the error, you would have seen something like JSONDecodeError (thank you to those of you who did see this and report it to us).
  • If you stored the Entity rows as text to be parsed later, you probably overwrote good rows with bad ones which will be unusable.
  • If you didn't download the snapshot or any portion of it between 2022-06-09 and 2022-06-21, there was no impact to you.
What to do about it:

If you downloaded any snapshot partitions between 2022-06-09 and 2022-06-21, the simplest and most thorough fix is to download and process the entire snapshot again. We do not recommend shortcuts like only replacing the partitions you downloaded in that period, or repairing already-downloaded files locally.

We're sorry! We won't do this again.

We understand that bugs like this are, at best, annoying and inconvenient. This one was introduced by a migration of our entire data storage and export layer, which we won't be repeating. New QA steps are in place to prevent corrupted data from appearing in the snapshot again. We apologize for letting it happen this time.

Best,
Richard

--
Richard Orr
Lead Developer - Unpaywall, OpenAlex
OurResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.
Reply all
Reply to author
Forward
0 new messages