August snapshot released

229 views
Skip to first unread message

Casey Meyer

unread,
Aug 19, 2023, 9:39:47 AM8/19/23
to OpenAlex users
Hi All,

The August snapshot is available for download! This snapshot is extra special because it contains our improved author disambiguation feature, along with several other improvements. The full release notes are below. You can see the snapshot here: https://openalex.s3.amazonaws.com/browse.html (docs)

Release notes
- released new authors disambiguation feature - fixed missing source assignment for 5.7M works - improved affiliation matching resulting in additional ~1.1M works matched to institutions - works with more than 100 authors no longer have authors truncated - modified Work.type, added Work.type_crossref - added APC data for 3,508 journals - added authorships.countries attribute - resolved minor snapshot bugs affecting abstract_inverted_index and manifest, removed "@" fields

Thanks,
Casey

--
Casey Meyer, CTO
OurResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.

Alexis-Michel Mugabushaka

unread,
Aug 21, 2023, 7:42:53 PM8/21/23
to Casey Meyer, OpenAlex users

Thank you. 

Can’t wait to check the new author ids after the holidays. 


--
You received this message because you are subscribed to the Google Groups "OpenAlex users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-users/228457e4-9f11-4c08-8c4f-74acf737e0c2n%40googlegroups.com.

Jason Augustyn

unread,
Aug 28, 2023, 12:43:12 PM8/28/23
to OpenAlex users
Thanks, Casey, my team is excited to work with the improved author disambiguation!

One issue we're running into: It appears that the data schema for documents in the earlier "updated_date=*" files differs from that in later files. For example, the schema in "updated_date=2021-11-03/part_000.gz" is different from the schema in "updated_date=2023-08-18/part_000.gz". The 2023 file follows the current documentation, but the older file seems to use an outdated schema.

I haven't done a full comparison to see how common these schema differences are, but any differences are going to make automated processing of the snapshot files very challenging.

Can you offer any insight?

Thanks,

Jason

Casey Meyer

unread,
Aug 29, 2023, 10:36:32 AM8/29/23
to OpenAlex users
Hi Jason,

Thanks for pointing that out! We're going to fix that for the next snapshot. I believe the root cause is we have some old records in the works portion of the snapshot that need to be deleted/merged. We will get to the bottom of it though.

Thanks,
Casey

Jason Augustyn

unread,
Aug 29, 2023, 1:30:53 PM8/29/23
to OpenAlex users
Awesome, thanks, Casey!
Reply all
Reply to author
Forward
0 new messages