Announcing the New OpenAlex Author Data

494 views
Skip to first unread message

Jason Portenoy

unread,
Aug 11, 2023, 12:08:39 PM8/11/23
to OpenAlex users

Dear OpenAlex community,


We’d like you to join us in celebrating our new and improved OpenAlex Authors!


The new authors are available now in the API, and will be in the next data snapshot in late August. They are a massive improvement over what came before, in the following ways:

  • A much better author disambiguation model, which helps greatly with (among other things) the “author splitting” problem, where individual authors were split into multiple author profiles.

  • A fully live system for assigning authors and affiliations to new works, using features including names, institutional affiliations, coauthors, citations, and concepts.

  • Tight integration with ORCID, when available.


Try exploring the new author data (the OpenAlex web app, while still in alpha development stage, is a great tool for this. Just start typing names into the search bar!). Many author names that used to be split between 2, 3, or sometimes dozens of different author profiles are now much more accurately represented as single authors. 👏


The new system is a change only to the authors and the way they are assigned to works in the work.authorships attribute. It is not a change to the data schema. For most use cases, the only big change you’ll notice should be an improved experience with much more sensible and accurate author data.


Our methods, code, and models are all, of course, fully open. You can find technical documentation on the author disambiguation model on Github here. You will also find code and links to training data there.


As we have mentioned previously, due to the size of this update, we made the decision to discard all of the old author IDs, rather than have any direct mapping of old authors to new authors. This was a difficult decision to make, as the principle of data persistence is, of course, very important to us and many of you. In this case, the benefits of starting fresh with completely new authors make it worth it. The old author data, including the mapping of OpenAlex works to old author IDs, is available here. All new author IDs have a numeric component >5000000000 (e.g., https://openalex.org/A5069065088). Any ID below this is an old author ID, and should not be connected with any works.


Below, you can see some graphs showing how OpenAlex’s data changed when we switched over to the new authors. Our total count of authors dropped from 127 million to its current value of 92 million—still probably a bit high, but much closer to the actual real-world number:


Authors with 0 works were a bug that was difficult to control with the old author system. This count dropped to nearly zero when we made the switch, and is staying that way ☺️:

Authors with only one work is not a simple bug; it is very common for an individual to have only contributed to a single work. However, with the old system, the number was much higher than it should be, due to inaccurate disambiguation and the author splitting problem. As we would hope, this number dropped significantly with the new system, from 85 million down to 53 million.

Finally, our new system is much better integrated with ORCID, an established persistent identification for researchers. Our percentage of works that have at least one author tied to ORCID saw a big improvement with the new authors, shooting up from 15% to 41%.


We are super excited to be able to celebrate this milestone with you! Better author disambiguation has been one of our most-requested features. It is also one of the hardest problems in data science and machine learning. We’re very pleased to share our results, and we thank you for your patience leading up to and during the rollout. We are so grateful for your ongoing support of OpenAlex! And, of course, we’re not stopping here. Stay tuned for more announcements, including some big improvements in our institutions data!


Cheers,

OpenAlex Team

Reply all
Reply to author
Forward
0 new messages