ORCID assignment

Ed Summers

unread,

Sep 30, 2024, 11:05:51 AM9/30/24

to openalex-...@googlegroups.com

While working with Works harvested from OpenAlex using ORCIDs I found one example where an ORCID was incorrectly assigned to a researcher:

https://api.openalex.org/works/W4393316292

This work metadata lists the author Abby C. King with ORCID https://orcid.org/0000-0002-7949-8811 who is affiliated with the University of Haifa. If you look at the publication it is clear that the affiliation and name are correct.

However the ORCID belongs to Abby King at Stanford University, who has never worked at University of Haifa, and they are in fact two different people:

- https://mcia.haifa.ac.il/dr-abby-king/
- https://profiles.stanford.edu/abby-king/

I noticed that the Crossref metadata for this publication doesn’t list an ORCID for Abby King, so I’m assuming OpenAlex added it, or it came from somewhere else?

https://api.crossref.org/works/10.1080/08946566.2024.2332141

Can anyone help me understand how this might have happened, and how authors are being matched up with ORCIDs when they are not provided?

Thanks!
Ed Summers

Samuel Mok

unread,

Oct 2, 2024, 5:44:25 AM10/2/24

to Ed Summers, openalex-...@googlegroups.com

Hi Ed,

well, matching data by people's names is a very difficult task; but it is required to match datasets like this. I'm not 100% certain how the OA team matches up all the data, but I do know how they try to match names: they use this rather gnarly sql/python function:

https://github.com/ourresearch/openalex-guts/blob/439bade6e661a8702fc26ef9e1caaca419a4b200/sql/db_udfs/f_matching_author_string.sql

The first giant block is to normalize the name, and the last few lines are the rather simple way of matching these normalized strings. This basically means that if 2 people have the same first + last name after going through the normalization process, OpenAlex will merge these people into a single Author.

This is something that can definitely be improved, but it's pretty difficult to do! In order to to proper automatic matching you will need to code a long, long, long list of specific exceptions to the rules.

For my own application, I perform some additional cleanup on all data I get from OpenAlex (e.g. all publications with an affiliation to my university) to filter out false positives when I retrieve works for my institution. This is way easier to do than trying to fix the entire OpenAlex dataset, as not only is this a way smaller subset of items, I can also use a master list of affiliated authors from my own university, as well as a list of all known publications we keep track of in our repository. I use that to grab all the items from the OpenAlex dataset that match know items. The remaining results need some more detailed attention, which can be done in various ways. I'm currently quite pleased with the results of using embeddings on normalized string representations of publications and/or authors to match them.

Cheers,

Samuel

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-community/1DB2EB21-5F4F-4EEC-B749-E8B25D6D298B%40pobox.com.

Reply all

Reply to author

Forward

Message has been deleted