bad data quality for affiliations

102 views
Skip to first unread message

Kevin McCurley

unread,
Nov 18, 2025, 12:35:53 PMNov 18
to OpenAlex Community
I have seen numerous comments here about how people find flaws in the OpenAlex data, and the same could be said about data in Crossref. Unfortunately I suspect the situation may be much worse in OpenAlex than it is in Crossref. Crossref receives data straight from publishers, and publishers sometimes submit false information but not very often. More often they simply omit information. As far as I can tell, OpenAlex attempts to fill in information that is missing from a record. Today I discovered that they had my affiliation in OpenAlex is listed as Weizmann Institute of Science in Israel. The fact of the matter is that I've never been to Israel and never had an affiliation there. Moreover, I don't even think any of my 50 coauthors have ever had an affiliation there but I could be wrong on that. Almost all of my publications have had an affiliation on them in the standard way. This raises the question in my mind of how bad the affiliation data is in openalex. Has anyone attempted to measure the quality of this data?

Simon van Bellen

unread,
Nov 18, 2025, 1:31:11 PMNov 18
to openalex-...@googlegroups.com
I too have encountered issues with OpenAlex' indexing of some of the content disseminated by Érudit, specifically concerning author affiliations. For context, Érudit is a Canadian platform for mostly diamond OA journals in the social sciences and humanities, including about 350 scholarly journals and 170 000 articles.

The issue applies to the new version of OpenAlex, but it may have been present as well in previous versions. The errors detected are not related to a bad usage of Crossref fields; it is OpenAlex's parser that is poorly fit for our data. Specifically, OpenAlex appears to look for footnotes on the article's front page to retrieve author affiliations, which may be valid in many cases, but not at Érudit and as we have a standard format for the cover page, this error tends to arrive very frequently. I would not be surprised many other publishers or platforms appear to be affected similarly.

I calculated the proportion of article-author combinations on Érudit that have incorrect affiliation identification in OpenAlex (which I was able to validate since we have retroactively recovered this information for the years 2015-2024). As a percentage, for all article-author combinations in OpenAlex:
  • 61% had no errors in the identified affiliation(s);
  • 13% had at least one error in the affiliations (in other words, OpenAlex returns several affiliations, at least one of which is correct);
  • 26% had no correct affiliations.

We are currently exploring ways to get the corrected metadata into OpenAlex. Ideally, we would send standardized affiliation metadata to Crossref.

Simon van Bellen



De : openalex-...@googlegroups.com <openalex-...@googlegroups.com> de la part de Kevin McCurley <kmcc...@gmail.com>
Envoyé : 18 novembre 2025 12:35
À : OpenAlex Community <openalex-...@googlegroups.com>
Objet : [openalex-community-group] bad data quality for affiliations
 
I have seen numerous comments here about how people find flaws in the OpenAlex data, and the same could be said about data in Crossref. Unfortunately I suspect the situation may be much worse in OpenAlex than it is in Crossref. Crossref receives data straight from publishers, and publishers sometimes submit false information but not very often. More often they simply omit information. As far as I can tell, OpenAlex attempts to fill in information that is missing from a record. Today I discovered that they had my affiliation in OpenAlex is listed as Weizmann Institute of Science in Israel. The fact of the matter is that I've never been to Israel and never had an affiliation there. Moreover, I don't even think any of my 50 coauthors have ever had an affiliation there but I could be wrong on that. Almost all of my publications have had an affiliation on them in the standard way. This raises the question in my mind of how bad the affiliation data is in openalex. Has anyone attempted to measure the quality of this data?
--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/a266af51-cd94-4168-b03f-f5dcbb137da3n%40googlegroups.com.

Ginny Hendricks

unread,
Nov 18, 2025, 2:36:45 PMNov 18
to Simon van Bellen, openalex-...@googlegroups.com
I can't speak for OpenAlex of course but Erudit does send Crossref raw affiliation strings but not for everything and not yet any ROR IDs - see their metadata dashboard here: . Wider adoption of ROR by publishers would really help everyone with affiliation disambiguation.

Ginny


Kevin McCurley

unread,
Nov 18, 2025, 6:46:04 PMNov 18
to OpenAlex Community
As a publisher we accept ROR and I think they are very useful. I don't think they are necessary to distinguish "Google" from "Weizmann Institute of Science". 

Euan Adie

unread,
Nov 19, 2025, 5:04:07 AMNov 19
to Kevin McCurley, OpenAlex Community
Yep: we use the affiliation data pretty heavily in production, and before making that decision compared some test sets to manual curation / other databases. OpenAlex was broadly in line with Dimensions and other commercial databases, which is to say around 80% of articles had correct affiliations with the vast majority of the remainder just missing data (there were definitely errors, too, usually stemming from affiliations extracted from the wrong place or a bad mapping from affiliation string to institution ID / normalized name... the other DBs also have these issues).

We didn't dig into subject / year differences but I assume that the data is worse for older articles than new ones, just because publishers are a little bit better at affiliation metadata nowadays. Would also assume if a university has worked with a provider then their slice of data is going to be better inside that provider too.

We pull data from CRIS systems from some customers. FWIW in all but one case the numbers (papers in OpenAlex with that affiliation vs papers in the institution's CRIS) are within about 5-10% of each other. The one case involves an organization that is an international collaboration where authors sometimes put the org as their primary affiliation on papers, and sometimes their 'home' institution instead.

I don't know if anybody connected to the Leiden rankings has other stats around this, as presumably they're leaning on the OpenAlex affiliation data too.

My view was that the affiliation data was good (relative to other options) even if it could always do with more work, but YMMV depending on your use case.


On Tue, 18 Nov 2025 at 17:35, Kevin McCurley <kmcc...@gmail.com> wrote:
I have seen numerous comments here about how people find flaws in the OpenAlex data, and the same could be said about data in Crossref. Unfortunately I suspect the situation may be much worse in OpenAlex than it is in Crossref. Crossref receives data straight from publishers, and publishers sometimes submit false information but not very often. More often they simply omit information. As far as I can tell, OpenAlex attempts to fill in information that is missing from a record. Today I discovered that they had my affiliation in OpenAlex is listed as Weizmann Institute of Science in Israel. The fact of the matter is that I've never been to Israel and never had an affiliation there. Moreover, I don't even think any of my 50 coauthors have ever had an affiliation there but I could be wrong on that. Almost all of my publications have had an affiliation on them in the standard way. This raises the question in my mind of how bad the affiliation data is in openalex. Has anyone attempted to measure the quality of this data?

--
Reply all
Reply to author
Forward
0 new messages