Data Quality

48 views
Skip to first unread message

Adam Goldberg

unread,
Nov 18, 2025, 7:21:46 PM (9 days ago) Nov 18
to OpenAlex Community
Generally speaking, lots of complaints about data quality.  No responses.

Why is anyone using OpenAlex for anything if the data isn't trustworthy?

Anton Angelo

unread,
Nov 18, 2025, 7:49:04 PM (9 days ago) Nov 18
to OpenAlex Community
Aha.

I've been waiting for the responses against Open Alex to start.  I wonder when the Scholarly Kitchen article will come.

Anyone who has worked with bibliomentric data knows that it is incredibly problematic.  Author disambiguation, for example.  I have a colleague who works doing that with the Scopus dataset on a near daily basis for authors in our institution and we're pretty tiny.  And we pay how much for that, and we are expected to tidy their data for them as well?

Open Alex challenges some really big players, and I expect a co-ordinated response against it.  I say that after working with Free/LibreOpen Source for the last 30 years, and having seen the same (tobacco company like) playbook over and over again.

In response to improving data quality: unless the data is open, it can't improve.  Currently we have fraud and malfeasance on a massive scale because we can't see where the data is coming from.  
Like I tell people when they poo-poo Wikipedia: don't moan, edit!  Standing kvetching on the sidelines just makes you part of the problem.

Looks like we've entered the second part of Ghandi's map of resistance - First they ignore you, then they laugh at you, then they fight you, then you win.

(That was attributed to Ghandi on my old O'Reilly t-shirt, but it probably wasn't him).

aa


From: 'Adam Goldberg' via OpenAlex Community <openalex-...@googlegroups.com>
Sent: Wednesday, November 19, 2025 1:21 PM
To: OpenAlex Community <openalex-...@googlegroups.com>
Subject: [openalex-community-group] Data Quality
 
Generally speaking, lots of complaints about data quality.  No responses.

Why is anyone using OpenAlex for anything if the data isn't trustworthy?
--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/ee54cbe8-ab9d-48f1-afdc-7245db6e7297n%40googlegroups.com.

In-Confidence
This email may be confidential and subject to legal privilege, it may not reflect the views of the University of Canterbury, and it is not guaranteed to be virus free. If you are not an intended recipient, please notify the sender immediately and erase all copies of the message and any attachments.

James Iremonger

unread,
Nov 19, 2025, 4:02:52 AM (9 days ago) Nov 19
to OpenAlex Community
I've worked with bibliometric data from PubMed for about 8 years now. I can assure you it's a mess across the board. At first glance the data looks great, everything structured and organised but then you start turning over a few rocks when your foreign keys fail. You start finding papers with the same author name in it more than once, published on the 31st of February, affiliations where they just write the department name for the first 4 and the 5th contains the full address and up until about 5 years ago the most authors just had their first initial and last name. Then ORCID comes to save the day, but no people are lazy, they forget their login details or move institution and instead of contacting ORCID and moving their e-mail to their new institution they just set up a new account. You won't believe  how many people enter their ID as 0000-0000-0000-0000 just so they can submit their paper when the journal requires that field not to be empty. I really could write books on the stuff that's thrown our old database out of whack over the years.

I can see OpenAlex are doing their best to serve many different needs it's going to take time to chip away at the edge cases. Knowing is half the battle, with such vast levels of data edge cases will only ever be found by happenstance. Currently, we take what they've given us and then add a layer of tuning over it that suits the needs of our clients as we have different requirements when it comes to sensitivity and specificity. But thankfully to date I have found OpenAlex to be a better starting point than the XML PubMed supplies.

Reply all
Reply to author
Forward
0 new messages