Hi,
I wanted to share a specific type of affiliation mismatch that the OpenAlex
team might want to look at, as well as an example of how a LLM (in this case
the Claude desktop app) can interact with our local instance of OpenAlex when
exposed through MCP. Many others of you have probably looked into similar
things, but it might serve as an interesting example.
As for the specific mismatch, it seems like “Oldham council” for some reason is
being mapped to a large number of organizations based on raw affiliation
strings such as “"Independent - affiliation not provided to SSRN" (in
these cases all publications are coming from “SSRN Electronic Journal”). The
issue came up and was automatically flagged by Claude as a probable anomaly
during a test run to try out different queries and mine some of the results.
When I look into the specific ROR flagged it seems like we have ~400000
work-author connections in our database, so the issue is relatively large.
See e.g. these selected parts of responses to follow-ups to an initial question
to “Which are the main organizations globally that perform research in the same
topic in the same time period?” ( a follow up to a query on the topic “Metal
and Thin Film Mechanics” related to the Swedish research funder Vinnova)
Claude: “Note that "Oldham Council" (GB, rank 3 with 1,230
works) looks like an OpenAlex data artefact — a local UK council is unlikely to
be a major thin film mechanics research institution. Let me flag that and also
check where the top Swedish and other notable Western institutions land:
[…]
Oldham Council (rank 3) is almost certainly an OpenAlex affiliation
disambiguation error — likely a misclassified UK address — and should be
disregarded.”
User:
“Look online at Oldham council, to see what research organizations or companies that can be found there, which could explain the presence within "Metal and Thin Film Mechanics".
Claude:
" […]
The mystery is solved. The raw affiliation string is "Independent —
affiliation not provided to SSRN" in every single case. This is a
classic OpenAlex affiliation disambiguation error: when authors submit
preprints to SSRN without providing an institutional affiliation, OpenAlex's
parsing algorithm has apparently mis-mapped the placeholder string "Independent
— affiliation not provided to SSRN" to the ROR entity "Oldham
Council" (ROR ID 05cxzzq07). There is no research being done by the actual
Oldham Metropolitan Borough Council.”
Claude
“[…]
The SSRN string does appear exactly as I stated initially — but it is
specifically a SSRN preprint server artefact, not a general
"Independent consultant" catch-all. The venue is consistently SSRN
Electronic Journal across all cases. There are no city, postal code or
address fields — the raw affiliation string contains only the placeholder text.
In multi-affiliation cases (e.g. work_id 4304774962), one author slot maps to
"Oldham Council" via the SSRN string while the other
affiliation slots correctly resolve to real institutions (CNRS labs in France,
Max-Planck in Germany).
My earlier sample showing "Independent Consultant, Rabat Morocco" etc. was from a different, broader set of records also mapped to ROR 05cxzzq07 — so there are actually two patterns:
Both are disambiguation errors, but the SSRN one is by far the most systematic and largest in volume.
“
I have attached parts of the conversation transcript for reference (https://claude.ai/public/artifacts/f03c32e7-de4e-4035-99ca-f10310b45c43, generated
and styled by Claude), which also may be interesting for some of you to see
some of the possibilities and how it was prompted. This is basically complete,
but omit some of the “thinking sections” and pre-sql-query notes by Claude.
This work was done at KTH Royal institute of technology library against a local
Swedish instance of OpenAlex, as a duckdb-database built on parquet files (with
the placeholder name Swemetrics, see https://swemetrics.se/).
At our end the MCP-connection is a recent development and in very early initial
testing, but clearly powerful. The same goes for development of Swemetrics as a
whole, which is in early development and fluid, done in collaboration with
Karolinska Institutet University Library. The backend and MCP-implementation
was mainly done by my collegue Markus Skyttner in the last couple of weeks.
Kind regards
Tobias Jeppsson
(sorry for possible double posting; seems like this was attached to a previous thread by mistake)
Hi Tobias,
This is really interesting. Your experiment demonstrates that LLM-based services/applications can identify and flag such errors.
Interestingly I found a few publications (although not about thin films) which were actually written by someone with real ties to Oldham Council for example: https://openalex.org/works/W2321618443 and https://openalex.org/works/W4381283478. These are written by Steve Rogowski (ironically an independent scholar according to his Google Scholar page: https://scholar.google.com/citations?user=zsgwOScAAAAJ&hl=en&oi=ao) in the field of Social Work who sometimes uses "Oldham Council" as affiliation in his publications. According to his biography, he was indeed active as a social worker at Oldham Council (https://essl.leeds.ac.uk/law/news/article/2359/from-leeds-law-to-social-justice-the-inspiring-journey-of-dr-steve-rogowski
I found a few other large scale mismatches with simple "visual check": Currently the two top sources among journal type sources in OpenAlex are two journals in niche fields (https://openalex.org/works?group_by=publication_year,open_access.is_oa,primary_topic.id,authorships.institutions.lineage,type,primary_location.source.type,primary_location.source.id&page=1&filter=primary_location.source.type:journal): "Medical Entomology and Zoology" (https://openalex.org/works?filter=primary_location.source.id:s4210197683) with almost 2 million publications, and "SHILAP Revista de lepidopterología" (https://openalex.org/works?page=1&filter=primary_location.source.id:s112646816) with more than 1.3 million publications.
These are obviously from source-matching errors, which causes that around 3 million items (0.6% of the total 480M records in OpenAlex) have erroneous sources.
"Medical Entomology and Zoology" is the journal of the Japan Society of Medical Entomology and Zoology with probably less than 10 thousands articles (according to Crossref): https://api.crossref.org/works?filter=issn:0424-7086. The error was most probably caused by the mismatching of sources when items from the Japanese CiiNi database were ingested by OpenAlex (https://en.wikipedia.org/wiki/CiNii)
"SHILAP Revista de lepidopterología" is the journal of "Sociedad Hispano-Luso-Americana de Lepidopterología" and is about moths and butterflies. It is highly unlikely that a such a journal published so large number of publications, and it is also suspicious that more than 100k publications are articles are in the topic of Military technology and strategies. This error is most likely caused by mis-matching sources when fetching data from DOAJ.
In the case of "Oldham Council" it is obvious that almost all the matched items were erroneous, but in other cases it might be not this straightforward. In the case of the two "mega-journals" above there are several thousands of legitimate articles which were actually published in these journals, but they are just hiding among the hundreds of thousands erroneous items.
Best regards,
Gabor Schubert
Stockholm University