Indexed articles are missing from the database.

124 views
Skip to first unread message

Onur Bal

unread,
Mar 12, 2026, 6:52:26 AM (12 days ago) Mar 12
to OpenAlex Community

Hello Everyone,

I would like to bring to your attention an issue I have encountered while using the latest version of the OpenAlex API. I am seeking clarification on this matter and would greatly appreciate any insights you may have.

Issue Description:
Some articles that are listed as referenced works in other papers appear to be missing from the OpenAlex database. Specifically, when I attempt to retrieve these articles or click on the provided links, I am unable to access the content. Additionally, I am not being redirected to an alternate page, which suggests that the articles are not available.

Example 1:
The paper with the OpenAlex ID W2103484761 references another paper, identified as W4256223384, which appears as the 181st entry in the list of referenced works. However, this referenced paper does not exist in the OpenAlex database.

Example 2:
I have attached a file containing multiple such instances where the referenced articles are missing. The file lists the missing articles as keys, with the corresponding referencing articles provided as values. It is worth noting that these missing references are only from a subset of 6.5k articles.

Questions:

  1. Have these articles been re-indexed? If so, could you please provide guidance on where I can find the re-indexed versions?

  2. If the articles have not been re-indexed, what could be the underlying issue preventing access to these referenced works?

Thank you in advance for your assistance. I look forward to your feedback.

Best regards,

Onur
missing.json

Gabor Schubert

unread,
Mar 19, 2026, 6:50:53 PM (5 days ago) Mar 19
to OpenAlex Community

Hi Onur,

I tried to investigate this a bit. I checked all the works in the attached json file: I could identify 1296 individual works with 1296 individual DOIs and I tried to get the number of references in these works from different sources: OpenAlex (with OpenAlex ID matching), Web of Science and Crossref (with DOI matching). 875 of these 1296 works had number of references data in both Web of Science and Crossref. Only ca 60 articles had the same number of references in OpenAlex compared to Crossref or Web of Science (regardless of the missing items in your question). It seems that most of the works have either more or fewer real references than in OpenAlex. (I attach a csv file with all the data)

I chose one example to dig deeper: https://openalex.org/works/W2087812817

According to OpenAlex this work "Cites: 16 other works" (https://api.openalex.org/works?filter=openalex_id:W2087812817&select=referenced_works_count), but the actual "cited by this work number" is only 14 (https://api.openalex.org/works?filter=cited_by:W2087812817&select=id). Although in Web of Science and Crossref there are only 12 references are shown for this work. And actually, there are only 12 references in the article itself (https://doi.org/10.1145/260750.260764).

So, I tried to match the OpenAlex data to the actual article data. The original article has 12 references:

Ref1: article in a magazine
Ref2: book
Ref3: article
Ref4: book
Ref5: book
Ref6: article
Ref7: book
Ref8: book
Ref9: book
Ref10: book
Ref11: book chapter
Ref12: article

In OpenAlex there are 14 references listed for this article:

W1576560608 is Ref2, but instead of the book it points to a book review
W2030458922 is Ref2, but instead of the book it points to a book review
W2024784219 is Ref3,
W1546144887 is Ref4, but instead of the book it points to a book review
W1489270652 is Ref5, but instead of the book it points to a book review
W2063215243 is Ref5, but instead of the book it points to a book review
W2131631103 is Ref6,
W2116339812 is Ref7,
W1986910015 is Ref8, but instead of the book it points to an article with the same title
W2024372407 is Ref9, but instead of the book it points to a book review
W2936807480 is Ref9, but instead of the book it points to a book review
W1495038747 is Ref10,
W614483135 is Ref11, but instead of the book chapter it points to the whole book
W2143969326 is Ref12,

And there are two extra references in OpenAlex which are missing now from the database: W4285719527, W6629404584.

OpenAlex misses Ref1 but have multiple matches for Refs 2, 5, 9.

It seems that OpenAlex has some serious problems with the matching of references to books (most likely its algorithm aggressively tries to match items with DOIs which is more common in book reviews than in actual books), and there are even some deduplication errors.

The two missing items in this case can be almost anything: maybe doublets which were deleted, items which were not supposed to be indexed (for example Ref1), or something else. But obviously the reference data suffers from many other problems than the missing items.

Best regards,
Gabor Schubert
Stockholm University

number_of_references.csv

Onur Bal

unread,
Mar 20, 2026, 6:28:25 AM (4 days ago) Mar 20
to OpenAlex Community
Hi Gabor,

First of all thank you so much for putting this amount of effort and comparing across platforms. I also would like to thank for the detailed explanations that you wrote. 

If I understand it correctly, there are multiple reference ids for the same piece of work and some pieces of work is missing. I hope in the future they can come over this problem since there is no solution for it. 

I will see whether if I can find the missing articles manually with cross referencing it across platforms. 

Once again, thank you for your efforts.

All the best,
Onur

19 Mart 2026 Perşembe tarihinde saat 23:50:53 UTC+1 itibarıyla gabor.sch...@gmail.com şunları yazdı:

Gabor Schubert

unread,
Mar 20, 2026, 9:42:10 AM (4 days ago) Mar 20
to OpenAlex Community

Hi Onur,

I did some further investigation and found the following on Github:

https://github.com/ourresearch/openalex-guts/blob/main/files-for-datadumps/standard-format/RELEASE_NOTES.txt

RELEASE 2024-07-30

"delete 187,452 works: deleted Zenodo records. (merge into deleted id: W4285719527)"

It seems that one of the "missing work" you found is a "sink" or "dummy" ID used to merge deleted items by OpenAlex. Probably it contains hundreds of thousands deleted records, which might be accessed through browsing previous snapshot files. (I don't have capabilities to process the snapshot files). If I understand correctly OpenAlex keeps these "dead" links in the database for historical graph integrity reasons. This might explain why you found so many works which cite this non-existent ID.

As far as I see your sample contained mostly articles in the field of management and similar areas, where the reference lists contain many non journal article type items, and this might explain the relatively high frequency of such errors.

But as we saw in the example above even the "live" OpenAlex references might be wrong: they might point to several items instead of one or point to wrongly matched items, or both. Most likely it would be very difficult to uncover what the deleted records were, and most probably they were just deleted because of deduplication or other basic errors.

Gabor

Rainer M Krug

unread,
Mar 20, 2026, 9:48:50 AM (4 days ago) Mar 20
to Gabor Schubert, OpenAlex Community
Hi Gabor

So you are saying that the number of references in the `referenced_works` field in OpenAlex is including deleted and non valid work ids, wherefore also the `referenced_works_count` is wrong? Or did I miss something in the discussion?

Rainer




-- 
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/e1ebcb2a-d55c-4e20-85bb-9ab4844962f7n%40googlegroups.com.

Gabor Schubert

unread,
Mar 20, 2026, 10:16:02 AM (4 days ago) Mar 20
to OpenAlex Community
Hi Rainer,

It seems to be the case: all the items Onur identified has higher "Cites" number than the number of items on the cited by list. You can see it in the attached csv file above. 

The largest difference is for https://openalex.org/W2058105398, which says "Cites: 610", but there are only 552 works listed on the cited by page: https://openalex.org/works?page=1&filter=cited_by:w2058105398 . For this work Web of Science shows 547 references, Crossref 529. The actual article (https://doi.org/https://doi.org/10.1103/revmodphys.81.591) is not open access, but I can access it via my university subscription and it lists 552 numbered references on the html page, but the pdf file has alphabetical reference list (not numbered), so it is not easy to count them. In this case the 552 might be correct in OpenAlex, but 610 is definitely too many.

Gabor

Rainer M Krug

unread,
Mar 20, 2026, 10:26:15 AM (4 days ago) Mar 20
to Gabor Schubert, OpenAlex Community
So the “referenced_works_count”, is wrong,  but the “referenced_works” are correct. So it is a partial relief.

Thanks

Rainer
 

Gabor Schubert

unread,
Mar 20, 2026, 10:44:31 AM (4 days ago) Mar 20
to OpenAlex Community
I think there might be cases where they are not correct: like this one https://openalex.org/works/W2142014244 . Cites: 221, Referenced work 219, Crossref/Article ( https://doi.org/10.1186/gm144  ): 146 references, Web of Science 142 references. There is at least one reference in OpenAlex  (https://openalex.org/W2588870577) which was published in 2016 but the citing article is from 2010, which is most likely should be an error. But obviously there are dozens of references in this item, which are most probably erroneous.

Gabor

Tobias Jeppsson

unread,
Mar 21, 2026, 2:19:35 PM (3 days ago) Mar 21
to openalex-...@googlegroups.com

Hi,

I wanted to share a specific type of affiliation mismatch that the OpenAlex team might want to look at, as well as an example of how a LLM (in this case the Claude desktop app) can interact with our local instance of OpenAlex when exposed through MCP. Many others of you have probably looked into similar things, but it might serve as an interesting example.

As for the specific mismatch, it seems like “Oldham council” for some reason is being mapped to a large number of organizations based on raw affiliation strings such as “"Independent - affiliation not provided to SSRN" (in these cases all publications are coming from “SSRN Electronic Journal”). The issue came up and was automatically flagged by Claude as a probable anomaly during a test run to try out different queries and mine some of the results. When I look into the specific ROR flagged it seems like we have ~400000 work-author connections in our database, so the issue is relatively large.

See e.g. these selected parts of responses to follow-ups to an initial question to “Which are the main organizations globally that perform research in the same topic in the same time period?” ( a follow up to a query on the topic “Metal and Thin Film Mechanics” related to the Swedish research funder Vinnova)

Claude: “Note that "Oldham Council" (GB, rank 3 with 1,230 works) looks like an OpenAlex data artefact — a local UK council is unlikely to be a major thin film mechanics research institution. Let me flag that and also check where the top Swedish and other notable Western institutions land:
[…]
Oldham Council (rank 3) is almost certainly an OpenAlex affiliation disambiguation error — likely a misclassified UK address — and should be disregarded.”

User:

“Look online at Oldham council, to see what research organizations or companies that can be found there, which could explain the presence within "Metal and Thin Film Mechanics".


Claude:

" […] The mystery is solved. The raw affiliation string is "Independent — affiliation not provided to SSRN" in every single case. This is a classic OpenAlex affiliation disambiguation error: when authors submit preprints to SSRN without providing an institutional affiliation, OpenAlex's parsing algorithm has apparently mis-mapped the placeholder string "Independent — affiliation not provided to SSRN" to the ROR entity "Oldham Council" (ROR ID 05cxzzq07). There is no research being done by the actual Oldham Metropolitan Borough Council.”

Claude
“[…]
The SSRN string does appear exactly as I stated initially — but it is specifically a SSRN preprint server artefact, not a general "Independent consultant" catch-all. The venue is consistently SSRN Electronic Journal across all cases. There are no city, postal code or address fields — the raw affiliation string contains only the placeholder text. In multi-affiliation cases (e.g. work_id 4304774962), one author slot maps to "Oldham Council" via the SSRN string while the other affiliation slots correctly resolve to real institutions (CNRS labs in France, Max-Planck in Germany).

My earlier sample showing "Independent Consultant, Rabat Morocco" etc. was from a different, broader set of records also mapped to ROR 05cxzzq07 — so there are actually two patterns:

  1. SSRN preprints where the author didn't declare an affiliation → "Independent - affiliation not provided to SSRN" → Oldham Council
  2. Other preprint/journal submissions where authors described themselves as independent consultants/researchers → also incorrectly mapped to the same ROR

Both are disambiguation errors, but the SSRN one is by far the most systematic and largest in volume.


I have attached parts of the conversation transcript for reference (generated and styled by Claude), which also may be interesting for some of you to see some of the possibilities and how it was prompted. This is basically complete, but omit some of the “thinking sections” and pre-sql-query notes by Claude.

This work was done at KTH Royal institute of technology library against a local Swedish instance of OpenAlex, as a duckdb-database built on parquet files (with the placeholder name Swemetrics, see https://swemetrics.se/). At our end the MCP-connection is a recent development and in very early initial testing, but clearly powerful. The same goes for development of Swemetrics as a whole, which is in early development and fluid, done in collaboration with Karolinska Institutet University Library. The backend and MCP-implementation was mainly done by my collegue Markus Skyttner in the last couple of weeks.

Kind regards

Tobias Jeppsson



 

Tobias Jeppsson, PhD

Bibliometric analyst, Project leader

KTH Royal Institute of Technology

KTH Library, Unit for Publication Infrastructure

Osquars backe 25, 100 44 Stockholm, Sweden

tj...@kth.se, http://www.kth.se/profile/tjep, 08-7907106

 



swemetrics_conversation_metal_thin_film_full.html
Reply all
Reply to author
Forward
Message has been deleted
0 new messages