Only arXiv data missing in large chunk among preprint servers

63 views
Skip to first unread message

Chiaki Miura

unread,
Nov 26, 2025, 2:46:31 AM11/26/25
to OpenAlex Community

Dear community,

OpenAlex’s coverage of preprint articles is generally very high. Most servers reach over 98% coverage as long as DOIs are available.

However, I recently noticed that arXiv is a clear exception. In my calculations, approximately 24%, 37%, and 74% of articles in computer science, quantitative biology, and astrophysics, respectively, are missing. A summary of these statistics is available here:

https://docs.google.com/spreadsheets/d/1OhsZ_HsV9_3vYIz_P5KR_d6Iby3FN-sicP5jgydbP1A/edit?gid=772832660#gid=772832660


I am wondering whether this gap stems from the fact that arXiv only started assigning DOIs after 2022. Although OpenAlex contains more than 1.2 million pre-2022 arXiv records, the missing items appear to be concentrated between 2007 and 2021.

As scientometric research increasingly focuses on preprints, comprehensive coverage is becoming more important—especially because OpenAlex is one of the few bibliographic databases that index citations to both preprints and their subsequent journal publications separately. This functionality is crucial for investigating many aspects of preprint scholarship.

I am happy to share the list of missing arXiv entries if it would be useful. Please let me know if there is anything I can contribute.

Thank you for your work maintaining and improving OpenAlex.

Best regards,

Chiaki

Chiaki Miura

unread,
Dec 18, 2025, 8:58:00 PM12/18/25
to OpenAlex Community
Hi,

Unfortunately the problem seems to sustain in Walden.
I'm afraid the thread is buried under numerous other important posts.

Thanks in advance.

2025年11月26日水曜日 16:46:31 UTC+9 Chiaki Miura:

Adam Buttrick

unread,
Dec 18, 2025, 11:52:21 PM12/18/25
to Chiaki Miura, OpenAlex Community
Hi Chiaki,

Do you have any examples of the arXiv IDs that are missing? I'm working on a project to improve the completeness of arXiv metadata, fleshing it out with things like authors' affiliations and their ROR ID assignments. To do so, we've been using their representation in DataCite, as well as the metadata and PDF files from arXiv's Kaggle dataset. Using the latter as a starting point (just re-downloaded and did a rough check), I see 2,908,095 individual works that have a total of 4,638,308 versions. The gap between these two seems roughly congruent with your summary totals.

While in processing the PDFs, we have observed differences in the metadata across versions (e.g. in the listed authors), there doesn't appear to be a source that provides bulk access to the metadata for the individual version records. The arXiv DOIs are unversioned, being updated with the most recent metadata whenever a new version is uploaded. Likewise, the Kaggle metadata file only lists instances of versions and their date of creation, not any more specific details. 

Best,
Adam

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/b89b557a-66ec-4970-8e8f-466732aab59cn%40googlegroups.com.

Chiaki Miura

unread,
Dec 20, 2025, 4:42:17 AM12/20/25
to Adam Buttrick, OpenAlex Community
Dear Adam,
 
Thanks for your information. 
Here are several arXiv DOIs that I could not match to OpenAlex entries via DOI lookup:
10.48550/arxiv.1404.6874
10.48550/arxiv.1002.4259
10.48550/arxiv.1907.02406
 
Querying the OpenAlex UI with doi_starts_with, e.g. https://openalex.org/works?page=1&view=list,report,api&version=2&filter=doi_starts_with:10.48550/arxiv.1404.6874 returns no result, while other arXiv DOIs (e.g. 10.48550/arxiv.1505.02543) are retrieved correctly.
 
I also searched by preprint title. For instance, for arXiv:1002.4259 (“Modulus stabilization and IR-brane kinetic terms in gauge-Higgs unification”), OpenAlex returns the journal article (JHEP) but not the preprint itself:
https://openalex.org/works?filter=title_and_abstract.search:Modulus+stabilization+and+IR-brane+kinetic+terms+in+gauge-Higgs+unification
 
This made me wonder whether some preprint records exist in the data but are not exposed via the UI or DOI filters.
 
link to the original arxiv:
https://arxiv.org/abs/1002.4259
 
I have compiled a full list of arXiv DOIs that failed to match (originally checked against version=1, May 1, 2025; some may now be resolved in version=2):
 
As a side note, when I last checked the Kaggle dataset, many curation DOIs (journal DOIs of subsequent publications) were missing, so I relied instead on the Zenodo community snapshot: https://zenodo.org/records/4990937
 
Best,
Chiaki
2025年12月19日 13:52 +0900, Adam Buttrick <adambu...@gmail.com>:

Chiaki Miura

unread,
Dec 20, 2025, 4:59:35 AM12/20/25
to Adam Buttrick, OpenAlex Community
For clarity, the statistics I initially reported are based on works within the timeframe 1993-01-01 to 2024-12-31.
 
To clarify the counts: the total 2,856,003 corresponds to the number of arXiv entries in the Zenodo snapshot. Of these, 1,300,083 could not be matched to my OpenAlex snapshot.
 
This implies that the OpenAlex snapshot I used (taken on May 1, 2025) contains 1,555,920 arXiv entries with
1993 <= publication_year < 2025.
 
This number may differ from the current size of arXiv coverage in OpenAlex, which—as you indicated—is 2,908,095 entries.
2025年12月19日 13:52 +0900, Adam Buttrick <adambu...@gmail.com>:
2,908,095

Adam Buttrick

unread,
Dec 20, 2025, 6:10:09 AM12/20/25
to Chiaki Miura, OpenAlex Community
Hi Chiaki,

I believe it may be the case that there is a bug that prevents these works from being returned via the UI and some forms of search. All listed appear to be available via the API when accessing the works endpoint directly with the full DOIs:

https://api.openalex.org/works/https://doi.org/10.48550/arxiv.1404.6874
https://api.openalex.org/works/https://doi.org/10.48550/arxiv.1002.4259
https://api.openalex.org/works/https://doi.org/10.48550/arxiv.1907.02406

Searching the partial DOIs in the UI, provides a pop-up suggestion for what look to be the correct works. But once selected, I get redirected to the search page with no results returned.

Using 10.48550/arxiv.1404.6874 as an example, entering this partial DOI string auto-suggests "Field-angle-dependent low-energy excitations around a vortex in the superconducting topological insulator CuxBi2Se3."

Clicking through then redirects to the below URL, with no results returned.

https://openalex.org/works?page=1&filter=ids.openalex:w6947639143

Directly accessing the corresponding works records listed in this URL appears to work, however:

https://openalex.org/works/w6947639143

The API can also return this work when using the filter, if we drop the page parameter used in the UI:

https://api.openalex.org/works&filter=ids.openalex:w6947639143

I'm also able to reproduce the UI search failures you mention. Likewise, that the title and abstract search in the API also fails to return the arXiv record for this work as well (even when including the DataCite metadata using the include-xpac parameter):

https://api.openalex.org/works?filter=title_and_abstract.search:Modulus+stabilization+and+IR-brane+kinetic+terms+in+gauge-Higgs+unification&include-xpac

And again, this work looks to exist in the API/data, so should be returned:

https://api.openalex.org/works/https://doi.org/10.48550/arxiv.1002.4259
https://api.openalex.org/works/W6891695487

I'll try to take a look at your larger files and write some scripts to see if these failure patterns consistently occur. If I have some extra time, I'll check in the OpenAlex data files as well.

Best,
Adam
Screenshot 2025-12-20 at 3.05.15 AM.png
Screenshot 2025-12-20 at 3.05.02 AM.png

Chiaki Miura

unread,
Dec 20, 2025, 10:02:35 PM12/20/25
to Adam Buttrick, OpenAlex Community
Much appreciated.
 
If I have some extra time I'll investigate my data to identify the consistent pattern in them.
 
Best,
Chiaki
2025年12月20日 20:10 +0900, Adam Buttrick <adambu...@gmail.com>:

Adam Buttrick

unread,
Dec 21, 2025, 1:13:58 AM12/21/25
to Chiaki Miura, OpenAlex Community
Hi Chiaki,

Here is my code and analysis on this:

https://github.com/adambuttrick/openalex-missing-arxiv-works-analysis

Using a 5K sample from your CSV, I found:

1. ~20% of DOIs in the CSV provided input appear to be missing leading zeros in either the YYMM prefix or numeric suffix of the arXiv ID. The majority are correctable by re-inserting:

Examples:

905.3199   → 0905.3199   (YYMM padding)
1104.027   → 1104.0027   (pre-2015: 4-digit suffix)
1503.0311  → 1503.00311  (2015+: 5-digit suffix)

I looked up some of these in the Zenodo dataset mentioned and these errors did not seem to be present there, so they were perhaps dropped somewhere in the transformation of arXiv IDs to DOIs. All spreadsheet applications unfortunately seem to love to silently introduce these kinds of errors - happened to me a few times in reviewing the various files here!

2. 98.7% of works are accessible in OpenAlex when the DOIs are corrected and `include_xpac=true` is used. 1.3% from this dataset appear to be truly missing, i.e. have valid or correctable DOIs, but cannot be returned using these values in OpenAlex. No indication of what accounts for the missing works, would need to do some more digging...

3. The direct `/works/{doi}` endpoint returns the 98.7% of accessible works, even without the `include_xpac=true` flag, but filter/search endpoints do not behave with the same consistency. Search results for arXiv seem to require `include_xpac=true` to achieve the best outcomes, which may explain some why they were missing from standard searches. It's unclear what accounts for this discrepancy in search behavior between Walden/v2 data and xpac, since it seems arXiv metadata was being independently harvested prior to xpac's inclusion.

Best,
Adam

Chiaki Miura

unread,
Dec 24, 2025, 2:46:21 AM12/24/25
to OpenAlex Community
Dear Adam,

Thanks for your investigation. Facepalmed. 

In some other servers there seems to be missing around 0.1 - 1 % of the preprints on the older snapshot of OpenAlex.
I'll come back to the issue after the vacation and check if it persists in walden v2.

PreprintServer Inception(year) Total  Matched MatchingRate
--------
BioRxiv 2013 262,741  262,598 99.9%
OSF Preprints 2016 70,518  69,294 98.2%
PsyArXiv 2016 39,782  39,311 98.8%
SocArXiv 2016 16,193  16,114 99.5%
medRxiv 2019 62,019  61,973 99.9%

Best,
Chiaki
2025年12月21日日曜日 15:13:58 UTC+9 adambu...@gmail.com:
Reply all
Reply to author
Forward
0 new messages