Constructing PDF URLs from arXiv IDs when pdf

Purna Srivatsa

unread,

Dec 10, 2025, 10:03:57 AM12/10/25

to OpenAlex Community

Hello,

I've been exploring PDF availability across different APIs and found an interesting pattern.

Example paper: https://api.openalex.org/works/W4402775298

is_oa: true
pdf_url: null

I checked the same paper across multiple sources:

OpenAlex (best_oa_location.pdf_url): null
Unpaywall (url_for_pdf): null
Semantic Scholar (openAccessPdf.url): "" (empty)

However, Semantic Scholar returns externalIds.ArXiv: "2408.02784" - and the PDF is directly accessible at https://arxiv.org/pdf/2408.02784.pdf.

For papers where:

pdf_url is null, AND
An arXiv version exists (either in locations or ids)

Could OpenAlex automatically populate pdf_url with the constructed arXiv link? The pattern is simple and reliable

https://arxiv.org/pdf/{arxiv_id}.pdf

This could significantly improve PDF coverage without relying on external services to provide the URL - arXiv's URL structure is stable and predictable.

Also could expose arxiv_id as a top-level field in the response, might make it easier for users to construct the URL ourselves.

Thanks,

Purna

Samuel Mok

unread,

Dec 10, 2025, 4:58:41 PM12/10/25

to Purna Srivatsa, OpenAlex Community

The issue here is that the SSRN preprint entry you linked (https://api.openalex.org/works/W4402775298) is not merged with the arxiv preprint version in OpenAlex (which you can find here: https://api.openalex.org/W4403622645). So your proposal won't solve this issue: entry W44...98 does not have any arxiv information. The errors that need solving here are:

- deduplication: identify that these work entries are about the same preprint hosted in two places (this is non-trivial !)

- recognize hosts that always have pdfs: SSRN is an open access platform, so all entries have an accessible pdf, so all ssrn entries in openalex should include a pdf link, same for arxiv, etc; if not found by scraping/indexing add programmatic rules

Cheers,

Samuel

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/7cbc197f-39bd-4fab-88a9-ee464e325e9an%40googlegroups.com.

Purna Srivatsa

unread,

Dec 11, 2025, 5:08:45 AM12/11/25

to OpenAlex Community

Thanks for the response.

Yes you're right about the two enhancements that could improve this. And i just looked at both API responses and i think a way would be to match if they have same authors, title and year( which is the case for above two). But there is risk of false positives and a more deterministic way would be preferable for sure.

I think the second approach is more deterministic and doable? But of course, I'm not versed with the existing setup to know the actual complexity. But, if there's a standard way for all SSRN works to populate the pdf_urls it could improve the pdf_url/oa_url coverage significantly i presume.

Constructing PDF URLs from arXiv IDs when pdf_url is null

Purna Srivatsa

Samuel Mok

Purna Srivatsa