Constructing PDF URLs from arXiv IDs when pdf_url is null

28 views
Skip to first unread message

Purna Srivatsa

unread,
Dec 10, 2025, 10:03:57 AM (8 days ago) Dec 10
to OpenAlex Community

Hello,

I've been exploring PDF availability across different APIs and found an interesting pattern.

Example paper: https://api.openalex.org/works/W4402775298

  • is_oa: true
  • pdf_url: null

I checked the same paper across multiple sources:

  • OpenAlex (best_oa_location.pdf_url): null
  • Unpaywall (url_for_pdf): null
  • Semantic Scholar (openAccessPdf.url): "" (empty)

However, Semantic Scholar returns externalIds.ArXiv: "2408.02784" - and the PDF is directly accessible at https://arxiv.org/pdf/2408.02784.pdf.

For papers where:

  1. pdf_url is null, AND
  2. An arXiv version exists (either in locations or ids)

Could OpenAlex automatically populate pdf_url with the constructed arXiv link? The pattern is simple and reliable

This could significantly improve PDF coverage without relying on external services to provide the URL - arXiv's URL structure is stable and predictable.

Also could expose arxiv_id as a top-level field in the response, might make it easier for users to construct the URL ourselves.


Thanks,

Purna

Samuel Mok

unread,
Dec 10, 2025, 4:58:41 PM (8 days ago) Dec 10
to Purna Srivatsa, OpenAlex Community
The issue here is that the SSRN preprint entry you linked (https://api.openalex.org/works/W4402775298) is not merged with the arxiv preprint version in OpenAlex (which you can find here: https://api.openalex.org/W4403622645). So your proposal won't solve this issue: entry W44...98 does not have any arxiv information. The errors that need solving here are:

- deduplication: identify that these work entries are about the same preprint hosted in two places (this is non-trivial !)
- recognize hosts that always have pdfs: SSRN is an open access platform, so all entries have an accessible pdf, so all ssrn entries in openalex should include a pdf link, same for arxiv, etc; if not found by scraping/indexing add programmatic rules

Cheers,
Samuel

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/7cbc197f-39bd-4fab-88a9-ee464e325e9an%40googlegroups.com.

Purna Srivatsa

unread,
Dec 11, 2025, 5:08:45 AM (7 days ago) Dec 11
to OpenAlex Community
Thanks for the response.

Yes you're right about the two enhancements that could improve this. And i just looked at both API responses and i think a way would be to match if they have same authors, title and year( which is the case for above two). But there is risk of false positives and a more deterministic way would be preferable for sure.

I think the second approach is more deterministic and doable? But of course, I'm not versed with the existing setup to know the actual complexity. But, if there's a standard way for all SSRN works to populate the pdf_urls it could improve the pdf_url/oa_url coverage significantly i presume.

Thanks,
Purna

Reply all
Reply to author
Forward
0 new messages