Bulk PDF download for specific categories — is OAI-PMH + direct PDF URL construction the intended approach?

25 views
Skip to first unread message

Stephen R

unread,
Mar 17, 2026, 6:11:06 PMMar 17
to arXiv API Discussion

Hi everyone, pretty cool community here.

I'm a math and econ student and I'm building an educational RAG system indexing academic papers across a subset of arXiv categories (math, econ, cs, q-fin, stat) for research and study purposes. My goal is to download the full-text PDFs for these categories in bulk.

I've seen the recommendation to use the Kaggle metadata snapshot for discovery — but that only contains metadata, not PDFs. The S3 buckets are cost-prohibitive for an educational project, not to mention they will have papers in categories not relevant to my project.

My current approach is: (1) use OAI-PMH to harvest arXiv IDs filtered by category, then (2) construct PDF URLs in the form https://arxiv.org/pdf/{id} and download them with rate-limiting between requests.

Is this the intended/permitted method for bulk PDF access for educational/research purposes? Is there a specific rate limit that applies to PDF endpoint fetches (separate from the API query endpoint)? Are there any other options available for free bulk PDF access at category scale?

Thank you.

Brian Maltzan

unread,
Mar 17, 2026, 7:27:51 PMMar 17
to a...@arxiv.org
Hi Stephen,

Yes, this will work fine:
> My current approach is: (1) use OAI-PMH to harvest arXiv IDs filtered by category, then (2) construct PDF URLs in the form [https://arxiv.org/pdf/{id}](https://arxiv.org/pdf/id) and download them with rate-limiting between requests.

Though, OAI is meant to give you changes to the collection since a date, plus some filtering. It's a catchup.

I'd agree with the Kaggle metadata recommendation.
It's quick to filter, as each line in the file is a json object for 1 paper.

# What are some categories
head arxiv-metadata-oai-snapshot.json  | jq . | grep categories
  "categories": "hep-ph",
  "categories": "math.CO cs.CG",

# Grep the whole metadata file for papers with a primary category, by using a leading quote:
grep '"math.CO ' arxiv-metadata-oai-snapshot.json | more
or
grep '"math.CO ' arxiv-metadata-oai-snapshot.json > results.math.co.json

Here are instructions for getting pdfs from a google bucket, which is not rate limited. Click "view more":
https://www.kaggle.com/datasets/Cornell-University/arxiv/data

# Copy a file to your computer:
gcloud storage cp gs://arxiv-dataset/arxiv/arxiv/pdf/1706/1706.03762v1.pdf .

# List a pdf. Papers before 2008 have a different structure, ie:
gcloud storage ls gs://arxiv-dataset/arxiv/math/pdf/0211/0211159v1.pdf

Cheers,
Brian


--
You received this message because you are subscribed to the Google Groups "arXiv API Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.
To view this discussion visit https://groups.google.com/a/arxiv.org/d/msgid/api/e47063eb-2b8f-43a5-b9aa-df0752309bccn%40arxiv.org.
Reply all
Reply to author
Forward
0 new messages