Bulk PDF download for specific categories — is OAI-PMH + direct PDF URL construction the intended approach?

65 views
Skip to first unread message

Stephen R

unread,
Mar 17, 2026, 6:11:06 PMMar 17
to arXiv API Discussion

Hi everyone, pretty cool community here.

I'm a math and econ student and I'm building an educational RAG system indexing academic papers across a subset of arXiv categories (math, econ, cs, q-fin, stat) for research and study purposes. My goal is to download the full-text PDFs for these categories in bulk.

I've seen the recommendation to use the Kaggle metadata snapshot for discovery — but that only contains metadata, not PDFs. The S3 buckets are cost-prohibitive for an educational project, not to mention they will have papers in categories not relevant to my project.

My current approach is: (1) use OAI-PMH to harvest arXiv IDs filtered by category, then (2) construct PDF URLs in the form https://arxiv.org/pdf/{id} and download them with rate-limiting between requests.

Is this the intended/permitted method for bulk PDF access for educational/research purposes? Is there a specific rate limit that applies to PDF endpoint fetches (separate from the API query endpoint)? Are there any other options available for free bulk PDF access at category scale?

Thank you.

Brian Maltzan

unread,
Mar 17, 2026, 7:27:51 PMMar 17
to a...@arxiv.org
Hi Stephen,

Yes, this will work fine:
> My current approach is: (1) use OAI-PMH to harvest arXiv IDs filtered by category, then (2) construct PDF URLs in the form [https://arxiv.org/pdf/{id}](https://arxiv.org/pdf/id) and download them with rate-limiting between requests.

Though, OAI is meant to give you changes to the collection since a date, plus some filtering. It's a catchup.

I'd agree with the Kaggle metadata recommendation.
It's quick to filter, as each line in the file is a json object for 1 paper.

# What are some categories
head arxiv-metadata-oai-snapshot.json  | jq . | grep categories
  "categories": "hep-ph",
  "categories": "math.CO cs.CG",

# Grep the whole metadata file for papers with a primary category, by using a leading quote:
grep '"math.CO ' arxiv-metadata-oai-snapshot.json | more
or
grep '"math.CO ' arxiv-metadata-oai-snapshot.json > results.math.co.json

Here are instructions for getting pdfs from a google bucket, which is not rate limited. Click "view more":
https://www.kaggle.com/datasets/Cornell-University/arxiv/data

# Copy a file to your computer:
gcloud storage cp gs://arxiv-dataset/arxiv/arxiv/pdf/1706/1706.03762v1.pdf .

# List a pdf. Papers before 2008 have a different structure, ie:
gcloud storage ls gs://arxiv-dataset/arxiv/math/pdf/0211/0211159v1.pdf

Cheers,
Brian


--
You received this message because you are subscribed to the Google Groups "arXiv API Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.
To view this discussion visit https://groups.google.com/a/arxiv.org/d/msgid/api/e47063eb-2b8f-43a5-b9aa-df0752309bccn%40arxiv.org.

Stephen R

unread,
May 18, 2026, 1:52:03 AM (2 days ago) May 18
to arXiv API Discussion, Brian Maltzan
Hi Brian, thanks for the GCS tip—it's perfect for the historical bulk! However, for our live pipeline, we need to ingest the newest daily papers (the delta since the last GCS update). We are using the export.arxiv.org/api/query to find the newest IDs, but when we attempt to download the PDFs for these daily updates, our IP frequently hits HTTP 429 Rate Limits. Is there a recommended polite rate (e.g., max requests per minute) for fetching the daily PDF deltas? Or is there a separate mechanism entirely for syncing the daily PDFs?
Reply all
Reply to author
Forward
0 new messages