Bulk PDF download for specific categories — is OAI-PMH + direct PDF URL construction the intended approach?

Stephen R

unread,

Mar 17, 2026, 6:11:06 PMMar 17

to arXiv API Discussion

Hi everyone, pretty cool community here.

I'm a math and econ student and I'm building an educational RAG system indexing academic papers across a subset of arXiv categories (math, econ, cs, q-fin, stat) for research and study purposes. My goal is to download the full-text PDFs for these categories in bulk.

I've seen the recommendation to use the Kaggle metadata snapshot for discovery — but that only contains metadata, not PDFs. The S3 buckets are cost-prohibitive for an educational project, not to mention they will have papers in categories not relevant to my project.

My current approach is: (1) use OAI-PMH to harvest arXiv IDs filtered by category, then (2) construct PDF URLs in the form https://arxiv.org/pdf/{id} and download them with rate-limiting between requests.

Is this the intended/permitted method for bulk PDF access for educational/research purposes? Is there a specific rate limit that applies to PDF endpoint fetches (separate from the API query endpoint)? Are there any other options available for free bulk PDF access at category scale?

Thank you.

Brian Maltzan

unread,

Mar 17, 2026, 7:27:51 PMMar 17

to a...@arxiv.org

Hi Stephen,

Yes, this will work fine:
> My current approach is: (1) use OAI-PMH to harvest arXiv IDs filtered by category, then (2) construct PDF URLs in the form [https://arxiv.org/pdf/{id}](https://arxiv.org/pdf/id) and download them with rate-limiting between requests.

Though, OAI is meant to give you changes to the collection since a date, plus some filtering. It's a catchup.

I'd agree with the Kaggle metadata recommendation.
It's quick to filter, as each line in the file is a json object for 1 paper.

# What are some categories
head arxiv-metadata-oai-snapshot.json | jq . | grep categories
"categories": "hep-ph",
"categories": "math.CO cs.CG",

# Grep the whole metadata file for papers with a primary category, by using a leading quote:
grep '"math.CO ' arxiv-metadata-oai-snapshot.json | more

or

grep '"math.CO ' arxiv-metadata-oai-snapshot.json > results.math.co.json

Here are instructions for getting pdfs from a google bucket, which is not rate limited. Click "view more":
https://www.kaggle.com/datasets/Cornell-University/arxiv/data

# Copy a file to your computer:
gcloud storage cp gs://arxiv-dataset/arxiv/arxiv/pdf/1706/1706.03762v1.pdf .

# List a pdf. Papers before 2008 have a different structure, ie:
gcloud storage ls gs://arxiv-dataset/arxiv/math/pdf/0211/0211159v1.pdf

Cheers,

Brian

--
You received this message because you are subscribed to the Google Groups "arXiv API Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.
To view this discussion visit https://groups.google.com/a/arxiv.org/d/msgid/api/e47063eb-2b8f-43a5-b9aa-df0752309bccn%40arxiv.org.

Stephen R

unread,

May 18, 2026, 1:52:03 AMMay 18

to arXiv API Discussion, Brian Maltzan

Hi Brian, thanks for the GCS tip—it's perfect for the historical bulk! However, for our live pipeline, we need to ingest the newest daily papers (the delta since the last GCS update). We are using the export.arxiv.org/api/query to find the newest IDs, but when we attempt to download the PDFs for these daily updates, our IP frequently hits HTTP 429 Rate Limits. Is there a recommended polite rate (e.g., max requests per minute) for fetching the daily PDF deltas? Or is there a separate mechanism entirely for syncing the daily PDFs?

Payam Meyer

unread,

May 26, 2026, 10:48:42 AMMay 26

to arXiv API Discussion, Stephen R, Brian Maltzan

Hello,

We are experiencing the same difficulty as Stephen. We are not interested in storing or serving the PDFs, our use case is just scraping the text for search purposes. Our use case is for analytical purposes for a health government agency. We have a 5 second wait between each call to fetch and extract the text and a minute wait when we get the 429, but have been getting rate limits frequently.

We would appreciate your insights.

Thanks,

Payam

Reply all

Reply to author

Forward