Hi everyone, pretty cool community here.
I'm a math and econ student and I'm building an educational RAG system indexing academic papers across a subset of arXiv categories (math, econ, cs, q-fin, stat) for research and study purposes. My goal is to download the full-text PDFs for these categories in bulk.
I've seen the recommendation to use the Kaggle metadata snapshot for discovery — but that only contains metadata, not PDFs. The S3 buckets are cost-prohibitive for an educational project, not to mention they will have papers in categories not relevant to my project.
My current approach is: (1) use OAI-PMH to harvest arXiv IDs filtered by category, then (2) construct PDF URLs in the form https://arxiv.org/pdf/{id} and download them with rate-limiting between requests.
Is this the intended/permitted method for bulk PDF access for educational/research purposes? Is there a specific rate limit that applies to PDF endpoint fetches (separate from the API query endpoint)? Are there any other options available for free bulk PDF access at category scale?
Thank you.
--
You received this message because you are subscribed to the Google Groups "arXiv API Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.
To view this discussion visit https://groups.google.com/a/arxiv.org/d/msgid/api/e47063eb-2b8f-43a5-b9aa-df0752309bccn%40arxiv.org.