Bulk source access for ~247K papers is out of date
47 views
Skip to first unread message
Eric Price
unread,
May 20, 2026, 9:52:08 AMMay 20
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to arXiv API Discussion
Hi,
I'm on Anthropic's pretraining data team. We consume arXiv source via the requester-pays S3 bucket at s3://arxiv/src/ (the month-tar bulk dataset), which works great — thank you for maintaining it.
We've noticed that each arXiv_src_YYMM_NNN.tar is rebuilt on an irregular schedule (per its manifest <timestamp>), and ~247K papers (~8% of the corpus) currently have an OAI versions[-1].created date newer than the rebuild timestamp of the tar they belong to. For those papers the bulk tar still contains an older version of the source. The stale papers are spread across ~95% of the tars (median ~21 stale per tar).
We ran a small test (20 papers) against export.arxiv.org/e-print/ and observed both the Disallow: / robots.txt and 429 responses at one request per 7 seconds, so we're asking before doing anything at scale. Options we are considering:
1. We fetch export.arxiv.org/e-print/<id> for each of the 247K IDs at one request per 3 seconds, ~9-10 days. User-Agent anthropic-arxiv-refresh/1.0 (mailto:ecp...@anthropic.com). Total ~860 GB. 2. If there's a better bulk mechanism (per-paper requester-pays prefix, or a tar rebuild) we'd use that instead. 3. If you'd rather we wait for normal rebuilds, I guess we can, but many seem years out of date.
Happy to share the ID list if useful. One example to see what's happening is https://arxiv.org/abs/2306.03498 : the website has src for v5, but the bulk tar ( aws s3 cp --request-payer requester s3://arxiv/src/arXiv_src_2306_027.tar - | tar -tv 2>/dev/null | grep 2306.03498 ) is still on v1.