Fetching all metadata and PDF files from arxiv

486 views
Skip to first unread message

Prakritidev Verma

unread,
Feb 5, 2018, 12:27:21 PM2/5/18
to arXiv API
Hi, 

I'm a student and I'm trying to build similar to arxiv-sanity, I wanted to fetch all the metadata of arxiv and the pdf from arxiv. I am not able to figure out how to do that. I checked the previous topic in this group that data on S3 bucket does not contain the metadata, can anyone help me on how to get the data? 

Thanks. 

Andrew Head

unread,
Oct 17, 2018, 11:36:27 AM10/17/18
to arXiv API
Hi Prakritidev! Did you ever figure out an answer to this question? I'm interested in the answer too.

It looks like the PDFs from the bulk download are named by their arXiv ID (e.g., 1801.00001.pdf). Perhaps one could get the metadata by querying the arXiv API with this ID in the id_list argument. Though making that many API calls kind of defeats the purpose of a bulk download.

Thorsten

unread,
Oct 17, 2018, 11:42:48 AM10/17/18
to arXiv API

The best way to obtain bulk metadata for all of arXiv or specific categories (sets) is via OAI-PMH harvest.

There are many good tools for OAI-PMH and this is very straightforward. For documentation see the section "Bulk Metadata Access" at https://arxiv.org/help/bulk_data

Cheers
T.

Andrew Head

unread,
Oct 17, 2018, 8:25:21 PM10/17/18
to arxi...@googlegroups.com
Great, thank you for following up Thorsten :D This is very helpful.

--
You received this message because you are subscribed to a topic in the Google Groups "arXiv API" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/arxiv-api/4-w4LT7CSn0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to arxiv-api+...@googlegroups.com.
To post to this group, send email to arxi...@googlegroups.com.
Visit this group at https://groups.google.com/group/arxiv-api.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages