Can you directly download many papers using identifiers from OAI-PMH?

409 views
Skip to first unread message

Grady D

unread,
Nov 26, 2016, 11:21:52 AM11/26/16
to arXiv api
I am researching trends over time in computer science papers to try to link them to the history of the internet. I would like to download all (or a sample of) the raw .tex files and their metadata for papers in the "cs" set (or, papers in the CCoR sub-archive). However, I am not sure of the best way to do this. To download raw source files, I know of two options: 
  • Bulk Downloads from Amazon AWS - This seems bad because the raw data on AWS is in big tars of documents ordered by date, and not group. 
  • Query Metadata APIs and then download from ArXiv directly - This would allow me to be more specific in what I'm looking, and more easily associate raw files with their metadata for downloads, but I am worried of the limitations of the available APIs (below). 
ArXiv's bulk data access page recommends OAI-PMH for bulk metadata access, but I don't know if I can use this protocol to download documents, or if I can use the document IDs to download them like a client could. 

So, my question is whether or not it is feasible to harvest metadata from OAI-PMH and then download documents directly, or if I have to download all of the data from Amazon AWS and then use the document identifiers to filter out non-computer science documents.

Thanks for any recommendations or advice anyone can give me. 

ganwar

unread,
Nov 27, 2016, 1:51:39 AM11/27/16
to arXiv api
I think the API doesn't provide bulk data (i.e. paper) access.

Grady D

unread,
Nov 27, 2016, 3:28:36 PM11/27/16
to arXiv api
I agree, I also don't think OAI-PMH allows me to access source files. I think I'm going to have to download everything from AWS.

Thorsten

unread,
Nov 27, 2016, 6:46:16 PM11/27/16
to arXiv api

That's correct, the arXiv API and also OAI-PMH are intended for metadata querying and dissemination, although in principle the OAI-PMH protocol allows to specify custom MetadataFormats with payloads like TeX-source or PDF.

From an operational point of view it is the most efficient approach in terms of arXiv staff time and resources to make bulk data available via S3, and given the very low cost of the requester pays bucket model, it should not be an impediment to any research project. It may be sub-optimal to download the entire arXiv corpus when one is only interested in a specific subset, but it would also take time and resources to prepare said subset at arXiv.

Cheers
T.

Grady D

unread,
Nov 27, 2016, 7:41:08 PM11/27/16
to arXiv api
I have used Metadata harvesting software available on GitHub to download the (readily) available metadata. However, I only got results from 2007 and later. Could this have anything to do with the change in document identification? I did not limit the harvest to Computer Science and so I'm sure there should be results before 2007.

Also, now that I have the metadata, would it be impolite/costly in ArXiv staff's resources to use the document identifiers provided by OAI-PMH to download documents like a client would (a script to plug in URL arguments and save the results). 

Thorsten

unread,
Nov 27, 2016, 7:44:35 PM11/27/16
to arXiv api



this is explained here: https://arxiv.org/help/oa/index

Every OAI-PMH metadata record has a datestamp associated with it, which is the last modification time of that record. Because arXiv has updated metadata records in bulk on several occasions, the OAI-PMH datestamp values do not correspond with the original submission or replacement times for older articles, and may not for newer articles because of administrative and bibliographic updates. The earliest datestamp is given then the <earliestDatestamp> element of the Identify response.

Grady D

unread,
Nov 27, 2016, 7:45:41 PM11/27/16
to arXiv api
In regards to my first question, the metadata harvester I used stores the results in batches (with dates in the filename) and the actual dates of the data inside do not correspond exactly (I suspect that there is a publication date and an upload/made available date), so I do actually have data from before 2007.

My mistake.


On Saturday, November 26, 2016 at 11:21:52 AM UTC-5, Grady D wrote:

Thorsten

unread,
Nov 27, 2016, 7:51:22 PM11/27/16
to arXiv api

to answer you second question: any automated download that impacts interactive user experience and responsiveness is forbidden, see https://arxiv.org/help/robots. Also, arXiv does have to pay for bandwidth consumed, so robotic downloads do affect the service. I strongly suggest you do S3 buckets.

Cheers
T.



On Sunday, November 27, 2016 at 5:41:08 PM UTC-7, Grady D wrote:

Grady D

unread,
Nov 27, 2016, 9:07:46 PM11/27/16
to arXiv api
Thanks
Reply all
Reply to author
Forward
0 new messages