I am researching trends over time in computer science papers to try to link them to the history of the internet. I would like to download all (or a sample of) the raw .tex files and their metadata for papers in the "cs" set (or, papers in the CCoR sub-archive). However, I am not sure of the best way to do this. To download raw source files, I know of two options:
- Bulk Downloads from Amazon AWS - This seems bad because the raw data on AWS is in big tars of documents ordered by date, and not group.
- Query Metadata APIs and then download from ArXiv directly - This would allow me to be more specific in what I'm looking, and more easily associate raw files with their metadata for downloads, but I am worried of the limitations of the available APIs (below).
ArXiv's bulk data access page recommends OAI-PMH for bulk metadata access, but I don't know if I can use this protocol to download documents, or if I can use the document IDs to download them like a client could.
So, my question is whether or not it is feasible to harvest metadata from OAI-PMH and then download documents directly, or if I have to download all of the data from Amazon AWS and then use the document identifiers to filter out non-computer science documents.
Thanks for any recommendations or advice anyone can give me.