Hi Charles,
Indeed OAI-PMH is a good way to obtain a snapshot of metadata of a particular arXiv category (look at ListSets) and if desired keep it up to date with daily incremental harvest of new and updated records (--from/--until arguments). The update frequency of arXiv is daily.
To use the full arXiv metadata and not just oai_dc you have to write a custom (xml) parser for one of the arXiv metadata formats. However, since the metadata isn't deeply nested, the main fields map easily into a dict. For example to get all quant-ph metadata with a few lines of python
In [1]: import json
In [2]: from sickle import Sickle
In [3]: s = Sickle('https://export.arxiv.org/oai2', max_retries=2)
In [4]: records = s.ListRecords(metadataPrefix='arXivRaw', set='physics:quant-ph', ignore_deleted=True)
Now you can iterate over records and extract the metadata of interest and do something with it.
In [5]: rec = records.next()
In [6]: rec.metadata
Out[6]:
{'abstract': [' In a quantum mechanical model, Diosi, Feldmann and Kosloff arrived at a\nconjecture stating that the limit o
f the entropy of certain mixtures is the\nrelative entropy as system size goes to infinity. The conjecture is proven in\nthis
paper for density matrices. The first proof is analytic and uses the\nquantum law of large numbers. The second one clarifies
the relation to channel\ncapacity per unit cost for classical-quantum channels. Both proofs lead to\ngeneralization of the c
onjecture.\n'],
'authors': ['I. Csiszar, F. Hiai and D. Petz'],
'categories': ['quant-ph cs.IT math.IT'],
'comments': ['LATEX file, 11 pages'],
'date': ['Sun, 1 Apr 2007 16:37:36 GMT'],
'doi': ['10.1063/1.2779138'],
'id': ['0704.0046'],
'journal-ref': ['J. Math. Phys. 48(2007), 092102.'],
'size': ['9kb'],
'submitter': ['Denes Petz'],
'title': ['A limit relation for entropy and channel capacity per unit cost'],
'version': [None]}
Note the "version" shows "[None]", because version is specified as an attribute in the arXivRaw XML and the simple mapping doesn't handle that. This is where custom handling of the XML comes into play, e.g.
d = rec.metadata
Sickle takes care of resumptionToken handling and flow control, so behind the scenes it retrieves batches of ~1000 records at a time. It is advisable to put a few seconds of sleep delay in the loop iterating over the records every so often.
Cheers
T.