Recommended use of API to get all recent articles

689 views
Skip to first unread message

Charles Y.

unread,
Dec 3, 2020, 1:02:43 PM12/3/20
to arXiv API
Hi team, 

I wonder what's the best practice of API to get all recent articles (for some categories). 

I can query (with `cat` constraint) and sort by date,  would you recommend I periodically  (e.g. hourly) query for a large number of results, e.g, 1000, , and remove the duplicate items on client side, or there is a better practice. 

Thanks for your time.

Bryan Newbold

unread,
Dec 3, 2020, 1:57:43 PM12/3/20
to arxi...@googlegroups.com
[I am not arXiv staff]

Hi Charles,

You could use the OAI-PMH feed to get a stream of new articles, then
filter the results by category. But note that, as far as I know, new
papers are released in daily batches, not a continuous stream (eg, not
hourly).

--bryan
> --
> You received this message because you are subscribed to the Google
> Groups "arXiv API" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to arxiv-api+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/arxiv-api/b31223db-8de8-459d-874f-7fd5490f16bfn%40googlegroups.com.

Charles Y.

unread,
Dec 4, 2020, 12:05:49 PM12/4/20
to arXiv API
Thanks Bryan! I will look into OAI-PMH and use it.

Thorsten

unread,
Dec 4, 2020, 1:23:19 PM12/4/20
to arXiv API

Hi Charles,


Indeed OAI-PMH is a good way to obtain a snapshot of metadata of a particular arXiv category (look at ListSets) and if desired keep it up to date with daily incremental harvest of new and updated records (--from/--until arguments). The update frequency of arXiv is daily.


To use the full arXiv metadata and not just oai_dc you have to write a custom (xml) parser for one of the arXiv metadata formats. However, since the metadata isn't deeply nested, the main fields map easily into a dict. For example to get all quant-ph metadata with a few lines of python

In [1]: import json

In [2]: from sickle import Sickle

In [3]: s = Sickle('https://export.arxiv.org/oai2', max_retries=2)

In [4]: records = s.ListRecords(metadataPrefix='arXivRaw', set='physics:quant-ph', ignore_deleted=True)

Now you can iterate over records and extract the metadata of interest and do something with it.

In [5]: rec = records.next()

In [6]: rec.metadata
Out[6]:  
{'abstract': ['  In a quantum mechanical model, Diosi, Feldmann and Kosloff arrived at a\nconjecture stating that the limit o
f the entropy of certain mixtures is the\nrelative entropy as system size goes to infinity. The conjecture is proven in\nthis
paper for density matrices. The first proof is analytic and uses the\nquantum law of large numbers. The second one clarifies
the relation to channel\ncapacity per unit cost for classical-quantum channels. Both proofs lead to\ngeneralization of the c
onjecture.\n'],
'authors': ['I. Csiszar, F. Hiai and D. Petz'],
'categories': ['quant-ph cs.IT math.IT'],
'comments': ['LATEX file, 11 pages'],
'date': ['Sun, 1 Apr 2007 16:37:36 GMT'],
'doi': ['10.1063/1.2779138'],
'id': ['0704.0046'],
'journal-ref': ['J. Math. Phys. 48(2007), 092102.'],
'size': ['9kb'],
'submitter': ['Denes Petz'],
'title': ['A limit relation for entropy and channel capacity per unit cost'],
'version': [None]}

Note the "version" shows "[None]", because version is specified as an attribute in the arXivRaw XML and the simple mapping doesn't handle that. This is where custom handling of the XML comes into play, e.g.

d = rec.metadata
d.update(rec.xml.find('.//{http://arxiv.org/OAI/arXivRaw/}version').attrib)

Sickle takes care of resumptionToken handling and flow control, so behind the scenes it retrieves batches of ~1000 records at a time. It is advisable to put a few seconds of sleep delay in the loop iterating over the records every so often.

Cheers
T.

Charles Yu

unread,
Dec 5, 2020, 1:34:58 PM12/5/20
to arxi...@googlegroups.com
This is super helpful, thanks Thorsten for the detailed information!

You received this message because you are subscribed to a topic in the Google Groups "arXiv API" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/arxiv-api/DL5MKRVM5f4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/f4eb47ab-37ea-41c0-853e-a4322cad15afn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages