multiple queries to API & rate limiting?

Christopher White

unread,

Feb 15, 2021, 12:57:09 PM2/15/21

to arXiv API

Hello,

I've written a little python script that downloads preprints posted in the previous month by a long list of authors whose work I'm interested in. It works by

1. querying for preprints by each author, e.g.

https://export.arxiv.org/api/query?search_query=au:%22Bagadonuts_J%22&sortBy=lastUpdatedDate

2. selecting those posted in the previous month

3. deduplicating the resulting list

4. downloading the pdfs.

I have quite a long wait time in between each query (>3 s per terms of use; up to 30s, but this is a knob I turn).

This successfully runs through some number of author queries, then fails with a socket connection timeout error.

One possible cause is my (admittedly unstable) home internet connection, but I didn't think it was *that* unstable. In that case, the solution is to retry. But if I've fallen afoul of some rate-limiting procedure on the arXiv side, retrying is the opposite of helpful.

So---am I running into throttling on the arXiv server side? How can I do this and be a good citizen?

Best,

Christopher

Lukas Schwab

unread,

Feb 15, 2021, 1:26:41 PM2/15/21

to arxi...@googlegroups.com

One idea, which may or may not suit your use case: if you want to reduce the number of requests you make, you can

Include several authors in your query with the OR boolean operator, and then
Locally segment the resulting list of articles.

This seems like a useful pattern if your authors aren't too prolific, e.g. if there are no new papers from a significant portion of them.

It's also only useful if your request count/frequency is really the issue here.

Good luck!

Lukas

--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/aeba59ba-7f40-49fe-b7e4-4ece36b7679cn%40googlegroups.com.

Christopher White

unread,

Feb 23, 2021, 11:50:54 AM2/23/21

to arXiv API

Boolean OR is probably the right thing in the long term, but I'm a little hesitant because (1) I've got a pretty good number of queries, and (2) that makes it hard to sanity check the queries and outputs---it's easy to mis-type, forget how to deal with certain kinds of unusual names, etc., and write a query that just doesn't return anything.

In any case, I added a retry and the thing works fine now. Presumably just v. unstable home internet.