To be clear, I think this wasn't my issue: None of my integration tests
- paginate through more than ~100 results from a single query
- request results at a high starting offset
Moreover, I still observed API instability this morning, and
(Aside: originally my integration tests would've collectively exceeded the 3qps rate limit for a low absolute request volume, <100 requests; I've since rewritten them to respect that rate limit.)
This seems like a more likely explanation. Are you still observing instability?
After investigating yesterday, I'd cluster the "unexpected behavior" into two categories:
- Anomalously empty results.
Consider the example response I shared earlier: that's a request for the first page of results (nowhere near the 1000-result limit). This query usually yields results. Instead, it yields valid XML suggesting there are no results: zero entries, and opensearch:totalResults is 0.
This is an existing issue at a dramatically increased rate. Testing I did in 2021 suggested increasing page sizes helped, which (unscientifically) suggests there's a stochastic per-request failure rate rather than per-result failure rate. - Sporadic connection reset errors (54, 104).
Both groups of errors seem recoverable with retries, but group 1 is especially insidious: if the
first page of a result set is anomalously empty, it's indistinguishable from a truly-empty result set. If you don't know to expect a non-empty response, there's nothing indicating an error
.
Since my CI is running on GitHub Actions, I initially thought these might both be cryptic anti-abuse measures, and that some of GitHub's Actions IPs had been added to something like a blocklist (e.g. because another user is issuing abusive calls from those machines). Unfortunately, I can reproduce the failures locally — both through my unit tests and through manual requests in-browser. They seem relatively request-rate-independent.
Hopefully this is helpful diagnostic info! Besides addressing the instability, I think there's room to improve some policy documentation:
- Is there a predictable API response indicating rate limit enforcement, e.g. a 429 HTTP status? Investigations are easier — and bug reports clearer — if one can rule out rate limit behavior.
- Is there an API changelog where policy changes (e.g. the 1000-record limit) are documented? I try scanning the arxiv/arxiv-docs commit history, but it's hard to spot changes to the API docs specifically.
- How was the 1000-record limit implemented? Is it a limit per caller, across all queries? A limit per query? How does it interact with the `start` query parameter? This might not be relevant anymore, but it would've been useful documentation.
I hope this doesn't come off as demanding; I know this legacy API isn't a top priority, and you're probably battling more immediate stability concerns.
Let me know if I can be of any help!
Cheers,
Lukas