Flaky API behavior

198 views
Skip to first unread message

Lukas Schwab

unread,
Oct 15, 2023, 6:59:16 PM10/15/23
to arxi...@googlegroups.com
This weekend I've been trying to debug my API client library. I think there's underlying instability in the API, but it's hard to categorically rule out rate limiter and cache bugs.

Identical, near-simultaneous requests can receive different responses depending on the protocol (HTTP vs. HTTPS). I'm not sure how to reproduce this issue, but 
I understand the wrapper library should standardize on HTTPS (per #128), but because even HTTPS  requests seem flaky I think the real issue is orthogonal to the HTTP/S protocol choice.

Moreover, integration tests that used to pass consistently are now behaving inconsistently, both locally and in GitHub Actions.

Have there been any changes — to the API proper or to the rate limit logic — in the last ~week?
Is there any more widespread evidence of degraded behavior? Nothing is listed on the statuspage.

Thanks for any help,
Lukas

Lukas Schwab

unread,
Oct 15, 2023, 7:17:23 PM10/15/23
to arxi...@googlegroups.com
I'll add: these failures seem to take the same form as a long-observed instability in the RSS API, where sometimes it returns anomalously empty pages of results.

For example, when I just requested https://export.arxiv.org/api/query?search_query=testing I received:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3Dtesting%26id_list%3D%26start%3D0%26max_results%3D10" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=testing&amp;id_list=&amp;start=0&amp;max_results=10</title>
  <id>http://arxiv.org/api/eUJhearGsjGVUW8fVMsA8Os6PgE</id>
  <updated>2023-10-15T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">10</opensearch:itemsPerPage>
</feed>
Here's a GitHub issue documenting that bug (2020).

My wrapper library's work-around is retries (with rate-limit-compliant sleeping between requests). Three retries have typically been enough for the integration tests to pass consistently.

Now I'm seeing these failures despite retries.

Has something increased the rate of these anomalous responses? Made them more likely to cluster, such that retries fail?

Jake Weiskoff

unread,
Oct 17, 2023, 11:03:24 AM10/17/23
to arxi...@googlegroups.com
Over the past several weeks arXiv has been inundated with excessive out of bounds requests via the API that had caused some system instability (at least one order of magnitude above what we'd been seeing as normal growth). During the early stages of diagnosing the issue, we placed a hard-cap upon the number of records possible to download via the API at 1,000. This had the unfortunate side effects you (as well as others) have reported regarding not being able to access records above that number. 
These changes were lifted as of yesterday evening, after completing research regarding the cause of the requests causing instability. 

Regards,
-Jake 

--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/CAHO42hTy4p6O-G1g18A5y39j6tzkkoXpXnhaQoNPiPLVdRPqLg%40mail.gmail.com.

Lukas Schwab

unread,
Oct 17, 2023, 12:38:20 PM10/17/23
to arxi...@googlegroups.com
> This had the unfortunate side effects you (as well as others) have reported regarding not being able to access records above that number.
To be clear, I think this wasn't my issue: None of my integration tests
  • paginate through more than ~100 results from a single query
  • request results at a high starting offset
Moreover, I still observed API instability this morning, and 

(Aside: originally my integration tests would've collectively exceeded the 3qps rate limit for a low absolute request volume, <100 requests; I've since rewritten them to respect that rate limit.)

> arXiv has been inundated with excessive out of bounds requests via the API that had caused some system instability

This seems like a more likely explanation. Are you still observing instability?
After investigating yesterday, I'd cluster the "unexpected behavior" into two categories:
  1. Anomalously empty results.
    Consider the example response I shared earlier: that's a request for the first page of results (nowhere near the 1000-result limit). This query usually yields results. Instead, it yields valid XML suggesting there are no results: zero entries, and opensearch:totalResults is 0.
    This is an existing issue at a dramatically increased rate. Testing I did in 2021 suggested increasing page sizes helped, which (unscientifically) suggests there's a stochastic per-request failure rate rather than per-result failure rate.
  2. Sporadic connection reset errors (54, 104).
Both groups of errors seem recoverable with retries, but group 1 is especially insidious: if the first page of a result set is anomalously empty, it's indistinguishable from a truly-empty result set. If you don't know to expect a non-empty response, there's nothing indicating an error.

Since my CI is running on GitHub Actions, I initially thought these might both be cryptic anti-abuse measures, and that some of GitHub's Actions IPs had been added to something like a blocklist (e.g. because another user is issuing abusive calls from those machines). Unfortunately, I can reproduce the failures locally — both through my unit tests and through manual requests in-browser. They seem relatively request-rate-independent.

Hopefully this is helpful diagnostic info! Besides addressing the instability, I think there's room to improve some policy documentation:
  • Is there a predictable API response indicating rate limit enforcement, e.g. a 429 HTTP status? Investigations are easier — and bug reports clearer — if one can rule out rate limit behavior.
  • Is there an API changelog where policy changes (e.g. the 1000-record limit) are documented? I try scanning the arxiv/arxiv-docs commit history, but it's hard to spot changes to the API docs specifically.
  • How was the 1000-record limit implemented? Is it a limit per caller, across all queries? A limit per query? How does it interact with the `start` query parameter? This might not be relevant anymore, but it would've been useful documentation.
I hope this doesn't come off as demanding; I know this legacy API isn't a top priority, and you're probably battling more immediate stability concerns.

Let me know if I can be of any help!

Cheers,
Lukas


Reply all
Reply to author
Forward
0 new messages