Arxiv query missing results, it seems that you can not set `start` above 4000-45000

288 views
Skip to first unread message

Santosh Gupta

unread,
Jun 10, 2019, 7:11:07 PM6/10/19
to arXiv API
Hello,

For the following parameters, the api does not seem to return results, even though I know they are there. I `sort_by` to `submittedDate` and sort_order by ascending

Here are the full parameters

```
search_query= 'cat:cs.CV OR cat:cs.AI OR cat:cs.LG OR cat:cs.CL OR cat:cs.NE OR cat:stat.ML',
start=40000,
max_results=10,
sort_by="submittedDate",
sort_order="ascending"
```

Here is the url:

When set `start` = 0, it starts when the first submissions came in, in 1993. However, the max I can set `start` to is about 3900, which returns results from 2017. If I set `start` to 40000, it doesn't give any results. 

But, when I switch the `sort_order` to `descending`, then it starts at 2019 as it should. 

So search_query= 'cat:cs.CV OR cat:cs.AI OR cat:cs.LG OR cat:cs.CL OR cat:cs.NE OR cat:stat.ML'  is missing about a year or so worth of results. 

Something similar is happening is I use the same query but only switch `sort_order` to `descending`

Setting `start`=40000 returns results from early 2018 / late 2017, but it does not return results for `start`=42000 it does not return results again

Here is the url 


So it seems that there is a maximum `start`, but it may also vary by the query being used. If I set:

search_query= 'cat:cs.CV OR cat:cs.AI OR cat:cs.LG'

then I can get results when I set `start`=45000

here is the url 


So if there is max setting for `start`, it seems to very with the type of query. 

Thorsten S

unread,
Jun 10, 2019, 7:12:11 PM6/10/19
to arXiv api



Because of speed limitations in our implementation of the API, the maximum number of results returned from a single call (max_results) is limited to 30000 in slices of at most 2000 at a time, using the max_results and start query parameters. For example to retrieve matches 6001-8000: http://export.arxiv.org/api/query?search_query=all:electron&start=6000&max_results=8000

Large result sets put considerable load on the server and also take a long time to render. We recommend to refine queries which return more than 1,000 results, or at least request smaller slices. For bulk metadata harvesting or set information, etc., the OAI-PMH interface is more suitable. A request with max_results >30,000 will result in an HTTP 400 error code with appropriate explanation. A request for 30000 results will typically take a little over 2 minutes to return a response of over 15MB. Requests for fewer results are much faster and correspondingly smaller.



--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To post to this group, send email to arxi...@googlegroups.com.
Visit this group at https://groups.google.com/group/arxiv-api.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/2129c64d-a39e-4bc8-85fe-b4d3ea98c395%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Santosh Gupta

unread,
Jun 10, 2019, 8:16:46 PM6/10/19
to arxi...@googlegroups.com
I am using `max_results` of 10, so the queries are supposed to return 10 results. The summary of my issue is that the parameter of `start` may have a limit.

Thorsten S

unread,
Jun 10, 2019, 8:21:52 PM6/10/19
to arXiv api

max_result and start are connected.

to provide stable pagination of a query result, the entire result set has to be retrieved and cached and the start parameter together with max_results provides the offset into the result set.

use queries returning smaller total number of results instead of attempting to page through an exceedingly large result.

Cheers
T.


Santosh Gupta

unread,
Jun 10, 2019, 8:40:19 PM6/10/19
to arxi...@googlegroups.com
I see, so even during paging, it still returns all of those results, paging just iterates over them. So the solution would be to break up the query into its individual components. 

I've been playing with splitting up the queries and this returns all of the results. Thanks~

Ion Freeman

unread,
Apr 21, 2020, 10:25:00 PM4/21/20
to arXiv API
Thanks, Thorsten!

I don't see a way to query arXiv through the OAI-PMH interface. Am I just missing it, or is the recommended solution to download the entire set and build an index ourselves offline?
To unsubscribe from this group and stop receiving emails from it, send an email to arxi...@googlegroups.com.

Thorsten S

unread,
Apr 21, 2020, 10:30:38 PM4/21/20
to arXiv api


OAI-PMH is intended for downloading (subsets) of metadata  and keeping them in sync via incremental update. There is no inherent search in OAI-PMH.
You can use ListSets, ListRecords, GetRecord and from/until date windows to requests the metadata of a (list) of specific records in a particular metadata format.

Cheers
T.

To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/c4764ff6-0365-400b-84f5-61f057175edd%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages