Total results doesn't match the real amount of entries

130 views
Skip to first unread message

Paulo S.

unread,
Aug 25, 2011, 4:51:31 PM8/25/11
to arXiv api
Using the API, I could notice that the total results for astro-ph
category is 105380 as presented below:

<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">105380</opensearch:totalResults>
<opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">0</opensearch:startIndex>
<opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">10</opensearch:itemsPerPage>

However when I try to get the last articles, I actually get no
results:

Query: http://export.arxiv.org/api/query?search_query=cat:astro-ph&start=105370&max_results=10
Data:

<feed xmlns="http://www.w3.org/2005/Atom">
<link href="http://arxiv.org/api/query?search_query%3Dcat%3Aastro-ph
%26id_list%3D%26start%3D105370%26max_results%3D10" rel="self"
type="application/atom+xml"/>
<title type="html">ArXiv Query: search_query=cat:astro-
ph&amp;id_list=&amp;start=105370&amp;max_results=10</title>
<id>http://arxiv.org/api/UpGo12YR3p9ADRUPMBxluA9RiLA</id>
<updated>2011-08-25T00:00:00-04:00</updated>
<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">105380</opensearch:totalResults>
<opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">105370</opensearch:startIndex>
<opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">10</opensearch:itemsPerPage>
</feed>

Changing the start and max_results a little bit, I've figured out that
I can only get results until 49945, that is, any value of max_results
bigger than one for the query below doesn't return any result.

Query: http://export.arxiv.org/api/query?search_query=cat:astro-ph&start=49945&max_results=1



Thorsten S

unread,
Aug 26, 2011, 6:09:07 AM8/26/11
to arxi...@googlegroups.com
the max number of returned search results is limited to 50000 for
practical reasons.

we recommend using time slices for searches that are too broad.

Cheers
T.

> --
> You received this message because you are subscribed to the Google Groups "arXiv api" group.
> To post to this group, send email to arxi...@googlegroups.com.
> To unsubscribe from this group, send email to arxiv-api+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/arxiv-api?hl=en.
>
>

Paulo S.

unread,
Aug 26, 2011, 4:55:08 PM8/26/11
to arXiv api
I haven't understood what you've answered. Is it possible to retrieve
more than 50000 articles by using time slices? Because
I'm retrieving 10 articles per iteration (waiting 3 seconds after each
one) and even this way I can't retrieve more than 50000. Is that
right? Because I've read some papers that have used the API to build
collaboration networks by category, e.g Astro-ph; using all the
articles from such category which is exactly what I'm trying to do to
carry on some experiments.

Cheers
Paulo S.

On Aug 26, 7:09 am, Thorsten S <thorsten.schwan...@gmail.com> wrote:
> the max number of returned search results is limited to 50000 for
> practical reasons.
>
> we recommend using time slices for searches that are too broad.
>
> Cheers
> T.
>
>
>
>
>
>
>
> On Thu, Aug 25, 2011 at 2:51 PM, Paulo S. <prssoar....@gmail.com> wrote:
> > Using the API, I could notice that the total results for astro-ph
> > category is 105380 as presented below:
>
> > <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/
> > opensearch/1.1/">105380</opensearch:totalResults>
> >  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/
> > opensearch/1.1/">0</opensearch:startIndex>
> >  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/
> > opensearch/1.1/">10</opensearch:itemsPerPage>
>
> > However when I try to get the last articles, I actually get no
> > results:
>
> > Query:http://export.arxiv.org/api/query?search_query=cat:astro-ph&start=105...
> > Data:
>
> > <feed xmlns="http://www.w3.org/2005/Atom">
> >  <link href="http://arxiv.org/api/query?search_query%3Dcat%3Aastro-ph
> > %26id_list%3D%26start%3D105370%26max_results%3D10" rel="self"
> > type="application/atom+xml"/>
> >  <title type="html">ArXiv Query: search_query=cat:astro-
> > ph&amp;id_list=&amp;start=105370&amp;max_results=10</title>
> >  <id>http://arxiv.org/api/UpGo12YR3p9ADRUPMBxluA9RiLA</id>
> >  <updated>2011-08-25T00:00:00-04:00</updated>
> >  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/
> > opensearch/1.1/">105380</opensearch:totalResults>
> >  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/
> > opensearch/1.1/">105370</opensearch:startIndex>
> >  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/
> > opensearch/1.1/">10</opensearch:itemsPerPage>
> > </feed>
>
> > Changing the start and max_results a little bit, I've figured out that
> > I can only get results until 49945, that is, any value of max_results
> > bigger than one for the query below doesn't return any result.
>
> > Query:http://export.arxiv.org/api/query?search_query=cat:astro-ph&start=499...

Paulo S.

unread,
Aug 29, 2011, 3:13:50 PM8/29/11
to arXiv api
Sorry for being too insistent. The thing is that I really need to do
this as soon as possible. Can anyone explain me how can I retrieve all
the articles from a specific category?

Thanks,
Paulo S.

Thorsten S

unread,
Aug 30, 2011, 12:13:08 PM8/30/11
to arxi...@googlegroups.com
The initial search is cached and you are retrieving subsets of that
search. No new search is performed when you subsequently specify
subsets of a search. Therefore the total number of results for the
lookup is limited to 50000 and your approach will not work and is not
the intended use of the api.

If you modify your original search to use time slices, then those
individual lookups are separate searches and therefore you can cover
the entire search space by using contiguous time slices of adequate
size.

However, a better approach to retrieving an entire archive's content
is to query the corresponding set via the OAI-PMH interface
http://arxiv.org/help/oa/index .

Best
T.

Paulo S.

unread,
Aug 30, 2011, 1:04:57 PM8/30/11
to arXiv api
I´m afraid I'm lost about this time slice approach. Could you give me
a short example showing how I´m supposed to use it (how I should build
the url and stuff)?

Thanks again,
Paulo S.

On Aug 30, 1:13 pm, Thorsten S <thorsten.schwan...@gmail.com> wrote:
> The initial search is cached and you are retrieving subsets of that
> search. No new search is performed when you subsequently specify
> subsets of a search. Therefore the total number of results for the
> lookup is limited to 50000 and your approach will not work and is not
> the intended use of the api.
>
> If you modify your original search to use time slices, then those
> individual lookups are separate searches and therefore you can cover
> the entire search space by using contiguous time slices of adequate
> size.
>
> However, a better approach to retrieving an entire archive's content
> is to query the corresponding set via the OAI-PMH interfacehttp://arxiv.org/help/oa/index .
>
> Best
> T.

Toby Proctor

unread,
Aug 30, 2011, 2:56:19 PM8/30/11
to arxi...@googlegroups.com
Paulo,

Something like the code below should get you the last weeks worth of results if I understand my own code correctly :) 

It's been a couple of months since I wrote it so might not be perfect, but should point you in the right direction.  If you're grabbing 50k at a time, just increase the max_results and timedelta so you're consistently picking up just less than 50k records in one go.

Toby


start = 0                     
max_results = 5000
startdate=datetime.datetime.now()-datetime.timedelta
enddate=datetime.datetime.now()-datetime.timedelta(days=7)
dateparse1=startdate.strftime('%Y%m%d')+'0001'
dateparse2=enddate.strftime('%Y%m%d')+'0000'
daterange=dateparse1+'+TO+'+dateparse2

query = 'search_query=submittedDate:[%s]&start=%i&max_results=%i' % (daterange, start, max_results)

response = urllib.urlopen(base_url+query).read()

Thorsten S

unread,
Aug 30, 2011, 3:47:53 PM8/30/11
to arxi...@googlegroups.com
you can specify date ranges, e.g. to narrow down your search to
papers submitted in 2009 your query-string should contain the date
specification

?search_query=submittedDate:[200901010000+TO+200912312359]&.....


http://export.arxiv.org/api/query?search_query=submittedDate:[200901010000+TO+200912312359]+AND+cat:astro-ph.*

which gives 12485 astro-ph papers in 2009
<opensearch:totalResults
xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">12485</opensearch:totalResults>


Note that you need to specify cat:astro-ph.* to search across all
astro-ph subcategories

Cheers
T.

Reply all
Reply to author
Forward
0 new messages