Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Total results doesn't match the real amount of entries
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post will appear after it is approved by moderators
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Paulo S.  
View profile  
 More options Aug 25 2011, 4:51 pm
From: "Paulo S." <prssoar....@gmail.com>
Date: Thu, 25 Aug 2011 13:51:31 -0700 (PDT)
Local: Thurs, Aug 25 2011 4:51 pm
Subject: Total results doesn't match the real amount of entries
Using the API, I could notice that the total results for astro-ph
category is 105380 as presented below:

<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">105380</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">10</opensearch:itemsPerPage>

However when I try to get the last articles, I actually get no
results:

Query: http://export.arxiv.org/api/query?search_query=cat:astro-ph&start=105...
Data:

<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3Dcat%3Aastro-ph
%26id_list%3D%26start%3D105370%26max_results%3D10" rel="self"
type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=cat:astro-
ph&amp;id_list=&amp;start=105370&amp;max_results=10</title>
  <id>http://arxiv.org/api/UpGo12YR3p9ADRUPMBxluA9RiLA</id>
  <updated>2011-08-25T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">105380</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">105370</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/
opensearch/1.1/">10</opensearch:itemsPerPage>
</feed>

Changing the start and max_results a little bit, I've figured out that
I can only get results until 49945, that is, any value of max_results
bigger than one for the query below doesn't return any result.

Query: http://export.arxiv.org/api/query?search_query=cat:astro-ph&start=499...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten S  
View profile  
 More options Aug 26 2011, 6:09 am
From: Thorsten S <thorsten.schwan...@gmail.com>
Date: Fri, 26 Aug 2011 04:09:07 -0600
Local: Fri, Aug 26 2011 6:09 am
Subject: Re: [arxiv-api] Total results doesn't match the real amount of entries
the max number of returned search results is limited to 50000 for
practical reasons.

we recommend using time slices for searches that are too broad.

Cheers
T.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paulo S.  
View profile  
 More options Aug 26 2011, 4:55 pm
From: "Paulo S." <prssoar....@gmail.com>
Date: Fri, 26 Aug 2011 13:55:08 -0700 (PDT)
Local: Fri, Aug 26 2011 4:55 pm
Subject: Re: Total results doesn't match the real amount of entries
I haven't understood what you've answered. Is it possible to retrieve
more than 50000 articles by using time slices? Because
I'm retrieving 10 articles per iteration (waiting 3 seconds after each
one) and even this way I can't retrieve more than 50000. Is that
right? Because I've read some papers that have used the API to build
collaboration networks by category, e.g Astro-ph; using all the
articles from such category which is exactly what I'm trying to do to
carry on some experiments.

Cheers
Paulo S.

On Aug 26, 7:09 am, Thorsten S <thorsten.schwan...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paulo S.  
View profile  
 More options Aug 29 2011, 3:13 pm
From: "Paulo S." <prssoar....@gmail.com>
Date: Mon, 29 Aug 2011 12:13:50 -0700 (PDT)
Local: Mon, Aug 29 2011 3:13 pm
Subject: Re: Total results doesn't match the real amount of entries
Sorry for being too insistent. The thing is that I really need to do
this as soon as possible. Can anyone explain me how can I retrieve all
the articles from a specific category?

Thanks,
Paulo S.

On Aug 26, 5:55 pm, "Paulo S." <prssoar....@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten S  
View profile  
 More options Aug 30 2011, 12:13 pm
From: Thorsten S <thorsten.schwan...@gmail.com>
Date: Tue, 30 Aug 2011 10:13:08 -0600
Local: Tues, Aug 30 2011 12:13 pm
Subject: Re: [arxiv-api] Re: Total results doesn't match the real amount of entries
The initial search is cached and you are retrieving subsets of that
search. No new search is performed when you subsequently specify
subsets of a search. Therefore the total number of results for the
lookup is limited to 50000 and your approach will not work and is not
the intended use of the api.

If you modify your original search to use time slices, then those
individual lookups are separate searches and therefore you can cover
the entire search space by using contiguous time slices of adequate
size.

However, a better approach to retrieving an entire archive's content
is to query the corresponding set via the OAI-PMH interface
http://arxiv.org/help/oa/index  .

Best
T.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paulo S.  
View profile  
 More options Aug 30 2011, 1:04 pm
From: "Paulo S." <prssoar....@gmail.com>
Date: Tue, 30 Aug 2011 10:04:57 -0700 (PDT)
Local: Tues, Aug 30 2011 1:04 pm
Subject: Re: Total results doesn't match the real amount of entries
I´m afraid I'm lost about this time slice approach. Could you give me
a short example showing how I´m supposed to use it (how I should build
the url and stuff)?

Thanks again,
Paulo S.

On Aug 30, 1:13 pm, Thorsten S <thorsten.schwan...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Toby Proctor  
View profile  
 More options Aug 30 2011, 2:56 pm
From: Toby Proctor <toby.proc...@gmail.com>
Date: Tue, 30 Aug 2011 19:56:19 +0100
Local: Tues, Aug 30 2011 2:56 pm
Subject: Re: [arxiv-api] Re: Total results doesn't match the real amount of entries

Paulo,

Something like the code below should get you the last weeks worth of results
if I understand my own code correctly :)

It's been a couple of months since I wrote it so might not be perfect, but
should point you in the right direction.  If you're grabbing 50k at a time,
just increase the max_results and timedelta so you're consistently picking
up just less than 50k records in one go.

Toby

start = 0
max_results = 5000
startdate=datetime.datetime.now()-datetime.timedelta
enddate=datetime.datetime.now()-datetime.timedelta(days=7)
dateparse1=startdate.strftime('%Y%m%d')+'0001'
dateparse2=enddate.strftime('%Y%m%d')+'0000'
daterange=dateparse1+'+TO+'+dateparse2

base_url = 'http://export.arxiv.org/api/query?';
query = 'search_query=submittedDate:[%s]&start=%i&max_results=%i' %
(daterange, start, max_results)

response = urllib.urlopen(base_url+query).read()

On 30 August 2011 18:04, Paulo S. <prssoar....@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten S  
View profile  
 More options Aug 30 2011, 3:47 pm
From: Thorsten S <thorsten.schwan...@gmail.com>
Date: Tue, 30 Aug 2011 13:47:53 -0600
Local: Tues, Aug 30 2011 3:47 pm
Subject: Re: [arxiv-api] Re: Total results doesn't match the real amount of entries
you can specify date ranges, e.g. to narrow down  your search to
papers submitted in 2009 your query-string should contain the date
specification

?search_query=submittedDate:[200901010000+TO+200912312359]&.....

http://export.arxiv.org/api/query?search_query=submittedDate:[200901010000+TO+200912312359]+AND+cat:astro-ph.*

which gives 12485 astro-ph papers in 2009
<opensearch:totalResults
xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">12485</opensearch:totalResults>

Note that you need to specify cat:astro-ph.* to search across all
astro-ph subcategories

Cheers
T.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »