rapid-fire requests

Martin

unread,

Aug 10, 2009, 4:21:35 PM8/10/09

to arXiv api

Hi:

I am putting together a batch PDF recording for I, Librarian. I wanted
to ask you how many API requests per second, or minute are tolerable.
PubMed for instance explicitly allows up to 3 requests per second.
What is your limit?

Martin

Dave

unread,

Sep 17, 2009, 10:28:50 PM9/17/09

to arXiv api

Martin did you ever get a response?

Martin

unread,

Sep 23, 2009, 4:19:05 PM9/23/09

to arXiv api

No, I did not. Summer is distracting. :-)

Thorsten S

unread,

Oct 22, 2009, 7:36:20 PM10/22/09

to arxi...@googlegroups.com, dab...@gmail.com, mku...@gmail.com

the arXiv api does not support bulk download of PDF.

we ask third party services to be reasonable with the frequency of
search and metadata request via the arXiv api. while there are no hard
limits at this time, if our servers get bogged down by certain clients
via the api we will have to take some protective measures.

for full metadata dumps we recommend OAI-PMH. After the initial
harvest daily or weekly incrementals are recommended.
http://arxiv.org/help/oa/index

for a full copy of (or particular subsets of) PDF for arXiv papers, we
are in the process of setting up a service in the Cloud, which will
offer the option for bulk download. I'll let you know when that
becomes available.

Cheers
Thorsten

Martin

unread,

Oct 23, 2009, 5:15:18 PM10/23/09

to arXiv api

Thanks Thorsten. I did not mean to download PDFs from arxiv. Let me
explain. New users of I, Librarian have hundreds of PDFs on their hard
drives. I, Librarian is able to extract DOI (or arxiv ID, if you'll
allow this) from these PDFs and fetch the corresponding metadata from
the respective repositories like PubMed, NASA ADS, or arxiv. It takes
some time to extract a DOI, move PDF and such. As a result, this batch
recording feature requests metadata every 1-2 seconds. That is why I
asked. I think 1 request per second should be fine, but I have no
problem to implement a sleep function for any number of seconds you
would feel comfortable with.

Thorsten S

unread,

Oct 23, 2009, 5:40:54 PM10/23/09

to arxi...@googlegroups.com

Hi Martin,

thanks for clarifying.

For api requests (in particular individual record requests) a
frequency of 1/second is fine in principle.

However, it seems to me that it would be more efficient to do a full
OAI-PMH harvest of arXiv metadata and keep that up to date with daily
incrementals and then run the lookup function against your local copy
instead of requesting 1 record at a time for each reference
encountered. The OAI web site has links to various tools
http://www.openarchives.org/pmh/tools/tools.php to readily implement a
harvester, and the individual record keys are the arXiv identifiers,
e.g. <identifier>oai:arXiv.org:0804.2273</identifier>, so the lookup
in the local copy is trivial.

It's up to you to decide what you want to implement. It's great to see
new and creative uses of the api and we encourage people to do so. If
the result is publicly accessible please share a pointer with the
group to explore it.

Cheers
Thorsten

Thorsten S

unread,

Jul 15, 2010, 1:44:17 PM7/15/10

to ogrisel, arxi...@googlegroups.com

Hi Oliver,

sorry for the delayed response -- it's the conference and vacation
time of the year

Please see

http://arxiv.org/help/bulk_data

and for the bulk PDF download from Amazon cloud service

http://arxiv.org/help/bulk_data_s3

Cheers
Thorsten

On Thu, Jul 1, 2010 at 5:20 AM, ogrisel <olivier...@gmail.com> wrote:
> On Oct 23 2009, 1:36 am, Thorsten S <thorsten.schwan...@gmail.com>
> wrote:

>> On Thu, Sep 17, 2009 at 8:28 PM, Dave <daba...@gmail.com> wrote:
>>
>> > Martin did you ever get a response?
>>
>> > On Aug 10, 1:21 pm, Martin <mku...@gmail.com> wrote:
>> >> Hi:
>>
>> >> I am putting together a batch PDF recording for I, Librarian. I wanted
>> >> to ask you how many API requests per second, or minute are tolerable.
>> >> PubMed for instance explicitly allows up to 3 requests per second.
>> >> What is your limit?
>>
>> >> Martin
>>
>> the arXiv api does not support bulk download of PDF.
>>
>> we ask third party services to be reasonable with the frequency of
>> search and metadata request via the arXiv api. while there are no hard
>> limits at this time, if our servers get bogged down by certain clients
>> via the api we will have to take some protective measures.
>>

>> for a full copy of (or particular subsets of) PDF for arXiv papers, we
>> are in the process of setting up a service in the Cloud, which will
>> offer the option for bulk download. I'll let you know when that
>> becomes available.
>

> Hi Thorsten,
>
> Have you made any progress on this side? I would like to gain access
> to a corpus of around 1000 to 10000 papers from various arxiv
> categories to test algorithms for semantic document analysis and
> clustering.
>
> Best,
>
> --
> Olivier

Olivier Grisel

unread,

Jul 15, 2010, 1:49:30 PM7/15/10

to Thorsten S, arxi...@googlegroups.com

2010/7/15 Thorsten S <thorsten....@gmail.com>:

> Hi Oliver,
>
> sorry for the delayed response -- it's the conference and vacation
> time of the year
>
> Please see
>
> http://arxiv.org/help/bulk_data
>
> and for the bulk PDF download from Amazon cloud service
>
> http://arxiv.org/help/bulk_data_s3