Downloading large number of PDFs

Franciszek Wieczorek

unread,

Oct 23, 2022, 7:16:08 AM10/23/22

to arXiv API

Hello,

I have a small project, which requires a large number of PDFs on a specific topic. I'd like to prevent a server from extra stress when using any form of automated downloading by regular API. I read there is an alternative method to deal with bulk data, which is Open Archives Initiative (OAI) here: https://arxiv.org/help/oa.

The problem is I don't really know how to use it! The regular API is clear about it: I can build a query and get a result on a number of papers, from a specific group such as Computer Science.

I found a module called Sickle here: https://sickle.readthedocs.io/en/latest/tutorial.html, which simplifies the work with OAI. But I still can't figure out the way to filter and download papers according to my query.

Is it possible using OAI?

Eric Lease Morgan

unread,

Oct 23, 2022, 9:26:42 AM10/23/22

to arxi...@googlegroups.com

On Oct 23, 2022, at 7:14 AM, Franciszek Wieczorek <twitsoc...@gmail.com> wrote:

> I have a small project, which requires a large number of PDFs on a specific topic. I'd like to prevent a server from extra stress when using any form of automated downloading by regular API. I read there is an alternative method to deal with bulk data, which is Open Archives Initiative (OAI) here: https://arxiv.org/help/oa.

I too have projects which require large numbers (100O's) of PDFs on a specific topic. I',m not sure, but I believe one is authorized to download such things if the downloads are throttled at no more than one request/5 second or so. Please correct me if I'm wrong.

OAI/PMH is a REST-ful protocol. Send a URL, get XML back. There are about 6 different shapes (verbs) of the URLs that can sent to an OAI/PMH data provider (server). They are outlined here:

https://www.openarchives.org/OAI/openarchivesprotocol.html#ProtocolMessages

There are many libraries implementing OAI.

The problem is that the way the protocol is configured here (Arxiv) is that you only get metadata back: author, title, date, maybe abstract, maybe keywords, and probably a link to a splash/landing page. You don't actually get the PDFs.

Again, please correct me if I'm wrong?

--
Eric Lease Morgan
University of Notre Dame

Lukas Schwab

unread,

Oct 23, 2022, 9:26:26 PM10/23/22

to arxi...@googlegroups.com

> I',m not sure, but I believe one is authorized to download such things if the downloads are throttled at no more than one request/5 second or so. Please correct me if I'm wrong.

I don't know of a separate rate limit for downloads; the OAI and arXiv API rate limits are 1req/3s.

The docs recommend programs access the API via the export.arxiv.org subdomain. Though API results always point at arxiv.org, you can add the subdomain to fetch PDFs as well: https://arxiv.org/pdf/2210.10863.pdf becomes https://export.arxiv.org/pdf/2210.10863.pdf.

You could also bulk-download PDFs from S3. At that point, you can use the arXiv API or OAI to identify papers of interest, then pick the relevant PDFs out of the S3 data... but there's a whole lot of S3 data, maybe too much to be useful. Those blobs are organized temporally, not categorically.

Cheers,

Lukas

--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/B68AD2AC-D00B-4147-AE54-C2CA39F62EDB%40nd.edu.

Eric Lease Morgan

unread,

Oct 24, 2022, 10:05:02 AM10/24/22

to arxi...@googlegroups.com

On Oct 23, 2022, at 1:47 PM, Lukas Schwab <lukas....@gmail.com> wrote:

>> I',m not sure, but I believe one is authorized to download such things if the downloads are throttled at no more than one request/5 second or so. Please correct me if I'm wrong.
>
> I don't know of a separate rate limit for downloads; the OAI and arXiv API rate limits are 1req/3s.

Yes, this is more accurate.

> The docs recommend programs access the API via the export.arxiv.org subdomain. Though API results always point at arxiv.org, you can add the subdomain to fetch PDFs as well: https://arxiv.org/pdf/2210.10863.pdf becomes https://export.arxiv.org/pdf/2210.10863.pdf.

This is also very helpful; I forgot this specific. Thank you.

Still, I believe the OAI-PMH interface will only return metadata and not the full text / PDF of archived items. (All puns intended.)

Jim Entwood

unread,

Oct 24, 2022, 10:28:42 AM10/24/22

to arXiv API

Right... the APIs in arXiv currently provide only metadata results. As part of the full modernization and cloud migration of the legacy system we have plans for full text via API, but that feature is likely not available until mid-2024.

For now the full text options for the full corpus are:

Full text PDF or TeX source for new content or subset of content is available by:

Crawling our export service: https://export.arxiv.org/
- Info is at: https://arxiv.org/help/bulk_data#harvest
- We suggest a rate limit of 4 requests per second with a 1 second sleep, per burst.
- As Lukas said, the url format is to use the export.arxiv.org subdomain in place of arxiv.org to avoid getting blocked by the firewall on the main site

Best,

Jim

Jim Entwood

unread,

Oct 24, 2022, 10:57:49 AM10/24/22

to arxi...@googlegroups.com

The APIs in arXiv currently provide only metadata results. As part of the full modernization and cloud migration of the legacy system we have plans for full text via API, but that feature is likely not available until mid-2024.

For now the full text options for the full corpus are:

AWS https://arxiv.org/help/bulk_data_s3

Kaggle https://www.kaggle.com/Cornell-University/arxiv

Full text for new content or subset of content is available by:

Crawling our export service: https://export.arxiv.org/

Info is at: https://arxiv.org/help/bulk_data#harvest

We suggest a rate limit of 4 requests per second with a 1 second sleep, per burst.

As Lukas said, the url format is to use the export.arxiv.org subdomain in place of arxiv.org to avoid getting blocked by the firewall.

ex: https://export.arxiv.org/pdf/2210.10863.pdf

Best,

Jim

Jim Entwood

arXiv.org Head of Content and User Support

je...@cornell.edu
(he / him)

From: arxi...@googlegroups.com <arxi...@googlegroups.com> on behalf of Lukas Schwab <lukas....@gmail.com>
Sent: Sunday, October 23, 2022 1:47 PM
To: arxi...@googlegroups.com <arxi...@googlegroups.com>
Subject: Re: [arxiv-api] Downloading large number of PDFs

To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/CAHO42hQmePrGEhZjRiCxqboTcpWSPQsYz5_0s4MtXcvsrBH9PQ%40mail.gmail.com.

Felicia

unread,

Apr 9, 2024, 9:10:05 AM4/9/24

to arXiv API

great work. Any update on this 'we have plans for full text via API, but that feature is likely not available until mid-2024.' ?

Jake Weiskoff

unread,

Apr 9, 2024, 9:23:48 AM4/9/24

to arxi...@googlegroups.com

Hi Felicia,

There has not been any additional development effort for new features/functionality in the search API. I don't expect that we'd be in a position to revisit this within the calendar year, but I've added the enchantment request to our "cloud migration" considerations.

Sincerely,

-Jake Weiskoff

Project Manager, arXiv.org

To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/da660683-e876-4235-afb6-54dd0c3423d8n%40googlegroups.com.

Reply all

Reply to author

Forward