Downloading large number of PDFs

277 views
Skip to first unread message

Franciszek Wieczorek

unread,
Oct 23, 2022, 7:16:08 AM10/23/22
to arXiv API
Hello,

I have a small project, which requires a large number of PDFs on a specific topic. I'd like to prevent a server from extra stress when using any form of automated downloading by regular API. I read there is an alternative method to deal with bulk data, which is Open Archives Initiative (OAI) here: https://arxiv.org/help/oa.

The problem is I don't really know how to use it! The regular API is clear about it: I can build a query and get a result on a number of papers, from a specific group such as Computer Science.

I found a module called Sickle here: https://sickle.readthedocs.io/en/latest/tutorial.html, which simplifies the work with OAI. But I still can't figure out the way to filter and download papers according to my query.

Is it possible using OAI?

Eric Lease Morgan

unread,
Oct 23, 2022, 9:26:42 AM10/23/22
to arxi...@googlegroups.com


On Oct 23, 2022, at 7:14 AM, Franciszek Wieczorek <twitsoc...@gmail.com> wrote:

> I have a small project, which requires a large number of PDFs on a specific topic. I'd like to prevent a server from extra stress when using any form of automated downloading by regular API. I read there is an alternative method to deal with bulk data, which is Open Archives Initiative (OAI) here: https://arxiv.org/help/oa.


I too have projects which require large numbers (100O's) of PDFs on a specific topic. I',m not sure, but I believe one is authorized to download such things if the downloads are throttled at no more than one request/5 second or so. Please correct me if I'm wrong.

OAI/PMH is a REST-ful protocol. Send a URL, get XML back. There are about 6 different shapes (verbs) of the URLs that can sent to an OAI/PMH data provider (server). They are outlined here:

https://www.openarchives.org/OAI/openarchivesprotocol.html#ProtocolMessages

There are many libraries implementing OAI.

The problem is that the way the protocol is configured here (Arxiv) is that you only get metadata back: author, title, date, maybe abstract, maybe keywords, and probably a link to a splash/landing page. You don't actually get the PDFs.

Again, please correct me if I'm wrong?

--
Eric Lease Morgan
University of Notre Dame


Lukas Schwab

unread,
Oct 23, 2022, 9:26:26 PM10/23/22
to arxi...@googlegroups.com
> I',m not sure, but I believe one is authorized to download such things if the downloads are throttled at no more than one request/5 second or so. Please correct me if I'm wrong.

I don't know of a separate rate limit for downloads; the OAI and arXiv API rate limits are 1req/3s.

The docs recommend programs access the API via the export.arxiv.org subdomain. Though API results always point at arxiv.org, you can add the subdomain to fetch PDFs as well: https://arxiv.org/pdf/2210.10863.pdf becomes https://export.arxiv.org/pdf/2210.10863.pdf.

You could also bulk-download PDFs from S3. At that point, you can use the arXiv API or OAI to identify papers of interest, then pick the relevant PDFs out of the S3 data... but there's a whole lot of S3 data, maybe too much to be useful. Those blobs are organized temporally, not categorically.

Cheers,
Lukas

--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/B68AD2AC-D00B-4147-AE54-C2CA39F62EDB%40nd.edu.

Eric Lease Morgan

unread,
Oct 24, 2022, 10:05:02 AM10/24/22
to arxi...@googlegroups.com

On Oct 23, 2022, at 1:47 PM, Lukas Schwab <lukas....@gmail.com> wrote:

>> I',m not sure, but I believe one is authorized to download such things if the downloads are throttled at no more than one request/5 second or so. Please correct me if I'm wrong.
>
> I don't know of a separate rate limit for downloads; the OAI and arXiv API rate limits are 1req/3s.

Yes, this is more accurate.


> The docs recommend programs access the API via the export.arxiv.org subdomain. Though API results always point at arxiv.org, you can add the subdomain to fetch PDFs as well: https://arxiv.org/pdf/2210.10863.pdf becomes https://export.arxiv.org/pdf/2210.10863.pdf.

This is also very helpful; I forgot this specific. Thank you.

Still, I believe the OAI-PMH interface will only return metadata and not the full text / PDF of archived items. (All puns intended.)

Jim Entwood

unread,
Oct 24, 2022, 10:28:42 AM10/24/22
to arXiv API
Right... the APIs in arXiv currently provide only metadata results. As part of the full modernization and cloud migration of the legacy system we have plans for full text via API, but that feature is likely not available until mid-2024.

For now the full text options for the full corpus are: 


Full text PDF or TeX source for new content or subset of content is available by:


Best,

Jim

Jim Entwood

unread,
Oct 24, 2022, 10:57:49 AM10/24/22
to arxi...@googlegroups.com
The APIs in arXiv currently provide only metadata results. As part of the full modernization and cloud migration of the legacy system we have plans for full text via API, but that feature is likely not available until mid-2024.

For now the full text options for the full corpus are: 

Full text for new content or subset of content is available by:


    Best,

    Jim




    Jim Entwood

    arXiv.org Head of Content and User Support
    je...@cornell.edu
    (he / him)




    From: arxi...@googlegroups.com <arxi...@googlegroups.com> on behalf of Lukas Schwab <lukas....@gmail.com>
    Sent: Sunday, October 23, 2022 1:47 PM
    To: arxi...@googlegroups.com <arxi...@googlegroups.com>
    Subject: Re: [arxiv-api] Downloading large number of PDFs
     

    Felicia

    unread,
    Apr 9, 2024, 9:10:05 AM4/9/24
    to arXiv API
    great work. Any update on this 'we have plans for full text via API, but that feature is likely not available until mid-2024.' ?

    Jake Weiskoff

    unread,
    Apr 9, 2024, 9:23:48 AM4/9/24
    to arxi...@googlegroups.com
    Hi Felicia,

    There has not been any additional development effort for new features/functionality in the search API. I don't expect that we'd be in a position to revisit this within the calendar year, but I've added the enchantment request to our "cloud migration" considerations. 

    Sincerely,
    -Jake Weiskoff
    Project Manager, arXiv.org

    Reply all
    Reply to author
    Forward
    0 new messages