Bulk data access for licenses?

133 views
Skip to first unread message

Colin Raffel

unread,
Mar 23, 2021, 11:59:30 AM3/23/21
to arXiv API
Hi all, we are interested in using the text from creative commons-licensed arxiv articles as part of a dataset for machine learning research. We have been able to obtain the articles through the requester-pays bulk data access S3 buckets (https://arxiv.org/help/bulk_data_s3) but the files in these buckets do not specify the license of a given article. The OAI API endpoint seems to return the license, but AFAICT it is rate-limited, and getting the license for every article on arxiv would take a very long time. Is there any bulk data access (via a requester-pays bucket) to arxiv article metadata, including the license?

Thorsten

unread,
Mar 24, 2021, 7:06:04 PM3/24/21
to arXiv API

Hi Colin,

The statement at https://arxiv.org/help/bulk_data_s3 concerning licenses is:

"Note: Most articles submitted to arXiv are submitted with the default arXiv license, which grants arXiv a perpetual, non-exclusive license to distribute the article, but does not assign copyright to arXiv, nor grant arXiv the right to grant any specific rights to others. We are thus unable to grant others the right to distribute arXiv articles. If you build indexes or tools based on the full-text, you must link back to arXiv for downloads. A small fraction of submissions are made with other licenses and this information is available in the OAI-PMH metadata."

So indeed to get at the concrete license information for a specific PDF, you should use OAI-PMH.
However, you don't have to make individual GetRecord requests. You can obtain a complete copy of arXiv metadata via the ListRecords verb in a reasonable amount of time.

Some simple python code (I recommend you store the entire record response for later use). sickle will take care of flow control and resumptionToken and download record metadata in sensible batches

In [1]: from sickle import Sickle

In [2]: s = Sickle('https://export.arxiv.org/oai2', max_retries=2)

In [3]: records = s.ListRecords(metadataPrefix='arXivRaw', set='cs', ignore_deleted=True)

In [4]: for rec in records:
  ...:     if 'license' in rec.metadata:
  ...:         print(rec.metadata.get('id', ''), rec.metadata.get('license', ''))
  ...:          
['0704.0002'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.0217'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.0229'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.0304'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.0468'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.0492'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.1043'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.1748'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.1751'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.1829'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.2010'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.2092'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.2258'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.2808'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.2900'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.3177'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.3313'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.3395'], ['http://creativecommons.org/licenses/publicdomain/']
['0704.3536'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
['0704.3674'], ['http://arxiv.org/licenses/nonexclusive-distrib/1.0/']
....

Note that the default mapping of the arXivRaw metadata structure to the flat sickle record structure isn't perfect. For better mapping and access to nested elements and attributes you should define a custom handler.

Cheers
T.

Colin Raffel

unread,
Mar 25, 2021, 11:32:19 AM3/25/21
to arxi...@googlegroups.com
Ah, this is perfect, thank you!



--
You received this message because you are subscribed to a topic in the Google Groups "arXiv API" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/arxiv-api/4a7i2AJLYMA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/7372141d-2756-4cc7-b57e-895d6a50d7a8n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages