bulk metadata access not through OAI-PMH

83 views
Skip to first unread message

Evan Peters

unread,
Sep 30, 2024, 9:17:56 AM9/30/24
to arXiv API Discussion
Hello,

I'm looking to access+download the arXivRaw metadata for all of the arXiv physics preprints. This is a large number of preprints (>1 million) spanning a few dozen categories. 

Currently I'm doing this through the OAI-PMH interface. I just wanted to confirm that this is the correct way to be doing things, and that there isn't way to download metadata in bulk like there is for the pdfs/soure?

Also for rate limit purposes do you happen to know what counts as a 'request' when accessing OAI-PMH through the Sickle python package? (e.g. code below). In their docs they imply that this is determined by a "batch size" of the repository, but I couldn't find out what that meant. I assume that I do not need to wait 3 seconds between accessing each article's metadata, but I can't figure out how to impose rate limits properly here...

```
s = Sickle('https://export.arxiv.org/oai2', default_retry_after=3, max_retries=10)
records = s.ListRecords(
    **{'metadataPrefix': 'arXivRaw',
    'set': 'physics:quant-ph',
    'ignore_deleted': True,
})
records.next() # <-- Is this one request? Or do I get a few thousand of these per request?
```

Thanks,

Evan Peters

Brian Maltzan

unread,
Sep 30, 2024, 9:20:29 AM9/30/24
to a...@arxiv.org
Hi Evan,

Yes, you can use the bulk download of metadata and pdfs through Kaggle:

Thanks,
Brian

--
You received this message because you are subscribed to the Google Groups "arXiv API Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.
To view this discussion on the web visit https://groups.google.com/a/arxiv.org/d/msgid/api/acb63844-d324-498e-b8d3-476d39c2d86en%40arxiv.org.

Isabel Beckenbach

unread,
Oct 7, 2024, 4:56:18 AM10/7/24
to arXiv API Discussion, bmal...@arxiv.org
Hi, I use sickle together with the OAI PMH adding every day the new or updated metadata for arxiv preprints:

```
sickle = Sickle(
    self.ARXIV_URL,
    iterator=OAIItemIteratorWithDelay,
    max_retries=3
)
try:
    records = sickle.ListRecords(**oai_params)
except NoRecordsMatch:
    log.info("No records found.")
    return False
```

where OAIItemIteratorWithDelay is a custom iterator for sickle adding a delay of 3.1 seconds between two consecutive calls to the arxiv OAI-PMH:

```
class OAIItemIteratorWithDelay(OAIItemIterator):
"""
Small wrapper around OAIItemIterator that only makes a request to an
OAI-PMH every 3 seconds
"""

def __init__(self, sickle, params, ignore_deleted=False):
    self.logger = logging.getLogger()
    super(OAIItemIteratorWithDelay, self).__init__(sickle, params, ignore_deleted)

def _next_response(self):
    self.logger.info("Sleep for next response.")
    time.sleep(3.1)
    super(OAIItemIterator, self)._next_response()
    self._items = self.oai_response.xml.iterfind('.//' + self.sickle.oai_namespace + self.element)
```

This did work until the end of September. Since 09/30/2024 I get a ConnectionResetError 104. However, if i start the harvesting for example just the changes from day 10/01/2024 works fine. I tried it several times and get always this error if the date range includes 09/30/2024. Were there some changes at the API on this date? Should I increase the delay between two calls to the OAI-PMH?


Jake Weiskoff

unread,
Oct 7, 2024, 11:55:52 AM10/7/24
to a...@arxiv.org
This may have been related to the issue with the database issues that weekend. It's possible that you may wish to use the kaggle dataset for the items on that day. 

Regards,
-Jake 

Isabel Beckenbach

unread,
Oct 7, 2024, 2:01:40 PM10/7/24
to arXiv API Discussion, ja...@arxiv.org
Thank you, Jack. The kagel dataset seems to be very usefull to fetch all metadata. 

In the meantime I could fetch the data from 30th September by increasing the delay time between two calls to the API to 10 seconds. However, I still get sometimes 104 Error Codes. I don't think I make too many requests per time. Maybe there is a lot of traffic on the API recently?Is there some time of the day on which the API is more reliable? Currently I use 03:00 UTC and I only fetch the metadata of one day. However, I make two calls to the API (one for metadata format arXiv and one for arXivRaw).

If my problem ist related to the database issue, that you mentioned, then I guess it should go away the next days. I will observe it and let you know whether I still get a lot of 104 HTTP Errors.

Best

Isabel

Brian Maltzan

unread,
Oct 7, 2024, 2:45:05 PM10/7/24
to a...@arxiv.org
Hi Isabel,

Yes, that sounds like errors from heavy load.
In general, right after announcement is the busiest.
There's less traffic 6am - 4pm UTC.

Thanks,
Brian


--
You received this message because you are subscribed to the Google Groups "arXiv API Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.
Reply all
Reply to author
Forward
0 new messages