created timestamp in OAI-PMH

49 views
Skip to first unread message

Isabel Beckenbach

unread,
Jul 11, 2025, 3:07:25 AMJul 11
to arXiv API Discussion
Dear arxiv team,

the 'created" timestamp in the arxiv oai pmh with metadataformat arxiv seems to give the creation date of the latest version of an arxiv preprint, and not the creation date of the first submission, which is what I expected.

For example:


Here "created" is  2025-06-16 whereas the first version was submitted in 2021.

Is this intended? Can I only get the creation date of the original submission using the arXivRaw metadata format?

Best

Isabel

arXiv API Discussion

unread,
Jul 11, 2025, 1:05:42 PMJul 11
to arXiv API Discussion, IsabelBe...@gmx.de

In general, OAI-PMH returns the metadata for the latest version of the paper, so the created date is the date that particular version was created. The arXivRaw format is the best metadata format to get information on previous versions of the paper. The oai_dc format also has the date for the original version of the paper.

Isabel Beckenbach

unread,
Jul 14, 2025, 2:11:41 AMJul 14
to arXiv API Discussion, arXiv API Discussion, Isabel Beckenbach
Thank you for the answer.

Was it always this way or were there changes during the migration of the  OAI-PMH to a cloud provider in june (https://groups.google.com/a/arxiv.org/g/api/c/4Cm8bc7A6JM/m/QSZRWsddBAAJ)?

Micha Moskovic

unread,
Jul 14, 2025, 7:23:41 AMJul 14
to arXiv API Discussion, IsabelBe...@gmx.de, arXiv API Discussion
Dear Isabel and arXiv team,

I experience the same issue and have reported it already privately. The current behavior differs from the old API without this change being documented. I don't understand why it was deemed useful for the "arXiv" format to return a different set of dates from "oai_dc" and "arXivRaw", namely the date of current version as "created", and some later mysterious date as "updated".

Is there any rationale behind that? If not, I would strongly advocate for reverting to the previous behavior and harmonizing the dates returned by the different serialization formats, as the current behavior is confusing and not very useful.

Best,
Micha

arXiv API Discussion

unread,
Jul 14, 2025, 11:57:53 AMJul 14
to arXiv API Discussion, mich...@gmail.com, IsabelBe...@gmx.de, arXiv API Discussion
Hello,
 Yes, there were many behind the scenes changes and a few user facing changes made during the move to the cloud, including some to provide less stale data. You can find more details on this page: 
Open Archives Initiative (OAI) - arXiv info


As far as what exact dates are present in different metadata formats:  All formats contain the metadata for the most recent version of the paper, including its associated date. Some formats provide additional information, oai_dc also contains a date for the first version of the paper, and arXivRaw contains dates for all versions of the paper for users who are interested.

Micha Moskovic

unread,
Jul 17, 2025, 8:11:35 AMJul 17
to arXiv API Discussion, arXiv API Discussion, Micha Moskovic, IsabelBe...@gmx.de
Dear arXiv team,

On Monday, July 14, 2025 at 5:57:53 PM UTC+2 arXiv API Discussion wrote:
Hello,
 Yes, there were many behind the scenes changes and a few user facing changes made during the move to the cloud, including some to provide less stale data. You can find more details on this page: 
Open Archives Initiative (OAI) - arXiv info

Could you please highlight on that page where you mention the change in date handling we're discussing here?
 

As far as what exact dates are present in different metadata formats:  All formats contain the metadata for the most recent version of the paper, including its associated date. Some formats provide additional information, oai_dc also contains a date for the first version of the paper, and arXivRaw contains dates for all versions of the paper for users who are interested.


I don't think that's a correct characterization of the current behavior. If we look at Isabel's example 2104.05109, we indeed see that arXivRaw has full timestamps for all 4 versions, and oai_dc has two dates, the first of which is the date of v1, the second the date of v4 (latest in this case). However, arXiv has a distinct set of dates: its created date is the date of v4, but its updated date is a later date (two days after created), which is not present in arXivRaw at all.

I would understand arXivRaw having the full data and the other formats being derived from it, but that doesn't seem to be the case, as the updated date in arXiv is not present in arXivRaw. It also doesn't make sense to me that for oai_dc you decided that the two relevant dates are the submission date of the v1 and of the latest version, but for arXiv somehow the v1 submission date is irrelevant, preferring instead to return the latest submission date and a mystery date.

This whole discussion might seem like nitpicking to you, but these choices have significant consequences. We want to get all the metadata in arXiv format (oai_dc is too incomplete, arXivRaw requires parsing authors) daily for all of arXiv with the actual created date, and it worked very well until you changed the behavior when you switched to the new API implementation. We've already introduced a partial workaround for the current behavior. However, a full solution if you don't want to change the current (and in my opinion buggy) behavior back would require us to request multiple formats for every record (that is, do a GetRecord with metadataPrefix=oai_dc for each record we encounter in GetRecords with arXiv format), dramatically increasing the number of API requests we need to make (one additional request per new/updated eprint per day) and the complexity of our implementation to go beyond standard OAI-PMH harvesting.

So again, I ask you to please reconsider this behavior and go back to the previous and consistent behavior.
Reply all
Reply to author
Forward
0 new messages