list of all paper ids?

391 views
Skip to first unread message

asd

unread,
Dec 2, 2012, 9:21:30 AM12/2/12
to arxi...@googlegroups.com
Hello,
is there something like a list of ids of all the papers published so far on arxiv.org?
Alternatively, is there possibly even a way to download a dump of all the paper meta data?

Thanks in advance!

Thorsten

unread,
Dec 2, 2012, 1:08:19 PM12/2/12
to arxi...@googlegroups.com

Similar questions have been asked here before. Use OAI-PMH for this purpose, see http://arxiv.org/help/oa/index .
OAI-PMH supports sets, so you can selectively harvest categories of interest to you.
OAI-PMH supports incremental harvest, so it's straightforward to stay up to date after the initial global harvest.

Note that there was a major change to identifier form in 2007 http://arxiv.org/help/arxiv_identifier

Cheers
T.

asd

unread,
Dec 8, 2012, 2:40:42 PM12/8/12
to arxi...@googlegroups.com
Thanks!
Now, at least I managed to get a list for, e.g., physics, i.e.:
http://export.arxiv.org/oai2?set=physics&verb=ListIdentifiers&metadataPrefix=arXiv

Unfortunately I coudn't figure out how to also get entries from the years before 2007.
So far, I tried the additional arguments
"metadataPrefix=oldArXiv"
and, e.g., 
"from=1998-01-15",
as described on
www.openarchives.org/OAI/2.0/openarchivesprotocol.htm.
But these didn't work out.
Could someone help with a valid example request?

Cheers!

Simeon Warner

unread,
Dec 9, 2012, 11:52:33 AM12/9/12
to arxi...@googlegroups.com
The OAI-PMH datestamps are the datestamps of the last update of the record and not the original submission date. If you harvest everything you'll get all the articles back to 1991, but there won't be any datestamps before 2007.

I added something to our help page about this as it has been a frequent cause of confusion:


Every OAI-PMH metadata record has a datestamp associated with it, which is the last modification time of that record. Because arXiv has updated metadata records in bulk on several occasions, the OAI-PMH datestamp values do not correspond with the original submission or replacement times for older articles, and may not for newer articles because of administrative and bibliographic updates. The earliest datestamp is given then the <earliestDatestamp> element of the Identify response.
The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoig basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the <earliestDatestamp> through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates).
Once an initial harvest has been completed, the copy may be maintained by making imcremental harvesting requests with the from date set to the date of last harvest (from is best taken from the last server response; don't set the until date).

Cheers,
Simeon

asd

unread,
Dec 15, 2012, 6:25:15 AM12/15/12
to arxi...@googlegroups.com
Alright, thanks. that's good to know.
But I still can't manage to also list articles from before 2007.
Do you possibly have a sample request?
Cheers!

Thorsten

unread,
Dec 16, 2012, 12:51:27 PM12/16/12
to arxi...@googlegroups.com

Note that the OAI date is the last modification date, and this is

http://export.arxiv.org/oai2?verb=Identify  reports the earliest date as <earliestDatestamp>2007-05-23</earliestDatestamp>

Therefore searching for earlier date ranges does not make sense. However a List(Records|Identifiers) will include all arXiv papers matching the criteria. You have to pay attention to the resumptionToken and its completeListSize attribute

For example
http://export.arxiv.org/oai2?verb=ListIdentifiers&metadataPrefix=arXiv
has
<resumptionToken cursor="0" completeListSize="807066">400806|10001</resumptionToken>

This is the complete list of all records in arXiv: 807066 e-prints

To get a specific record, use e.g.
http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:physics/9803001&metadataPrefix=arXiv
and note that it has <datestamp>2009-10-31</datestamp> even though it was posted <created>1998-03-01</created>

Further OAI-PMH specific questions should be directed at the OAI mailing list.

Cheers
T.


asd

unread,
Dec 29, 2012, 1:10:30 PM12/29/12
to arxi...@googlegroups.com
Ok, thanks for the additional insight about datestamps and records.
But unfortunately I feel still a bit left with my initial question.
I'm not sure, whether I used misleading vocabulary or just don't get you.
So let me try it again this way:
Is there any way to get a list (or harvest?) of all ids/metadata related to articles
which were created/submitted the years before 2007?

Cheers!

Thorsten

unread,
Dec 29, 2012, 1:47:27 PM12/29/12
to arxi...@googlegroups.com

there is no OAI-PMH query which will give you this (and only this) list. that's the whole point of explaining the distinction between the <datestamp> (last modified) element and the <created> element and stating that OAI-PMH is using the former (as a necessity for incremental harvest among other things).

you have to do a complete harvest and than postprocess the results at your end filtering on the <created> element as the previous response already alluded to. it's simple xml processing

Cheers
T.




Reply all
Reply to author
Forward
0 new messages