Categories, dates

65 views
Skip to first unread message

John Landahl

unread,
Dec 3, 2023, 10:52:41 AM12/3/23
to arXiv API
Hi, I just retrieved a daily list using the OAI interface as suggested in a recent post (e.g. "https://export.arxiv.org/oai2?verb=ListIdentifiers&metadataPrefix=arXiv&from=2023-11-28"), and picked a random entry from the results (1408.2076). The OAI output gives it the categories "math" and "physics:math-ph", but when I get the metadata using the regular API it lists these categories: "math.SP", "math-ph", "math.AP", "math.MP", and "81U40, 47A40". Some questions about this:
  • Why are the lists different?
  • Which one should I consider canonical?
  • What is that last category with two alphanumeric values in it?
When I load up the abstract page for the article, I see that it is listed as "1408.2076v2" (i.e. "v2" was added to what I got from the OAI interface) and that it was last updated in 2015. Some more questions:
  •  Which ID should I treat as canonical, 1408.2076 or 1408.2076v2?
  •  Why would an article that was last updated in 2015 be returned in a query listing changes since 2023-11-28?
I'm building something that's a bit similar to https://trendingpapers.com/ in that it will get and index a list of new/updated papers once a day. I could easily just ignore a paper from 2015 for my purposes, but I'm just wondering if this is a common situation or if I just happened upon an outlier.

Thanks for any help.

Carlos Souza

unread,
Dec 3, 2023, 11:59:22 AM12/3/23
to arxi...@googlegroups.com
Hi John,
It's Carlos, I'm developing Trending Papers. Let me try to answer some of your questions with what I've learned by interacting with the API over the last 6 months:
- I'm always using the identifiers in the metadata. In those, I can assure you, it never gets a v2 or v3, or whatever. It's always YYMM.######.
- An article posted in Arxiv years ago (e.g. 2015) might be updated or retracted. If it's updated, then it appears in the new metadata file for today (e.g. last 28th). (Detecting retractions is a bit harder but totally possible).
- Every day that Arxiv publishes a new metadata file, I see approx. ~450 new papers (in Computer Science), and ~150 updates. So, handling updates properly is important.

Hope it helps!
Cheers,
Carlos

--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/883995c2-7c4c-450c-93cc-392cbdafab40n%40googlegroups.com.

Thorsten

unread,
Dec 3, 2023, 7:11:34 PM12/3/23
to arXiv API

Hi John,

you are conflating OAI-PMH sets with categories. The OAI response for this record has header metadata

<setSpec>math</setSpec>
<setSpec>physics:math-ph</setSpec>

Sets are often supersets of categories, for example the set math contains all of the math subcategories.

The distinction is obvious in the full record returned on GetRecord request


The header info has the sets, the record metadata the individual categories.

Best
T.
Reply all
Reply to author
Forward
0 new messages