Filtering for the daily announced papers

332 views
Skip to first unread message

Tatsunori Hashimoto

unread,
Oct 31, 2023, 1:31:34 PM10/31/23
to arXiv API
I've been trying to build a simple client for tracking new papers on arXiv - like the email announcements but filtered and more nicely formatted.

The submittedDate query is close, but not quite right (it's wrong for papers under hold). 
I've seen other posts on the topic, 

but I dont know if I've seen a clear 'canonical' solution to this problem. 

I think the arxiv ids are actually sequential by announce date and not by submitted date, so in theory I could query for thousands of the most recent papers each day, sort by id and then cut off after trying to infer where the new papers start (is there even a way to grab the most recent papers by ID?).

I could also maintain a large DB of all the papers, so that I can keep track of which papers are actually new and which ones are not. This seems very heavyweight for an app that could be stateless if there was an announcedDate metadata field. Is there something cleaner than either of these solutions? 

I guess another solution would just be to silently drop any paper that was put on hold, but that seems fairly suboptimal..

Best regards,

Tatsu.

Jake Weiskoff

unread,
Oct 31, 2023, 2:39:56 PM10/31/23
to arxi...@googlegroups.com
Hi Tatsu,

Your question is a common one, but the API isn't really intended to handle that sort of query (the category RSS handles all new papers on that day). However, what I think might work for you is to use some of the functionality from the OAI-PMH and pull a day's id listing, for example today's: 


and then build an id_list query based upon that list if you want the API's atom list. Note that this identifiers list contains all articles that have had any metadata change, not necessarily limited to those that are new or appear in the announcements. The OAI's purpose is to harvest the corpus's metadata in bulk, so it has some different parameters and no search...but it will tell you everything changed since that date. For more information on the OAI see: https://info.arxiv.org/help/oa/index.html 

Best,
-Jake

--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/b5944672-2ff5-4b04-b6e1-afce3caa5a14n%40googlegroups.com.

Tatsunori Hashimoto

unread,
Oct 31, 2023, 3:22:12 PM10/31/23
to arXiv API
Thanks Jake!

I looked at OAI-PMH but wasn't quite sure if the datestamp was tracking what I was looking for. It seems like maybe from your response that it is. Just to make sure, is the following correct?

1. The OAI-PMH datestamp is different from both submittedDate and lastUpdatedDate (which dont correspond to announcements), and actually corresponds to some more fine-grained update timestamp, which includes being announced.
2. The arxiv_ids are sequential by announcement time, so if i take the most recent contiguous sequence of arxiv_ids for a day's OAI-PMH updated papers, that is likely most likely the newly announced set of papers.

If those two are true, that's great news since it would give me a pretty clean way to grab all the new announced papers. 

Tatsu.

Jake Weiskoff

unread,
Oct 31, 2023, 4:09:35 PM10/31/23
to arxi...@googlegroups.com
That's mostly correct. It won't account for the case where the first in the sequence is really a replacement of the prior day's last paper, but I'd wager that's sufficiently rare of an occurrence. It's also fairly straightforward to detect, as the 2nd element "link href=" will contain a link to the arXiv abstract page entry. This will always include the version number as part of the element. For example: 


has the line: 
    <id>http://arxiv.org/abs/2310.16913v1</id>
which shows that it's entry is v1 of that paper (so it can only be a newly announced paper, or a paper that had a metadata change). In this case, it's not a metadata change, so this is the v1 announcement of the article. The first element in that manifest list:
<identifier>oai:arXiv.org:1407.1670</identifier>

is a metadata update. You can tell that from its query:
https://export.arxiv.org/api/query?id_list=1407.1670
because its <updated> element at the top is so wildly different than its <updated> <published> lines. 
-Jake

Tatsunori Hashimoto

unread,
Oct 31, 2023, 4:53:32 PM10/31/23
to arXiv API
Is there a way to know for sure if an update is a metadata update or an announcement? A paper replacement can be detected from the v1->v2 version change, but it seems like at least a few papers on the feed have metadata changes with no version change. It seemed from your email that detecting metadata changes is a bit of guesswork (looking at submitted date vs current date, concluding that submission holds dont take longer than X days)

Tatsu.

Jake Weiskoff

unread,
Oct 31, 2023, 5:14:35 PM10/31/23
to arxi...@googlegroups.com
There really isn't a way for you to determine that from the feeds. I can tell what happened via back-end data, but that wouldn't be included in any of the feeds as that's not within scope of what they do. The search API is just that; the OAI's purpose is to provide a complete copy of the metadata for local analysis (so anything that's included in the manifest would be considered an update to the new canonical version). 
If you really only wanted to know what was NEW-new you'd be better off querying against the RSS feeds which are updated at the time of announcement to include only replacements, and new submissions, but there's no history available there, and you're tied to the three supported formats for RSS in arXiv: https://info.arxiv.org/help/rss.html

-Jake

Carlos Souza

unread,
Oct 31, 2023, 5:32:46 PM10/31/23
to arxi...@googlegroups.com
I accomplished that using OAI-PMH and keeping a local copy of the full metadata file. Every day, my system downloads the new metadata and compares it against the local copy, discovering what’s new, what should be updated, and what should be removed.

The application is growing pretty fast, would love to hear what you guys think:

Cheers!
Carlos

Reply all
Reply to author
Forward
0 new messages