Creating a Personalized arXiv Article Recommender with Reexpress (introspectable, uncertainty-aware on-device LLM data analysis)

83 views
Skip to first unread message

Allen Schmaltz

unread,
Jan 22, 2024, 7:29:41 PM1/22/24
to arXiv API
Given the large increase in daily submissions to arXiv in Computer Science categories (such as cs.LG and cs.CL), reading the daily RSS feed or email has become largely untenable without some degree of filtering.

However, how should we approach filtering?

Ideally every researcher would have their own recommendation engine tuned to their own research interests (easily updatable as their research programs evolve), and additionally, ideally every researcher would be able to retain control over their model and its training data (and training process) as a counter to biases and conflicts-of-interests typical of third-party aggregators. Our first inclination may be to use an off-the-shelf large language model (LLM), but that then introduces additional challenges: The models are of unknown reliability, lacking robust uncertainty quantification, and can require non-trivial time-effort to train, build a useable interface, re-train, maintain, etc.

It turns out that the on-device, no-code Reexpress (https://re.express/) macOS application (for Apple silicon) can quickly be set up to serve as an effective classifier of arXiv abstracts. (It's currently available on the US Mac App Store, and we have plans to expand to the other country/region stores soon.)

We provide a YouTube tutorial here: https://youtu.be/k1H3GcDdAfs

The classifier can be bootstrapped via a modest number of abstracts downloaded from the search API, so the burden on the arXiv servers remains low. Then, every day, a simple script can be run to download and format the RSS feed for one or more arXiv categories. We provide simple Python scripts to do this. No coding is necessary, just updating some input arguments. All of the serious heavy-lifting is done by the no-code Reexpress app:

https://github.com/ReexpressAI/Example_Data/tree/main/tutorials/tutorial5_arxiv_recommender

Once the model is trained, the core capabilities of Reexpress are unlocked: Reliable uncertainty quantification, document- and feature-level dense search, advanced hybrid semantic search, and auto-visualizations of the data distributions. All of these capabilities can be useful to quickly filter and search the daily arXiv feeds, and to inspect, update, and refine the underlying classifier and its training data. And amazingly, all of the processing is done directly on your Mac.

It would be great to hear your feedback, and don't hesitate to contact us if you have any questions!

Best,

Allen Schmaltz, PhD

Allen Schmaltz

unread,
Feb 1, 2024, 7:55:06 PM2/1/24
to arXiv API
The new arXiv RSS format introduced today (https://blog.arxiv.org/2024/01/31/attention-arxiv-users-re-implemented-rss/) works well and simplifies parsing the output. We updated the arXiv RSS formatter script for use with Reexpress. It is available in the GitHub Tutorial repo. The updated 'preprocess_arxiv_from_rss.py' script is necessary to correctly parse the updated format:

https://github.com/ReexpressAI/Example_Data/tree/main/tutorials/tutorial5_arxiv_recommender

Best,

Allen
Reply all
Reply to author
Forward
0 new messages