Building app with ArXiv.

153 views
Skip to first unread message

Fangrui Liu

unread,
Jun 19, 2023, 8:31:52 AM6/19/23
to arXiv API
Hi, 

We are building an article search app with LangChain and MyScale. We want to get a list of IDs in arxiv database and perform a semantic search with their abstract. Also other metadata like titles and authors are useful too. May I ask where or how can I get those information from arXiv API? 

Best

Fangrui

Eric Lease Morgan

unread,
Jun 19, 2023, 8:53:16 AM6/19/23
to arxi...@googlegroups.com


On Jun 15, 2023, at 11:22 PM, Fangrui Liu <lfr1...@gmail.com> wrote:

> We are building an article search app with LangChain and MyScale. We want to get a list of IDs in arxiv database and perform a semantic search with their abstract. Also other metadata like titles and authors are useful too. May I ask where or how can I get those information from arXiv API?
>
> --
> Fangrui


If I understand your question correctly, then I have found the Arxiv data set on Kaggle to be the most useful. [1] Every once in a while I grab the dataset, run it through a system of my own design, and provide access to the resulting index in an sluggish, ugly, but functional interface. [2, 3] Example queries include:

* "expert systems" AND 2022 - https://bit.ly/3CBRpfJ
* category:"cs.DL" AND (date:2022 OR date:2021) - https://bit.ly/3NAXRtH
* author:thain - https://bit.ly/43JZ47N


[1] Kaggle - https://bit.ly/3NmqSIq
[2] system of my own design - https://github.com/ericleasemorgan/arxiv-index
[3] sluggish and ugly interface - https://distantreader.org/stacks/indexes/arxiv

--
Eric Lease Morgan
Navari Family Center for Digital Scholarship
University of Notre Dame


Lukas Schwab

unread,
Jun 19, 2023, 12:02:09 PM6/19/23
to arxi...@googlegroups.com
Since you're using LangChain — unless you have reason to build your own search index — the built-in arxiv tool calls the arXiv API: https://python.langchain.com/docs/modules/agents/tools/integrations/arxiv

--
You received this message because you are subscribed to the Google Groups "arXiv API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arxiv-api/6C2D13F8-DD6D-418E-9AA4-D83926915D58%40nd.edu.

Elbek Keskinoglu

unread,
Jun 20, 2023, 11:39:28 AM6/20/23
to arXiv API
Hi,

Here is the code I have used:

import requests
from bs4 import BeautifulSoup
import datetime

# Set the URL parameters
today = datetime.date.today()
params = {
    'advanced': '',
    'terms-0-operator': 'AND',
    'terms-0-term': '',
    'terms-0-field': 'title',
    'classification-physics': 'y',
    'classification-physics_archives': 'quant-ph',
    'classification-include_cross_list': 'include',
    'date-year': '',
    'date-filter_by': 'date_range',
    'date-from_date': '1992-10-7',
    'date-to_date': today,
    'date-date_type': 'announced_date_first',
    'abstracts': 'hide',
    'size': '200',  # Maximum number of articles per page
    'order': '-announced_date_first',  # Order by date, newest first
}

# Initialize the starting index and batch size
start_index = 0
batch_size = 200

# Open the output file in write mode
with open('./pre-pre-pre/arxiv_ids.txt', 'a') as f:
    # Loop through all batches and extract the arXiv IDs
    while True:
        # Update the starting index in the URL parameters
        params['start'] = start_index
       
        # Make a GET request to the URL and get the HTML content
        response = requests.get('https://arxiv.org/search/advanced', params=params)
        html_content = response.content
        print(response.status_code)
        # Parse the HTML content using Beautiful Soup
        soup = BeautifulSoup(html_content, 'html.parser')
       
        # Find all the elements with the tag 'p' and the class 'list-title is-inline-block'
        arxiv_elements = soup.find_all('p', {'class': 'list-title is-inline-block'})
       
        # Loop through each element and extract the arXiv ID
        for element in arxiv_elements:
            arxiv_id = element.find('a').text.strip()
            # Write the arXiv ID to the output file
            f.write(arxiv_id + '\n')
       
        # If the number of elements found is less than the batch size, we have reached the end of the results
        if len(arxiv_elements) < batch_size:
            break
       
        # Increase the starting index for the next batch
        start_index += batch_size

Cheers,
E. J. Keskinoglu

Message has been deleted

Paul Ginsparg

unread,
Jun 21, 2023, 12:01:26 PM6/21/23
to arxi...@googlegroups.com
For retrieval purposes, it might also be useful to be aware of semantic embeddings of arxiv titles and abs described here:
and available here:
(also via huggingface and kaggle, as linked in the above twitter link).
they are supposed to be updated periodically, and can be used to facilitate retrieval.

There are as well some ongoing attempts to calculate embeddings for chunked versions of the entire full-text database (pdf2txt conversions as well as latex source) to both facilitate retrieval and provide context for automated chatbot queries (i.e., prompts, which tend to suppress hallucinatory behavior)
Reply all
Reply to author
Forward
0 new messages