How to create a database of all papers and their abstracts?

Udayshankar Menon

unread,

Nov 25, 2019, 7:11:11 PM11/25/19

to arXiv API

Hi,

I am trying to use the arXiv data to create an application using Natural Language Processing. Towards this I am trying to create a dataframe with paper titles and their abstracts. Any suggestions as to what tools I must use to obtain the abstracts of all papers?

I was initially considering creating a web crawler using Scrapy, would this be good approach or anyone familiar with the arXiv API and if it can be used to accomplish this task of mine?

I'm new to web scraping and learning NLP, so apologies ahead if my question is not complicated. I tried to run the Python example code under the API manual and that did not work for me. All suggestions are welcome.

Best,

Uday

Thorsten

unread,

Nov 25, 2019, 7:15:51 PM11/25/19

to arXiv API

please search the archives before posting. the question of bulk downloads has been raised many times before.

to retrieve all metadata you should use OAI-PMH. see https://arxiv.org/help/bulk_data

you have a cornell.edu email address. just walk over to the arXiv team at Cornell and talk with them

Cheers

T.

Eric Lease Morgan

unread,

Nov 26, 2019, 7:30:34 AM11/26/19

to arxi...@googlegroups.com

On Nov 25, 2019, at 7:02 PM, Udayshankar Menon <uk...@cornell.edu> wrote:

> I am trying to use the arXiv data to create an application using Natural Language Processing. Towards this I am trying to create a dataframe with paper titles and their abstracts. Any suggestions as to what tools I must use to obtain the abstracts of all papers? I was initially considering creating a web crawler using Scrapy, would this be good approach or anyone familiar with the arXiv API and if it can be used to accomplish this task of mine? I'm new to web scraping and learning NLP, so apologies ahead if my question is not complicated. I tried to run the Python example code under the API manual and that did not work for me. All suggestions are welcome.

Based on my reading of your use case, I agree with the previous reply. To elaborate:

* use OAI-PMH to explore the availability of metadata
* design a data frame accordingly
* use OAI-PMH to fill the data frame (start small)
* manipulate the data in the data frame to do your NLP
* go back two steps and make your data frame larger

It will be relatively easy to fill a data frame using OAI-PMH, but from my experience getting the actual content (PDF files) is more difficult.

--
Eric Lease Morgan
emo...@nd.edu

Reply all

Reply to author

Forward