On Nov 25, 2019, at 7:02 PM, Udayshankar Menon <
uk...@cornell.edu> wrote:
> I am trying to use the arXiv data to create an application using Natural Language Processing. Towards this I am trying to create a dataframe with paper titles and their abstracts. Any suggestions as to what tools I must use to obtain the abstracts of all papers? I was initially considering creating a web crawler using Scrapy, would this be good approach or anyone familiar with the arXiv API and if it can be used to accomplish this task of mine? I'm new to web scraping and learning NLP, so apologies ahead if my question is not complicated. I tried to run the Python example code under the API manual and that did not work for me. All suggestions are welcome.
Based on my reading of your use case, I agree with the previous reply. To elaborate:
* use OAI-PMH to explore the availability of metadata
* design a data frame accordingly
* use OAI-PMH to fill the data frame (start small)
* manipulate the data in the data frame to do your NLP
* go back two steps and make your data frame larger
It will be relatively easy to fill a data frame using OAI-PMH, but from my experience getting the actual content (PDF files) is more difficult.
--
Eric Lease Morgan
emo...@nd.edu