Scraping link titles from any site?

36 views

Skip to first unread message

Kyle

unread,

Jan 10, 2017, 5:19:48 PM1/10/17

to beautifulsoup

Hi guys,

I'm new to the concept of web scraping, but had a question I was hoping someone could answer. I'm interested in using BeautifulSoup to extract link titles from different websites. However, I'm noticing that the links on different sites are in very different formats. For example, if I visit news.google.com and use

titles = soup.findAll('span', attrs = { 'class' : 'titletext' }) # Gets titles in page

I am able to extract nearly all headline titles from the page. However, if I try to do the same thing at www.yahoo.com, it does not work. Is there a generalized method that would work for most sites, or does web scraping need to be tailored every time to the site at hand? Thanks.

J. Albert Bowden

unread,

Jan 10, 2017, 9:28:03 PM1/10/17

to beautifulsoup

Kyle,

There are generalized methods and/or best practices for sure, but you hit the nail on the head: scraping needs to be tailored for each site. At least a little bit.

I'm speaking from my scraping experiences, but also from being a web developer for a while now, sites use whatever markup they decide to build with, and you are left to their mercy.

Sites that create valid HTML documents, as well as utilize patterns/libraries can make the pain minimal for scraping, and almost achieve the utopia of not having to tweak for each sites, but unfortunately they are few and far between.

From your code posted, you have the "generalized method" already....just keep that snippet, and the next time you want to scrape, inspect the sites markup, note the differences in their markup from your code, and edit your code accordingly.

Let me be clear that I am giving an extremely oversimplified approach...typically you'll have to make other adjustments aside from the sites markup, but those are for another time/question.

tl;dr: keeping your scraping snippets/patterns/code as modular as possible will give you some generalized methods, but 99% of the time in my experience, at some point you are going to have to get hands on, at least a little bit, for each site you scrape.

Cheers,

Albert

Reply all

Reply to author

Forward

0 new messages