Crawling for daily news articles only

49 views

Skip to first unread message

jai ven

unread,

May 10, 2014, 12:30:03 PM5/10/14

to scrapy...@googlegroups.com

Hello,

I need guidance of any form, what i'm trying to do is to scrape a news website such as bbc.co.uk for their daily updates only,is there a way to do that in scrapy without having to crawl the whole website?

Nikolaos-Digenis Karagiannis

unread,

May 11, 2014, 5:15:05 AM5/11/14

to scrapy...@googlegroups.com

In their urls they seem to identify articles by a number matching:

/(\d+)-[^/]*$|-(\d+)$

Provided you use this identifier when you store articles in a database, you can write a spider middleware that queries the db to determine if you already have the article and decide to allow the request iff you don't. To improve performance for rejection of articles you can cache (on open_spider()) all the article identifiers from the previous day. For the complement, approving articles for scraping, think of a workaround, eg, I would guess their identifier is generated by a sequence, use sorting and don't look further back in the past than a few days before the current session.

Also, look at http://www.bbc.com/news/10628494 , you can parse the date from the feed. Depending on the site, you may miss some articles if erroneous dates are stated.

Reply all

Reply to author

Forward

0 new messages