broad crawl

199 views
Skip to first unread message

Jordi Llonch

unread,
Apr 13, 2013, 6:56:25 PM4/13/13
to scrapy...@googlegroups.com
Hello,

I am running a broad crawl using scrapy. 

The spider.parser yields Requests for every scraped link which is good for any new discovered url.

Already visited Requests are rejected by Scheduler's duplicate filter.

What is the best way to avoid yielding visited Requests at the spider.parse stage?

Thanks,

Bruno Lima

unread,
Apr 13, 2013, 7:54:25 PM4/13/13
to scrapy...@googlegroups.com
First idea that comes to mind is keep a dict where the key is the url.
If key exists in dict dont yeld.

Bruno Seabra Mendonça Lima
--------------
http://about.me/bruno.seabra


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Andrés Moreira

unread,
Apr 13, 2013, 8:16:51 PM4/13/13
to scrapy...@googlegroups.com
I think the standard way to implement is as it's implemented with the dupefilter in Scrapy (https://github.com/scrapy/scrapy/blob/master/scrapy/dupefilter.py#L25), but for Broadcrawls I would suggest to inherit from BaseDupeFilter using some efficient persistent set on disk+memory. Here you have a nice article about fast data structures for python, probably you should look for Bloomfilters or ZODB persistent btrees. 
Of course, you can do the same on the parse method, using an efficient data structure but I prefer to follow the standard way.

Also, I suggest you to glance the Broad crawls documentation from Scrapy. 

I hope this help.

Andrés

--
--
Andrés Moreira.
elkpi...@gmail.com

Jordi Llonch

unread,
Apr 13, 2013, 9:04:31 PM4/13/13
to scrapy...@googlegroups.com
I am using dupefilters in the scheduler. How can spider have access to DuplicateFilter or even the Scheduler?



2013/4/14 Andrés Moreira <elkpi...@gmail.com>

Andrés Moreira

unread,
Apr 13, 2013, 9:45:02 PM4/13/13
to scrapy...@googlegroups.com

You can instantiate the dupefilter class from the spider(scrapy.dupefilter.*), check the scheduler code, there you have an example of how is used.

Reply all
Reply to author
Forward
0 new messages