Is it possible to achieve focused crawling using Scrapy?

charlieS

unread,

Oct 25, 2010, 2:39:53 PM10/25/10

to scrapy-users

Hi,

I was wondering if Scrapy lets you handle and prioritize the links
that the spiders collect through a crawl cycle, in order to achieve a
focused crawling. At the moment I am experimenting with
SPIDER_MIDDLEWARES and try to find out a way to redirect the crawlers
to a specific url.

Is there a way to prioritize extracted links?Any ideas or examples are
welcomed!

Thanks!

Pablo Hoffman

unread,

Oct 26, 2010, 11:52:35 AM10/26/10

to scrapy...@googlegroups.com

What about the Request priority attribute?
http://doc.scrapy.org/topics/request-response.html#scrapy.http.Request

A request with higher priority will be executed before a request with lower
priority. Spider middlewares can adjust the Request.priority attribute.

Pablo.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

charlieS

unread,

Oct 26, 2010, 2:44:57 PM10/26/10

to scrapy-users

Hi Pablo,

I tried assigning a priority to each request however the spiders did
not process the requests with the highest priority (for example:
url1 -> priority=1
url2 -> priority=2
url3 -> priority=3

the spider crawls url3 but next follows url1. Furthermore, as more
requests are heading to the scheduler the order is less correlated to
the priority values assigned to each request. Besides what i want is
to pick the request with the highest priority each time) Does this
have to do with the asynchronous nature of Scrapy? In that case i
assume there is no evident solution, am i right or i'm missing
something here?

Thanks for your time!

On Oct 26, 6:52 pm, Pablo Hoffman <pablohoff...@gmail.com> wrote:
> What about the Request priority attribute?http://doc.scrapy.org/topics/request-response.html#scrapy.http.Request

Pablo Hoffman

unread,

Oct 27, 2010, 11:53:47 AM10/27/10

to scrapy...@googlegroups.com

There order can't be completely guaranteed, think of it more as a hint to
Scrapy for which requests should be processed first.

One thing that can also help is setting CONCURRENT_REQUESTS_PER_SPIDER to 1,
but it won't completely ensure the order either because the downloader has its
own local queue for performance reasons, so the best you can do is prioritize
the requests but not ensure its exact order.

Pablo.

charlieS

unread,

Oct 30, 2010, 12:04:32 PM10/30/10

to scrapy-users

That's right. After testing with a huge number of pages i realized
that the order is closely retained to the highest priorities, even
with CONCURRENT_REQUESTS_PER_SPIDER set to a higher number(and better
performance).
Another subject is how to deploy this tactic to more spider instances.
What i have done is to dynamically generate as many classes of the
same spider(but with diffirent name and start_urls) as i need for my
project, and then add the to a spiderqueue. Thus , i got let's say 10
spiders that crawl concurrently their starting seeds. However, my
benchmarks show that the number of pages crawled by one spider is
about the same when using 10 spider(populated using the above method)
in a specific amount of time. Any suggestions?

Thanks!

Pablo Hoffman

unread,

Oct 31, 2010, 11:09:09 PM10/31/10

to scrapy...@googlegroups.com

On Sat, Oct 30, 2010 at 09:04:32AM -0700, charlieS wrote:
> However, my benchmarks show that the number of pages crawled by one spider is
> about the same when using 10 spider(populated using the above method) in a
> specific amount of time. Any suggestions?