domain_name = quirks

696 views
Skip to first unread message

Kyle Clarke

unread,
Feb 19, 2010, 1:57:14 AM2/19/10
to scrapy-users
I'm having an issue with the setting of the domain_name value in my
spiders.

When setting the domain name to a "valid" url, the spider runs as
normal eg

class TradeMeCoNzPropertyRent(CrawlSpider):
domain_name = 'www.trademe.co.nz'

However, if I were to set the domain to "foobar" - the spider would be
found, however none of the links or rules would be valid & the spider
closes. I would have thought that the domain_name var would just be a
placeholder/namespace so to sepak & could be named almost anything -
tho it does seem to be linked to the actual website to be crawled.
I've tried to test by using a unique uri, eg www.trademe.co.nz/rent -
tho it seems that all uri's I try with a "/" fail even if they are a
valid url to browse?

I'm sorry but I'm not too sure of the connection between the domain &
the spider. My problem is that I have 6+ spiders crawling the one
domain - www.trademe.co.nz - & I don't want all of these called at the
same time - so I want to give unique domain_name values to be explicit
when calling. Can anyone share a little light on this for me.
Thanks
Kyle

Ben

unread,
Feb 20, 2010, 5:25:56 PM2/20/10
to scrapy-users
Hi

I'm in the same boat as Kyle - I would like to set up multiple spiders
to crawl the same general domain (but at different times and with
different rules/parsing structure).

Scrapy Enhancement Proposal 012 (http://dev.scrapy.org/wiki/SEP-012)
would appear to address this issue, but in the meantime I guess we
need a workaround. The SEP refers to such a workaround, but I can't
find the details of what this might be. Can anyone point me (and Kyle)
in the direction of some workaround instructions?

I guess one approach might be to have one spider but different rules
that call different parsing methods depending on something passed-in
at runtime? I'm a complete noob to scrapy, so sorry if that sounds
silly.

Thanks.

On Feb 19, 7:57 pm, Kyle Clarke <kylercla...@gmail.com> wrote:
> I'm having an issue with the setting of the domain_name value in my
> spiders.
>
> When setting the domain name to a "valid" url, the spider runs as
> normal eg
>
> class TradeMeCoNzPropertyRent(CrawlSpider):
>     domain_name = 'www.trademe.co.nz'
>
> However, if I were to set the domain to "foobar" - the spider would be
> found, however none of the links or rules would be valid & the spider
> closes. I would have thought that the domain_name var would just be a
> placeholder/namespace so to sepak & could be named almost anything -
> tho it does seem to be linked to the actual website to be crawled.
> I've tried to test by using a unique uri, egwww.trademe.co.nz/rent-
> tho it seems that all uri's I try with a "/" fail even if they are a
> valid url to browse?
>
> I'm sorry but I'm not too sure of the connection between the domain &
> the spider. My problem is that I have 6+ spiders crawling the one

> domain -www.trademe.co.nz- & I don't want all of these called at the

Daniel Graña

unread,
Feb 21, 2010, 12:28:12 PM2/21/10
to scrapy...@googlegroups.com
Hello Ben, Kyle:

Scrapy Enhancement Proposal 012 (http://dev.scrapy.org/wiki/SEP-012)
would appear to address this issue, but in the meantime I guess we
need a workaround.  The SEP refers to such a workaround, but I can't
find the details of what this might be. Can anyone point me (and Kyle)
in the direction of some workaround instructions?

spider.domain_name is just a name to quickly identify spiders at logs and from command line, except for OffsiteMiddleware that historically uses it as main spider domain to filter out requests. So, if you forget about OffsiteMiddleware, you can put any value to domain_name as far as it is unique across your project spiders.

Take "test.com" as example domain to scrape with multiples spiders, then:

# Spider for test.com
class Test1Spider(BaseSpider):
     domain_name = 'test1'
     extra_domain_names = ['test.com']

# Another spider for test.com
class Test2Spider(BaseSpider):
     domain_name = 'test2'
     extra_domain_names = ['test.com']
 
I guess one approach might be to have one spider but different rules
that call different parsing methods depending on something passed-in
at runtime?  I'm a complete noob to scrapy, so sorry if that sounds
silly.

It is possible too, but a little more complex and you need to extend SpiderManager.

hopes it help.
Daniel

Ben

unread,
Feb 22, 2010, 3:00:26 AM2/22/10
to scrapy-users
Thanks Daniel, that is useful.

I need to turn off the OffsiteMiddleware anyway, so that will make the
approach you suggest an easy choice.

Cheers,
Ben.

Kyle Clarke

unread,
Feb 23, 2010, 5:34:07 PM2/23/10
to scrapy...@googlegroups.com
And thanks Ben & Daniel for the additions to my project!
Wicket!
Kyle

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.




--
Kyle Clarke - Voomer

Mobile - 021 324 064
Web - www.voomstudio.com

Kyle Clarke

unread,
Feb 23, 2010, 6:44:54 PM2/23/10
to scrapy...@googlegroups.com
Additionally tho - this won't help me with trying to call exclusive spiders on the one website domain at different times - eg if I was looking at crawling all of the vehicle details on ebay - as there are a lot of vehicles loaded, it makes sense to crawl more frequently than lets say "pet supplies" which won't nearly be as popular etc. So when calling the scrapy command line via cron - all the spiders for the one domain will still be crawled.

At this stage to stop this behaviour, it looks as tho I will need to first check when I lasted stored crawled details to the dbase per certain spiders, & then drop the spider as necessary - tho "there be dragons" with this approach also.

I haven't yet looked into the core scrapy stuff to implement a better approach - tho I should be able to hook the command line params sent via the shell & only allow the explicit spiders requested... (I say this without looking at the code & a PHP developer)

Any advice is welcome.
Thanks
Kyle
Reply all
Reply to author
Forward
0 new messages