When setting the domain name to a "valid" url, the spider runs as
normal eg
class TradeMeCoNzPropertyRent(CrawlSpider):
domain_name = 'www.trademe.co.nz'
However, if I were to set the domain to "foobar" - the spider would be
found, however none of the links or rules would be valid & the spider
closes. I would have thought that the domain_name var would just be a
placeholder/namespace so to sepak & could be named almost anything -
tho it does seem to be linked to the actual website to be crawled.
I've tried to test by using a unique uri, eg www.trademe.co.nz/rent -
tho it seems that all uri's I try with a "/" fail even if they are a
valid url to browse?
I'm sorry but I'm not too sure of the connection between the domain &
the spider. My problem is that I have 6+ spiders crawling the one
domain - www.trademe.co.nz - & I don't want all of these called at the
same time - so I want to give unique domain_name values to be explicit
when calling. Can anyone share a little light on this for me.
Thanks
Kyle
I'm in the same boat as Kyle - I would like to set up multiple spiders
to crawl the same general domain (but at different times and with
different rules/parsing structure).
Scrapy Enhancement Proposal 012 (http://dev.scrapy.org/wiki/SEP-012)
would appear to address this issue, but in the meantime I guess we
need a workaround. The SEP refers to such a workaround, but I can't
find the details of what this might be. Can anyone point me (and Kyle)
in the direction of some workaround instructions?
I guess one approach might be to have one spider but different rules
that call different parsing methods depending on something passed-in
at runtime? I'm a complete noob to scrapy, so sorry if that sounds
silly.
Thanks.
On Feb 19, 7:57 pm, Kyle Clarke <kylercla...@gmail.com> wrote:
> I'm having an issue with the setting of the domain_name value in my
> spiders.
>
> When setting the domain name to a "valid" url, the spider runs as
> normal eg
>
> class TradeMeCoNzPropertyRent(CrawlSpider):
> domain_name = 'www.trademe.co.nz'
>
> However, if I were to set the domain to "foobar" - the spider would be
> found, however none of the links or rules would be valid & the spider
> closes. I would have thought that the domain_name var would just be a
> placeholder/namespace so to sepak & could be named almost anything -
> tho it does seem to be linked to the actual website to be crawled.
> I've tried to test by using a unique uri, egwww.trademe.co.nz/rent-
> tho it seems that all uri's I try with a "/" fail even if they are a
> valid url to browse?
>
> I'm sorry but I'm not too sure of the connection between the domain &
> the spider. My problem is that I have 6+ spiders crawling the one
> domain -www.trademe.co.nz- & I don't want all of these called at the
Scrapy Enhancement Proposal 012 (http://dev.scrapy.org/wiki/SEP-012)
would appear to address this issue, but in the meantime I guess we
need a workaround. The SEP refers to such a workaround, but I can't
find the details of what this might be. Can anyone point me (and Kyle)
in the direction of some workaround instructions?
I guess one approach might be to have one spider but different rules
that call different parsing methods depending on something passed-in
at runtime? I'm a complete noob to scrapy, so sorry if that sounds
silly.
I need to turn off the OffsiteMiddleware anyway, so that will make the
approach you suggest an easy choice.
Cheers,
Ben.
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.