I'm using a SitemapSpider and code along the following lines:
sitemap_urls = ['http://www.example.com/sitemap.xml.gz']
sitemap_rules = [('/a/items/', 'parse_page'),]
sitemap_follow = ['/item_sitemap']
The site-map contains links to other Gzipped site-maps, which in turn
have links to around 60,000 URLs with /a/items/ in them. Everything
works great so far (thanks for the awesome project!!), however, I have
the following desires:
1) To selectively crawl some of the URLs, according to whatever the
results of a DB query tell me.
2) To manipulate some URLs before allowing Scrapy to fetch them.
3) To not crawl duplicate URLs.
In terms of 1) and 2), the XMLFeedSpider has the 'parse_node' function
which allows you to selectively return requests for each URL, but
SiteMapSpider doesn't seem to have this feature, only being able to
recursively call '_parse_sitemap' until it hits a designated URL
matching rules.
In terms of 3) I guess it makes sense that SitemapSpider processes all
URLs, because the site-map creators shouldn't have duplicates in their
site-map. The issue is that they actually attach a random jsession id
when you hit it, and for some reason the spider is constantly
following the URL (further, it doesn't even match '/a/items/'!!!
Are there any simple solutions that allow me to achieve 1-3, or do I
have to implement a different spider?
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
>> <mailto:scrapy...@googlegroups.com>.
>> To unsubscribe from this group, send email to
>> scrapy-users...@googlegroups.com
>> <mailto:scrapy-users%2Bunsu...@googlegroups.com>.
>> For more options, visit this group at
>> http://groups.google.com/group/scrapy-users?hl=en.
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "scrapy-users" group.
>> To post to this group, send email to scrapy...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> scrapy-users...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/scrapy-users?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to
> scrapy-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scrapy-users?hl=en.
>
>
--
Sent from my mobile device
so shoild i override the is_gzipped function inside each class that extends the sitemapspider?