Correct way to cherry pick URLs to parse using SitemapSpider?

1,285 views
Skip to first unread message

Edward

unread,
Dec 7, 2011, 5:32:51 AM12/7/11
to scrapy-users
Hi,

I'm using a SitemapSpider and code along the following lines:

sitemap_urls = ['http://www.example.com/sitemap.xml.gz']
sitemap_rules = [('/a/items/', 'parse_page'),]
sitemap_follow = ['/item_sitemap']

The site-map contains links to other Gzipped site-maps, which in turn
have links to around 60,000 URLs with /a/items/ in them. Everything
works great so far (thanks for the awesome project!!), however, I have
the following desires:

1) To selectively crawl some of the URLs, according to whatever the
results of a DB query tell me.
2) To manipulate some URLs before allowing Scrapy to fetch them.
3) To not crawl duplicate URLs.

In terms of 1) and 2), the XMLFeedSpider has the 'parse_node' function
which allows you to selectively return requests for each URL, but
SiteMapSpider doesn't seem to have this feature, only being able to
recursively call '_parse_sitemap' until it hits a designated URL
matching rules.
In terms of 3) I guess it makes sense that SitemapSpider processes all
URLs, because the site-map creators shouldn't have duplicates in their
site-map. The issue is that they actually attach a random jsession id
when you hit it, and for some reason the spider is constantly
following the URL (further, it doesn't even match '/a/items/'!!!

Are there any simple solutions that allow me to achieve 1-3, or do I
have to implement a different spider?

Pablo Hoffman

unread,
Dec 16, 2011, 1:14:45 PM12/16/11
to scrapy...@googlegroups.com
I don't think you can achieve that with the current SitemapSpider
implementation. Maybe you can provide a patch to make the implementation
easier to extend by having a method to decide whether a urls should be
followed?. I would gladly review it.

Sanket Gupta

unread,
Dec 22, 2011, 8:09:14 AM12/22/11
to scrapy-users
Hi
I tried running the sitemap spider but it refused to crawl gzipped
sitemaps.It gave the following error

[scrapy] WARNING: Ignoring non-XML sitemap

is there a setting that needs to be enabled to allow parsing of
gzipped sitemaps?

Pablo Hoffman

unread,
Dec 23, 2011, 2:42:20 PM12/23/11
to scrapy...@googlegroups.com
It should support gzipped sitemaps if the encoding is declared properly
in the HTTP server. Could you share the url?

Sanket Gupta

unread,
Dec 24, 2011, 2:11:47 AM12/24/11
to scrapy...@googlegroups.com
this is referening a gz sitemap.


 

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.


Shane Evans

unread,
Dec 24, 2011, 5:56:00 AM12/24/11
to scrapy...@googlegroups.com
It references:
http://www.flipkart.com/sitemap/sitemapMOB.xml.gz
which has a content type header of application/octet-stream

As a work around, you could modify the list of accepted content types:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/sitemap.py#L54

I guess generally, should scrapy handle this situation? maybe guess that its gzip based on the file extension when there is not a useful info in the content type header?
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

Sanket Gupta

unread,
Dec 24, 2011, 1:14:01 PM12/24/11
to scrapy...@googlegroups.com
Gzip in sitemaps is part of the sitemap protocol. So ideally the
sitemap spider should handle this out of the box
i will try the solution suggested

>> <mailto:scrapy...@googlegroups.com>.


>> To unsubscribe from this group, send email to
>> scrapy-users...@googlegroups.com

>> <mailto:scrapy-users%2Bunsu...@googlegroups.com>.


>> For more options, visit this group at
>> http://groups.google.com/group/scrapy-users?hl=en.
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "scrapy-users" group.
>> To post to this group, send email to scrapy...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> scrapy-users...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/scrapy-users?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to
> scrapy-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scrapy-users?hl=en.
>
>

--
Sent from my mobile device

Sanket Gupta

unread,
Dec 26, 2011, 6:50:59 AM12/26/11
to scrapy...@googlegroups.com

so shoild i override the is_gzipped function inside each class that extends the sitemapspider?

Pablo Hoffman

unread,
Jan 3, 2012, 9:20:09 AM1/3/12
to scrapy...@googlegroups.com
I've added support for processing sitemap urls even if they have a wrong content type:
https://github.com/scrapy/scrapy/commit/10ed28b9d02ddc0b216889c7d37d452ac4b11324

Note that it's only available in master (0.15) branch, because it's considered a new feature.
Reply all
Reply to author
Forward
0 new messages