Hi all. I am new to python/scrapy etc.
From the tutorial I have a working spider for my site. Now I try to
customize it to my own requirements. One of these requirements is to
filter out some query parameters from some urls.
To do this, I created my own extractor based on SgmlLinkExtractor. I
override the process_value() method, in order to remove unwanted query
parameters from the url, and return the cleaned-up url:
class MySgmlLinkExtractor(SgmlLinkExtractor):
def __init__(self):
SgmlLinkExtractor.__init__(self,allow=(r'blah\.php'))
def process_value(self,v):
[do something]
However, this doesn't seem to work, there is no clean-up taking place.
Looking at the python code in sgml.py, it seems I have no choice but
to specify a process_value as a callback only, when instanciating the
link_extractor. I don't know how best I can do that, whatever I tried
didn't work:
class MySpider(CrawlSpider):
domain_name = "blah.blah"
start_urls = ["
http://www.blah.blah/"]
rules = (
Rule(MySgmlLinkExtractor(process_value=my_process_value)),
)
def my_process_value(self,v):
[do something]
...
This causes an error. I just don't know how to pass a method as a
callback for the "process_value" parameter (while it is no problem to
do so for the "callback" parameter in the Rule object).
What is the proper way?
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to
scrapy...@googlegroups.com.
To unsubscribe from this group, send email to
scrapy-users...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/scrapy-users?hl=en.