How to specify custom process_value to link_extractor()

1,810 views
Skip to first unread message

rhill

unread,
May 5, 2010, 12:39:28 PM5/5/10
to scrapy-users
Hi all. I am new to python/scrapy etc.

From the tutorial I have a working spider for my site. Now I try to
customize it to my own requirements. One of these requirements is to
filter out some query parameters from some urls.

To do this, I created my own extractor based on SgmlLinkExtractor. I
override the process_value() method, in order to remove unwanted query
parameters from the url, and return the cleaned-up url:

class MySgmlLinkExtractor(SgmlLinkExtractor):
def __init__(self):
SgmlLinkExtractor.__init__(self,allow=(r'blah\.php'))
def process_value(self,v):
[do something]

However, this doesn't seem to work, there is no clean-up taking place.

Looking at the python code in sgml.py, it seems I have no choice but
to specify a process_value as a callback only, when instanciating the
link_extractor. I don't know how best I can do that, whatever I tried
didn't work:

class MySpider(CrawlSpider):
domain_name = "blah.blah"
start_urls = ["http://www.blah.blah/"]
rules = (
Rule(MySgmlLinkExtractor(process_value=my_process_value)),
)
def my_process_value(self,v):
[do something]
...

This causes an error. I just don't know how to pass a method as a
callback for the "process_value" parameter (while it is no problem to
do so for the "callback" parameter in the Rule object).

What is the proper way?

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Victor Mireyev

unread,
May 6, 2010, 10:41:38 AM5/6/10
to scrapy-users
The only public method that every LinkExtractor have is extract_links.
http://doc.scrapy.org/topics/link-extractors.html#topics-link-extractors

Chris

unread,
May 12, 2010, 11:24:15 AM5/12/10
to scrapy-users
I was having a similar problem. To get rid of the errors that were
bothering me, I had to move the declaration of my_process_value ABOVE
the rules in my code. I also did not have "self" as a parameter of
my_process_value.

So it looked like this:

def my_process_value(value):
[do stuff]
return modified_value

rules = (
Rule(SgmlLinkExtractor(allow=(),
process_value=my_process_value)),
)

Hope that helps!

As a bit of a python noob, I found the scrapy documentation on using
process_value to be a little confusing.

On May 5, 12:39 pm, rhill <rh...@raymondhill.net> wrote:
> Hi all. I am new to python/scrapy etc.
>
> From the tutorial I have a working spider for my site. Now I try to
> customize it to my own requirements. One of these requirements is to
> filter out some query parameters from some urls.
>
> To do this, I created my own extractor based on SgmlLinkExtractor. I
> override theprocess_value() method, in order to remove unwanted query
> parameters from the url, and return the cleaned-up url:
>
> class MySgmlLinkExtractor(SgmlLinkExtractor):
>         def __init__(self):
>                 SgmlLinkExtractor.__init__(self,allow=(r'blah\.php'))
>         defprocess_value(self,v):
>                 [do something]
>
> However, this doesn't seem to work, there is no clean-up taking place.
>
> Looking at the python code in sgml.py, it seems I have no choice but
> to specify aprocess_valueas a callback only, when instanciating the
Reply all
Reply to author
Forward
0 new messages