How to specify custom process_value to link

rhill

unread,

May 5, 2010, 12:39:28 PM5/5/10

to scrapy-users

Hi all. I am new to python/scrapy etc.

From the tutorial I have a working spider for my site. Now I try to
customize it to my own requirements. One of these requirements is to
filter out some query parameters from some urls.

To do this, I created my own extractor based on SgmlLinkExtractor. I
override the process_value() method, in order to remove unwanted query
parameters from the url, and return the cleaned-up url:

class MySgmlLinkExtractor(SgmlLinkExtractor):
def __init__(self):
SgmlLinkExtractor.__init__(self,allow=(r'blah\.php'))
def process_value(self,v):
[do something]

However, this doesn't seem to work, there is no clean-up taking place.

Looking at the python code in sgml.py, it seems I have no choice but
to specify a process_value as a callback only, when instanciating the
link_extractor. I don't know how best I can do that, whatever I tried
didn't work:

class MySpider(CrawlSpider):
domain_name = "blah.blah"
start_urls = ["http://www.blah.blah/"]
rules = (
Rule(MySgmlLinkExtractor(process_value=my_process_value)),
)
def my_process_value(self,v):
[do something]
...

This causes an error. I just don't know how to pass a method as a
callback for the "process_value" parameter (while it is no problem to
do so for the "callback" parameter in the Rule object).

What is the proper way?

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Victor Mireyev

unread,

May 6, 2010, 10:41:38 AM5/6/10

to scrapy-users

The only public method that every LinkExtractor have is extract_links.
http://doc.scrapy.org/topics/link-extractors.html#topics-link-extractors

Chris

unread,

May 12, 2010, 11:24:15 AM5/12/10

to scrapy-users

I was having a similar problem. To get rid of the errors that were
bothering me, I had to move the declaration of my_process_value ABOVE
the rules in my code. I also did not have "self" as a parameter of
my_process_value.

So it looked like this:

def my_process_value(value):
[do stuff]
return modified_value

rules = (
Rule(SgmlLinkExtractor(allow=(),
process_value=my_process_value)),
)

Hope that helps!

As a bit of a python noob, I found the scrapy documentation on using
process_value to be a little confusing.

On May 5, 12:39 pm, rhill <rh...@raymondhill.net> wrote:
> Hi all. I am new to python/scrapy etc.
>
> From the tutorial I have a working spider for my site. Now I try to
> customize it to my own requirements. One of these requirements is to
> filter out some query parameters from some urls.
>
> To do this, I created my own extractor based on SgmlLinkExtractor. I

> override theprocess_value() method, in order to remove unwanted query

> parameters from the url, and return the cleaned-up url:
>
> class MySgmlLinkExtractor(SgmlLinkExtractor):
> def __init__(self):
> SgmlLinkExtractor.__init__(self,allow=(r'blah\.php'))
> defprocess_value(self,v):
> [do something]
>
> However, this doesn't seem to work, there is no clean-up taking place.
>
> Looking at the python code in sgml.py, it seems I have no choice but

> to specify aprocess_valueas a callback only, when instanciating the

Reply all

Reply to author

Forward

How to specify custom process_value to link_extractor()

rhill

Victor Mireyev

Chris