How to extract & follow relative links ?

1,971 views
Skip to first unread message

Roberto López

unread,
Dec 24, 2013, 5:31:43 PM12/24/13
to scrapy...@googlegroups.com
Hi.

I have to extract and follow links like this:

<a href="#date=2013-12-24&Id=1269282">Tynwald Titan</a>

The next rule doesn't work,
no links found

rules = ( Rule(SgmlLinkExtractor(allow=r''), callback='parse_item', follow=True), )

Do you know how I can do it ?

Best regards

Roberto López

unread,
Dec 25, 2013, 8:58:06 AM12/25/13
to scrapy...@googlegroups.com
Today I am using SgmlLinkExtractor with process_value to transform relative to absolute paths.

rules = (
       
Rule(SgmlLinkExtractor(tags="a",attrs="href",allow=r'#day=2013-12-24&Id=33', process_value=my_process_value), callback='my_parser', follow=False,),
       
)


   
 def my_process_value(value):
       
print '---->'+value
       
return

When I run the spider I can see all response links, this is the output:

---->#Day=2013-12-24&Id=33
---->#Day=2013-12-24&Id=1269753
---->#Day=2013-12-24&Id=1269753
---->#Day=2013-12-24&Id=1269772
---->#Day=2013-12-24&Id=1269772

I want the first relative link only, , it's like allow param doesn't take effect. The output should be this

---->#Day=2013-12-24&Id=33

Do you know the reason ?

Roberto López

unread,
Dec 25, 2013, 9:24:31 AM12/25/13
to scrapy...@googlegroups.com
Finally I have used this Rule :

rules = (
       
Rule(SgmlLinkExtractor( restrict_xpaths=('//th[@class="headline"]'), tags=("a"),attrs=("href"),allow=(r''), process_value=my_process_value), callback='my_parser', follow=False),
       
)


 def my_process_value(value):
       
print '---->'+value
       
return


I get the links I want, links within the restricted_xpath, it´s work. this is the output:

---->#Day=2013-12-24&Id=33

But . . . I would like to do the same using allow and process_value. Do you know how I can do it ?



El martes, 24 de diciembre de 2013 23:31:43 UTC+1, Roberto López escribió:

Rolando Espinoza La Fuente

unread,
Dec 25, 2013, 12:38:58 PM12/25/13
to scrapy...@googlegroups.com
Perhaps your links get filtered out due the relative url being the same page and the #fragment being removed by the link extractor.

You can get the links with fragments by using the canonicalize=False option.

See scrapy shell session below:

In [1]: body = '<a href="#date=2013-12-24&Id=1269282">Tynwald Titan</a>'

In [2]: from scrapy.http import HtmlResponse

In [3]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [4]: r = HtmlResponse('http://www.example.com/', body=body)

In [5]: lx = SgmlLinkExtractor(canonicalize=False)

In [6: lx.extract_links(r)
Out[6]: [Link(url='http://www.example.com/#date=2013-12-24&Id=1269282', text=u'Tynwald Titan', fragment='', nofollow=False)]

But the #fragment is not supposed to be sent to the server, so if you attempt to request that url you will get the same page. Most likely the website uses javascript to display the information based on the fragment. Give that scrapy doesn't execute javascript, you will need to figure out what the site does and reproduce that in scrapy (i.e. building an ajax request with the date and id in the fragment).

Regards
Rolando



--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Message has been deleted

Roberto López

unread,
Jan 2, 2014, 7:47:41 PM1/2/14
to scrapy...@googlegroups.com
Thanks Rolando. It's clear enough and it helped me a lot.

I'm still learning. Now it´s time to learn about item loaders.

Regards ¡¡¡
Reply all
Reply to author
Forward
0 new messages