Low speed of xpath select operation

248 views
Skip to first unread message

Grigoriy Petukhov

unread,
Nov 14, 2011, 4:12:39 AM11/14/11
to scrapy-users
Hi guys,

I am learning scrapy and I've faced with strange problem.

Here is code of simple spider to reproduce my issue. I've tried to
implement only code which works with HtmlXPathSelector
but I do not understand how to build Response object manually. So the
code of spider is:


# -*- coding: utf-8 -*-
import time

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class GigantclipsSpider(BaseSpider):
name = 'test'
start_urls = ['http://tubesexclips.com/']

def parse(self, response):
hxs = HtmlXPathSelector(response)

start = time.time()
hxs.select('//div[@class="thumb"]').extract()[0]
print '%.2f' % (time.time() - start)

start = time.time()
hxs.select('//div[@class="thumb"]/a').extract()[0]
print '%.2f' % (time.time() - start)

If I run `scrapy crawl test` I get:

0.07
14.22

14 seconds for simple xpath query! Please point me that I am doing
wrong. I've tested this query with lxml and it has taken less a second
as expected.

Rolando Espinoza La Fuente

unread,
Nov 14, 2011, 8:56:16 AM11/14/11
to scrapy...@googlegroups.com
> 14 seconds for simple xpath query! Please point me that I am doing
> wrong. I've tested this query with lxml and it has taken less a second
> as expected.

There are 547 nodes for that simple xpath query. The HXS wrapper
is slower than raw api calls when there are many nodes selected.

If you wan to select just first element, but seems unlikely, you can use
this xpath: //div[@class="thumb"]/a[1]

If you want to extract links in order to crawl them, use the link extractor
which is faster in this case:

$ scrapy shell url
...
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> lx = SgmlLinkExtractor(restrict_xpaths='//div[@class="thumb"]')
>>> ret = lx.extract_links(response)

Regards,

~Rolando

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages