If I use something other than parse with SgmlLinkExtractor, I get an error

1,776 views
Skip to first unread message

Bob

unread,
Mar 10, 2011, 2:28:59 PM3/10/11
to scrapy-users

The following code
class SiteSpider(BaseSpider):
name = "some_site.com"
allowed_domains = ["some_site.com"]
start_urls = [
"some_site.com/something/another/PRODUCT-
CATEGORY1_10652_-1_85667_29104_85667",
]
rules = (
Rule(SgmlLinkExtractor(allow=('some_site.com/something/another/
PRODUCT-CATEGORY_(.*)', ))),

# Extract links matching 'item.php' and parse them with the
spider's method parse_item
Rule(SgmlLinkExtractor(allow=('some_site.com/something/another/
PRODUCT-DETAIL(.*)', )), callback="parse_item"),
)
def parse_item(self, response):
.... parse stuff

Throws the following error

Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py",
line 1174, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py",
line 796, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py",
line 318, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py",
line 424, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py",
line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/usr/lib/pymodules/python2.6/scrapy/spider.py", line 62, in
parse
raise NotImplementedError
exceptions.NotImplementedError:

When I change the callback to "parse" and the function to "parse" i
don't get any errors, but nothing is scraped. Perhaps I'm setting up
the link extractor wrong?

What I want to do is parse each ITEM link on the CATEGORY page. Am I
doing this totally wrong?

zanhsieh

unread,
Mar 10, 2011, 7:53:28 PM3/10/11
to scrapy-users
Hi Bob,

Perhaps it might work if you change from BaseSpider to CrawlSpider?
The BaseSpider seems not implement Rule, see:

http://doc.scrapy.org/topics/spiders.html?highlight=rule#scrapy.contrib.spiders.Rule

-M

Bob

unread,
Mar 12, 2011, 8:44:26 PM3/12/11
to scrapy-users
Thanks Zanhsieh! Worked like a charm. However, now only the last
"item" page is being properly parsed.

Is there a problem in how I am writing my rule such that hxs =
HtmlXPathSelector(response) throws the following:
if ret is None:raise treeError('xmlDocGetRootElement() failed')
libxml2.treeError: xmlDocGetRootElement() failed

Looks like the response item isn't being passed properly to parse_item
other than the last element.... Am I missing something?


On Mar 10, 4:53 pm, zanhsieh <zanhs...@gmail.com> wrote:
> Hi Bob,
>
> Perhaps it might work if you change from BaseSpider to CrawlSpider?
> The BaseSpider seems not implement Rule, see:
>
> http://doc.scrapy.org/topics/spiders.html?highlight=rule#scrapy.contr...

Bob

unread,
Mar 12, 2011, 9:05:38 PM3/12/11
to scrapy-users
If it is any help, for everything but the last element passed the
following i strue about the response:
DEBUG: request =
DEBUG: response type = <type 'str'>
DEBUG: response encoding = ascii

zanhsieh

unread,
Mar 13, 2011, 6:38:51 AM3/13/11
to scrapy-users
Hi Bob,

Since I don't have change to peek what the website you tried to crawl,
the tip I could offer is to insert the following line in your
parse_item method:

from scrapy.shell import inspect_response
inspect_response(response)

Re-run the spider, it should stop whenever the spider enter your
parsing snippet, then start use hxs or xxs or even type in response
then enter to see what the response would be. My guess is you might
hit some not well formatted html page or xml document.

-M

Bob

unread,
Mar 14, 2011, 7:59:11 PM3/14/11
to scrapy-users
Hi M,

Thank you for the response. See attached when I run the rule:
>>> response
TextResponse(url='http://www.somesite.com/asdf/asdf/
asfd_3324_adsfa-1_0', status=200, body='', headers={'Content-Length':
['0'], 'Content-Language': ['en-US'], 'Server': ['IBM_HTTP_Server'],
'Connection': ['close'], 'Date': ['Sun, 13 Mar 2011 16:47:56 GMT'],
'P3P': ['CP="CAO DSP COR CURa ADMa DEVa OUR IND PHY ONL UNI COM NAV
INT DEM PRE"'], 'Content-Type': ['text/plain']},
request=Request(url='http://www.some.com/site/here/
fsproductdetail_10652_7dadf2_565_-1_0', method='GET', body='',
headers={'Accept-Language': ['en'], 'Accept-Encoding':
['gzip,deflate'], 'Accept': ['text/html,application/xhtml
+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0
(X11; U; Linux i686; en-US; rv:1.9.2.15) Gecko/20110303 Ubuntu/10.10
(maverick) Firefox/3.6.15'], 'Cookie':
['WC_GENERIC_ACTIVITYDATA=[299194823%3atrue%3afalse
%3a0%3aeNxMdWLIdOXPx15ZidBzh%2bnIBXI%3d]
[com.ibm.commerce.context.base.BaseContext|
10652%26%2d1002%26%2d1002%26%2d1]
[com.ibm.commerce.catalog.businesscontext.CatalogContext|29104%26null]
[com.ibm.commerce.context.globalization.GlobalizationContext|%2d1%26USD
%26%2d1%26USD][com.ibm.commerce.context.entitlement.EntitlementContext|
4000000000000000505%264000000000000000505%26null%26%2d2000]
[com.ibm.commerce.context.experiment.ExperimentContext|null]
[CTXSETNAME|Store][com.ibm.commerce.context.audit.AuditContext|null];
WC_ACTIVEPOINTER=%2d1%2c10652; WC_USERACTIVITY_-1002=
%2d1002%2c10652%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull
%2c9hMW09Lr9qRot%2f%2bbZxrrGhP71ppokVUauZLuyY
%2bwhqIe4VYmSxCSCYCavpzQI3WnZ%2bWPO706u0WO%0a
%2fd0ZSu29X9jtsTUki1HWw8FB5W5YbVJmcyhif4iaQnWooqR4FLTeVYrJHvaHtMw%3d;
WC_SESSION_ESTABLISHED=true;
WCS_JSESSIONID=0000AulVYLoP5sLzhTto6hVkYRF:15lh8tfg4'], 'Referer':
['http://www.some.com/site/here/producthierarchy1_10dd652_-asdfa2']},
cookies={}, meta={'download_timeout': 180, 'depth': 1, 'link_text':
u'The Actual Valid Tag*Power Supplies asdfsdfadsf adsfasdf
asdfsadf'}), flags=[])
>>> hxs
>>> str(hxs)
'None'

as opposed to when I crawl directly.
>>> hxs
<HtmlXPathSelector xpath=None data=u'<html><head><meta name="Keywords"
conten'>
From which I can do all the normal xpathing you'd expect...

Very strange to me. Am I not passing the proper request item with the
spider or something?

Best,
Bobby

zanhsieh

unread,
Mar 15, 2011, 2:45:34 AM3/15/11
to scrapy-users
Hi Bob,

I suspect it might have some header parameters missing. Two tools I
recommend you to try:

1. Firefox Live HTTP headers:
https://addons.mozilla.org/en-us/firefox/addon/live-http-headers/

2. Wireshark:
http://www.wireshark.org/

Live HTTP headers could replay the desired requests, whereas wireshark
provides you more detail traffic analysis.

Bob

unread,
Mar 15, 2011, 1:09:50 PM3/15/11
to scrapy-users
Thank you again,
How HTTP headers would help me in this situation? I understand what
HTTP headers are, but don't really know how to apply that knowledge to
help me find a solution here.

Why would HTTP parameters be missing from the items (1....n-1,
excluding the last element?) but not from the last item in the crawled
list or the manually crawled pages? Does scrapy handle those pages
differently?

Perhaps I'm crawling the pages wrong? Or not passing the right
information? It just seems so odd that ONE element in the list works,
yet the others do not, even though I'm getting Status 200 from the
other pages.

Bob

unread,
Mar 17, 2011, 11:42:38 AM3/17/11
to scrapy-users
Quick reply to my own question: Disabling concurrent connections fixed
my problem. Perhaps I was getting blocked by the webserver.
Reply all
Reply to author
Forward
0 new messages