Images Pipelines and scraping

tek123

unread,

Aug 28, 2010, 9:37:19 AM8/28/10

to scrapy-users

I am a beginner to scrapy and after going through the basic tutorial I
managed to scrape what I wanted. Now I would like to extract the
images as well but have ran into some problems.

http://doc.scrapy.org/topics/images.html#scrapy.contrib.pipeline.images.ImagesPipeline

Using the above documentation I can't seem to work out where
everything goes. It shows the full example:

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.core.exceptions import DropItem
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)

def item_completed(self, results, item, info):
image_paths = [info['path'] for success, info in results if
success]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item

But I don't know where to place this code? Do I place it in
pipelines.py or items.py. I've tried both but then I get the error
that 'No Module Named Image'. Thank you for your response.

Leonardo Lazzaro

unread,

Aug 28, 2010, 2:28:31 PM8/28/10

to scrapy...@googlegroups.com

you must put that in pipeline.py .

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

tek123

unread,

Aug 29, 2010, 7:29:45 AM8/29/10

to scrapy-users

I have tried the pipeline but I don't know how to implement into the
spider. This is what I get when I try to crawl....

c:\Python26\Scripts\elim>python scrapy-ctl.py crawl ecolabelindex.com
2010-08-29 12:26:34+0100 [-] Log opened.
Traceback (most recent call last):
File "scrapy-ctl.py", line 7, in <module>
execute()
File "C:\Python26\lib\site-packages\scrapy\cmdline.py", line 127, in
execute
scrapymanager.configure(control_reactor=True)
File "C:\Python26\lib\site-packages\scrapy\core\manager.py", line
30, in confi
gure
spiders.load()
File "C:\Python26\lib\site-packages\scrapy\contrib
\spidermanager.py", line 71,
in load
for spider in self._getspiders(ISpider, module):
File "C:\Python26\lib\site-packages\scrapy\contrib
\spidermanager.py", line 85,
in _getspiders
adapted = interface(plugin, None)
File "C:\Python26\zope\interface\interface.py", line 631, in
_call_conform
return conform(self)
File "C:\Python26\lib\site-packages\twisted\plugin.py", line 68, in
__conform_
_
return self.load()
File "C:\Python26\lib\site-packages\twisted\plugin.py", line 63, in
load
return namedAny(self.dropin.moduleName + '.' + self.name)
File "C:\Python26\lib\site-packages\twisted\python\reflect.py", line
464, in n
amedAny
topLevelPackage = _importAndCheckStack(trialname)
File "c:\Python26\Scripts\elim\elim\spiders\elim_spider.py", line 2,
in <modul
e>
from scrapy.contrib.pipeline.images import ImagesPipeline
File "C:\Python26\lib\site-packages\scrapy\contrib\pipeline
\images.py", line 1
3, in <module>
import Image
ImportError: No module named Image

My spider is:

from scrapy.spider import BaseSpider

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.core.exceptions import DropItem
from scrapy.http import Request

class ElimSpider(BaseSpider):
name = "ecolabelindex.com"
allowed_domains = ["ecolabelindex.com"]
start_urls = [
"http://www.ecolabelindex.com/ecolabels",
]

def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[2]/h4')
items = []
for site in sites:
item = ElimItem()
item['title'] = site.select('a/text()').extract()
#image[] = site.select('a/text()').extract()
#item['desc'] = site.select('text()').extract()
items.append(item)
return items

SPIDER = ElimSpider()

I am a little confused on how to crawl the images and defining the
fields in the item or is it that they are unrelated to the images
pipeline? Thank you for your help.

Daniel Graña

unread,

Aug 29, 2010, 9:10:56 AM8/29/10

to scrapy...@googlegroups.com

you must install python imaging library

http://www.pythonware.com/products/pil/

in debian based linux distributions, just:

$ sudo apt-get install python-imaging

--

tek123

unread,

Aug 30, 2010, 8:07:53 AM8/30/10

to scrapy-users

Thank you Daniel, I have now installed PIL. Just some clarifications
on the spider, I get this traceback.

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\tekken6>cd c:\python26\scripts\elim

c:\Python26\Scripts\elim>python scrapy-ctl.py crawl ecolabelindex.com

2010-08-30 13:04:03+0100 [-] Log opened.
2010-08-30 13:04:03+0100 [elim] DEBUG: Enabled extensions:
CloseSpider, WebServi
ce, TelnetConsole, CoreStats
2010-08-30 13:04:03+0100 [elim] DEBUG: Enabled scheduler middlewares:
Duplicates
FilterMiddleware
2010-08-30 13:04:03+0100 [elim] DEBUG: Enabled downloader middlewares:
HttpAuthM
iddleware, DownloaderStats, UserAgentMiddleware, RedirectMiddleware,
DefaultHead
ersMiddleware, CookiesMiddleware, HttpCompressionMiddleware,
RetryMiddleware
2010-08-30 13:04:03+0100 [elim] DEBUG: Enabled spider middlewares:
UrlLengthMidd
leware, HttpErrorMiddleware, RefererMiddleware, OffsiteMiddleware,
DepthMiddlewa
re
2010-08-30 13:04:03+0100 [elim] DEBUG: Enabled item pipelines:
MyImagesPipeline
2010-08-30 13:04:03+0100 [-] scrapy.webservice.WebService starting on
6080
2010-08-30 13:04:03+0100 [-] scrapy.telnet.TelnetConsole starting on
6023
2010-08-30 13:04:03+0100 [ecolabelindex.com] INFO: Spider opened
2010-08-30 13:04:03+0100 [ecolabelindex.com] DEBUG: Redirecting (301)
to <GET ht
tp://www.ecolabelindex.com/ecolabels/> from <GET http://www.ecolabelindex.com/ec
olabels>
2010-08-30 13:04:06+0100 [ecolabelindex.com] DEBUG: Crawled (200) <GET
http://ww
w.ecolabelindex.com/ecolabels/> (referer: None)
2010-08-30 13:04:06+0100 [ecolabelindex.com] ERROR: Spider exception
caught whil
e processing <http://www.ecolabelindex.com/ecolabels> (referer:
<None>): [Failur
e instance: Traceback: <type 'exceptions.NameError'>: global name
'HtmlXPathSele
ctor' is not defined
C:\Python26\lib\site-packages\twisted\internet\base.py:
1179:mainLoop
C:\Python26\lib\site-packages\twisted\internet\base.py:
778:runUntilCurre
nt
C:\Python26\lib\site-packages\twisted\internet\defer.py:
280:callback
C:\Python26\lib\site-packages\twisted\internet\defer.py:
354:_startRunCal
lbacks
--- <exception caught here> ---
C:\Python26\lib\site-packages\twisted\internet\defer.py:
371:_runCallback
s
c:\Python26\Scripts\elim\elim\spiders\elim_spider.py:14:parse
]
2010-08-30 13:04:06+0100 [ecolabelindex.com] INFO: Closing spider
(finished)
2010-08-30 13:04:06+0100 [ecolabelindex.com] DEBUG: Reloading module
elim.spider
s.elim_spider
2010-08-30 13:04:06+0100 [ecolabelindex.com] INFO: Spider closed
(finished)
2010-08-30 13:04:06+0100 [scrapy.webservice.WebService] (Port 6080
Closed)
2010-08-30 13:04:06+0100 [scrapy.telnet.TelnetConsole] (Port 6023
Closed)
2010-08-30 13:04:06+0100 [-] Main loop terminated.

My spider is:

from scrapy.spider import BaseSpider
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.core.exceptions import DropItem
from scrapy.http import Request

class ElimSpider(BaseSpider):
name = "ecolabelindex.com"
allowed_domains = ["ecolabelindex.com"]
start_urls = [
"http://www.ecolabelindex.com/ecolabels",
]

def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[2]/h4')
items = []
for site in sites:
item = ElimItem()
item['title'] = site.select('a/text()').extract()
#image[] = site.select('a/text()').extract()
#item['desc'] = site.select('text()').extract()
items.append(item)
return items

SPIDER = ElimSpider()

So is there a specific Xpath for extracting images? Or am I treating
the images as text? Thank you

> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@googlegroups.com>
> > .

> > For more options, visit this group at

> >http://groups.google.com/group/scrapy-users?hl=en.- Hide quoted text -
>
> - Show quoted text -

Steven Almeroth

unread,

Aug 30, 2010, 11:02:38 AM8/30/10

to scrapy-users

the error reads, "global name 'HtmlXPathSelector' is not defined", so
try adding

from scrapy.selector import HtmlXPathSelector

> tp://www.ecolabelindex.com/ecolabels/> from <GEThttp://www.ecolabelindex.com/ec
> olabels>
> 2010-08-30 13:04:06+0100 [ecolabelindex.com] DEBUG: Crawled (200) <GEThttp://ww

> > >http://groups.google.com/group/scrapy-users?hl=en.-Hide quoted text -

Leonardo Lazzaro

unread,

Aug 31, 2010, 12:50:00 AM8/31/10

to scrapy...@googlegroups.com

why are you importing imagepipeline from the spider?

If this is your first spider, try to make the tutorial on the docs scrapy page to just crawl the name of ecolabels or similar. then try to go to images downloading.

--

tek123

unread,

Aug 31, 2010, 10:04:31 AM8/31/10

to scrapy-users

I have already scraped the names of the ecolabels but now I wanted to
scrape the images. Ok, I have changed my spider to:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from elim.items import ElimItem

class ElimSpider(BaseSpider):
name = "ecolabelindex.com"
allowed_domains = ["ecolabelindex.com"]
start_urls = [
"http://www.ecolabelindex.com/ecolabels",
]

def parse(self, response):
hxs = HtmlXPathSelector(response)

sites = hxs.select('//div[2]/a')

items = []
for site in sites:
item = ElimItem()
item['title'] = site.select('a/text()').extract()
#image[] = site.select('a/text()').extract()
#item['desc'] = site.select('text()').extract()
items.append(item)
return items

SPIDER = ElimSpider()

But I think there may be an error with my items.py which is:

from scrapy.item import Item, Field

class ElimItem(Item):
title = Field()
link = Field()
desc = Field()

The traceback I get is:

2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
ElimItem(title=[]) i
n <http://www.ecolabelindex.com/ecolabels>
2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
ElimItem(title=[]) i
n <http://www.ecolabelindex.com/ecolabels>
2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
ElimItem(title=[]) i
n <http://www.ecolabelindex.com/ecolabels>
2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
ElimItem(title=[]) i
n <http://www.ecolabelindex.com/ecolabels>
2010-08-31 15:01:37+0100 [ecolabelindex.com] ERROR: Error processing
ElimItem(ti
tle=[]) - [Failure instance: Traceback: <type 'exceptions.KeyError'>:
'image_url
s'
C:\Python26\lib\site-packages\scrapy\core\scraper.py:
175:_process_spider
mw_output

C:\Python26\lib\site-packages\scrapy\contrib\pipeline

\__init__.py:64:pro
cess_item
C:\Python26\lib\site-packages\scrapy\utils\defer.py:
39:mustbe_deferred

C:\Python26\lib\site-packages\scrapy\contrib\pipeline

\__init__.py:60:nex
t_stage

--- <exception caught here> ---

C:\Python26\lib\site-packages\scrapy\utils\defer.py:
39:mustbe_deferred
C:\Python26\lib\site-packages\scrapy\contrib\pipeline\media.py:
38:proces
s_item
c:\Python26\Scripts\elim\elim\pipelines.py:
13:get_media_requests
C:\Python26\lib\site-packages\scrapy\item.py:51:__getitem__
]
2010-08-31 15:01:37+0100 [ecolabelindex.com] ERROR: Error processing
ElimItem(ti
tle=[]) - [Failure instance: Traceback: <type 'exceptions.KeyError'>:
'image_url
s'
C:\Python26\lib\site-packages\scrapy\core\scraper.py:
175:_process_spider
mw_output

C:\Python26\lib\site-packages\scrapy\contrib\pipeline

\__init__.py:64:pro
cess_item
C:\Python26\lib\site-packages\scrapy\utils\defer.py:
39:mustbe_deferred

C:\Python26\lib\site-packages\scrapy\contrib\pipeline

\__init__.py:60:nex
t_stage

--- <exception caught here> ---

C:\Python26\lib\site-packages\scrapy\utils\defer.py:
39:mustbe_deferred
C:\Python26\lib\site-packages\scrapy\contrib\pipeline\media.py:
38:proces
s_item
c:\Python26\Scripts\elim\elim\pipelines.py:
13:get_media_requests
C:\Python26\lib\site-packages\scrapy\item.py:51:__getitem__
]
2010-08-31 15:01:37+0100 [ecolabelindex.com] INFO: Closing spider
(finished)
2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Reloading module
elim.spider
s.elim_spider
2010-08-31 15:01:37+0100 [ecolabelindex.com] INFO: Spider closed
(finished)
2010-08-31 15:01:37+0100 [scrapy.webservice.WebService] (Port 6080
Closed)
2010-08-31 15:01:37+0100 [scrapy.telnet.TelnetConsole] (Port 6023
Closed)
2010-08-31 15:01:37+0100 [-] Main loop terminated.

So I need title = field() when I'm scraping images?

Thank you for bearing with me.

> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@googlegroups.com>
> > .

> > For more options, visit this group at

> >http://groups.google.com/group/scrapy-users?hl=en.- Hide quoted text -

Daniel Graña

unread,

Aug 31, 2010, 11:31:42 AM8/31/10

to scrapy...@googlegroups.com

2010-08-31 15:01:37+0100 [ecolabelindex.com] ERROR: Error processing

ElimItem(title=[]) - [Failure instance: Traceback: <type 'exceptions.KeyError'>: 'image_urls'

the exception suggests that your item has not 'image_urls', you need to extract urls and set them to item['image_urls'] in your spider. Remember, your imagepipeline expects a list).

Also, your item must define 'image_urls' and 'image_paths' fields.

goodluck,

Daniel.

To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

Pablo Hoffman

unread,

Aug 31, 2010, 3:05:59 PM8/31/10

to scrapy...@googlegroups.com

Hey,

This thread motivated this ticket and change:
http://dev.scrapy.org/ticket/217
http://dev.scrapy.org/changeset/2241

Which will be part of Scrapy 0.10.

Pablo.

On Tue, Aug 31, 2010 at 12:31:42PM -0300, Daniel Grana wrote:
> 2010-08-31 15:01:37+0100 [ecolabelindex.com] ERROR: Error processing
> ElimItem(title=[]) - [Failure instance: Traceback: <type
> 'exceptions.KeyError'>: 'image_urls'
>
> the exception suggests that your item has not 'image_urls', you need to
> extract urls and set them to item['image_urls'] in your spider. Remember,

> your imagepipeline expects a *list*).

> > > > scrapy-users...@googlegroups.com<scrapy-users%2Bunsu...@googlegroups.com>

> > <scrapy-users%2Bunsubscribe@google�groups.com>
> > > > .
> > > > For more options, visit this group at
> > > >http://groups.google.com/group/scrapy-users?hl=en.- Hide quoted text -
> > >
> > > - Show quoted text -
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "scrapy-users" group.
> > To post to this group, send email to scrapy...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsu...@googlegroups.com>

> > .
> > For more options, visit this group at
> > http://groups.google.com/group/scrapy-users?hl=en.
> >
> >
>

tek123

unread,

Sep 7, 2010, 7:36:58 AM9/7/10

to scrapy-users

Thank you Pablo the change really helped. Now one final thing, in
order to extract the image from the spider I am having trouble with
the site.select XPath.

My spider is

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from elim.items import ElimItem

class ElimSpider(BaseSpider):
name = "ecolabelindex.com"
allowed_domains = ["ecolabelindex.com"]
start_urls = [
"http://www.ecolabelindex.com/ecolabels",
]

def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[2]/a')
items = []
for site in sites:
item = ElimItem()

item['image_urls'] = site.select('a/text()').extract()
item['images'] = site.select('a/@src()').extract()

#item['desc'] = site.select('text()').extract()
items.append(item)
return items

SPIDER = ElimSpider()

It is working ok but is returning blanks since I don't have the right
XPath for images. In the item['images'] = site.select('a/
@src()').extract() part. If I am extracting images what would I
replace the ('a/@src()') with?

Thank you

On 31 Aug, 20:05, Pablo Hoffman <pablohoff...@gmail.com> wrote:
> Hey,
>

> This thread motivated this ticket and change:http://dev.scrapy.org/ticket/217http://dev.scrapy.org/changeset/2241

>
> Which will be part of Scrapy 0.10.
>
> Pablo.
>
>
>
> On Tue, Aug 31, 2010 at 12:31:42PM -0300, Daniel Grana wrote:
> > 2010-08-31 15:01:37+0100 [ecolabelindex.com] ERROR: Error processing
> > ElimItem(title=[]) - [Failure instance: Traceback: <type
> > 'exceptions.KeyError'>: 'image_urls'
>
> > the exception suggests that your item has not 'image_urls', you need to
> > extract urls and set them to item['image_urls'] in your spider. Remember,
> > your imagepipeline expects a *list*).
>
> > Also, your item must define 'image_urls' and 'image_paths' fields.
>
> > goodluck,
> > Daniel.
>

> > > > On Sun, Aug 29, 2010 at 8:29 AM,tek123<paktek...@googlemail.com>

> > > > > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@googlegroups.com>

> > > <scrapy-users%2Bunsubscribe@google groups.com>
> > > > > .
> > > > > For more options, visit this group at

> > > > >http://groups.google.com/group/scrapy-users?hl=en.-Hide quoted text -

>
> > > > - Show quoted text -
>
> > > --
> > > You received this message because you are subscribed to the Google Groups
> > > "scrapy-users" group.
> > > To post to this group, send email to scrapy...@googlegroups.com.
> > > To unsubscribe from this group, send email to

> > > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@googlegroups.com>

> > > .
> > > For more options, visit this group at

> > >http://groups.google.com/group/scrapy-users?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> > To post to this group, send email to scrapy...@googlegroups.com.
> > To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> > For more options, visit this group at
>

> ...
>
> read more »- Hide quoted text -

Pablo Hoffman

unread,

Sep 7, 2010, 10:29:57 AM9/7/10

to scrapy...@googlegroups.com

Hi tek123,

You should only extract the 'image_urls' field in your spider. The pipeline
will populate the 'images' field, after downloading the images.

If this isn't clear in the documentation let me know and I'll update it.

Pablo.

> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

tek123

unread,

Sep 9, 2010, 8:10:16 AM9/9/10

to scrapy-users

Hi Pablo,

I think I am having some troubles with understanding the XPath
variables. I am still learning html/PHP at a very basic level thus
cannot determine how to extract the image urls from this site
ecolabelindex.com/ecolabels. In order to extract the ecolabel images,
no matter what XPath I type in the shell returns nothing. Here are
somethings I've tried:

hxs.select('//div[2]/a').extract()
hxs.select('//a[contains(@href, "image")]/img/@src').extract()
hxs.select('//a[contains(@href, "image")]/@href').extract()

I just seem to an empty box []. The html coding is:

div class="grid_3 alpha" style="text-align: center">
<a class="image" href="/ecolabel/100-green-electricity-100-
energia-verde"><img src="/files/ecolabel-logos-sized/100-green-
electricity-100-energia-verde.png" width="100" height="104"
class="image" alt="100% Green Electricity - 100% Energia Verde logo" /
></a>
</div>

Which XPath would I use?

Thank you

> > > > > ï¿½ ï¿½name = "ecolabelindex.com"
> > > > > ï¿½ ï¿½allowed_domains = ["ecolabelindex.com"]
> > > > > ï¿½ ï¿½start_urls = [
> > > > > ï¿½ ï¿½ ï¿½ ï¿½"http://www.ecolabelindex.com/ecolabels",
> > > > > ï¿½ ï¿½]
>
> > > > > ï¿½ ï¿½def parse(self, response):
> > > > > ï¿½ ï¿½ ï¿½ hxs = HtmlXPathSelector(response)
> > > > > ï¿½ ï¿½ ï¿½ ï¿½sites = hxs.select('//div[2]/a')
> > > > > ï¿½ ï¿½ ï¿½ ï¿½items = []
> > > > > ï¿½ ï¿½ ï¿½ for site in sites:
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ item = ElimItem()
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ item['title'] = site.select('a/text()').extract()
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ #image[] = site.select('a/text()').extract()
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ #item['desc'] = site.select('text()').extract()
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ items.append(item)
> > > > > ï¿½ ï¿½ ï¿½ return items

>
> > > > > SPIDER = ElimSpider()
>
> > > > > But I think there may be an error with my items.py which is:
>
> > > > > from scrapy.item import Item, Field
>
> > > > > class ElimItem(Item):

> > > > > ï¿½ ï¿½title = Field()
> > > > > ï¿½ ï¿½link = Field()
> > > > > ï¿½ ï¿½desc = Field()

>
> > > > > The traceback I get is:
>
> > > > > 2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
> > > > > ElimItem(title=[]) i
> > > > > n <http://www.ecolabelindex.com/ecolabels>
> > > > > 2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
> > > > > ElimItem(title=[]) i
> > > > > n <http://www.ecolabelindex.com/ecolabels>
> > > > > 2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
> > > > > ElimItem(title=[]) i
> > > > > n <http://www.ecolabelindex.com/ecolabels>
> > > > > 2010-08-31 15:01:37+0100 [ecolabelindex.com] DEBUG: Scraped
> > > > > ElimItem(title=[]) i
> > > > > n <http://www.ecolabelindex.com/ecolabels>
> > > > > 2010-08-31 15:01:37+0100 [ecolabelindex.com] ERROR: Error processing
> > > > > ElimItem(ti
> > > > > tle=[]) - [Failure instance: Traceback: <type 'exceptions.KeyError'>:
> > > > > 'image_url
> > > > > s'

> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\core\scraper.py:
> > > > > 175:_process_spider
> > > > > mw_output
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:64:pro
> > > > > cess_item
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:60:nex
> > > > > t_stage
> > > > > ï¿½ ï¿½ ï¿½ ï¿½--- <exception caught here> ---
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\contrib\pipeline\media.py:
> > > > > 38:proces
> > > > > s_item
> > > > > ï¿½ ï¿½ ï¿½ ï¿½c:\Python26\Scripts\elim\elim\pipelines.py:
> > > > > 13:get_media_requests
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\item.py:51:__getitem__
> > > > > ï¿½ ï¿½ ï¿½ ï¿½]

> > > > > 2010-08-31 15:01:37+0100 [ecolabelindex.com] ERROR: Error processing
> > > > > ElimItem(ti
> > > > > tle=[]) - [Failure instance: Traceback: <type 'exceptions.KeyError'>:
> > > > > 'image_url
> > > > > s'

> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\core\scraper.py:
> > > > > 175:_process_spider
> > > > > mw_output
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:64:pro
> > > > > cess_item
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > ï¿½ ï¿½ ï¿½ ï¿½ C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:60:nex
> > > > > t_stage
> > > > > ï¿½ ï¿½ ï¿½ ï¿½--- <exception caught here> ---
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\contrib\pipeline\media.py:
> > > > > 38:proces
> > > > > s_item
> > > > > ï¿½ ï¿½ ï¿½ ï¿½c:\Python26\Scripts\elim\elim\pipelines.py:
> > > > > 13:get_media_requests
> > > > > ï¿½ ï¿½ ï¿½ ï¿½C:\Python26\lib\site-packages\scrapy\item.py:51:__getitem__
> > > > > ï¿½ ï¿½ ï¿½ ï¿½]

> > > > > > > ï¿½File "scrapy-ctl.py", line 7, in <module>
> > > > > > > ï¿½ ï¿½execute()
> > > > > > > ï¿½File "C:\Python26\lib\site-packages\scrapy\cmdline.py", line 127, in
> > > > > > > execute
> > > > > > > ï¿½ ï¿½scrapymanager.configure(control_reactor=True)
> > > > > > > ï¿½File "C:\Python26\lib\site-packages\scrapy\core\manager.py", line
> > > > > > > 30, in confi
> > > > > > > gure
> > > > > > > ï¿½ ï¿½spiders.load()
> > > > > > > ï¿½File "C:\Python26\lib\site-packages\scrapy\contrib
> > > > > > > \spidermanager.py", line 71,
> > > > > > > ï¿½in load
> > > > > > > ï¿½ ï¿½for spider in self._getspiders(ISpider, module):
> > > > > > > ï¿½File "C:\Python26\lib\site-packages\scrapy\contrib
> > > > > > > \spidermanager.py", line 85,
> > > > > > > ï¿½in _getspiders
> > > > > > > ï¿½ ï¿½adapted = interface(plugin, None)
> > > > > > > ï¿½File "C:\Python26\zope\interface\interface.py", line 631, in
> > > > > > > _call_conform
> > > > > > > ï¿½ ï¿½return conform(self)
> > > > > > > ï¿½File "C:\Python26\lib\site-packages\twisted\plugin.py", line 68, in
> > > > > > > __conform_
> > > > > > > _
> > > > > > > ï¿½ ï¿½return self.load()
> > > > > > > ï¿½File "C:\Python26\lib\site-packages\twisted\plugin.py", line 63, in
> > > > > > > load
> > > > > > > ï¿½ ï¿½return namedAny(self.dropin.moduleName + '.' + self.name)

tek123

unread,

Sep 19, 2010, 7:59:43 AM9/19/10

to scrapy-users

Ok now I have the correct XPath: /html/body/div[2]/div/div[4]/div/div/
div/a/img

But when I crawl it just passes the item and there is no images in the
image store?

Thank you

Siddharth Jain

unread,

Jan 18, 2013, 6:30:06 AM1/18/13

to scrapy...@googlegroups.com

Hey Guys I am trying to work with the Regular Expression and I dont know how to work with but i ahve tried on code that giving Only one Error that is Index Error Index Out Of Range please tell me how to Run aSpider??

import csv

from urllib import urlopen

import re

# open and read html/xml

xml=urlopen("http://www.ebay.com/electronics/rss.xml").read()

# grab article titles and urls Using regex

xmlTitle = re.compile("<title>(.*)</title>")

xmlLink = re.compile("<link>(.*)</link>")

# store the data

findTitle = re.findall(xmlTitle,xml)

findLink = re.findall(xmlLink,xml)

iterate = []

iterate [:] = range(1,25)

# open the csv file

writer = csv.writer(open("pytest.csv","wb"))

head = ("Title" , "URL")

writer.writerow(head)

# wrtie the results into the csv file

for i in iterate:

writer.writerow([findTitle[i], findLink[i]])

Reply all

Reply to author

Forward