Hi Pablo,
I think I am having some troubles with understanding the XPath
variables. I am still learning html/PHP at a very basic level thus
cannot determine how to extract the image urls from this site
ecolabelindex.com/ecolabels. In order to extract the ecolabel images,
no matter what XPath I type in the shell returns nothing. Here are
somethings I've tried:
hxs.select('//div[2]/a').extract()
hxs.select('//a[contains(@href, "image")]/img/@src').extract()
hxs.select('//a[contains(@href, "image")]/@href').extract()
I just seem to an empty box []. The html coding is:
div class="grid_3 alpha" style="text-align: center">
<a class="image" href="/ecolabel/100-green-electricity-100-
energia-verde"><img src="/files/ecolabel-logos-sized/100-green-
electricity-100-energia-verde.png" width="100" height="104"
class="image" alt="100% Green Electricity - 100% Energia Verde logo" /
></a>
</div>
Which XPath would I use?
Thank you
> > > > > � �name = "
ecolabelindex.com"
> > > > > � �allowed_domains = ["
ecolabelindex.com"]
> > > > > � �start_urls = [
> > > > > � � � �"
http://www.ecolabelindex.com/ecolabels",
> > > > > � �]
>
> > > > > � �def parse(self, response):
> > > > > � � � hxs = HtmlXPathSelector(response)
> > > > > � � � �sites = hxs.select('//div[2]/a')
> > > > > � � � �items = []
> > > > > � � � for site in sites:
> > > > > � � � � � item = ElimItem()
> > > > > � � � � � item['title'] = site.select('a/text()').extract()
> > > > > � � � � � #image[] = site.select('a/text()').extract()
> > > > > � � � � � #item['desc'] = site.select('text()').extract()
> > > > > � � � � � items.append(item)
> > > > > � � � return items
>
> > > > > SPIDER = ElimSpider()
>
> > > > > But I think there may be an error with my items.py which is:
>
> > > > > from scrapy.item import Item, Field
>
> > > > > class ElimItem(Item):
> > > > > � �title = Field()
> > > > > � �link = Field()
> > > > > � �desc = Field()
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\core\scraper.py:
> > > > > 175:_process_spider
> > > > > mw_output
> > > > > � � � � C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:64:pro
> > > > > cess_item
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > � � � � C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:60:nex
> > > > > t_stage
> > > > > � � � �--- <exception caught here> ---
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\contrib\pipeline\media.py:
> > > > > 38:proces
> > > > > s_item
> > > > > � � � �c:\Python26\Scripts\elim\elim\pipelines.py:
> > > > > 13:get_media_requests
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\item.py:51:__getitem__
> > > > > � � � �]
> > > > > 2010-08-31 15:01:37+0100 [
ecolabelindex.com] ERROR: Error processing
> > > > > ElimItem(ti
> > > > > tle=[]) - [Failure instance: Traceback: <type 'exceptions.KeyError'>:
> > > > > 'image_url
> > > > > s'
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\core\scraper.py:
> > > > > 175:_process_spider
> > > > > mw_output
> > > > > � � � � C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:64:pro
> > > > > cess_item
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > � � � � C:\Python26\lib\site-packages\scrapy\contrib\pipeline
> > > > > \__init__.py:60:nex
> > > > > t_stage
> > > > > � � � �--- <exception caught here> ---
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\utils\defer.py:
> > > > > 39:mustbe_deferred
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\contrib\pipeline\media.py:
> > > > > 38:proces
> > > > > s_item
> > > > > � � � �c:\Python26\Scripts\elim\elim\pipelines.py:
> > > > > 13:get_media_requests
> > > > > � � � �C:\Python26\lib\site-packages\scrapy\item.py:51:__getitem__
> > > > > � � � �]
> > > > > > > �File "scrapy-ctl.py", line 7, in <module>
> > > > > > > � �execute()
> > > > > > > �File "C:\Python26\lib\site-packages\scrapy\cmdline.py", line 127, in
> > > > > > > execute
> > > > > > > � �scrapymanager.configure(control_reactor=True)
> > > > > > > �File "C:\Python26\lib\site-packages\scrapy\core\manager.py", line
> > > > > > > 30, in confi
> > > > > > > gure
> > > > > > > � �spiders.load()
> > > > > > > �File "C:\Python26\lib\site-packages\scrapy\contrib
> > > > > > > \spidermanager.py", line 71,
> > > > > > > �in load
> > > > > > > � �for spider in self._getspiders(ISpider, module):
> > > > > > > �File "C:\Python26\lib\site-packages\scrapy\contrib
> > > > > > > \spidermanager.py", line 85,
> > > > > > > �in _getspiders
> > > > > > > � �adapted = interface(plugin, None)
> > > > > > > �File "C:\Python26\zope\interface\interface.py", line 631, in
> > > > > > > _call_conform
> > > > > > > � �return conform(self)
> > > > > > > �File "C:\Python26\lib\site-packages\twisted\plugin.py", line 68, in
> > > > > > > __conform_
> > > > > > > _
> > > > > > > � �return self.load()
> > > > > > > �File "C:\Python26\lib\site-packages\twisted\plugin.py", line 63, in
> > > > > > > load
> > > > > > > � �return namedAny(self.dropin.moduleName + '.' +
self.name)