How do you strip out \xa0

1,684 views
Skip to first unread message

Scott

unread,
Nov 25, 2011, 11:29:44 PM11/25/11
to scrapy...@googlegroups.com
I cannot find how to remove the \xa0 which is causing problems with my itemloader.  It must either be obvious and I am just too much of a noob to python and scrapy to understand it or I am doing something wrong and I shouldn't be having the problem to begin with.  When I run my spider, it scrapes the information from the page ok.  As I attempt to load my item, it fails with the following error.  Can someone give me a little guidance before I have pulled all my hair out?

Error message looks like this:
2011-11-25 22:10:54-0600 [mysite.com] ERROR: Error processing outItem(brand=u'Motorola', cost=u'$1.50', link=u'http://www.mysite.com/motorolaz6lens.aspx', supplier=u'SUPPLIER', model=u'Z6m', desc=u'Motorola Z6 , Z6m \xa0lens')
Traceback (most recent call last):
 File "C:\Python27\lib\site-packages\scrapy-0.12.0.2550-py2.7.egg\scrapy\middleware.py", line 54, in _process_chain
   return process_chain(self.methods[methodname], obj, *args)
 File "C:\Python27\lib\site-packages\scrapy-0.12.0.2550-py2.7.egg\scrapy\utils\defer.py", line 65, in process_chain
   d.callback(input)
 File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 361, in callback
   self._startRunCallbacks(result)
 File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 455, in _startRunCallbacks
   self._runCallbacks()
--- <exception caught here> ---
 File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 542, in _runCallbacks
   current.result = callback(current.result, *args, **kw)
 File "C:\scrapes\wire\wire\pipelines.py", line 14, in process_item
   self.csvwriter.writerow([item['supplier'], item['brand'], item['model'], item['desc'], item['cost'], item['link']])
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 18: ordinal not in range(128)

items.py looks like this:

from scrapy.item import Item, Field
from scrapy.contrib.loader.processor import MapCompose, Join

class outItem(Item): 

    supplier = Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join(),    
    )

    brand = Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join(),     
    )
  
    model = Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join(),    
    )
  
    desc = Field( 
        input_processor=MapCompose(unicode.strip) , 
        output_processor=Join(), 
    ) 
    
    cost = Field( 
        input_processor=MapCompose(unicode.strip) , 
        output_processor=Join(), 
    ) 

    link = Field( 
        input_processor=MapCompose(unicode.strip) , 
        output_processor=Join(), 
    ) 

The section of the spider.py looks like this:

def parseItem(self, response):

        l = XPathItemLoader(item=outItem(), response=response)
        l.add_value('model', response.request.meta['model'])
        l.add_value('brand', response.request.meta['brand'])
        l.add_xpath('desc', "//div[@id='product-detail-div' and @class='product-detail']/table[@class='prod-detail']/tr[2]/td[@class='prod-detail-bt']/div[@class='prod-detail-desc']/div/text()")
        l.add_xpath('cost', "//div[@id='product-detail-div' and @class='product-detail']/table[1][@class='prod-detail']/tr[1]/td[2][@class='prod-detail-rt']/div[1][@class='prod-detail-price']/div[@class='prod-detail-cost']/span[2][@class='prod-detail-cost-value']/text()")
        l.add_value('link', str_to_unicode(response.url))
        l.add_value('supplier', u'SUPPLIER')
        loadeditem = l.load_item()
        yield loadeditem
    
All guidance is greatly appreciated.
Best Regards

Scott

unread,
Nov 25, 2011, 11:36:11 PM11/25/11
to scrapy...@googlegroups.com
The pipelines.py looks like this:

import csv

class CsvWriterPipeline(object):

    def __init__(self):
        self.csvwriter = csv.writer(open('wire.csv', 'wb'))

    def process_item(self, item, spider):
        
        self.csvwriter.writerow([item['supplier'], item['brand'], item['model'], item['desc'], item['cost'], item['link']])
        
        return item

Rivka Shenhav

unread,
Nov 26, 2011, 3:28:17 PM11/26/11
to scrapy...@googlegroups.com
try:

pClnUp = re.compile(r'\n|\t|\xa0')

and then, if t is your text (string)

t1=str(pClnUp.sub('',t)

will get rid of the u'\xao' in the text (in each of your items - or in the item that has that portion)

and avoid the error messages.


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/7UM_0gJ4YV0J.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Reply all
Reply to author
Forward
0 new messages