How do you strip out \xa0

Scott

unread,

Nov 25, 2011, 11:29:44 PM11/25/11

to scrapy...@googlegroups.com

I cannot find how to remove the \xa0 which is causing problems with my itemloader. It must either be obvious and I am just too much of a noob to python and scrapy to understand it or I am doing something wrong and I shouldn't be having the problem to begin with. When I run my spider, it scrapes the information from the page ok. As I attempt to load my item, it fails with the following error. Can someone give me a little guidance before I have pulled all my hair out?

Error message looks like this:

2011-11-25 22:10:54-0600 [mysite.com] ERROR: Error processing outItem(brand=u'Motorola', cost=u'$1.50', link=u'http://www.mysite.com/motorolaz6lens.aspx', supplier=u'SUPPLIER', model=u'Z6m', desc=u'Motorola Z6 , Z6m \xa0lens')

Traceback (most recent call last):

File "C:\Python27\lib\site-packages\scrapy-0.12.0.2550-py2.7.egg\scrapy\middleware.py", line 54, in _process_chain

return process_chain(self.methods[methodname], obj, *args)

File "C:\Python27\lib\site-packages\scrapy-0.12.0.2550-py2.7.egg\scrapy\utils\defer.py", line 65, in process_chain

d.callback(input)

File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 361, in callback

self._startRunCallbacks(result)

File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 455, in _startRunCallbacks

self._runCallbacks()

--- <exception caught here> ---

File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 542, in _runCallbacks

current.result = callback(current.result, *args, **kw)

File "C:\scrapes\wire\wire\pipelines.py", line 14, in process_item

self.csvwriter.writerow([item['supplier'], item['brand'], item['model'], item['desc'], item['cost'], item['link']])

exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 18: ordinal not in range(128)

items.py looks like this:

from scrapy.item import Item, Field

from scrapy.contrib.loader.processor import MapCompose, Join

class outItem(Item):

supplier = Field(

input_processor=MapCompose(unicode.strip),

output_processor=Join(),

)

brand = Field(

input_processor=MapCompose(unicode.strip),

output_processor=Join(),

)

model = Field(

input_processor=MapCompose(unicode.strip),

output_processor=Join(),

)

desc = Field(

input_processor=MapCompose(unicode.strip) ,

output_processor=Join(),

)

cost = Field(

input_processor=MapCompose(unicode.strip) ,

output_processor=Join(),

)

link = Field(

input_processor=MapCompose(unicode.strip) ,

output_processor=Join(),

)

The section of the spider.py looks like this:

def parseItem(self, response):

l = XPathItemLoader(item=outItem(), response=response)

l.add_value('model', response.request.meta['model'])

l.add_value('brand', response.request.meta['brand'])

l.add_xpath('desc', "//div[@id='product-detail-div' and @class='product-detail']/table[@class='prod-detail']/tr[2]/td[@class='prod-detail-bt']/div[@class='prod-detail-desc']/div/text()")

l.add_xpath('cost', "//div[@id='product-detail-div' and @class='product-detail']/table[1][@class='prod-detail']/tr[1]/td[2][@class='prod-detail-rt']/div[1][@class='prod-detail-price']/div[@class='prod-detail-cost']/span[2][@class='prod-detail-cost-value']/text()")

l.add_value('link', str_to_unicode(response.url))

l.add_value('supplier', u'SUPPLIER')

loadeditem = l.load_item()

yield loadeditem

All guidance is greatly appreciated.

Best Regards

Scott

unread,

Nov 25, 2011, 11:36:11 PM11/25/11

to scrapy...@googlegroups.com

The pipelines.py looks like this:

import csv

class CsvWriterPipeline(object):

def __init__(self):

self.csvwriter = csv.writer(open('wire.csv', 'wb'))

def process_item(self, item, spider):

self.csvwriter.writerow([item['supplier'], item['brand'], item['model'], item['desc'], item['cost'], item['link']])

return item

Rivka Shenhav

unread,

Nov 26, 2011, 3:28:17 PM11/26/11

to scrapy...@googlegroups.com

try:

pClnUp = re.compile(r'\n|\t|\xa0')

and then, if t is your text (string)

t1=str(pClnUp.sub('',t)

will get rid of the u'\xao' in the text (in each of your items - or in the item that has that portion)

and avoid the error messages.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/7UM_0gJ4YV0J.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Reply all

Reply to author

Forward