scrapy text encoding

984 views
Skip to first unread message

Mindcast Mindcast

unread,
Feb 8, 2012, 5:10:07 AM2/8/12
to scrapy...@googlegroups.com
 

Here is my spider

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem

class vriskoSpider(CrawlSpider):
    name = 'vrisko'
    allowed_domains = ['vrisko.gr']
    start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)

    def parse_start_url(self, response):
        hxs = HtmlXPathSelector(response)
        vriskoit = VriskoItem()
        vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
        vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
        return vriskoit

My problem is that the returned strings are ascii and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.

Thank you in advance!

vitsin

unread,
Feb 8, 2012, 10:50:35 AM2/8/12
to scrapy-users
This is what works for me:
unicode(<any_chars>)

vriskoit['eponimia'] = unicode(hxs.select("//a[@itemprop='name']/
text()").extract())
vriskoit['address'] = unicode(hxs.select("//
div[@class='results_address_class']/text()").extract())

regards,
--vs


On Feb 8, 5:10 am, Mindcast Mindcast <i...@mindcast.gr> wrote:
> Here is my spider
>
>
>
>
>
>
>
>
>
> > from scrapy.contrib.spiders import CrawlSpider,Rule
> > from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> > from scrapy.selector import HtmlXPathSelector
> > from vrisko.items import VriskoItem
>
> > class vriskoSpider(CrawlSpider):
> >     name = 'vrisko'
> >     allowed_domains = ['vrisko.gr']
> >     start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%8...]

Mindcast Mindcast

unread,
Feb 8, 2012, 11:01:57 AM2/8/12
to scrapy...@googlegroups.com
Unfortunately this does not work for me.  I ve got greek characters, so i need utf-8 encoding!

vitsin

unread,
Feb 8, 2012, 11:13:11 AM2/8/12
to scrapy-users
Extract ASCII chars and unicode into utf-8:

eponimia_urf8 = unicode(hxs.select("//a[@itemprop='name']/
text()").extract(), "utf-8")
vriskoit['eponimia'] = eponimia_urf8

This works if .extract() brings ASCII string obj.

--vs

Mindcast Mindcast

unread,
Feb 8, 2012, 11:22:51 AM2/8/12
to scrapy...@googlegroups.com
No, it was my fault,it returns unicode,but i need utf-8.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.




--
mindcast.gr
Μεσοποταμίας 2, Ελευθέρια - Κορδελιό
2310.70.63.20 | 6932.59.80.59
http://www.mindcast.gr
skype: mindcastgr
twitter: mindcastgr


Максим Горковский

unread,
Feb 8, 2012, 8:16:44 PM2/8/12
to scrapy...@googlegroups.com
The answer is here
http://groups.google.com/group/scrapy-users/browse_thread/thread/ac971f485559dcee

2012/2/9 Mindcast Mindcast <in...@mindcast.gr>



--
С уважением,
Максим Горковский
Reply all
Reply to author
Forward
0 new messages