Scrap Chinese sites

simon

unread,

Sep 21, 2009, 1:18:11 AM9/21/09

to scrapy-users

Hi All,
I'm trying to use Scrapy to crawl some Chinese websites.
But all Chinese characters in the response seems encoded, like
\xc8\xb8\xbd\xfd
Tried to ask Mr google but didnot have any luck.

Could anyone help me out?
What I want is when writing extracted response to a file, it should
shows in Chinese.

Thank you in advance!

-Simon

Pablo Hoffman

unread,

Sep 21, 2009, 12:00:20 PM9/21/09

to scrapy-users

Simon,

Looks like you're having some kind of encoding issue (unlikely to be a
Scrapy bug). Can you provide the steps to reproduce the problem?

Pablo.

Haisheng Wu

unread,

Sep 22, 2009, 1:22:40 PM9/22/09

to scrapy...@googlegroups.com

Hi, Pablo,

Here is the step to see the problem ( with scrapy shell )

a. python scrapy-ctl.py shell http://www.cncrk.com/downinfo/11390.html

(The page is basically a software download information with lots of ADs)

b. select the software name

>>hxs.select("//div[@id='softtitle']/span/text()").extract()

You'll see the result is not in Chinese

Same thing when pipeline the result to a csv file (csvwriter).

(I'll publish the 'study' project later but its idea is very like to those steps)

Thanks your help.

-Simon

simon

unread,

Sep 23, 2009, 8:28:08 AM9/23/09

to scrapy-users

Scrapy project is also available at
http://code.google.com/p/personal-study/source/browse/#svn/trunk/python-work/cncrk

Thanks.
-Simon

Pablo Hoffman

unread,

Sep 23, 2009, 8:38:12 AM9/23/09

to scrapy...@googlegroups.com

Simon,

When you type in the shell:

>>> hxs.select("//div[@id='softtitle']/span/text()").extract()

[u'\u5c04\u624b\u5f71\u97f3\u64ad\u653e\u5668\u524d\u536b\u7248 V2.3.709 \u7b80\u4f53\u4e2d\u6587\u7eff\u8272\u514d\u8d39\u7248 ']

You're actually seeing the repr() of the object. That's a convention used by
all Python consoles.

What you actually want is to print the contents, not seeing its Python
representation. So you can use "print" instead:

>>> print hxs.select("//div[@id='softtitle']/span/text()").extract()[0]
射手影音播放器前卫版 V2.3.709 简体中文绿色免费版

In both cases, the scraped data is the same. And you can write it to a file,
store it in a database, etc.

Pablo.

Haisheng Wu

unread,

Sep 23, 2009, 8:47:13 AM9/23/09

to scrapy...@googlegroups.com

Pablo, appreciate your quick response.
That's good to me.

-Simon

Reply all

Reply to author

Forward